Conference PaperPDF Available

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization

Exploiting Task-Parallelism on GPU Clusters
via OmpSs and rCUDA Virtualization
an Castell´
o, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ort´
Depto. de Ingenier´
ıa y Ciencia de Computadores, Universidad Jaume I, 12071–Castell´
on, Spain
Emails: {adcastel,mayo,quintana}
Barcelona Supercomputing Center (BSC-CNS), 08034–Barcelona, Spain
Abstract—OmpSs is a task-parallel programming model con-
sisting of a reduced collection of OpenMP-like directives, a
front-end compiler, and a runtime system. This directive-based
programming interface helps developers accelerate their ap-
plication’s execution, e.g. in a cluster equipped with graphics
processing units (GPUs), with a low programming effort. On the
other hand, the virtualization package rCUDA provides seamless
and transparent remote access to any CUDA GPU in a cluster,
via the CUDA Driver and Runtime programming interfaces.
In this paper we investigate the hurdles and practical ad-
vantages of combining these two technologies. Our experimental
study targets two cluster configurations: a system where all the
GPUs are located into a single cluster node; and a cluster with
the GPUs distributed among the nodes. Two applications, the N-
body particle simulation and the Cholesky factorization of a dense
matrix, are employed to expose the bottlenecks and performance
of a remote virtualization solution applied to these two OmpSs
task-parallel codes.
KeywordsTask parallelism; graphics processing units (GPUs);
OmpSs; CUDA; remote virtualization.
Heterogeneous computing has arisen as a powerful ap-
proach to leverage the doubling of transistors on chip dictated
by Moore’s law [1], while facing the challenge posed by the
power wall [2], [3]. In the middle of the past decade, this event
marked the end of the “GHz race” and the shift towards mul-
ticore processor designs. More recently, it has promoted “dark
silicon” [4] and the deployment of heterogeneous facilities for
high performance computing [5], [6].
Heterogeneous clusters, equipped with general-purpose
multicore processors (CPUs) and many-core graphics process-
ing units (GPUs), have been broadly adopted over the last few
years. This is motivated by their favourable price-performance-
power ratio as well as the availability of relatively simple par-
allel application programming interfaces (APIs), programming
tools and libraries. One of the most accepted APIs for this
purpose is CUDA [7], developed by NVIDIA. In addition, this
technology has demonstrated remarkable performance for the
solution of many scientific and engineering problems with high
computational requirements. In consequence, some of the most
powerful supercomputers in the world are currently built using
heterogeneous CPU-GPU servers [5].
The conventional approach to build a GPU-enabled cluster
furnishes each (cluster) node with one or more of these devices.
Unfortunately, this configuration often yields a low utilization
of the accelerators, due to mismatches between the type of
concurrency featured by many applications and the GPU archi-
tecture. rCUDA [8], [9] is a remote virtualization framework
that addresses this problem by transforming GPU-equipped
nodes into a pool of GPU servers. With this technology
the servers become available to all compute nodes (clients),
rendering a higher overall utilization of the cluster GPUs.
Task-parallelism is an efficient means to exploit the increas-
ing number of cores of current CPUs. It has been accepted as
a new parallelism model which allows more flexible codes.
OmpSs [10], developed by BSC, is a task runtime scheduler
which permits to decompose more complex codes into a set
of simple tasks adding compiler directives into the code.
In this paper we investigate the effectiveness of GPU
remote virtualization for task-parallel applications, identifying
the difficulties encountered during the integration of OmpSs
with rCUDA. To validate the study, we analyze the perfor-
mance of two scientific codes, specifically the solution of
the N-body problem and the Cholesky factorization for dense
linear systems, parallelized via OmpSs and executed on top of
rCUDA. Furthermore, in the experimentation we consider two
cluster configurations (scenarios) consisting of a pool of GPUs
located in a single node and a cluster of GPU server nodes with
one and several GPUs per node. Our results reveal affordable
overheads and fair scaling trends for OmpSs task-parallel
applications making use of remote graphics accelerators via
rCUDA instead of local hardware.
The rest of the paper is structured as follows. In Section II
we review work related to our research. In Section III we offer
a brief summary of the tools combined in this work: OmpSs
and rCUDA. In Section IV we describe the ongoing effort
towards their integration, and in Section V we evaluate and
estimate the performance of the result. Finally, in Section VI
we close the paper with a few concluding remarks and a
discussion of future work.
With the aim of reducing the total cost of ownership
of the clusters, several GPU virtualizations frameworks have
been developed during last years. These technologies share the
target, enabling the remote GPU facilities access. rCUDA [8],
[9] is a production-ready framework, that offers support for
the latest CUDA revisions, and provides wide coverage of
current GPU APIs. Compared with other CUDA and OpenCL
rCUDA libraries rCUDA server daemon
Fig. 1: rCUDA modular architecture.
virtualization tools [11], [12], [13], [14], rCUDA is more
mature and is up-to-date with CUDA versions.
On the other hand, several efforts, pioneered by Cilk [15],
aim to ease the development and improve the performance of
task-parallel programs by embedding task scheduling policies
inside a runtime. OmpSs [10] and StarPU [16] have demon-
strated the benefits of this technology to execute a task-parallel
application on a single server equipped with multiple GPUs.
A. Extracting task-parallelism via OmpSs
OmpSs is a task-based programming model, developed
at the Barcelona Supercomputing Center, that detects data
dependencies between tasks at execution time, with the help
of directionality clauses embedded in the code as OpenMP-
like directives. OmpSs leverages this information to generate
a task graph during the execution that is next employed by the
threads to exploit the task parallelism implicit to the operation,
via a dynamic out-of-order but dependency-aware schedule.
B. Remote virtualization via rCUDA
rCUDA is a virtualization middleware that enables seam-
less access to any CUDA device in a cluster from any compute
node. In particular, with rCUDA a GPU can be shared among
several nodes and a single node can use all graphic accelerators
available in the facility as if they were local. rCUDA aims
to deliver higher accelerator utilization rates in the overall
system in order to reduce resource, space and energy re-
quirements [17], [18]. The software is structured following
a classical client-server distributed architecture, illustrated in
Figure 1. In rCUDA the client middleware is embedded in the
code of the application demanding GPU acceleration services,
providing a transparent replacement for the native CUDA
libraries. Furthermore, the server middleware is executed in the
cluster nodes from which the actual GPUs offer the requested
GPU service.
The rCUDA client exposes the same interface as the regular
NVIDIA CUDA 6.5 release [7], including the Runtime and
Driver APIs. With this solution, applications are not aware
that they run on top of a virtualization layer.
In this section we describe the effort conducted towards the
integration of OmpSs and rCUDA, focusing on the main diffi-
culties encountered during this process, due to the differences
in behaviour between OmpSs+CUDA and OmpSs+rCUDA, as
well as the intermediate result of an ongoing project.
A. CUDA APIs and compilation chain
rCUDA supports the Runtime and Driver APIs native to
CUDA, capturing the invocations to these interfaces in the
client and forwarding them to the appropriate GPU server. On
the other hand, OmpSs only interacts with the local GPUs
via the CUDA Runtime API. Therefore, integrating these
two frameworks should be, in principle, effortless: OmpSs
relies on the Nanos++ runtime1and the Mercurium C/C++
compiler, which is a front-end source-to-source transformer
that relies on NVIDIA’s nvcc compiler and any x86 compiler
(in our case, GNU’s g++). Thus, the compilation processes
to generate OmpSs-based task-parallel code for a local GPU
(OmpSs+CUDA) or a remote accelerator (OmpSs+rCUDA)
differs only in the final stage, where the output from the
compilation needs to be linked with rCUDA instead of CUDA
in order to execute the binary over remote GPUs.
B. Initialization of CUDA functions
One of the aspects that was necessary to modify in rCUDA
is the point where the CUDA kernels and Runtime and
Driver functions are loaded. Concretely, CUDA performs this
initialization at the beginning of the application’s execution
while, in the OmpSs framework, this event occurs the first
time that a GPU task (i.e. any work item that is targeted to a
GPU) is issued for execution.
rCUDA was designed with CUDA in mind, and it is
therefore natural that its initialization procedure mimics that of
CUDA. Nevertheless, in order to integrate OmpSs code with
rCUDA, this behaviour had to be modified. In particular, for
each thread created on the client side, the first call to any kernel
configuration routine triggers a load of the corresponding
modules in the associated GPU server. From then on, any other
routine configuration call directed to the same GPU server,
does not produce any effect.
C. Avoiding communication overhead
Current hardware from NVIDIA can promote idle GPUs
from a high–performance active state to an energy–saving
passive one following the same approach that, for several years
now, has been intensively exploited in x86 architectures via the
C-states defined in the ACPI standard [19]. When this change
of state happens, the next CUDA call takes longer because
the GPU driver first needs to reactivate the GPU, to then
execute the CUDA function. OmpSs favours high performance
over energy efficiency, preventing GPUs to enter an energy
saving state via regular (i.e., periodic) calls to the cudaFree
function with no argument. This has no effect other than
keeping the GPU active, even when idle.
Fortunately, rCUDA does not need this mechanism, which
would otherwise create a certain communication overhead in
the network by introducing a bunch of short messages between
the client and the remote GPU server. In particular, the rCUDA
server runs on the GPU server node thus preventing the GPU
from entering the passive state. In addition to this, we modified
the rCUDA client middleware to intercept all cudaFree calls,
forwarding them to the server or not, respectively depend-
ing on whether they are true memory free calls or, in the
OmpSs+rCUDA scenario, unnecessary activation commands.
D. Ongoing work
OmpSs features a work scheduling policy [20] in which
tasks are initially distributed among the threads via private
queues but, when a thread runs out of work, it polls the queues
from other threads and “steals” part of the tasks in one of them.
When GPUs are involved, due to the existence of a separate
memory address space per GPU, this change may imply data
transfers between the memories of two GPUs. These copies
are carried out in OmpSs via calls to cudaMemcpyPeer,
which directly transfers the data from the memory of the source
GPU to the memory of the destination GPU. This kind of data
transfers are only available for threads which are created into
the same main process.
Unfortunately, the current version of rCUDA does not
support this type of data transfers when the origin and des-
tination spaces in memory have been allocated by different
threads. The reason is that, for each thread present on the
client side, rCUDA spawns a new process on the server side.
Each server process is thus placed inside a different GPU
context and, in this scenario, memory allocated from within
this context is not visible to other contexts/processes. This
problem basically arises because rCUDA was initially designed
before version 4.0 of CUDA, which was the first release that
allowed concurrent access to a GPU from more than one
In order to tackle this problem, and allow the invocation to
cudaMemcpyPeer when the origin and destination spaces
in memory have been allocated by different threads, several
modifications are needed in the current version of rCUDA, in
both the server and client sides. The changes in the server side
are nontrivial, because GPU servers should now distinguish
between the main rCUDA process and the threads spawned
from inside a CUDA application, in order to create or not a
new context respectively.
Each process or thread inside a process spawned during
a CUDA application execution is considered to be an inde-
pendent rCUDA client, with its own connection to a process
in the server side. Therefore, a client should be aware of the
existence of other clients, as communication between clients
is needed in order to perform a remote GPU memory transfer.
Moreover, due to the overhead introduced by the network, the
solution should not implement device-to-device memory data
transfers between servers through the client.
rCUDA mainly targets two types of cluster configurations:
a system with a pool of GPUs connected to a single node or
a cluster where several (or all) nodes are equipped with one
or more GPUs; see Figure 2. Therefore, a complete solution
needs to consider the following situations:
(a) GPU pool configuration.
(b) Several GPUs per node.
Fig. 2: rCUDA cluster configurations.
A GPU GAin a server accessed by two remote clients
from different nodes. In this case, an intra-node commu-
nication between the two processes running on the server
is needed.
Two GPUs GAand GBattached to the same server node.
As in the previous situation, an intra-node communication
between the two processes in the GPU server is needed.
Two GPUs GAand GBattached to different server nodes
(only for the cluster of GPU nodes). GPU Direct RDMA
(GDR) could be used in this situation to grant access
from GAto the remote memory of GB(or vice-versa).
This communication can be implemented, e.g., on top of
These situations must be carefully analyzed as several options
are possible to perform this type of GPU memory transfers.
A. Experimental setup
For our experiments, we have employed two systems with
different GPU configurations:
Tintorrum is a 2-node system, each with two Intel
Xeon E5520 (quad-core) processors, at 2.27 GHz, and
with 24 GB of DDR3-1866 RAM memory. One of these
nodes is connected to two NVIDIA C2050 boards, and the
other is furnished with four NVIDIA C2050 GPUs. Inter-
node communications employ an InfiniBand (IB) QDR
Minotauro is a cluster with 126 nodes, each with two
Intel Xeon E5649 processors (6 cores per processor), at
2.53 GHz, 24 GB of DDR3-1333 RAM, and connected
to two NVIDIA M2050 GPUs. The cluster network is IB
rCUDA 5.0 and OmpSs 14.10 were used for both systems.
In Tintorrum, we installed CUDA 6.5. On the other hand,
Minotauro is a production system and the most recent
CUDA version available there is CUDA 5.0. The back-end
compiler was g++ 4.4.7 in Tintorrum, and g++ 4.4.4
in Minotauro.
Two applications were selected for the evaluation:
N-body. This is the classical simulation of a dynamical
system of particles, under the influence of physical forces
(e.g., gravity), widely used in physics and astronomy.
For our tests we set the number of particles to 57,600
which corresponds to the largest volume that fits into the
memory of a single GPU of Tintorrum. By dividing the
domain into separate “blocks” (or tasks), and replicating
the “ghost” borders in spatially neighbour tasks, the code
turns out to be embarrassingly parallel. Thus, there are
no dependencies between tasks and, in general, there is
no benefit from using work stealing. In consequence,
OmpSs performs no copies between GPU memories via
cudaMemcpyPeer; see Section IV-D.
Cholesky factorization. This is a crucial operation for
the solution of dense systems of linear equations with
symmetric positive definite coefficient matrix. In our
experiments we set the problem size to 45,056×45,056.
When operating with single precision data, this is the
largest matrix that fits into the memory of a single GPU
of Minotauro. OmpSs decomposes this operation into
four types of kernels (Cholesky factorization of diagonal
blocks, triangular system solve, symmetric rank-kupdate,
and matrix multiplication), organized as a number of tasks
with a rich collection of dependencies among them. In
this application, work stealing is in place and involves
data copies between the GPU memories, incurring in the
problem described in Section IV-D.
B. N-body
This application was only tested in Tintorrum. For this
particular case study, there are no GPU memory transfers,
and thus we can directly compare the performance of the
OmpSs–accelerated code executed on top of either CUDA or
rCUDA. GPUs were “added” in the experiments with this 2-
node cluster by first populating the 4-GPU node with up to
4 server processes (one per GPU) and, from then on, the 2-
GPU node. Figure 3 reports the outcome from this evaluation,
showing a linear reduction of the execution time when up to 5
server processes/GPUs are employed. The speed-ups observed
there demonstrate the scalability of this embarrassingly parallel
application when combined with rCUDA in order to leverage
a reduced number of GPUs. The same plot also exposes
a notable increase of execution time when the 6th server
process/GPU is included in the experiment. The reason is that,
due to the particular architecture of the node with 2 GPUs,
the transfers between this last GPU and the IB network occur
through the QPI interconnect, which poses a considerable
bottleneck for this application.
Fig. 3: Execution time for N-body in Tintorrum.
Fig. 4: Synchronization time for N-body in Tintorrum.
An additional observation to emphasize in this experiment
is that OmpSs+rCUDA clearly outperforms OmpSs+CUDA.
Figure 4 demonstrates that the difference in favour of rCUDA
is due to the overhead introduced with the synchronization that
is enforced by OmpSs with the #pragma omp taskwait
directive. The reason is that rCUDA integrates an aggressive
synchronization which carries out a continuous polling that
significantly benefits high performance. In contrast, CUDA
features a more relaxed synchronization mechanism with im-
portant negative consequences for the execution time of this
C. Cholesky factorization
As argued earlier, the task-parallel version of the Cholesky
factorization in OmpSs transfers data between GPU mem-
ories via cudaMemcpyPeer, but this feature is not yet
supported in rCUDA. In order to overcome this problem, while
still delivering a fair comparison between the scalability of
OmpSs+rCUDA vs OmpSs+CUDA, we will first determine
the overhead introduced by a device-to-device communication
for setup configurations corresponding to a pool of GPUs (in
a single node) and a cluster with several GPUs per node. In
the first scenario, these copies occur between GPUs in the
same node while, in the second one, they may also involve
Fig. 5: Device-to-device memory bandwidth using
cudaMemcpyPeer in Minotauro.
the interconnect as the GPUs participating in the transfer may
be in different nodes.
1) Pools of GPUs: When all GPUs are collected into single
server node, any memory transfer between the memories of
two accelerators only involves this node, and for instance
can be implemented on top of GPU Direct 2.0. In order to
estimate the cost of these intra-node transfers, we conducted a
bandwidth test with the cudaMemcpyPeer routine. Figure 5
reveals that the lowest bandwidth attained with this test in
Minotauro is considerably superior to the throughput re-
ported for CUDA in Figure 6.
In a preliminary test, we also detected that the optimal al-
gorithmic block size for the Cholesky factorization using both
CUDA and rCUDA is 2,048. In practice, this implies that all
tasks operate (and communicate) with blocks of 2,048×2,048
(single precision) elements which occupy 16 MB. Therefore,
the throughput curve in Figure 5 indicates that the cost of a
direct data transfer of this dimension, between two GPUs in
the same node, can be expected to be around 0.192 ms in
2) Multi-GPU cluster: In this scenario, transferring data
between two GPUs introduces the communication costs of
both the PCIe bandwidth and, if located in different nodes,
the interconnect. To estimate this aggregated cost, we ran a
test using the common CUDA transfer calls combined with
the ib_send_bw test (included in the OFED-3.12 package)
between the GPUs in two separate nodes of Minotauro.
Figure 6 shows that, for a 16-MB data chunk, CUDA (i.e.,
local transfers) outperforms the throughput of the IB QDR
network. Also, a transfer of a 16-MB block between two
GPUs located in different cluster nodes, via e.g. GDR, can
be expected to cost 5.283 ms in Minotauro.
3) Performance: For the evaluation of the Cholesky fac-
torization, we run the application on Minotauro and report
the GFLOPS metric, which reflects the throughput in terms
of billions of floating-point arithmetic operations (flops) per
second. For CUDA, the code is executed using 1 and 2 GPUs
in a single node of this cluster. With rCUDA, we employ up to
four nodes with one GPU per node. Unfortunately, an internal
problem in the current implementation of OmpSs, when linked
Fig. 6: IB QDR and CUDA bandwidths in Minotauro. For
the latter option, H2D and D2H respectively refer to the host-
to-device and device-to-host communication throughput.
with NVIDIA’s mathematical library CUBLAS, impeded the
experimentation with a larger number of GPUs. The results
from the experiment are reported in Figure 7. The three
lines with the labels rCUDA, rCUDA (intra) and rCUDA
(QDR) there correspond to the execution of the task-parallel
code, linked with rCUDA to operate in a distributed-memory
scenario, with three different by-passes to deal with the lack
of support for device-to-device copies. Concretely, in the first
case, the copy is simply obviated so that this line reflects the
peak performance that could be observed if the cost of this
type of copies was negligible. In the second case, we assume
that inter-node communications proceed at the rate of intra-
node copies, see subsection V-C1; this result thus approximates
the performance that could observed in a configuration where
all GPUs were installed in the same node (pool of GPUs).
The third case considers that all the device-to-device copies
occur at the inter-node speed; see subsection V-C2. To estimate
the performance for the last two cases, the invocations to
cudaMemcpyPeer are intercepted and replaced by sleep
commands/tasks for the duration of the corresponding data
The performance results for rCUDA in the plot clearly
demonstrate the benefits of the remote virtualization approach
for this particular application. We note that the lines there
correspond to the GFLOPS per GPU and show a profile
that is almost flat. Therefore, an increase in the number of
GPUs in a factor of g, implies an increase in the aggregated
(total) GFLOPS rate by almost the same proportion and,
in consequence, an inversely proportional reduction of the
execution time as it is observed in the Figure 8. The overhead
of OmpSs+rCUDA with respect to OmpSs+CUDA, when
executed using a single local GPU is roughly about 5% (781
vs 737 GFLOPS, respectively).
We have presented our work-in-progress towards the in-
tegration of the OmpSs task-parallel programming language
with the rCUDA virtualization middleware. The combination
of these two technologies allows a more efficient use of the
resources in a GPU cluster, e.g., by enabling the deployment
Fig. 7: GFLOPS per GPU for the Cholesky factorization in
Fig. 8: Execution time for the Cholesky factorization in
of facilities where all the GPUs are located in a single node
(a pool of accelerators) and the remaining compute nodes
access/share them. In addition, when the GPUs are distributed
among the cluster nodes, through rCUDA applications are no
longer constrained by the number of GPUs but can exploit the
acceleration power of any GPU in the system.
These ideal configurations have been tested using task-
parallel OmpSs-based codes for two complex applications,
namely an N-body simulation of particles and the Cholesky
factorization of a dense matrix. The results with these case
studies reveal that OmpSs+rCUDA is an appealing solution to
improve programming productivity as well as deliver higher
accelerator utilization rates, potentially reducing resource,
space and energy usage.
As part of future work, and a natural extension of this
work, we plan to redesign rCUDA in order to accommodate the
type of device-to-device communications embedded for high
performance in OmpSs.
The researchers from the Universidad Jaume I were sup-
ported by Universitat Jaume I research project (P11B2013-
21), project TIN2011-23283 and FEDER. The researcher from
the Barcelona Supercomputing Center (BSC-CNS) was sup-
ported by the European Commission (HiPEAC-3 Network of
Excellence, FP7-ICT 287759), Intel-BSC Exascale Lab collab-
oration, IBM/BSC Exascale Initiative collaboration agreement,
on de Altas Prestaciones VI (TIN2012-34557) and
the Generalitat de Catalunya (2014-SGR-1051).
[1] G. Moore, “Cramming more components onto integrated circuits,”
Electronics, vol. 38, no. 8, pp. 114–117, 1965.
[2] M. Duranton et al, “HiPEAC vision 2015. High performance and
embedded architecture and compilation,” 2015,
[3] R. Lucas et al, “Top ten Exascale research challenges,” 2014,
[4] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” in Proc.
38th Annual Int. Symp. on Computer architecture, ser. ISCA’11, 2011,
pp. 365–376.
[5] “The TOP500 list,” 2015. [Online]. Available:
[6] “The GREEN500 list,” 2015. [Online]. Available: http://www.green500.
[7] NVIDIA Corporation, CUDA API Reference Manual Version 6.5, 2014.
[8] A. J. Pe˜
na, “Virtualization of accelerators in high performance clusters,
Ph.D. dissertation, Universitat Jaume I, Castellon, Spain, Jan. 2013.
[9] A. J. Pe˜
na, C. Rea˜
no, F. Silla, R. Mayo, E. S. Quintana-Orti,
and J. Duato, “A complete and efficient CUDA-sharing solution
for HPC clusters,” Parallel Computing, vol. 40, no. 10, pp. 574 –
588, 2014. [Online]. Available:
[10] “OmpSs project home page,”
[11] A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi, “Distributed-
shared CUDA: Virtualization of large-scale GPU systems for pro-
grammability and reliability,” in The Fourth International Conference
on Future Computational Technologies and Applications (FUTURE
COMPUTING), 2012, pp. 7–12.
[12] L. Shi, H. Chen, J. Sun, and K. Li, “vCUDA: GPU-accelerated high-
performance computing in virtual machines,” IEEE Transactions on
Computers, vol. 61, no. 6, pp. 804–816, 2012.
[13] S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen,
J. Hong, and W. Feng, “VOCL: An optimized environment for trans-
parent virtualization of graphics processing units,” in Innovative Parallel
Computing (InPar). IEEE, 2012.
[14] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL
framework for heterogeneous CPU/GPU clusters,” in 26th International
Conference on Supercomputing (ICS). ACM, 2012, pp. 341–352.
[15] “Cilk project home page,”
[16] “StarPU project home page,”
[17] A. Castell´
o, J. Duato, R. Mayo, A. J. Pe˜
na, E. S. Quintana-Ort´
ı, V. Roca,
and F. Silla, “On the Use of Remote GPUs and Low-Power Processors
for the Acceleration of Scientific Applications,” in The Fourth Inter-
national Conference on Smart Grids, Green Communications and IT
Energy-aware Technologies (ENERGY), April 2014, pp. 57–62.
[18] S. Iserte, A. Castell´
o, R. Mayo, E. S. Quintana-Ort´
ı, C. Rea˜
no, J. Prades,
F. Silla, and J. Duato, “SLURM support for remote GPU virtualization:
Implementation and performance study,” in Proceedings of the Inter-
national Symposium on Computer Architecture and High Performance
Computing (SBAC-PAD), Paris, France, Oct. 2014.
[19] HP Corp., Intel Corp., Microsoft Corp., Phoenix Tech. Ltd., and Toshiba
Corp., “Advanced configuration and power interface specification, revi-
sion 5.0,” 2011.
[20] R. Blumofe and C. Leiserson, “Scheduling multithreaded computations
by work stealing,” in Proceedings of the 35th Annual Symposium on
Foundations of Computer Science, Santa Fe, New Mexico., November
1994, pp. 356–368. [Online]. Available:
... In [19] we introduced a first study of simple OpenACC directives on top of rCUDA. In this paper we extend an in-depth analysis of the OmpSs programming model over virtualized remote GPUs [20] with an analogous study for the OpenACC programming model. ...
... The results indicate that each data movement, respectively, added 0.192 and 5.283 ms to the rCUDA execution time. The study can be found in [20]. ...
Full-text available
Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can introduce two problems: an increase in the total cost of ownership and their underutilization because not all codes match their architecture. Remote accelerator virtualization frameworks address those problems. In particular, rCUDA provides transparent access to any graphic processor unit installed in a cluster, reducing the number of accelerators and increasing their utilization ratio. Joining these two technologies, directive-based programming models and rCUDA, is thus highly appealing. In this work, we study the integration of OmpSs and OpenACC with rCUDA, describing and analyzing several applications over three different hardware configurations that include two InfiniBand interconnections and three NVIDIA accelerators. Our evaluation reveals favorable performance results, showing low overhead and similar scaling factors when using remote accelerators instead of local devices.
... rCUDA is structured following a client-server distributed architecture and its client exposes the same interface as the regular NVIDIA CUDA 6.5 release (NVIDIA Corp., ).With this middleware, applications are not aware that they are executed on top of a virtualization layer. To deal with new GPU programming models, rCUDA has been recently extended to accommodate directivebased models such as OmpSs (Castelló et al., 2015a) and OpenACC (Castelló et al., 2015b). The integration of remote GPGPU virtualization with global resource schedulers such as SLURM (Iserte et al., 2014) completes this appealing technology, making accelerator-enabled clusters more flexible and energyefficient . ...
Conference Paper
Full-text available
The use of accelerators, such as graphics processing units (GPUs), to reduce the execution time of compute-intensive applications has become popular during the past few years. These devices increment the computational power of a node thanks to their parallel architecture. This trend has led cloud service providers as Amazon or middlewares such as OpenStack to add virtual machines (VMs) including GPUs to their facilities instances. To fulfill these needs, the guest hosts must be equipped with GPUs which, unfortunately, will be barely utilized if a non GPU-enabled VM is running in the host. The solution presented in this work is based on GPU virtualization and shareability in order to reach an equilibrium between service supply and the ap-plications' demand of accelerators. Concretely, we propose to decouple real GPUs from the nodes by using the virtualization technology rCUDA. With this software configuration, GPUs can be accessed from any VM avoiding the need of placing a physical GPUs in each guest host. Moreover, we study the viability of this approach using a public cloud service configuration, and we develop a module for OpenStack in order to add support for the virtualized devices and the logic to manage them. The results demonstrate this is a viable configuration which adds flexibility to current and well-known cloud solutions.
The N-body simulations consist of computing mutual gravitational forces exerted on each body in O(N). The Barnes-Hut approximation allows processing a group of bodies in O(1) if they are far enough from a given body, which drops the complexity of the whole simulation to O(NLogN). The octree is used to ease the pruning process but at the cost of some irregularity in the access pattern. In a parallel N-body implementation the bodies are partitioned among threads that are executed on multiple cores. The depth-first traversal of the octree is used for processing each body, which causes repeated cache misses during traversal. This paper proposes different types of tiling methods to improve the performance of N-body simulations. It presents an experimental analysis of octree traversal by using these tiling methods to identify the potential of cache data reuse. It then evaluates these tiling methods for varying tile sizes with different galaxy sizes and a varying number of threads on several machine architectures. The efficiency of tiling approaches depends on the chosen tile size. It is shown that a speedup of 8 times can be achieved by choosing the appropriate tile size on a 60-core Intel accelerator. In order to determine appropriate tile size, the paper proposes an adaptive tiling approach to implicitly adapt the tile size to the distribution of threads, the cache capacity, cache latency, problem size and dynamic changes in the access pattern over the iterations. The proposed adaptive tiling approach can be used as an optimization option in parallel compilers.
Full-text available
In the last decades, the number of cores per processor has increased steadily, reaching impressive counts such as the 260 cores per socket in the Sunway TaihuLight supercomputer. This hardware evolution requires an extra effort to extract all the on-node computational power via concurrent programming models (PMs) and applications. Moreover, this trend indicates that future exascale systems will elevate this massive on-node parallelism to thousands of cores per socket. Therefore, that hardware will require efficient libraries and PMs. One of the most popular approaches to obtain acceptable on-node parallel performance relies on the use of operating system (OS) threads via high-level PMs such as OpenMP or via the Pthreads application programming interface (API). Unfortunately the Pthreads API fails to accommodate new software paradigms that target dynamically scheduled and fine-grained parallelism. In contrast with those threads, several lightweight thread (LWT) libraries have been proposed in the last years to tackle fine-grained and dynamic software requirements. These libraries are based on the concept of lighter threads that are managed by OS threads in the user-space. Therefore, the mechanisms' overheads such as context-switch are almost negligible. Some LWT solutions are ConverseThreads, Nanos++, MassiveThreads, Qthreads, or Argobots. LWT libraries demonstrate semantic and performance benefits over the classic Pthreads. However, the variety of LWT libraries hinders portability and reduces its usage to certain solutions. Moreover, this lack of portability reduces the use of LWT implementations in the field of high performance computing (HPC). In this scenario, a unified standard interface can be highly beneficial, as long as it supports most of the functionalities offered by the LWT libraries while maintaining their performance. Moreover, the highly adopted use of Pthreads, as low-level API as well as the base for high-level PMs, increments the effort in order to offer visibility to those LWTs solution. Therefore, high-level PMs and the Pthreads API implemented on top of a unified LWT API are necessary to achieve a wider adoption. This thesis aims to highlight the use of LWT solutions by tackling the problem of portability via a common API. In addition, this work provides solutions to easily migrate from current high-level PM implementations to LWT-based solutions without code modifications. More concretely, the contributions of the thesis are: 1) Decomposition of several threading solutions from a semantic point of view, identifying the strong and weak points of each threading solution; 2) Design and implementation of a unified LWT API, named Generic Lightweight Threads (GLT), that groups the functionality of general-purpose LWT solutions for HPC under the same PM; 3) Implementation of a complete interaction between the already existent Pthreads API, and the new GLT API; and 4) Design and implementation of OpenMP and OmpSs runtimes on top of the GLT API, called Generic Lightweight Thread OpenMP (GLTO) and Generic Lightweight Thread OmpSs (GOmpSs), respectively.
Conference Paper
Full-text available
SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
Conference Paper
Full-text available
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these acceler-ators is usually far below 100%. In this context, remote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bot-tlenecks of several "heterogeneous" scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
Full-text available
Graphics processing units (GPUs) have been widely used for general-purpose computation acceleration. However, cur-rent programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent utilization of local or remote GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as de-coupled virtual resources that can be transparently managed independent of the application execution. The proposed framework requires no source code modifications. We also propose various strategies for reducing the overhead caused by data communication and kernel launching and demon-strate about 85% of the data write bandwidth and 90% of the data read bandwidth compared to data write and read, respectively, in a native nonvirtualized environment. We evaluate the performance of VOCL using four real-world applications with various computation and memory access intensities and demonstrate that compute-intensive appli-cations can execute with negligible overhead in the VOCL environment.
Conference Paper
Full-text available
Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.
In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. The target cluster architecture consists of a designated, single host node and many compute nodes. They are connected by an interconnection network, such as Gigabit Ethernet and InfiniBand switches. Each compute node is equipped with multicore CPUs and multiple GPUs. A set of CPU cores or each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. SnuCL provides a system image running a single operating system instance for heterogeneous CPU/GPU clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. SnuCL also provides collective communication extensions to OpenCL to facilitate manipulating memory objects. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.
Conference Paper
This paper describes vCUDA, a general-purpose graphics processing unit (GPGPU) computing solution for virtual machines (VMs). vCUDA allows applications executing within VMs to leverage hardware acceleration, which can be beneficial to the performance of a class of high-performance computing (HPC) applications. The key insights in our design include API call interception and redirection and a dedicated RPC system for VMs. With API interception and redirection, Compute Unified Device Architecture (CUDA) applications in VMs can access a graphics hardware device and achieve high computing performance in a transparent way. In the current study, vCUDA achieved a near-native performance with the dedicated RPC system. We carried out a detailed analysis of the performance of our framework. Using a number of unmodified official examples from CUDA SDK and third-party applications in the evaluation, we observed that CUDA applications running with vCUDA exhibited a very low performance penalty in comparison with the native environment, thereby demonstrating the viability of vCUDA architecture.