Content uploaded by Adrián Castelló
Author content
All content in this area was uploaded by Adrián Castelló on Aug 13, 2015
Content may be subject to copyright.
Exploiting Task-Parallelism on GPU Clusters
via OmpSs and rCUDA Virtualization
Adri´
an Castell´
o∗, Rafael Mayo∗, Judit Planas†, Enrique S. Quintana-Ort´
ı∗
∗Depto. de Ingenier´
ıa y Ciencia de Computadores, Universidad Jaume I, 12071–Castell´
on, Spain
Emails: {adcastel,mayo,quintana}@uji.es
†Barcelona Supercomputing Center (BSC-CNS), 08034–Barcelona, Spain
Email: judit.planas@bsc.es
Abstract—OmpSs is a task-parallel programming model con-
sisting of a reduced collection of OpenMP-like directives, a
front-end compiler, and a runtime system. This directive-based
programming interface helps developers accelerate their ap-
plication’s execution, e.g. in a cluster equipped with graphics
processing units (GPUs), with a low programming effort. On the
other hand, the virtualization package rCUDA provides seamless
and transparent remote access to any CUDA GPU in a cluster,
via the CUDA Driver and Runtime programming interfaces.
In this paper we investigate the hurdles and practical ad-
vantages of combining these two technologies. Our experimental
study targets two cluster configurations: a system where all the
GPUs are located into a single cluster node; and a cluster with
the GPUs distributed among the nodes. Two applications, the N-
body particle simulation and the Cholesky factorization of a dense
matrix, are employed to expose the bottlenecks and performance
of a remote virtualization solution applied to these two OmpSs
task-parallel codes.
Keywords—Task parallelism; graphics processing units (GPUs);
OmpSs; CUDA; remote virtualization.
I. INTRODUCTION
Heterogeneous computing has arisen as a powerful ap-
proach to leverage the doubling of transistors on chip dictated
by Moore’s law [1], while facing the challenge posed by the
power wall [2], [3]. In the middle of the past decade, this event
marked the end of the “GHz race” and the shift towards mul-
ticore processor designs. More recently, it has promoted “dark
silicon” [4] and the deployment of heterogeneous facilities for
high performance computing [5], [6].
Heterogeneous clusters, equipped with general-purpose
multicore processors (CPUs) and many-core graphics process-
ing units (GPUs), have been broadly adopted over the last few
years. This is motivated by their favourable price-performance-
power ratio as well as the availability of relatively simple par-
allel application programming interfaces (APIs), programming
tools and libraries. One of the most accepted APIs for this
purpose is CUDA [7], developed by NVIDIA. In addition, this
technology has demonstrated remarkable performance for the
solution of many scientific and engineering problems with high
computational requirements. In consequence, some of the most
powerful supercomputers in the world are currently built using
heterogeneous CPU-GPU servers [5].
The conventional approach to build a GPU-enabled cluster
furnishes each (cluster) node with one or more of these devices.
Unfortunately, this configuration often yields a low utilization
of the accelerators, due to mismatches between the type of
concurrency featured by many applications and the GPU archi-
tecture. rCUDA [8], [9] is a remote virtualization framework
that addresses this problem by transforming GPU-equipped
nodes into a pool of GPU servers. With this technology
the servers become available to all compute nodes (clients),
rendering a higher overall utilization of the cluster GPUs.
Task-parallelism is an efficient means to exploit the increas-
ing number of cores of current CPUs. It has been accepted as
a new parallelism model which allows more flexible codes.
OmpSs [10], developed by BSC, is a task runtime scheduler
which permits to decompose more complex codes into a set
of simple tasks adding compiler directives into the code.
In this paper we investigate the effectiveness of GPU
remote virtualization for task-parallel applications, identifying
the difficulties encountered during the integration of OmpSs
with rCUDA. To validate the study, we analyze the perfor-
mance of two scientific codes, specifically the solution of
the N-body problem and the Cholesky factorization for dense
linear systems, parallelized via OmpSs and executed on top of
rCUDA. Furthermore, in the experimentation we consider two
cluster configurations (scenarios) consisting of a pool of GPUs
located in a single node and a cluster of GPU server nodes with
one and several GPUs per node. Our results reveal affordable
overheads and fair scaling trends for OmpSs task-parallel
applications making use of remote graphics accelerators via
rCUDA instead of local hardware.
The rest of the paper is structured as follows. In Section II
we review work related to our research. In Section III we offer
a brief summary of the tools combined in this work: OmpSs
and rCUDA. In Section IV we describe the ongoing effort
towards their integration, and in Section V we evaluate and
estimate the performance of the result. Finally, in Section VI
we close the paper with a few concluding remarks and a
discussion of future work.
II. RE LATE D WO RK
With the aim of reducing the total cost of ownership
of the clusters, several GPU virtualizations frameworks have
been developed during last years. These technologies share the
target, enabling the remote GPU facilities access. rCUDA [8],
[9] is a production-ready framework, that offers support for
the latest CUDA revisions, and provides wide coverage of
current GPU APIs. Compared with other CUDA and OpenCL
Application
CUDA
GPU
CLIENT SERVER
rCUDA libraries rCUDA server daemon
Communications
GPU
Network
Communications
Fig. 1: rCUDA modular architecture.
virtualization tools [11], [12], [13], [14], rCUDA is more
mature and is up-to-date with CUDA versions.
On the other hand, several efforts, pioneered by Cilk [15],
aim to ease the development and improve the performance of
task-parallel programs by embedding task scheduling policies
inside a runtime. OmpSs [10] and StarPU [16] have demon-
strated the benefits of this technology to execute a task-parallel
application on a single server equipped with multiple GPUs.
III. BRI EF REVIEW OF TECHNOLOGIES
A. Extracting task-parallelism via OmpSs
OmpSs is a task-based programming model, developed
at the Barcelona Supercomputing Center, that detects data
dependencies between tasks at execution time, with the help
of directionality clauses embedded in the code as OpenMP-
like directives. OmpSs leverages this information to generate
a task graph during the execution that is next employed by the
threads to exploit the task parallelism implicit to the operation,
via a dynamic out-of-order but dependency-aware schedule.
B. Remote virtualization via rCUDA
rCUDA is a virtualization middleware that enables seam-
less access to any CUDA device in a cluster from any compute
node. In particular, with rCUDA a GPU can be shared among
several nodes and a single node can use all graphic accelerators
available in the facility as if they were local. rCUDA aims
to deliver higher accelerator utilization rates in the overall
system in order to reduce resource, space and energy re-
quirements [17], [18]. The software is structured following
a classical client-server distributed architecture, illustrated in
Figure 1. In rCUDA the client middleware is embedded in the
code of the application demanding GPU acceleration services,
providing a transparent replacement for the native CUDA
libraries. Furthermore, the server middleware is executed in the
cluster nodes from which the actual GPUs offer the requested
GPU service.
The rCUDA client exposes the same interface as the regular
NVIDIA CUDA 6.5 release [7], including the Runtime and
Driver APIs. With this solution, applications are not aware
that they run on top of a virtualization layer.
IV. INT EGRATIO N OF OM PSS A ND R CUDA
In this section we describe the effort conducted towards the
integration of OmpSs and rCUDA, focusing on the main diffi-
culties encountered during this process, due to the differences
in behaviour between OmpSs+CUDA and OmpSs+rCUDA, as
well as the intermediate result of an ongoing project.
A. CUDA APIs and compilation chain
rCUDA supports the Runtime and Driver APIs native to
CUDA, capturing the invocations to these interfaces in the
client and forwarding them to the appropriate GPU server. On
the other hand, OmpSs only interacts with the local GPUs
via the CUDA Runtime API. Therefore, integrating these
two frameworks should be, in principle, effortless: OmpSs
relies on the Nanos++ runtime1and the Mercurium C/C++
compiler, which is a front-end source-to-source transformer
that relies on NVIDIA’s nvcc compiler and any x86 compiler
(in our case, GNU’s g++). Thus, the compilation processes
to generate OmpSs-based task-parallel code for a local GPU
(OmpSs+CUDA) or a remote accelerator (OmpSs+rCUDA)
differs only in the final stage, where the output from the
compilation needs to be linked with rCUDA instead of CUDA
in order to execute the binary over remote GPUs.
B. Initialization of CUDA functions
One of the aspects that was necessary to modify in rCUDA
is the point where the CUDA kernels and Runtime and
Driver functions are loaded. Concretely, CUDA performs this
initialization at the beginning of the application’s execution
while, in the OmpSs framework, this event occurs the first
time that a GPU task (i.e. any work item that is targeted to a
GPU) is issued for execution.
rCUDA was designed with CUDA in mind, and it is
therefore natural that its initialization procedure mimics that of
CUDA. Nevertheless, in order to integrate OmpSs code with
rCUDA, this behaviour had to be modified. In particular, for
each thread created on the client side, the first call to any kernel
configuration routine triggers a load of the corresponding
modules in the associated GPU server. From then on, any other
routine configuration call directed to the same GPU server,
does not produce any effect.
C. Avoiding communication overhead
Current hardware from NVIDIA can promote idle GPUs
from a high–performance active state to an energy–saving
passive one following the same approach that, for several years
now, has been intensively exploited in x86 architectures via the
C-states defined in the ACPI standard [19]. When this change
of state happens, the next CUDA call takes longer because
the GPU driver first needs to reactivate the GPU, to then
execute the CUDA function. OmpSs favours high performance
over energy efficiency, preventing GPUs to enter an energy
saving state via regular (i.e., periodic) calls to the cudaFree
function with no argument. This has no effect other than
keeping the GPU active, even when idle.
1http://www.bsc.es/computer-sciences/programming-models/
nanos-environment
Fortunately, rCUDA does not need this mechanism, which
would otherwise create a certain communication overhead in
the network by introducing a bunch of short messages between
the client and the remote GPU server. In particular, the rCUDA
server runs on the GPU server node thus preventing the GPU
from entering the passive state. In addition to this, we modified
the rCUDA client middleware to intercept all cudaFree calls,
forwarding them to the server or not, respectively depend-
ing on whether they are true memory free calls or, in the
OmpSs+rCUDA scenario, unnecessary activation commands.
D. Ongoing work
OmpSs features a work scheduling policy [20] in which
tasks are initially distributed among the threads via private
queues but, when a thread runs out of work, it polls the queues
from other threads and “steals” part of the tasks in one of them.
When GPUs are involved, due to the existence of a separate
memory address space per GPU, this change may imply data
transfers between the memories of two GPUs. These copies
are carried out in OmpSs via calls to cudaMemcpyPeer,
which directly transfers the data from the memory of the source
GPU to the memory of the destination GPU. This kind of data
transfers are only available for threads which are created into
the same main process.
Unfortunately, the current version of rCUDA does not
support this type of data transfers when the origin and des-
tination spaces in memory have been allocated by different
threads. The reason is that, for each thread present on the
client side, rCUDA spawns a new process on the server side.
Each server process is thus placed inside a different GPU
context and, in this scenario, memory allocated from within
this context is not visible to other contexts/processes. This
problem basically arises because rCUDA was initially designed
before version 4.0 of CUDA, which was the first release that
allowed concurrent access to a GPU from more than one
thread.
In order to tackle this problem, and allow the invocation to
cudaMemcpyPeer when the origin and destination spaces
in memory have been allocated by different threads, several
modifications are needed in the current version of rCUDA, in
both the server and client sides. The changes in the server side
are nontrivial, because GPU servers should now distinguish
between the main rCUDA process and the threads spawned
from inside a CUDA application, in order to create or not a
new context respectively.
Each process or thread inside a process spawned during
a CUDA application execution is considered to be an inde-
pendent rCUDA client, with its own connection to a process
in the server side. Therefore, a client should be aware of the
existence of other clients, as communication between clients
is needed in order to perform a remote GPU memory transfer.
Moreover, due to the overhead introduced by the network, the
solution should not implement device-to-device memory data
transfers between servers through the client.
rCUDA mainly targets two types of cluster configurations:
a system with a pool of GPUs connected to a single node or
a cluster where several (or all) nodes are equipped with one
or more GPUs; see Figure 2. Therefore, a complete solution
needs to consider the following situations:
(a) GPU pool configuration.
(b) Several GPUs per node.
Fig. 2: rCUDA cluster configurations.
•A GPU GAin a server accessed by two remote clients
from different nodes. In this case, an intra-node commu-
nication between the two processes running on the server
is needed.
•Two GPUs GAand GBattached to the same server node.
As in the previous situation, an intra-node communication
between the two processes in the GPU server is needed.
•Two GPUs GAand GBattached to different server nodes
(only for the cluster of GPU nodes). GPU Direct RDMA
(GDR) could be used in this situation to grant access
from GAto the remote memory of GB(or vice-versa).
This communication can be implemented, e.g., on top of
MPI.
These situations must be carefully analyzed as several options
are possible to perform this type of GPU memory transfers.
V. EVAL UATI ON O F REM OTE VIRTUAL IZ ATION W IT H
TASK -PARALLEL APPLICATIONS
A. Experimental setup
For our experiments, we have employed two systems with
different GPU configurations:
•Tintorrum is a 2-node system, each with two Intel
Xeon E5520 (quad-core) processors, at 2.27 GHz, and
with 24 GB of DDR3-1866 RAM memory. One of these
nodes is connected to two NVIDIA C2050 boards, and the
other is furnished with four NVIDIA C2050 GPUs. Inter-
node communications employ an InfiniBand (IB) QDR
fabric.
•Minotauro is a cluster with 126 nodes, each with two
Intel Xeon E5649 processors (6 cores per processor), at
2.53 GHz, 24 GB of DDR3-1333 RAM, and connected
to two NVIDIA M2050 GPUs. The cluster network is IB
QDR.
rCUDA 5.0 and OmpSs 14.10 were used for both systems.
In Tintorrum, we installed CUDA 6.5. On the other hand,
Minotauro is a production system and the most recent
CUDA version available there is CUDA 5.0. The back-end
compiler was g++ 4.4.7 in Tintorrum, and g++ 4.4.4
in Minotauro.
Two applications were selected for the evaluation:
•N-body. This is the classical simulation of a dynamical
system of particles, under the influence of physical forces
(e.g., gravity), widely used in physics and astronomy.
For our tests we set the number of particles to 57,600
which corresponds to the largest volume that fits into the
memory of a single GPU of Tintorrum. By dividing the
domain into separate “blocks” (or tasks), and replicating
the “ghost” borders in spatially neighbour tasks, the code
turns out to be embarrassingly parallel. Thus, there are
no dependencies between tasks and, in general, there is
no benefit from using work stealing. In consequence,
OmpSs performs no copies between GPU memories via
cudaMemcpyPeer; see Section IV-D.
•Cholesky factorization. This is a crucial operation for
the solution of dense systems of linear equations with
symmetric positive definite coefficient matrix. In our
experiments we set the problem size to 45,056×45,056.
When operating with single precision data, this is the
largest matrix that fits into the memory of a single GPU
of Minotauro. OmpSs decomposes this operation into
four types of kernels (Cholesky factorization of diagonal
blocks, triangular system solve, symmetric rank-kupdate,
and matrix multiplication), organized as a number of tasks
with a rich collection of dependencies among them. In
this application, work stealing is in place and involves
data copies between the GPU memories, incurring in the
problem described in Section IV-D.
B. N-body
This application was only tested in Tintorrum. For this
particular case study, there are no GPU memory transfers,
and thus we can directly compare the performance of the
OmpSs–accelerated code executed on top of either CUDA or
rCUDA. GPUs were “added” in the experiments with this 2-
node cluster by first populating the 4-GPU node with up to
4 server processes (one per GPU) and, from then on, the 2-
GPU node. Figure 3 reports the outcome from this evaluation,
showing a linear reduction of the execution time when up to 5
server processes/GPUs are employed. The speed-ups observed
there demonstrate the scalability of this embarrassingly parallel
application when combined with rCUDA in order to leverage
a reduced number of GPUs. The same plot also exposes
a notable increase of execution time when the 6th server
process/GPU is included in the experiment. The reason is that,
due to the particular architecture of the node with 2 GPUs,
the transfers between this last GPU and the IB network occur
through the QPI interconnect, which poses a considerable
bottleneck for this application.
Fig. 3: Execution time for N-body in Tintorrum.
Fig. 4: Synchronization time for N-body in Tintorrum.
An additional observation to emphasize in this experiment
is that OmpSs+rCUDA clearly outperforms OmpSs+CUDA.
Figure 4 demonstrates that the difference in favour of rCUDA
is due to the overhead introduced with the synchronization that
is enforced by OmpSs with the #pragma omp taskwait
directive. The reason is that rCUDA integrates an aggressive
synchronization which carries out a continuous polling that
significantly benefits high performance. In contrast, CUDA
features a more relaxed synchronization mechanism with im-
portant negative consequences for the execution time of this
application.
C. Cholesky factorization
As argued earlier, the task-parallel version of the Cholesky
factorization in OmpSs transfers data between GPU mem-
ories via cudaMemcpyPeer, but this feature is not yet
supported in rCUDA. In order to overcome this problem, while
still delivering a fair comparison between the scalability of
OmpSs+rCUDA vs OmpSs+CUDA, we will first determine
the overhead introduced by a device-to-device communication
for setup configurations corresponding to a pool of GPUs (in
a single node) and a cluster with several GPUs per node. In
the first scenario, these copies occur between GPUs in the
same node while, in the second one, they may also involve
Fig. 5: Device-to-device memory bandwidth using
cudaMemcpyPeer in Minotauro.
the interconnect as the GPUs participating in the transfer may
be in different nodes.
1) Pools of GPUs: When all GPUs are collected into single
server node, any memory transfer between the memories of
two accelerators only involves this node, and for instance
can be implemented on top of GPU Direct 2.0. In order to
estimate the cost of these intra-node transfers, we conducted a
bandwidth test with the cudaMemcpyPeer routine. Figure 5
reveals that the lowest bandwidth attained with this test in
Minotauro is considerably superior to the throughput re-
ported for CUDA in Figure 6.
In a preliminary test, we also detected that the optimal al-
gorithmic block size for the Cholesky factorization using both
CUDA and rCUDA is 2,048. In practice, this implies that all
tasks operate (and communicate) with blocks of 2,048×2,048
(single precision) elements which occupy 16 MB. Therefore,
the throughput curve in Figure 5 indicates that the cost of a
direct data transfer of this dimension, between two GPUs in
the same node, can be expected to be around 0.192 ms in
Minotauro.
2) Multi-GPU cluster: In this scenario, transferring data
between two GPUs introduces the communication costs of
both the PCIe bandwidth and, if located in different nodes,
the interconnect. To estimate this aggregated cost, we ran a
test using the common CUDA transfer calls combined with
the ib_send_bw test (included in the OFED-3.12 package)
between the GPUs in two separate nodes of Minotauro.
Figure 6 shows that, for a 16-MB data chunk, CUDA (i.e.,
local transfers) outperforms the throughput of the IB QDR
network. Also, a transfer of a 16-MB block between two
GPUs located in different cluster nodes, via e.g. GDR, can
be expected to cost 5.283 ms in Minotauro.
3) Performance: For the evaluation of the Cholesky fac-
torization, we run the application on Minotauro and report
the GFLOPS metric, which reflects the throughput in terms
of billions of floating-point arithmetic operations (flops) per
second. For CUDA, the code is executed using 1 and 2 GPUs
in a single node of this cluster. With rCUDA, we employ up to
four nodes with one GPU per node. Unfortunately, an internal
problem in the current implementation of OmpSs, when linked
Fig. 6: IB QDR and CUDA bandwidths in Minotauro. For
the latter option, H2D and D2H respectively refer to the host-
to-device and device-to-host communication throughput.
with NVIDIA’s mathematical library CUBLAS, impeded the
experimentation with a larger number of GPUs. The results
from the experiment are reported in Figure 7. The three
lines with the labels rCUDA, rCUDA (intra) and rCUDA
(QDR) there correspond to the execution of the task-parallel
code, linked with rCUDA to operate in a distributed-memory
scenario, with three different by-passes to deal with the lack
of support for device-to-device copies. Concretely, in the first
case, the copy is simply obviated so that this line reflects the
peak performance that could be observed if the cost of this
type of copies was negligible. In the second case, we assume
that inter-node communications proceed at the rate of intra-
node copies, see subsection V-C1; this result thus approximates
the performance that could observed in a configuration where
all GPUs were installed in the same node (pool of GPUs).
The third case considers that all the device-to-device copies
occur at the inter-node speed; see subsection V-C2. To estimate
the performance for the last two cases, the invocations to
cudaMemcpyPeer are intercepted and replaced by sleep
commands/tasks for the duration of the corresponding data
transfers.
The performance results for rCUDA in the plot clearly
demonstrate the benefits of the remote virtualization approach
for this particular application. We note that the lines there
correspond to the GFLOPS per GPU and show a profile
that is almost flat. Therefore, an increase in the number of
GPUs in a factor of g, implies an increase in the aggregated
(total) GFLOPS rate by almost the same proportion and,
in consequence, an inversely proportional reduction of the
execution time as it is observed in the Figure 8. The overhead
of OmpSs+rCUDA with respect to OmpSs+CUDA, when
executed using a single local GPU is roughly about 5% (781
vs 737 GFLOPS, respectively).
VI. CONCLUDING RE MA RK S
We have presented our work-in-progress towards the in-
tegration of the OmpSs task-parallel programming language
with the rCUDA virtualization middleware. The combination
of these two technologies allows a more efficient use of the
resources in a GPU cluster, e.g., by enabling the deployment
Fig. 7: GFLOPS per GPU for the Cholesky factorization in
Minotauro.
Fig. 8: Execution time for the Cholesky factorization in
Minotauro.
of facilities where all the GPUs are located in a single node
(a pool of accelerators) and the remaining compute nodes
access/share them. In addition, when the GPUs are distributed
among the cluster nodes, through rCUDA applications are no
longer constrained by the number of GPUs but can exploit the
acceleration power of any GPU in the system.
These ideal configurations have been tested using task-
parallel OmpSs-based codes for two complex applications,
namely an N-body simulation of particles and the Cholesky
factorization of a dense matrix. The results with these case
studies reveal that OmpSs+rCUDA is an appealing solution to
improve programming productivity as well as deliver higher
accelerator utilization rates, potentially reducing resource,
space and energy usage.
As part of future work, and a natural extension of this
work, we plan to redesign rCUDA in order to accommodate the
type of device-to-device communications embedded for high
performance in OmpSs.
ACKNOWLEDGMENTS
The researchers from the Universidad Jaume I were sup-
ported by Universitat Jaume I research project (P11B2013-
21), project TIN2011-23283 and FEDER. The researcher from
the Barcelona Supercomputing Center (BSC-CNS) was sup-
ported by the European Commission (HiPEAC-3 Network of
Excellence, FP7-ICT 287759), Intel-BSC Exascale Lab collab-
oration, IBM/BSC Exascale Initiative collaboration agreement,
Computaci´
on de Altas Prestaciones VI (TIN2012-34557) and
the Generalitat de Catalunya (2014-SGR-1051).
REFERENCES
[1] G. Moore, “Cramming more components onto integrated circuits,”
Electronics, vol. 38, no. 8, pp. 114–117, 1965.
[2] M. Duranton et al, “HiPEAC vision 2015. High performance and
embedded architecture and compilation,” 2015, http://www.hipeac.net/
vision.
[3] R. Lucas et al, “Top ten Exascale research challenges,” 2014,
http://science.energy.gov/∼/media/ascr/ascac/pdf/meetings/20140210/
Top10reportFEB14.pdf.
[4] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” in Proc.
38th Annual Int. Symp. on Computer architecture, ser. ISCA’11, 2011,
pp. 365–376.
[5] “The TOP500 list,” 2015. [Online]. Available: http://www.top500.org
[6] “The GREEN500 list,” 2015. [Online]. Available: http://www.green500.
org
[7] NVIDIA Corporation, CUDA API Reference Manual Version 6.5, 2014.
[8] A. J. Pe˜
na, “Virtualization of accelerators in high performance clusters,”
Ph.D. dissertation, Universitat Jaume I, Castellon, Spain, Jan. 2013.
[9] A. J. Pe˜
na, C. Rea˜
no, F. Silla, R. Mayo, E. S. Quintana-Orti,
and J. Duato, “A complete and efficient CUDA-sharing solution
for HPC clusters,” Parallel Computing, vol. 40, no. 10, pp. 574 –
588, 2014. [Online]. Available: http://www.sciencedirect.com/science/
article/pii/S0167819114001227
[10] “OmpSs project home page,” http://pm.bsc.es/ompss.
[11] A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi, “Distributed-
shared CUDA: Virtualization of large-scale GPU systems for pro-
grammability and reliability,” in The Fourth International Conference
on Future Computational Technologies and Applications (FUTURE
COMPUTING), 2012, pp. 7–12.
[12] L. Shi, H. Chen, J. Sun, and K. Li, “vCUDA: GPU-accelerated high-
performance computing in virtual machines,” IEEE Transactions on
Computers, vol. 61, no. 6, pp. 804–816, 2012.
[13] S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen,
J. Hong, and W. Feng, “VOCL: An optimized environment for trans-
parent virtualization of graphics processing units,” in Innovative Parallel
Computing (InPar). IEEE, 2012.
[14] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL
framework for heterogeneous CPU/GPU clusters,” in 26th International
Conference on Supercomputing (ICS). ACM, 2012, pp. 341–352.
[15] “Cilk project home page,” http://supertech.csail.mit.edu/cilk/.
[16] “StarPU project home page,” http://runtime.bordeaux.inria.fr/StarPU/.
[17] A. Castell´
o, J. Duato, R. Mayo, A. J. Pe˜
na, E. S. Quintana-Ort´
ı, V. Roca,
and F. Silla, “On the Use of Remote GPUs and Low-Power Processors
for the Acceleration of Scientific Applications,” in The Fourth Inter-
national Conference on Smart Grids, Green Communications and IT
Energy-aware Technologies (ENERGY), April 2014, pp. 57–62.
[18] S. Iserte, A. Castell´
o, R. Mayo, E. S. Quintana-Ort´
ı, C. Rea˜
no, J. Prades,
F. Silla, and J. Duato, “SLURM support for remote GPU virtualization:
Implementation and performance study,” in Proceedings of the Inter-
national Symposium on Computer Architecture and High Performance
Computing (SBAC-PAD), Paris, France, Oct. 2014.
[19] HP Corp., Intel Corp., Microsoft Corp., Phoenix Tech. Ltd., and Toshiba
Corp., “Advanced configuration and power interface specification, revi-
sion 5.0,” 2011.
[20] R. Blumofe and C. Leiserson, “Scheduling multithreaded computations
by work stealing,” in Proceedings of the 35th Annual Symposium on
Foundations of Computer Science, Santa Fe, New Mexico., November
1994, pp. 356–368. [Online]. Available: citeseer.ist.psu.edu/article/
blumofe94scheduling.html