Conference PaperPDF Available

Abstract and Figures

The use of accelerators, such as graphics processing units (GPUs), to reduce the execution time of compute-intensive applications has become popular during the past few years. These devices increment the computational power of a node thanks to their parallel architecture. This trend has led cloud service providers as Amazon or middlewares such as OpenStack to add virtual machines (VMs) including GPUs to their facilities instances. To fulfill these needs, the guest hosts must be equipped with GPUs which, unfortunately, will be barely utilized if a non GPU-enabled VM is running in the host. The solution presented in this work is based on GPU virtualization and shareability in order to reach an equilibrium between service supply and the ap-plications' demand of accelerators. Concretely, we propose to decouple real GPUs from the nodes by using the virtualization technology rCUDA. With this software configuration, GPUs can be accessed from any VM avoiding the need of placing a physical GPUs in each guest host. Moreover, we study the viability of this approach using a public cloud service configuration, and we develop a module for OpenStack in order to add support for the virtualized devices and the logic to manage them. The results demonstrate this is a viable configuration which adds flexibility to current and well-known cloud solutions.
Content may be subject to copyright.
Enabling GPU Virtualization in Cloud Environments
Sergio Iserte, Francisco J. Clemente-Castell´
o, Adri´
an Castell´
Rafael Mayo and Enrique S. Quintana-Ort´
Department of Computer Science and Engineering, Universitat Jaume I, Castell´
o de la Plana, Spain
{siserte, fclement, adcastel, mayo, quintana}
Keywords: Cloud Computing, GPU Virtualization, Amazon Web Services (AWS), OpenStack, Resource Management.
Abstract: The use of accelerators, such as graphics processing units (GPUs), to reduce the execution time of compute-
intensive applications has become popular during the past few years. These devices increment the compu-
tational power of a node thanks to their parallel architecture. This trend has led cloud service providers as
Amazon or middlewares such as OpenStack to add virtual machines (VMs) including GPUs to their facilities
instances. To fulll these needs, the guest hosts must be equipped with GPUs which, unfortunately, will be
barely utilized if a non GPU-enabled VM is running in the host. The solution presented in this work is based
on GPU virtualization and shareability in order to reach an equilibrium between service supply and the ap-
plications’ demand of accelerators. Concretely, we propose to decouple real GPUs from the nodes by using
the virtualization technology rCUDA. With this software conguration, GPUs can be accessed from any VM
avoiding the need of placing a physical GPUs in each guest host. Moreover, we study the viability of this
approach using a public cloud service conguration, and we develop a module for OpenStack in order to add
support for the virtualized devices and the logic to manage them. The results demonstrate this is a viable
conguration which adds exibility to current and well-known cloud solutions.
Nowadays, many cloud vendors have started of-
fering virtual machines (VMs) with graphics pro-
cessing units (GPUs) in order to provide GPGPU
(general-purpose GPU) computation services. A
few relevant examples include Amazon Web Ser-
vices (AWS)1, Penguin Computing2, Softlayer3and
Microsoft Azure4. In the public scope, one of the
most popular cloud vendors is AWS, which offers
a wide range of precongured instances ready to be
launched. Alternatively, owning the proper infrastruc-
ture, a private cloud can be deployed using a specic
middleware such as OpenStack5or Opennebula6.
Unfortunately, sharing GPU resources among
multiple VMs in cloud environments is more com-
plex than in physical servers. On one hand, instances
in public clouds are not easily customizable. On
the other, although the instances in a private cloud
can be customized in many aspects, when referring
to GPUs the number of options is reduced. As a
result, neither vendors nor tools offer GPGPU ser-
vices. Remote virtualization has been recently pro-
posed to deal with the low-usage problem. Some rel-
evant examples include rCUDA (Pe˜
na, 2013), DS-
CUDA (Kawai et al., 2012), gVirtus (Giunta et al.,
2010), vCUDA (Shi et al., 2012), VOCL (Xiao et al.,
2012), and SnuCL (Kim et al., 2012). Roughly
speaking these virtualization frameworks enable clus-
ter congurations with fewer GPUs than nodes. The
goal is that GPU-equipped nodes act as GPGPU
servers, yielding a GPU-sharing solution that poten-
tially achieves a higher overall utilization of the ac-
celerators in the system.
The main goals of this work are to study current
cloud solutions in an HPC GPU-enabled scenario, and
to analyze and improve them by adding exibility via
GPU virtualization. In order to reach this goal, we
select rCUDA, a virtualization tool that is possibly the
more complete and up-to-date for NVIDIA GPUs.
The rest of the paper is structured as follows. In
Section 2 we introduce the technologies used in this
work; Section 3 summarizes related work; the effort
Iserte, S., Clemente-Castelló, F., Castelló, A., Mayo, R. and Quintana-Ortí, E.
Enabling GPU Virtualization in Cloud Environments.
In Proceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016) - Volume 2, pages 249-256
ISBN: 978-989-758-182-3
Copyright c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reser ved
to use AWS is explained in Section 4; while the work
with Openstack is described in Section 5; nally, Sec-
tion 6 summarizes the advances and Section 7 outlines
the next steps of this research.
2.1 The rCUDA Framework
rCUDA (Pe˜
na et al., 2014) is a middleware that en-
ables transparent access to any NVIDIA GPU de-
vice present in a cluster from all compute nodes.
The GPUs can also be shared among nodes, and a
single node can use all the graphic accelerators as
if they were local. rCUDA is structured following
a client-server distributed architecture and its client
exposes the same interface as the regular NVIDIA
CUDA 6.5 release (NVIDIA Corp., ).With this mid-
dleware, applications are not aware that they are ex-
ecuted on top of a virtualization layer. To deal
with new GPU programming models, rCUDA has
been recently extended to accommodate directive-
based models such as OmpSs (Castell´
o et al., 2015a)
and OpenACC (Castell´
o et al., 2015b). The inte-
gration of remote GPGPU virtualization with global
resource schedulers such as SLURM (Iserte et al.,
2014) completes this appealing technology, making
accelerator-enabled clusters more exible and energy-
efcient (Castell´
o et al., 2014).
2.2 Amazon Web Services
AWS (Amazon Web Services, 2015) is a public cloud
computing provider, composed of several services,
such as cloud-based computation, storage and other
functionality, that enables organizations and/or indi-
viduals to deploy services and applications on de-
mand. These services replace company-owned local
IT infrastructure and provide agility and instant elas-
ticity matching perfectly with enterprise software re-
From the point of view of HPC, AWS offers high
performance facilities via instances equipped with
GPUs and high performance network interconnec-
2.3 OpenStack
OpenStack (OpenStack Foundation, 2015) is a cloud
operating system (OS) that provides Infrastructure as
a Service (IaaS). OpenStack controls large pools of
compute, storage, and networking resources through-
out a datacenter. All these resources are managed
through a dashboard or an API that gives administra-
tors control while empowering their users to provision
resources through a web interface or a command-line
interface. OpenStack supports most recent hypervi-
sors and handles provisioning and life-cycle manage-
ment of VMs. The OpenStack architecture offers ex-
ibility to create a custom cloud, with no proprietary
hardware or software requirements, and the ability to
integrate with legacy systems and third party tech-
From the HPC perspective, OpenStack offers high
performance virtual machine congurations with dif-
ferent hardware architectures. Even though in Open-
Stack it is possible to work with GPUs, the Nova
project does not support this architecture yet.
Our solutions to the deciencies exposed in the pre-
vious section relies on GPU virtualization, sharing
resources in order to attain a fair balance between
supply and demand. While several efforts with the
same goal have been initiated in the past, as exposed
next, none of them is as ambitious as ours. The work
in (Younge et al., 2013) allows the VM managed by
the Xen hypervisor to access the GPUs in a physical
node, but with this solution a node cannot use more
GPUs than those locally hosted, and an idle GPU
cannot be shared with other nodes. The solution
presented by gVirtus (Giunta et al., 2010) virtualizes
GPUs and makes them accessible for any VM in the
cluster. However, this kind of virtualization strongly
depends on the hypervisor, and so does its perfor-
mance. Similar solution is presented in gCloud (Diab
et al., 2013). While this solution is not yet integrated
in a Cloud Computing Manager, its main drawback
is that the application’s code must be modied in or-
der to run in the virtual-GPU environment. A run-
time component to provide abstraction and sharing of
GPUs is presented in (Becchi et al., 2012), which al-
lows scheduling policies to isolate and share GPUs in
a cluster for a set of applications. The work intro-
duced in (Jun et al., 2014) is more mature; however,
it is only focused on compute-intensive HPC applica-
Our proposal goes further, not only bringing solu-
tions for all kind of HPC applications, but also aiming
to boost exibility in the use of GPUs.
CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science
4.1 Current Features
4.1.1 Instances
An instance is a pre-congured VM focused on an
specic target. Among the large list of instances of-
fered by AWS, we can nd specialized versions for
general-purpose (T2, M4 and M3); computer science
(C4 and C3); memory (R3); storage (I2 and D2) and
GPU capable (G2). Each type of instance has its own
purpose and cost (price). Moreover, each type offers
a different number of CPUs as well as network in-
terconnection, which can be: low, medium, high or
10Gb. For our study, we worked in the AWS avail-
ability zone US EAST (N. VIRGINIA). The instances
available in that case present the features reported in
Table 1.
Table 1: Shown HPC instances available in US EAST (N.
VIRGINIA) in June 2015.
Name vCPUs Memory Network GPUs Price
c3.2xlarge 8 15 GiB High 0 $ 0.42
c3.8xlarge 32 60 GiB 10 Gb 0 $ 1.68
g2.2xlarge 8 15 GiB High 1 $ 0.65
g2.8xlarge 32 60 GiB 10 Gb 4 $ 2.6
For the following experiments, we select C3 fam-
ily instances, which are not equipped with GPUs, as
clients; whereas instances of the G2 family will act as
GPU-enabled servers.
4.1.2 Networking
Table 1 shows that each instance integrates a different
network. As the bandwidth is critical when GPU vir-
tualization is applied, we rst perform a simple test to
verify the real network bandwidth.
Table 2: IPERF results between selected instances.
Server Client Network Bandwidth
g2.8xlarge c3.2xlarge High 1 Gb/s
g2.8xlarge c3.8xlarge 10Gb 7.5 Gb/s
g2.8xlarge g2.2xlarge High 1 Gb/s
g2.8xlarge g2.8xlarge 10Gb 7.5 Gb/s
To evaluate the actual bandwidth, we executed the
IPERF7tool between the instances described in Ta-
ble 1, with the results shown in Table 2. From this
experiment, we can derive that network “High” cor-
responds to a 1 Gb interconnect while “10 Gb” has
a real bandwidth of 7.5 Gb/s. Moreover, it seems
that the bandwidth of the instances equipped with
a “High” interconnection network is constrained by
software to 1 Gb/s since the theoretical and real band-
width match perfectly. The real gap between sus-
tained and theoretical bandwidth can be observed
with the 10 Gb interconnection, which reaches up to
7.5 Gb/s.
4.1.3 GPUs
An instance relies on a VM that runs on a real
node with its own virtualized components. Therefore
AWS can leverage a virtualization framework to of-
fer GPU services to all the instances. Although the
nvidia-smi command indicates that the GPUs in-
stalled are NVIDIA GRID K520, we need to verify
that these are non-virtualized devices. For this pur-
pose, we execute the NVIDIA SDK bandwidthtest.
As shown in Table 3, the bandwidth achieved in this
test is higher than the network bandwidth, which sug-
gests that the accelerator is an actual GPU.
Table 3: Results of bandwidthtest transferring 32MB us-
ing pageable memory in a local GPU.
Name Data Movement Bandwidth
g2.2xlarge Host to Device 3,004 MB/s
g2.2xlarge Device to Host 2,809 MB/s
g2.8xlarge Host to Device 2,814 MB/s
g2.8xlarge Device to Host 3,182 MB/s
4.2 Testbed Scenarios
All scenarios are based on the maximum number of
instances that a user can freely select without sub-
mitting a formal request. In particular, the maximum
number for “g2.2xlarge” is 5; for “g2.8xlarge” it is 2.
Ant the instances operate the RHEL 7.1 64-bit OS and
version 6.5 of CUDA. We design three conguration
scenarios for our tests:
Scenario A (Figure 1(a)) shows a common con-
guration in GPU-accelerated clusters, with each
node populated with a single GPU. Here, a node
can access 5 GPUs using the “High” network.
Scenario B (Figure 1(b)) is composed of 2 server
nodes, equipped with 4 GPUs each, and a GPU-
less client. This scenario includes a 10Gb net-
work, and the client can execute the application
using up to 8 GPUs.
Enabling GPU Virtualization in Cloud Environments
Scenario C (Figure 1(c)) combines scenarios A
and B. A single client, using a 1Gb network inter-
connection, can leverage 13 GPUs as if they were
Once the scenarios are congured from the point
of view of hardware, the rCUDA middleware needs
to be installed in order to add the required exibil-
ity to the system. The rCUDA server is executed in
the server nodes and the rCUDA libraries are invoked
from the node that acts as client.
In order to evaluate the network bandwidth us-
ing a remote GPU, we re-applied NVIDIA SDK
bandwidthtest. Table 4 exposes that the bandwidth
is limited by the network.
Table 4: Results of bandwidthtest transferring 32MB us-
ing pageable memory in a remote GPU using rCUDA.
Scenario Data Movement Network Bandwidth
A Host-to-Device High 127 MB/s
A Device-to-Host High 126 MB/s
B Host-to-Device 10 Gb 858 MB/s
B Device-to-Host 10 Gb 843 MB/s
4.3 Experimental Results
The rst application is MonteCarloMultiGPU, from
the NVIDIA SDK, a code that is compute bound (its
execution barely involves memory operations). This
was launched with the default conguration, “scal-
ing=weak”, which adjusts the size of the problem de-
pending on the number of accelerators. Figure 2 de-
picts the options per second calculated by the appli-
cation running on the scenarios in Figure 1 as well as
using local GPUs. For clarity, we have removed the
results observed for Scenario B as they are exactly the
same as those obtained from Scenario C with up to 8
GPUs. In this particular case, rCUDA (remote GPUs)
outperforms CUDA (local GPUs) because the former
loads the libraries when the daemon is started (Pe˜
2013). With rCUDA we can observe differences in
the results between both scenarios. Here, Scenario
A can increase the throughput because the GPUs do
not share the PCI bus with other devices as each node
only is equipped with one GPU. On the other hand,
when the 4-GPU instances (“g2.8xlarge”) are added
(Scenario C), the PCI bus constrains the communica-
tion bandwidth, hurting the scalability.
The second application, LAMMPS8, is a classi-
cal molecular dynamics simulator that can be applied
at the atomic, meso, or continuum scale. From the
implementation perspective, this multi-process appli-
cation employs at least one GPU to host its processes,
but can benet from the presence of multiple GPUs.
Figure 3(a) shows that, for this application, the use
of remote GPUs does not offer any advantage over
the original CUDA. Furthermore, for the execution
on remote GPUs, the difference between both net-
works is small, although, the results observed with the
“High” network are worse than those obtained with
the “10 Gb” network. In execution of LAMMPS on
a larger problem (see Figure 3(b)), CUDA still per-
forms better, but the interesting point is the execu-
tion time when using remote GPUs. These are almost
the same even with different networks, which indi-
cates that the transfers turn the interconnection net-
work into a bottleneck. For this type of application,
enlarging the problem size compensates the negative
effect of a slower network.
4.4 Discussion
The previous experiments reveal that the AWS GPU-
instances are not appropriate for HPC because nei-
ther the network nor the accelerators are powerful
enough to deliver high performance when running
compute-intensive parallel applications. As (Pe˜
2013) demonstrates, network and device types are
critical factors to performance. In other words, AWS
is more oriented toward offering a general-purpose
service than to provide high performance. Also, AWS
fails in offering exibility as it enforces the user to
choose between a basic instance with a GPU and a
powerful instance with 4 GPUs. Table 1 shows that
the resources of the “g2.8xlarge” are quadrupled, but
so is the cost per hour. Therefore, in the case of
having other necessities (instances type), using GPU
virtualization technology we could in principle at-
tach an accelerator to any type of instance. Further-
more, reducing the budget spent in cloud services is
possible by customizing the resources of the avail-
able instances. For example, we can work on an
instance with 2 GPUs for only $ 1.3 by launching
2 “g2.2xlarge” and using remote GPUs, avoiding to
pay the double for features that we do not need in
“g2.8xlarge”. In terms of GPU-shareability, AWS re-
serves GPU-capable physical machines which will be
waiting for a GPU-instance request. Investing in ex-
pensive accelerators to keep them in a standby state is
counter-productive. It makes more sense to dedicate
less devices, accessible from any machine, resulting
in a higher utilization rate.
CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science
(a) Version Icehouse
(b) With GPGPU module
Figure 4: OpenStack Architecture.
Figure 5: Internal Communication among modules.
mode the instance monopolizes all the GPUs; while
in the “shared” mode, the GPUs are partitioned. As a
result of sharing the GPU memory, the instance will
be able to work with up to 8 GPUs, provided that each
partition can be addressed as an independent GPU.
Moreover, the users are also responsible for de-
ciding whether a GPU (or a pool) will be assigned to
other instances. This behavior is refereed as “scope”,
and it determines that a group of instances is logically
connected to a pool of GPUs. Working with the “pub-
lic” scope (bottom row of Figure 6) implies that the
GPUs of a pool can be used simultaneously by all the
instances linked to it. Again, the GPU pool can be
composed of “exclusive” or “shared” GPUs.
5.3 User Interface
In order to deal with the new features, several modi-
cations have been planned in the OpenStack Dash-
board, they have not been implemented yet, though.
First of all, the Instance Launch Panel should be ex-
tended with a new eld, where the user could assign
an existing GPU pool, create a new one, or keep the
Figure 6: Examples of working modes.
Figure 7: Launching Instances and assigning GPUs.
instance without accelerators of this kind. When the
option “New GPU Pool” is chosen, elds for the pool
conguration would appear (see Figure 7). Further-
more, a new panel with the existent GPUs displays all
the information related to GPUs (see Figure 8).
Figure 8: GPU Information Panel.
CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science
5.4 Experimental Results
Following tests were executed on 2 sets of nodes us-
ing a 1Gb Ethernet network. All the nodes composed
by an Intel Xeon E7420 quadcore processor, at 2.13
GHz, and 16 GB DDR2 RAM at 667 MHz.
The rst set, in charge of providing the cloud envi-
ronment, consisted of 3 nodes. To deploy an IaaS, we
used OpenStack Icehouse version, and QEMU/KVM
0.12.1 as the hypervisor. A fully-featured Open-
Stack deployment requires at least three nodes: a con-
troller manages the compute nodes where the VMs
are hosted; a network node manages the logic virtual
network for the VMs, and one or more compute nodes
run the hypervisor and VMs.
The second set, composed of 4 nodes, were aux-
iliary servers with a Tesla C1060 GPU each. The OS
was a Centos 6.6; the GPUs used CUDA 6.5; and
rCUDA v5.0 as GPU virtualization framework.
We have designed 6 different set-ups which can
be divided into 2 groups: exclusive and shared GPUs.
The exclusive mode provides, at most, 4 accelerators.
The number of available GPU in shared mode will de-
pend on the partition size. In this case, we halved the
GPU memory, resulting 8 partitions that can be ad-
dressed as independent GPUs. For each group, we de-
ployed virtual clusters of 1, 2 and 3 nodes, where the
application processes were executed. The instances
were launched with the OpenStack predened a-
vor m1.medium, which determines a conguration of
VMs consisting of 2 cores and 4 GB of memory.
MCUDA-MEME (Liu et al., 2010) was the appli-
cation selected to test the set-ups. Thus is an MPI
software, where each process must have access to
a GPU. Therefore, the number of GPUs determines
the maximum number of processes we can launch.
Figure 9 compares the execution time of the appli-
cation with different congurations over different se-
tups. We used up to 3 nodes to spread the processes
and launched up to 8 processes (only 4 in exclusive
mode), one per remote GPU. We can rst observe that
the performance is higher with more than one node,
because the trafc network is distributed among the
nodes. In addition, the shortest execution time is ob-
tained by both modes (exclusive and shared) when
running their maximum number of processes with
more than one node. This seems to imply that it is not
worth to scale (increase) the number of resources, be-
cause the performance growth rate is barely increas-
ing. Although, the time is lower when the GPUs are
shared, the setup cannot take advantage of an increase
in the number of devices.
Figure 9: Scalability results of MCUDAMEME with a dif-
ferent number of MPI processes.
5.5 Discussion
The network interconnect restricts the performance of
our executions. The analysis in (Pe˜
na, 2013) reveals
that improving the network infrastructure can make a
big different for GPU virtualization.
The most remarkable achievement is the wide
range of possible congurations and the exibility to
adapt a system to t the user requirements. In addi-
tion, with this virtualization technology, the requests
for GPU devices can be fullled with small invest-
ment in infrastructure and maintenance. Energy can
be saved not only thanks to the remote access and
the ability to emulate several GPUs using only a few
real ones, but also by consolidating the accelerators
in a single machine (when possible), or turning down
nodes when their GPUs are idle.
We have presented a complete study of the possibil-
ities offered by AWS when it comes to GPUs. The
constraints imposed by this service motivated us to
deploy our own private cloud, in order to gain exibil-
ity when dealing with these accelerators. For this pur-
pose, we have introduced an extension of OpenStack
which can be easily exploited to create GPU-instances
as well as manage the physical GPUs to better prot.
As we expected, due to the limited bandwidth of
the interconnects used in the experimentation, the per-
formances observed for the GPU virtualized scenarios
in the tests were quite low. On the other hand, we
have created new operation modes that open interest-
ing new ways to leverage GPUs in situations where
having access to a GPU is more important than hav-
ing a powerful GPU to boost the performance.
Enabling GPU Virtualization in Cloud Environments
The rst item in the list of pending work is an up-
grade of the network to an interconnect that is more
prone to HPC. In particular, porting the setup and the
tests to an infrastructure with an Inniband network
will shed light on the viability of this kind of solu-
tions. Similar reasons, motivate us to try other Cloud
vendors which better support for HPC. Looking for
situations where performance is less important than
exibility will drive us to explore alternative tools to
easily deploy GPU-programming computer labs.
Finally, an interesting future work is to design new
strategies in order to decide where a remote GPUs is
created and assigned to a physical device Concretely,
to innovate scheduling policies can enhance the exi-
bility offered by the GPGPU module for OpenStack.
The authors would like to thank the IT members of
the department Gustavo Edo and Vicente Roca for
their help. This research was supported by Universitat
Jaume I research project (P11B2013-21); and project
TIN2014-53495-R from MINECO and FEDER. The
initial version of rCUDA was jointly developed by
Universitat Polit`
ecnica de Val`
encia (UPV) and Uni-
versitat Jaume I de Castell´
on until year 2010. This
initial development was later split into two branches.
Part of the UPV version was used in this paper
and it was supported by Generalitat Valenciana un-
der Grants PROMETEO 2008/060 and Prometeo II
Amazon Web Services (2015). Amazon web services. Accessed: 2015-10.
Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi,
V., and Chakradhar, S. (2012). A virtual memory
based runtime to support multi-tenancy in clusters
with GPUs. In 21st Int. symp. on High-Performance
Parallel and Distributed Computing.
o, A., Duato, J., Mayo, R., Pe˜
na, A. J., Quintana-
ı, E. S., Roca, V., and Silla, F. (2014). On the use
of remote GPUs and low-power processors for the ac-
celeration of scientic applications. In The Fourth Int.
Conf. on Smart Grids, Green Communications and IT
Energy-aware Technologies, pages 57–62, France.
o, A., Mayo, R., Planas, J., and Quintana-Ort´
ı, E. S.
(2015a). Exploiting task-parallelism on GPU clusters
via OmpSs and rCUDA virtualization. In I IEEE Int.
Workshop on Reengineering for Parallelism in Het-
erogeneous Parallel Platforms, Helsinki (Finland).
o, A., Pe˜
na, A. J., Mayo, R., Balaji, P., and Quintana-
ı, E. S. (2015b). Exploring the suitability of re-
mote GPGPU virtualization for the OpenACC pro-
gramming model using rCUDA. In IEEE Int. Con-
ference on Cluster Computing, Chicago, IL (USA).
Diab, K. M., Raque, M. M., and Hefeeda, M. (2013). Dy-
namic sharing of GPUs in cloud systems. In Parallel
and Distributed Processing Symp. Workshops & PhD
Forum, 2013 IEEE 27th International.
Giunta, G., Montella, R., Agrillo, G., and Coviello, G.
(2010). A GPGPU transparent virtualization compo-
nent for high performance computing clouds. In Euro-
Par, Parallel Processing, pages 379–391. Springer.
Iserte, S., Castell´
o, A., Mayo, R., Quintana-Ort´
ı, E. S.,
no, C., Prades, J., Silla, F., and Duato, J. (2014).
SLURM support for remote GPU virtualization: Im-
plementation and performance study. In Int. Sym-
posium on Computer Architecture and High Perfor-
mance Computing, Paris, France.
Jun, T. J., Van Quoc Dung, M. H. Y., Kim, D., Cho, H.,
and Hahm, J. (2014). GPGPU enabled HPC cloud
platform based on OpenStack.
Kawai, A., Yasuoka, K., Yoshikawa, K., and Narumi, T.
(2012). Distributed-shared CUDA: Virtualization of
large-scale GPU systems for programmability and re-
liability. In The Fourth Int. Conf. on Future Computa-
tional Technologies and Applications, pages 7–12.
Kim, J., Seo, S., Lee, J., Nah, J., Jo, G., and Lee, J.
(2012). SnuCL: an OpenCL framework for hetero-
geneous CPU/GPU clusters. In Int. Conf. on Super-
computing (ICS).
Liu, Y., Schmidt, B., Liu, W., and Maskell, D. L. (2010).
CUDA-MEME: Accelerating motif discovery in bio-
logical sequences using CUDA-enabled graphics pro-
cessing units. Pattern Recognition Letters, 31(14).
NVIDIA Corp. CUDA API Reference Manual Version 6.5.
OpenStack Foundation (2015). OpenStack. Accessed: 2015-10.
na, A. J. (2013). Virtualization of accelerators in high
performance clusters. PhD thesis, Universitat Jaume
I, Castellon, Spain.
na, A. J., Rea˜
no, C., Silla, F., Mayo, R., Quintana-Ort´
E. S., and Duato, J. (2014). A complete and efcient
CUDA-sharing solution for HPC clusters. Parallel
Computer, 40(10).
Shi, L., Chen, H., Sun, J., and Li, K. (2012). vCUDA:
GPU-accelerated high-performance computing in vir-
tual machines. IEEE Trans. on Comput., 61(6).
Xiao, S., Balaji, P., Zhu, Q., Thakur, R., Coghlan, S., Lin,
H., Wen, G., Hong, J., and Feng, W. (2012). VOCL:
An optimized environment for transparent virtualiza-
tion of graphics processing units. In Innovative Paral-
lel Computing. IEEE.
Younge, A. J., Walters, J. P., Crago, S., and Fox, G. C.
(2013). Enabling high performance computing in
cloud infrastructure using virtualized GPUs.
CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science
... • A new architectural component to provide VMs access to cGPUs in multitenant environments, hiding the real location of the accelerators and detaching their traffic from the VMs traffic. • A complete cGPU resource management and scheduling system, which is an extension of our previous general purpose GPU management system [22]. It has been improved by including the necessary logic to support new working modes based on the locality of the physical GPUs and the exclusivity when accessing them. ...
... Authors in [22] presented the general purpose GPU management system (GPGPUMS), a module for OpenStack that is in charge of managing remote access to a set of GPUs registered in the cloud infrastructure. This development leverages rCUDA to grant remote access to the GPUs in a provider network from any VM. ...
Full-text available
Cloud technology is an attractive infrastructure solution that provides customers with an almost unlimited on-demand computational capacity using a pay-per-use approach, and allows data centers to increase their energy and economic savings by adopting a virtualized resource sharing model. However, resources such as graphics processing units (GPUs), have not been fully adapted to this model. Although, general-purpose computing on graphics processing units (GPGPU) is becoming more and more popular, cloud providers lack of fiexibility to manage accelerators, because of the extended use of peripheral component interconnect (PCI) passthrough techniques to attach GPUs to virtual machines (VMs). For this reason, we design, develop, and evaluate a service that provides a complete management of cloudified GPUs (cGPUs) in public cloud platforms. Our solution enables an effective, anonymous, and transparent access from VMs to cGPUs that are previously scheduled and assigned by a full resource manager, taking into account newGPU selection policies and newworking modes based on the locality of the physical accelerators and the exclusivity when accessing them. This easy-to-adopt tool improves the resource availability through different cGPUs configurations for end-users, whilst cloud providers are able to achieve a better utilization of their infrastructures and offer more competitive services. Scalability results in a real cloud environment demonstrate that our solution introduces a virtually null overhead in the deployment of VMs. Besides, performance experiments reveal that GPU-enabled clusters based on cloud infrastructures can benefit from our proposal not only exploiting better the accelerators, but also serving more jobs requests per unit of time.
... They have also provided a naïve scheduler that first attempts to allocate local GPUs, but if it is not possible it randomly selects other available GPUs in the cluster. Iserte et al. [29] have done another work that extends OpenStack [30] to support remote GPUs using rCUDA. They extended OpenStack to allow the user to allocate local or disaggregated GPUs from a pool of GPUs. ...
Full-text available
Modern applications demand resources at an unprecedented level. In this sense, data-centers are required to scale efficiently to cope with such demand. Resource disaggregation has the potential to improve resource-efficiency by allowing the deployment of workloads in more flexible ways. Therefore, the industry is shifting towards disaggregated architectures, which enables new ways to structure hardware resources in data centers. However, determining the best performing resource provisioning is a complicated task. The optimality of resource allocation in a disaggregated data center depends on its topology and the workload collocation. This paper presents DRMaestro, a framework to orchestrate disaggregated resources transparently from the applications. DRMaestro uses a novel flow-network model to determine the optimal placement in multiple phases while employing best-efforts on preventing workload performance interference. We first evaluate the impact of disaggregation regarding the additional network requirements under higher network load. The results show that for some applications the impact is minimal, but other ones can suffer up to 80% slowdown in the data transfer part. After that, we evaluate DRMaestro via a real prototype on Kubernetes and a trace-driven simulation. The results show that DRMaestro can reduce the total job makespan with a speedup of up to ≈1.20x and decrease the QoS violation up to ≈2.64x comparing with another orchestrator that does not support resource disaggregation.
... Furthermore, another great advantage of this solution is that no source code modification is required. Taking the advantages of rCUDA, the authors of [71] discuss the implementation of an extension module to OpenStack, which enables configuring virtual machines initialized by OpenStack using remote GPUs. In [72], a GPU Scheduler as a Service (GSaaS) extension is implemented for OpenStack, based on rCUDA. ...
Full-text available
Industrial IoT has special communication requirements, including high reliability, low latency, flexibility, and security. These are instinctively provided by the 5G mobile technology, making it a successful candidate for supporting Industrial IoT (IIoT) scenarios. The aim of this paper is to identify current research challenges and solutions in relation to 5G-enabled Industrial IoT, based on the initial requirements and promises of both domains. The methodology of the paper follows the steps of surveying state-of-the art, comparing results to identify further challenges, and drawing conclusions as lessons learned for each research domain. These areas include IIoT applications and their requirements; mobile edge cloud; back-end performance tuning; network function virtualization; and security, blockchains for IIoT, Artificial Intelligence support for 5G, and private campus networks. Beside surveying the current challenges and solutions, the paper aims to provide meaningful comparisons for each of these areas (in relation to 5G-enabled IIoT) to draw conclusions on current research gaps.
... However, this solution shows some bottlenecks for intensive applications (communication optimization challenge), does not consider different access modes from the user's point of view (space sharing challenge), and details of its integration in a cloud infrastructure are not given. Iserte et al. [38] introduced an extension of Open-Stack that can be easily exploited to create GPU-instances as well as manage the physical GPUs to better profit by considering different working modes. Nonetheless, the performance observed for the GPU virtualized scenarios were quite low 1 3 ...
Full-text available
The cloud model allows the access to a vast amount of computational resources, alleviating the need for acquisition and maintenance costs on a pay-per-use basis. However, other resources, such as (GPUs), have not been fully adapted to this model. Many areas would benefit from suitable cloud solutions based on GPUs: video encoding, sequencing in bioinformatics, scene rendering in remote gaming, or machine learning. Cloud providers offer local and exclusive access to GPUs by using PCI passthrough. This limitation can be overcome by integrating new virtual GPUs (vGPUs) in cloud infrastructures or by providing mechanisms to cloudify existing GPUs, cloudified GPUs (cGPUs), which do not support native virtualization. The proposed architecture enables an effective and transparent integration of cGPUs in public cloud infrastructures. Our solution offers several access modes (local/remote and exclusive/shared) and configures autonomously its components by integrating with the message middleware of the cloud infrastructure. A prototype of the proposed architecture has been evaluated in a real cloud deployment. Experiments assess overhead in the infrastructure and performance of GPU-based applications by considering three different programs: matrix multiplication, sequencing read alignment, and Monte-Carlo on multiple GPUs. Results show that our solution introduces low impact both on the infrastructure and the performance of applications. Full-text view-only version :
... While GPUs have shown performance improvements in terms of execution time, GPU virtualization is still in its early stage (cf. [16], [17]) and remains an open research direction. ...
Conference Paper
Future factory automation systems are expected to process vast amounts of data and orchestrate complex cyber-physical components. Edge Computing (EC) is a promising approach to address the requirements set by upcoming industrial systems. While EC caters to the computation requirements, it requires a solution to perform flexible network management of these computation resources. Software-Defined Networking (SDN) is a promising candidate to tackle such challenges. While most of the related work on EC and SDN focuses on multimedia or automotive applications, this paper presents the relevance of both paradigms for industrial applications. By introducing two most prominent industrial use cases, namely proactive system surveillance and intelligent technical assistance, this paper discusses the challenges involved and proposes a solution space for realizing these applications using the combination of EC and SDN. Furthermore, it presents future research directions regarding the combination of both paradigms in the context of factory automation.
Full-text available
This proposal addresses, from two different approaches, the improvement of data centers produc- tivity through an efficient resource management. On the one hand, the combination of GPU remote virtualization technologies with workload managers in HPC clusters demonstrated an interesting increase in throughput, in terms of completed jobs per unit of time, during the research conducted in the predoctoral period. The dissertation begins with an extended study on its impact not only in productivity, but also in resource utilization and energy consumption. Hence, an efficient management of the access to these accelerators is crucial in order to obtain a higher number of completed jobs per unit of time rate. On the same basis, cloud computing environments (public or private) also deal with GPUs, since virtual machines can be equipped with these devices. As detailed in this document, the adoption of a GPU remote virtualization technology together with a resource manager introduces new working modes aimed to the global throughput improvement. On the other hand, the second approach involves job reconfigurations in terms of varying its number of processes during the execution (commonly referred as MPI malleability) in order to increase the system throughput. Currently, MPI jobs suppose a high percentage of the total load in an HPC facility. In an effort to ease the adoption of malleability in scientific applications, this manuscript presents two solutions, from an OmpSs-like programming model approach and from a MPI-friendly syntax, which provide the necessary tools for easily converting an application into malleable. Performance evaluations reveal a non-negligible improvement not only in the throughput, but also in the job waiting time and in the energy consumption.
Full-text available
In the last decades, the number of cores per processor has increased steadily, reaching impressive counts such as the 260 cores per socket in the Sunway TaihuLight supercomputer. This hardware evolution requires an extra effort to extract all the on-node computational power via concurrent programming models (PMs) and applications. Moreover, this trend indicates that future exascale systems will elevate this massive on-node parallelism to thousands of cores per socket. Therefore, that hardware will require efficient libraries and PMs. One of the most popular approaches to obtain acceptable on-node parallel performance relies on the use of operating system (OS) threads via high-level PMs such as OpenMP or via the Pthreads application programming interface (API). Unfortunately the Pthreads API fails to accommodate new software paradigms that target dynamically scheduled and fine-grained parallelism. In contrast with those threads, several lightweight thread (LWT) libraries have been proposed in the last years to tackle fine-grained and dynamic software requirements. These libraries are based on the concept of lighter threads that are managed by OS threads in the user-space. Therefore, the mechanisms' overheads such as context-switch are almost negligible. Some LWT solutions are ConverseThreads, Nanos++, MassiveThreads, Qthreads, or Argobots. LWT libraries demonstrate semantic and performance benefits over the classic Pthreads. However, the variety of LWT libraries hinders portability and reduces its usage to certain solutions. Moreover, this lack of portability reduces the use of LWT implementations in the field of high performance computing (HPC). In this scenario, a unified standard interface can be highly beneficial, as long as it supports most of the functionalities offered by the LWT libraries while maintaining their performance. Moreover, the highly adopted use of Pthreads, as low-level API as well as the base for high-level PMs, increments the effort in order to offer visibility to those LWTs solution. Therefore, high-level PMs and the Pthreads API implemented on top of a unified LWT API are necessary to achieve a wider adoption. This thesis aims to highlight the use of LWT solutions by tackling the problem of portability via a common API. In addition, this work provides solutions to easily migrate from current high-level PM implementations to LWT-based solutions without code modifications. More concretely, the contributions of the thesis are: 1) Decomposition of several threading solutions from a semantic point of view, identifying the strong and weak points of each threading solution; 2) Design and implementation of a unified LWT API, named Generic Lightweight Threads (GLT), that groups the functionality of general-purpose LWT solutions for HPC under the same PM; 3) Implementation of a complete interaction between the already existent Pthreads API, and the new GLT API; and 4) Design and implementation of OpenMP and OmpSs runtimes on top of the GLT API, called Generic Lightweight Thread OpenMP (GLTO) and Generic Lightweight Thread OmpSs (GOmpSs), respectively.
Conference Paper
Full-text available
SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
Conference Paper
Full-text available
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these acceler-ators is usually far below 100%. In this context, remote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bot-tlenecks of several "heterogeneous" scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
Conference Paper
The use of computational accelerators, specifically programmable GPUs, is becoming popular in cloud computing environments. Cloud vendors currently provide GPUs as dedicated resources to cloud users, which may result in under-utilization of the expensive GPU resources. In this work, we propose gCloud, a framework to provide GPUs as on-demand computing resources to cloud users. gCloud allows on-demand access to local and remote GPUs to cloud users only when the target GPU kernel is ready for execution. In order to improve the utilization of GPUs, gCloud efficiently shares the GPU resources among concurrent applications from different cloud users. Moreover, it reduces the inter-application interference of concurrent kernels for GPU resources by considering the local and global memory, number of threads, and the number of thread blocks of each kernel. It schedules concurrent kernels on available GPUs such that the overall inter-application interference across the cluster is minimal. We implemented gCloud as an independent module, and integrated it with the Open Stack cloud computing platform. Evaluation of gCloud using representative applications shows that it improves the utilization of GPU resources by 56.3% on average compared to the current state-of-the-art systems that serialize GPU kernel executions. Moreover, gCloud significantly reduces the completion time of GPU applications, e.g., in our experiments of running a mix of 8 to 28 GPU applications on 4 NVIDIA Tesla GPUs, gCloud achieves up to 430% reduction in the total completion time.
In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. The target cluster architecture consists of a designated, single host node and many compute nodes. They are connected by an interconnection network, such as Gigabit Ethernet and InfiniBand switches. Each compute node is equipped with multicore CPUs and multiple GPUs. A set of CPU cores or each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. SnuCL provides a system image running a single operating system instance for heterogeneous CPU/GPU clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. SnuCL also provides collective communication extensions to OpenCL to facilitate manipulating memory objects. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.