Deadline aware virtual machine scheduler for scientific grids and cloud computing
ABSTRACT Virtualization technology has enabled applications to be decoupled from the underlying hardware providing the benefits of portability, better control over execution environment and isolation. It has been widely adopted in scientific grids and commercial clouds. Since virtualization, despite its benefits incurs a performance penalty, which could be significant for systems dealing with uncertainty such as High Performance Computing (HPC) applications where jobs have tight deadlines and have dependencies on other jobs before they could run. The major obstacle lies in bridging the gap between performance requirements of a job and performance offered by the virtualization technology if the jobs were to be executed in virtual machines. In this paper, we present a novel approach to optimize job deadlines when run in virtual machines by developing a deadline-aware algorithm that responds to job execution delays in real time, and dynamically optimizes jobs to meet their deadline obligations. Our approaches borrowed concepts both from signal processing and statistical techniques, and their comparative performance results are presented later in the paper including the impact on utilization rate of the hardware resources. Comment: 6 pages, 4 figures
-
Citations (0)
-
Cited In (0)
Page 1
Deadline Aware Virtual Machine Scheduler for Grid and
Cloud Computing
Omer Khalid
CERN
Geneva, Switzerland
Omer.Khalid@cern.ch IMaljevic@somanetworks.com
Kevin Parrott
University of Greenwich
London, United Kingdom
A.K.Parrott@gre.ac.uk
Miltos Petridis
University of Greenwich
London, United Kingdom
M.Petridis@gre.ac.uk
Ivo Maljevic
Soma Networks
Toronto, Canada
Richard Anthony
University of Greenwich
London, United Kingdom
R.J.Anthony@gre.ac.uk
Markus Schulz
CERN
Geneva, Switzerland
Markus.Schulz@cern.ch
ABSTRACT
Virtualization technology has enabled applications to be
decoupled from the underlying hardware providing the benefits of
portability, better control over execution environment and
isolation. It has been widely adopted in scientific grids and
commercial clouds. Since virtualization, despite its benefits incurs
a performance penalty, which could be significant for systems
dealing with uncertainty such as High Performance Computing
(HPC) applications where jobs have tight deadlines and have
dependencies on other jobs before they could run. The major
obstacle lies in bridging the gap between performance
requirements of a job and performance offered by the
virtualization technology if the jobs were to be executed in virtual
machines. In this paper, we present a novel approach to optimize
job deadlines when run in virtual machines by developing a
deadline-aware algorithm that responds to job execution delays in
real time, and dynamically optimizes jobs to meet their deadline
obligations. Our approaches borrowed concepts both from signal
processing and statistical techniques, and their comparative
performance results are presented later in the paper including the
impact on utilization rate of the hardware resources.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance, and
Enhancement – portability; Metrics – performance measures;
General Terms
Algorithms,
Experimentation.
Measurement, Performance, Design, and
Keywords
Xen, Virtualization, Grid, Pilot jobs, HPC, Cloud, Cluster, Job
Scheduling, ATLAS
1. INTRODUCTION
In the past few years, Virtualization technology have remarkably
shaped the way data centres have come about to increase their
resource utilization by virtualizing more and more of computing
applications to reap the benefits of lower support costs, higher
mobility of the application, higher fault tolerance with lower
capital cost [1, 4, 5]. This have led to the evolution of the idea of
Grid computing [2] where
geographically spanning over very many data centres and sharing
resources among diverse community of users in decentralised
fashion. But the rigid boundaries between very many systems in
the “Grid” and the constraints they posed led to development of
“Clouds” such as Amazon EC2 [25] where users could access
computing and storage resources on-demand metered by per hour
use. The terms Grid and Cloud are almost interchangeable since
both aims to provide computing resources to their respective user
communities through abstract and well defined set of interfaces.
In this evolution over the last decade, virtualization have
increasingly brought a paradigm shift where applications and
services could be packaged as thin appliances and moved across
distributed infrastructure with minimum disruption not only on the
servers but across desktops in large organizations.
Despite these major developments, some fundamental questions
remain unanswered especially for running HPC jobs in the virtual
machines either deployed on the Grid or Cloud that how
significant virtualization overhead could be under different
workloads, and whether jobs with tight deadlines could meet their
obligation if resource providers were to fully virtualizes their
worker nodes [3].
Given this potential, we investigated how this technology could
benefit ATLAS [6] (on of CERN’s high-energy physics
experiments) grid infrastructure and improve its efficiency by
simulating its High Performance Computing (HPC) jobs on virtual
machines.
This poses a particular challenge in scientific grids such as LCG1
that have to serve the needs of diverse communities often with
resources were distributed
1 Large Hadron Collider Computer Grid (LHC) and Open Science
Grid (OSG).
2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops
978-0-7695-4019-1/10 $26.00 © 2010 IEEE
DOI 10.1109/WAINA.2010.107
85
Page 2
competing and opposite demands but it’s simpler to manage in
commercial clouds such as Amazon’s Elastic Cloud Computing
(EC2) [25] or scientific clouds like Nimbus [13] where user have
clear understanding that they would be paying for per hour usage
and their SLA would terminate when they stop to pay.
This enables the users to estimate their workload estimates in a
different manner as compared to HPC users where sometimes it’s
not possible to accurately estimate job execution times. This
problem is further compounded when extended virtualization job
duration results in increased deadline miss rate. We define
deadline miss rate as a function of jobs not meeting their
deadlines.
Our study attempts to provide a way forward to address the above
mentioned challenges in a way which is transparent to the users
without letting them know that their jobs are run in the virtual
machine and tries to optimize the job execution rate.
2. MOTIVATION AND BACKGROUND
Since some jobs are more CPU or memory intensive than the
others and vice versa, this requires a dynamic and intelligent
resource scheduling which is adaptive as the nature of workloads
at any given moment changes. By just throwing more resources to
a virtual machine at the expense of other competing ones does not
solves the research problem, which in fact leads to lower resource
utilization. We aim to explore and investigate this area of research
that how such a scheduling model could be achieved which not
only maximizes the success rate of jobs while maintaining high
resource utilization.
Our work is novel in a sense that it’s extensible to translate our
system parameters in to monetary terms for a scenario where a
cloud provider attaches some currency value to cost and
incentives in the system.
2.1 ATLAS Experiment
Since ATLAS experiment uses PanDA [7] software framework to
submit jobs to the grid. In our previous experiments, we
demonstrated how such an existing Grid application framework is
modified to deploy grid jobs in virtual machines [10, 11] while
delivering higher job performance by tweaking the parameters of
Xen hypervisor [24].
Since various ATLAS jobs have different execution times ranging
from 6hrs to 24hrs each, it wasn’t feasible to run the actual jobs to
quantify the performance of our virtual machine scheduling
algorithm. Since we needed to run thousands of jobs to test the
algorithm, to overcome this constraint we developed a simulator
that resembles a typical compute node in a Grid or a Cloud.
3. Theoretical Model
We have n number of jobs ji from {1… n} for slots s each with
deadline di and execution time eT where each ji requires Ri amount
of resource quantum (CPU, memory). The Service Level
Agreement (SLA) for Atlas experiment guarantees one CPU core
per ji. Our scheduler is truthful as it’s built on this assumption that
the user is providing accurate resource requirements. It has also to
be noted that in our experimental context, ATLAS experiment’s
jobs are of two types; user analysis jobs and production jobs. We
focus on production jobs as they have higher priority since their
output is used to calibrate the detectors, and their resource
requirements are well known and tend to be truthful.
In our simulation model, the algorithm intelligently schedules the
jobs and learns over time about the missed deadlines under
various conditions and tries to predict whether ji would be
meeting its deadline di, and if not then take appropriate measures
to improve it chances in meeting di. We assume that ∀ji : eT ≤ di .
Since virtualization incurs a constant overhead, let it be
coefficient ∂ and eV as virtual execution time than the new
deadline would be dnew. This implies:
virtual exec. time = (duration * overhead) + duration
(1)
It can also be formally expressed as following:
eV = eT + ∂eT , {∂ ≥ 0} ⇒ dnew ≈ eV
(2)
Unlimited deadlines: For the purpose of proof, let D be the
deadline for ji. In this case, we show that if D = ∞ then every job
will be meeting its deadline despite virtualization overhead thus it
would be an ideal scheduler. This is the upper bound of our
system which it would never be able to cross where is
n n
∑ i=1 eV ≤ ∑i=1Di
Tight deadlines: In this case, we show that if each ji has eT = di
and since eV = ∂eT , thus
∴ ∀ji : dnew > di and dnew < eV given Δd ≥ 0
(3)
(4)
This would be the worst case (lower bound) scenario since all the
jobs would be missing their deadlines, and if the batch system
kills them all when they will exceed their individual allocated run
time, then the system would be heavily underutilized since all
these jobs have to be re-submitted and previously utilized
resources would be considered as wasted.
Although in our experimental context, no monetary incentive is
involved but to commercially schedule virtual machines, we could
have introduced economics parameters in the system following
the approach of Fledman et al for scheduling sponsored Google’s
advertisement slots to a set of bidders [9]. Since we abstract
physical machine resources (CPU and memory) as resource units,
the boundaries set for each resource slot s is between time interval
[0, t).
If N is the number of time units required for a job, then the sN is
the number of slots a job needs to complete where t acts as
frequency of the system to measure the slot booking and resource
utilization ratio for a given time span.
eV
sN := ⎯⎯
t
If t = ∞, then the slots have unlimited life and it would not be
possible to measure their utilization. But this dimension of
scheduling is outside the scope of our deployment scenario since
scientific grids doesn’t involve monetary factors as an input for
their set of scheduling parameters.
(5)
86
Page 3
3.1.1 Performance Metrics
To access the performance of our scheduling alg
the following metrics in our systems:
o
System performance to measure to
completed during a period of time.
o
Deadline miss rate representing the
missing their deadline, thus being t
scheduler.
o
Utilization rate for the CPU and me
how long each resource have been acti
To allow the scheduling algorithm to respon
properties, we introduced duration-execute
remaining ratio, donated as x, for ji that is pro
current deadline, and is determined by:
(job duration remaining – time to dead
xi= _________________________________
job duration remaining
The first, and easiest method, is to set an acc
such that when xi < x threshold (X) jobs are acce
otherwise. The basic idea behind this appro
expected that that acceptance of jobs beyond a
would be counter-productive as most of them wo
The adaptive threshold update structure we hav
in figure 1, has been motivated by similar struct
in communication systems and digital signal pro
there are many control loops that have a simila
given reference value, such as time (time sync
(phase locked loops), etc and most notably
control, where goal is to maintain a certain refer
A good overview of these techniques can be foun
Fig 1. Adaptive x threshold algorithm: θi , θ
values, D – delay element, ∆ - delta step, Ftarge
rate, Fmeasured – measured failure
In our implementation, a simple threshold up
based on trying to keep the failure rate close to
failure rate. The threshold value for job accepta
the failure rate increases and vice versa. The upd
quantized to a small value ∆x in order to avo
changes if the difference between the measu
failure rate is large. The optimal value for the
determined through experimentation.
A third approach we have taken is to calculat
Distribution Function (PDF) and Cumulat
Function (CDF) of
to select the threshold value dynamically in a suc
corresponds to the probability P [x < XThresh]
PThresh is some pre-selected target probability
possible to use the CDF curve of the failure
adaptive algorithm, but the number of successful
the success rate
gorithm, we define
otal number jobs
number of jobs
erminated by the
emory to measure
ive
nd to the system
ed vs duration-
ojected to miss its
dline)
____
(6)
ceptance threshold
epted and rejected
oach is that it is
a certain threshold
ould fail.
ve adopted, shown
tures that are used
ocessing. Namely,
ar goal: to track a
chronizers), phase
in adaptive gain
rence power level.
nd in [8]
θi+1 - threshold
et – target failure
rate
pdate algorithm is
o a selected target
ance is lowered if
date step has been
oid fast threshold
ured and targeted
step ∆x has been
ate the Probability
tive Distribution
and
ch way that XThresh
< PThresh, where
of success. It is
rate to drive the
lly completed jobs
use it
is larger than the number of failed jo
meaningful statistics. We have tested
determine the best value. The CDF cur
the first 1000 jobs, and has been upda
any possible changes in the system beh
Failure rate is representation of the job
a certain window of time, independ
performance, and derives the adaptive
failure rate where as F is the target f
0.25]. There are two additional fac
constant S for n number of jobs su
deadline with total number of N jobs.
to keep track of an ever-increasin
rates in question are calculated as:
failure rate = 1- ((n/N) (1 - α) + S
4. CASE STUDY
Since the present study is based on sim
same machine resources configuration
the physical servers with 4 CPU and 8
phase, to derive some core parameters
queue length of 10,000 hours of wor
phase the job queue length was 100,00
In our previous experiments, we focu
job type that had the highest resourc
priority and proved through emp
virtualization, despite its over head, p
for the many of the problems fac
Organization (VO). The questions we
o
How dynamic scheduling of
level will work once the batc
job to a particular node give
in a virtual machine containe
o
What kind of scheduling te
optimize multiple virtual mac
o
Which parameter of the jo
pivotal in scheduling policy
failure rate? Or both of them
scheduling technique?
o
What would be the mix of
machine to increase resource
All the above questions are of particul
the context of LHC’s ATLAS experi
event generation job have higher p
requirements over the competing long
having very high resource requir
scheduling methodologies could not
based execution have to take place on t
4.1 Training Phase
Since there are many different inp
such as resource ratio (memory to
scheduler, deadline buffer that was set
alpha and delta values for the adaptiv
the simulator in the training mode to e
obs, which results in more
d several PThresh values to
rve has been generated after
ated subsequently to reflect
havior.
bs missing their deadlines in
dent of the global system
x algorithm. Let donate f as
failure for the system (0.2,
ctors alpha α and success
ucceeding in meeting their
Such method avoids having
ng number of jobs, and
S α )
(7)
mulation results, we used the
n for the virtual server as of
8GB RAM. For the training
we ran the simulator for job
rkload while for the steady
0 hours of workloads.
used only on the one set of
ce requirements with lower
pirical results that para-
provide very neat solutions
ced by the Atlas Virtual
are trying to answer is:
f workloads at the machine
ch system have scheduled a
en that job will be executed
er?
echnique could be used to
chines running HPC jobs?
ob could be considered as
y; deadline-duration ratio or
m could be part of the same
executing job types on the
e utilization?
lar importance especially in
iment where short running
priority with low resource
running reconstruction job
rements. Thus, traditional
be used if virtual machine
the LCG Grid.
put parameters in the system
CPU), frequency of the
t to 5% of the job duration,
ve x algorithm. We first ran
stablish optimum values for
87
Page 4
the above-mentioned parameters before run
simulation. In this section, results of this tr
presented.
4.1.1 Resource Ratio Optimization
We ran training simulations for two differe
let it be donated as Ri per job slot, configuratio
global success rate and the job deadline miss ra
(CPU: Memory) ratio were 1, 1.5, 2 and 3 (re
res_4) respectively. The number of jobs concur
the systems is constraint by the available CPU s
policy for Grid jobs is one job per CPU slot, so
policy for jobs being executed in the virtual mach
Let Ri = M/C where M donates memor
number of CPU on the worker node.
Figure 2 shows the results for res_1 and rep
for res_2. It has to be noticed that higher C re
system performance and increased the job de
where as R=2 lead to the best performance,
slower as it took more simulation time units, bu
were available for the competing jobs. For all o
we kept R=1.5 as golden middle resource ratio
the configuration of physical servers.
Fig 2. Comparison of job success rate and term
different resource configuratio
4.1.2 Alpha Delta Optimization
The next step was to train the algorithm for
and ∆x threshold values which could be then u
phase. We ran the simulation with adaptive
learning mode enabled for α [0.01, 0.05, 0.
alpha_1, alpha_2, alpha_3 respectively while
and ∆x [0.05, 0.1, 0.2] (delta_1, delta_2, del
while keeping α =0.01). It was observed that as
have a ratio factor between 5 -10, they performe
of the combination of values as shown in the foll
nning the actual
raining phase are
ent resource ratio,
ons to measure the
ate where resource
es_1, res_2, res_3,
rrently running in
since ATLAS [19]
we kept the same
hines.
ry and C donates
presents the results
educed the overall
eadline miss rate
though jobs ran
ut more resources
other experiments,
o since it matched
mination rate for
ns
r optimum alpha α
sed for the steady
X scheduler and
.1] (configuration
keeping ∆x=0.1)
lta_3 respectively
s long as α and ∆x
ed better then rest
lowing table 1.
Table 1. Global Success, Deadline S
miss rate for different alpha
Conf Global
success rate
0.92
Deadl
succes
0.83 alpha_1
alpha_2 0.93 0.86
alpha_3 0.90 0.82
delta_1 0.92 0.84
delta_2 0.93 0.86
delta_3 0.93 0.85
4.1.3 X threshold Optimization
The job distribution input dataset
generation, simulation and reconstru
durations and resource requirements, a
rate metric is driven by the virtualiz
moment which will affect whether a jo
not when it first appears to miss it.
duration is expressed as x in the syst
moment when x is recorded to miss i
failure is heavily influenced by the pr
which determines the virtualization ov
previous study [11].
Since xi success or failure is lin
virtualization overhead for the lengt
threshold value for the system is signi
will result in termination of jobs wh
succeeded and keeping it too high wou
allowed to run which might not event
thus decreasing resource utilization.
We trained the algorithm for di
+0.1] while keeping α and ∆x th
respectively. Our results showed t
somewhere between [0.5,0.6] as sho
failure rate converges to < 0.2 but wi
early phases leads to higher failure r
responds to it by altering the x thresho
golden middle for the initial threshold v
Fig 3. Failure rate evolution for dif
over the period o
Success rate and deadline
a and delta values
line
ss
Deadline miss
0.17
0.14
0.18
0.16
0.14
0.15
n
t is randomized since event
uction jobs have different
and system’s deadline miss
zation overhead at a given
ob will meet it’s deadline or
This deadline ratio to job
em. Although at any given
its deadline, it’s success or
rofile of the concurrent jobs
ver head as described in our
nked with the evolution of
th of the job, so initial x
ificant since keep it too low
hich otherwise might have
uld have led to all jobs being
tually meet their deadlines,
ifferent x values [0.3, 0.7,
hreshold as 0.05 and 0.1
that the optimum x lied
own in the figure 3 where
ith lower x threshold in the
rate and then the algorithm
old. We selected x=0.6 as a
value for later experiments.
fferent x threshold values
of time
88
Page 5
4.2 Steady Phase
4.2.1 Configuration
Once the key optimization parameters were mea
training phase, we ran the simulation for 100k h
for different set of configuration (alg_1, alg_
alg_5) to measure that how HPC workloads w
ran on virtual machines. Alg_1 is our physical b
alg_2 represents virtual baseline without t
optimization technique. Afterwards, we pro
various optimization techniques. alg_3 employs
optimization where virtualization overhead was
adaptive algorithm. alg_4 (Virutal Dynamic Ada
function of alg_3 using adaptive algorithm
(Virtual Dynamic Statistical) was ran using CD
threshold for the executing workloads.
4.2.2 Performance Results
Physical baseline had the best performance i
success rate with the minimum job deadline mi
figure 4, and since the core objective of this stud
an algorithm which could deliver the performa
15% range. Without any optimization techn
(alg_2) lead to worst performance where deadl
0.58 but the empirical data shows that consider p
could be made by alternating virtualization
according to the running workloads. Our adaptiv
alg_4 and alg_5) further improved the job succe
to 0.84 by 7.7% while job deadline miss rate b
26%, less than alg_3’s 0.23 deadlines miss rate.
Fig 4. Comparative performance of different a
for the simulation
This study provided ample evidence that ther
room for improve in the performance of virtua
by using dynamic VO which is driven by the C
characteristics of the workloads, and adapting x
time driven by the subset of near-past job his
failures). It has to be noted that we determined
HPC workloads [11] and thus the VO levels w
trained before adaptive x algorithm could be app
with different computational and memory require
asured through the
ours of workloads
_2, alg_3, alg_4,
will perform when
baseline where as
the use of any
ogressively tested
s Virtual Dynamic
s adapted without
aptive) is a step-up
where as alg_5
DF to adapt the x
in overall system
iss rate, as seen in
dy was to develop
ance within 10%-
nique, static VO
line miss rate was
performance gains
overhead (alg_3)
ve algorithm (both
ess rate from 0.78
by ±0.17, which is
algorithms tested
re is considerable
al HPC workloads
CPU and memory
x threshold at run
story (success and
d VO for ATLAS
e used have to be
plied to workloads
ements.
5. RELATED WORK
Present day virtual machine deplo
OpenNebula [12], Nimbus [13] and
developing standard and transparent m
deploy virtual machines on clusters
incorporated into scientific grids and c
the issues of job performance or job
which is critical for HPC experiments
16] done to abstract compute reso
available as utility computing using
deployment engines but their resea
developing leasing mechanisms for bri
clusters for parallel jobs.
The work done by VSched project [
research problem and complements ou
towards meeting job deadlines when
(UI) applications are mixed and exe
Their underlying scheduling technique
our approach since user interactive app
response latency when deadline repre
compared to our use case where the
optimize job success ratio for HPC
constraints. We addressed the issue
success rates in real time where as S
same problem by working towards a
could reshape and re-size virtual m
changes under different conditions. Lin
an alternative approach to apply opt
submissions and it would be very
further in the context of job submission
Since a lots of efforts have been
management and live migration of vi
demands for resources goes up and
times [20, 21] to consolidate VM's
powering down free resources to red
foot print. This requires that migrated
servers with minimum interruption to
by retaining network connectivity an
memory [22, 23]. This is a very interes
beyond the scope of our study.
6. CONCLUSION
In this paper, we have presented a dy
machine scheduler that monitors job
optimize job success rate for HPC wor
machines in the scientific grids. To ac
job execution simulator and implem
techniques including statistical metho
We observed that both statistical and a
yielded best performance as compar
model. Empirical data also proved that
overhead dynamical is critical for HPC
performance when run in physical ma
resource resolution ratios affect mor
rather than job deadline rates, and for
rate increased.
The limitation of our approach is that
types, initial performance metric has
oyment engines such as
d others have focused on
mechanisms and interfaces to
managed which could be
clouds but does not address
bs meeting their deadlines
s. Similarly, efforts [14, 15,
ources and making them
the above-mentioned VM
arch focus have been on
nging up on-demand virtual
17] is very relevant to our
ur work as they have worked
batch and user-interactive
ecuted in virtual machines.
significantly different from
plications require minimum
sents a response latency as
issue at hand have been to
C batch jobs with no UI
e through monitoring job
adon et al [18] tackled the
a scheduling model which
machines as the work load
ngrand et al [19] have taken
timization technique to job
interesting to explore this
n to virtual machines.
n also put in to hotspot
irtual machines as services
down during peak-offpeak
s on few resources while
duce electricity and cooling
VM's be migrated between
o their running applications
nd checking point volatile
sting arena of research but is
ynamic and real-time virtual
b execution pathways and
rkloads when ran on virtual
chieve this we developed a
mented various scheduling
ods for prediction purposes.
adaptive x threshold models
red to static virtualization
t adaptation of virtualization
C jobs to deliver comparable
achines. We also learnt that
re of job termination rate
lower ratio job termination
for a given set of workload
to be established to define
89