Evaluation of gang scheduling performance and cost in a cloud computing system
ABSTRACT Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage
to an off-site, location-transparent centralized facility or “Cloud.” Gang Scheduling is an efficient job scheduling algorithm
for time sharing, already applied in parallel and distributed systems. This paper studies the performance of a distributed
Cloud Computing model, based on the Amazon Elastic Compute Cloud (EC2) architecture that implements a Gang Scheduling scheme.
Our model utilizes the concept of Virtual Machines (or VMs) which act as the computational units of the system. Initially,
the system includes no VMs, but depending on the computational needs of the jobs being serviced new VMs can be leased and
later released dynamically. A simulation of the aforementioned model is used to study, analyze, and evaluate both the performance
and the overall cost of two major gang scheduling algorithms. Results reveal that Gang Scheduling can be effectively applied
in a Cloud Computing environment both performance-wise and cost-wise.
KeywordsCloud computing–Gang scheduling–HPC–Virtual machines
- SourceAvailable from: Gang Kou[Show abstract] [Hide abstract]
ABSTRACT: Resource allocation is a complicated task in cloud computing environment because there are many alternative computers with varying capacities. The goal of this paper is to propose a model for task-oriented resource allocation in a cloud computing environment. Resource allocation task is ranked by the pairwise comparison matrix technique and the Analytic Hierarchy Process giving the available resources and user preferences. The computing resources can be allocated according to the rank of tasks. Furthermore, an induced bias matrix is further used to identify the inconsistent elements and improve the consistency ratio when conflicting weights in various tasks are assigned. Two illustrative examples are introduced to validate the proposed method. KeywordsCloud computing–Task scheduling–Resource allocation–Consistency ratio–Induced bias matrix–Analytic hierarchy processThe Journal of Supercomputing 06/2013; · 0.92 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Cloud Computing is an emerging technology in the area of parallel and distributed computing. Clouds consist of a collection of virtualized resources, which include both computational and storage facilities that can be provisioned on demand, depending on the users' needs. Gang Scheduling is an efficient technique for scheduling parallel jobs, already applied in the areas of Grid and Cluster computing. This paper studies the application of Gang Scheduling on a Cloud Computing model, based on the architecture of the Amazon Elastic Compute Cloud (EC2). The study takes into consideration both performance and cost while integrating mechanisms for job migration and handling of job starvation. The number of Virtual Machines (VMs) available at any moment is dynamic and scales according to the demands of the jobs being serviced. The aforementioned model is studied through simulation in order to analyze the performance and overall cost of Gang Scheduling with migrations and starvation handling. Results highlight that this scheduling strategy can be effectively deployed on Clouds, and that cloud platforms can be viable for HPC or high performance enterprise applications.Proceedings of the 16th IEEE Symposium on Computers and Communications, ISCC 2011, Kerkyra, Corfu, Greece, June 28 - July 1, 2011; 01/2011
- [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we investigate Cloud computing resource provisioning to extend the computing capacity of local clusters in the presence of failures. We consider three steps in the resource provisioning including resource brokering, dispatch sequences, and scheduling. The proposed brokering strategy is based on the stochastic analysis of routing in distributed parallel queues and takes into account the response time of the Cloud provider and the local cluster while considering computing cost of both sides. Moreover, we propose dispatching with probabilistic and deterministic sequences to redirect requests to the resource providers. We also incorporate checkpointing in some well-known scheduling algorithms to provide a fault-tolerant environment. We propose two cost-aware and failure-aware provisioning policies that can be utilized by an organization that operates a cluster managed by virtual machine technology, and seeks to use resources from a public Cloud provider. Simulation results demonstrate that the proposed policies improve the response time of users’ requests by a factor of 4.10 under a moderate load with a limited cost on a public Cloud.The Journal of Supercomputing 02/2013; 63(2). · 0.92 Impact Factor
Evaluation of gang scheduling performance
and cost in a cloud computing system
Ioannis A. Moschakis ·Helen D. Karatza
© Springer Science+Business Media, LLC 2010
Abstract Cloud Computing refers to the notion of outsourcing on-site available ser-
vices, computational facilities, or data storage to an off-site, location-transparent cen-
tralized facility or “Cloud.” Gang Scheduling is an efficient job scheduling algorithm
for time sharing, already applied in parallel and distributed systems. This paper stud-
ies the performance of a distributed Cloud Computing model, based on the Ama-
zon Elastic Compute Cloud (EC2) architecture that implements a Gang Scheduling
scheme. Our model utilizes the concept of Virtual Machines (or VMs) which act as
the computational units of the system. Initially, the system includes no VMs, but
depending on the computational needs of the jobs being serviced new VMs can be
leased and later released dynamically. A simulation of the aforementioned model is
used to study, analyze, and evaluate both the performance and the overall cost of
two major gang scheduling algorithms. Results reveal that Gang Scheduling can be
effectively applied in a Cloud Computing environment both performance-wise and
Keywords Cloud computing · Gang scheduling · HPC · Virtual machines
Cloud Computingis a revolutionaryway of providingshared resources over the Inter-
net. Through the use of low level virtualization software, such as Xen , the Cloud
provides virtualized computing hardware infrastructure in a manner similar to the
public utilities, thus it is also termed as Infrastructure-as-a-Service (IaaS) or Utility
Computing. Since all hardware is virtualized, the Cloud gives the illusion of limitless
resources which can be made availableto the user on-demandand can be dynamically
I.A. Moschakis (?) · H.D. Karatza
Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
I.A. Moschakis, H.D. Karatza
scaled up or down. On the other hand Computing refers to the applications and soft-
ware platforms being offered through the Cloud usually under the notion of a service
model, hence called Software-as-a-Service (SaaS) .
The importance of Cloud Computing arises in the opportunity that it provides for
the development of application services without the requirement of a prior to de-
ployment Capital Expenditure (CapEx). This allows for startup Internet companies
with tight budgets to use their profits for Operational Expenditure (OpEx) alone.
Furthermore, in the scientific field of study, CC presents us with the ability to lease
computational resources from its virtually infinite pool for use in High Performance
Computing (HPC). In this way, even small institutions or individuals can have access
to a large number of computational resources at a fraction of the cost of maintain-
ing a supercomputer center. Since the Cloud is cost-associative, we pay only for the
computing time that we spent running each VM and for data transfers in and out of
the Cloud. One, of course, could argue that this problem is already addressed by the
Grid, but the Grid posses certain restrictions on the availability of software while
Cloud VMs can be custom built with virtually any software a user needs.
In order to take advantage of computational resources that span one server, or
in our case a virtual machine a parallel or distributed computing scheme must be
applied. Although Cloud Computing infrastructure is virtualized, and thus provides
no direct access to the underlying hardware, the Amazon EC2 specification provides
multicore VMs, hence parallelization even on a single VM is possible. Moreover, one
of the main features of Cloud Computing is its ability to adapt, so a user can expand
or contract his system dynamically. Conclusively, if CC is going to be used for HPC,
whose market share comprises a third of the server market [2, 5], appropriatemethods
must be considered for both parallel job scheduling and VM scalability.
The importance of scheduling methods is apparent in every distributed system.
The scheduling algorithm must seek a way to maximize the performance of the sys-
tem by avoiding unnecessary delays  and also in our case maintain a good re-
sponse time to leasing cost ratio. The main task of the scheduler is to allocate proces-
sors to parallel jobs that have entered the system . In the system modeled, parallel
jobs consist of tasks that are in very frequent communication and, therefore, must ex-
ecute both simultaneously and concurrently. Gang scheduling is a special case of job
scheduling that allows the scheduling of such jobs. A system that applies this kind
of scheduling must guarantee that every task of a given job will be allocated on a
different processor so that it will begin and finish its execution at the same time as
the other tasks. In this way, the system can avoid cases where a task is blocked while
waiting for input from another task that is not currently executing.
This type of scheduling has been extensively studied in the past in the area of dis-
tributed and Grid systems [9, 11, 13, 23, 24]. In [11–14] Karatza has studied the per-
formance of Adaptive First Come First Serve (AFCFS) and Largest Job First Served
(LJFS) gang scheduling policies. Gang Scheduling has also been examined in situa-
tions involving more than one clusters of processors [7, 18, 19]. Also,  considers
task migration strategies with the inclusion of high priority jobs in the process. In the
aforementioned publications, the number of processors available to the system was
always static during the simulation and the workload consisted of jobs with a degree
of parallelism in the range [1..P], P being the total number of available processors,
regardless of distribution.
Evaluation of gang scheduling performance and cost in a cloud
Scheduling strategies have been studied before under the notion of Cloud Comput-
ing. In , Assunção et al. studied the use of CC as an extension to private clusters.
In their model, tasks were separate from each other and did not communicate. Vir-
tual Machine usage and leasing has also been studied in [21, 22] through the use of
Haizea1VM-based lease management architecture.
In this paper, the simulation model consists of one distributed and dynamically
scaling Cloud Computing cluster of VMs. The workload consists of parallel jobs
(gangs) that are either small or large based on a pre-simulation specified job size
coefficient. We compare AFCFS and LJFS under this model in order to study their
performance and cost efficiency in a Cloud Computing environment. Additionally,
we implement a complex system for adding and removing virtual machines from the
system depending on the system’s load at any specific time. To the best of our knowl-
edge, there have not been any other publications that have addressed this specific
The structure of this paper is as follows. Section 2 presents an in-depth descrip-
tion of the system and workload models. Section 3 describes the Dispatching and the
Scheduling strategies utilized in the simulation. In Sect. 4, we discuss the VM han-
dling system that we have implemented. Section 5 presents the metrics used to mea-
sure performance and cost, the parameters of the simulation, and the results along
with an analysis of them. Finally, Sect. 6 provides some conclusive remarks along
with our thoughts about future work on the subject.
2 System and workload models
The simulation model consists of a single cluster of Virtual Machines connected with
a Dispatcher Virtual Machine (DVM). Initially, the system leases no VMs so the
cluster is empty. Depending on the workload at any specific moment, the system has
the ability to lease new VMs up to a total number of Pmax= 120. This is a limitation
posed by Amazon EC2 which allows up to 20 “Regular” and up to 100 “Spot” VMs
which can be leased under certain conditions , hence virtually up to 120 VMs.
The user can request even more VMs through electronic request, but the approval of
the request is not certain nor an answer is guaranteed within a specified time limit.
Therefore, for the time being, such a feature is excluded from the model.
Each Virtual Machine incorporates its own task waiting queue where the tasks of
parallel jobs are dispatched by the Dispatcher Virtual Machine (DVM). The DVM
also includes a waiting queue for jobs that where unable to be dispatched at the
moment of their arrival due to either inadequacy of VMs at that moment or due to
overloaded VMs. For the sake of simplicity DVM is not counted within the overall
limit of VMs, Pmax.
In this paper, we assume that the communication between the virtual machines
is contention-free. Therefore, we consider that the communication latencies are in-
cluded implicitly in the jobs’ execution time. However, we do consider explicit delays
I.A. Moschakis, H.D. Karatza
Fig. 1 The system model
when jobs are not immediately dispatched for the reasons discussed in the previous
We also assume that all virtual machines are identical, that is, they all belong to the
same class of EC2 virtual machines. As is true with nonvirtualized systems, VMs can
suffer from inequalities in their performance depending on the state of the underlying
hardware at any specific moment. However, studies [2, 16, 17] have shown that VMs
are able to provide near homogeneous performance as long as no I/O takes place.
Even this problem is expected to be resolved in the near future through the use of
newer types of flash memory such as solid state drives (SSD). For these reasons, we
consider that any overhead that may exist due to temporal performance difference
between VMs is implicitly included in the execution time of jobs.
Gang scheduling is a special case of scheduling parallel jobs in which tasks of
jobs need to communicate very frequently . Thus, each job requires a number of
processors equal to its degree of parallelism, the number of tasks that it consists of, in
order to be dispatched and executed. In the model under study, degrees of parallelism
are random numbers following the discrete uniform distribution. Furthermore, jobs
fall in two different categories of size:
– Lowly Parallel Jobs, that have job sizes in the range [1..16] with a probability
Evaluation of gang scheduling performance and cost in a cloud
– Highly Parallel Jobs, that have job sizes in the range [17..32] with a probability
where q is the job size coefficient which determines the amount of jobs that belong
to the first or the second category.
So we can compute the average number of tasks per job or Average Job Size (AJS)
in the following way:
AJS = q1+16
The mean interarrival time of jobs is exponentially distributed with a mean of
1/λ. The mean task service time is exponentially distributed with a mean of 1/μ.
There exists no correlation between service times and job size, for example, it is not
necessary for a large job to have a long service time.
We must emphasize here that jobs always execute to completion and that no pre-
emption takes place. This happens because context switching in the case of Gang
Scheduling involves high overhead since network status must be saved and then be
restored when switching between tasks . Also, as noted in the same reference,
there is a possibility that some messages that should have been received by a process
before it was switched may be received by another process after the context switch.
For this reason, it is impractical and possibly dangerous to either preempt or migrate
gang tasks when they are already running.
3 Dispatching and scheduling strategies
3.1 Job routing
The job entry point for the system is the Dispatcher VM. If the degree of parallelism
of any arriving job is less than or equal to the number of the available VMs, the job
is immediately dispatched. The allocation of VMs to tasks is handled by the DVM
which employs the Shortest Queue First (SQF) algorithm for this. SQF dispatches
tasks to VMs with the shortest, least loaded, queues. Tasks that belong to the same
job, also called sibling tasks, cannot occupy the same queue since gang scheduling
requires that there exists a one-to-one mapping of tasks to server VMs. An abstracted
view of SQF is provided by the following listing 1.
Algorithm 1 Shortest Queue First
vmsByQueueLength := getVMsByQueueLengthIncremental();
for i = 0 to numberOfTasks do