Evaluation of gang scheduling performance and cost in a cloud computing system
ABSTRACT Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage
to an off-site, location-transparent centralized facility or “Cloud.” Gang Scheduling is an efficient job scheduling algorithm
for time sharing, already applied in parallel and distributed systems. This paper studies the performance of a distributed
Cloud Computing model, based on the Amazon Elastic Compute Cloud (EC2) architecture that implements a Gang Scheduling scheme.
Our model utilizes the concept of Virtual Machines (or VMs) which act as the computational units of the system. Initially,
the system includes no VMs, but depending on the computational needs of the jobs being serviced new VMs can be leased and
later released dynamically. A simulation of the aforementioned model is used to study, analyze, and evaluate both the performance
and the overall cost of two major gang scheduling algorithms. Results reveal that Gang Scheduling can be effectively applied
in a Cloud Computing environment both performance-wise and cost-wise.
KeywordsCloud computing–Gang scheduling–HPC–Virtual machines
- SourceAvailable from: Philipp Leitner[Show abstract] [Hide abstract]
ABSTRACT: A core idea of cloud computing is elasticity, i.e., enabling applications to adapt to varying load by dynamically acquiring and releasing cloud resources. One concrete realization is cloud bursting, which is the migration of applications or parts of applications running in a private cloud to a public cloud to cover load spikes. Actually building a cloud bursting enabled application is not trivial. In this paper, we introduce a reference model and middleware realization for Cloud bursting, thus enabling elastic applications to run across the boundaries of different Cloud infrastructures. In particular, we extend our previous work on application-level elasticity in single clouds to multiple clouds, and apply it to implement an hybrid cloud model that combines good utilization of a private cloud with the unlimited scalability of a public cloud. By means of an experimental evaluation we show the feasibility of the approach and the benefits of adopting Cloud bursting in hybrid cloud models.Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing; 12/2013
- [Show abstract] [Hide abstract]
ABSTRACT: The cloud provider plays a major role especially providing resources such as computing power for the cloud subscriber to deploy their applications on multiple platforms anywhere; anytime. Hence the cloud users still having problem for resource management in receiving the guaranteed computing resources on time. This will impact the service time and the service level agreements for various users in multiple applications. Therefore there is a need for a new resolution to resolve this problem. This survey paper conducts a study in resource allocation and monitoring in the cloud computing environment. We describe cloud computing and its properties, research issues in resource management mainly in resource allocation and monitoring and finally solutions approach for resource allocation and monitoring. It is believed that this paper would benefit both cloud users and researchers for further knowledge on resource management in cloud computing.International Journal of Machine Learning and Computing (IJMLC). 02/2014; 4(1):31.
- [Show abstract] [Hide abstract]
ABSTRACT: Multi-objective combinatorial optimization (MOCO) is an essential concern for the implementation of large-scale distributed modeling and simulation (MS) system. It is more complex than general computing systems, with higher dynamics and stricter demands on real-time performance. The quality and speed of the optimal decision directly decides the efficiency of the simulation. However, few works have been carried out for multi-objective combinatorial optimization MOCO especially in large-scale and service-oriented distributed simulation systems (SoDSSs). The existing algorithms for MOCO in SoDSSs are far from enough owing to their low accuracy or long decision time. To overcome this bottleneck, in this paper, a quantum multi-agent evolutionary algorithm (QMAEA), for addressing MOCO in large-scale SoDSSs is proposed. In QMAEA, the concept and characteristics of agent and quantum encoding are introduced for high intelligent searching. Each agent represented by a quantum bit, called a quantum agent (QAgent), is defined as a candidate solution for a MOCO problem, and each QAgent is assigned an energy, which denotes the fitness or objective function value of the candidate solution represented by it. Each QAgent is connected by four other QAgents nearby, and all QAgents are organized by an annular grid, called a multi-agent grid (MAG). In a MAG system, the population of QAgents can reproduce, perish, compete for survival, observe and communicate with the environment, and make all their decisions autonomously. Several operators, i.e. energy-evaluation-operator, competition-operator, crossover-operator, mutation-operator and trimming-operator, are designed to specify the evolvement of the MAG. The theory of predatory search strategy of animals is introduced in the evolution of QMAEA. Multiple evolutionary strategies, such as local-evolution-strategy, local-mutation-strategy and global-mutation-strategy are designed and used to balance the exploration (global search ability) and the exploitation (local search ability) of QMAEA. The framework and procedures of QMAEA are presented in detail. The simulation and comparison results demonstrate the proposed method is very effective and efficient for addressing MOCO in SoDSSs.Simulation 02/2014; 90(2):182-204. · 0.69 Impact Factor
Evaluation of gang scheduling performance
and cost in a cloud computing system
Ioannis A. Moschakis ·Helen D. Karatza
© Springer Science+Business Media, LLC 2010
Abstract Cloud Computing refers to the notion of outsourcing on-site available ser-
vices, computational facilities, or data storage to an off-site, location-transparent cen-
tralized facility or “Cloud.” Gang Scheduling is an efficient job scheduling algorithm
for time sharing, already applied in parallel and distributed systems. This paper stud-
ies the performance of a distributed Cloud Computing model, based on the Ama-
zon Elastic Compute Cloud (EC2) architecture that implements a Gang Scheduling
scheme. Our model utilizes the concept of Virtual Machines (or VMs) which act as
the computational units of the system. Initially, the system includes no VMs, but
depending on the computational needs of the jobs being serviced new VMs can be
leased and later released dynamically. A simulation of the aforementioned model is
used to study, analyze, and evaluate both the performance and the overall cost of
two major gang scheduling algorithms. Results reveal that Gang Scheduling can be
effectively applied in a Cloud Computing environment both performance-wise and
Keywords Cloud computing · Gang scheduling · HPC · Virtual machines
Cloud Computingis a revolutionaryway of providingshared resources over the Inter-
net. Through the use of low level virtualization software, such as Xen , the Cloud
provides virtualized computing hardware infrastructure in a manner similar to the
public utilities, thus it is also termed as Infrastructure-as-a-Service (IaaS) or Utility
Computing. Since all hardware is virtualized, the Cloud gives the illusion of limitless
resources which can be made availableto the user on-demandand can be dynamically
I.A. Moschakis (?) · H.D. Karatza
Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
I.A. Moschakis, H.D. Karatza
scaled up or down. On the other hand Computing refers to the applications and soft-
ware platforms being offered through the Cloud usually under the notion of a service
model, hence called Software-as-a-Service (SaaS) .
The importance of Cloud Computing arises in the opportunity that it provides for
the development of application services without the requirement of a prior to de-
ployment Capital Expenditure (CapEx). This allows for startup Internet companies
with tight budgets to use their profits for Operational Expenditure (OpEx) alone.
Furthermore, in the scientific field of study, CC presents us with the ability to lease
computational resources from its virtually infinite pool for use in High Performance
Computing (HPC). In this way, even small institutions or individuals can have access
to a large number of computational resources at a fraction of the cost of maintain-
ing a supercomputer center. Since the Cloud is cost-associative, we pay only for the
computing time that we spent running each VM and for data transfers in and out of
the Cloud. One, of course, could argue that this problem is already addressed by the
Grid, but the Grid posses certain restrictions on the availability of software while
Cloud VMs can be custom built with virtually any software a user needs.
In order to take advantage of computational resources that span one server, or
in our case a virtual machine a parallel or distributed computing scheme must be
applied. Although Cloud Computing infrastructure is virtualized, and thus provides
no direct access to the underlying hardware, the Amazon EC2 specification provides
multicore VMs, hence parallelization even on a single VM is possible. Moreover, one
of the main features of Cloud Computing is its ability to adapt, so a user can expand
or contract his system dynamically. Conclusively, if CC is going to be used for HPC,
whose market share comprises a third of the server market [2, 5], appropriatemethods
must be considered for both parallel job scheduling and VM scalability.
The importance of scheduling methods is apparent in every distributed system.
The scheduling algorithm must seek a way to maximize the performance of the sys-
tem by avoiding unnecessary delays  and also in our case maintain a good re-
sponse time to leasing cost ratio. The main task of the scheduler is to allocate proces-
sors to parallel jobs that have entered the system . In the system modeled, parallel
jobs consist of tasks that are in very frequent communication and, therefore, must ex-
ecute both simultaneously and concurrently. Gang scheduling is a special case of job
scheduling that allows the scheduling of such jobs. A system that applies this kind
of scheduling must guarantee that every task of a given job will be allocated on a
different processor so that it will begin and finish its execution at the same time as
the other tasks. In this way, the system can avoid cases where a task is blocked while
waiting for input from another task that is not currently executing.
This type of scheduling has been extensively studied in the past in the area of dis-
tributed and Grid systems [9, 11, 13, 23, 24]. In [11–14] Karatza has studied the per-
formance of Adaptive First Come First Serve (AFCFS) and Largest Job First Served
(LJFS) gang scheduling policies. Gang Scheduling has also been examined in situa-
tions involving more than one clusters of processors [7, 18, 19]. Also,  considers
task migration strategies with the inclusion of high priority jobs in the process. In the
aforementioned publications, the number of processors available to the system was
always static during the simulation and the workload consisted of jobs with a degree
of parallelism in the range [1..P], P being the total number of available processors,
regardless of distribution.
Evaluation of gang scheduling performance and cost in a cloud
Scheduling strategies have been studied before under the notion of Cloud Comput-
ing. In , Assunção et al. studied the use of CC as an extension to private clusters.
In their model, tasks were separate from each other and did not communicate. Vir-
tual Machine usage and leasing has also been studied in [21, 22] through the use of
Haizea1VM-based lease management architecture.
In this paper, the simulation model consists of one distributed and dynamically
scaling Cloud Computing cluster of VMs. The workload consists of parallel jobs
(gangs) that are either small or large based on a pre-simulation specified job size
coefficient. We compare AFCFS and LJFS under this model in order to study their
performance and cost efficiency in a Cloud Computing environment. Additionally,
we implement a complex system for adding and removing virtual machines from the
system depending on the system’s load at any specific time. To the best of our knowl-
edge, there have not been any other publications that have addressed this specific
The structure of this paper is as follows. Section 2 presents an in-depth descrip-
tion of the system and workload models. Section 3 describes the Dispatching and the
Scheduling strategies utilized in the simulation. In Sect. 4, we discuss the VM han-
dling system that we have implemented. Section 5 presents the metrics used to mea-
sure performance and cost, the parameters of the simulation, and the results along
with an analysis of them. Finally, Sect. 6 provides some conclusive remarks along
with our thoughts about future work on the subject.
2 System and workload models
The simulation model consists of a single cluster of Virtual Machines connected with
a Dispatcher Virtual Machine (DVM). Initially, the system leases no VMs so the
cluster is empty. Depending on the workload at any specific moment, the system has
the ability to lease new VMs up to a total number of Pmax= 120. This is a limitation
posed by Amazon EC2 which allows up to 20 “Regular” and up to 100 “Spot” VMs
which can be leased under certain conditions , hence virtually up to 120 VMs.
The user can request even more VMs through electronic request, but the approval of
the request is not certain nor an answer is guaranteed within a specified time limit.
Therefore, for the time being, such a feature is excluded from the model.
Each Virtual Machine incorporates its own task waiting queue where the tasks of
parallel jobs are dispatched by the Dispatcher Virtual Machine (DVM). The DVM
also includes a waiting queue for jobs that where unable to be dispatched at the
moment of their arrival due to either inadequacy of VMs at that moment or due to
overloaded VMs. For the sake of simplicity DVM is not counted within the overall
limit of VMs, Pmax.
In this paper, we assume that the communication between the virtual machines
is contention-free. Therefore, we consider that the communication latencies are in-
cluded implicitly in the jobs’ execution time. However, we do consider explicit delays
I.A. Moschakis, H.D. Karatza
Fig. 1 The system model
when jobs are not immediately dispatched for the reasons discussed in the previous
We also assume that all virtual machines are identical, that is, they all belong to the
same class of EC2 virtual machines. As is true with nonvirtualized systems, VMs can
suffer from inequalities in their performance depending on the state of the underlying
hardware at any specific moment. However, studies [2, 16, 17] have shown that VMs
are able to provide near homogeneous performance as long as no I/O takes place.
Even this problem is expected to be resolved in the near future through the use of
newer types of flash memory such as solid state drives (SSD). For these reasons, we
consider that any overhead that may exist due to temporal performance difference
between VMs is implicitly included in the execution time of jobs.
Gang scheduling is a special case of scheduling parallel jobs in which tasks of
jobs need to communicate very frequently . Thus, each job requires a number of
processors equal to its degree of parallelism, the number of tasks that it consists of, in
order to be dispatched and executed. In the model under study, degrees of parallelism
are random numbers following the discrete uniform distribution. Furthermore, jobs
fall in two different categories of size:
– Lowly Parallel Jobs, that have job sizes in the range [1..16] with a probability
Evaluation of gang scheduling performance and cost in a cloud
– Highly Parallel Jobs, that have job sizes in the range [17..32] with a probability
where q is the job size coefficient which determines the amount of jobs that belong
to the first or the second category.
So we can compute the average number of tasks per job or Average Job Size (AJS)
in the following way:
AJS = q1+16
The mean interarrival time of jobs is exponentially distributed with a mean of
1/λ. The mean task service time is exponentially distributed with a mean of 1/μ.
There exists no correlation between service times and job size, for example, it is not
necessary for a large job to have a long service time.
We must emphasize here that jobs always execute to completion and that no pre-
emption takes place. This happens because context switching in the case of Gang
Scheduling involves high overhead since network status must be saved and then be
restored when switching between tasks . Also, as noted in the same reference,
there is a possibility that some messages that should have been received by a process
before it was switched may be received by another process after the context switch.
For this reason, it is impractical and possibly dangerous to either preempt or migrate
gang tasks when they are already running.
3 Dispatching and scheduling strategies
3.1 Job routing
The job entry point for the system is the Dispatcher VM. If the degree of parallelism
of any arriving job is less than or equal to the number of the available VMs, the job
is immediately dispatched. The allocation of VMs to tasks is handled by the DVM
which employs the Shortest Queue First (SQF) algorithm for this. SQF dispatches
tasks to VMs with the shortest, least loaded, queues. Tasks that belong to the same
job, also called sibling tasks, cannot occupy the same queue since gang scheduling
requires that there exists a one-to-one mapping of tasks to server VMs. An abstracted
view of SQF is provided by the following listing 1.
Algorithm 1 Shortest Queue First
vmsByQueueLength := getVMsByQueueLengthIncremental();
for i = 0 to numberOfTasks do