Conference PaperPDF Available

Hierarchical scheduling of independent tasks with shared files

Authors:

Abstract and Figures

Parallel computing platforms such as grids, clusters and multi-clusters constitute promising alternatives for executing applications comprised by a large number of independent tasks. However, some application and architectural characteristics may severely limit performance gains. For instance, tasks with fine granularity, huge data files to be transmitted to or from data repositories, and tasks which share common input files are examples of such characteristics that may cause poor performance. Bottlenecks may also appear due to the existence of a centralized controller in the master-slave architecture, or centralized data repositories within the system. This paper shows how system efficiency decreases under such conditions. To overcome such limitations, a hierarchical strategy for file distribution which aims at improving the system capacity of delivering input files to processing nodes is proposed and assessed. Such a strategy arranges the processors in a tree topology, clusters tasks that share common input files together, and maps such groups of tasks to clusters of processors. By means of such strategy, significant improvements in the application scalability can be achieved
Content may be subject to copyright.
Hierarchical Scheduling of Independent
Tasks with Shared Files
Hermes Senger, Fabr
´
ıcio A. B. Silva, Waneron M. Nascimento
Universidade Cat
´
olica de Santos (UniSantos)
R. Dr. Carvalho de Mendonca, 144
Santos, SP - Brazil - 11070-906
Email: {senger,fabricio}@unisantos.br
Abstract Parallel computing platforms such as grids, clusters
and multi-clusters constitute promising alternatives for executing
applications comprised by a large number of independent tasks.
However, some application and architectural characteristics may
severely limit performance gains. For instance, tasks with fine
granularity, huge data files to be transmitted to or from data
repositories, and tasks which share common input files are ex-
amples of such characteristics that may cause poor performance.
Bottlenecks may also appear due to the existence of a centralized
controller in the master-slave architecture, or centralized data
repositories within the system. This paper shows how system
efficiency decreases under such conditions.
To overcome such limitations, an hierarchical strategy for file
distribution which aims at improving the system capacity of
delivering input files to processing nodes is proposed and assessed.
Such a strategy arranges the processors in a tree topology,
clusters tasks that share common input files together, and maps
such groups of tasks to clusters of processors. By means of such
strategy, significant improvements in the application scalability
can be achieved.
I. INTRODUCTION
Because of their computational power, clusters, multi-
clusters, and grid platforms are suitable to execute high
performance applications comprised by a great number of
tasks, often ranging from hundreds to several thousands. This
paper is dedicated to a class of applications studied in [2],
[3], [7], [8], [12], which can be decomposed as a set of
independent tasks and executed in any order. In this class,
tasks do not communicate to each other and depend only
upon one or more input data files to be executed. The output
produced is also one or more files. Such application class
is referred in the literature as parameter-sweep [3], or bag-
of-tasks [4], and typical examples include applications that
can be structured as a set of tasks that realize independent
experiments with different parameters. Other typical examples
include several applications that involve data mining, image
manipulation, Monte Carlo simulations, and massive searches.
They are frequent in fields such as astronomy, high-energy
physics, bioinformatics, and many others.
In this paper, we focus on a specific class of applications
comprised by a large number of independent tasks that may
share input files. By large we mean a number at least one order
of magnitude greater than the number of available processors.
Such applications were studied in [2], [3], [7], [8], being the
typical case for grid-based environments such as the AppLeS
Parameter Sweep Template [1] (for a detailed description of
some real-world applications see [2], [3]). According to our
previous experience [16]–[18], some data mining applications
clearly fit this model, in particular those which follow the
Independent Parallelism model, as mentioned in [20]. Fur-
thermore, many other science and engineering computational
applications are potential candidates.
Many challenges arise when scheduling a large number
of tasks in large clusters or computational grids. Typically,
there is no guarantee regarding availability levels and quality
of services, as the involved computational resources may
be heterogeneous, non-dedicated, geographically distributed,
and owned by multiple institutions. Furthermore, data grids
that harness distributed resources such as computers and
data repositories need support for moving high volume data
files in an efficient manner. In [15], Ranganathan and Foster
propose and assess some independent policies for assigning
jobs to processors, and moving data files among sites of a
grid environment, showing the importance of considering file
locality when scheduling tasks.
Scheduling independent tasks (with no file sharing) on
heterogeneous processors was studied by Maheswaram et.
al. [12]. In such paper, the authors propose and assess
three scheduling heuristics, named min-min, max-min, and
sufferage. However, many problems may arise whether the
application tasks do share files which are stored in a single
repository and have to be transmitted to the processing nodes
in which tasks will be executed. In [2], [3], Casanova et. al.
improved these heuristics to schedule independent tasks with
an additional constraint. The constraint is the possibility of
sharing common input files.
As the number of tasks to schedule in such applications is
typically large, scheduling heuristics must have low compu-
tational costs. With this purpose, Giersch, Robert, and Vivien
[6], [8] propose and assess some scheduling heuristics that
produce schedules with similar qualities to those proposed by
Casanova et. al. in [2], [3], while keeping the computational
costs one order of magnitude lower. In a further work, Giersch,
Robert, and Vivien [7] extend their work on heuristics, and
establish theoretical limits for the problem of scheduling
independent tasks with shared files, for all possible locations
of data repositories (e.g., centralized, decentralized).
This paper addresses the problem of scheduling independent
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
F4F3F2F1 Fm
T1 T2 T3 TnT4
Tn
Input Files
Tasks
File Task
Fig. 1. The shared files model
tasks with shared files which are stored in a centralized repos-
itory. The focus of this paper is not concerned with proposing
a new heuristic, neither stressing heuristic approaches. Instead,
some scalability limits that are intrinsic for the execution of
such applications executing on a master-slave architecture are
presented, and a strategy is proposed, which can significantly
improve the application scalability by means of reducing
the bottleneck at the master computer, and multiplying the
capacity of the grid to distribute data files. The proposed
strategy implements an hierarchical scheme that integrates
tasks scheduling with file distribution. Such strategy is carried
out in three steps: it starts by clustering tasks that share
common files into jobs; it organizes the available processors
to form a hierarchy; and then, it maps groups of tasks onto
the processors hierarchy. Experimental results obtained by
means of simulation suggest that application scalability can
be improved in one order of magnitude.
The remainder of this paper is organized as follows. Section
2 presents a model that describes architecture and application
features. Section 3 describes the problem. Section 4 evaluates
the potential benefits of using hierarchical scheduling. An
hierarchical strategy is proposed in section 4 and assessed by
means of simulation in section 5. The final considerations are
presented in section 6.
II. A S
YSTEM MODEL
The motivating architecture for this work is comprised by
one master computer, and a set of S slaves distributed among
C clusters. The master is responsible for controlling the slave
computers, being usually implemented by the user’s machine
from which the application is launched.
An application consists of a set of T independent tasks. By
independent we mean that every task does not depend upon
the execution of any other task, and there is no communication
among tasks. There is a set of F data files which are initially
stored in the master computer, and must be transmitted to the
slave computers to serve as input for the application tasks.
Every task requires one or more input files, and produces
only one output file. Each file provides input data for at least
one task, and not rarely, for a very large number of tasks
of the same application. Such relationship can be represented
as a bipartite graph as depicted in Fig.1. For instance, this
example shows a task T
1
that depends upon files F
1
and
F
2
. The master communicates with slave processors in order
to transmit input files sequentially (i.e., with no concurrent
transmission) by means of a dedicated full-duplex link. In this
model, we consider that file sizes and task execution times are
known a priori.
A. The Motivating Application
This model is motivated by previous works that involve
the execution of computationally expensive machine learning
algorithms for data mining in computational grids [18] and
clusters [16]. For instance, the project mentioned in [16]
involves the adaptation of Weka [21], a tool that is widely
used among data mining researchers and practitioners, to
execute on clusters of PCs. A specific class of data mining
algorithms is the classification algorithms [5], which analyze
the characteristics of dataset from a specific domain, and try to
produce models that can well characterize a set of examples.
For instance, a common application consists in evaluating
”which is the best classification algorithm from a list, for
a given dataset”, i.e., which algorithm can produce a model
that best represents the characteristics for a given dataset. The
tenfold Cross Validation procedure [5] is a standard way of
predicting the error rate of a classifier given a simple, fixed
sample of data. In the tenfold cross validation process, the
dataset is divided into ten equal parts (folds) and the classifier
is trained in nine parts (folds) and tested in the remaining
one. This procedure is repeated in ten different training sets
and the estimated error rate is the average in the test sets. If
one would like to evaluate a list of classifier algorithms, say
30 algorithms, by means of tenfold cross validation, 300 tasks
could be created. Alternatively, a more accurate method is the
N-fold cross validation [7]. In this method, a dataset with N
instances (items) is divided into N folds (each one containing
a single item), then the algorithm is trained with N 1 folds
and tested in the remaining one. The error rate for a given is
computed as the mean of N validation tasks. Thus, a dataset
with 5,000 items to be tested with the same 30 algorithms will
produce 150,000 tasks, all of them using the same input file.
III. D
ISTRIBUTING SHARED FILES
In order to illustrate scalability problems in the master-
slave architecture, this section shows some results of a real
application carried out in the Unisantos’ laboratory, which in-
volves the execution of the Cluster Genetic Algorithm (CGA)
[10]. The goal of CGA is to identify a finite set of categories
(clusters) to describe a given data set, both maximizing
homogeneity within each cluster and heterogeneity among
different clusters. Thus, objects that belong to the same cluster
should be more similar to each other than those objects that
belong to different clusters. In this experiments, we used a
dataset that is a benchmark for data mining applications (the
Congressional Voting Records), available at the UCI Machine
Learning Repository [13]. This dataset contains 435 instances
(267 democrats, 168 republicans) and 16 Boolean valued
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
0 2 4 6 8 10 12 14 16 18 20
Execution times
Number of Processors
Workqueue (Dedicated Cluster)
Fig. 2. The makespan for real experiments on a dedicated cluster.
attributes that represent key votes for each of the U.S. House
of Representatives Congressmen. Also, 203 instances present
missing values which have been removed for our experiments.
For these experiments we adopted the MyGrid [4] platform,
which is a lightweight and easy to use grid middleware,
intended to execute applications comprised by independent
tasks. MyGrid implements the Workqueue [4] scheduling
algorithm. Initially, a set of tasks were executed sequentially
on a single machine with a Pentium IV (1.8 GHz) processor
with 1 GB of main memory. Then, the same set of tasks
were run on 2, 4, 8, 12, 16 and 20 dedicated machines
with similar hardware characteristics located in the laboratory
at Unisantos, interconnected by a 100 Mbps Ethernet LAN.
As one can note from Fig. 2, no significant reduction in
the execution time can be achieved with more than 9 slave
processors. After this limit, no performance gains can be
realized by adding slave computers. Such behavior is typical
for some critical applications that cause higher data transfer
rates and higher resource consumption at the master processor,
as we will explain in the following. The CGA application
in the Congressional dataset is such an example. Additional
information about these experiments as well as such scalability
problems can be found in [17], [18].
Initially, the dataset is stored in some repository that is
accessible to the user’s machine which plays the role of
master processor. The master is responsible for controlling
the execution of the application tasks as well as transferring
input and output files to and from the grid machines. In such
a master slave platform, the more slave machines are added
to the system, the greater is the demand for data transfers
imposed to the master computer. If the number of slaves
exceeds the master’s capacity of delivering input files, the
addition of new slave processors in the system will force some
of them to stay idle while waiting for file transfers. Such a
situation is aggravated in presence of fine grain application
tasks, i.e., tasks with a low computation per communication
ratio.
In order to manage the execution of remote tasks, the master
usually spawns a small number of local processes that transmit
input files to slave processors and receive output files which
are sent back with results. Such processes are eliminated
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1 10 100 1000
Makespan
Number of Processors
Granularity = 2
Granularity = 4
Granularity = 8
Granularity = 16
Granularity = 32
Granularity = 64
Granularity = 128
Fig. 3. Simulated makespan for tasks with different granularities running on
a master-slave platform.
after their corresponding tasks are completed. Thus, short
application tasks will raise the rate of creation and destruction
of control processes, thus degrading the performance. Also,
the number of control processes executing concurrently at the
master node is proportional to the number of slaves under
its control. Thus, the consumption of resources in the master
processor (e.g. memory, I/O subsystem, network bandwidth,
CPU cycles) tend to be proportional to the number of slaves
with which it directly interacts for performing control and data
transfer operations.
A. Granularity and Scalability in Master-Slave Platforms
Performance limitations of the master-slave architecture
may appear with different numbers of processors, depending
on some application characteristics. For instance, the execution
times of the application tasks, and the transmission times
of their input files can determine the maximum number of
processors that can be added to the system without loosing
performance. Under some assumptions, such an effective num-
ber of slaves can be estimated as follows. Let RT be the mean
response time for the execution of tasks, which is given by the
sum of their mean execution time (ET), mean time required
to transmit input files (TTIF), and mean time required to
transmit output files (TTOF):
RT = TTIF + ET + TTOF. (1)
Also, suppose that files are sequentially transmitted to
slave processors, i.e., the master does not perform concurrent
transmissions. In such scenario, the maximum number of slave
processors without loss of performance S
eff
, can be estimated
as:
S
eff
=
RT
TTIF + TTOF
. (2)
Obviously, this number depends on the granularity of the
application tasks, since it reflects the ratio between the times
involved in the execution of tasks and in the transmission of
files.
In order to illustrate the influence of granularity (here ex-
pressed as the computation time per transmission time ratio) on
the effective number of slaves, some simulation experiments
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
were carried out using SimGrid [11]. For the experiments, we
consider a dedicated cluster, comprised by the master node
and a set of slave nodes. We assume a shared link, and the
master’s capability of handling only one transmission at any
given time (no concurrent transmissions). The link is assumed
full-duplex, so that the master node can send an input file
to one processor and receive the output file from another
processor concurrently. According to Yang and Casanova [22],
this model corresponds to the one port model, which is suitable
to simulate LAN network connections.
For these experiments, different granularities were produced
by varying the time to transmit files, while all other parameters
(e.g. number of tasks, their execution times, total amount
of work of the application) were fixed. In this example, if
one assumes the granularity is equal to 2, he means the
execution of task takes twice as long as the transmission
of the involved files for such task. The experiment involves
32,000 tasks which take 128 time units to execute, and the
bandwidth of communication links is 1,000,000 bytes per
second. File sizes started with 2,000,000 bytes, doubling up
to 64,000,000 bytes, producing tasks whose computation per
communication ratio varies from 2 to 64. The metric adopted
for this experiment is the makespan [14], which is computed
by the time taken to execute all the application tasks as
well as involved transmissions of input and output fies. As
shown in figure 3, the reduction of the makespan are strongly
dependent on the granularity of the application tasks. These
results show that the coarser the granularity of the application
tasks, the more processors can be effectively added to the
system, leading to reduction in execution times.
Although performance gains are critical for fine granularity
applications, it is worth noting that such problem may also
occur in presence of coarse grain applications, depending upon
the application characteristics. Our work aims at extending
such scalability limits, i.e., improving the effective number
of processors that can be added to a grid system for a given
application, so that its overall execution time can be reduced.
B. Grouping Tasks to Improve Scalability
As mentioned before, the main cause of poor scalability is
the bottleneck in the master computer. Despite of the severe
limitations in scalability, a small number of papers is devoted
to this problem. In [17], [18], Silva et. al. propose a scheduling
algorithm that organizes a given application in groups of tasks
that share the same input files, so minimizing the file transfer.
For didactic purposes, this section presents details about the
grouping of tasks that share input files, because our proposal
also adopts the grouping technique.
At scheduling time, tasks are clustered to form jobs, so
that only one job is created per machine. The number of file
transfers is minimized because tasks that comprise a given
job share the same input files, thus, reducing the number of
file transmissions to be delivered by the master computer.
The principle behind this technique is the improvement of the
computation per communication ratio, i.e., the granularity of
work units, since a job is composed by a group of tasks.
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1 10 100 1000 10000
Makespan
Number of Processors
WQ/2
WQ/16
WQ/128
WQ+G/2
WQ+G/16
WQ+G/128
Fig. 4. Makespan for simulated experiments using pure Workqueue and
Workqueue with Grouping, for granularities of 2, 16, and 128
To illustrate the scalability gains obtained with the grouping
technique, we simulated a dedicated cluster comprised by
the master and a set of slave processors. The application
involves 32,000 fine grain tasks which take 128 time units to
be executed by homogeneous slaves. We assume that (similarly
to our motivating application) each task requires one input file
and produces one output file that are transmitted through the
communication link. Such situation is often verified for many
other real applications mentioned in section I. For the sake of
simplicity, we also consider that each task produces one output
file whose size is at least one order of magnitude smaller than
the input file, and its transmission time is negligible
1
Fig. 4
shows the simulation results as the total execution times, for
an application executing on grids scaling from 1 to 1,000
slave nodes. Our simulation evaluates the Workqueue, and
Workqueue with Grouping algorithms.
In summary, the Workqueue algorithm puts all tasks in a
queue and maps each of them to an idle processor. Whenever
an idle processor is detected, it is assigned to the next task
in the queue. When one task finishes, it is removed from the
queue. The algorithm performs while there remain tasks to
be completed. The Workqueue originally does not consider
file sharing, and transmits input files to a slave processor
every time a task is scheduled to it. The Workqueue with
Grouping extends the original algorithm by grouping tasks
that share input files to compose one job. The later algorithm
creates only one job for each slave processor, so that each
input file is sent only once for each slave. As suggested
by the experiments shown in Fig. 4, there are scalability
limits for both the scheduling algorithms. The pure Workqueue
performs very poorly, scaling up to 9 slave processors. After
this number of processors, no reduction in the execution time
can be achieved. As grouping tasks raises this limit to a few
hundreds of slave nodes, this technique will be applied in our
hierarchical strategy. Since each input file is transmitted only
once for each slave processor that executes tasks which need
1
Such assumption is commonly adopted for simulation models found in the
literature [7], [15], and reflects the characteristic of our motivating application
mentioned in section II-A, which may use very large input files and produces
output files of a few hundreds of bytes.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
SmS2
S1
M
Master
Supervisor
Link
Slave
Fig. 5. The master-supervisor-slave architecture
it, the maximum number of slaves that can be effectively
added to the system without loss of performance may be
estimated for schedules that adopts task grouping. Under the
same assumptions considered for equation 2, the expected
response time for a job (RT
job
) can be computed as
RT
job
= TTIF + NT
job
(ET + TTOF), (3)
where NT
job
is the number of grouped tasks to form one
job, TTIF is the time required to transmit the input file
(which is the same for all tasks in this job), and TTOF is
the transmission time for output files produced by each task
in the job. Thus, the effective number of slaves that can be
effectively controlled by the master node is
S
eff
=
RT
job
TTIF + NT
job
TTOF
. (4)
As illustrated in Fig. 4, there are scalability limits for
both the scheduling algorithms. The pure Workqueue performs
very poorly, while the task clustering performs slightly better.
Thus, though task clustering has demonstrated to reduce the
minimum makespan in about one order of magnitude, a per-
formance limit can still be observed because of the bottleneck
at the master processor. In the next section we investigate
whether the use of an hierarchical topology as opposed to the
traditional master-slave architecture can reduce the bottleneck
and improve the scalability of the system.
IV. I
SOEFFICIENCY AND SCALABILITY
Scalability may be defined as ”the system’s ability to
increase speedup as the number of processors increase” [9].
Another definition that is not based in the concept of speedup
is the following: ”An algorithm-machine combination is scal-
able if the achieved average speed of the algorithm on the
given machine can remain constant with increasing number of
processors, provided the problem size can be increased with
the system size” [19]. This last definition is important since it
relates the scalability to the combination of a machine and an
algorithm, instead of being a property of either the machine
or the algorithm. Based on those definitions, in this section
we introduce one scalability metric: the isoefficiency function
[9], based on the concept of parallel computing efficiency.
Grama, Gupta and Kumar proposed the isoefficiency concept
[9]. Isoefficiency fixes the efficiency and measures how much
work must be increased to keep the efficiency unchanged as
the machine scales up. An isoefficiency function f(P ) related
0.9
0.95
1
1.05
1.1
0 50 100 150 200 250 300 350 400
Efficiency
Number of Processors
2
4
8
16
32
Fig. 6. Efficiency of the executions when all tasks share the same input file
4e+06
3e+06
2e+06
1e+06
0
400 100 10
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 7. Isoefficiency function when all tasks share the same input file
machine size (P ) to the amount of work needed to maintain
the efficiency. Parallel computing efficiency is defined as
E =
T
seq
T
par
P
, (5)
where T
seq
is the time for a sequential execution, and T
par
is
the time for a parallel execution with P processors.
In this section we present the isoefficiency functions of the
execution of independent tasks with shared file applications
on a master-slave platform. The platform simulated was com-
posed of up to 400 homogeneous and dedicated processors,
and the application was composed of a variable number of
0
20000
40000
60000
80000
100000
0 5 10 15 20 25 30 35 40 45 50
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 8. Isoefficiency function when each task has its own input file
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
500000
400000
300000
200000
100000
0
400 100 10
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 9. Isoefficiency function for the hierarchical platform
5e+06
4e+06
3e+06
2e+06
1e+06
0
400 100 10
Number of tasks
Number of Processors
Master-slave 2
Master-slave 4
Master-slave 8
Master-slave 16
Master-slave 32
Hierarchical 2
Hierarchical 4
Hierarchical 8
Hierarchical 16
Hierarchical 32
Fig. 10. Comparing the isoefficiency functions of the hierarchical and master-
slave platforms
tasks, depending on the amount of work necessary to keep the
efficiency constant. Each task takes 8 time units to complete
(ET ), and the amount of time needed to send the input
files (TTIF) varies in order to obtain different ratios. In the
following experiments we simulated the following ratios: 2, 4,
8, 16 and 32. Figure 6 shows the efficiency of the experiments
when tasks share the same input file. It can be verified that the
efficiency is kept around 0.99 for all ratios. Figure 7 shows
the corresponding Isoefficiency functions. It is worth noting
the parabolic shape of all curves. We also executed the same
application when each task has its own input file. In this case a
round-robin strategy was used to map tasks to processors. The
round-robin strategy is optimal for master-slave platforms that
are homogeneous and dedicated [8]. For those executions the
input files are sent to slave nodes before each task execution.
Figure 8 shows the corresponding Isoefficiency functions, for
efficiencies around 0.99. It can be seen that the execution of
an application comprised by independent tasks which have
different input files is not scalable. A third set of experiments
is shown in Figure 9.We have simulated a hierarchical platform
composed of one supervisor node and several master and
slave nodes. The maximum number of slave nodes considered
is 200. It is possible to verify that the curves also have a
parabolic shape. However, the rate of grown of the parabolic-
like function is smaller when compared to the master-slave
platform, as the direct comparison of the curves of Figure 7
and Figure 9 show (see Figure 10).
V. H
IERARCHICAL SCHEDULING
As shown in section IV, an architecture with hierarchi-
cal topology present higher scalability than a master-slave
architecture for the execution of applications comprised by
independent tasks that share input files. In this section, we
present a hierarchical scheme for file distribution and task
scheduling that aims at improving the application scalability
in such scenario. The problem of scheduling independent tasks
with shared input files stored in a centralized repository appear
for many applications mentioned in section I, being studied
in [2], [3], [6], [8], [17], [18]. In such scenario, the master
node is implemented by the user’s machine which accesses
a centralized repository and launches the application tasks
to be executed by the slave processors. The main function
of the master node concerns taking actions for application
coordination (e.g., scheduling, control of completed tasks) and
file distribution. Our hierarchical scheme aims at alleviating
the bottleneck in master node, by reducing the number of file
transfers and control actions it should perform.
In order to reduce the workload at the master node, we
propose the addition of a number of supervisor nodes to the
master-slave architecture. The supervisor is responsible for
controlling the execution of the application tasks as well as
transmitting files to the slave nodes. The master groups tasks
together, to form execution units comprised by tasks that share
common files, namely the jobs that will be distributed among
the supervisor nodes. In turn, each supervisor is responsible
to control the execution of this job on a subset of slave
nodes. In this model, there is no direct interaction between the
master and slave nodes. Instead, the master delegates jobs to
supervisors which are responsible for communicating to slave
nodes and managing the execution of the application tasks.
A. A Strategy for Hierarchical Scheduling
First, consider a distributed architecture comprised by one
master node M , a collection of P processors placed in a
collection of C clusters interconnected by communication
links. The application is comprised by T tasks, where T is
at least one order of magnitude larger than P (T>>P).
This collection may be partitioned into two sets. The former is
comprised by S slave computers, and the later is comprised by
N supervisor computers. Under such conditions, we propose
the following steps to be carried out by the master node, for
launching the application execution:
1) Initially, the master obtains static information about the
available resources, e.g., the number and identification
of available computers in the system, processor speed,
and memory. By the end of this step, the parameter P
can be known.
2) For each known cluster, compute its number of super-
visors N
i
, so that every cluster has at least one super-
visor, and clusters with a large number of processors
have additional supervisors accordingly to the maximum
number of slaves they can efficiently control. More
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
precisely, the number N
i
of supervisors for the i-th
cluster is computed as
N
i
=
P
i
S
eff
, (6)
where P
i
is the number of processor in the i-th cluster,
and S
eff
is the maximum number of processors a
supervisor can efficiently control when task grouping
is applied (see equation 4 in section III). Then, obtain
N, as the total number of supervisors in the system as
N =
C
i=1
N
i
. (7)
3) Compute the number of slaves S = P N.
4) As proposed by Silva et. al. in [17], group the application
tasks into S execution units, namely the jobs, so that
each job can be assigned to one supervisor and its slaves.
Each job will group together a number (in the range
[T/S , T/S]) of tasks that share common input files.
The mapping of jobs to processors can be done at
random, or by means of some heuristic similar to those
presented in [6]–[8], [17]. For instance, the granularity
(i.e., the number of tasks) of each job could be adjusted
according to slave processors capacities. The result of
this step is a list of 2-tuples containing the identification
of a job and the slave processor it was assigned to.
5) Decompose the job list into N sub-lists, so that each sub-
list corresponds to one supervisor and contains the jobs
assigned to the slave processors under its control. This
step can be accomplished in time O(N ) by selecting the
sub-list each job must be moved to. For each supervisor,
send the sub-list containing the information on jobs it
has been assigned to, and transmit the required input
files (only once).
6) Wait for tasks to be executed and results to be returned.
As soon as a supervisor receives a job list and input files,
it distributed the tasks to its slave processors. Every time a
supervisor is notified that a task has concluded, it performs
the following steps:
1) Set the task status as DONE.
2) Forward the notification and results back to the master.
3) While there remain tasks to execute (READY) whose
input files have already been transmitted to the idle slave,
the supervisor assigns the next task to the idle processor.
4) When no READY task can be found, the supervisor
looks for some uncompleted task (RUNNING) whose
input files have already been transmitted to the idle
slave, and creates a replica of such task in the idle slave
processor. Replication can improve the probability of
such task to be concluded earlier, as well as provide
some level of fault tolerance.
5) When steps 3 and 4 are completed (all tasks have been
concluded), the supervisor asks the master node for
incomplete tasks from other processors.
6) If the master has no tasks to be executed, then it finishes.
TABLE I
P
ERFORMANCE FOR WORKQUEUE (WQ), WORKQUEUE WITH GROUPING
(WQ+G), AND HIERARCHICAL WORKQUEUE (WQ+H)
Number WQ WQ+G WQ+H
of proce- Make- Effi- Make- Effi- Make- Effi-
ssors (P) span ciency span ciency span ciency
1 256000 1,00 224001 1,00 224002 1,00
5 51204 0,87 44805 1,00 44806 1,00
8 32007 0,87 28008 1,00 22410 1,00
10 32007 0,70 22409 1,00 22410 1,00
50 32007 0,14 4509 0,99 4510 0,99
100 32007 0,07 2294 0,98 2270 0,99
400 32007 0,02 764 0,73 589 0,95
500 32007 0,01 702 0,64 482 0,93
1000 32007 0,01 673 0,33 263 0,85
2000 32007 0,00 674 0,17 161 0,70
3000 32007 0,00 674 0,11 134 0,56
5000 32007 0,00 675 0,07 123 0,36
10000 32007 0,00 684 0,03 124 0,18
Every supervisor node maintains a list for the jobs and
tasks under its control, and the master maintain a list for
all jobs and tasks of the application. For each task, there
is a status information (READY, RUNNING, DONE),
which is updated by means of messages exchanged
among the processors, upon corresponding events.
VI. S
IMULATION RESULTS
The hierarchical scheduling strategy was evaluated by
means of simulation, using the same application and archi-
tecture scenario described in section III. For the execution of
simulations, homogeneous, dedicated computers and commu-
nication links were considered. Also, the simulated application
is comprised by homogeneous tasks. The results are shown in
Table I. For these experiments, two metrics were used, the total
makespan [14], and the computational efficiency (see equation
5). The total makespan is computed by the time between the
transmission of the first input file, and the return of the last
output file.
As shown in Table I, with the pure Workqueue algorithm,
makespan can be reduced at the minimum of 32,007 time units,
using eight processors. After this threshold, the makespan can
not be shorten by adding processors. With the Workqueue with
Grouping (WQ+G) algorithm, the makespan may be reduced
at the minimum of 673 time units, by employing 1,000 nodes.
In this experiment, the execution with 500 nodes is concluded
in 702 time units and the computational efficiency is around
0.64. However, after this threshold the addition of processors
will not lead to a significant reduction in execution times, and
the efficiency will drop fast. Finally, with the Hierarchical
(WQ+H) algorithm, the makespan may be reduced at the
minimum time of 123 time units, which can be achieved with
5,000 processors. In this experiment, the execution with 3,000
nodes is concluded in 134 time units and the computational
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1 10 100 1000 10000
Makespan
Number of Processors
WQ
WQ+G
WQ+H
Fig. 11. Makespan for simulated experiments using Workqueue (WQ),
Workqueue with Grouping (WQ+G), and Hierarchical Workqueue (WQ+H)
scheduling, for a fixed granularity.
efficiency is 0.56. However, it is worth noting that after
this threshold the addition of processors will not lead to a
significant reduction in the makespan and the efficiency will
gracefully decrease.
VII. C
ONCLUSIONS AND FUTURE WORK
Independent tasks with shared files constitutes an important
class of applications that can benefit from the computational
power delivered by computational grids and clusters. The
execution of such applications in master-slave architectures
creates a bottleneck in the master computer, which limits the
system scalability. The bottleneck appears because the master
node is responsible for scheduling tasks and transmitting input
files to slave processors. Such a limitation is aggravated by the
existence of fine grain application tasks that increase both, the
rate of files transmitted to slave processors, and the rate of
scheduling actions taken by the master.
As a contribution, we propose and assess a strategy that
orchestrates file transfers and the mapping of tasks to proces-
sors, in an integrated manner. Our strategy performs by
mapping groups of tasks to a hierarchical division of proces-
sors, leading to significant improvements in the application
scalability, mainly in presence of fine grain tasks. The basis
for such improvement comes from the facts that our strategy:
(i) groups tasks that share input files together, so improving
the application granularity and reducing the number of file
transfers on the communication links; (ii) multiplies the system
capacity of transmitting input files to slave processors, by
means of adding supervisors to the master-slave model; and
(iii), reduces the amount of scheduling actions to be taken by
the master, by delegating them to a set of supervisor nodes.
These results apply to the execution of applications comprised
by independent tasks with shared files, executing on the top
of master slave platforms.
Although in this paper we evaluated an instance of master-
supervisor-slave hierarchy with only one level of supervisors,
a more general scheme must be evaluated in the future. Also,
future evaluations must emphasize analytical and theoretical
aspects of the scalability limits that can be achieved by means
of techniques discussed here.
R
EFERENCES
[1] Berman, F. High-performance schedulers. In: Foster, I., Kesselman, C.
(editors) The Grid: Blueprints for a New Computing Infrastructure, pp.
279-309. Morgan-Kaufmann, 1999.
[2] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Using Simulation
to Evaluate Scheduling Heuristics for a Class of Applications in Grid
Environments. Research Report 99-46, LIP-ENS, Lyon, France, 1999.
[3] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Heuristics for
Scheduling Parameter Sweep Applications in Grid Environments. In: 9th
Het. Computing Workshop, 2000, pp. 349-363. IEEE CS Press, 2000.
[4] Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F.,
Sauv
´
e, J., Oshtoff, C., Silva, F.A.B., Silveira, C.: Running Bag-of-Tasks
Applications on Computational Grids: The MyGrid Approach. In: Intl.
Conf.Parallel Processing - ICPP, 2003.
[5] Fayyad, U. M., Shapiro, G. P., Smyth, P.: From Data Mining to Knowl-
edge Discovery: An Overview. In: Advances in Knowledge Discovery
and Data Mining, Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthu-
rusamy, R., Editors, MIT Press, pp. 1-37, 1996.
[6] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files on
heterogeneous clusters. Research Report RR-2003-28, LIP, ENS Lyon,
France, May, 2003.
[7] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files from
Distributed Repositories. Technical Report N. 5214, INRIA, France,2004.
[8] Giersh, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on
heterogeneous master-slave platforms. In: 12th Euromicro Workshop in
Par., Dist. and Network-based Proc., pp. 364-371. IEEE CS Press, 2004.
[9] Grama, A., Gupta A., and Kumar, V. Isoefficiency: Measuring the
Scalability of Parallel Algorithms and Architectures. IEEE Parallel and
Distributed Technology, Vol 1, No 3, 1993.
[10] Hruschka, E. R., Ebecken, N.F.F.: A genetic algorithm for cluster
analysis, Intelligent Data Analysis (IDA), v.7, pp. 15-25, IOS Press, 2003.
[11] Legrand, A., Lerouge,J.: MetaSimGrid: Towards Realistic Scheduling
Simulation of Distributed Applications. Research Report N. 2002-28.
Laboratoire de L’Informatique du Paralllisme, ENS, Lyon, 2002.
[12] Maheswaram, M., Ali, S., Siegel, H.J., Hengsen, D., Freund, R.:
Dynamic Matching and Scheduling of a Class of Independent Tasks
onto Heterogeneous Computing Systems. 8th Heterogeneous Computing
Workshop (HCW’99), April, 1999.
[13] Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Data-
bases, http://www.ics.uci.edu, Irvine, CA, University of California.
[14] Pinedo, M.: Scheduling: Theory, Algorithms and systems. Prentice Hall,
Englewood Cliffs, NJ, 1995.
[15] Ranganatham, K., Foster, I. Simulation Studies of Computation and Data
Scheduling Algorithms for Data Grids. Journal of Grid COmputing 1(1),
2003, pp.53-62. Kluwer Academic Publishers, The Netherlands, 2003.
[16] Senger, H., Hruschka, E.R., Silva, F.A.B., Sato, L.M., Bianchini, C.P.,
Esperidio, M.D.: Inhambu: Data Mining Using Idle Cycles in Clusters
of PCs. In: Proc. IFIP Intl. Conf. on Network and Parallel Computing
(NPC’04), Wuhan, China, 2004. LNCS, Vol. 3222, pp.213-220. Springer-
Verlag, Berlin Heidelberg New York (2004).
[17] Silva, F.A.B., Carvalho, S., Senger, H., Hruschka, E.R., Farias, C.R.G.:
Running Data Mining Applications on the Grid: a Bag-of-Tasks Ap-
proach. In: Int. Conf. on Computational Science and its Appllications
(ICCSA), Assisi, Italy. LNCS, Vol. .3044, pp.168-177. Springer-Verlag,
Berlin Heidelberg New York (2004).
[18] Silva, F.A.B., Carvalho, S., Hruschka, E.R.: A Scheduling Algorithm for
Running Bag-of-Tasks Data Mining Application on the Grid. In: Euro-Par
2004, Pisa, Italy, 2004. LNCS, Vol. 3419, pp.254-262. Springer-Verlag,
Berlin Heidelberg New York (2004).
[19] Sun, X. and Rover, D.T. Scalability of Parallel Algorithm-Machine
Combinations. IEEE Transactions on Parallel and Distributed Systems,
Vol 5, No 6, June 1994.
[20] D. Talia, ”Parallelism in Knowledge Discovery Techniques”, Proc. Sixth
Int. Conference on Applied Parallel Computing, Helsinki, LNCS 2367,
pp. 127-136, June 2002.
[21] Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools
with Java implemen-tations. Morgan Kaufmann, San Francisco, 2000.
[22] Yang, Y., van der Raadt, K., Casanova, H.: Multi-Round Algorithms for
Scheduling Divisible Workloads. in IEEE Transactions on Parallel and
Distributed Systems (TPDS), 16(11), 1092–1102, 2005.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006
IEEE
... In a previous paper, we studied the performance and scalability of real data mining applications executing as BoT application on a cluster of PCs [39] deployed as a master-slave platform. In a subsequent work, we demonstrated that the execution of BoT applications with file sharing can be remarkably more scalable on hierarchical platforms than in a pure master-slave when the one-port model is considered [40]. Real implementations of hierarchical platforms include large distributed systems such as multi-clusters and multi-cluster grids, with either high performance interconnections such as InfiniBand and Crossbar (for intra-cluster communication), or WAN networks (for inter-cluster communication). ...
... BoT applications composed of independent tasks with file sharing have been studied in several papers during the last years [3,4,6,9,14,16,18,19,23,30,41,42,37,40]. Their relevance has motivated the development of specialized environments which aim to facilitate the execution of large BoT applications on computational grids and clusters, such as the AppLeS Parameter-Sweep Template (APST) [6] and MyGrid [14]. ...
... Our interest on the scalability of BoT applications with file sharing executing on hierarchical platforms comes from a previous study [40], in which we presented some evidences that such platforms can achieve higher scalability values than pure master-slave platforms when the one-port communication model is considered. In a subsequent work [43], we assessed the scalability of such applications executing on master-slave platforms, and proposed a scheduling algorithm oblivious to task execution times dubbed dynamic clustering (DC). ...
Article
Bag-of-Tasks applications are parallel applications composed of independent (i.e., embarrassingly parallel) tasks, which do not communicate with each other, may depend upon one or more input files, and can be executed in any order. Each file may be input for more than one task. Examples of Bag-of-Tasks (BoT) applications include Monte Carlo simulations, massive searches (such as key breaking), image manipulation applications and data mining algorithms. A common framework to execute BoT applications is the master–slave topology, in which the user machine is used to control the execution of tasks. In this scenario, a large number of concurrent tasks competing for resources (e.g., CPU and communication links) severely limits application execution scalability. This paper is devoted to study the scalability of BoT applications running on multi-node systems (such as clusters and multi-clusters) organized as hierarchical platforms, considering several communication paradigms. Our study employs a set of experiments that involves the simulation of various large-scale platforms. The results presented provide important guidelines for improving the scalability of practical applications
... In Fig. 5 (PART-Adult), one can observe that no such performance limitation was verified for the cluster size evaluated. However, as shown in Senger et al. (2006), Silva et al. (2004a), Silva et al. (2004b), the bottleneck appears for master-slave architectures running applications whose tasks present low values for the ratio of computation per communication. Under this perspective, such problem may even occur for the PART algorithm when running on a larger cluster. ...
... As the number of tasks is likely to be much larger than the number of workers, such optimization could lead to performance improvements as reported in a previous work (Silva et al., 2004b). Also, the master-slave architecture implemented by Inhambu could be adapted to a more scalable one which alleviates the master's bottleneck, such as the hierarchical architecture described in Senger et al. (2006). As far as Inhambu is concerned, although such optimizations could potentially improve its scalability, the corresponding implementation would raise the complexity of the system, and require a complete refactoring and reimplementation of several Weka's classes (e.g. ...
Article
In this paper we present and evaluate Inhambu, a distributed object-oriented system that supports the execution of data mining applications on clusters of PCs and workstations. This system provides a resource management layer, built on the top of Java/RMI, that supports the execution of the data mining tool called Weka. We evaluate the performance of Inhambu by means of several experiments in homogeneous, heterogeneous and non-dedicated clusters. The obtained results are compared with those achieved by a similar system named Weka-Parallel. Inhambu outperforms its counterpart for coarse grain applications, mainly for heterogeneous and non-dedicated clusters. Also, our system provides additional advantages such as application checkpointing, support for dynamic aggregation of hosts to the cluster, automatic restarting of failed tasks, and a more effective usage of the cluster. Therefore, Inhambu is a promising tool for efficiently executing real-world data mining applications. The software is delivered at the project’s web site available at http://incubadora.fapesp.br/projects/inhambu/
... A hierarchical mechanism is composed of a central scheduler interacting with multiple lower-level schedulers. The central scheduler is responsible for controlling the execution of the application and assigning a portion of the application to each of the lower-level schedulers [ (Thanalapati and Dandamudi, 2001), (Senger et al., 2006), ]. ...
Thesis
Emerging technologies enable a set of distributed resources across a network to be linked together and used in a coordinated fashion to solve a particular parallel application at the same time. Such applications are often abstracted as directed acyclic graphs (DAGs), in which vertices represent application tasks and edges represent data dependencies between tasks. Effective scheduling mechanisms for DAG applications are essential to exploit the tremendous potential of computational resources. The core issues are that the availability and performance of resources, which are already by their nature heterogeneous, can be expected to vary dynamically, even during the course of an execution. In this thesis, we first consider the problem of scheduling DAG task graphs onto heterogeneous resources with changeable capabilities. We propose a list-scheduling heuristic approach, the Global Task Positioning (GTP) scheduling method, which addresses the problem by allowing rescheduling and migration of tasks in response to significant variations in resource characteristics. We observed from experiments with GTP that in an execution with relatively frequent migration, it may be that, over time, the results of some task have been copied to several other sites, and so a subsequent migrated task may have several possible sources for each of its inputs. Some of these copies may now be more quickly accessible than the original, due to dynamic variations in communication capabilities. To exploit this observation, we extended our model with a Copying Management(CM) function, resulting in a new version, the Global Task Positioning with copying facilities (GTP/c) system. The idea is to reuse such copies, in subsequent migration of placed tasks, in order to reduce the impact of migration cost on makespan. Finally, we believe that fault tolerance is an important issue in heterogeneous and dynamic computational environments as the availability of resources cannot be guaranteed. To address the problem of processor failure, we propose a rewinding mechanism which rewinds the progress of the application to a previous state, thereby preserving the execution in spite of the failed processor(s). We evaluate our mechanisms through simulation, since this allow us to generate repeatable patterns of resource performance variation. We use a standard benchmark set of DAGs, comparing performance against that of competing algorithms from the scheduling literature.
... Although graph networks provide more general platform models, they nevertheless introduce routing decision making, which greatly complicates the scheduling problem. In contrast, the hierarchical topology of tree networks has the advantage to remove routing decision problems [30,81,108,126,168]. However, it has been shown that the problem of extracting the best spanning tree from a given network is NP-complete, and that even though such a tree could be found, there exist networks for which the performance of the optimal tree is arbitrarily worse than the whole network performance [14]. ...
... In a previous paper, we studied the performance and scalability of real data mining applications executing as BoT application on a cluster of PCs [3]. Further, we demonstrated that the execution of BoT applications whose tasks share files can be remarkably more scalable on hierarchical platforms than in a pure master-slave platform [4]. In [5], we proved the scalability lower bound on the isoefficiency function for BoT applications executing on master-slave platforms, when the underlying communication paradigm is the one-port model. ...
Article
Full-text available
Bag-of-Tasks applications are parallel applications composed of independent (i.e., embarrassingly parallel) tasks that do not communicate with each other, may depend upon one or more input files, and can be executed in any order. Each file may be input for more than one task. A common framework to execute BoT applications is the master-slave topology. In this paper we studied the scalability of BoT applications running on multi-node systems (e.g. clusters and grids) organized as master-slave platforms, considering two communications paradigms: multiplexed connections and efficient broadcast. We prove that the lower bound on the isoefficiency function for master-slave platforms is achievable by those platforms that have an efficient broadcast primitive available. Our study employs a set of simulation
... In a previous paper, we studied the performance and scalability of real data mining applications executing as BoT application on a cluster of PCs [3]. Further, we demonstrated that the execution of BoT applications whose tasks share files can be remarkably more scalable on hierarchical platforms than in a pure master-slave platform [4]. In [5] , we proved the scalability lower bound on the isoefficiency function for BoT applications executing on master-slave platforms, when the underlying communication paradigm is the one-port model. ...
Article
Full-text available
Bag-of-Tasks applications are parallel applications composed of independent (i.e., embarrassingly parallel) tasks that do not communicate with each other, may depend upon one or more input files, and can be executed in any order. Each file may be input for more than one task. A common framework to execute BoT applications is the master-slave topology, in which the user machine is used to control the execution of tasks. In this scenario, a large number of concurrent tasks competing for resources (e.g., CPU and communication links) severely limits the scalability. In this paper we studied the scalability of BoT applications running on multi-node systems (e.g. clusters and grids) organized as master-slave platforms, considering two communications paradigms: multiplexed connections and efficient broadcast. We prove that the lowest bound possible on the isoefficiency function for master-slave platforms is achievable by those platforms that have an O(1) efficient broadcast primitive available. We also analyze the impact of output file contention in scalability, under different assumptions. Our study employs a set of simulation experiments that confirms and extends the theoretical results (e.g. by simulating TCP links).
... In order to carry out these experiments, we implemented a discrete-event simulator [19] that mimics the execution of BoT applications, focusing on the execution of tasks and transmission of files on very large hierarchical platforms according to the one-port and other communication models. It is worth noting that our simulator has been extensively tested, refined and validated against a significant amount of results obtained in our previous experimental studies ( [18], [20], and [21]). For the simulation experiments presented in this section, we assume that both the processors and communication links are homogeneous and dedicated. ...
Conference Paper
Full-text available
This work presents a scalability analysis of embarrassingly parallel applications running on cluster and multi-cluster machines. Several applications can be included in this category. Examples are Bag-of-tasks (BoT) applications and some classes of online web services, such as index processing in online web search. The analysis presented here is divided in two parts: first, the impact of front end topology on scalability is assessed through a lower bound analysis. In a second step several task mapping strategies are compared from the scalability standpoint.
Conference Paper
Evaluating new ideas for job scheduling or data transfer algorithms in large-scale grid systems is known to be notoriously challenging. Existing grid simulators expect to receive a realistic workload as an input. Such input is difficult to provide in absence of an in-depth study of representative grid workloads. In this work, we analyze the ATLAS workload processed on the resources of NDG Facility. ATLAS is one of the biggest grid technology users, with extreme demands for CPU power and bandwidth. The analysis is based on the data sample with ~1.6 million jobs, 1,723 TB of data transfer, and 873 years of processor time. Our additional contributions are (a) scalable workload models that can be used to generate a synthetic workload for a given number of jobs, (b) an open-source workload generator software integrated with existing grid simulators, and (c) suggestions for grid system designers based on the insights of data analysis.
Article
Efficient service scheduling is critical for achieving high performance in heterogeneous grid computing environment. However, optimal matches between tasks and services are difficult and challenging because the scheduling algorithms are becoming unscalable and unusable facing thousands of resources / services in large-scale grid environments. We present here a method called Subnet-Partition based 2-level Service Scheduling Policy (SP2SP) which tackles the process of service scheduling in three steps: hyper-link based logical subnet partitioning, requirement prediction based application scheduling among logical subnets and multi-priorities queue based service scheduling inside a certain subnet. Details and solutions for these three steps are described in this paper. Furthermore, experiments are carried out to evaluate the efficiency of SP2SP in Qos, equitableness and stability etc.
Article
Full-text available
Fast networks have made it possible to aggregate distributed CPU, memory, and storage resources into Grids that can deliver considerable performance. However, achieving performance on such systems requires good performance prediction which is usually difficult due to their dynamic and heterogeneous nature. This is especially true for parallel applications whose performance is highly dependent upon the efficient coordination of their constituent components (e.g. computation and data). The goal of the AppLeS project is to develop application-level scheduling agents that provide mechanisms for automatically scheduling individual applications on production heterogeneous systems. AppLeS agents utilize the Network Weather Service (NWS) to monitor and forecast the varying performance of resources potentially usable by their applications. Each AppLeS uses static and dynamic application and system information to select viable resource configurations and evaluate their potential performance. The AppLeS then interacts with the appropriate resource management system to implement the application's network transfers and computational tasks. The next generation of AppLeS agents aims at providing templates that can be used for scheduling classes of structurally similar applications. In this document we introduce a template for scheduling Parameter Sweep applications (application consisting of em large number of independent tasks, with possible input data sharing). We have designed a general scheduling algorithm that can adapt to Grid environments and use a variety of strategies and heuristics to assign tasks and data to resources. In order to evaluate and compare those heuristics we have built a simulator as part of the template. The simulator makes it possible to rapidly conduct large numbers of experiments in a variety of environments. Our starting point was to use widely accepted heuristics that have been proposed in the litterature and venture improvements given our Grid and application model. This document presents the implementation of our simulator and explains how it will be used to obtain new research results in the field of Grid scheduling.
Conference Paper
Data Grids seek to harness geographically distributed resources for large-scale data-intensive problems such as those encountered in high energy physics, bioinformatics, and other disciplines. These problems typically involve numerous, loosely coupled jobs that both access and generate large data sets. Effective scheduling in such environments is challenging, because of a need to address a variety of metrics and constraints (e.g., resource utilization, response time, global and local allocation policies) while dealing with multiple, potentially independent sources of jobs and a large number of storage, compute, and network resources. We describe a scheduling framework that addresses these problems. Within this framework, data movement operations may be either tightly bound to job scheduling decisions or performed by a decoupled, asynchronous process on the basis of observed data access patterns and load. We develop a family of job scheduling and data movement (replication) algorithms and use simulation studies to evaluate various combinations. Our results suggest that while it is necessary to consider the impact of replication on the scheduling strategy, it is not always necessary to couple data movement and computation scheduling. Instead, these two activities can be addressed separately, thus significantly simplifying the design and implementation of the overall Data Grid system.
Article
Most scheduling problems are already hard on homogeneous platforms, they become quite intractable in an heterogeneous framework such as a metacomputing grid. In the best cases, a guaranteed heuristic can be found, but most of the time, it is not possible. Real experiments or simulations are often involved to test or to compare heuristics. However, on a distributed heterogeneous platform, such experiments are technically difficult to drive, because of the genuine instability of the platform. It is almost impossible to guarantee that a platform which is not dedicated to the experiment, will remain exactly the same between two tests, thereby forbidding any meaningful comparison. Simulations are then used to replace real experiments, so as to ensure the reproducibility of measured data. A key issue is the possibility to run the simulations against a realistic environment. The main idea of trace-based simulation is to record the platform parameters today, and to simulate the algorithms tomorrow, against the recorded data: even though it is not the current load of the platform, it is realistic, because it represents a fair summary of what happened previously. A good example of a trace-based simulation tool is SimGrid, a toolkit providing a set of core abstractions and functionalities that can be used to easily build simulators for specific application domains and/or computing environment topologies. Nevertheless, SimGrid lacks a number of convenient features to craft simulations of a distributed application where scheduling decisions are not taken by a single process. Furthermore, modeling a complex platform by hand is fastidious for a few hosts and is almost impossible for a real grid. This report is a survey on simulation for scheduling evaluation purposes and present MetaSimGrid, a simulator built on top of SimGrid.
Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions.
Article
This paper is devoted to scheduling a large collection of independent tasks onto heterogeneous clusters. The tasks depend upon (input) files which initially reside on a master processor. A given file may well be shared by several tasks. The role of the master is to distribute the files to the processors, so that they can execute the tasks. The objective for the master is to select which file to send to which slave, and in which order, so as to minimize the total execution time. The contribution of this paper is twofold. On the theoretical side, we establish complexity results that assess the difficulty of the problem. On the practical side, we design several new heuristics, which are shown to perform as efficiently as the best heuristics in [H. Casanova, A. Legrand, D. Zagorodnov, F. Berman, Heuristics for scheduling parameter sweep applications in Grid environments, in: Ninth Heterogeneous Computing Workshop, IEEE Computer Society Press, Silver Spring, MD, 2000, pp. 349–363; H. Casanova, A. Legrand, D. Zagorodnov, F. Berman, Using simulation to evaluate scheduling heuristics for a class of applications in Grid environments, Research Report RR-1999-46, LIP, ENS Lyon, France, 1999] although their cost is an order of magnitude lower.