Content uploaded by Fabrício Alves Barbosa da Silva
Author content
All content in this area was uploaded by Fabrício Alves Barbosa da Silva
Content may be subject to copyright.
Hierarchical Scheduling of Independent
Tasks with Shared Files
Hermes Senger, Fabr´
ıcio A. B. Silva, Waneron M. Nascimento
Universidade Cat´
olica de Santos (UniSantos)
R. Dr. Carvalho de Mendonca, 144
Santos, SP - Brazil - 11070-906
Email: {senger,fabricio}@unisantos.br
Abstract— Parallel computing platforms such as grids, clusters
and multi-clusters constitute promising alternatives for executing
applications comprised by a large number of independent tasks.
However, some application and architectural characteristics may
severely limit performance gains. For instance, tasks with fine
granularity, huge data files to be transmitted to or from data
repositories, and tasks which share common input files are ex-
amples of such characteristics that may cause poor performance.
Bottlenecks may also appear due to the existence of a centralized
controller in the master-slave architecture, or centralized data
repositories within the system. This paper shows how system
efficiency decreases under such conditions.
To overcome such limitations, an hierarchical strategy for file
distribution which aims at improving the system capacity of
delivering input files to processing nodes is proposed and assessed.
Such a strategy arranges the processors in a tree topology,
clusters tasks that share common input files together, and maps
such groups of tasks to clusters of processors. By means of such
strategy, significant improvements in the application scalability
can be achieved.
I. INTRODUCTION
Because of their computational power, clusters, multi-
clusters, and grid platforms are suitable to execute high
performance applications comprised by a great number of
tasks, often ranging from hundreds to several thousands. This
paper is dedicated to a class of applications studied in [2],
[3], [7], [8], [12], which can be decomposed as a set of
independent tasks and executed in any order. In this class,
tasks do not communicate to each other and depend only
upon one or more input data files to be executed. The output
produced is also one or more files. Such application class
is referred in the literature as parameter-sweep [3], or bag-
of-tasks [4], and typical examples include applications that
can be structured as a set of tasks that realize independent
experiments with different parameters. Other typical examples
include several applications that involve data mining, image
manipulation, Monte Carlo simulations, and massive searches.
They are frequent in fields such as astronomy, high-energy
physics, bioinformatics, and many others.
In this paper, we focus on a specific class of applications
comprised by a large number of independent tasks that may
share input files. By large we mean a number at least one order
of magnitude greater than the number of available processors.
Such applications were studied in [2], [3], [7], [8], being the
typical case for grid-based environments such as the AppLeS
Parameter Sweep Template [1] (for a detailed description of
some real-world applications see [2], [3]). According to our
previous experience [16]–[18], some data mining applications
clearly fit this model, in particular those which follow the
Independent Parallelism model, as mentioned in [20]. Fur-
thermore, many other science and engineering computational
applications are potential candidates.
Many challenges arise when scheduling a large number
of tasks in large clusters or computational grids. Typically,
there is no guarantee regarding availability levels and quality
of services, as the involved computational resources may
be heterogeneous, non-dedicated, geographically distributed,
and owned by multiple institutions. Furthermore, data grids
that harness distributed resources such as computers and
data repositories need support for moving high volume data
files in an efficient manner. In [15], Ranganathan and Foster
propose and assess some independent policies for assigning
jobs to processors, and moving data files among sites of a
grid environment, showing the importance of considering file
locality when scheduling tasks.
Scheduling independent tasks (with no file sharing) on
heterogeneous processors was studied by Maheswaram et.
al. [12]. In such paper, the authors propose and assess
three scheduling heuristics, named min-min,max-min, and
sufferage. However, many problems may arise whether the
application tasks do share files which are stored in a single
repository and have to be transmitted to the processing nodes
in which tasks will be executed. In [2], [3], Casanova et. al.
improved these heuristics to schedule independent tasks with
an additional constraint. The constraint is the possibility of
sharing common input files.
As the number of tasks to schedule in such applications is
typically large, scheduling heuristics must have low compu-
tational costs. With this purpose, Giersch, Robert, and Vivien
[6], [8] propose and assess some scheduling heuristics that
produce schedules with similar qualities to those proposed by
Casanova et. al. in [2], [3], while keeping the computational
costs one order of magnitude lower. In a further work, Giersch,
Robert, and Vivien [7] extend their work on heuristics, and
establish theoretical limits for the problem of scheduling
independent tasks with shared files, for all possible locations
of data repositories (e.g., centralized, decentralized).
This paper addresses the problem of scheduling independent
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
F4F3F2F1 Fm
T1 T2 T3 TnT4
Tn
Input Files
Tasks
File Task
Fig. 1. The shared files model
tasks with shared files which are stored in a centralized repos-
itory. The focus of this paper is not concerned with proposing
a new heuristic, neither stressing heuristic approaches. Instead,
some scalability limits that are intrinsic for the execution of
such applications executing on a master-slave architecture are
presented, and a strategy is proposed, which can significantly
improve the application scalability by means of reducing
the bottleneck at the master computer, and multiplying the
capacity of the grid to distribute data files. The proposed
strategy implements an hierarchical scheme that integrates
tasks scheduling with file distribution. Such strategy is carried
out in three steps: it starts by clustering tasks that share
common files into jobs; it organizes the available processors
to form a hierarchy; and then, it maps groups of tasks onto
the processors hierarchy. Experimental results obtained by
means of simulation suggest that application scalability can
be improved in one order of magnitude.
The remainder of this paper is organized as follows. Section
2 presents a model that describes architecture and application
features. Section 3 describes the problem. Section 4 evaluates
the potential benefits of using hierarchical scheduling. An
hierarchical strategy is proposed in section 4 and assessed by
means of simulation in section 5. The final considerations are
presented in section 6.
II. A SYSTEM MODEL
The motivating architecture for this work is comprised by
one master computer, and a set of Sslaves distributed among
Cclusters. The master is responsible for controlling the slave
computers, being usually implemented by the user’s machine
from which the application is launched.
An application consists of a set of Tindependent tasks. By
independent we mean that every task does not depend upon
the execution of any other task, and there is no communication
among tasks. There is a set of Fdata files which are initially
stored in the master computer, and must be transmitted to the
slave computers to serve as input for the application tasks.
Every task requires one or more input files, and produces
only one output file. Each file provides input data for at least
one task, and not rarely, for a very large number of tasks
of the same application. Such relationship can be represented
as a bipartite graph as depicted in Fig.1. For instance, this
example shows a task T1that depends upon files F1and
F2. The master communicates with slave processors in order
to transmit input files sequentially (i.e., with no concurrent
transmission) by means of a dedicated full-duplex link. In this
model, we consider that file sizes and task execution times are
known a priori.
A. The Motivating Application
This model is motivated by previous works that involve
the execution of computationally expensive machine learning
algorithms for data mining in computational grids [18] and
clusters [16]. For instance, the project mentioned in [16]
involves the adaptation of Weka [21], a tool that is widely
used among data mining researchers and practitioners, to
execute on clusters of PCs. A specific class of data mining
algorithms is the classification algorithms [5], which analyze
the characteristics of dataset from a specific domain, and try to
produce models that can well characterize a set of examples.
For instance, a common application consists in evaluating
”which is the best classification algorithm from a list, for
a given dataset”, i.e., which algorithm can produce a model
that best represents the characteristics for a given dataset. The
tenfold Cross Validation procedure [5] is a standard way of
predicting the error rate of a classifier given a simple, fixed
sample of data. In the tenfold cross validation process, the
dataset is divided into ten equal parts (folds) and the classifier
is trained in nine parts (folds) and tested in the remaining
one. This procedure is repeated in ten different training sets
and the estimated error rate is the average in the test sets. If
one would like to evaluate a list of classifier algorithms, say
30 algorithms, by means of tenfold cross validation, 300 tasks
could be created. Alternatively, a more accurate method is the
N-fold cross validation [7]. In this method, a dataset with N
instances (items) is divided into Nfolds (each one containing
a single item), then the algorithm is trained with N−1folds
and tested in the remaining one. The error rate for a given is
computed as the mean of Nvalidation tasks. Thus, a dataset
with 5,000 items to be tested with the same 30 algorithms will
produce 150,000 tasks, all of them using the same input file.
III. DISTRIBUTING SHARED FILES
In order to illustrate scalability problems in the master-
slave architecture, this section shows some results of a real
application carried out in the Unisantos’ laboratory, which in-
volves the execution of the Cluster Genetic Algorithm (CGA)
[10]. The goal of CGA is to identify a finite set of categories
(clusters) to describe a given data set, both maximizing
homogeneity within each cluster and heterogeneity among
different clusters. Thus, objects that belong to the same cluster
should be more similar to each other than those objects that
belong to different clusters. In this experiments, we used a
dataset that is a benchmark for data mining applications (the
Congressional Voting Records), available at the UCI Machine
Learning Repository [13]. This dataset contains 435 instances
(267 democrats, 168 republicans) and 16 Boolean valued
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
0 2 4 6 8 10 12 14 16 18 20
Execution times
Number of Processors
Workqueue (Dedicated Cluster)
Fig. 2. The makespan for real experiments on a dedicated cluster.
attributes that represent key votes for each of the U.S. House
of Representatives Congressmen. Also, 203 instances present
missing values which have been removed for our experiments.
For these experiments we adopted the MyGrid [4] platform,
which is a lightweight and easy to use grid middleware,
intended to execute applications comprised by independent
tasks. MyGrid implements the Workqueue [4] scheduling
algorithm. Initially, a set of tasks were executed sequentially
on a single machine with a Pentium IV (1.8 GHz) processor
with 1 GB of main memory. Then, the same set of tasks
were run on 2, 4, 8, 12, 16 and 20 dedicated machines
with similar hardware characteristics located in the laboratory
at Unisantos, interconnected by a 100 Mbps Ethernet LAN.
As one can note from Fig. 2, no significant reduction in
the execution time can be achieved with more than 9 slave
processors. After this limit, no performance gains can be
realized by adding slave computers. Such behavior is typical
for some critical applications that cause higher data transfer
rates and higher resource consumption at the master processor,
as we will explain in the following. The CGA application
in the Congressional dataset is such an example. Additional
information about these experiments as well as such scalability
problems can be found in [17], [18].
Initially, the dataset is stored in some repository that is
accessible to the user’s machine which plays the role of
master processor. The master is responsible for controlling
the execution of the application tasks as well as transferring
input and output files to and from the grid machines. In such
a master slave platform, the more slave machines are added
to the system, the greater is the demand for data transfers
imposed to the master computer. If the number of slaves
exceeds the master’s capacity of delivering input files, the
addition of new slave processors in the system will force some
of them to stay idle while waiting for file transfers. Such a
situation is aggravated in presence of fine grain application
tasks, i.e., tasks with a low computation per communication
ratio.
In order to manage the execution of remote tasks, the master
usually spawns a small number of local processes that transmit
input files to slave processors and receive output files which
are sent back with results. Such processes are eliminated
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1 10 100 1000
Makespan
Number of Processors
Granularity = 2
Granularity = 4
Granularity = 8
Granularity = 16
Granularity = 32
Granularity = 64
Granularity = 128
Fig. 3. Simulated makespan for tasks with different granularities running on
a master-slave platform.
after their corresponding tasks are completed. Thus, short
application tasks will raise the rate of creation and destruction
of control processes, thus degrading the performance. Also,
the number of control processes executing concurrently at the
master node is proportional to the number of slaves under
its control. Thus, the consumption of resources in the master
processor (e.g. memory, I/O subsystem, network bandwidth,
CPU cycles) tend to be proportional to the number of slaves
with which it directly interacts for performing control and data
transfer operations.
A. Granularity and Scalability in Master-Slave Platforms
Performance limitations of the master-slave architecture
may appear with different numbers of processors, depending
on some application characteristics. For instance, the execution
times of the application tasks, and the transmission times
of their input files can determine the maximum number of
processors that can be added to the system without loosing
performance. Under some assumptions, such an effective num-
ber of slaves can be estimated as follows. Let RT be the mean
response time for the execution of tasks, which is given by the
sum of their mean execution time (ET), mean time required
to transmit input files (TTIF), and mean time required to
transmit output files (TTOF):
RT =TTIF +ET +TTOF. (1)
Also, suppose that files are sequentially transmitted to
slave processors, i.e., the master does not perform concurrent
transmissions. In such scenario, the maximum number of slave
processors without loss of performance Sef f , can be estimated
as:
Seff =RT
TTIF +TTOF .(2)
Obviously, this number depends on the granularity of the
application tasks, since it reflects the ratio between the times
involved in the execution of tasks and in the transmission of
files.
In order to illustrate the influence of granularity (here ex-
pressed as the computation time per transmission time ratio) on
the effective number of slaves, some simulation experiments
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
were carried out using SimGrid [11]. For the experiments, we
consider a dedicated cluster, comprised by the master node
and a set of slave nodes. We assume a shared link, and the
master’s capability of handling only one transmission at any
given time (no concurrent transmissions). The link is assumed
full-duplex, so that the master node can send an input file
to one processor and receive the output file from another
processor concurrently. According to Yang and Casanova [22],
this model corresponds to the one port model, which is suitable
to simulate LAN network connections.
For these experiments, different granularities were produced
by varying the time to transmit files, while all other parameters
(e.g. number of tasks, their execution times, total amount
of work of the application) were fixed. In this example, if
one assumes the granularity is equal to 2, he means the
execution of task takes twice as long as the transmission
of the involved files for such task. The experiment involves
32,000 tasks which take 128 time units to execute, and the
bandwidth of communication links is 1,000,000 bytes per
second. File sizes started with 2,000,000 bytes, doubling up
to 64,000,000 bytes, producing tasks whose computation per
communication ratio varies from 2 to 64. The metric adopted
for this experiment is the makespan [14], which is computed
by the time taken to execute all the application tasks as
well as involved transmissions of input and output fies. As
shown in figure 3, the reduction of the makespan are strongly
dependent on the granularity of the application tasks. These
results show that the coarser the granularity of the application
tasks, the more processors can be effectively added to the
system, leading to reduction in execution times.
Although performance gains are critical for fine granularity
applications, it is worth noting that such problem may also
occur in presence of coarse grain applications, depending upon
the application characteristics. Our work aims at extending
such scalability limits, i.e., improving the effective number
of processors that can be added to a grid system for a given
application, so that its overall execution time can be reduced.
B. Grouping Tasks to Improve Scalability
As mentioned before, the main cause of poor scalability is
the bottleneck in the master computer. Despite of the severe
limitations in scalability, a small number of papers is devoted
to this problem. In [17], [18], Silva et. al. propose a scheduling
algorithm that organizes a given application in groups of tasks
that share the same input files, so minimizing the file transfer.
For didactic purposes, this section presents details about the
grouping of tasks that share input files, because our proposal
also adopts the grouping technique.
At scheduling time, tasks are clustered to form jobs, so
that only one job is created per machine. The number of file
transfers is minimized because tasks that comprise a given
job share the same input files, thus, reducing the number of
file transmissions to be delivered by the master computer.
The principle behind this technique is the improvement of the
computation per communication ratio, i.e., the granularity of
work units, since a job is composed by a group of tasks.
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1 10 100 1000 10000
Makespan
Number of Processors
WQ/2
WQ/16
WQ/128
WQ+G/2
WQ+G/16
WQ+G/128
Fig. 4. Makespan for simulated experiments using pure Workqueue and
Workqueue with Grouping, for granularities of 2, 16, and 128
To illustrate the scalability gains obtained with the grouping
technique, we simulated a dedicated cluster comprised by
the master and a set of slave processors. The application
involves 32,000 fine grain tasks which take 128 time units to
be executed by homogeneous slaves. We assume that (similarly
to our motivating application) each task requires one input file
and produces one output file that are transmitted through the
communication link. Such situation is often verified for many
other real applications mentioned in section I. For the sake of
simplicity, we also consider that each task produces one output
file whose size is at least one order of magnitude smaller than
the input file, and its transmission time is negligible1Fig. 4
shows the simulation results as the total execution times, for
an application executing on grids scaling from 1 to 1,000
slave nodes. Our simulation evaluates the Workqueue, and
Workqueue with Grouping algorithms.
In summary, the Workqueue algorithm puts all tasks in a
queue and maps each of them to an idle processor. Whenever
an idle processor is detected, it is assigned to the next task
in the queue. When one task finishes, it is removed from the
queue. The algorithm performs while there remain tasks to
be completed. The Workqueue originally does not consider
file sharing, and transmits input files to a slave processor
every time a task is scheduled to it. The Workqueue with
Grouping extends the original algorithm by grouping tasks
that share input files to compose one job. The later algorithm
creates only one job for each slave processor, so that each
input file is sent only once for each slave. As suggested
by the experiments shown in Fig. 4, there are scalability
limits for both the scheduling algorithms. The pure Workqueue
performs very poorly, scaling up to 9 slave processors. After
this number of processors, no reduction in the execution time
can be achieved. As grouping tasks raises this limit to a few
hundreds of slave nodes, this technique will be applied in our
hierarchical strategy. Since each input file is transmitted only
once for each slave processor that executes tasks which need
1Such assumption is commonly adopted for simulation models found in the
literature [7], [15], and reflects the characteristic of our motivating application
mentioned in section II-A, which may use very large input files and produces
output files of a few hundreds of bytes.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
SmS2
S1
M
Master
Supervisor
Link
Slave
Fig. 5. The master-supervisor-slave architecture
it, the maximum number of slaves that can be effectively
added to the system without loss of performance may be
estimated for schedules that adopts task grouping. Under the
same assumptions considered for equation 2, the expected
response time for a job (RTjob) can be computed as
RTjob =TTIF +NTjob ∗(ET +TTOF),(3)
where NTjob is the number of grouped tasks to form one
job, TTIF is the time required to transmit the input file
(which is the same for all tasks in this job), and TTOF is
the transmission time for output files produced by each task
in the job. Thus, the effective number of slaves that can be
effectively controlled by the master node is
S
eff =RTjob
TTIF +NTjob ∗TTOF .(4)
As illustrated in Fig. 4, there are scalability limits for
both the scheduling algorithms. The pure Workqueue performs
very poorly, while the task clustering performs slightly better.
Thus, though task clustering has demonstrated to reduce the
minimum makespan in about one order of magnitude, a per-
formance limit can still be observed because of the bottleneck
at the master processor. In the next section we investigate
whether the use of an hierarchical topology as opposed to the
traditional master-slave architecture can reduce the bottleneck
and improve the scalability of the system.
IV. ISOEFFICIENCY AND SCALABILITY
Scalability may be defined as ”the system’s ability to
increase speedup as the number of processors increase” [9].
Another definition that is not based in the concept of speedup
is the following: ”An algorithm-machine combination is scal-
able if the achieved average speed of the algorithm on the
given machine can remain constant with increasing number of
processors, provided the problem size can be increased with
the system size” [19]. This last definition is important since it
relates the scalability to the combination of a machine and an
algorithm, instead of being a property of either the machine
or the algorithm. Based on those definitions, in this section
we introduce one scalability metric: the isoefficiency function
[9], based on the concept of parallel computing efficiency.
Grama, Gupta and Kumar proposed the isoefficiency concept
[9]. Isoefficiency fixes the efficiency and measures how much
work must be increased to keep the efficiency unchanged as
the machine scales up. An isoefficiency function f(P)related
0.9
0.95
1
1.05
1.1
0 50 100 150 200 250 300 350 400
Efficiency
Number of Processors
2
4
8
16
32
Fig. 6. Efficiency of the executions when all tasks share the same input file
4e+06
3e+06
2e+06
1e+06
0
400 100 10
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 7. Isoefficiency function when all tasks share the same input file
machine size (P) to the amount of work needed to maintain
the efficiency. Parallel computing efficiency is defined as
E=
Tseq
Tpar
P,(5)
where Tseq is the time for a sequential execution, and Tpar is
the time for a parallel execution with Pprocessors.
In this section we present the isoefficiency functions of the
execution of independent tasks with shared file applications
on a master-slave platform. The platform simulated was com-
posed of up to 400 homogeneous and dedicated processors,
and the application was composed of a variable number of
0
20000
40000
60000
80000
100000
0 5 10 15 20 25 30 35 40 45 50
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 8. Isoefficiency function when each task has its own input file
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
500000
400000
300000
200000
100000
0
400 100 10
Number of tasks
Number of Processors
2
4
8
16
32
Fig. 9. Isoefficiency function for the hierarchical platform
5e+06
4e+06
3e+06
2e+06
1e+06
0
400 100 10
Number of tasks
Number of Processors
Master-slave 2
Master-slave 4
Master-slave 8
Master-slave 16
Master-slave 32
Hierarchical 2
Hierarchical 4
Hierarchical 8
Hierarchical 16
Hierarchical 32
Fig. 10. Comparing the isoefficiency functions of the hierarchical and master-
slave platforms
tasks, depending on the amount of work necessary to keep the
efficiency constant. Each task takes 8 time units to complete
(ET), and the amount of time needed to send the input
files (TTIF) varies in order to obtain different ratios. In the
following experiments we simulated the following ratios: 2, 4,
8, 16 and 32. Figure 6 shows the efficiency of the experiments
when tasks share the same input file. It can be verified that the
efficiency is kept around 0.99 for all ratios. Figure 7 shows
the corresponding Isoefficiency functions. It is worth noting
the parabolic shape of all curves. We also executed the same
application when each task has its own input file. In this case a
round-robin strategy was used to map tasks to processors. The
round-robin strategy is optimal for master-slave platforms that
are homogeneous and dedicated [8]. For those executions the
input files are sent to slave nodes before each task execution.
Figure 8 shows the corresponding Isoefficiency functions, for
efficiencies around 0.99. It can be seen that the execution of
an application comprised by independent tasks which have
different input files is not scalable. A third set of experiments
is shown in Figure 9.We have simulated a hierarchical platform
composed of one supervisor node and several master and
slave nodes. The maximum number of slave nodes considered
is 200. It is possible to verify that the curves also have a
parabolic shape. However, the rate of grown of the parabolic-
like function is smaller when compared to the master-slave
platform, as the direct comparison of the curves of Figure 7
and Figure 9 show (see Figure 10).
V. H IERARCHICAL SCHEDULING
As shown in section IV, an architecture with hierarchi-
cal topology present higher scalability than a master-slave
architecture for the execution of applications comprised by
independent tasks that share input files. In this section, we
present a hierarchical scheme for file distribution and task
scheduling that aims at improving the application scalability
in such scenario. The problem of scheduling independent tasks
with shared input files stored in a centralized repository appear
for many applications mentioned in section I, being studied
in [2], [3], [6], [8], [17], [18]. In such scenario, the master
node is implemented by the user’s machine which accesses
a centralized repository and launches the application tasks
to be executed by the slave processors. The main function
of the master node concerns taking actions for application
coordination (e.g., scheduling, control of completed tasks) and
file distribution. Our hierarchical scheme aims at alleviating
the bottleneck in master node, by reducing the number of file
transfers and control actions it should perform.
In order to reduce the workload at the master node, we
propose the addition of a number of supervisor nodes to the
master-slave architecture. The supervisor is responsible for
controlling the execution of the application tasks as well as
transmitting files to the slave nodes. The master groups tasks
together, to form execution units comprised by tasks that share
common files, namely the jobs that will be distributed among
the supervisor nodes. In turn, each supervisor is responsible
to control the execution of this job on a subset of slave
nodes. In this model, there is no direct interaction between the
master and slave nodes. Instead, the master delegates jobs to
supervisors which are responsible for communicating to slave
nodes and managing the execution of the application tasks.
A. A Strategy for Hierarchical Scheduling
First, consider a distributed architecture comprised by one
master node M, a collection of Pprocessors placed in a
collection of Cclusters interconnected by communication
links. The application is comprised by Ttasks, where Tis
at least one order of magnitude larger than P(T>>P).
This collection may be partitioned into two sets. The former is
comprised by Sslave computers, and the later is comprised by
Nsupervisor computers. Under such conditions, we propose
the following steps to be carried out by the master node, for
launching the application execution:
1) Initially, the master obtains static information about the
available resources, e.g., the number and identification
of available computers in the system, processor speed,
and memory. By the end of this step, the parameter P
can be known.
2) For each known cluster, compute its number of super-
visors Ni, so that every cluster has at least one super-
visor, and clusters with a large number of processors
have additional supervisors accordingly to the maximum
number of slaves they can efficiently control. More
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
precisely, the number Niof supervisors for the i-th
cluster is computed as
Ni=Pi
S
eff ,(6)
where Piis the number of processor in the i-th cluster,
and S
eff is the maximum number of processors a
supervisor can efficiently control when task grouping
is applied (see equation 4 in section III). Then, obtain
N, as the total number of supervisors in the system as
N=
C
i=1
Ni.(7)
3) Compute the number of slaves S=P−N.
4) As proposed by Silva et. al. in [17], group the application
tasks into Sexecution units, namely the jobs, so that
each job can be assigned to one supervisor and its slaves.
Each job will group together a number (in the range
[T/S,T/S]) of tasks that share common input files.
The mapping of jobs to processors can be done at
random, or by means of some heuristic similar to those
presented in [6]–[8], [17]. For instance, the granularity
(i.e., the number of tasks) of each job could be adjusted
according to slave processors capacities. The result of
this step is a list of 2-tuples containing the identification
of a job and the slave processor it was assigned to.
5) Decompose the job list into Nsub-lists, so that each sub-
list corresponds to one supervisor and contains the jobs
assigned to the slave processors under its control. This
step can be accomplished in time O(N)by selecting the
sub-list each job must be moved to. For each supervisor,
send the sub-list containing the information on jobs it
has been assigned to, and transmit the required input
files (only once).
6) Wait for tasks to be executed and results to be returned.
As soon as a supervisor receives a job list and input files,
it distributed the tasks to its slave processors. Every time a
supervisor is notified that a task has concluded, it performs
the following steps:
1) Set the task status as DONE.
2) Forward the notification and results back to the master.
3) While there remain tasks to execute (READY) whose
input files have already been transmitted to the idle slave,
the supervisor assigns the next task to the idle processor.
4) When no READY task can be found, the supervisor
looks for some uncompleted task (RUNNING) whose
input files have already been transmitted to the idle
slave, and creates a replica of such task in the idle slave
processor. Replication can improve the probability of
such task to be concluded earlier, as well as provide
some level of fault tolerance.
5) When steps 3 and 4 are completed (all tasks have been
concluded), the supervisor asks the master node for
incomplete tasks from other processors.
6) If the master has no tasks to be executed, then it finishes.
TAB LE I
PERFORMANCE FOR WORKQUEUE (WQ), WORKQUEUE WITH GROUPING
(WQ+G), AND HIERARCHICAL WORKQUEUE (WQ+H)
Number WQ WQ+G WQ+H
of proce- Make- Effi- Make- Effi- Make- Effi-
ssors (P) span ciency span ciency span ciency
1256000 1,00 224001 1,00 224002 1,00
551204 0,87 44805 1,00 44806 1,00
832007 0,87 28008 1,00 22410 1,00
10 32007 0,70 22409 1,00 22410 1,00
50 32007 0,14 4509 0,99 4510 0,99
100 32007 0,07 2294 0,98 2270 0,99
400 32007 0,02 764 0,73 589 0,95
500 32007 0,01 702 0,64 482 0,93
1000 32007 0,01 673 0,33 263 0,85
2000 32007 0,00 674 0,17 161 0,70
3000 32007 0,00 674 0,11 134 0,56
5000 32007 0,00 675 0,07 123 0,36
10000 32007 0,00 684 0,03 124 0,18
Every supervisor node maintains a list for the jobs and
tasks under its control, and the master maintain a list for
all jobs and tasks of the application. For each task, there
is a status information (READY, RUNNING, DONE),
which is updated by means of messages exchanged
among the processors, upon corresponding events.
VI. SIMULATION RESULTS
The hierarchical scheduling strategy was evaluated by
means of simulation, using the same application and archi-
tecture scenario described in section III. For the execution of
simulations, homogeneous, dedicated computers and commu-
nication links were considered. Also, the simulated application
is comprised by homogeneous tasks. The results are shown in
Table I. For these experiments, two metrics were used, the total
makespan [14], and the computational efficiency (see equation
5). The total makespan is computed by the time between the
transmission of the first input file, and the return of the last
output file.
As shown in Table I, with the pure Workqueue algorithm,
makespan can be reduced at the minimum of 32,007 time units,
using eight processors. After this threshold, the makespan can
not be shorten by adding processors. With the Workqueue with
Grouping (WQ+G) algorithm, the makespan may be reduced
at the minimum of 673 time units, by employing 1,000 nodes.
In this experiment, the execution with 500 nodes is concluded
in 702 time units and the computational efficiency is around
0.64. However, after this threshold the addition of processors
will not lead to a significant reduction in execution times, and
the efficiency will drop fast. Finally, with the Hierarchical
(WQ+H) algorithm, the makespan may be reduced at the
minimum time of 123 time units, which can be achieved with
5,000 processors. In this experiment, the execution with 3,000
nodes is concluded in 134 time units and the computational
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1 10 100 1000 10000
Makespan
Number of Processors
WQ
WQ+G
WQ+H
Fig. 11. Makespan for simulated experiments using Workqueue (WQ),
Workqueue with Grouping (WQ+G), and Hierarchical Workqueue (WQ+H)
scheduling, for a fixed granularity.
efficiency is 0.56. However, it is worth noting that after
this threshold the addition of processors will not lead to a
significant reduction in the makespan and the efficiency will
gracefully decrease.
VII. CONCLUSIONS AND FUTURE WORK
Independent tasks with shared files constitutes an important
class of applications that can benefit from the computational
power delivered by computational grids and clusters. The
execution of such applications in master-slave architectures
creates a bottleneck in the master computer, which limits the
system scalability. The bottleneck appears because the master
node is responsible for scheduling tasks and transmitting input
files to slave processors. Such a limitation is aggravated by the
existence of fine grain application tasks that increase both, the
rate of files transmitted to slave processors, and the rate of
scheduling actions taken by the master.
As a contribution, we propose and assess a strategy that
orchestrates file transfers and the mapping of tasks to proces-
sors, in an integrated manner. Our strategy performs by
mapping groups of tasks to a hierarchical division of proces-
sors, leading to significant improvements in the application
scalability, mainly in presence of fine grain tasks. The basis
for such improvement comes from the facts that our strategy:
(i) groups tasks that share input files together, so improving
the application granularity and reducing the number of file
transfers on the communication links; (ii) multiplies the system
capacity of transmitting input files to slave processors, by
means of adding supervisors to the master-slave model; and
(iii), reduces the amount of scheduling actions to be taken by
the master, by delegating them to a set of supervisor nodes.
These results apply to the execution of applications comprised
by independent tasks with shared files, executing on the top
of master slave platforms.
Although in this paper we evaluated an instance of master-
supervisor-slave hierarchy with only one level of supervisors,
a more general scheme must be evaluated in the future. Also,
future evaluations must emphasize analytical and theoretical
aspects of the scalability limits that can be achieved by means
of techniques discussed here.
REFERENCES
[1] Berman, F. High-performance schedulers. In: Foster, I., Kesselman, C.
(editors) The Grid: Blueprints for a New Computing Infrastructure, pp.
279-309. Morgan-Kaufmann, 1999.
[2] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Using Simulation
to Evaluate Scheduling Heuristics for a Class of Applications in Grid
Environments. Research Report 99-46, LIP-ENS, Lyon, France, 1999.
[3] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Heuristics for
Scheduling Parameter Sweep Applications in Grid Environments. In: 9th
Het. Computing Workshop, 2000, pp. 349-363. IEEE CS Press, 2000.
[4] Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F.,
Sauv´
e, J., Oshtoff, C., Silva, F.A.B., Silveira, C.: Running Bag-of-Tasks
Applications on Computational Grids: The MyGrid Approach. In: Intl.
Conf.Parallel Processing - ICPP, 2003.
[5] Fayyad, U. M., Shapiro, G. P., Smyth, P.: From Data Mining to Knowl-
edge Discovery: An Overview. In: Advances in Knowledge Discovery
and Data Mining, Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthu-
rusamy, R., Editors, MIT Press, pp. 1-37, 1996.
[6] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files on
heterogeneous clusters. Research Report RR-2003-28, LIP, ENS Lyon,
France, May, 2003.
[7] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files from
Distributed Repositories. Technical Report N. 5214, INRIA, France,2004.
[8] Giersh, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on
heterogeneous master-slave platforms. In: 12th Euromicro Workshop in
Par., Dist. and Network-based Proc., pp. 364-371. IEEE CS Press, 2004.
[9] Grama, A., Gupta A., and Kumar, V. Isoefficiency: Measuring the
Scalability of Parallel Algorithms and Architectures. IEEE Parallel and
Distributed Technology, Vol 1, No 3, 1993.
[10] Hruschka, E. R., Ebecken, N.F.F.: A genetic algorithm for cluster
analysis, Intelligent Data Analysis (IDA), v.7, pp. 15-25, IOS Press, 2003.
[11] Legrand, A., Lerouge,J.: MetaSimGrid: Towards Realistic Scheduling
Simulation of Distributed Applications. Research Report N. 2002-28.
Laboratoire de L’Informatique du Paralllisme, ENS, Lyon, 2002.
[12] Maheswaram, M., Ali, S., Siegel, H.J., Hengsen, D., Freund, R.:
Dynamic Matching and Scheduling of a Class of Independent Tasks
onto Heterogeneous Computing Systems. 8th Heterogeneous Computing
Workshop (HCW’99), April, 1999.
[13] Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Data-
bases, http://www.ics.uci.edu, Irvine, CA, University of California.
[14] Pinedo, M.: Scheduling: Theory, Algorithms and systems. Prentice Hall,
Englewood Cliffs, NJ, 1995.
[15] Ranganatham, K., Foster, I. Simulation Studies of Computation and Data
Scheduling Algorithms for Data Grids. Journal of Grid COmputing 1(1),
2003, pp.53-62. Kluwer Academic Publishers, The Netherlands, 2003.
[16] Senger, H., Hruschka, E.R., Silva, F.A.B., Sato, L.M., Bianchini, C.P.,
Esperidio, M.D.: Inhambu: Data Mining Using Idle Cycles in Clusters
of PCs. In: Proc. IFIP Intl. Conf. on Network and Parallel Computing
(NPC’04), Wuhan, China, 2004. LNCS, Vol. 3222, pp.213-220. Springer-
Verlag, Berlin Heidelberg New York (2004).
[17] Silva, F.A.B., Carvalho, S., Senger, H., Hruschka, E.R., Farias, C.R.G.:
Running Data Mining Applications on the Grid: a Bag-of-Tasks Ap-
proach. In: Int. Conf. on Computational Science and its Appllications
(ICCSA), Assisi, Italy. LNCS, Vol. .3044, pp.168-177. Springer-Verlag,
Berlin Heidelberg New York (2004).
[18] Silva, F.A.B., Carvalho, S., Hruschka, E.R.: A Scheduling Algorithm for
Running Bag-of-Tasks Data Mining Application on the Grid. In: Euro-Par
2004, Pisa, Italy, 2004. LNCS, Vol. 3419, pp.254-262. Springer-Verlag,
Berlin Heidelberg New York (2004).
[19] Sun, X. and Rover, D.T. Scalability of Parallel Algorithm-Machine
Combinations. IEEE Transactions on Parallel and Distributed Systems,
Vol 5, No 6, June 1994.
[20] D. Talia, ”Parallelism in Knowledge Discovery Techniques”, Proc. Sixth
Int. Conference on Applied Parallel Computing, Helsinki, LNCS 2367,
pp. 127-136, June 2002.
[21] Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools
with Java implemen-tations. Morgan Kaufmann, San Francisco, 2000.
[22] Yang, Y., van der Raadt, K., Casanova, H.: Multi-Round Algorithms for
Scheduling Divisible Workloads. in IEEE Transactions on Parallel and
Distributed Systems (TPDS), 16(11), 1092–1102, 2005.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)
0-7695-2585-7/06 $20.00 © 2006 IEEE