Dynamic multi phase scheduling for heterogeneous clusters
ABSTRACT Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are how to deal with scheduling and load balancing of such a system, which may consist of heterogeneous computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops (with independent iterations) on heterogeneous computer clusters have been obtained and studied. However, no study of dynamic schemes for loops with iteration dependencies has been reported so far. In this work we study the problem of scheduling loops with iteration dependencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration dependencies incurs an extra degree of difficulty and makes the development of such schemes quite a challenge. We extend three well known dynamic schemes (CSS, TSS and DTSS) by introducing synchronization points at certain intervals so that processors compute in pipelined fashion. Our scheme is called dynamic multi-phase scheduling (DMPS) and we apply it to loops with iteration dependencies. We implemented our new scheme on a network of heterogeneous computers and studied its performance. Through extensive testing on two real-life applications (the heat equation and the Floyd-Steinberg algorithm), we show that the proposed method is efficient for parallelizing nested loops with dependencies on heterogeneous systems.
-
Citations (0)
-
Cited In (0)
Page 1
Dynamic Multi Phase Scheduling
for Heterogeneous Clusters
Florina M. Ciorba1,2, Theodore Andronikos1, Ioannis Riakiotakis1,
Anthony T. Chronopoulos3,4and George Papakonstantinou1
1Computing Systems Laboratory
Dept. of Electrical and Computer Engineering
National Technical University of Athens
Zografou Campus, 15773, Athens, Greece
2Student Member IEEE
3Dept. of Computer Science
University of Texas at San Antonio
6900 N. Loop 1604 West, San Antonio, TX 78249
4Senior Member IEEE
Email: {cflorina,tedandro,iriak}@cslab.ntua.gr, atc@cs.utsa.edu
Abstract
Distributed computing systems are a viable and less expensive alternative to parallel com-
puters. However, concurrent programming methods in distributed systems have not been stud-
ied as extensively as for parallel computers. Some of the main research issues are how to
deal with scheduling and load balancing of such a system which may consist of heterogeneous
computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops
(with independent iterations) on heterogeneous computer clusters have been obtained and stud-
ied. However, no study of dynamic schemes for loops with iteration dependencies has been
reported so far. In this work we study the problem of scheduling loops with iteration depen-
dencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration
dependencies incurs an extra degree of difficulty and makes the development of such schemes
quite a challenge. We extend three well known dynamic schemes (CS, TSS and DTSS) by
introducing synchronization points at certain intervals so that processors compute in pipelined
fashion. Our scheme is called dynamic multi-phase scheduling (DMPS) and we apply it to
loops with iteration dependencies. We implemented our new scheme on a network of het-
erogeneous computers and study its performance. Through extensive testing on two real-life
applications (heat equation and Floyd-Steinberg algorithm), we show that the proposed method
is efficient for parallelizing nested loops with dependencies on heterogeneous systems.
Index Terms - heterogeneous distributed systems, loop scheduling, dynamic algorithms, de-
pendence loops, pipelined execution.
Page 2
1 Introduction
Loops are one of the largest sources of parallelism in scientific programs. The iterations within a
loop nest are either independent (called parallel loops) or precedence constrained (called depen-
dence loops). Furthermore, the precedence constraints can be uniform (constant) or non-uniform
throughout the execution of the program. A review of important parallel loop scheduling algo-
rithms is presented in [7] (and references therein) and some recent results are presented in [4].
Research results also exist on scheduling parallel loops on message passing parallel systems and
on heterogeneous systems [1],[2],[3],[5],[6],[8],[9],[10],[11],[15],[16]. Static scheduling schemes
for dependence loops have been studied extensively for shared memory and distributed memory
systems [18] (and references therein), [27],[26],[28] [25],[19],[20].
Loops can be scheduled statically at compile-time or dynamically at run-time. Static schedul-
ing is applicable to both parallel and dependence loops. It has the advantage of minimizing the
scheduling time overhead, and achieving near optimal loop balancing when the execution envi-
ronment is homogeneous with uniform and constant workload. Examples of such scheduling are
Block, Cyclic [29],[18]. However, most cluster nowadays are heterogeneous and non-dedicated
to specific users, yielding a system with variable workload. When static schemes are applied to
heterogeneous systems with variable workload the performance is severely deteriorated. Dynamic
scheduling algorithms adapt the assigned number of iterations to match the workload variation of
both homogeneous and heterogeneous systems. An important class of dynamic scheduling algo-
rithms are the self-scheduling schemes (such as CS [23], GSS [24], TSS [13], Factoring [1], and
others [11]). On distributed systems these schemes are implemented using a Master-Slave model.
All dynamic schemes proposed so far apply only to parallel loops without dependencies.
Another very important factor in achieving near optimal execution time in distributed systems
is load balancing. Distributed systems are characterized by heterogeneity. To offer load balancing
loop scheduling schemes must take into account the processing power of each computer in the sys-
tem. The processing power depends on CPU speed, memory, cache structure and even the program
type. Furthermore, processing power depends on the workload of the computer throughout the exe-
cution of the problem. Therefore, load balancing methods adapted to distributed environments take
into account the relative computing powers of the computers. These relative computing powers are
used as weights that scale the size of the sub-problem assigned to each processor. This significantly
improves the total execution time when a non-dedicated heterogeneous computing environment is
used. Such algorithms were presented in [4],[9]. A recent algorithm that improves TSS by taking
into account the processing powers of a non-dedicated heterogeneous system is DTSS (Distributed
TSS) [6].
When loops without dependencies are parallelized with dynamic schemes, the index space is
partitioned into chunks, and the master assigns these chunks to processors upon request. Through-
out the parallel execution, every slave works independently and upon chunk completion sends the
results back to the master. Obviously, this approach is not suitable for dependence loops because,
due to dependencies, iterations in one chunk depend on iterations in other chunks. Hence slaves
need to communicate. Inter-processor communication is the foremost important reason for perfor-
mance deterioration when parallelizing loops with dependencies. No study of dynamic algorithms
for loops with dependencies on homogeneous or heterogeneous clusters has been reported so far.
In this paper, we study the problem of dynamic scheduling of uniform dependence loops on
heterogeneous distributed systems. We extend three well known dynamic schemes (Chunk, TSS,
2
Page 3
DTSS) and apply them to dependence loops. After partitioning the index space into chunks (using
one of the three schemes), we introduce synchronization points at certain intervals so that proces-
sors compute chunks in pipelined fashion. Synchronization points are carefully placed so that the
volume of data exchange is reduced and the pipeline parallelism is improved. Our scheme is called
dynamic multi-phase scheduling (DMPS(x)), where (x) stands for one of the three algorithms,
considered as an input parameter to DMPS. We implement our new scheme on a network of
heterogeneous (dedicated and non-dedicated) computers and evaluate its performance through ex-
tensive simulation and empirical testing. Two case studies are examined: the Heat Equation and
the Floyd-Steinberg dithering algorithm. The experimental results validate the presented theory
and corroborate the efficiency of the parallel code.
Section 2 gives the algorithmic model and some notations. In section 3, we thoroughly present
our algorithm and motivation. In section 4, the implementation, the case studies we used and the
experimental results are presented. In Section 5, conclusions are drawn.
2Notation
Parallel loops have no dependencies among iterations and, thus, the iterations can be executed in
any order or even simultaneously. In dependence loops the iterations depend on each other, which
imposes a certain execution order. The depth of the loop nest, n, determines the dimension of the
iteration index space J = {j ∈ Nn| lr≤ ir≤ ur,1 ≤ r ≤ n}. Each point of this n-dimensional
index space is a distinct iteration of the loop body. L = (l1,...,ln) and U = (u1,...,un) are the
initial and terminal points of the index space.
for (i1=l1; i1<=u1; i1++) {
...
for (in=ln; in<=un; in++) {
S1(I);
...
Sk(I);
}
...
}
Loop
Body
Figure 1: Algorithmic model.
Without loss of generality we assume that L = (1,...,1) and that u1 ≥ ... ≥ un. DS =
{?d1,...,?dp}, p ≥ n, is the set of the p dependence vectors, which are uniform, i.e., constant
throughout the index space. The index space of the dependence loop is divided into chunks, using
one of the three dynamic schemes, giving preference to the smallest dimension (here un). The
following notation is used throughout the paper:
• PE stands for processing element.
• P1,...,Pmare the slaves.
3
Page 4
• N is the number of scheduling steps, i = 1,...,N.
• A few consecutive iterations of the loop are called a chunk; Ciis the chunk size at the i-th
scheduling step.
• Viis the size (in number of iterations) of chunk i along dimension un.
• SP: In each chunk we introduce M synchronization points (SP) uniformly distributed along
u1.
• H is the interval (number of iterations along dimension u1) between two SPs (H is the same
for every chunk).
• The current slave is the slave assigned with the chunk i, whereas the previous slave is the
slave assigned with the chunk i − 1.
• V Pkis the virtual computing power of slave Pk.
• V P =?m
• Qkis the number of processes in the run-queue of Pk, reflecting the total load of Pk.
k=1V Pkis the total virtual computing power of the cluster.
• ACP: Ak=
executed in non-dedicated mode).
?
V Pk
Qk
?
is the available computing power (ACP) of Pk(needed when the loop is
• A =?m
• SCi,jis the set of iterations of chunk i, between SPj−1and SPj.
k=1Akis the total available computing power of the cluster.
Figure 3 below illustrates Ci, Viand H. Note that Ciis the number of iterations in the rectan-
gular region, i.e. Ci= Vi× M × H.
3A dynamic scheduling scheme for uniform dependence loops
3.1
Existing dynamic scheduling algorithms cannot cope with uniform dependence loops. Consider,
for instance, the heat equation.
Motivation
/* Heat equation */
for (i=1; i<width; i++){
for (j=1; j<height; j++){ /* Dependence loop */
A[i][j] = 1/4*(A[i-1][j] + A[i][j-1]
+ A’[i+1][j] + A’[i][j+1]);
}
}
4
Page 5
When dynamic schemes are applied to parallelize this problem, the index space is partitioned
into chunks, which are assigned to slaves. These slaves then work independently. But due to the
presence of dependencies, the slaves have to communicate. However, existing dynamic schemes
do not provide for inter-slave communication, only for master-to-slaves communication. There-
fore, in order to apply dynamic schemes to dependence loops, one must provide an inter-slave
communication scheme, such that problem’s dependencies are not violated or ignored.
In this work we bring dynamic scheduling schemes into the field of scheduling loops with
dependencies. We propose an inter-slave communication scheme for three well known dynamic
methods: CS [23], TSS [13] and DTSS [6]. In all cases, after the master assigns chunks to slaves,
the slaves synchronize via synchronization points. This provides the slaves with a unified commu-
nication scheme. This is depicted in Fig. 2 and 3, where chunks i−1,i,i+1 are assigned to slaves
Pk−1,Pk,Pk+1, respectively. The shaded areas denote sets of iterations that are computed concur-
rently by different PEs. When Pkreaches the synchronization point SPj+1(i.e. after computing
SCi,j+1) it sends Pk+1only the data Pk+1requires to begin execution of SCi+1,j+1. The data sent
to Pk+1designates only those iterations of SCi,j+1imposed by the dependence vectors, on which
the iterations of SCi+1,j+1depend on. Similarly, Pkreceives from Pk−1the data Pkrequires to
proceed with the execution of SCi,j+2. Note that slaves do not reach a synchronization point at the
same time. For instance, Pkreaches SPj+1earlier than Pk+1and later than Pk−1. The existence of
synchronization points leads to pipelined execution, as shown in Fig. 2 by the shaded areas.
Pk+1
Pk
Pk-1
SPj
Ci+1
Ci
Ci-1
SPj+1
SPj+2
SCi,j+1
SCi-1,j+1
Figure 2: Synchronization points
3.2Dynamic scheduling for dependence loops
This section gives a brief description of the three dynamic algorithms we used. Chunk Scheduling
(CS) assigns a chunk that consists of a number of iterations (known as Ci) to a slave. A large
chunk size reduces scheduling overhead, but also increases the chance of load imbalance. The
Trapezoid Self-Scheduling (TSS) [13] scheme linearly decreases the chunk size Ci. Considering
|J| the total number of iterations of the loop, in TSS the first and last (assigned) chunk size pair
5
Page 6
u1
u2
current
slave
HH
Vi+1
Vi
Vi-1
C
S
SP1
H
previous
slave
M Synchronization points
V1
VN
Ci+1
Ci
Ci-1
...
...
...
...
H
T
S
S
SP2
SP3
SPj
D
T
S
S
ji,
SC
SPj-1
Figure 3: Chunks are formed along u2and SP are introduced along u1
(F,L) may be set by the programmer. In a conservative selection, the (F,L) pair is determined
as: F =
total load in most loop distributions and reduces the chance of imbalance due to large size of the
first chunk. Then, the proposed number of steps needed for the scheduling process is N =
Thus the decrease between consecutive chunks is D = (F − L)/(N − 1). Then the chunk sizes
in TSS are C1= F,C2= F − D,C3= F − 2 × D,... . Distributed TSS (DTSS) [6] improves
on TSS by selecting the chunk sizes according to the computational power of the slaves. DTSS
uses a model that includes the number of processes in the run-queue of each PE. Every process
running on a PE is assumed to take an equal share of its computing resources. The programmer
may determine the pair (F,L) according to TSS; or the following formula may be used in the
conservative selection approach (by default): F =
is N =
case is Ci= Ak× (F − D × (Sk−1+ (Ak− 1)/2)), where: Sk−1= A1+ .. + Ak−1. When all
PEs are dedicated to a single process then Ak= Vk. Also, when all the PEs have the same speed
then Vk = 1 and the tasks assigned in DTSS are the same as in TSS. The important difference
between DTSS and TSS is that in DTSS the next chunk is allocated according to a PE’s available
computing power, but in TSS all PEs are simply treated in the same way. Thus, faster PEs get more
iterations than slower ones in DTSS. Table 3.2 shows the chunk sizes computed with CS, TSS and
DTSS for an index space size of 5000 × 10000 and and m = 10 slaves. CS and TSS obtain the
same chunk sizes in dedicated clusters as in non-dedicated clusters; DTSS adapts the chunk size
to match the different computational powers of slaves. These algorithms have been evaluated for
parallel loops and it has been established that the DTSS algorithm improves on the TSS, which in
|J|
2×mand L = 1. This ensures that the load of the first chunk is less than 1/m of the
2×|J|
(F+L).
|J|
2×Aand L = 1. The total number of steps
2×|J|
(F+L)and the chunk decrement is D = (F − L)/(N − 1). The size of a chunk in this
6
Page 7
turn outperforms CS [6].
Table 1: Sample chunk sizes given for |J| = 5000 × 10000 and m = 10
Algorithm
CS
TSS
Chunk sizes
300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200
277 270 263 256 249 242 235 228 221 214 207 200 193 186 179 172 165 158 151 144
137 130 123 116 109 102 73
392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207
130 183 114 159 98 46 87 41 44
263 383 369 355 229 112 219 107 209 203 293 279 265 169 33 96 46 89 86 83 80 77
74 24 69 66 31 59 56 53 50 47 44 20 39 20 33 30 27 24 21 20 20 20 20 20 20 20 20 8
DTSS
(dedicated)
DTSS
(non-dedicated)
For the sake of simplicity we consider a 2D dependence loop with U = (u1,u2), and u1≥ u2.
The index space of this loop is divided into chunks along u2(using one of the three algorithms).
Along u1synchronization points are introduced at equal intervals. The interval size (H), chosen by
the programmer, determines the number of synchronization points.
3.3The DMPS(x) algorithm
The following notation is essential for the inter-slave communication scheme: the master always
names the slaveassigned with thelatest chunk (Ci) as current and theslaveassigned with the chunk
Ci−1as previous. Whenever a new chunk is computed and assigned, the current slave becomes the
(new) previous slave, whereas the new slave is named (new) current. Fig. 4 below shows the state
diagram related to the (new) current – (new) previous slaves. The state transitions are triggered by
new requests for chunks to the master.
Slave Pk-1: previous
Slave Pk: current
Slave Pk+1: undefined
Slave Pk-1: (former) previous
Slave Pk: (new) previous
Slave Pk+1: (new) current
old state for
scheduling SCi,j
new state for
scheduling SCi+1,j
Slave Pk-1: (former) previous
Slave Pk: (new) current
Slave Pk+1: undefined
Slave Pk finished
chunk Ci and Pk+1
is still undefined
Slave Pk+1 requests
new chunk from
master
Figure 4: State diagram of the slaves
The DMPS(x) algorithm is described in the following pseudocode:
7
Page 8
INPUT (a) An n-dimensional dependence nested loop, with terminal point U.
(b) The choice of algorithm CS, TSS or DTSS.
(c) If CS is chosen, then chunk size Ci.
(d) The synchronization interval H.
(e) The number of slaves m; if case of DTSS the virtual power of every slave.
Master:
Initialization: (a) Register slaves. In case of DTSS, slaves report their ACP.
(b) Calculate F,L,N,D for TSS and DTSS. For CS use the given Ci.
1. While there are unassigned iterations do:
(a) If a request arrives, put it in the queue.
(b) Pick a request from the queue, and compute the next chunk using CS, TSS or DTSS.
(c) Update the current and previous slave ids.
(d) Send the id of the current slave to the previous one.
Slave Pk:
Initialization: (a) Register with the master; in case of DTSS report ACPk.
(b) Compute M according to the given H.
1. Send request to the master.
2. Wait for reply; if received chunk from master goto step 3 else goto OUTPUT.
3. While the next synchronization point is not reached computed chunk i.
4. If id of the send-to slave is known goto step 7.
5. If the id of the send-to slave is in the receive buffer goto step 7.
6. Goto step 8.
7. Send computed data to send-to slave.
8. Receive data from the receive-from slave and goto step 3.
OUTPUT Master: If there are no more chunks to be assigned to slaves, terminate.
Slave Pk: If no more tasks come from master, terminate.
Remark: (1) Note that the synchronization intervals are the same for all chunks. For (2)–(5) below
refer to Fig. 4 for an illustration. (2) Upon completion of SCi,0, slave Pkrequests from the master
the identity of the send-to slave. If no reply is received, then Pkis still the current slave, and it
proceeds to receive data from the previous slave Pk−1, and then it begins SCi,1. (3) Slave Pkkeeps
requesting the identity of the send-to slave, at the end of every SCi,juntil either a (new) current
slave has been appointed by the master or Pkhas finished chunk i. (4) If slave Pkhas already
executed SCi,0,...,SCi,jby the time it is informed by the master about the identity of the send-to
slave, it sends all computed data from SCi,0,...,Si,j. (5) If no send-to slave has been appointed
by the time slave Pkfinishes chunk i, then all computed data is kept in the local memory of slave
Pk. Then Pkmakes a new request to the master to become the (new) current slave.
8
Page 9
4Implementation and Test Results
Ourimplementationreliesonthedistributedprogrammingframeworkofferedbythempich.1.2.6
implementationoftheMessagePassingInterface (MPI)[12], and the1.2.6 versionofthegcc com-
piler.
We used a heterogeneous distributed system that consists of 10 computers, one of them being
the master. More precisely we used: (a) 4 Intel Pentiums III 1266MHz with 1GB RAM (called
zealots), assumed to have V Pk= 1.5 (one of these was chosen to be the master); and (b) 6 Intel
Pentiums III 500MHz with 512MB RAM (called kids), assumed to have V Pk= 0.5. The virtual
power for each machine type was determined as a ratio of processing times established by timing
a test program on each machine type. The machines are interconnected by a Fast Ethernet, with a
bandwidth of 100 Mbits/sec.
Wepresenttwocases, dedicated andnon-dedicated. In thefirst case, processorsarededicatedto
runningthe program and no otherloads are interposed during theexecution. We takemeasurements
withupto 9slaves. Weuseoneofthefast machinesas amaster. In thesecond case, at thebeginning
of the execution of the program, we start a resource expensive process on some of the slaves. Due
to the fact that scheduling algorithms for loops with uniform dependencies are usually static and no
other dynamic algorithms have been reported so far, we cannot compare with similar algorithms.
We ran three series of experiments for the dedicated and non-dedicated case: (1) DMPS(CS), (2)
DMPS(TSS), and (3) DMPS(DTSS)and compare theresults for two real-life case studies. We
ran the above series for m = 3,4,5,6,7,8,9 slaves in order to compute the speedup. We compute
the speedup according to the following equation:
Sp=min{TP1,TP2,...,TPm}
TPAR
(1)
where TPiis the serial execution time on slave Pi, 1 ≤ i ≤ m, and TPARis the parallel execution
time (on m slaves). Note that in the plotting of Sp, we use V P instead of m on the x-axis.
4.1Test Problems
We used the heat equation computation for a domain of 5000 × 10000 points, and the Floyd-
Steinbergerrordiffusioncomputationforaimageof10000×20000pixels,onasystemconsistingof
9 heterogeneous slavemachines and one master, withthefollowingconfiguration: zealot1 (master),
zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. For instance, when using 6 slaves, the
machines used are: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3. The slaves in italics
are the ones loaded in the non-dedicated case. As mentioned previously, by starting a resource
expensive process on these slaves, their ACP is halved.
4.2Heat Equation
The heat equation computation is one of the most widely used case studies in the literature, and
its loop body is similar to the majority of the numerical methods used for solving partial differ-
ential equations. It computes the temperature in each pixel of its domain based on two values of
the current time step (A[i-1][j], A[i][j-1]) and two values from the previous time step
(A’[i+1][j], A’[i][j+1]). The dependence vectors are:?d1 = (1,0) and?d2 = (0,1).
9
Page 10
The pseudocode is given below:
/* Heat equation */
for (i=1; i<width; i++){
for (j=1; j<height; j++){
A[i][j] = 1/4*(A[i-1][j] + A[i][j-1]
+ A’[i+1][j] + A’[i][j+1]);
}
}
u2
H
u1
...
...
SPj
Vi+1
Vi
Vi-1
...
...
Ci+1
Ci
Ci-1
HEAT EQUATION
SPj+1
H
u1
u2
...
...
SPj
Ci+1
Ci
Ci-1
...
...
Vi+1
Vi
Vi-1
FLOYD-STEINBERG
SPj+1
Figure 5: The dependence patterns for heat equation (left) and Floyd-Steinberg (right).
An illustration of the dependence patterns is given in Fig. 5. The iterations in a chunk are exe-
cuted in the order imposed by the dependencies of the heat equation. Whenever a synchronization
point is reached, data is exchanged between the processors executing neighboring chunks.
Table 2 shows comparative results we obtained for the heat equation, for the three series of
experiments: DMPS(CS), DMPS(TSS) and DMPS(DTSS), on a dedicated and a non-
dedicated heterogeneous cluster. The values represent the parallel times (in seconds) for differ-
ent number of slaves. Also, three synchronization intervals were given. Three synchronization
intervals were chosen, and the total ACP ranged according to the number of slaves from 3.5–7.5.
Fig. 6 presents the speedupsfor theheat equationon an indexspace of5000×10000, forchunks
sizes computed with CS, TSS and DTSS and synchronization interval 150, on a dedicated cluster
(left) and a non-dedicated cluster (right).
10
Page 11
Table 2: Parallel execution times (sec) for heat equation
Heat Equation, non-dedicated heterogeneous cluster
0
1
2
3
4
3.54 5.566.577.5
Virtual powers
Speedup
DMPS(CS)DMPS(TSS) DMPS(DTSS)
Figure 6: Speedups for the heat equation
4.3 Floyd-Steinberg
TheFloyd-Steinbergcomputation[17]isanimageprocessingalgorithmusedfortheerror-diffusion
dithering of a width by height grayscale image. The boundary conditions are ignored. The de-
pendencies are:?d1= (1,0),?d2= (1,1),?d3= (0,1) and?d4= (1,−1) The pseudocode is given
below:
/* Floyd-Steinberg */ for (i=1; i<width; i++){
for (j=1; j<height; j++){
I[i][j] = trunc(J[i][j]) + 0.5;
err = J[i][j] - I[i][j]*255;
J[i-1][j] += err*(7/16);
J[i-1][j-1] += err*(3/16);
J[i][j-1] += err*(5/16);
J[i-1][j+1] += err*(1/16);
}
}
An illustration of the dependence patterns is given in Fig. 5. The iterations in a chunk are
executed in the order imposed by the dependencies of the Floyd-Steinberg algorithm. Whenever a
synchronization point is reached, data is exchanged between the processors executing neighboring
chunks.
11
Page 12
Comparative results for the Floyd-Steinberg case study on a dedicated and a non-dedicated
heterogeneous cluster are given in Table 3. The values represent the parallel times (in seconds) for
different number of slaves. Three synchronization intervals were chosen, and the total ACP ranged
according to the number of slaves from 3.5–7.5.
Table 3: Parallel execution times (sec) for Floyd-Steinberg
Fig. 7 presents the speedup results of the Floyd-Steinberg algorithm, for the three variations.
The size of the index space was 10000×20000. Chunks sizes were computed with CS, TSS and
DTSS and synchronization interval chosen to be 100, on a dedicated cluster (left) and a non-
dedicated cluster (right).
Floyd-Steinberg, dedicated heterogeneous cluster
0
1
2
3
4
5
6
7
3.545.566.577.5
Virtual powers
Speedup
DMPS(CS)DMPS(TSS)DMPS(DTSS)
Floyd-Steinberg, non-dedicated heterogeneous cluster
0
1
2
3
4
5
6
3.545.56 6.577.5
Virtual power
Speedup
DMPS(CS)DMPS(TSS)DMPS(DTSS)
Figure 7: Speedups for Floyd-Steinberg
4.4Interpretation of the results
As expected, the results for the dedicated cluster are much better for both case studies. In par-
ticular, DMPS(TSS) seems to perform slightly better than DMPS(CS). This was expected
since TSS provides better load balancing than CS for simple parallel loops without dependencies.
In addition, DMPS(DTSS) outperforms both algorithms. This is because it explicitly accounts
for the heterogeneity of the slaves. For the non-dedicated case, one can see that DMPS(CS)
and DMPS(TSS) cannot handle workload variations as effectively as DMPS(DTSS). This is
shown in Fig. 6. The speedup for DMPS(CS) and DMPS(TSS) decreases as loaded slaves are
12
Page 13
added, whereas for DMPS(DTSS)it increases even when slavesare loaded. In thenon-dedicated
approach, our choice was to load the slow processors, so as to incur large differences between the
processing power of the two machine types. Even in this case, DMPS(DTSS) proved to achieve
good results.
Anotherfactor affecting the performance is the selection of the synchronizationinterval. Notice
that the best synchronization interval for the heat equation was H = 150, whereas for the Floyd-
Steinberg better results were obtained for H = 100. The performance differences for interval sizes
close to the ones depicted in Fig. 6 and 7 are small. When choosing a proper synchronization
interval, system’s architecture (i.e. throughput, communication buffer size etc.) and the communi-
cation/computation ratio of the problem under inspection must be taken into account. However, if
these parameters are not known, the programmer may select an interval of his choice. Nonetheless,
a poor choice may lead to a considerable performance deterioration.
5Conclusion
In this paper we presented a novel dynamic scheduling scheme for dependence loops on hetero-
geneous clusters. We tested three variations of our method on a heterogeneous cluster, both in
dedicated and non-dedicated mode. The main contribution of our work is extending three previ-
ous schemes by taking into account the existing iteration dependencies of the problem, and hence
providing a scheme for inter-slave communication. We tested our method on two real-life applica-
tions: heat equation and Floyd-Steinberg algorithm. The results demonstrate that our new scheme
is effective for distributed applications with dependence loops.
Future work will focus on establishing a model for predicting the optimal synchronization in-
terval (H) such that communication is minimized for every problem. Also we intend to extend
other well known dynamic algorithms to be applied to dependence loops, and incorporated in an
automatic parallel code generation tool for heterogeneous systems.
Acknowledgments
Thiswork ofF. Ciorbawassupportedby theGreek StateScholarshipsFoundation. ThisworkofDr.
Chronopoulos was supported in part by National Science Foundation under grant CCR-0312323.
References
[1]I. Banicescu and Z. Liu. Adaptive Factoring: A Dynamic Scheduling Method Tuned to the
Rate of Weight Changes, Proc. of the High Performance Computing Symposium 2000, Wash-
ington, USA, 2000, pp. 122–129.
[2] I. Banicescu, V. Velusamy and J. Devaprasad. On the Scalability of Dynamic Scheduling
Scientific Applications with Adaptive Weighted Factoring, Cluster Computing 6, 2003, pp.
215–226.
13
Page 14
[3]J. Barbosa, J. Tavares and A. J. Padilha. Linear Algebra Algorithms in a Heterogeneous
Cluster of Personal Computers, Proc. of the 9th Heterogeneous Computing Workshop (HCW
2000), Cancun, Mexico, 2000, pp. 147–159.
[4]D.J. Hancock, J.M. Bull, R.W. Ford and T.L. Freeman. An Investigation of Feedback Guided
Dynamic Scheduling of Nested Loops Proc. of the IEEE International Workshops on Parallel
Processing, 21-24 Aug. 2000, ed. P. Sadayappan, pp. 315–321.
[5] M. Cierniak, W. Li and M. J. Zaki. Loop Scheduling for Heterogeneity, Proc. of the 4th IEEE
Intl. Symp. on High Performance Distributed Computing, Washington, DC, 1995, pp. 78–85.
[6]A. T. Chronopoulos, R. Andonie, M. Benche and D. Grosu. A Class of Distributed Self-
Scheduling Schemes for Heterogeneous Clusters, Proc. of the 3rd IEEE International Con-
ference on Cluster Computing (CLUSTER 2001), Newport Beach, CA USA, 2001.
[7]Y.W. Fann, C.T. Yang, S.S. Tseng and C.J. Tsai. An Intelligent Parallel Loop Scheduling for
Parallelizing Compilers, Journal of Information Science and Engineering, 16:69–200, 2000.
[8]G. Goumas, N. Drosinos, M. Athanasaki and N. Koziris. Compiling Tiled Iteration Spaces for
Clusters, Proc. of the 4th IEEE International Conference on Cluster Computing (CLUSTER
2002), Chicago, IL USA, 2002, pp. 360–369.
[9] S.F. Hummel, J. Schmidt, R.N. Uma and J. Wein. Load-Sharing in Heterogeneous Systems
via Weighted Factoring, Proc. of 8th Annual Symp. on Parallel Algorithms and Architectures,
Padua, Italy, 1996, pp. 318–328.
[10] T.H. Kim and J.M. Purtilo. Load Balancing for Parallel Loops in Workstation Clusters, Proc.
of Intl. Conference on Parallel Processing, Bloomingdale, IL USA, 3:182–190, 1996.
[11] E.P. Markatos and T.J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-
Memory Multiprocessors, IEEE Transactions on Parallel and Distributed systems, 5(4):379–
400, April 1994.
[12] Peter Pachecho. Parallel Programming with MPI, Morgan Kauffman 1997.
[13] T.H. Tzen and L.M. Ni. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Par-
allel Compilers, IEEE Trans. on Parallel and Distributed Systems, 4(1):87–98, Jan. 1993.
[14] M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley Publication
Co., 1996.
[15] Y. Yan, C. Jin and X. Zhang. Adaptively Scheduling Parallel Loops in Distributed Shared-
Memory Systems, IEEE Trans. on Parallel and Distributed Systems, 8(1):70–81, Jan. 1997.
[16] C.T. Yang and S.C. Chang. A Parallel Loop Self-Scheduling on Extremely Heterogeneous
PC Clusters, Proc. of Intl Conf. on Computational Science, Melbourne, Australia and St.
Petersburg, Russia, 2003, pp. 1079–1088.
[17] R.W. Floyd and L. Steinberg. An adaptive algorithm for spatial grey scale. Proc. Soc. Inf.
Display, 17:75-77, 1976.
14
Page 15
[18] N. Manjikian and T.S. Abdelrahman. Exploiting Wavefront Parallelism on Large-Scale
Shared-Memory Multiprocessors. IEEE Trans. on Parallel and Distributed Systems,
12(3):259–271, 2001.
[19] T. Andronikos, F.M. Ciorba, P. Theodoropoulos, D. Kamenopoulos and G. Papakonstantinou.
Code Generation for General Loops Using Methods from Computational Geometry. Proc.
of the IASTED Parallel and Distributed Computing and Systems Conference (PCDS 2004),
Cambrige, MA USA, November 9-11, 2004, pp. 348–353.
[20] F.M. Ciorba, T. Andronikos, I. Drositis and G. Papakonstantinou. Reducing Communica-
tion via Chain Pattern Scheduling. To be presented at the 4th IEEE International Symposium
on Network Computing and Applications (IEEE NCA05), Cambridge, MA USA, July 27-29
2005.
[21] M. Gonzalez, E. Ayugade, X. Martorell and J. Labarta. Defining and Supporting Pipelined
Executions in OpenMP. Proc. of the Int’l Workshop on OpenMP Applications and Tools:
OpenMP Shared Memory Parallel Programming (WOMPAT 2001), 2001, pp. 155–169.
[22] M. Gonzalez, E. Ayugade, X. Martorell and J. Labarta. Exploting Pipelined Executions in
OpenMP. Proc. of the Int’l Conference on Parallel Processing (ICPP 2003), 2003, pp. 153–
160.
[23] C.P. Kruskal and A. Weiss. Allocating independent subtasks on parallel processors. IEEE
Trans. on Software Engineering, 11(10):1001–1016, 1985.
[24] C.D. Polychronopoulos and D.J. Kuck. Guided self-scheduling: A practical self-scheduling
scheme for parallel supercomputers. IEEE Trans. on Computer, C-36(12): 1425–1439, 1987.
[25] F. Rastello, A. Rao and S. Pande. Optimal Task Scheduling to minimize Inter-Tile Latencies.
Parallel Computing, 29(2): 209–239, 2003.
[26] J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, August 2000 (280 pages).
[27] P. Boulet, J. Dongarra, F. Rastello, Y. Robert and F. Vivien. Algorithmic Issues on Heteroge-
neous Computing Platforms. Parallel Processing Letters, 9(2): 197–213, 1998.
[28] T. Thanalapati and S. Dandamudi. An Efficient Adaptive Scheduling Scheme for Distributed
Memory Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 12(7):
758–768, 2001.
[29] Y.K. Kwok and I. Ahmad. Static Scheduling Algorithms for Allocating Directed Task Graphs
to Multiprocessors. ACM Computing Surveys, 31(4): 406–471, 1999.
15
View other sources
Hide other sources
-
Available from Florina Monica Ciorba · 17 Sep 2012
-
Available from psu.edu