Page 1

Dynamic Multi Phase Scheduling

for Heterogeneous Clusters

Florina M. Ciorba1,2, Theodore Andronikos1, Ioannis Riakiotakis1,

Anthony T. Chronopoulos3,4and George Papakonstantinou1

1Computing Systems Laboratory

Dept. of Electrical and Computer Engineering

National Technical University of Athens

Zografou Campus, 15773, Athens, Greece

2Student Member IEEE

3Dept. of Computer Science

University of Texas at San Antonio

6900 N. Loop 1604 West, San Antonio, TX 78249

4Senior Member IEEE

Email: {cflorina,tedandro,iriak}@cslab.ntua.gr, atc@cs.utsa.edu

Abstract

Distributed computing systems are a viable and less expensive alternative to parallel com-

puters. However, concurrent programming methods in distributed systems have not been stud-

ied as extensively as for parallel computers. Some of the main research issues are how to

deal with scheduling and load balancing of such a system which may consist of heterogeneous

computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops

(with independent iterations) on heterogeneous computer clusters have been obtained and stud-

ied. However, no study of dynamic schemes for loops with iteration dependencies has been

reported so far. In this work we study the problem of scheduling loops with iteration depen-

dencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration

dependencies incurs an extra degree of difficulty and makes the development of such schemes

quite a challenge. We extend three well known dynamic schemes (CS, TSS and DTSS) by

introducing synchronization points at certain intervals so that processors compute in pipelined

fashion. Our scheme is called dynamic multi-phase scheduling (DMPS) and we apply it to

loops with iteration dependencies. We implemented our new scheme on a network of het-

erogeneous computers and study its performance. Through extensive testing on two real-life

applications (heat equation and Floyd-Steinberg algorithm), we show that the proposed method

is efficient for parallelizing nested loops with dependencies on heterogeneous systems.

Index Terms - heterogeneous distributed systems, loop scheduling, dynamic algorithms, de-

pendence loops, pipelined execution.

Page 2

1 Introduction

Loops are one of the largest sources of parallelism in scientific programs. The iterations within a

loop nest are either independent (called parallel loops) or precedence constrained (called depen-

dence loops). Furthermore, the precedence constraints can be uniform (constant) or non-uniform

throughout the execution of the program. A review of important parallel loop scheduling algo-

rithms is presented in [7] (and references therein) and some recent results are presented in [4].

Research results also exist on scheduling parallel loops on message passing parallel systems and

on heterogeneous systems [1],[2],[3],[5],[6],[8],[9],[10],[11],[15],[16]. Static scheduling schemes

for dependence loops have been studied extensively for shared memory and distributed memory

systems [18] (and references therein), [27],[26],[28] [25],[19],[20].

Loops can be scheduled statically at compile-time or dynamically at run-time. Static schedul-

ing is applicable to both parallel and dependence loops. It has the advantage of minimizing the

scheduling time overhead, and achieving near optimal loop balancing when the execution envi-

ronment is homogeneous with uniform and constant workload. Examples of such scheduling are

Block, Cyclic [29],[18]. However, most cluster nowadays are heterogeneous and non-dedicated

to specific users, yielding a system with variable workload. When static schemes are applied to

heterogeneous systems with variable workload the performance is severely deteriorated. Dynamic

scheduling algorithms adapt the assigned number of iterations to match the workload variation of

both homogeneous and heterogeneous systems. An important class of dynamic scheduling algo-

rithms are the self-scheduling schemes (such as CS [23], GSS [24], TSS [13], Factoring [1], and

others [11]). On distributed systems these schemes are implemented using a Master-Slave model.

All dynamic schemes proposed so far apply only to parallel loops without dependencies.

Another very important factor in achieving near optimal execution time in distributed systems

is load balancing. Distributed systems are characterized by heterogeneity. To offer load balancing

loop scheduling schemes must take into account the processing power of each computer in the sys-

tem. The processing power depends on CPU speed, memory, cache structure and even the program

type. Furthermore, processing power depends on the workload of the computer throughout the exe-

cution of the problem. Therefore, load balancing methods adapted to distributed environments take

into account the relative computing powers of the computers. These relative computing powers are

used as weights that scale the size of the sub-problem assigned to each processor. This significantly

improves the total execution time when a non-dedicated heterogeneous computing environment is

used. Such algorithms were presented in [4],[9]. A recent algorithm that improves TSS by taking

into account the processing powers of a non-dedicated heterogeneous system is DTSS (Distributed

TSS) [6].

When loops without dependencies are parallelized with dynamic schemes, the index space is

partitioned into chunks, and the master assigns these chunks to processors upon request. Through-

out the parallel execution, every slave works independently and upon chunk completion sends the

results back to the master. Obviously, this approach is not suitable for dependence loops because,

due to dependencies, iterations in one chunk depend on iterations in other chunks. Hence slaves

need to communicate. Inter-processor communication is the foremost important reason for perfor-

mance deterioration when parallelizing loops with dependencies. No study of dynamic algorithms

for loops with dependencies on homogeneous or heterogeneous clusters has been reported so far.

In this paper, we study the problem of dynamic scheduling of uniform dependence loops on

heterogeneous distributed systems. We extend three well known dynamic schemes (Chunk, TSS,

2

Page 3

DTSS) and apply them to dependence loops. After partitioning the index space into chunks (using

one of the three schemes), we introduce synchronization points at certain intervals so that proces-

sors compute chunks in pipelined fashion. Synchronization points are carefully placed so that the

volume of data exchange is reduced and the pipeline parallelism is improved. Our scheme is called

dynamic multi-phase scheduling (DMPS(x)), where (x) stands for one of the three algorithms,

considered as an input parameter to DMPS. We implement our new scheme on a network of

heterogeneous (dedicated and non-dedicated) computers and evaluate its performance through ex-

tensive simulation and empirical testing. Two case studies are examined: the Heat Equation and

the Floyd-Steinberg dithering algorithm. The experimental results validate the presented theory

and corroborate the efficiency of the parallel code.

Section 2 gives the algorithmic model and some notations. In section 3, we thoroughly present

our algorithm and motivation. In section 4, the implementation, the case studies we used and the

experimental results are presented. In Section 5, conclusions are drawn.

2Notation

Parallel loops have no dependencies among iterations and, thus, the iterations can be executed in

any order or even simultaneously. In dependence loops the iterations depend on each other, which

imposes a certain execution order. The depth of the loop nest, n, determines the dimension of the

iteration index space J = {j ∈ Nn| lr≤ ir≤ ur,1 ≤ r ≤ n}. Each point of this n-dimensional

index space is a distinct iteration of the loop body. L = (l1,...,ln) and U = (u1,...,un) are the

initial and terminal points of the index space.

for (i1=l1; i1<=u1; i1++) {

...

for (in=ln; in<=un; in++) {

S1(I);

...

Sk(I);

}

...

}

Loop

Body

Figure 1: Algorithmic model.

Without loss of generality we assume that L = (1,...,1) and that u1 ≥ ... ≥ un. DS =

{?d1,...,?dp}, p ≥ n, is the set of the p dependence vectors, which are uniform, i.e., constant

throughout the index space. The index space of the dependence loop is divided into chunks, using

one of the three dynamic schemes, giving preference to the smallest dimension (here un). The

following notation is used throughout the paper:

• PE stands for processing element.

• P1,...,Pmare the slaves.

3

Page 4

• N is the number of scheduling steps, i = 1,...,N.

• A few consecutive iterations of the loop are called a chunk; Ciis the chunk size at the i-th

scheduling step.

• Viis the size (in number of iterations) of chunk i along dimension un.

• SP: In each chunk we introduce M synchronization points (SP) uniformly distributed along

u1.

• H is the interval (number of iterations along dimension u1) between two SPs (H is the same

for every chunk).

• The current slave is the slave assigned with the chunk i, whereas the previous slave is the

slave assigned with the chunk i − 1.

• V Pkis the virtual computing power of slave Pk.

• V P =?m

• Qkis the number of processes in the run-queue of Pk, reflecting the total load of Pk.

k=1V Pkis the total virtual computing power of the cluster.

• ACP: Ak=

executed in non-dedicated mode).

?

V Pk

Qk

?

is the available computing power (ACP) of Pk(needed when the loop is

• A =?m

• SCi,jis the set of iterations of chunk i, between SPj−1and SPj.

k=1Akis the total available computing power of the cluster.

Figure 3 below illustrates Ci, Viand H. Note that Ciis the number of iterations in the rectan-

gular region, i.e. Ci= Vi× M × H.

3A dynamic scheduling scheme for uniform dependence loops

3.1

Existing dynamic scheduling algorithms cannot cope with uniform dependence loops. Consider,

for instance, the heat equation.

Motivation

/* Heat equation */

for (i=1; i<width; i++){

for (j=1; j<height; j++){ /* Dependence loop */

A[i][j] = 1/4*(A[i-1][j] + A[i][j-1]

+ A’[i+1][j] + A’[i][j+1]);

}

}

4

Page 5

When dynamic schemes are applied to parallelize this problem, the index space is partitioned

into chunks, which are assigned to slaves. These slaves then work independently. But due to the

presence of dependencies, the slaves have to communicate. However, existing dynamic schemes

do not provide for inter-slave communication, only for master-to-slaves communication. There-

fore, in order to apply dynamic schemes to dependence loops, one must provide an inter-slave

communication scheme, such that problem’s dependencies are not violated or ignored.

In this work we bring dynamic scheduling schemes into the field of scheduling loops with

dependencies. We propose an inter-slave communication scheme for three well known dynamic

methods: CS [23], TSS [13] and DTSS [6]. In all cases, after the master assigns chunks to slaves,

the slaves synchronize via synchronization points. This provides the slaves with a unified commu-

nication scheme. This is depicted in Fig. 2 and 3, where chunks i−1,i,i+1 are assigned to slaves

Pk−1,Pk,Pk+1, respectively. The shaded areas denote sets of iterations that are computed concur-

rently by different PEs. When Pkreaches the synchronization point SPj+1(i.e. after computing

SCi,j+1) it sends Pk+1only the data Pk+1requires to begin execution of SCi+1,j+1. The data sent

to Pk+1designates only those iterations of SCi,j+1imposed by the dependence vectors, on which

the iterations of SCi+1,j+1depend on. Similarly, Pkreceives from Pk−1the data Pkrequires to

proceed with the execution of SCi,j+2. Note that slaves do not reach a synchronization point at the

same time. For instance, Pkreaches SPj+1earlier than Pk+1and later than Pk−1. The existence of

synchronization points leads to pipelined execution, as shown in Fig. 2 by the shaded areas.

Pk+1

Pk

Pk-1

SPj

Ci+1

Ci

Ci-1

SPj+1

SPj+2

SCi,j+1

SCi-1,j+1

Figure 2: Synchronization points

3.2Dynamic scheduling for dependence loops

This section gives a brief description of the three dynamic algorithms we used. Chunk Scheduling

(CS) assigns a chunk that consists of a number of iterations (known as Ci) to a slave. A large

chunk size reduces scheduling overhead, but also increases the chance of load imbalance. The

Trapezoid Self-Scheduling (TSS) [13] scheme linearly decreases the chunk size Ci. Considering

|J| the total number of iterations of the loop, in TSS the first and last (assigned) chunk size pair

5

Page 6

u1

u2

current

slave

HH

Vi+1

Vi

Vi-1

C

S

SP1

H

previous

slave

M Synchronization points

V1

VN

Ci+1

Ci

Ci-1

...

...

...

...

H

T

S

S

SP2

SP3

SPj

D

T

S

S

ji,

SC

SPj-1

Figure 3: Chunks are formed along u2and SP are introduced along u1

(F,L) may be set by the programmer. In a conservative selection, the (F,L) pair is determined

as: F =

total load in most loop distributions and reduces the chance of imbalance due to large size of the

first chunk. Then, the proposed number of steps needed for the scheduling process is N =

Thus the decrease between consecutive chunks is D = (F − L)/(N − 1). Then the chunk sizes

in TSS are C1= F,C2= F − D,C3= F − 2 × D,... . Distributed TSS (DTSS) [6] improves

on TSS by selecting the chunk sizes according to the computational power of the slaves. DTSS

uses a model that includes the number of processes in the run-queue of each PE. Every process

running on a PE is assumed to take an equal share of its computing resources. The programmer

may determine the pair (F,L) according to TSS; or the following formula may be used in the

conservative selection approach (by default): F =

is N =

case is Ci= Ak× (F − D × (Sk−1+ (Ak− 1)/2)), where: Sk−1= A1+ .. + Ak−1. When all

PEs are dedicated to a single process then Ak= Vk. Also, when all the PEs have the same speed

then Vk = 1 and the tasks assigned in DTSS are the same as in TSS. The important difference

between DTSS and TSS is that in DTSS the next chunk is allocated according to a PE’s available

computing power, but in TSS all PEs are simply treated in the same way. Thus, faster PEs get more

iterations than slower ones in DTSS. Table 3.2 shows the chunk sizes computed with CS, TSS and

DTSS for an index space size of 5000 × 10000 and and m = 10 slaves. CS and TSS obtain the

same chunk sizes in dedicated clusters as in non-dedicated clusters; DTSS adapts the chunk size

to match the different computational powers of slaves. These algorithms have been evaluated for

parallel loops and it has been established that the DTSS algorithm improves on the TSS, which in

|J|

2×mand L = 1. This ensures that the load of the first chunk is less than 1/m of the

2×|J|

(F+L).

|J|

2×Aand L = 1. The total number of steps

2×|J|

(F+L)and the chunk decrement is D = (F − L)/(N − 1). The size of a chunk in this

6

Page 7

turn outperforms CS [6].

Table 1: Sample chunk sizes given for |J| = 5000 × 10000 and m = 10

Algorithm

CS

TSS

Chunk sizes

300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200

277 270 263 256 249 242 235 228 221 214 207 200 193 186 179 172 165 158 151 144

137 130 123 116 109 102 73

392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207

130 183 114 159 98 46 87 41 44

263 383 369 355 229 112 219 107 209 203 293 279 265 169 33 96 46 89 86 83 80 77

74 24 69 66 31 59 56 53 50 47 44 20 39 20 33 30 27 24 21 20 20 20 20 20 20 20 20 8

DTSS

(dedicated)

DTSS

(non-dedicated)

For the sake of simplicity we consider a 2D dependence loop with U = (u1,u2), and u1≥ u2.

The index space of this loop is divided into chunks along u2(using one of the three algorithms).

Along u1synchronization points are introduced at equal intervals. The interval size (H), chosen by

the programmer, determines the number of synchronization points.

3.3The DMPS(x) algorithm

The following notation is essential for the inter-slave communication scheme: the master always

names the slaveassigned with thelatest chunk (Ci) as current and theslaveassigned with the chunk

Ci−1as previous. Whenever a new chunk is computed and assigned, the current slave becomes the

(new) previous slave, whereas the new slave is named (new) current. Fig. 4 below shows the state

diagram related to the (new) current – (new) previous slaves. The state transitions are triggered by

new requests for chunks to the master.

Slave Pk-1: previous

Slave Pk: current

Slave Pk+1: undefined

Slave Pk-1: (former) previous

Slave Pk: (new) previous

Slave Pk+1: (new) current

old state for

scheduling SCi,j

new state for

scheduling SCi+1,j

Slave Pk-1: (former) previous

Slave Pk: (new) current

Slave Pk+1: undefined

Slave Pk finished

chunk Ci and Pk+1

is still undefined

Slave Pk+1 requests

new chunk from

master

Figure 4: State diagram of the slaves

The DMPS(x) algorithm is described in the following pseudocode:

7

Page 8

INPUT (a) An n-dimensional dependence nested loop, with terminal point U.

(b) The choice of algorithm CS, TSS or DTSS.

(c) If CS is chosen, then chunk size Ci.

(d) The synchronization interval H.

(e) The number of slaves m; if case of DTSS the virtual power of every slave.

Master:

Initialization: (a) Register slaves. In case of DTSS, slaves report their ACP.

(b) Calculate F,L,N,D for TSS and DTSS. For CS use the given Ci.

1. While there are unassigned iterations do:

(a) If a request arrives, put it in the queue.

(b) Pick a request from the queue, and compute the next chunk using CS, TSS or DTSS.

(c) Update the current and previous slave ids.

(d) Send the id of the current slave to the previous one.

Slave Pk:

Initialization: (a) Register with the master; in case of DTSS report ACPk.

(b) Compute M according to the given H.

1. Send request to the master.

2. Wait for reply; if received chunk from master goto step 3 else goto OUTPUT.

3. While the next synchronization point is not reached computed chunk i.

4. If id of the send-to slave is known goto step 7.

5. If the id of the send-to slave is in the receive buffer goto step 7.

6. Goto step 8.

7. Send computed data to send-to slave.

8. Receive data from the receive-from slave and goto step 3.

OUTPUT Master: If there are no more chunks to be assigned to slaves, terminate.

Slave Pk: If no more tasks come from master, terminate.

Remark: (1) Note that the synchronization intervals are the same for all chunks. For (2)–(5) below

refer to Fig. 4 for an illustration. (2) Upon completion of SCi,0, slave Pkrequests from the master

the identity of the send-to slave. If no reply is received, then Pkis still the current slave, and it

proceeds to receive data from the previous slave Pk−1, and then it begins SCi,1. (3) Slave Pkkeeps

requesting the identity of the send-to slave, at the end of every SCi,juntil either a (new) current

slave has been appointed by the master or Pkhas finished chunk i. (4) If slave Pkhas already

executed SCi,0,...,SCi,jby the time it is informed by the master about the identity of the send-to

slave, it sends all computed data from SCi,0,...,Si,j. (5) If no send-to slave has been appointed

by the time slave Pkfinishes chunk i, then all computed data is kept in the local memory of slave

Pk. Then Pkmakes a new request to the master to become the (new) current slave.

8

Page 9

4Implementation and Test Results

Ourimplementationreliesonthedistributedprogrammingframeworkofferedbythempich.1.2.6

implementationoftheMessagePassingInterface (MPI)[12], and the1.2.6 versionofthegcc com-

piler.

We used a heterogeneous distributed system that consists of 10 computers, one of them being

the master. More precisely we used: (a) 4 Intel Pentiums III 1266MHz with 1GB RAM (called

zealots), assumed to have V Pk= 1.5 (one of these was chosen to be the master); and (b) 6 Intel

Pentiums III 500MHz with 512MB RAM (called kids), assumed to have V Pk= 0.5. The virtual

power for each machine type was determined as a ratio of processing times established by timing

a test program on each machine type. The machines are interconnected by a Fast Ethernet, with a

bandwidth of 100 Mbits/sec.

Wepresenttwocases, dedicated andnon-dedicated. In thefirst case, processorsarededicatedto

runningthe program and no otherloads are interposed during theexecution. We takemeasurements

withupto 9slaves. Weuseoneofthefast machinesas amaster. In thesecond case, at thebeginning

of the execution of the program, we start a resource expensive process on some of the slaves. Due

to the fact that scheduling algorithms for loops with uniform dependencies are usually static and no

other dynamic algorithms have been reported so far, we cannot compare with similar algorithms.

We ran three series of experiments for the dedicated and non-dedicated case: (1) DMPS(CS), (2)

DMPS(TSS), and (3) DMPS(DTSS)and compare theresults for two real-life case studies. We

ran the above series for m = 3,4,5,6,7,8,9 slaves in order to compute the speedup. We compute

the speedup according to the following equation:

Sp=min{TP1,TP2,...,TPm}

TPAR

(1)

where TPiis the serial execution time on slave Pi, 1 ≤ i ≤ m, and TPARis the parallel execution

time (on m slaves). Note that in the plotting of Sp, we use V P instead of m on the x-axis.

4.1Test Problems

We used the heat equation computation for a domain of 5000 × 10000 points, and the Floyd-

Steinbergerrordiffusioncomputationforaimageof10000×20000pixels,onasystemconsistingof

9 heterogeneous slavemachines and one master, withthefollowingconfiguration: zealot1 (master),

zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. For instance, when using 6 slaves, the

machines used are: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3. The slaves in italics

are the ones loaded in the non-dedicated case. As mentioned previously, by starting a resource

expensive process on these slaves, their ACP is halved.

4.2Heat Equation

The heat equation computation is one of the most widely used case studies in the literature, and

its loop body is similar to the majority of the numerical methods used for solving partial differ-

ential equations. It computes the temperature in each pixel of its domain based on two values of

the current time step (A[i-1][j], A[i][j-1]) and two values from the previous time step

(A’[i+1][j], A’[i][j+1]). The dependence vectors are:?d1 = (1,0) and?d2 = (0,1).

9

Page 10

The pseudocode is given below:

/* Heat equation */

for (i=1; i<width; i++){

for (j=1; j<height; j++){

A[i][j] = 1/4*(A[i-1][j] + A[i][j-1]

+ A’[i+1][j] + A’[i][j+1]);

}

}

u2

H

u1

...

...

SPj

Vi+1

Vi

Vi-1

...

...

Ci+1

Ci

Ci-1

HEAT EQUATION

SPj+1

H

u1

u2

...

...

SPj

Ci+1

Ci

Ci-1

...

...

Vi+1

Vi

Vi-1

FLOYD-STEINBERG

SPj+1

Figure 5: The dependence patterns for heat equation (left) and Floyd-Steinberg (right).

An illustration of the dependence patterns is given in Fig. 5. The iterations in a chunk are exe-

cuted in the order imposed by the dependencies of the heat equation. Whenever a synchronization

point is reached, data is exchanged between the processors executing neighboring chunks.

Table 2 shows comparative results we obtained for the heat equation, for the three series of

experiments: DMPS(CS), DMPS(TSS) and DMPS(DTSS), on a dedicated and a non-

dedicated heterogeneous cluster. The values represent the parallel times (in seconds) for differ-

ent number of slaves. Also, three synchronization intervals were given. Three synchronization

intervals were chosen, and the total ACP ranged according to the number of slaves from 3.5–7.5.

Fig. 6 presents the speedupsfor theheat equationon an indexspace of5000×10000, forchunks

sizes computed with CS, TSS and DTSS and synchronization interval 150, on a dedicated cluster

(left) and a non-dedicated cluster (right).

10

Page 11

Table 2: Parallel execution times (sec) for heat equation

Heat Equation, non-dedicated heterogeneous cluster

0

1

2

3

4

3.54 5.566.577.5

Virtual powers

Speedup

DMPS(CS)DMPS(TSS) DMPS(DTSS)

Figure 6: Speedups for the heat equation

4.3 Floyd-Steinberg

TheFloyd-Steinbergcomputation[17]isanimageprocessingalgorithmusedfortheerror-diffusion

dithering of a width by height grayscale image. The boundary conditions are ignored. The de-

pendencies are:?d1= (1,0),?d2= (1,1),?d3= (0,1) and?d4= (1,−1) The pseudocode is given

below:

/* Floyd-Steinberg */ for (i=1; i<width; i++){

for (j=1; j<height; j++){

I[i][j] = trunc(J[i][j]) + 0.5;

err = J[i][j] - I[i][j]*255;

J[i-1][j] += err*(7/16);

J[i-1][j-1] += err*(3/16);

J[i][j-1] += err*(5/16);

J[i-1][j+1] += err*(1/16);

}

}

An illustration of the dependence patterns is given in Fig. 5. The iterations in a chunk are

executed in the order imposed by the dependencies of the Floyd-Steinberg algorithm. Whenever a

synchronization point is reached, data is exchanged between the processors executing neighboring

chunks.

11

Page 12

Comparative results for the Floyd-Steinberg case study on a dedicated and a non-dedicated

heterogeneous cluster are given in Table 3. The values represent the parallel times (in seconds) for

different number of slaves. Three synchronization intervals were chosen, and the total ACP ranged

according to the number of slaves from 3.5–7.5.

Table 3: Parallel execution times (sec) for Floyd-Steinberg

Fig. 7 presents the speedup results of the Floyd-Steinberg algorithm, for the three variations.

The size of the index space was 10000×20000. Chunks sizes were computed with CS, TSS and

DTSS and synchronization interval chosen to be 100, on a dedicated cluster (left) and a non-

dedicated cluster (right).

Floyd-Steinberg, dedicated heterogeneous cluster

0

1

2

3

4

5

6

7

3.545.566.577.5

Virtual powers

Speedup

DMPS(CS)DMPS(TSS)DMPS(DTSS)

Floyd-Steinberg, non-dedicated heterogeneous cluster

0

1

2

3

4

5

6

3.545.56 6.577.5

Virtual power

Speedup

DMPS(CS)DMPS(TSS)DMPS(DTSS)

Figure 7: Speedups for Floyd-Steinberg

4.4Interpretation of the results

As expected, the results for the dedicated cluster are much better for both case studies. In par-

ticular, DMPS(TSS) seems to perform slightly better than DMPS(CS). This was expected

since TSS provides better load balancing than CS for simple parallel loops without dependencies.

In addition, DMPS(DTSS) outperforms both algorithms. This is because it explicitly accounts

for the heterogeneity of the slaves. For the non-dedicated case, one can see that DMPS(CS)

and DMPS(TSS) cannot handle workload variations as effectively as DMPS(DTSS). This is

shown in Fig. 6. The speedup for DMPS(CS) and DMPS(TSS) decreases as loaded slaves are

12

Page 13

added, whereas for DMPS(DTSS)it increases even when slavesare loaded. In thenon-dedicated

approach, our choice was to load the slow processors, so as to incur large differences between the

processing power of the two machine types. Even in this case, DMPS(DTSS) proved to achieve

good results.

Anotherfactor affecting the performance is the selection of the synchronizationinterval. Notice

that the best synchronization interval for the heat equation was H = 150, whereas for the Floyd-

Steinberg better results were obtained for H = 100. The performance differences for interval sizes

close to the ones depicted in Fig. 6 and 7 are small. When choosing a proper synchronization

interval, system’s architecture (i.e. throughput, communication buffer size etc.) and the communi-

cation/computation ratio of the problem under inspection must be taken into account. However, if

these parameters are not known, the programmer may select an interval of his choice. Nonetheless,

a poor choice may lead to a considerable performance deterioration.

5Conclusion

In this paper we presented a novel dynamic scheduling scheme for dependence loops on hetero-

geneous clusters. We tested three variations of our method on a heterogeneous cluster, both in

dedicated and non-dedicated mode. The main contribution of our work is extending three previ-

ous schemes by taking into account the existing iteration dependencies of the problem, and hence

providing a scheme for inter-slave communication. We tested our method on two real-life applica-

tions: heat equation and Floyd-Steinberg algorithm. The results demonstrate that our new scheme

is effective for distributed applications with dependence loops.

Future work will focus on establishing a model for predicting the optimal synchronization in-

terval (H) such that communication is minimized for every problem. Also we intend to extend

other well known dynamic algorithms to be applied to dependence loops, and incorporated in an

automatic parallel code generation tool for heterogeneous systems.

Acknowledgments

Thiswork ofF. Ciorbawassupportedby theGreek StateScholarshipsFoundation. ThisworkofDr.

Chronopoulos was supported in part by National Science Foundation under grant CCR-0312323.

References

[1]I. Banicescu and Z. Liu. Adaptive Factoring: A Dynamic Scheduling Method Tuned to the

Rate of Weight Changes, Proc. of the High Performance Computing Symposium 2000, Wash-

ington, USA, 2000, pp. 122–129.

[2] I. Banicescu, V. Velusamy and J. Devaprasad. On the Scalability of Dynamic Scheduling

Scientific Applications with Adaptive Weighted Factoring, Cluster Computing 6, 2003, pp.

215–226.

13

Page 14

[3]J. Barbosa, J. Tavares and A. J. Padilha. Linear Algebra Algorithms in a Heterogeneous

Cluster of Personal Computers, Proc. of the 9th Heterogeneous Computing Workshop (HCW

2000), Cancun, Mexico, 2000, pp. 147–159.

[4]D.J. Hancock, J.M. Bull, R.W. Ford and T.L. Freeman. An Investigation of Feedback Guided

Dynamic Scheduling of Nested Loops Proc. of the IEEE International Workshops on Parallel

Processing, 21-24 Aug. 2000, ed. P. Sadayappan, pp. 315–321.

[5] M. Cierniak, W. Li and M. J. Zaki. Loop Scheduling for Heterogeneity, Proc. of the 4th IEEE

Intl. Symp. on High Performance Distributed Computing, Washington, DC, 1995, pp. 78–85.

[6]A. T. Chronopoulos, R. Andonie, M. Benche and D. Grosu. A Class of Distributed Self-

Scheduling Schemes for Heterogeneous Clusters, Proc. of the 3rd IEEE International Con-

ference on Cluster Computing (CLUSTER 2001), Newport Beach, CA USA, 2001.

[7]Y.W. Fann, C.T. Yang, S.S. Tseng and C.J. Tsai. An Intelligent Parallel Loop Scheduling for

Parallelizing Compilers, Journal of Information Science and Engineering, 16:69–200, 2000.

[8]G. Goumas, N. Drosinos, M. Athanasaki and N. Koziris. Compiling Tiled Iteration Spaces for

Clusters, Proc. of the 4th IEEE International Conference on Cluster Computing (CLUSTER

2002), Chicago, IL USA, 2002, pp. 360–369.

[9] S.F. Hummel, J. Schmidt, R.N. Uma and J. Wein. Load-Sharing in Heterogeneous Systems

via Weighted Factoring, Proc. of 8th Annual Symp. on Parallel Algorithms and Architectures,

Padua, Italy, 1996, pp. 318–328.

[10] T.H. Kim and J.M. Purtilo. Load Balancing for Parallel Loops in Workstation Clusters, Proc.

of Intl. Conference on Parallel Processing, Bloomingdale, IL USA, 3:182–190, 1996.

[11] E.P. Markatos and T.J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-

Memory Multiprocessors, IEEE Transactions on Parallel and Distributed systems, 5(4):379–

400, April 1994.

[12] Peter Pachecho. Parallel Programming with MPI, Morgan Kauffman 1997.

[13] T.H. Tzen and L.M. Ni. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Par-

allel Compilers, IEEE Trans. on Parallel and Distributed Systems, 4(1):87–98, Jan. 1993.

[14] M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley Publication

Co., 1996.

[15] Y. Yan, C. Jin and X. Zhang. Adaptively Scheduling Parallel Loops in Distributed Shared-

Memory Systems, IEEE Trans. on Parallel and Distributed Systems, 8(1):70–81, Jan. 1997.

[16] C.T. Yang and S.C. Chang. A Parallel Loop Self-Scheduling on Extremely Heterogeneous

PC Clusters, Proc. of Intl Conf. on Computational Science, Melbourne, Australia and St.

Petersburg, Russia, 2003, pp. 1079–1088.

[17] R.W. Floyd and L. Steinberg. An adaptive algorithm for spatial grey scale. Proc. Soc. Inf.

Display, 17:75-77, 1976.

14

Page 15

[18] N. Manjikian and T.S. Abdelrahman. Exploiting Wavefront Parallelism on Large-Scale

Shared-Memory Multiprocessors. IEEE Trans. on Parallel and Distributed Systems,

12(3):259–271, 2001.

[19] T. Andronikos, F.M. Ciorba, P. Theodoropoulos, D. Kamenopoulos and G. Papakonstantinou.

Code Generation for General Loops Using Methods from Computational Geometry. Proc.

of the IASTED Parallel and Distributed Computing and Systems Conference (PCDS 2004),

Cambrige, MA USA, November 9-11, 2004, pp. 348–353.

[20] F.M. Ciorba, T. Andronikos, I. Drositis and G. Papakonstantinou. Reducing Communica-

tion via Chain Pattern Scheduling. To be presented at the 4th IEEE International Symposium

on Network Computing and Applications (IEEE NCA05), Cambridge, MA USA, July 27-29

2005.

[21] M. Gonzalez, E. Ayugade, X. Martorell and J. Labarta. Defining and Supporting Pipelined

Executions in OpenMP. Proc. of the Int’l Workshop on OpenMP Applications and Tools:

OpenMP Shared Memory Parallel Programming (WOMPAT 2001), 2001, pp. 155–169.

[22] M. Gonzalez, E. Ayugade, X. Martorell and J. Labarta. Exploting Pipelined Executions in

OpenMP. Proc. of the Int’l Conference on Parallel Processing (ICPP 2003), 2003, pp. 153–

160.

[23] C.P. Kruskal and A. Weiss. Allocating independent subtasks on parallel processors. IEEE

Trans. on Software Engineering, 11(10):1001–1016, 1985.

[24] C.D. Polychronopoulos and D.J. Kuck. Guided self-scheduling: A practical self-scheduling

scheme for parallel supercomputers. IEEE Trans. on Computer, C-36(12): 1425–1439, 1987.

[25] F. Rastello, A. Rao and S. Pande. Optimal Task Scheduling to minimize Inter-Tile Latencies.

Parallel Computing, 29(2): 209–239, 2003.

[26] J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, August 2000 (280 pages).

[27] P. Boulet, J. Dongarra, F. Rastello, Y. Robert and F. Vivien. Algorithmic Issues on Heteroge-

neous Computing Platforms. Parallel Processing Letters, 9(2): 197–213, 1998.

[28] T. Thanalapati and S. Dandamudi. An Efficient Adaptive Scheduling Scheme for Distributed

Memory Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 12(7):

758–768, 2001.

[29] Y.K. Kwok and I. Ahmad. Static Scheduling Algorithms for Allocating Directed Task Graphs

to Multiprocessors. ACM Computing Surveys, 31(4): 406–471, 1999.

15