Content uploaded by Arnold Rosenberg

Author content

All content in this area was uploaded by Arnold Rosenberg on Dec 01, 2014

Content may be subject to copyright.

Toward a Theory for Scheduling Dags in

Internet-Based Computing

Grzegorz Malewicz, Member, IEEE,

Arnold L. Rosenberg, Fellow, IEEE, and Matthew Yurkewych

Abstract—Conceptual and algorithmic tools are developed as a foundation for a theory of scheduling complex computation-dags for

Internet-based computing. The goal of the schedules produced is to render tasks eligible for allocation to remote clients (hence, for

execution) at the maximum possible rate. This allows one to utilize remote clients well, as well as to lessen the likelihood of the

“gridlock” that ensues when a computation stalls for lack of eligible tasks. Earlier work has introduced a formalism for studying this

optimization problem and has identified optimal schedules for several significant families of structurally uniform dags. The current

paper extends this work via a methodology for devising optimal schedules for a much broader class of complex dags, which are

obtained via composition from a prespecified collection of simple building-block dags. The paper provides a suite of algorithms that

decompose a given dag G to expose its building blocks and an execution-priority relation . on building blocks. When the building blocks

are appropriately interrelated under ., the algorithms specify an optimal schedule for G.

Index Terms—Internet-based computing, grid computing, global computing, Web computing, scheduling dags, dag decomposition,

theory.

æ

1INTRODUCTION

E

ARLIER work [15], [17] has developed the Internet-

Computing (IC, for short) Pebble Game, which abstracts

the problem of scheduling computations having intertask

dependencies,

1

for several modalities of Internet-based

computing—including Grid computing (cf. [2], [6], [5]),

global computing (cf. [3]), and Web computing (cf. [12]).

The quality metric for schedules produced using the Game

is to maximize the rate at which tasks are rendered eligible

for allocation to remote clients (hence, for execution), with

the dual aim of: 1) enhancing the effective utilization of

remote clients and 2) lessening the likel ihood of the

“gridlock” that can arise when a computation stalls pending

computation of already allocated tasks.

A simple example should illustrate our scheduling

objective. Consider the two-dimensional evolving mesh of

Fig. 1. An optimal schedule for this dag sequences tasks

sequentially along each level [15] (as numbered in the

figure). If just one client participates in the computation,

then, after five tasks have been executed, we can allocate

any of three eligible tasks to the client. If there are several

clients, we could encounter a situation wherein two of these

three eligible tasks (marked A in the figure) are allocated to

clients who have not yet finished executing them. There is,

then, only one task (marked E) that is eligible and

unallocated. If two clients now request work, we may be

able to satisfy only one request, thus wasting the computing

resources of one client. Since an optimal schedule max-

imizes the number of eligible tasks, it mini mizes the

likelihood of this waste of resources (whose extreme case

is the gridlock that arises when all eligible tasks have been

allocated, but none has been executed).

Many IC projects—cf. [2], [11], [18]—monitor either the

past histories of remote clients or their current computa-

tional capabilities or both. While the resulting snapshots

yield no guarantees of future performance, they at least

afford the server a basis for estimating such performance.

Our study proceeds under the idealized assumption that

such monitoring yields sufficiently accurate predictions of

clients’ future performance that the server can allocate

eligible tasks to clients in an order that makes it likely that

tasks will be executed in the order of their allocation. We

show how such information often allows us to craft

schedules that produce maximally many eligible tasks after

each task execution.

Our contributions. We develop the framework of a

theory of Internet-based scheduling via three conceptual/

algorithmic contributions. 1) We introduce a new “priority”

relation, denoted ., on pairs of bipartite dags; the assertion

“G

1

. G

2

” guarantees that one never sacrifices our quality

metric (which rewards a schedule’s rate of producing

eligible tasks) by executing all sources of G

1

, then all sources

of G

2

, then all sinks of both dags. We provide a repertoire of

bipartite building-block dags, show how to schedule each

optimally, and expose the .-interrelationships among them.

2) We specify a way of “composing” building blocks to

obtain dags of possibly quite complex structures; cf. Fig. 2.

If the building blocks used in the composition form a

“relation-chain” under ., then the resulting composite dag

IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006 757

. G. Malewicz is with the Department of Engineering, Google, Inc.,

Mountain View, CA 94043. E-mail: malewicz@google.com.

. A.L. Rosenberg and M. Yurkewych are with the Department of Computer

Science, University of Massachusetts, Amherst, MA 01003.

E-mail: {rsnbrg, yurk}@cs.umass.edu.

Manuscript received 14 Jan. 2005; revised 20 May 2005; accepted 20 June

2005; published online 21 Apr. 2006.

For information on obtaining reprints of this article, please send e-mail to:

tc@computer.org, and reference IEEECS Log Number TC-0008-0105.

1. As is traditional—cf. [8], [9]—we model such a computation as a dag

(directed acyclic graph).

0018-9340/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

is guaranteed to admit an optimal schedule. 3) The

framework developed thus far is descriptive rather than

prescriptive. It says that if a dag G is constructed from

bipartite building blocks via composition and if we can

identify the “blueprint” used to construct G and if the

underlying building blocks are interrelated in a certain way,

then a prescribed strategy produces an optimal schedule for

G. We next address the algorithmic challenge in the

preceding ifs: Given a dag G, how does one apply the

preceding framework to it? We develop a suite of

algorithms that: a) reduce any dag G to its “transitive

skeleton” G

0

, a simplified version of G that shares the same

set of optimal schedules; b) decompose G

0

to determine

whether or not it is constructed from bipartite building

blocks via composition, thereby exposing a “blueprint” for

G

0

; c) specify an optimal schedule for any such G

0

that is

built from building blocks that form a “relation-chain”

under .. For illustration, all of the dags in Fig. 2 yield to our

algorithms.

The scheduling theory we develop here has the potential

of improving efficiency and fault tolerance in existing Grid

systems. As but one example, when Condor [19] executes

computations with complex task dependencies, such as the

Sloan Digital Sky Survey [1], it uses a “FIFO” regimen to

sequence the allocation of eligible tasks. Given the temporal

unpredictability of the remote clients, this scheduling may

sometimes lead to an ineffective use of the clients’

computing resources and, in the extreme case, to “grid-

lock.” Our scheduling algorithms have the potential of

reducing the severity of these issues. Experimental work is

underway to determine how to enhance this potential.

Related work. Most closely related to our study are its

immediate precursors and motivators, [15], [17]. The main

results of those sources demonstrate the necessity and

sufficiency of parent orientation for optimality in scheduling

the dags of Fig. 3. Notably, these dags yield to the

algorithms presented here, so our results both extend the

results in [15], [17] and explain their underlying principles

in a general setting. In a companion to this study, we are

pursuing an orthogonal direction for extending [15], [17].

Motivated by the demonstration in Section 3.4 of the limited

scope of the notion of optimal schedule that we study here,

we formulate, in [14], a scheduling paradigm in which a

server allocates batches of tasks periodically, rather than

allocating individual tasks as soon as they become eligible.

Optimality is always possible within this new framework,

but achieving it may entail a prohibitively complex

computation. An alternative direction of inquiry appears

in [7], [13], where a probabilistic pebble game is used to

study the execution of interdependent tasks on unreliable

clients. Finally, our study has been inspired by the many

exciting systems and/or application-oriented studies of

Internet-based computing, in sources such as [2], [3], [5], [6],

[11], [12], [18].

2EXECUTING DAGS ON THE INTERNET

We review the basic graph-theoretic terms used in our study.

We then introduce several bipartite “building blocks” to

exemplify our theory. Finally, we present the pebble game on

dags we use to model computations on dags.

2.1 Computation-Dags

2.1.1 Basic Definitions

A directed graph G is given by a set of nodes N

G

and a set of arcs

(or, directed edges) A

G

, each having the form ðu ! vÞ, where

u; v 2 N

G

.Apath in G is a sequence of arcs that share adjacent

endpoints, as in the following path from node u

1

to node u

n

:

ðu

1

! u

2

Þ; ðu

2

! u

3

Þ; ...; ðu

n2

! u

n1

Þ; ðu

n1

! u

n

Þ.A

dag (directed acyclic graph) G is a directed graph that has no

cycles, i.e.,in adag, no path of the preceding form has u

1

¼ u

n

.

When a dag G is used to model a computation, i.e., is a

computation-dag:

. each v 2 N

G

represents a task in the computation;

. an arc ðu ! vÞ2A

G

represents the dependence of

task v on task u: v cannot be executed until u is.

Given an arc ðu ! vÞ2A

G

, u is a parent of v and v is a child

of u in G. Each parentless node of G is a source (node), and

each childless node is a sink (node); all other nodes are

internal. A dag G is bipartite if:

1. N

G

can be partitioned into subsets X and Y such

that, for every arc ðu ! vÞ2A

G

, u 2 X and v 2 Y ;

2. each v 2 N

G

is incident to some arc of G, i.e., is either

the node u or the node w of some arc ðu ! wÞ2A

G

.

(Prohibiting “isolated” nodes avoids degeneracies.)

758 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006

Fig. 1. An optimal schedule helps utilize clients well and reduce chances

of gridlock.

Fig. 2. Dags with complex task dependencies that our algorithms can

schedule optimally.

G is connected if, when arc-orientations are ignored, there is a

path connecting every pair of distinct nodes.

2.1.2 A Repertoire of Building Blocks

Our study applies to any repertoire of connected bipartite

building-block dags that one chooses to build complex dags

from. For illustration, we focus on the following specific

dags. The following descriptions proceed left to right along

successive rows of Fig. 4; we use the drawings to refer to

“left” and “right.”

The first three dags are named for the Latin letters

suggested by their topologies. W-dags epitomize “expan-

sive” and M-dags epitomize “reductive” computations.

W-dags. For each integer d>1, the ð1;dÞ-W-dag W

1;d

has

one source and d sinks; its d arcs connect the source to each

sink. Inductively, for positive integers a; b, the ða þ b; dÞ-W-

dag W

aþb;d

is obtained from the ða; dÞ-W-dag W

a;d

and the

ðb; dÞ-W-dag W

b;d

by identifying (or merging) the rightmost

sink of the former dag with the leftmost sink of the latter.

M-dags. For each integer d>1, the ð1;dÞ-M-dag M

1;d

has

d sources and one sink; its d arcs connect each source to the

sink. Inductively, for positive integers a; b, the ða þ b; dÞ-M-

dag M

aþb;d

is obtained from the ða; dÞ-M-dag M

a;d

and the

ðb; dÞ-M-dag M

b;d

by identifying (or merging) the rightmost

source of the former dag with the leftmost source of the latter.

N-dags. For each integer s>0, the s-N-dag N

s

has

s sources and s sinks; its 2s 1 arcs connect each source v to

sink v and to sink v þ 1 if the latter exists. N

s

is obtained

from W

s1;2

by adding a new source on the right whose sole

arc goes to the rightmost sink. The leftmost source of N

s

—the dag’s anchor—has a child that has no other parents.

(Bipartite) Cycle-dags.Foreachintegers>1,the

s-(Bipartite) Cycle-dag C

s

is obtained from N

s

by adding a

new arc from the rightmost source to the leftmost sink—so

that each source v has arcs to sinks v and v þ 1 mod s.

(Bipartite) Clique-dags. For each integer s>1,the

s-(Bipartite) Clique-dag Q

s

has s sources and s sinks and an

arc from each source to each sink.

We choose the preceding building blocks because the

dags of Fig. 3 can all be constructed using these blocks.

Although details must await Section 4, it is intuitively clear

from the figure that the evolving mesh is constructed from

its source outward by “composing” (or, “concatenating”) a

ð1; 2Þ-W-dag with a ð2; 2Þ-W-dag, then a ð3; 2Þ-W-dag, and

so on; the reduction-mesh is constructed from its sources

upward using ðk; 2Þ-M-dags for successively decreasing

values of k; the reduction-tree is constructed from its

sources/leaves upward by “concatenating” collections of

ð1; 2Þ-M-dags; the FFT dag is constructed from its sources

outward by “concatenating” collections of 2-cycles (which

are identical to 2-cliques).

MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 759

Fig. 3. (a) An evolving (two-dimensional) mesh, (b) a (binary) reduction-tree, (c) an FFT-dag, and (d) a (two-dimensional) reduction-mesh (or,

pyramid dag).

Fig. 4. The building blocks of semi-uniform dags.

2.2 The Internet-Computing Pebble Game

A number of so-called pebble games on dags have been

shown, over the course of several decades, to yield elegant

formal analogues of a variety of problems related to

scheduling computation-dags. Suc h games use tokens,

called pebbles, to model the progress of a computation on

a dag: The placement or removal of the various available

types of pebbles—which is constrained by the dependencies

modeled by the dag’s arcs—represents the changing

(computational) status of the dag’s task-nodes.

Our study is based on the Internet-Computing (IC, for

short) Pebble Game of [15], whose structure derives from the

“no recomputation allowed” pebble game of [16]. Arguments

are presented in [15], [17] (q.v.) that justify studying a

simplified form of the Game in which task-execution order

follows task-allocation order. As we remark in the Introduc-

tion, while we recognize that this assumption will never be

completely realized in practice, one hopes that careful

monitoring of the clients’ past behaviors and current

capabilities, as prescribed in, say, [2], [11], [18], can enhance

the likelihood, if not the certainty, of the desired order.

2.2.1 The Rules of the Game

The IC Pebble Game on a computation-dag G involves one

player S, the Server, who has access to unlimited supplies of

two types of pebbles:

ELIGIBLE pebbles, whose presence

indicates a task’s eligibility for execution, and

EXECUTED

pebbles, whose presence indicates a task’s having been

executed. We now present the rules of our simplified

version of the IC Pebble Game of [15], [17].

The Rules of the IC Pebble Game

. S begins by placing an

ELIGIBLE pebble on each

unpebbled source of G.

/*Unexecuted sources are always eligible for

execution, having no parents whose prior execution

they depend on.*/

. At each step, S

- selects a node that contains an

ELIGIBLE pebble,

- replaces that pebble by an

EXECUTED pebble,

- places an

ELIGIBLE pebble on each unpebbled

node of G, all of whose parents contain

EXE-

CUTED

pebbles.

. S’s goal is to allocate nodes in such a way that every

node v of G eventually contains an

EXECUTED pebble.

/*This modest goal is necessitated by the possi-

bility that G is infinite.*/

Note. The (idealized) IC Pebble Game on a dag G

executes one task/node of G per step. The reader should not

infer that we are assuming a repertoire of tasks that are

uniformly computable in unit time. Once we adopt the

simplifying assumption that task-execution order follows

task-allocation order, we can begin to measure time in an

event-driven way, i.e., per task, rather than chronologically,

i.e., per unit time. Therefore, our model allows tasks to be

quite heterogeneous in complexity as long as the Server can

match the tasks’ complexities with the clients’ resources (via

the monitoring alluded to earlier).

A schedule for the IC Pebble Game on a dag G is a rule

for selecting which

ELIGIBLE pebble to execute at each

step of a play of the Game. For brevity, we henceforth

call a node

ELIGIBLE (respectively, EXECUTED) when it

contains an

ELIGIBLE (respectively, an EXECUTED) pebble.

For uniformity, we henceforth talk about executing nodes

rather than tasks.

2.2.2 IC Quality

The goal in the IC Pebble Game is to play the Game in a way

that maximizes the number of

ELIGIBLE nodes at every

step t. For each step t of a play of the Game on a dag G

under a schedule :

b

EE

ðtÞ denotes the number of nodes of G

that are

ELIGIBLE at step t and E

ðtÞ the number of

ELIGIBLE nonsource nodes. (Note that E

ð0Þ¼0.)

We measure the IC quality of a play of the IC Pebble Game on

a dag G by the size of

b

EE

ðtÞ at each step t of the play—the bigger

b

EE

ðtÞ is, the better. Our goal is an IC-optimal schedule ,in

which, for all steps t,

b

EE

ðtÞ is as big as possible.

It is not a priori clear that IC-optimal schedules ever

exist! The property demands that there be a single schedule

for dag G such that, at every step of the computation,

maximizes the number of

ELIGIBLE nodes across all

schedules for G. In principle, it could be that every schedule

that maximizes the number of

ELIGIBLE nodes at step t

requires that a certain set of t nodes has been executed,

while every analogous schedule for step t þ 1 requires that

a different set of t þ 1 nodes has been executed. Indeed, we

see in Section 3.4 that there exist dags that do not admit any

IC-optimal schedule. Surprisingly, though, the strong

requirement of IC optimality can be achieved for large

families of dags—even ones of quite complex structure.

The significance of IC quality—hence of IC optimality

—stems from the following intuitive scenarios: 1) Schedules

that produce

ELIGIBLE nodes maximally fast may reduce

the chance of a computation’s “stalling” because no new

tasks can be allocated pending the return of alrea dy

assigned ones. 2) If the Server receives a batch of requests

for tasks at (roughly) the same time, then an IC-optimal

schedule ensures that maximally many tasks are

ELIGIBLE

at that time so that maximally many requests can be

satisfied. See [15], [17] for more elaborate discussions of

IC quality.

3THE RUDIMENTS OF IC-OPTIMAL SCHEDULING

We now lay the groundwork for an algorithmic theory of

how to devise IC-optimal schedules. Beginning with a

result that simplifies the quest for such schedules, we

expose IC-optimal schedules for the building blocks of

Section 2.1.2. We then create a framework for scheduling

disjoint collections of building blocks via a priority relation

on dags and we demonstrate the nonexistence of such

schedules for certain other collections.

Executing a sink produces no

ELIGIBLE nodes, while

executing a nonsink may. This simple fact allows us to focus

on schedules with the following simple structure:

Lemma 1. Let be a schedule for a dag G.If is altered to

execute all of G’s nonsinks before any of its sinks, then the

IC quality of the resulting schedule is no less than ’s.

760 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006

When applied to a bipartite dag G, Lemma 1 says that we

never diminish IC quality by executing all of G’s sources

before executing any of its sinks.

3.1 IC-Optimal Schedules for Individual Building

Blocks

A schedule for any of the very uniform dags of Fig. 3 is

IC optimal when it sequences task execution sequentially

along each level of the dags [15]. While such an order is

neither necessary nor sufficient for IC optimality with the

“semi-uniform” dags studied later, it is important when

scheduling the building-block dags of Section 2.1.2.

Theorem 1. Each of our building-block dags admits an IC-

optimal schedule that executes sources from one end to the

other; for N-dags, the execution must begin with the anchor.

Proof. The structures of the building blocks render the

following bounds on E

ðtÞ obvious, as t ranges from 0 to

the number of sources in the given dag:

2

W

s;d

: E

ðtÞðd 1Þt þ½t ¼ s;

N

s

: E

ðtÞt;

M

s;d

: E

ðtÞ½t ¼ 0þ ðt 1Þ=ðd 1Þ

bc

;

C

s

: E

ðtÞt ½t 6¼ 0þ½t ¼ s;

Q

s

: E

ðtÞ¼s ½t ¼ s:

The execution orders in the theorem convert each of

these bounds to an equality. tu

3.2 Execution Priorities for Bipartite Dags

We now define a relation on bipartite dags that often

affords us an easy avenue toward IC-optimal schedules—

for complex, as well as bipartite, dags.

Let the disjoint bipartite dags G

1

and G

2

have s

1

and s

2

sources and admit the IC-optimal schedules

1

and

2

,

respectively. If the following inequalities hold,

3

ð8x 2½0;s

1

Þ ð8y 2½0;s

2

Þ :

E

1

ðxÞþE

2

ðyÞE

1

ðminfs

1

;xþ ygÞ

þ E

2

ððx þ yÞminfs

1

;xþ ygÞ;

ð1Þ

then we say that G

1

has priority over G

2

, denoted G

1

. G

2

.

The inequalities in (1) say that one never decreases

IC quality by executing a source of G

1

, in preference to a

source of G

2

, whenever possible.

The following result is quite important in our algorithmic

framework:

Theorem 2. The relation . on bipartite dags is transitive.

Proof. Let G

1

, G

2

, G

3

be arbitrary bipartite dags such that:

1. each G

i

has s

i

sources and admits an IC-optimal

schedule

i

;

2. G

1

. G

2

and G

2

. G

3

.

To see that G

1

. G

3

, focus on a moment when we have

executed x

1

<s

1

sources of G

1

and x

3

s

3

sources of G

3

(so E

1

ðx

1

ÞþE

3

ðx

3

Þ sinks are ELIGIBLE). We consider

two cases.

Case 1. s

1

x

1

minfs

2

;x

3

g. In this case, we have

E

1

ðx

1

ÞþE

3

ðx

3

ÞE

1

ðx

1

ÞþE

2

ðminfs

2

;x

3

gÞ

þ E

3

ðx

3

minfs

2

;x

3

gÞ

E

1

ðx

1

þ minfs

2

;x

3

gÞ

þ E

3

ðx

3

minfs

2

;x

3

gÞ;

ð2Þ

the first inequality follows because G

2

. G

3

, the second

because G

1

. G

2

. We can iterate these transfers until either

all sources of G

1

are EXECUTED or no sources of G

3

are

EXECUTED.

Case 2. s

1

x

1

< minfs

2

;x

3

g. This case is a bit subtler

than the preceding one. Let y ¼ s

3

x

3

and

z ¼ðs

1

x

1

Þþðs

3

x

3

Þ¼ðs

1

x

1

Þþy. Then, x

1

¼ s

1

ðz yÞ and x

3

¼ s

3

y. (This change of notation is useful

because it relates x

1

and x

3

to the numbers of sources in

G

1

and G

3

.) We note the following useful facts about y

and z:

. 0 y z by definition,

. 0 z<s

3

because s

1

x

1

<x

3

,

. z y s

1

because x

1

0,

. s

1

x

1

¼ z y by definition,

. z y<s

2

because s

1

x

1

<s

2

.

Now, we apply these observations to the problem at hand.

Because G

2

. G

3

and z y 2½0;s

2

and fy; zg½0;s

3

,we

know that

E

2

ðs

2

ðz yÞÞ þ E

3

ðs

3

yÞE

2

ðs

2

ÞþE

3

ðs

3

zÞ;

so that

E

3

ðs

3

yÞE

3

ðs

3

zÞE

2

ðs

2

ÞE

2

ðs

2

ðz yÞÞ:

ð3Þ

Intuitively, executing the last z y sources of G

2

is no

worse (in IC quality) than executing the “intermediate”

sources s

3

z through s

3

y of G

3

.

Similarly, because G

1

. G

2

and z y 2½0; minfs

1

;s

2

g,

we know that

E

1

ðs

1

ðz yÞÞ þ E

2

ðs

2

ÞE

1

ðs

1

ÞþE

2

ðs

2

ðz yÞÞ;

so that

E

2

ðs

2

ÞE

2

ðs

2

ðz yÞÞ E

1

ðs

1

ÞE

1

ðs

1

ðz yÞÞ:

ð4Þ

Intuitively, executing the last z y sources of G

1

is no

worse (in IC quality) than executing the last z y sources

of G

2

.

By transitivity (of ), inequalities (3), (4) imply that

E

3

ðs

3

yÞE

3

ðs

3

zÞE

1

ðs

1

ÞE

1

ðs

1

ðz yÞÞ;

so that

E

1

ðx

1

ÞþE

3

ðx

3

Þ¼E

1

ðs

1

ðz yÞÞ þ E

3

ðs

3

yÞ

E

1

ðs

1

ÞþE

3

ðs

3

zÞ

¼ E

1

ðs

1

ÞþE

3

ðx

3

ðs

1

x

1

ÞÞ:

ð5Þ

The preceding cases—particularly, the chains of inequal-

ities (2), (5)—verify that system (1) always holds for G

1

and G

3

so that G

1

. G

3

, as was claimed. tu

Theorem 2 has a corollary that further exposes the nature

of . and that tells us how to schedule pairwise .-comparable

MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 761

2. For any statement P about t, ½PðtÞ ¼ if P ðtÞ then 1 else 0.

3. ½a; b denotes the set of integers fa; a þ 1; ...;bg.

bipartite dags IC optimally. Specifically, we develop tools

that extend Theorem 1 to disjoint unions—called sums—of

building-block dags. Let G

1

; ...; G

n

be connected bipartite

dags that are pairwise disjoint, in that N

G

i

\ N

G

j

¼;for all

distinct i and j.Thesum of these dags, denoted

G

1

þþG

n

, is the bipartite dag whose node-set and arc-

set are, respectively, the unions of the corresponding sets of

G

1

; ...; G

n

.

Corollary 1. Let G

1

; ...; G

n

be pairwise disjoint bipartite dags,

with each G

i

admitting an IC-optimal schedule

i

.If

G

1

. . G

n

, then the schedule

?

for the sum G

1

þþG

n

that executes, in turn, all sources of G

1

according to

1

, all

sources of G

2

according to

2

, and so on, for all i 2½1;n, and,

finally, executes all sinks, is IC optimal.

Proof. By Lemma 1, we lose no generality by focusing on a

step t when the only

EXECUTED nodes are sources of the

sum-dag. For any indices i and j>i, the transitivity of .

guarantees that G

i

. G

j

. Suppose that some sources of G

i

are not EXECUTED at step t, but at least one source of G

j

is EXECUTED. Then, by the definition of ., in (1), we

never decrease the number of

ELIGIBLE sinks at step t by

“transferring” as many source-executions as possible

from G

j

to G

i

. By repeating such “transfers” a finite

number of times, we end up with a “left-loaded”

situation at step t, wherein there exists i 2½1;n such

that all sources of G

1

; ...; G

i1

are EXECUTED, some

sources of G

i

are EXECUTED , and no sources of

G

iþ1

; ...; G

n

are EXECUTED. tu

One can actually prove Corollary 1 without invoking the

transitivity of . by successively “transferring executions”

from each G

i

to G

i1

.

3.3 Priorities among Our Building Blocks

We now determine the pairwise priorities among the

building-block dags of Section 2.1.2.

Theorem 3. We observe the following pairwise priorities among

our building-block dags:

1. For all s and d, W

s;d

. G for the following bipartite

dags G:

a. all W-dags W

s

0

;d

0

whenever d

0

<d or whenever

d

0

¼ d and s

0

s;

b. all M-dags, N-dags, and Cycle-dags; and

c. Clique-dags Q

s

0

with s

0

d.

2. For all s, N

s

. G for the following bipartite dags G:

a. all N-dags N

s

0

, for all s

0

and

b. all M-dags.

3. For all s, C

s

. G for the following bipartite dags G:

a. C

s

and

b. all M-dags.

4. For all s and d, M

s;d

. M

s

0

;d

0

whenever d

0

>d or

whenever d

0

¼ d and s

0

s.

5. For all s, Q

s

. Q

s

.

The proof of Theorem 3 is a long sequence of

calculations paired with an invocation of the transitivity

of .; we relegate it to the Appendix (Section A), which

can be found on the Computer Society Digital Library at

http://computer.org/tc/archives.htm.

3.4 Incompatible Sums of Bui lding Blocks

Each of our building blocks admits an IC-optimal schedule,

but some of their sums do not.

Lemma 2. The following sums of building-block dags admit no

IC-optimal schedule:

1. all sums of the forms C

s

1

þC

s

2

or C

s

1

þQ

s

2

or

Q

s

1

þQ

s

2

, where s

1

6¼ s

2

;

2. all sums of the form N

s

1

þC

s

2

or N

s

1

þQ

s

2

; and

3. all sums of the form Q

s

1

þM

s

2

;d

, where s

1

>s

2

.

Proof.

1. Focus on schedules for the dag G¼C

s

1

þC

s

2

,

where s

1

6¼ s

2

. There is a unique family

1

of

schedu les for which E

ðs

1

Þ¼s

1

; all of these

execute sources of C

s

1

for the first s

1

steps. For

any other schedule

0

, E

0

ðs

1

Þ <E

ðs

1

Þ. Simi-

larly, there is a unique family

2

of schedules for

which E

ðs

2

Þ¼s

2

; all of these execute sources of

C

s

2

for the first s

2

steps. For any other schedule

0

,

E

0

ðs

2

Þ <E

ðs

2

Þ. Since s

1

6¼ s

2

, the families

1

and

2

are disjoint! Thus, no schedule for G

maximizes IC quality at both steps s

1

and s

2

;

hence, G does not admit any IC-optimal schedule.

Exactly the same argument works for the other

indicated sum-dags of part 1.

2. Say, for contradiction, that there is an IC-

optimal schedule for a dag N

s

1

þG

s

2

, where

G

s

2

2fC

s

2

; Q

s

2

g. The first node that executes

must be the anchor of N

s

1

for only this choice

yields E

ð1Þ 6¼ 0. It follows that must execute

all sources of N

s

1

in the first s

1

steps, for this

would yield E

ðtÞ¼t for all t s

1

, while any

other choice would not maximize IC quality until

step s

1

. We claim that does not maximize

IC quality at some step s>1 and, hence, cannot

be IC optimal. To wit: If s

2

s

1

,then’s

deficiency is manifest at step s

1

þ 1. A schedule

0

that executes all sources of G

s

2

and then

executes s

1

s

2

þ 1 sources of N

s

1

has

E

0

ðs

1

þ 1Þ¼s

1

þ 1. But, executes a source of

G

s

2

for the first time at step s

1

þ 1 and, so,

E

ðs

1

þ 1Þ¼s

1

.Ifs

2

>s

1

, then ’s deficiency is

manifest at step s

2

. A schedule

0

that executes all

sources of G

s

2

during the first s

2

steps has

E

0

ðs

2

Þ¼s

2

.However,duringthisperiod,

executes some x 1 sources of N

s

1

, hence, only

some y s

2

1 sources o f G

s

2

. (Note that

x þ y ¼ s

2

.) Since s

1

<s

2

, it must be that y 1.

But, then, by step s

2

, will have produced

exactly x

ELIGIBLE sinks on N

s

1

and no more than

y 1

ELIGIBLE sinks on G

s

2

,sothat

E

ðs

2

Þ¼x þ y 1 <s

2

.

3. Assume, for contradiction that there is an IC-

optimal schedule for Q

s

1

þM

s

2

;d

, where s

1

>s

2

.

Focus on the numbers of

ELIGIBLE sinks after s

1

and

after s

2

steps. The first s

2

nodes that executes

must be nodes of M

s

2

;d

dictated by an IC-optimal

762 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006

schedule for that dag for this is the only choice for

which E

ðs

2

Þ 6¼ 0. A schedule

0

that executes all

sources of Q

s

1

during the first s

1

steps would have

E

0

ðs

1

Þ¼s

1

. Consider what can have produced

by step s

1

. Since spends at least one step before

step s

1

executing a node of M

s

2

;d

, it cannot have

rendered any sink of Q

s

1

ELIGIBLE by step s

1

; hence,

E

ðs

1

Þbðs

1

1Þ=ðd 1Þc s

1

1. It follows

that cannot be IC optimal. tu

We summarize our priority-related results about sums of

building blocks in Table 1.

4ON SCHEDULING COMPOSITIONS OF BUILDING

BLOCKS

We now show how to devise IC-optimal schedules for

complex dags that are obtained via composition from any

base set of connected bipartite dags that can be related by ..

We illustrate the process using the building blocks of

Section 2.1.2 as a base set.

We inductively define the operation of composition on

dags.

. Start with a base set B of connected bipartite dags.

. Given G

1

; G

2

2 B—which could be copies of the same

dag with nodes renamed to achieve disjointness

—one obtains a composite dag G as follows:

- Let G begin as the sum, G

1

þG

2

. Rename nodes

to ensure that N

G

is disjoint from N

G

1

and N

G

2

.

- Select some set S

1

of sinks from the copy of G

1

in

the sum G

1

þG

2

and an equal-size set S

2

of

sources from the copy of G

2

.

- Pairwise identify (i.e., merge) the nodes in S

1

and S

2

in some way.

4

The resulting set of nodes

is N

G

; the induced set of arcs is A

G

.

. Add the dag G thus obtained to the set B.

We denote composition by * and say that the dag G is a

composite of type ½G

1

*G

2

.

Notes. 1) The roles of G

1

and G

2

in a composition are

asymmetric: G

1

contributes sinks, while G

2

contri butes

sources. 2) G’s type indicates only that sources of G

2

were

merged with sinks of G

1

; it does not identify which nodes

were merged. 3) The dags G

1

and/or G

2

could themselves be

composite.

Composition is associative, so we do not have to keep track

of the order in which dags are incorporated into a composite

dag. Fig. 5 illustrates this fact, which we verify now.

Lemma 3. The composition operation on dags is associative. That

is, a dag G is a composite of type ½½G

1

*G

2

*G

3

if, and only if,

it is a composite of type ½G

1

*½G

2

*G

3

.

Proof. For simplicity, we refer to sinks and sources that are

merged in a composition by their names prior to the

merge. Context should disambiguate each occurrence of

a name.

Let G be composite of type ½½G

1

*G

2

*G

3

, i.e., of type

½G

0

*G

3

, where G

0

is composite of type ½G

1

*G

2

. Let T

1

and S

2

comprise, respectively, the sinks of G

1

and the

sources of G

2

that were merged to yield G

0

. Note that no

node from T

1

is a sink of G

0

because these nodes have

become internal nodes of G

0

. Let T

0

and S

3

comprise,

respectively, the sinks of G

0

and the sources of G

3

that

were merged to yield G. Each sink of G

0

corresponds

either to a sink of G

1

that is not in T

1

or to a sink of G

2

.

Hence, T

0

can be partitioned into the sets T

0

1

, whose

nodes are sinks of G

1

, and T

0

2

, whose nodes are sinks of

G

2

. Let S

0

1

and S

0

2

comprise the sources of G

3

that were

merged with, respectively, nodes of T

0

1

and nodes of T

0

2

.

Now, G can be obtained by first merging the sources of S

0

2

with the sinks of T

0

2

and then merging the sources of the

resulting dag, S

0

1

[ S

2

, with the sinks, T

0

1

[ T

1

,ofG

1

. Thus,

MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 763

4. When S

1

¼ S

2

¼;, the composite dag is just a sum.

TABLE 1

The Relation . among Building-Block Dags

Entries either list conditions for priority or indicate (via “X”) the absence of any IC-optimal schedule for that pairing.

G is also composite of type ½G

1

*½G

2

*G

3

. The converse

yields to similar reasoning. tu

We can now illustrate the natural correspondence

between the node-set of a composite dag and those of its

building blocks, via Fig. 3:

. The evolving two-dimensional mesh is composite of

type W

1;2

*W

2;2

*W

3;2

*.

. A binary reduction-tree is obtained by pairwise

composing of many instances of M

1;2

(seven

instances in the figure).

. The 5-level two-dimensional reduction-mesh is a

composite of type M

5;2

*M

4;2

*M

3;2

*M

2;2

*M

1;2

.

. The FFT dag is obtained by pairwise composing

many instances of C

2

¼Q

2

(12 instances in the

figure).

Dag G is a .-linear composition of the connected bipartite

dags G

1

; G

2

; ...; G

n

if:

1. G is a composite of type G

1

*G

2

**G

n

;

2. each G

i

. G

iþ1

, for all i 2½1;n 1.

Dags that are .-linear compositions admit simple IC-

optimal schedules.

Theorem 4. Let G be a .-linear composition of G

1

; G

2

; ...; G

n

,

where each G

i

admits an IC-optimal schedule

i

. The schedule

for G that proceeds as follows is IC optimal:

1. executes the nodes of G that correspond to sources of

G

1

, in the order mandated by

1

, then the nodes that

correspond to sources of G

2

, in the order mandated by

2

, and so on, for all i 2½1;n.

2. finally executes all sinks of G in any order.

Proof. Let

0

be a schedule for G that has maximum E

0

ðxÞ

for some x and let X comprise the first x nodes that

0

executed. By Lemma 1, we may assume that either

1. X contains all nonsinks of G (and perhaps some

sinks) or

2. X is a proper subset of the nonsinks of G.

In situation 1, E

ðxÞ is maximal by hypothesis. We

therefore assume that situation 2 holds and show that

E

ðxÞE

0

ðxÞ. When X contains only nonsinks of G,

each node of X corresponds to a specific source of one

specific G

i

. Let us focus, for each i 2½1;n, on the set of

sources of G

i

that correspond to nodes in X; call this set

X

i

. We claim that:

The number of

ELIGIBLE nodes in G at step x, denoted

eðXÞ,isjSjjXjþ

P

m

i¼1

e

i

ðX

i

Þ, where S is the set of sources

of G, and e

i

ðX

i

Þ is the number of sinks of G

i

that are ELIGIBLE

when only sources X

i

of G

i

are EXECUTED.

To verify this claim, imagine that we execute nodes of

G and the corresponding nodes of its building block G

i

in

tandem, using the terminology of the IC Pebble Game for

convenience. The main complication arises when we

pebble an internal node v of G since we then simulta-

neously pebble a sink v

i

of some G

i

and a source v

j

of

some G

j

. At each step t of the Game: If node v of G

becomes

ELIGIBLE, then we place an ELIGIBLE pebble on

v

i

and leave v

j

unpebbled; if v becomes EXECUTED, then

we place an

EXECUTED pebble on v

j

and an ELIGIBLE

pebble on v

i

.AnEXECUTED pebble on a sink of G is

replaced with an

ELIGIBLE pebble. No other pebbles

change.

Focus on an arbitrary G

i

. Note that the sources of G

i

that are EXECUTED comprise precisely the set X

i

. The

sinks of G

i

that are ELIGIBLE comprise precisely the set Y

i

of sinks all of whose parents are EXECUTED; hence,

jY

i

j¼e

i

ðX

i

Þ. The cumulative number of sources of the

dags G

i

that are ELIGIBLE is jSjp, where p is the

number of sources of G that are

EXECUTED. It follows

that the cumulative number of

ELIGIBLE pebbles on the

dags G

i

is e

1

ðX

1

Þþþe

n

ðX

n

ÞþjSjp.Wenow

calculate the surfeit of

ELIGIBLE pebbles on the dags G

i

over the ELIGIBLE pebbles on G. Extra ELIGIBLE pebbles

get created when G is decomposed, in only two cases:

1) when an internal node of G becomes

EXECUTED and

2) when we process a sink of G that is

EXECUTED. The

number of the former cases is jX

1

jþþjX

n

jp.

Denoting the number of the latter cases by q, we note

that q þjX

1

jþþjX

n

j¼jXj. The claim is thus verified

because the number of

ELIGIBLE nodes in G is

eðXÞ¼ðe

1

ðX

1

Þþþe

n

ðX

n

ÞþjSjpÞ

ðjX

1

jþþjX

n

jp þ qÞ:

Because of the priority relations among the dags G

i

,

Corollary 1 implies that eðXÞ¼

P

n

i¼1

E

i

ðx

0

i

Þ, where x

0

i

is

a “low-index-loaded” execution of the G

i

. Because of the

way the dags G

i

are composed, the sources of each G

j

could have been merged only with sinks of lower-index

dags, namely, G

1

; ...; G

j1

. Thus, a “low-index-loaded”

execution corresponds to a set X

0

of x EXECUTED nodes

of G that satisfy precedence constraints. Thus, there is a

764 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006

Fig. 5. Dags of the following types: (a) ½½W

1;5

*W

2;4

*C

3

;

(b) ½½½W

3;2

*M

2;3

*M

1;2

*M

1;3

;

(c) ½N

3

*½N

3

*N

2

¼ ½½N

3

*N

3

*N

2

. Each admits an IC-optimal

schedule.

schedule—namely, —that executes nodes of G that

correspond to the dags G

1

; G

2

; ...; G

n

, in turn, and this

schedule is IC optimal. tu

5 IC-OPTIMAL SCHEDULES VIA

DAG-DECOMPOSITION

Section 4 describes how to build complex dags that admit

IC-optimal schedules. Of course, the “real” problem is not

to build a dag but rather to execute a given one. We now craft

an algorithmic framework that converts the synthetic setting

of Section 4 to an analytical setting. We present a suite of

algorithms that take a given dag G and:

1. simplify G’s structure in a way that preserves the IC

quality of its schedules;

2. decompose (the simplified) G into its “constituents”

(when it is, indeed, composite); and

3. determine when (the simplified) G is a .-linear

composition of its “constituents.”

When this program succeeds, we invoke Theorem 4 to

schedule G IC optimally, bottom-up, from the decomposi-

tion. We now develop the advertised algorithmic setting.

5.1 “Skeletonizing” a Complex Dag

The word “simplified” is needed in the preceding para-

graph because a dag can fail to be composite just because it

contains “shortcut” arcs that do not impact intertask

dependencies. Often, removing all shortcuts renders a dag

composite, hence, susceptible to our scheduling strategy.

(Easily, not every shortcut-free dag is composite.)

For any dag G and nodes u; v 2 N

G

, we write u

e

>

G

v to

indicate that there is a path from u to v in G. An arc ðu !

vÞ2A

G

is a shortcut if there is a path u

e

>

G

v that does not

include the arc. The reader can show easily that:

Lemma 4. Composite dags contain no shortcuts.

Fortunately, one can efficiently remove all shortcuts from

a dag without changing its set of IC-optimal schedules. A

(transitive) skeleton (or, minimum equivalent digraph) G

0

of dag

G is a smallest subdag of G that shares G’s node-set and

transitive closure [4].

Lemma 5 ([10]). Every dag G has a unique transitive skeleton,

ðGÞ, which can be found in polynomial time.

We can craft an IC-optimal schedule for a dag G

automatically by crafting such a schedule for ðGÞ.A

special case of the following result appears in [15].

Theorem 5. A schedule has the same IC quality when it

executes a dag G as when it executes ðGÞ. In particular, if is

IC optimal for ðGÞ, then it is IC optimal for G.

Proof. Say that, under schedule , a node u becomes

ELIGIBLE at step t of the IC Pebble Game on ðGÞ. This

means that, at step t, all of u’s ancestors in ðGÞ—its

parents, its parents’ parents, etc.—are

EXECUTED. Be-

cause ðGÞ and G have the same transitive closure, node u

has precisely the same ancestors in G as it does in ðGÞ.

Hence, under schedule , u becomes

ELIGIBLE at step t

of the IC Pebble Game on G. tu

By Lemma 4, a dag cannot be composite unless it is

transitively skeletonized. By Theorem 5, once having

scheduled ðGÞ IC optimally, we have also scheduled G IC

optimally. Therefore, this section paves the way for our

decomposition-based scheduling strategy.

5.2 Decomposing a Composite Dag

Every dag G that is composed from connected bipartite dags

can be decomposed to expose the dags and how they

combine to yield G. We describe this process in detail and

illustrate it with the dags of Fig. 3.

A connected bipartite dag H is a constituent of G just

when:

1. H is an induced subdag of G: N

H

N

G

and A

H

is

comprised of all arcs ðu ! vÞ2A

G

such that

fu; vgN

H

.

2. H is maximal: The induced subdag of G on any

superset of H’s nodes—i.e., any set S such that

N

H

S N

G

—is not connected and bipartite.

Selecting a constituent. We select any constituent of G all

of whose sources are also sources of G, if possible; we call

the selected constituent B

1

(the notation emphasizing that

B

1

is bipartite).

In Fig. 3: Every candidate B

1

for the FFT dag is a copy of C

2

included in levels 2 and 3; every candidate for the reduction-

tree is a copy of M

1;2

; the unique candidate for the

reduction-mesh is M

4;2

.

Detaching a constituent. We “detach” B

1

from G by

deleting the nodes of G that correspond to sources of B

1

, all

incident arcs, and all resulting isolated sinks. We thereby

replace G with a pair of dags hB

1

; G

0

i, where G

0

is the

remnant of G after B

1

is detached.

If G

0

is not empty, then the process of selection and

detachment continues, producing a sequence of the form

G¼)hB

1

; G

0

i¼)hB

1

; hB

2

; G

00

ii¼)hB

1

; hB

2

; hB

3

; G

000

iii¼) ;

leading, ultimately, to a sequence of connected bipartite

dags: B

1

; B

2

; ...; B

n

.

We claim that the described process recognizes whether

or not G is composite and, if so, it produces the dags from

which G is composed (possibly in a different order from the

original composition). If G is not composite, then the process

fails.

Theorem 6. Let the dag G be composite of type G

1

**G

n

. The

decomposition process produces a sequence B

1

; ...; B

n

of

connected bipartite dags such that:

. G is composite of type B

1

**B

n

;

. fB

1

; ...; B

n

g¼fG

1

; ...; G

n

g.

Proof. The result is trivial when n ¼ 1 as G is then a

connected bipartite dag. Assume, therefore, that the

result holds for all n<m and let G be a composite

of type G

1

**G

m

. In this case, G

1

is a constituent

of G, all of whose sources are sources of G. (Other

G

i

’s may share this property.) There is, therefore, a

dag B

1

for our process to detach. Since any

constituent of G all of whose sources are sources of

G must be one of the G

i

, we know that B

1

is one of

these dags. It follows that G is a composite of type

MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 765

B

1

*ðG

1

**G

i1

*G

iþ1

**G

m

Þ;moreover,the

dag G

0

resulting after detaching B

1

is composite of type

G

1

**G

i1

*G

iþ1

**G

m

because the detachment

process does not affect any sources of G other than those

it shares with B

1

. By inductive hypothesis, then, G

0

can be

decomposed as indicated in the theorem. We now invoke

Lemma 3. tu

5.3 The Super-Dag Obtained by Decomposing G

The next step in our strategy is to abstract the structure of G

exposed by its decomposition into B

1

; ...; B

n

in an

algorithmicall y advantageous way. Therefore, we shift

focus from the decomposition to G’s associated super-dag

S

G

¼

def

SðB

1

**B

n

Þ, which is constructed as follows:

Each node of S

G

—which we call a supernode to prevent

ambiguity—is one of the B

i

s. There is an arc in S

G

from

supernode u to supernode v just when some sink(s) of u are

identified with some source(s) of v when one composes the

B

i

s to produce G. Fig. 6 and Fig. 7 present two examples; in

both, supernodes appear in dashed boxes and are inter-

connected by dashed arcs.

In terms of super-dags, the question of whether or not

Theorem 4 applies to dag G reduces to the question of

whether or not S

G

admits a topological sort [4] whose

linearization of supernodes is consistent with the relation ..

For instance, one derives an IC-optimal schedule for the

dag G of Fig. 5b (which is decomposed in Fig. 6) by noting

that G is a composite of type W

3;2

*M

1;2

*M

2;3

*M

1;3

and

that W

3;2

. M

1;2

. M

2;3

. M

1;3

. Indeed, G points out the

challenge in determining if Theorem 4 applies since it is

also a composite of type W

3;2

*M

2;3

*M

1;2

*M

1;3

, but

M

2;3

6 .M

1;2

. We leave to the reader the easy verification

that the linearization B

1

; ...; B

n

is a topological sort of

SðB

1

**B

n

Þ.

5.4 On Exploiting Priorities among Constit uents

Our remaining challenge is to devise a topological sort of S

G

that linearizes the supernodes in an order that honors

relation .. We now present sufficient conditions for this to

occur, verified via a linearization algorithm:

Theorem 7. Say that the dag G is a composite of type B

1

**

B

n

and that, for each pair of constituents, B

i

, B

j

with i 6¼ j,

either B

i

. B

j

or B

j

. B

i

. Then, G is a .-linear composition

whenever the following holds:

Whenever B

j

is a child of B

i

in SðB

1

**B

n

Þ;

we have B

i

. B

j

:

ð6Þ

Proof. We begin w ith an arbit rary topological sort,

b

BB¼

def

B

ð1Þ

; ...; B

ðnÞ

, of the superdag S

G

. We invoke the

hypothesis that . is a (weak) order on the B

i

’s to reorder

b

BB according to ., using a stable

5

comparison sort. Let

~

BB¼

def

B

ð1Þ

. . B

ðnÞ

be the linearization of S

G

produced

by the sort. We claim that

~

BB is also a topological sort of

S

G

. To wit, pick any B

i

and B

j

such that B

j

is B

i

’s child in

S

G

. By definition of topological sort, B

i

precedes B

j

in

b

BB.

We claim that, because B

i

. B

j

(by (6)), B

i

precedes B

j

also in

~

BB. On the one hand, if B

j

6 . B

i

, then the sort

necessarily places B

i

before B

j

in

~

BB. On the other hand, if

B

j

. B

i

, then, since the sort is stable, B

i

precedes B

j

in

~

BB

because it precedes B

j

in

b

BB. Thus,

~

BB is, indeed, a

topological sort of S

G

so that G is composite of type

B

ð1Þ

**B

ðnÞ

. In other words, G is the desired .-

linear composition of B

ð1Þ

; ...; B

ðnÞ

. tu

We can finally apply Theorem 4 to find an IC-optimal

schedule for the dag G.

6CONCLUSIONS AND PROJECTIONS

We have developed three notions that form the basis for a

theory of scheduling complex computation-dags for Inter-

net-based computing: the priority relation . on bipartite

dags (Section 3.2), the operation of the composition of dags

(Section 4), and the operation of the decomposition of dags

(Section 5). We have established a way of combining these

notions to produce schedules for a large class of complex

computation-dags that maximize the number of tasks that

are eligible for allocation to remote clients at every step of

the schedule (Theorems 4 and 7). We have used our notions

to progress beyond the structurally uniform computation-

dags studied in [15], [17] to families that are built in

structured, yet flexible, ways from a repertoire of bipartite

building-block dags. The composite dags that we can now

schedule optimally encompass not only those studied in

[15], [17], but, as illustrated in Fig. 5, also dags that have

rather complex structures, including nodes of varying

degrees and nonleveled global structure.

One direction for future work is to extend the repertoire of

building-block dags that form the raw material for our

composite dags. In particular, we want building blocks of

more complex structures than those of Section 2.1.2, including

less-uniform bipartite dags and nonbipartite dags. We expect

the computational complexity of our scheduling algorithms

to increase with the structural complexity of our building

blocks. Along these lines, we have thus far been unsuccessful

in determining the complexity of the problem of deciding if a

given computation-dag admits an IC-optimal schedule, but

we continue to probe in this direction. (The scheduling

problem could well be co-NP-Complete because o f its

underlying universal quantification.) Finally, we are working

766 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006

Fig. 6. The composition of the dags of Fig. 5b and its associated

superdag.

5. That is, if B

i

. B

j

and B

j

. B

i

, then the sort maintains the original

relative order of B

i

and B

j

.

to extend Theorems 4 and 7 to loosen the strict require-

ment that the composite dag be a .-linear composition.

ACKNOWLEDGMENTS

A portion of the research of G. Malewicz was done while he

was visiting the TAPADS Group at the University of

Massachusetts Amherst. The research of A. Rosenberg and

M. Yurkewych was supported in part by US National

Science Foundation Grant CCF-0342417. A portion of this

paper appeared in the Proceedings of the International Parallel

and Distributed Processing Symposium, 2005.

REFERENCES

[1] J. Annis, Y. Zhao, J. Voeckler, M. Wilde, S. Kent, and I. Foster,

“Applying Chimera Virtual Data Concepts to Cluster Finding in

the Sloan Sky Survey,” Proc. 15th Conf. High Performance

Networking and Computing, p. 56, 2002.

[2] R. Buyya, D. Abramson, and J. Giddy, “A Case for Economy Grid

Architecture for Service Oriented Grid Computing,” Proc. 10th

Heterogeneous Computing Workshop, 2001.

[3] W. Cirne and K. Marzullo, “The Computational Co-Op: Gathering

Clusters into a Metacomputer,” Proc. 13th Int’l Parallel Processing

Symp., pp. 160-166, 1999.

[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction

to Algorithms, second ed. Cambridge, Mass.: MIT Press, 2001.

[5] The Grid: Blueprint for a New Computing Infrastructure, second ed.,

I. Foster and C. Kesselman, eds. San Francisco: Morgan

Kaufmann, 2004.

[6] I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Grid:

Enabling Scalable Virtual Organizations,” Int’l J. High Performance

Computing Applications, vol. 15, pp. 200-222, 2001.

[7] L. Gao and G. Malewicz, “Internet Computing of Tasks with

Dependencies Using Unreliable Workers,” Thoery of Computing

Systems, to appear.

[8] A. Gerasoulis and T. Yang, “A Comparison of Clustering

Heuristics for Scheduling Dags on Multiprocessors,” J. Parallel

and Distributed Computing, vol. 16, pp. 276-291, 1992.

[9] L. He, Z. Han, H. Jin, L. Pan, and S. Li, “DAG-Based Parallel Real

Time Task Scheduling Algorithm on a Cluster,” Proc. Int’l Conf.

Parallel and Distruted Processing Techniques and Applications, pp. 437-

443, 2000.

[10] H.T. Hsu, “An Algorithm for Finding a Minimal Equivalent

Graph of a Digraph,” J. ACM, vol. 22, pp. 11-16, 1975.

[11] D. Kondo, H. Casanova, E. Wing, and F. Berman, “Models and

Scheduling Guidelines for Global Computing Applications,” Proc.

Int’l Parallel and Distruted Processing Symp., p. 79, 2002.

[12] E. Korpela, D. Werthimer, D. Anderson, J. Cobb, and M. Lebofsky,

“SETI@home: Massively Distributed Computing for SETI,”

Computing in Science and Eng., P.F. Dubois, ed., Los Alamitos,

Calif.: IEEE CS Press, 2000.

[13] G. Malewicz, “Parallel Scheduli ng of Complex Dags under

Uncertainty,” Proc. 17th ACM Symp. Parallelism in Algorithms and

Architectures, 2005.

[14] G. Malewicz and A.L. Rosenberg, “On Batch-Scheduling Dags for

Internet-Based Computing,” Proc. 11th European Conf. Parallel

Processing, 2005.

[15] A.L. Rosenberg, “On Scheduling Mesh-Structured Computations

for Internet-Based Computing,” IEEE Trans. Computers, vol. 53,

pp. 1176-1186, 2004.

[16] A.L. Rosenberg and I.H. Sudborough, “Bandwidth and Pebbling,”

Computing, vol. 31, pp. 115-139, 1983.

[17] A.L. Rosenberg and M. Yurkewych, “Guidelines for Scheduling

Some Common Computation-Dags for Internet-Based Comput-

ing,” IEEE Trans. Computers, vol. 54, pp. 428-438, 2005.

[18] X.-H. Sun and M. Wu, “Grid Harvest Service: A System for Long-

Term, Application-Level Task Scheduling,” Proc. IEEE Int’l Parallel

and Distributed Processing Symp.,

p. 25, 2003.

[19] D. Thain, T. Tannenbaum, and M. Livny, “Distributed Computing

in Practice: The Condor Experience,” Concurrency and Computation:

Practice and Experience, 2005.

Grzegorz Malewicz studied computer science

and applied mathematics at the University of

Warsaw from 1993 until 1998. He then joined the

University of Connecticut and received the

doctorate in 2003. He is a software engineer at

Google, Inc. Prior to joining Google, he was an

assistant professor of computer science at the

University of Alabama (UA), where he taught

computer science from 2003-2005. He has had

internships at the AT&T Shannon Lab (summer

2001) and Microsoft Corporation (summer 2000 and fall 2001). He

visited the Laboratory for Computer Science, MIT (AY 2002/2003) and

was a visiting scientist at the University of Massachusetts Amherst

(summer 2004) and Argonne National Laboratory (summer 2005). His

research focuses on parallel and distributed computing, algorithms,

combinatorial optimization and scheduling. His research appears in top

journals and conferences and includes a SIAM Journal of Computing

paper for which he was the sole author that solves a decade-old problem

in distributed computing. He is a member of the IEEE.

MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 767

Fig. 7. The three-dimensional FFT dag and its associated superdag.

Arnold L. Rosenberg is a Distinguished Uni-

versity Professor of Computer Science at the

University of Massachusetts Amherst, where he

codirects the Theoretical Aspects of Parallel and

Distributed Systems (TAPADS) Laboratory.

Prior to joining UMass, he was a professor of

computer science at Duke University from 1981

to 1986 and a research staff member at the IBM

T.J. Watson Research Center from 1965 to

1981. He has held visiting positions at Yale

University and the University of Toronto; he was a Lady Davis Visiting

Professor at the Technion (Israel Institute of Technology) in 1994, and a

Fulbright Research Scholar at the University of Paris-South in 2000. His

research focuses on developing algorithmic models and techniques to

deal with the new modalities of “collaborative computing” (the endeavor

of having several computers cooperate in the solution of a single

computational problem) that result from emerging technologies. He is

the author or coauthor of more than 150 technical papers on these and

other topics in theoretical computer science and discrete mathematics

and is the coauthor of the book Graph Separators, with Applications.He

is a fellow of the ACM, a fellow of the IEEE, and a Golden Core member

of the IEEE Computer Society.

Matthew Yurkewych received the BS degree

from the Massachusetts Institute of Technology

in 1998. He is a PhD Student in computer

science at the University of Massachusetts-

Amherst. Prior to entering graduate school, he

worked at Akamai Technologies and CNet Net-

works as a software engineer.

. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

768 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006