Content uploaded by Arnold Rosenberg
Author content
All content in this area was uploaded by Arnold Rosenberg on Dec 01, 2014
Content may be subject to copyright.
Toward a Theory for Scheduling Dags in
Internet-Based Computing
Grzegorz Malewicz, Member, IEEE,
Arnold L. Rosenberg, Fellow, IEEE, and Matthew Yurkewych
Abstract—Conceptual and algorithmic tools are developed as a foundation for a theory of scheduling complex computation-dags for
Internet-based computing. The goal of the schedules produced is to render tasks eligible for allocation to remote clients (hence, for
execution) at the maximum possible rate. This allows one to utilize remote clients well, as well as to lessen the likelihood of the
“gridlock” that ensues when a computation stalls for lack of eligible tasks. Earlier work has introduced a formalism for studying this
optimization problem and has identified optimal schedules for several significant families of structurally uniform dags. The current
paper extends this work via a methodology for devising optimal schedules for a much broader class of complex dags, which are
obtained via composition from a prespecified collection of simple building-block dags. The paper provides a suite of algorithms that
decompose a given dag G to expose its building blocks and an execution-priority relation . on building blocks. When the building blocks
are appropriately interrelated under ., the algorithms specify an optimal schedule for G.
Index Terms—Internet-based computing, grid computing, global computing, Web computing, scheduling dags, dag decomposition,
theory.
æ
1INTRODUCTION
E
ARLIER work [15], [17] has developed the Internet-
Computing (IC, for short) Pebble Game, which abstracts
the problem of scheduling computations having intertask
dependencies,
1
for several modalities of Internet-based
computing—including Grid computing (cf. [2], [6], [5]),
global computing (cf. [3]), and Web computing (cf. [12]).
The quality metric for schedules produced using the Game
is to maximize the rate at which tasks are rendered eligible
for allocation to remote clients (hence, for execution), with
the dual aim of: 1) enhancing the effective utilization of
remote clients and 2) lessening the likel ihood of the
“gridlock” that can arise when a computation stalls pending
computation of already allocated tasks.
A simple example should illustrate our scheduling
objective. Consider the two-dimensional evolving mesh of
Fig. 1. An optimal schedule for this dag sequences tasks
sequentially along each level [15] (as numbered in the
figure). If just one client participates in the computation,
then, after five tasks have been executed, we can allocate
any of three eligible tasks to the client. If there are several
clients, we could encounter a situation wherein two of these
three eligible tasks (marked A in the figure) are allocated to
clients who have not yet finished executing them. There is,
then, only one task (marked E) that is eligible and
unallocated. If two clients now request work, we may be
able to satisfy only one request, thus wasting the computing
resources of one client. Since an optimal schedule max-
imizes the number of eligible tasks, it mini mizes the
likelihood of this waste of resources (whose extreme case
is the gridlock that arises when all eligible tasks have been
allocated, but none has been executed).
Many IC projects—cf. [2], [11], [18]—monitor either the
past histories of remote clients or their current computa-
tional capabilities or both. While the resulting snapshots
yield no guarantees of future performance, they at least
afford the server a basis for estimating such performance.
Our study proceeds under the idealized assumption that
such monitoring yields sufficiently accurate predictions of
clients’ future performance that the server can allocate
eligible tasks to clients in an order that makes it likely that
tasks will be executed in the order of their allocation. We
show how such information often allows us to craft
schedules that produce maximally many eligible tasks after
each task execution.
Our contributions. We develop the framework of a
theory of Internet-based scheduling via three conceptual/
algorithmic contributions. 1) We introduce a new “priority”
relation, denoted ., on pairs of bipartite dags; the assertion
“G
1
. G
2
” guarantees that one never sacrifices our quality
metric (which rewards a schedule’s rate of producing
eligible tasks) by executing all sources of G
1
, then all sources
of G
2
, then all sinks of both dags. We provide a repertoire of
bipartite building-block dags, show how to schedule each
optimally, and expose the .-interrelationships among them.
2) We specify a way of “composing” building blocks to
obtain dags of possibly quite complex structures; cf. Fig. 2.
If the building blocks used in the composition form a
“relation-chain” under ., then the resulting composite dag
IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006 757
. G. Malewicz is with the Department of Engineering, Google, Inc.,
Mountain View, CA 94043. E-mail: malewicz@google.com.
. A.L. Rosenberg and M. Yurkewych are with the Department of Computer
Science, University of Massachusetts, Amherst, MA 01003.
E-mail: {rsnbrg, yurk}@cs.umass.edu.
Manuscript received 14 Jan. 2005; revised 20 May 2005; accepted 20 June
2005; published online 21 Apr. 2006.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-0008-0105.
1. As is traditional—cf. [8], [9]—we model such a computation as a dag
(directed acyclic graph).
0018-9340/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society
is guaranteed to admit an optimal schedule. 3) The
framework developed thus far is descriptive rather than
prescriptive. It says that if a dag G is constructed from
bipartite building blocks via composition and if we can
identify the “blueprint” used to construct G and if the
underlying building blocks are interrelated in a certain way,
then a prescribed strategy produces an optimal schedule for
G. We next address the algorithmic challenge in the
preceding ifs: Given a dag G, how does one apply the
preceding framework to it? We develop a suite of
algorithms that: a) reduce any dag G to its “transitive
skeleton” G
0
, a simplified version of G that shares the same
set of optimal schedules; b) decompose G
0
to determine
whether or not it is constructed from bipartite building
blocks via composition, thereby exposing a “blueprint” for
G
0
; c) specify an optimal schedule for any such G
0
that is
built from building blocks that form a “relation-chain”
under .. For illustration, all of the dags in Fig. 2 yield to our
algorithms.
The scheduling theory we develop here has the potential
of improving efficiency and fault tolerance in existing Grid
systems. As but one example, when Condor [19] executes
computations with complex task dependencies, such as the
Sloan Digital Sky Survey [1], it uses a “FIFO” regimen to
sequence the allocation of eligible tasks. Given the temporal
unpredictability of the remote clients, this scheduling may
sometimes lead to an ineffective use of the clients’
computing resources and, in the extreme case, to “grid-
lock.” Our scheduling algorithms have the potential of
reducing the severity of these issues. Experimental work is
underway to determine how to enhance this potential.
Related work. Most closely related to our study are its
immediate precursors and motivators, [15], [17]. The main
results of those sources demonstrate the necessity and
sufficiency of parent orientation for optimality in scheduling
the dags of Fig. 3. Notably, these dags yield to the
algorithms presented here, so our results both extend the
results in [15], [17] and explain their underlying principles
in a general setting. In a companion to this study, we are
pursuing an orthogonal direction for extending [15], [17].
Motivated by the demonstration in Section 3.4 of the limited
scope of the notion of optimal schedule that we study here,
we formulate, in [14], a scheduling paradigm in which a
server allocates batches of tasks periodically, rather than
allocating individual tasks as soon as they become eligible.
Optimality is always possible within this new framework,
but achieving it may entail a prohibitively complex
computation. An alternative direction of inquiry appears
in [7], [13], where a probabilistic pebble game is used to
study the execution of interdependent tasks on unreliable
clients. Finally, our study has been inspired by the many
exciting systems and/or application-oriented studies of
Internet-based computing, in sources such as [2], [3], [5], [6],
[11], [12], [18].
2EXECUTING DAGS ON THE INTERNET
We review the basic graph-theoretic terms used in our study.
We then introduce several bipartite “building blocks” to
exemplify our theory. Finally, we present the pebble game on
dags we use to model computations on dags.
2.1 Computation-Dags
2.1.1 Basic Definitions
A directed graph G is given by a set of nodes N
G
and a set of arcs
(or, directed edges) A
G
, each having the form ðu ! vÞ, where
u; v 2 N
G
.Apath in G is a sequence of arcs that share adjacent
endpoints, as in the following path from node u
1
to node u
n
:
ðu
1
! u
2
Þ; ðu
2
! u
3
Þ; ...; ðu
n2
! u
n1
Þ; ðu
n1
! u
n
Þ.A
dag (directed acyclic graph) G is a directed graph that has no
cycles, i.e.,in adag, no path of the preceding form has u
1
¼ u
n
.
When a dag G is used to model a computation, i.e., is a
computation-dag:
. each v 2 N
G
represents a task in the computation;
. an arc ðu ! vÞ2A
G
represents the dependence of
task v on task u: v cannot be executed until u is.
Given an arc ðu ! vÞ2A
G
, u is a parent of v and v is a child
of u in G. Each parentless node of G is a source (node), and
each childless node is a sink (node); all other nodes are
internal. A dag G is bipartite if:
1. N
G
can be partitioned into subsets X and Y such
that, for every arc ðu ! vÞ2A
G
, u 2 X and v 2 Y ;
2. each v 2 N
G
is incident to some arc of G, i.e., is either
the node u or the node w of some arc ðu ! wÞ2A
G
.
(Prohibiting “isolated” nodes avoids degeneracies.)
758 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006
Fig. 1. An optimal schedule helps utilize clients well and reduce chances
of gridlock.
Fig. 2. Dags with complex task dependencies that our algorithms can
schedule optimally.
G is connected if, when arc-orientations are ignored, there is a
path connecting every pair of distinct nodes.
2.1.2 A Repertoire of Building Blocks
Our study applies to any repertoire of connected bipartite
building-block dags that one chooses to build complex dags
from. For illustration, we focus on the following specific
dags. The following descriptions proceed left to right along
successive rows of Fig. 4; we use the drawings to refer to
“left” and “right.”
The first three dags are named for the Latin letters
suggested by their topologies. W-dags epitomize “expan-
sive” and M-dags epitomize “reductive” computations.
W-dags. For each integer d>1, the ð1;dÞ-W-dag W
1;d
has
one source and d sinks; its d arcs connect the source to each
sink. Inductively, for positive integers a; b, the ða þ b; dÞ-W-
dag W
aþb;d
is obtained from the ða; dÞ-W-dag W
a;d
and the
ðb; dÞ-W-dag W
b;d
by identifying (or merging) the rightmost
sink of the former dag with the leftmost sink of the latter.
M-dags. For each integer d>1, the ð1;dÞ-M-dag M
1;d
has
d sources and one sink; its d arcs connect each source to the
sink. Inductively, for positive integers a; b, the ða þ b; dÞ-M-
dag M
aþb;d
is obtained from the ða; dÞ-M-dag M
a;d
and the
ðb; dÞ-M-dag M
b;d
by identifying (or merging) the rightmost
source of the former dag with the leftmost source of the latter.
N-dags. For each integer s>0, the s-N-dag N
s
has
s sources and s sinks; its 2s 1 arcs connect each source v to
sink v and to sink v þ 1 if the latter exists. N
s
is obtained
from W
s1;2
by adding a new source on the right whose sole
arc goes to the rightmost sink. The leftmost source of N
s
—the dag’s anchor—has a child that has no other parents.
(Bipartite) Cycle-dags.Foreachintegers>1,the
s-(Bipartite) Cycle-dag C
s
is obtained from N
s
by adding a
new arc from the rightmost source to the leftmost sink—so
that each source v has arcs to sinks v and v þ 1 mod s.
(Bipartite) Clique-dags. For each integer s>1,the
s-(Bipartite) Clique-dag Q
s
has s sources and s sinks and an
arc from each source to each sink.
We choose the preceding building blocks because the
dags of Fig. 3 can all be constructed using these blocks.
Although details must await Section 4, it is intuitively clear
from the figure that the evolving mesh is constructed from
its source outward by “composing” (or, “concatenating”) a
ð1; 2Þ-W-dag with a ð2; 2Þ-W-dag, then a ð3; 2Þ-W-dag, and
so on; the reduction-mesh is constructed from its sources
upward using ðk; 2Þ-M-dags for successively decreasing
values of k; the reduction-tree is constructed from its
sources/leaves upward by “concatenating” collections of
ð1; 2Þ-M-dags; the FFT dag is constructed from its sources
outward by “concatenating” collections of 2-cycles (which
are identical to 2-cliques).
MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 759
Fig. 3. (a) An evolving (two-dimensional) mesh, (b) a (binary) reduction-tree, (c) an FFT-dag, and (d) a (two-dimensional) reduction-mesh (or,
pyramid dag).
Fig. 4. The building blocks of semi-uniform dags.
2.2 The Internet-Computing Pebble Game
A number of so-called pebble games on dags have been
shown, over the course of several decades, to yield elegant
formal analogues of a variety of problems related to
scheduling computation-dags. Suc h games use tokens,
called pebbles, to model the progress of a computation on
a dag: The placement or removal of the various available
types of pebbles—which is constrained by the dependencies
modeled by the dag’s arcs—represents the changing
(computational) status of the dag’s task-nodes.
Our study is based on the Internet-Computing (IC, for
short) Pebble Game of [15], whose structure derives from the
“no recomputation allowed” pebble game of [16]. Arguments
are presented in [15], [17] (q.v.) that justify studying a
simplified form of the Game in which task-execution order
follows task-allocation order. As we remark in the Introduc-
tion, while we recognize that this assumption will never be
completely realized in practice, one hopes that careful
monitoring of the clients’ past behaviors and current
capabilities, as prescribed in, say, [2], [11], [18], can enhance
the likelihood, if not the certainty, of the desired order.
2.2.1 The Rules of the Game
The IC Pebble Game on a computation-dag G involves one
player S, the Server, who has access to unlimited supplies of
two types of pebbles:
ELIGIBLE pebbles, whose presence
indicates a task’s eligibility for execution, and
EXECUTED
pebbles, whose presence indicates a task’s having been
executed. We now present the rules of our simplified
version of the IC Pebble Game of [15], [17].
The Rules of the IC Pebble Game
. S begins by placing an
ELIGIBLE pebble on each
unpebbled source of G.
/*Unexecuted sources are always eligible for
execution, having no parents whose prior execution
they depend on.*/
. At each step, S
- selects a node that contains an
ELIGIBLE pebble,
- replaces that pebble by an
EXECUTED pebble,
- places an
ELIGIBLE pebble on each unpebbled
node of G, all of whose parents contain
EXE-
CUTED
pebbles.
. S’s goal is to allocate nodes in such a way that every
node v of G eventually contains an
EXECUTED pebble.
/*This modest goal is necessitated by the possi-
bility that G is infinite.*/
Note. The (idealized) IC Pebble Game on a dag G
executes one task/node of G per step. The reader should not
infer that we are assuming a repertoire of tasks that are
uniformly computable in unit time. Once we adopt the
simplifying assumption that task-execution order follows
task-allocation order, we can begin to measure time in an
event-driven way, i.e., per task, rather than chronologically,
i.e., per unit time. Therefore, our model allows tasks to be
quite heterogeneous in complexity as long as the Server can
match the tasks’ complexities with the clients’ resources (via
the monitoring alluded to earlier).
A schedule for the IC Pebble Game on a dag G is a rule
for selecting which
ELIGIBLE pebble to execute at each
step of a play of the Game. For brevity, we henceforth
call a node
ELIGIBLE (respectively, EXECUTED) when it
contains an
ELIGIBLE (respectively, an EXECUTED) pebble.
For uniformity, we henceforth talk about executing nodes
rather than tasks.
2.2.2 IC Quality
The goal in the IC Pebble Game is to play the Game in a way
that maximizes the number of
ELIGIBLE nodes at every
step t. For each step t of a play of the Game on a dag G
under a schedule :
b
EE
ðtÞ denotes the number of nodes of G
that are
ELIGIBLE at step t and E
ðtÞ the number of
ELIGIBLE nonsource nodes. (Note that E
ð0Þ¼0.)
We measure the IC quality of a play of the IC Pebble Game on
a dag G by the size of
b
EE
ðtÞ at each step t of the play—the bigger
b
EE
ðtÞ is, the better. Our goal is an IC-optimal schedule ,in
which, for all steps t,
b
EE
ðtÞ is as big as possible.
It is not a priori clear that IC-optimal schedules ever
exist! The property demands that there be a single schedule
for dag G such that, at every step of the computation,
maximizes the number of
ELIGIBLE nodes across all
schedules for G. In principle, it could be that every schedule
that maximizes the number of
ELIGIBLE nodes at step t
requires that a certain set of t nodes has been executed,
while every analogous schedule for step t þ 1 requires that
a different set of t þ 1 nodes has been executed. Indeed, we
see in Section 3.4 that there exist dags that do not admit any
IC-optimal schedule. Surprisingly, though, the strong
requirement of IC optimality can be achieved for large
families of dags—even ones of quite complex structure.
The significance of IC quality—hence of IC optimality
—stems from the following intuitive scenarios: 1) Schedules
that produce
ELIGIBLE nodes maximally fast may reduce
the chance of a computation’s “stalling” because no new
tasks can be allocated pending the return of alrea dy
assigned ones. 2) If the Server receives a batch of requests
for tasks at (roughly) the same time, then an IC-optimal
schedule ensures that maximally many tasks are
ELIGIBLE
at that time so that maximally many requests can be
satisfied. See [15], [17] for more elaborate discussions of
IC quality.
3THE RUDIMENTS OF IC-OPTIMAL SCHEDULING
We now lay the groundwork for an algorithmic theory of
how to devise IC-optimal schedules. Beginning with a
result that simplifies the quest for such schedules, we
expose IC-optimal schedules for the building blocks of
Section 2.1.2. We then create a framework for scheduling
disjoint collections of building blocks via a priority relation
on dags and we demonstrate the nonexistence of such
schedules for certain other collections.
Executing a sink produces no
ELIGIBLE nodes, while
executing a nonsink may. This simple fact allows us to focus
on schedules with the following simple structure:
Lemma 1. Let be a schedule for a dag G.If is altered to
execute all of G’s nonsinks before any of its sinks, then the
IC quality of the resulting schedule is no less than ’s.
760 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006
When applied to a bipartite dag G, Lemma 1 says that we
never diminish IC quality by executing all of G’s sources
before executing any of its sinks.
3.1 IC-Optimal Schedules for Individual Building
Blocks
A schedule for any of the very uniform dags of Fig. 3 is
IC optimal when it sequences task execution sequentially
along each level of the dags [15]. While such an order is
neither necessary nor sufficient for IC optimality with the
“semi-uniform” dags studied later, it is important when
scheduling the building-block dags of Section 2.1.2.
Theorem 1. Each of our building-block dags admits an IC-
optimal schedule that executes sources from one end to the
other; for N-dags, the execution must begin with the anchor.
Proof. The structures of the building blocks render the
following bounds on E
ðtÞ obvious, as t ranges from 0 to
the number of sources in the given dag:
2
W
s;d
: E
ðtÞðd 1Þt þ½t ¼ s;
N
s
: E
ðtÞt;
M
s;d
: E
ðtÞ½t ¼ 0þ ðt 1Þ=ðd 1Þ
bc
;
C
s
: E
ðtÞt ½t 6¼ 0þ½t ¼ s;
Q
s
: E
ðtÞ¼s ½t ¼ s:
The execution orders in the theorem convert each of
these bounds to an equality. tu
3.2 Execution Priorities for Bipartite Dags
We now define a relation on bipartite dags that often
affords us an easy avenue toward IC-optimal schedules—
for complex, as well as bipartite, dags.
Let the disjoint bipartite dags G
1
and G
2
have s
1
and s
2
sources and admit the IC-optimal schedules
1
and
2
,
respectively. If the following inequalities hold,
3
ð8x 2½0;s
1
Þ ð8y 2½0;s
2
Þ :
E
1
ðxÞþE
2
ðyÞE
1
ðminfs
1
;xþ ygÞ
þ E
2
ððx þ yÞminfs
1
;xþ ygÞ;
ð1Þ
then we say that G
1
has priority over G
2
, denoted G
1
. G
2
.
The inequalities in (1) say that one never decreases
IC quality by executing a source of G
1
, in preference to a
source of G
2
, whenever possible.
The following result is quite important in our algorithmic
framework:
Theorem 2. The relation . on bipartite dags is transitive.
Proof. Let G
1
, G
2
, G
3
be arbitrary bipartite dags such that:
1. each G
i
has s
i
sources and admits an IC-optimal
schedule
i
;
2. G
1
. G
2
and G
2
. G
3
.
To see that G
1
. G
3
, focus on a moment when we have
executed x
1
<s
1
sources of G
1
and x
3
s
3
sources of G
3
(so E
1
ðx
1
ÞþE
3
ðx
3
Þ sinks are ELIGIBLE). We consider
two cases.
Case 1. s
1
x
1
minfs
2
;x
3
g. In this case, we have
E
1
ðx
1
ÞþE
3
ðx
3
ÞE
1
ðx
1
ÞþE
2
ðminfs
2
;x
3
gÞ
þ E
3
ðx
3
minfs
2
;x
3
gÞ
E
1
ðx
1
þ minfs
2
;x
3
gÞ
þ E
3
ðx
3
minfs
2
;x
3
gÞ;
ð2Þ
the first inequality follows because G
2
. G
3
, the second
because G
1
. G
2
. We can iterate these transfers until either
all sources of G
1
are EXECUTED or no sources of G
3
are
EXECUTED.
Case 2. s
1
x
1
< minfs
2
;x
3
g. This case is a bit subtler
than the preceding one. Let y ¼ s
3
x
3
and
z ¼ðs
1
x
1
Þþðs
3
x
3
Þ¼ðs
1
x
1
Þþy. Then, x
1
¼ s
1
ðz yÞ and x
3
¼ s
3
y. (This change of notation is useful
because it relates x
1
and x
3
to the numbers of sources in
G
1
and G
3
.) We note the following useful facts about y
and z:
. 0 y z by definition,
. 0 z<s
3
because s
1
x
1
<x
3
,
. z y s
1
because x
1
0,
. s
1
x
1
¼ z y by definition,
. z y<s
2
because s
1
x
1
<s
2
.
Now, we apply these observations to the problem at hand.
Because G
2
. G
3
and z y 2½0;s
2
and fy; zg½0;s
3
,we
know that
E
2
ðs
2
ðz yÞÞ þ E
3
ðs
3
yÞE
2
ðs
2
ÞþE
3
ðs
3
zÞ;
so that
E
3
ðs
3
yÞE
3
ðs
3
zÞE
2
ðs
2
ÞE
2
ðs
2
ðz yÞÞ:
ð3Þ
Intuitively, executing the last z y sources of G
2
is no
worse (in IC quality) than executing the “intermediate”
sources s
3
z through s
3
y of G
3
.
Similarly, because G
1
. G
2
and z y 2½0; minfs
1
;s
2
g,
we know that
E
1
ðs
1
ðz yÞÞ þ E
2
ðs
2
ÞE
1
ðs
1
ÞþE
2
ðs
2
ðz yÞÞ;
so that
E
2
ðs
2
ÞE
2
ðs
2
ðz yÞÞ E
1
ðs
1
ÞE
1
ðs
1
ðz yÞÞ:
ð4Þ
Intuitively, executing the last z y sources of G
1
is no
worse (in IC quality) than executing the last z y sources
of G
2
.
By transitivity (of ), inequalities (3), (4) imply that
E
3
ðs
3
yÞE
3
ðs
3
zÞE
1
ðs
1
ÞE
1
ðs
1
ðz yÞÞ;
so that
E
1
ðx
1
ÞþE
3
ðx
3
Þ¼E
1
ðs
1
ðz yÞÞ þ E
3
ðs
3
yÞ
E
1
ðs
1
ÞþE
3
ðs
3
zÞ
¼ E
1
ðs
1
ÞþE
3
ðx
3
ðs
1
x
1
ÞÞ:
ð5Þ
The preceding cases—particularly, the chains of inequal-
ities (2), (5)—verify that system (1) always holds for G
1
and G
3
so that G
1
. G
3
, as was claimed. tu
Theorem 2 has a corollary that further exposes the nature
of . and that tells us how to schedule pairwise .-comparable
MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 761
2. For any statement P about t, ½PðtÞ ¼ if P ðtÞ then 1 else 0.
3. ½a; b denotes the set of integers fa; a þ 1; ...;bg.
bipartite dags IC optimally. Specifically, we develop tools
that extend Theorem 1 to disjoint unions—called sums—of
building-block dags. Let G
1
; ...; G
n
be connected bipartite
dags that are pairwise disjoint, in that N
G
i
\ N
G
j
¼;for all
distinct i and j.Thesum of these dags, denoted
G
1
þþG
n
, is the bipartite dag whose node-set and arc-
set are, respectively, the unions of the corresponding sets of
G
1
; ...; G
n
.
Corollary 1. Let G
1
; ...; G
n
be pairwise disjoint bipartite dags,
with each G
i
admitting an IC-optimal schedule
i
.If
G
1
. . G
n
, then the schedule
?
for the sum G
1
þþG
n
that executes, in turn, all sources of G
1
according to
1
, all
sources of G
2
according to
2
, and so on, for all i 2½1;n, and,
finally, executes all sinks, is IC optimal.
Proof. By Lemma 1, we lose no generality by focusing on a
step t when the only
EXECUTED nodes are sources of the
sum-dag. For any indices i and j>i, the transitivity of .
guarantees that G
i
. G
j
. Suppose that some sources of G
i
are not EXECUTED at step t, but at least one source of G
j
is EXECUTED. Then, by the definition of ., in (1), we
never decrease the number of
ELIGIBLE sinks at step t by
“transferring” as many source-executions as possible
from G
j
to G
i
. By repeating such “transfers” a finite
number of times, we end up with a “left-loaded”
situation at step t, wherein there exists i 2½1;n such
that all sources of G
1
; ...; G
i1
are EXECUTED, some
sources of G
i
are EXECUTED , and no sources of
G
iþ1
; ...; G
n
are EXECUTED. tu
One can actually prove Corollary 1 without invoking the
transitivity of . by successively “transferring executions”
from each G
i
to G
i1
.
3.3 Priorities among Our Building Blocks
We now determine the pairwise priorities among the
building-block dags of Section 2.1.2.
Theorem 3. We observe the following pairwise priorities among
our building-block dags:
1. For all s and d, W
s;d
. G for the following bipartite
dags G:
a. all W-dags W
s
0
;d
0
whenever d
0
<d or whenever
d
0
¼ d and s
0
s;
b. all M-dags, N-dags, and Cycle-dags; and
c. Clique-dags Q
s
0
with s
0
d.
2. For all s, N
s
. G for the following bipartite dags G:
a. all N-dags N
s
0
, for all s
0
and
b. all M-dags.
3. For all s, C
s
. G for the following bipartite dags G:
a. C
s
and
b. all M-dags.
4. For all s and d, M
s;d
. M
s
0
;d
0
whenever d
0
>d or
whenever d
0
¼ d and s
0
s.
5. For all s, Q
s
. Q
s
.
The proof of Theorem 3 is a long sequence of
calculations paired with an invocation of the transitivity
of .; we relegate it to the Appendix (Section A), which
can be found on the Computer Society Digital Library at
http://computer.org/tc/archives.htm.
3.4 Incompatible Sums of Bui lding Blocks
Each of our building blocks admits an IC-optimal schedule,
but some of their sums do not.
Lemma 2. The following sums of building-block dags admit no
IC-optimal schedule:
1. all sums of the forms C
s
1
þC
s
2
or C
s
1
þQ
s
2
or
Q
s
1
þQ
s
2
, where s
1
6¼ s
2
;
2. all sums of the form N
s
1
þC
s
2
or N
s
1
þQ
s
2
; and
3. all sums of the form Q
s
1
þM
s
2
;d
, where s
1
>s
2
.
Proof.
1. Focus on schedules for the dag G¼C
s
1
þC
s
2
,
where s
1
6¼ s
2
. There is a unique family
1
of
schedu les for which E
ðs
1
Þ¼s
1
; all of these
execute sources of C
s
1
for the first s
1
steps. For
any other schedule
0
, E
0
ðs
1
Þ <E
ðs
1
Þ. Simi-
larly, there is a unique family
2
of schedules for
which E
ðs
2
Þ¼s
2
; all of these execute sources of
C
s
2
for the first s
2
steps. For any other schedule
0
,
E
0
ðs
2
Þ <E
ðs
2
Þ. Since s
1
6¼ s
2
, the families
1
and
2
are disjoint! Thus, no schedule for G
maximizes IC quality at both steps s
1
and s
2
;
hence, G does not admit any IC-optimal schedule.
Exactly the same argument works for the other
indicated sum-dags of part 1.
2. Say, for contradiction, that there is an IC-
optimal schedule for a dag N
s
1
þG
s
2
, where
G
s
2
2fC
s
2
; Q
s
2
g. The first node that executes
must be the anchor of N
s
1
for only this choice
yields E
ð1Þ 6¼ 0. It follows that must execute
all sources of N
s
1
in the first s
1
steps, for this
would yield E
ðtÞ¼t for all t s
1
, while any
other choice would not maximize IC quality until
step s
1
. We claim that does not maximize
IC quality at some step s>1 and, hence, cannot
be IC optimal. To wit: If s
2
s
1
,then’s
deficiency is manifest at step s
1
þ 1. A schedule
0
that executes all sources of G
s
2
and then
executes s
1
s
2
þ 1 sources of N
s
1
has
E
0
ðs
1
þ 1Þ¼s
1
þ 1. But, executes a source of
G
s
2
for the first time at step s
1
þ 1 and, so,
E
ðs
1
þ 1Þ¼s
1
.Ifs
2
>s
1
, then ’s deficiency is
manifest at step s
2
. A schedule
0
that executes all
sources of G
s
2
during the first s
2
steps has
E
0
ðs
2
Þ¼s
2
.However,duringthisperiod,
executes some x 1 sources of N
s
1
, hence, only
some y s
2
1 sources o f G
s
2
. (Note that
x þ y ¼ s
2
.) Since s
1
<s
2
, it must be that y 1.
But, then, by step s
2
, will have produced
exactly x
ELIGIBLE sinks on N
s
1
and no more than
y 1
ELIGIBLE sinks on G
s
2
,sothat
E
ðs
2
Þ¼x þ y 1 <s
2
.
3. Assume, for contradiction that there is an IC-
optimal schedule for Q
s
1
þM
s
2
;d
, where s
1
>s
2
.
Focus on the numbers of
ELIGIBLE sinks after s
1
and
after s
2
steps. The first s
2
nodes that executes
must be nodes of M
s
2
;d
dictated by an IC-optimal
762 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006
schedule for that dag for this is the only choice for
which E
ðs
2
Þ 6¼ 0. A schedule
0
that executes all
sources of Q
s
1
during the first s
1
steps would have
E
0
ðs
1
Þ¼s
1
. Consider what can have produced
by step s
1
. Since spends at least one step before
step s
1
executing a node of M
s
2
;d
, it cannot have
rendered any sink of Q
s
1
ELIGIBLE by step s
1
; hence,
E
ðs
1
Þbðs
1
1Þ=ðd 1Þc s
1
1. It follows
that cannot be IC optimal. tu
We summarize our priority-related results about sums of
building blocks in Table 1.
4ON SCHEDULING COMPOSITIONS OF BUILDING
BLOCKS
We now show how to devise IC-optimal schedules for
complex dags that are obtained via composition from any
base set of connected bipartite dags that can be related by ..
We illustrate the process using the building blocks of
Section 2.1.2 as a base set.
We inductively define the operation of composition on
dags.
. Start with a base set B of connected bipartite dags.
. Given G
1
; G
2
2 B—which could be copies of the same
dag with nodes renamed to achieve disjointness
—one obtains a composite dag G as follows:
- Let G begin as the sum, G
1
þG
2
. Rename nodes
to ensure that N
G
is disjoint from N
G
1
and N
G
2
.
- Select some set S
1
of sinks from the copy of G
1
in
the sum G
1
þG
2
and an equal-size set S
2
of
sources from the copy of G
2
.
- Pairwise identify (i.e., merge) the nodes in S
1
and S
2
in some way.
4
The resulting set of nodes
is N
G
; the induced set of arcs is A
G
.
. Add the dag G thus obtained to the set B.
We denote composition by * and say that the dag G is a
composite of type ½G
1
*G
2
.
Notes. 1) The roles of G
1
and G
2
in a composition are
asymmetric: G
1
contributes sinks, while G
2
contri butes
sources. 2) G’s type indicates only that sources of G
2
were
merged with sinks of G
1
; it does not identify which nodes
were merged. 3) The dags G
1
and/or G
2
could themselves be
composite.
Composition is associative, so we do not have to keep track
of the order in which dags are incorporated into a composite
dag. Fig. 5 illustrates this fact, which we verify now.
Lemma 3. The composition operation on dags is associative. That
is, a dag G is a composite of type ½½G
1
*G
2
*G
3
if, and only if,
it is a composite of type ½G
1
*½G
2
*G
3
.
Proof. For simplicity, we refer to sinks and sources that are
merged in a composition by their names prior to the
merge. Context should disambiguate each occurrence of
a name.
Let G be composite of type ½½G
1
*G
2
*G
3
, i.e., of type
½G
0
*G
3
, where G
0
is composite of type ½G
1
*G
2
. Let T
1
and S
2
comprise, respectively, the sinks of G
1
and the
sources of G
2
that were merged to yield G
0
. Note that no
node from T
1
is a sink of G
0
because these nodes have
become internal nodes of G
0
. Let T
0
and S
3
comprise,
respectively, the sinks of G
0
and the sources of G
3
that
were merged to yield G. Each sink of G
0
corresponds
either to a sink of G
1
that is not in T
1
or to a sink of G
2
.
Hence, T
0
can be partitioned into the sets T
0
1
, whose
nodes are sinks of G
1
, and T
0
2
, whose nodes are sinks of
G
2
. Let S
0
1
and S
0
2
comprise the sources of G
3
that were
merged with, respectively, nodes of T
0
1
and nodes of T
0
2
.
Now, G can be obtained by first merging the sources of S
0
2
with the sinks of T
0
2
and then merging the sources of the
resulting dag, S
0
1
[ S
2
, with the sinks, T
0
1
[ T
1
,ofG
1
. Thus,
MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 763
4. When S
1
¼ S
2
¼;, the composite dag is just a sum.
TABLE 1
The Relation . among Building-Block Dags
Entries either list conditions for priority or indicate (via “X”) the absence of any IC-optimal schedule for that pairing.
G is also composite of type ½G
1
*½G
2
*G
3
. The converse
yields to similar reasoning. tu
We can now illustrate the natural correspondence
between the node-set of a composite dag and those of its
building blocks, via Fig. 3:
. The evolving two-dimensional mesh is composite of
type W
1;2
*W
2;2
*W
3;2
*.
. A binary reduction-tree is obtained by pairwise
composing of many instances of M
1;2
(seven
instances in the figure).
. The 5-level two-dimensional reduction-mesh is a
composite of type M
5;2
*M
4;2
*M
3;2
*M
2;2
*M
1;2
.
. The FFT dag is obtained by pairwise composing
many instances of C
2
¼Q
2
(12 instances in the
figure).
Dag G is a .-linear composition of the connected bipartite
dags G
1
; G
2
; ...; G
n
if:
1. G is a composite of type G
1
*G
2
**G
n
;
2. each G
i
. G
iþ1
, for all i 2½1;n 1.
Dags that are .-linear compositions admit simple IC-
optimal schedules.
Theorem 4. Let G be a .-linear composition of G
1
; G
2
; ...; G
n
,
where each G
i
admits an IC-optimal schedule
i
. The schedule
for G that proceeds as follows is IC optimal:
1. executes the nodes of G that correspond to sources of
G
1
, in the order mandated by
1
, then the nodes that
correspond to sources of G
2
, in the order mandated by
2
, and so on, for all i 2½1;n.
2. finally executes all sinks of G in any order.
Proof. Let
0
be a schedule for G that has maximum E
0
ðxÞ
for some x and let X comprise the first x nodes that
0
executed. By Lemma 1, we may assume that either
1. X contains all nonsinks of G (and perhaps some
sinks) or
2. X is a proper subset of the nonsinks of G.
In situation 1, E
ðxÞ is maximal by hypothesis. We
therefore assume that situation 2 holds and show that
E
ðxÞE
0
ðxÞ. When X contains only nonsinks of G,
each node of X corresponds to a specific source of one
specific G
i
. Let us focus, for each i 2½1;n, on the set of
sources of G
i
that correspond to nodes in X; call this set
X
i
. We claim that:
The number of
ELIGIBLE nodes in G at step x, denoted
eðXÞ,isjSjjXjþ
P
m
i¼1
e
i
ðX
i
Þ, where S is the set of sources
of G, and e
i
ðX
i
Þ is the number of sinks of G
i
that are ELIGIBLE
when only sources X
i
of G
i
are EXECUTED.
To verify this claim, imagine that we execute nodes of
G and the corresponding nodes of its building block G
i
in
tandem, using the terminology of the IC Pebble Game for
convenience. The main complication arises when we
pebble an internal node v of G since we then simulta-
neously pebble a sink v
i
of some G
i
and a source v
j
of
some G
j
. At each step t of the Game: If node v of G
becomes
ELIGIBLE, then we place an ELIGIBLE pebble on
v
i
and leave v
j
unpebbled; if v becomes EXECUTED, then
we place an
EXECUTED pebble on v
j
and an ELIGIBLE
pebble on v
i
.AnEXECUTED pebble on a sink of G is
replaced with an
ELIGIBLE pebble. No other pebbles
change.
Focus on an arbitrary G
i
. Note that the sources of G
i
that are EXECUTED comprise precisely the set X
i
. The
sinks of G
i
that are ELIGIBLE comprise precisely the set Y
i
of sinks all of whose parents are EXECUTED; hence,
jY
i
j¼e
i
ðX
i
Þ. The cumulative number of sources of the
dags G
i
that are ELIGIBLE is jSjp, where p is the
number of sources of G that are
EXECUTED. It follows
that the cumulative number of
ELIGIBLE pebbles on the
dags G
i
is e
1
ðX
1
Þþþe
n
ðX
n
ÞþjSjp.Wenow
calculate the surfeit of
ELIGIBLE pebbles on the dags G
i
over the ELIGIBLE pebbles on G. Extra ELIGIBLE pebbles
get created when G is decomposed, in only two cases:
1) when an internal node of G becomes
EXECUTED and
2) when we process a sink of G that is
EXECUTED. The
number of the former cases is jX
1
jþþjX
n
jp.
Denoting the number of the latter cases by q, we note
that q þjX
1
jþþjX
n
j¼jXj. The claim is thus verified
because the number of
ELIGIBLE nodes in G is
eðXÞ¼ðe
1
ðX
1
Þþþe
n
ðX
n
ÞþjSjpÞ
ðjX
1
jþþjX
n
jp þ qÞ:
Because of the priority relations among the dags G
i
,
Corollary 1 implies that eðXÞ¼
P
n
i¼1
E
i
ðx
0
i
Þ, where x
0
i
is
a “low-index-loaded” execution of the G
i
. Because of the
way the dags G
i
are composed, the sources of each G
j
could have been merged only with sinks of lower-index
dags, namely, G
1
; ...; G
j1
. Thus, a “low-index-loaded”
execution corresponds to a set X
0
of x EXECUTED nodes
of G that satisfy precedence constraints. Thus, there is a
764 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006
Fig. 5. Dags of the following types: (a) ½½W
1;5
*W
2;4
*C
3
;
(b) ½½½W
3;2
*M
2;3
*M
1;2
*M
1;3
;
(c) ½N
3
*½N
3
*N
2
¼ ½½N
3
*N
3
*N
2
. Each admits an IC-optimal
schedule.
schedule—namely, —that executes nodes of G that
correspond to the dags G
1
; G
2
; ...; G
n
, in turn, and this
schedule is IC optimal. tu
5 IC-OPTIMAL SCHEDULES VIA
DAG-DECOMPOSITION
Section 4 describes how to build complex dags that admit
IC-optimal schedules. Of course, the “real” problem is not
to build a dag but rather to execute a given one. We now craft
an algorithmic framework that converts the synthetic setting
of Section 4 to an analytical setting. We present a suite of
algorithms that take a given dag G and:
1. simplify G’s structure in a way that preserves the IC
quality of its schedules;
2. decompose (the simplified) G into its “constituents”
(when it is, indeed, composite); and
3. determine when (the simplified) G is a .-linear
composition of its “constituents.”
When this program succeeds, we invoke Theorem 4 to
schedule G IC optimally, bottom-up, from the decomposi-
tion. We now develop the advertised algorithmic setting.
5.1 “Skeletonizing” a Complex Dag
The word “simplified” is needed in the preceding para-
graph because a dag can fail to be composite just because it
contains “shortcut” arcs that do not impact intertask
dependencies. Often, removing all shortcuts renders a dag
composite, hence, susceptible to our scheduling strategy.
(Easily, not every shortcut-free dag is composite.)
For any dag G and nodes u; v 2 N
G
, we write u
e
>
G
v to
indicate that there is a path from u to v in G. An arc ðu !
vÞ2A
G
is a shortcut if there is a path u
e
>
G
v that does not
include the arc. The reader can show easily that:
Lemma 4. Composite dags contain no shortcuts.
Fortunately, one can efficiently remove all shortcuts from
a dag without changing its set of IC-optimal schedules. A
(transitive) skeleton (or, minimum equivalent digraph) G
0
of dag
G is a smallest subdag of G that shares G’s node-set and
transitive closure [4].
Lemma 5 ([10]). Every dag G has a unique transitive skeleton,
ðGÞ, which can be found in polynomial time.
We can craft an IC-optimal schedule for a dag G
automatically by crafting such a schedule for ðGÞ.A
special case of the following result appears in [15].
Theorem 5. A schedule has the same IC quality when it
executes a dag G as when it executes ðGÞ. In particular, if is
IC optimal for ðGÞ, then it is IC optimal for G.
Proof. Say that, under schedule , a node u becomes
ELIGIBLE at step t of the IC Pebble Game on ðGÞ. This
means that, at step t, all of u’s ancestors in ðGÞ—its
parents, its parents’ parents, etc.—are
EXECUTED. Be-
cause ðGÞ and G have the same transitive closure, node u
has precisely the same ancestors in G as it does in ðGÞ.
Hence, under schedule , u becomes
ELIGIBLE at step t
of the IC Pebble Game on G. tu
By Lemma 4, a dag cannot be composite unless it is
transitively skeletonized. By Theorem 5, once having
scheduled ðGÞ IC optimally, we have also scheduled G IC
optimally. Therefore, this section paves the way for our
decomposition-based scheduling strategy.
5.2 Decomposing a Composite Dag
Every dag G that is composed from connected bipartite dags
can be decomposed to expose the dags and how they
combine to yield G. We describe this process in detail and
illustrate it with the dags of Fig. 3.
A connected bipartite dag H is a constituent of G just
when:
1. H is an induced subdag of G: N
H
N
G
and A
H
is
comprised of all arcs ðu ! vÞ2A
G
such that
fu; vgN
H
.
2. H is maximal: The induced subdag of G on any
superset of H’s nodes—i.e., any set S such that
N
H
S N
G
—is not connected and bipartite.
Selecting a constituent. We select any constituent of G all
of whose sources are also sources of G, if possible; we call
the selected constituent B
1
(the notation emphasizing that
B
1
is bipartite).
In Fig. 3: Every candidate B
1
for the FFT dag is a copy of C
2
included in levels 2 and 3; every candidate for the reduction-
tree is a copy of M
1;2
; the unique candidate for the
reduction-mesh is M
4;2
.
Detaching a constituent. We “detach” B
1
from G by
deleting the nodes of G that correspond to sources of B
1
, all
incident arcs, and all resulting isolated sinks. We thereby
replace G with a pair of dags hB
1
; G
0
i, where G
0
is the
remnant of G after B
1
is detached.
If G
0
is not empty, then the process of selection and
detachment continues, producing a sequence of the form
G¼)hB
1
; G
0
i¼)hB
1
; hB
2
; G
00
ii¼)hB
1
; hB
2
; hB
3
; G
000
iii¼) ;
leading, ultimately, to a sequence of connected bipartite
dags: B
1
; B
2
; ...; B
n
.
We claim that the described process recognizes whether
or not G is composite and, if so, it produces the dags from
which G is composed (possibly in a different order from the
original composition). If G is not composite, then the process
fails.
Theorem 6. Let the dag G be composite of type G
1
**G
n
. The
decomposition process produces a sequence B
1
; ...; B
n
of
connected bipartite dags such that:
. G is composite of type B
1
**B
n
;
. fB
1
; ...; B
n
g¼fG
1
; ...; G
n
g.
Proof. The result is trivial when n ¼ 1 as G is then a
connected bipartite dag. Assume, therefore, that the
result holds for all n<m and let G be a composite
of type G
1
**G
m
. In this case, G
1
is a constituent
of G, all of whose sources are sources of G. (Other
G
i
’s may share this property.) There is, therefore, a
dag B
1
for our process to detach. Since any
constituent of G all of whose sources are sources of
G must be one of the G
i
, we know that B
1
is one of
these dags. It follows that G is a composite of type
MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 765
B
1
*ðG
1
**G
i1
*G
iþ1
**G
m
Þ;moreover,the
dag G
0
resulting after detaching B
1
is composite of type
G
1
**G
i1
*G
iþ1
**G
m
because the detachment
process does not affect any sources of G other than those
it shares with B
1
. By inductive hypothesis, then, G
0
can be
decomposed as indicated in the theorem. We now invoke
Lemma 3. tu
5.3 The Super-Dag Obtained by Decomposing G
The next step in our strategy is to abstract the structure of G
exposed by its decomposition into B
1
; ...; B
n
in an
algorithmicall y advantageous way. Therefore, we shift
focus from the decomposition to G’s associated super-dag
S
G
¼
def
SðB
1
**B
n
Þ, which is constructed as follows:
Each node of S
G
—which we call a supernode to prevent
ambiguity—is one of the B
i
s. There is an arc in S
G
from
supernode u to supernode v just when some sink(s) of u are
identified with some source(s) of v when one composes the
B
i
s to produce G. Fig. 6 and Fig. 7 present two examples; in
both, supernodes appear in dashed boxes and are inter-
connected by dashed arcs.
In terms of super-dags, the question of whether or not
Theorem 4 applies to dag G reduces to the question of
whether or not S
G
admits a topological sort [4] whose
linearization of supernodes is consistent with the relation ..
For instance, one derives an IC-optimal schedule for the
dag G of Fig. 5b (which is decomposed in Fig. 6) by noting
that G is a composite of type W
3;2
*M
1;2
*M
2;3
*M
1;3
and
that W
3;2
. M
1;2
. M
2;3
. M
1;3
. Indeed, G points out the
challenge in determining if Theorem 4 applies since it is
also a composite of type W
3;2
*M
2;3
*M
1;2
*M
1;3
, but
M
2;3
6 .M
1;2
. We leave to the reader the easy verification
that the linearization B
1
; ...; B
n
is a topological sort of
SðB
1
**B
n
Þ.
5.4 On Exploiting Priorities among Constit uents
Our remaining challenge is to devise a topological sort of S
G
that linearizes the supernodes in an order that honors
relation .. We now present sufficient conditions for this to
occur, verified via a linearization algorithm:
Theorem 7. Say that the dag G is a composite of type B
1
**
B
n
and that, for each pair of constituents, B
i
, B
j
with i 6¼ j,
either B
i
. B
j
or B
j
. B
i
. Then, G is a .-linear composition
whenever the following holds:
Whenever B
j
is a child of B
i
in SðB
1
**B
n
Þ;
we have B
i
. B
j
:
ð6Þ
Proof. We begin w ith an arbit rary topological sort,
b
BB¼
def
B
ð1Þ
; ...; B
ðnÞ
, of the superdag S
G
. We invoke the
hypothesis that . is a (weak) order on the B
i
’s to reorder
b
BB according to ., using a stable
5
comparison sort. Let
~
BB¼
def
B
ð1Þ
. . B
ðnÞ
be the linearization of S
G
produced
by the sort. We claim that
~
BB is also a topological sort of
S
G
. To wit, pick any B
i
and B
j
such that B
j
is B
i
’s child in
S
G
. By definition of topological sort, B
i
precedes B
j
in
b
BB.
We claim that, because B
i
. B
j
(by (6)), B
i
precedes B
j
also in
~
BB. On the one hand, if B
j
6 . B
i
, then the sort
necessarily places B
i
before B
j
in
~
BB. On the other hand, if
B
j
. B
i
, then, since the sort is stable, B
i
precedes B
j
in
~
BB
because it precedes B
j
in
b
BB. Thus,
~
BB is, indeed, a
topological sort of S
G
so that G is composite of type
B
ð1Þ
**B
ðnÞ
. In other words, G is the desired .-
linear composition of B
ð1Þ
; ...; B
ðnÞ
. tu
We can finally apply Theorem 4 to find an IC-optimal
schedule for the dag G.
6CONCLUSIONS AND PROJECTIONS
We have developed three notions that form the basis for a
theory of scheduling complex computation-dags for Inter-
net-based computing: the priority relation . on bipartite
dags (Section 3.2), the operation of the composition of dags
(Section 4), and the operation of the decomposition of dags
(Section 5). We have established a way of combining these
notions to produce schedules for a large class of complex
computation-dags that maximize the number of tasks that
are eligible for allocation to remote clients at every step of
the schedule (Theorems 4 and 7). We have used our notions
to progress beyond the structurally uniform computation-
dags studied in [15], [17] to families that are built in
structured, yet flexible, ways from a repertoire of bipartite
building-block dags. The composite dags that we can now
schedule optimally encompass not only those studied in
[15], [17], but, as illustrated in Fig. 5, also dags that have
rather complex structures, including nodes of varying
degrees and nonleveled global structure.
One direction for future work is to extend the repertoire of
building-block dags that form the raw material for our
composite dags. In particular, we want building blocks of
more complex structures than those of Section 2.1.2, including
less-uniform bipartite dags and nonbipartite dags. We expect
the computational complexity of our scheduling algorithms
to increase with the structural complexity of our building
blocks. Along these lines, we have thus far been unsuccessful
in determining the complexity of the problem of deciding if a
given computation-dag admits an IC-optimal schedule, but
we continue to probe in this direction. (The scheduling
problem could well be co-NP-Complete because o f its
underlying universal quantification.) Finally, we are working
766 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006
Fig. 6. The composition of the dags of Fig. 5b and its associated
superdag.
5. That is, if B
i
. B
j
and B
j
. B
i
, then the sort maintains the original
relative order of B
i
and B
j
.
to extend Theorems 4 and 7 to loosen the strict require-
ment that the composite dag be a .-linear composition.
ACKNOWLEDGMENTS
A portion of the research of G. Malewicz was done while he
was visiting the TAPADS Group at the University of
Massachusetts Amherst. The research of A. Rosenberg and
M. Yurkewych was supported in part by US National
Science Foundation Grant CCF-0342417. A portion of this
paper appeared in the Proceedings of the International Parallel
and Distributed Processing Symposium, 2005.
REFERENCES
[1] J. Annis, Y. Zhao, J. Voeckler, M. Wilde, S. Kent, and I. Foster,
“Applying Chimera Virtual Data Concepts to Cluster Finding in
the Sloan Sky Survey,” Proc. 15th Conf. High Performance
Networking and Computing, p. 56, 2002.
[2] R. Buyya, D. Abramson, and J. Giddy, “A Case for Economy Grid
Architecture for Service Oriented Grid Computing,” Proc. 10th
Heterogeneous Computing Workshop, 2001.
[3] W. Cirne and K. Marzullo, “The Computational Co-Op: Gathering
Clusters into a Metacomputer,” Proc. 13th Int’l Parallel Processing
Symp., pp. 160-166, 1999.
[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction
to Algorithms, second ed. Cambridge, Mass.: MIT Press, 2001.
[5] The Grid: Blueprint for a New Computing Infrastructure, second ed.,
I. Foster and C. Kesselman, eds. San Francisco: Morgan
Kaufmann, 2004.
[6] I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Grid:
Enabling Scalable Virtual Organizations,” Int’l J. High Performance
Computing Applications, vol. 15, pp. 200-222, 2001.
[7] L. Gao and G. Malewicz, “Internet Computing of Tasks with
Dependencies Using Unreliable Workers,” Thoery of Computing
Systems, to appear.
[8] A. Gerasoulis and T. Yang, “A Comparison of Clustering
Heuristics for Scheduling Dags on Multiprocessors,” J. Parallel
and Distributed Computing, vol. 16, pp. 276-291, 1992.
[9] L. He, Z. Han, H. Jin, L. Pan, and S. Li, “DAG-Based Parallel Real
Time Task Scheduling Algorithm on a Cluster,” Proc. Int’l Conf.
Parallel and Distruted Processing Techniques and Applications, pp. 437-
443, 2000.
[10] H.T. Hsu, “An Algorithm for Finding a Minimal Equivalent
Graph of a Digraph,” J. ACM, vol. 22, pp. 11-16, 1975.
[11] D. Kondo, H. Casanova, E. Wing, and F. Berman, “Models and
Scheduling Guidelines for Global Computing Applications,” Proc.
Int’l Parallel and Distruted Processing Symp., p. 79, 2002.
[12] E. Korpela, D. Werthimer, D. Anderson, J. Cobb, and M. Lebofsky,
“SETI@home: Massively Distributed Computing for SETI,”
Computing in Science and Eng., P.F. Dubois, ed., Los Alamitos,
Calif.: IEEE CS Press, 2000.
[13] G. Malewicz, “Parallel Scheduli ng of Complex Dags under
Uncertainty,” Proc. 17th ACM Symp. Parallelism in Algorithms and
Architectures, 2005.
[14] G. Malewicz and A.L. Rosenberg, “On Batch-Scheduling Dags for
Internet-Based Computing,” Proc. 11th European Conf. Parallel
Processing, 2005.
[15] A.L. Rosenberg, “On Scheduling Mesh-Structured Computations
for Internet-Based Computing,” IEEE Trans. Computers, vol. 53,
pp. 1176-1186, 2004.
[16] A.L. Rosenberg and I.H. Sudborough, “Bandwidth and Pebbling,”
Computing, vol. 31, pp. 115-139, 1983.
[17] A.L. Rosenberg and M. Yurkewych, “Guidelines for Scheduling
Some Common Computation-Dags for Internet-Based Comput-
ing,” IEEE Trans. Computers, vol. 54, pp. 428-438, 2005.
[18] X.-H. Sun and M. Wu, “Grid Harvest Service: A System for Long-
Term, Application-Level Task Scheduling,” Proc. IEEE Int’l Parallel
and Distributed Processing Symp.,
p. 25, 2003.
[19] D. Thain, T. Tannenbaum, and M. Livny, “Distributed Computing
in Practice: The Condor Experience,” Concurrency and Computation:
Practice and Experience, 2005.
Grzegorz Malewicz studied computer science
and applied mathematics at the University of
Warsaw from 1993 until 1998. He then joined the
University of Connecticut and received the
doctorate in 2003. He is a software engineer at
Google, Inc. Prior to joining Google, he was an
assistant professor of computer science at the
University of Alabama (UA), where he taught
computer science from 2003-2005. He has had
internships at the AT&T Shannon Lab (summer
2001) and Microsoft Corporation (summer 2000 and fall 2001). He
visited the Laboratory for Computer Science, MIT (AY 2002/2003) and
was a visiting scientist at the University of Massachusetts Amherst
(summer 2004) and Argonne National Laboratory (summer 2005). His
research focuses on parallel and distributed computing, algorithms,
combinatorial optimization and scheduling. His research appears in top
journals and conferences and includes a SIAM Journal of Computing
paper for which he was the sole author that solves a decade-old problem
in distributed computing. He is a member of the IEEE.
MALEWICZ ET AL.: TOWARD A THEORY FOR SCHEDULING DAGS IN INTERNET-BASED COMPUTING 767
Fig. 7. The three-dimensional FFT dag and its associated superdag.
Arnold L. Rosenberg is a Distinguished Uni-
versity Professor of Computer Science at the
University of Massachusetts Amherst, where he
codirects the Theoretical Aspects of Parallel and
Distributed Systems (TAPADS) Laboratory.
Prior to joining UMass, he was a professor of
computer science at Duke University from 1981
to 1986 and a research staff member at the IBM
T.J. Watson Research Center from 1965 to
1981. He has held visiting positions at Yale
University and the University of Toronto; he was a Lady Davis Visiting
Professor at the Technion (Israel Institute of Technology) in 1994, and a
Fulbright Research Scholar at the University of Paris-South in 2000. His
research focuses on developing algorithmic models and techniques to
deal with the new modalities of “collaborative computing” (the endeavor
of having several computers cooperate in the solution of a single
computational problem) that result from emerging technologies. He is
the author or coauthor of more than 150 technical papers on these and
other topics in theoretical computer science and discrete mathematics
and is the coauthor of the book Graph Separators, with Applications.He
is a fellow of the ACM, a fellow of the IEEE, and a Golden Core member
of the IEEE Computer Society.
Matthew Yurkewych received the BS degree
from the Massachusetts Institute of Technology
in 1998. He is a PhD Student in computer
science at the University of Massachusetts-
Amherst. Prior to entering graduate school, he
worked at Akamai Technologies and CNet Net-
works as a software engineer.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
768 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 6, JUNE 2006