Conference PaperPDF Available

Abstract

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.
On-the-Fly Pipeline Parallelism
I-Ting Angelina Lee
*
Charles E. Leiserson
*
Tao B. Schardl
*
Jim Sukha
Zhunping Zhang
*
*
MIT CSAIL
32 Vassar Street
Cambridge, MA 02139
Intel Corporation
25 Manchester Street, Suite 200
Merrimack, NH 03054
*
{angelee, cel, neboat, jzz}@mit.edu
jim.sukha@intel.com
ABSTRACT
Pipeline parallelism organizes a parallel program as a linear se-
quence of s stages. Each stage processes elements of a data stream,
passing each processed data element to the next stage, and then
taking on a new element before the subsequent stages have nec-
essarily completed their processing. Pipeline parallelism is used
especially in streaming applications that perform video, audio, and
digital signal processing. Three out of 13 benchmarks in PARSEC,
a popular software benchmark suite designed for shared-memory
multiprocessors, can be expressed as pipeline parallelism.
Whereas most concurrency platforms that support pipeline par-
allelism use a “construct-and-run” approach, this paper investi-
gates “on-the-fly” pipeline parallelism, where the structure of the
pipeline emerges as the program executes rather than being spec-
ified a priori. On-the-fly pipeline parallelism allows the num-
ber of stages to vary from iteration to iteration and dependencies
to be data dependent. We propose simple linguistics for speci-
fying on-the-fly pipeline parallelism and describe a provably ef-
ficient scheduling algorithm, the PIPER algorithm, which inte-
grates pipeline parallelism into a work-stealing scheduler, allow-
ing pipeline and fork-join parallelism to be arbitrarily nested. The
PIPER algorithm automatically throttles the parallelism, precluding
“runaway” pipelines. Given a pipeline computation with T
1
work
and T
span (critical-path length), PIPER executes the computation
on P processors in T
P
T
1
/P+O (T
+lgP) expected time. PIPER
also limits stack space, ensuring that it does not grow unboundedly
with running time.
We have incorporated on-the-fly pipeline paralleli sm into a Cilk-
based work-stealing runtime system. Our prototype Cilk-P imple-
mentation exploits optimizations such as lazy enabling and depen-
dency folding. We have ported the three PARSEC benchmarks
that exhibit pipeline parallelism to run on C ilk-P. One of these,
x264, cannot readily be executed by systems that support only
construct-and-run pipeline parallelism. Benchmark results indicate
that Cilk-P has low serial overhead and good scalability. On x264,
for example, Cilk-P exhibits a speedup of 13.87 over its respective
serial counterpart when running on 16 processors.
This work was supported in part by the National Science Foundation under
Grants CNS-1017058 and CCF-1162148. Tao B. Schardl is supported in
part by an NSF Graduate Research Fellowship.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
SPAA’13, June 23–25, 2013, Montréal, Québec, Canada.
Copyright 2013 ACM 978-1-4503-1572-2/13/07 ...$15.00.
Categories and Subject Descriptors
D.3.3 [Language Constructs and Features]: Concurrent pro-
gramming structures; D.3.4 [Programming Languages]: Proces-
sors—Run-time environments.
General Terms
Algorithms, Languages, Theory.
Keywords
Cilk, multicore, multithreading, parallel programming, pipeline
parallelism, on-the-fly pipelining, scheduling, work stealing.
1. INTRODUCTION
Pipeline parallelism
1
[6, 16, 17, 25, 27, 28, 31, 33, 35, 37] is a
well-known parallel-programming pattern that can be used to par-
allelize a variety of applications, including streaming applications
from the domains of video, audio, and digital signal processing.
Many applications, including the ferret, dedup, and x264 bench-
marks from the PARS EC benchmark suite [4, 5], exhibit paral-
lelism in the form of a linear pipeline, where a linear sequence
S = hS
0
,..., S
s1
i of abstract functions, called stages, are executed
on an input stream I = ha
0
,a
1
,..., a
n1
i. Conceptually, a linear
pipeline can be thought of as a loop over the elements of I, where
each loop iteration i processes an element a
i
of the input stream.
The loop body encodes the sequence S of stages through w hich
each element i s processed. Parallelism arises in linear pipelines
because the execution of iterations can overlap in time, that is, it er-
ation i may start after the preceding iteration i 1 has started, but
before i 1 has necessarily completed.
Most systems that provide pipeline parallelism employ a
construct-and-run model, as exemplified by t he pipeline model in
Intel Threading Building Blocks (TBB) [27], where t he pipeline
stages and their dependencies are defined a priori before execu-
tion. Systems that support construct-and-run pipeline parallelism
include the following: [1, 11,17, 25–27, 29–32, 35, 37, 38].
We have extended the Cilk parallel-programming model [15,
20, 24] to augment its native fork-join parallelism with on-the-fly
pipeline parallelism, where the linear pipeline is constructed dy-
namically as the program executes. The Cilk-P system provides a
flexible linguistic model for pipelining that allows the structure of
the pipeline to be determined dynamically as a f unction of data in
the input stream. Cilk-P also admits a variable number of stages
across iterations, allowing the pipeline to take on shapes other than
1
Pipeline parallelism should not be confused with instruction pipelining in
hardware [34] or software pipelining [22].
!!!"
!!!"
0
iterations
stages
1 2 3 4 5 6 7 n–1
0
1
2
Figure 1: Modeling the execution of ferrets linear pipeline as a pipeline
dag. Each column contains nodes for a single iteration, and each row cor-
responds to a stage of the pipeline. Vertices in the dag correspond to nodes
of the linear pipeline, and edges denote dependencies between the nodes.
Throttling edges are not shown.
simple rectangular grids. The Cilk-P programming model is flex-
ible, yet restrictive enough to allow provably efficient scheduling,
as Sections 5 through 8 will show. In particular, Cilk-P’s scheduler
provides automatic “throttling” to ensure that the computation uses
bounded space. As a testament to the exibility provided by Cilk-P,
we were able to parallelize the x264 benchmark from PARSEC, an
application that cannot be programmed easily using TB B [33].
Cilk-P’s support for defining linear pipelines on the fly is more
flexible than the ordered directive in OpenMP [29], which sup-
ports a limited form of on-the-fly pipelining, but it is less expres-
sive than other approaches. Blelloch and Reid-Miller [6] describe
a scheme for on-the-fly pipeline parallelism that employs futures
[3, 14] to coordinate the stages of the pipeline, allowing even non-
linear pipelines to be defined on the y. Although futures permit
more complex, nonlinear pipelines to be expressed, this generality
can lead to unbounded space requirements to attain even modest
speedups [7].
To illustrate the ideas behind the Cilk-P model, consider a simple
3-stage linear pipeline such as in the ferret benchmark from PAR-
SEC [4, 5]. Figure 1 shows a pipeline dag (directed acyclic graph)
G = (V,E) representing the execution of the pipeline. Each of the
3 horizontal rows corresponds to a stage of the pipeline, and each
of the n vertical columns is an iteration. We define a pipeline node
(i, j) V , where i = 0,1,... , n 1 and j = 0,1, 2, to be the exe-
cution of S
j
(a
i
), the jth st age in the ith iteration, represented as a
vertex in the dag. The edges between nodes denote dependencies.
A stage edge goes between two nodes (i, j) and (i, j
), where j < j
,
and indicates that (i, j
) cannot start until (i, j) completes. A cross
edge between nodes (i 1, j) and (i, j) indicates that (i, j) can start
execution only after node (i 1, j) completes. Cilk-P always exe-
cutes nodes of the same iteration in increasing order by stage num-
ber, thereby creating a vertical chain of stage edges. Cross edges
between corresponding stages of adjacent it erations are optional.
We can categorize the stages of a Cilk-P pipeline. A stage is a
serial stage if all nodes belonging to the stage are connected by
cross edges, it is a parallel stage if none of the nodes belonging t o
the stage are connected by cross edges, and it is a hybrid stage oth-
erwise. The ferret pipeline, for example, exhibits a static structure
often referred to as an “SPS” pipeline, since Stage 0 and St age 2
are serial and Stage 1 is parallel. Cilk-P requires that pipelines be
linear, since iterations are totally ordered and dependencies go be-
tween adjacent iterations, and in fact, Stage 0 of any Cilk-P pipeline
is always a serial stage. Later stages may be serial, parallel, or hy-
brid, as we shall see i n Sections 2 and 3.
To execute a linear pipeline, Cilk-P follows the lead of TBB
and adopts a bind-to-element approach [25, 27], where workers
(scheduling threads) execute pipeline iterations either to comple-
tion or until an unresolved dependency is encountered. In par-
ticular, Ci lk-P and TB B both rely on “work-stealing” schedulers
(see, for example, [2, 8, 10, 13, 15, 21]) for load balancing. In con-
trast, many systems that support pipeline parallelism, including
typical Pthreaded implementations, execute linear pipelines using
a bind-to-stage approach, where each worker executes a distinct
stage and coordination between workers is handled using concur-
rent queues [17, 35, 38]. Some researchers report that the bind-
to-element approach generally outperforms bind-to-stage [28, 33],
since a work-stealing scheduler can do a better job of dynamically
load-balancing the computation, but our own experiments show
mixed results.
A natural theoretical question is, how much parallelism is inher-
ent in the ferret pipeline (or in any pipeline)? How much speedup
can one hope for? Since the computation is represented as a dag
G = (V,E), one can use a simple work/span analysis [12, Ch. 27]
to answer this question. In this analytical model, we assume that
each vertex v V executes in some time w(v). The work of the
computation, denoted T
1
, is essentially the serial execution time,
that is, T
1
=
P
vV
w(v). The span of the computation, denoted
T
, is the length of a longest weighted path through G, which is
essentially the ti me of an infinite-processor execution. The paral-
lelism is the ratio T
1
/T
, which is the maximum possible speedup
attainable on any number of processors, using any scheduler.
Unlike in some applications, in the ferret pipeline, each node
executes seriall y, that is, its work and span are the same. Let w(i, j)
be the execution time of node (i, j). Assume that the serial Stages
0 and 2 execute in unit ti me, that is, for all i, we have w(i, 0) =
w(i,2) = 1, and that the parallel Stage 1 executes in ti me r 1,
that is, for all i, we have w(i, 1) = r. Because the pipeline dag is
grid-like, the span of this SPS pipeline can be realized by some
staircase walk through the dag f rom node (0,0) t o node (n 1, 2).
The work of this pipeline is therefore T
1
= n(r + 2), and the span is
T
= max
0x<n
(
x
X
i=0
w(i,0) + w(x, 1) +
n1
X
i=x
w(i,2)
)
= n + r .
Consequently, the parallelism of this dag is T
1
/T
= n(r + 2)/(n +
r), which for 1 r n is at least r/2+ 1. Thus, if St age 1 contains
much more work than the other two stages, the pipeline exhibits
good parallelism.
Cilk-P guarantees to execute the ferret pipeline efciently.
In particular, on an ideal shared-memory computer with up to
T
1
/T
= O(r) processors, Cilk-P guarantees linear speedup. Gen-
erally, Cilk-P executes a pipeline with linear speedup as long as
the parallelism of the pipeline exceeds the number of processors
on which the computation is scheduled. Moreover, as Section 3
will describe, Cilk-P allows stages of the pipeline themselves to be
parallel using recursive pipelining or fork-join parallelism.
In practice, i t is also important to limit the space used during
an execution. Unbounded space can cause thrashing of the mem-
ory system, leading to slowdowns not predicted by simple execu-
tion models. In particular, a bind-to-element scheduler must avoid
creating a runaway pipeline a situation where the scheduler al-
lows many new iterations to be started before finishing old ones. In
Figure 1, a runaway pipeline might correspond to executing many
nodes in Stage 0 (the top row) without fi nishing the other stages
of the computation in the earlier iterations. Runaway pipelines can
cause space utilization to grow unboundedly, since every started
but incomplete it eration requires space to store local variables.
Cilk-P automatically throttles pipelines to avoid runaway
pipelines. On a system with P workers, Cilk-P inhibits the start
of iteration i + K until iteration i has completed, where K = Θ(P)
is the t hrottling limit. Throttling corresponds to putting throttling
edges from the last node in each iteration i to the first node in iter-
ation i + K. For the simple pipeline from Figure 1, throttling does
not adversely affect asymptotic scalability if stages are uniform,
but it can be a concern for more complex pipelines, as Secti on 11
will discuss. The Cilk-P scheduler guarantees efcient scheduling
of pipelines as a function of the parallelism of the dag in which
throttling edges are included in t he calculation of span.
Contributions
Our prototype Cilk-P system adapts the Cilk-M [23] work-stealing
scheduler to support on-the-fly pipeline parallelism using a bind-to-
element approach. This paper makes the following contributions:
We describe linguistics for Cilk-P that al low on-the-fly pipeline
parallelism to be i ncorporated into the Cilk fork-join parallel
programming model (Section 2).
We i llustrate how Cilk-P linguistics can be used to express the
x264 benchmark as a pipeline program (Section 3).
We characterize the execution dag of a Cilk-P pipeline program
as an extension of a fork-join program (Section 4).
We introduce the PIPER scheduling algorithm, a theoretically
sound randomized work-stealing scheduler (Section 5).
We prove that PIPER is asymptotically efficient, executing
Cilk-P programs on P processors in T
P
T
1
/P + O(T
+ lg P)
expected time (Sections 6 and 7).
We bound space usage, proving that PIPER on P processors uses
S
P
P(S
1
+ f DK) stack space for pipeline i terations, where S
1
is the serial stack space, f is the “fr ame size, D is the depth of
nested pipelines, and K is the throttling limit (Section 8).
We describe our implementation of PIPER in the Cilk-P runtime
system, i ntroducing two key optimizations: lazy enabling and
dependency folding (Section 9).
We demonstrate that the ferret, dedup, and x264 benchmarks
from PARSEC, when hand-compiled for the Cilk-P runtime
system (we do not as yet have a compiler for the Cilk-P lan-
guage), run competitively with existing Pthreaded implementa-
tions (Section 10).
We conclude in Section 11 with a discussion of the performance
implications of throttling.
2. ON-THE-FLY PIPELINE PROGRAMS
Cilk-P’s linguistic model supports both fork-join and pipeline
parallelism, which can be nested arbitrarily. For convenience, we
shall refer to programs containing nested fork-join and pipeline par-
allelism simply as pipeline programs. Cilk-P’s on-the-fly pipelin-
ing model allows the programmer to specify a pipeline whose struc-
ture is determined during the pipeline’s execution. This section re-
view s the basic Cilk model and shows how on-the-fly parallelism
is supported in Cilk-P using a pipe_while construct.
We first outline the basic semantics of Cilk without t he pipelining
features of Cilk-P. We use the syntax of Cilk++ [24] and Cilk Plus
[20] which augments serial C/C++ code with two principal key-
words: cilk_spawn and cilk_sync.
2
When a function invocation
is preceded by the keyword cilk_spawn, the function is spawned
as a child subcomputation, but the r untime system may continue
to execute the statement after the cilk_spawn, called the continu-
ation, in parallel with the spawned subroutine without waiting for
the child to return. The complementary keyword to cilk_spawn is
cilk_sync, which acts as a local barrier and joins together all the
parallelism forked by cilk_spawn wi thin a function. Every func-
tion contains an implicit cilk_sync before the function returns.
2
Cilk++ and Cilk Plus also include other features that are not relevant to the
discussion here.
To support on-the-fly pipeline parallelism, Cilk-P provides a
pipe_while keyword. A pipe_while loop is similar to a serial
while loop, except that loop iterations can execute in parallel in a
pipelined fashion. The body of the pipe_while can be subdivided
into stages, with stages named by user-specified integer values that
strictly increase as the iteration executes. Each stage can contain
nested fork-join and pipeline parallelism.
The boundaries of stages are denoted in the body of a
pipe_while using the special functions pipe_continue and
pipe_wait. These functions accept an integer stage argument,
which is the number of the next stage to execute and which must
strictly increase during the execution of an iteration. Every iteration
i begins executing Stage 0, represented by node (i,0). While exe-
cuting a node (i, j
), if control flow encounters a pipe_wait(j) or
pipe_continue(j) statement, where j > j
, then node (i, j
) ends,
and control flow proceeds to node (i, j). A pipe_continue(j)
statement indicates that node (i, j) can start executing immediately,
whereas a pipe_wait(j) stat ement indicates that node (i, j) can-
not start until node (i 1, j) completes. The pipe_wait(j) in iter-
ation i creates a cross edge from node (i 1, j) to node (i, j) in the
pipeline dag. Thus, by design choice, Cilk-P imposes the restric-
tion that pipeline dependencies only go between adjacent iterati ons.
As we shall see in Section 9, this design choice facilitates the “lazy
enabling” and “dependency folding” runtime optimizations.
The pipe_continue and pipe_wait functions can be used with-
out an explicit stage argument. Omitting the stage argument while
executing stage j corresponds to an implicit stage argument of
j + 1, i.e., control moves onto the next stage.
Cilk-P’s semantics for pipe_continue and pipe_wait state-
ments allow f or st age skipping, where execution in an iteration i
can jump stages from node (i, j
) to node (i, j), even if j > j
+ 1.
If control flow in iteration i + 1 enters node (i + 1, j
′′
) after a
pipe_wait, where j
< j
′′
< j, then we implicitly create a null
node (i, j
′′
) in the pipeline dag, which has no associated work and
incurs no scheduling overhead, and insert stage edges from (i, j
) to
(i, j
′′
) and from (i, j
′′
) to (i, j), as well as a cross edge from (i, j
′′
)
to (i + 1, j
′′
).
3. ON-THE-FLY PIPELINING OF x264
To i llustrate the use of Cil k-Ps pipe_while loop, this section
describes how to paralleli ze the x264 video encoder [39].
We begi n with a simplified description of x264. Given a stream
h f
0
, f
1
,...i of video frames to encode, x264 partitions the frame
into a two dimensional array of “macroblocks” and encodes each
macroblock. A macroblock in frame f
i
is encoded as a function of
the encodings of similar macroblocks within f
i
and similar mac-
roblocks in frames “near” f
i
. A frame f
j
is near a frame f
i
if
i b j i +b for some constant b. In addition, we define a mac-
roblock (x
,y
) to be near a macroblock (x,y) i f x w x
x + w
and y w y
y + w for some constant w.
The type of a frame f
i
determines how a macroblock (x,y) in f
i
is encoded. If f
i
is an I-frame, then macroblock (x,y) can be en-
coded using only previous macroblocks within f
i
macroblocks
at positions (x
,y
) where y
< y or y
= y and x
< x. If f
i
is a
P-frame, then macroblock (x,y)s encoding can also be based on
nearby macroblocks in nearby preceding frames, up to the most re-
cent preceding I-frame,
3
if one exists within the nearby range. If
f
i
is a B-frame, then macroblock (x,y)s encoding can be based
also on nearby macroblocks in nearby frames, likewise, up to the
most recently preceding I-frame and up to the next succeeding I- or
P-frame.
3
To be precise, up to a particular type of I-frame called an IDR-frame.
1 // Sy mbol ic nam es for im port ant sta ges
2 cons t uint6 4_t PRO CESS _IP FRAME = 1;
3 cons t uint6 4_t PRO CESS _BF RAMES = 1 < < 40;
4 cons t uint6 4_t END = PRO CESS _BFR AMES + 1;
5 int i = 0;
6 int w = m v_ra nge / pixe l_pe r_ro w ;
8 pipe _whi le ( fra me_t *f = n ext_ fra me () ) {
9 vector < fra me_t * > bf rames ;
10 f-> type = d ecid e_fr ame _ type ( f ) ;
11 whil e (f - > t ype == T YPE_ B ) {
12 bfr ames . pus h_ba ck ( f ) ;
13 f = next_ fram e () ;
14 f - > type = deci de_fram e _ type ( f ) ;
15 }
16 int sk ip = w * i ++;
17 p ipe_ wait ( P ROCE SS_I PFRA M E + s kip);
18 whil e ( m b_t * m acro blocks = nex t_ro w ( fra me ) ) {
19 proc ess_ row ( mac robl ocks ) ;
20 if (f - > typ e == TYP E_I ) {
21 pip e_co ntin ue ;
22 } e lse {
23 pip e_wa it ;
24 }
25 }
26 p ipe_ cont inue ( PROCESS _BFR AMES ) ;
27 c ilk_ for ( int j =0; j < bfra mes . s ize() ; ++ j ) {
28 proc ess_ bfra me ( bfra mes [ j ]);
29 }
30 p ipe_ wait ( END ) ;
31 w rite _out _fra m e s ( frame , bfr ames );
32 }
Figure 2: Example C++-like pseudocode for the x264 linear pipeline. This
pseudocode uses Cilk-P’s linguistics to define hybrid pipeline stages on the
fly, specifically with the pipe_wait on line 17, the input-data dependent
pipe_wait or pipe_continue on lines 20–24, and the pipe_continue on
line 26.
Based on these frame types, an x264 encoder must ensure that
frames are processed in a valid order such that dependencies be-
tween encoded macroblocks are satisfied. A parallel x264 encoder
can pipeline the encoding of I- and P-frames in the input stream,
processing each set of intervening B-frames after encoding the lat-
est I- or P-frame on which the B-frame may depend.
Figure 2 shows pseudocode for an x264 linear pipeline. Con-
ceptually, the x264 pipeline begins with a seri al stage (lines 8–17)
that reads fr ames from the input stream and determines the type of
each frame. This st age buffers all B-frames at the head of the input
stream until it encounters an I- or P-frame. After this initial stage, s
hybrid stages process this I- or P-frame row by row (lines 18–25),
where s is the number of rows in the video frame. After all rows of
this I- or P-frame have been processed, the PROCESS_BFRAMES stage
processes all B-f rames in parallel (lines 27–29), and then the END
stage updates the output stream with the processed frames (line 31).
Two issues arise with this general pipelining strategy, both of
which can be handled using on-the-fly pipeline paralleli sm. First,
the encoding of a P-frame must wait for the encoding of rows
in the previous frame to be completed, whereas the encoding of
an I-frame need not. These conditional dependencies are imple-
mented in lines 20–24 of Figure 2 by executing a pipe_wait or
pipe_continue statement conditionally based on the frame’s t ype.
In contrast, many construct-and-run pipeline mechanisms assume
that the dependencies on a stage are fixed for the entirety of a
pipeline’s execution, making such dynamic dependencies more dif-
ficult to handle. Second, the encoding of a macroblock i n row x of
P-frame f
i
may depend on the encoding of a macroblock in a later
row x + w in the preceding I- or P-frame f
i1
. The code in Fig-
ure 2 handles such offset dependencies on l ine 17 by skipping w
additional stages relative to the previous i teration. A simil ar stage-
skipping trick is used on line 26 to ensure that the processing of a
P-frame in iteration i depends only on t he processing of the previ-
ous I- or P-frame, and not on the processing of preceding B-frames.
Figure 3 illustrates the pipeline dag corresponding to the execution
I
B
P
B
P
B
P
B
I
B
P
B
P
B
I
B
P
B
P
B
P
B
P
B
s
Figure 3: The pipeline dag generated for x264. Each iteration processes
either an I- or P-frame, each consisting of s rows. As the iteration index i
increases, the number of initial s tages skipped in the iteration also increases.
This stage skipping produces cross edges into an iteration i from null nodes
in iteration i 1. Null nodes are represented as the intersection between
two edges.
of the code in Figure 2, assuming that w = 1. Skipping stages shift s
the nodes of an iteration down, adding null nodes to the pipeline,
which do not increase the work or span.
4. COMPUTATION-DAG MODEL
Although the pipeline-dag model provides intuiti on for program-
mers to understand the execution of a pipeline program, it is not as
precise as we shall require. For example, a pipeline dag has no real
way of representing nested fork-join or pipeline parallelism within
a node. This section describes how to represent the execution of a
pipeline program as a more refined “computation dag.
Let us rst review the notion of a computation dag for ordi-
nary fork-join Cilk programs [7, 8] without pipeline parallelism.
A fork-join computation dag G = (V,E) represents the execution
of a Cilk program, where the vertices belonging to V are unit-cost
instructions. Edges in E indicate ordering dependencies between
instructions. The normal serial execution of one instruction after
another creates a serial edge from the first instruction to the next.
A cilk_spawn of a function creates two dependency edges ema-
nating from the instruction immediately before the cilk_spawn:
the spawn edge goes to the first instruction of the spawned func-
tion, and the continue edge goes t o the first instruction after the
spawned function. A cilk_sync creates a return edge from the
final instruction of each spawned function to the instruction imme-
diately after the cilk_sync (as well as an ordinary serial edge from
the instruction that executed immediately before the cilk_sync).
To model an arbitrary pipeline-program execution as a (pipeline)
computation dag, we follow a three-step process. First, we trans-
late the code executed in each pipe_while loop into ordinary Cilk
code augmented wit h special functions to handle cross and throt-
tling dependencies. Second, we model the execution of this aug-
mented Cilk program as a fork-join computation dag. Third, we
show how to augment the fork-join computation dag with cross and
throttling edges using the special functions.
The first step of this process does not reflect how a Cilk-P com-
1 int fd_ou t = op en_o utput_f i le ();
2 boo l done = fals e ;
3 pipe _whi le (! done ) {
4 chun k_t * c hunk = g et_n ext_ chun k () ;
5 if ( chu nk == NULL ) {
6 d one = true ;
7 } e lse {
8 pip e_wa it (1) ;
9 b ool isDu plicate = dedu plic ate ( c hunk ) ;
10 pipe _con tinu e (2) ;
11 if (! i sDup licate )
12 com pres s ( chu nk ) ;
13 pipe _wai t (3) ;
14 writ e_to _fil e ( fd_out , chu nk );
15 }
16 }
Figure 4: Cilk-P pseudocode for the parallelization of the dedup compres-
sion program as an SSPS pipeline.
piler would actually compile a pipe_while loop. Indeed, such
a code transformation is impossible for a compiler, because the
boundaries of nodes are determined on the fly. Instead, this code-
transformation step is simply a theoretical construct for the purpose
of describing how the PIPER algorithm works in a way that can be
analyzed.
We shall illustrate this three-step process on a Cilk-P implemen-
tation of the dedup compression program from PARS EC [4, 5].
The benchmark can be parallelized by using a pipe_while to im-
plement an SSPS pipeline. Figure 4 shows Cilk-P pseudocode
for dedup, which compresses the provided input file by removing
duplicated “chunks, as follows. St age 0 (lines 4–6) of the pro-
gram reads data fr om the i nput file and breaks the data into chunks
(line 4). As part of Stage 0, it also checks the loop-termination
condition and sets the done flag to true (line 6) if the end of the
input file is reached. If there is more input to be processed, the
program begins Stage 1, which calculates the SHA1 signature of a
given chunk and queries a hash table whether this chunk has been
seen using the SHA1 signature as key (line 9). Stage 1 is a serial
stage as dictated by the pipe_wait on line 8. Stage 2, which the
pipe_continue on line 10 i ndicates is a parallel stage, compresses
the chunk if it has not been seen before (line 12). The nal stage, a
serial stage, writes either the compressed chunk or its SHA1 signa-
ture to the output file depending on whether it is the first time the
chunk has been seen (line 14).
As the fir st step in building the computation dag for an execution
of this Cilk-P program, we transform the code executed from run-
ning the code in Figure 4 into the ordinary Cilk program shown in
Figure 5. As shown in lines 3–32, a pipe_while is “lifted” using
a C++ lambda function [36, Sec.11.4] and converted to an ordi-
nary while loop using the variable i to index it erations. The loop
body executes Stage 0 and spawns off a C++ lambda function that
executes the remainder of the iteration (line 12). As one can see
from this transformation, St age 0 of a pipe_while loop is always a
serial stage and the t est condition of the pipe_while loop is consid-
ered part of Stage 0. These constraints guarantee that the repeated
tests of the pipe_while loop-termination condition execute seri-
ally. Each stage ends with a cilk_sync (lines 10, 16, 21, and 25.)
The last statement in the loop (line 29) is a call to a special func-
tion throttle, w hich implements the throttling dependency. The
cilk_sync immediately after the end of the while loop (line 31)
waits for completion of the spawned iterati ons.
The second step models the execution of this Cilk program as a
fork-join computation dag G = (V, E) as in [7, 8].
The third step is to produce the final pipeline computation
dag by augmenting the fork-join computation dag with cross
and throttling edges based on t he special functions pipe_wait,
pipe_continue, and throttle. For example, when iteration i
executes the pipe_wait call in line 22, it specifies the start of
1 int fd _out = op en_o utput_fi l e ();
2 bool do ne = false ;
3 [&]( ) {
4 int i = 0; // it erat ion ind ex
5 whil e (! don e ) { // p ipe_ whi le
6 chu nk_t * c hun k = get_ next _chu n k () ;
7 if ( chun k == NUL L ) {
8 do ne = tru e ;
9 } else {
10 cil k_sy nc ;
11 // A ddit iona l sta ges of ite rati on i
12 cil k_sp awn [ i , chunk , fd_o ut ]() {
13 pip e_wa it (1) ;
14 // nod e ( i ,1) beg ins
15 bo ol isDu plic ate = ded upli cate ( chun k ) ;
16 cil k_sy nc ;
17 pip e_co ntin ue (2) ;
18 // nod e ( i ,2) beg ins
19 if (! i sDu plic ate )
20 com pres s ( chu nk ) ;
21 cil k_sy nc ;
22 pip e_wa it (3) ;
23 // nod e ( i ,3) beg ins
24 wri te_t o_fi le ( fd_out , chun k ) ;
25 cil k_sy nc ;
26 }() ;
27 }
28 i ++;
29 thr ottl e (i - K);
30 }
31 c ilk_ sync ;
32 }() ;
Figure 5: The Cilk Plus pseudocode that results from transforming
the execution of the Cilk-P dedup implementation from Figure 4 into
fork-join code augmented by dependencies indicated by the pipe_wait,
pipe_continue, and throttle special functions. T he unbound variable K
is the throttling limit.
node (i, 3) and adds a cross edge from the last instruction of node
(i 1,3) to the first instruction of node (i,3). If node (i 1,3) is a
null node, t hen the cross edge goes from the last instruction of the
last real node in it eration i 1 before (i 1,3). This “collapsing”
of null nodes may cause multiple cross edges to be generated from
a single vertex i n iteration i 1 to different vertices in it eration i.
The pipe_continue call in line 17 simply indicates the start of
node (i, 2). The throttle call in line 29 changes the normal return
edge from the last instruction in iteration i K (the return repre-
sented by the closing brace in line 26) into a throttling edge. Rather
than going to t he cilk_sync in line 31 as the return would, the
edge is redirected to the invocation of throttle in iteration i.
5. THE PIPER SCHEDULER
PIPER executes a pipeline program on a set of P workers using
work-stealing. For the most part, PIPERs execution model can be
view ed as modification of the scheduler described by Ar ora, Blu-
mofe, and Plaxton [2] (henceforth referred to as the ABP model)
for computation dags arising from pipeline programs. PIPER devi-
ates from the ABP model in one significant way, however, in that it
performs a “tail-swap” operation.
We describe the operation of PIPER in terms of the pipeline com-
putation dag G = (V,E). Each worker p in PIPER maintains an as-
signed vertex corresponding to the instruction that p executes on
the current time step. We say that a vertex x is ready if all i ts pre-
decessors have been executed. Executing an assigned vertex v may
enable a vertex x that is a direct successor of v in G by making
x ready. E ach worker maintains a deque of ready vertices. Nor-
mally, a worker pushes and pops vertices from the tail of its deque.
A “thief, however, may try to steal a vertex from the head of an-
other worker’s deque. It is convenient to define the extended deque
hv
0
,v
1
,..., v
r
i of a worker p, where v
0
V is ps assigned vertex
and v
1
,v
2
,..., v
r
V are the vertices in ps deque in order from tail
to head.
On each time step, each PIPER worker p follows a few sim-
ple rules for execution based on the type of ps assigned vertex
v and how many direct successors are enabled by the execution
of v, which is at most 2. (Although v may have multiple immediate
successors in the next iteration due t o the collapsing of null nodes,
executing v can enable at most one such vertex, since the stages in
the next iteration execute serially.) We assume that the rules are
executed atomically.
If the assigned vertex v is not the last vertex of an iteration and
its execution enables only one direct successor x, then p simply
changes its assigned vertex from v to x. Executing a vertex v can
enable two successors if v spawns a child x with continuation y or
if v is the last vertex in a node in iteration i, which enables both the
first vertex x of the next node in i and the first vertex y in a node in
iteration i + 1. In either case, p pushes y onto the tail of its deque
and changes its assigned vertex from v to x. If executing v enables
no successors, then p tries t o pop an element z from the tail of its
deque, changing its assigned vertex from v to z. If ps deque is
empty, p becomes a thief . As a thief, p randomly picks another
worker to be its victim, t r ies to steal the vertex z at the head of the
victim’s deque, and sets the assigned vertex of p to z if successful.
These cases are consistent with the normal ABP model.
PIPER handles the end of an iteration differently, however, due
to throttling edges. Suppose that a worker p has an assigned ver-
tex v representing the l ast vertex in a given it eration in a given
pipe_while loop, and suppose that the edge leaving v is a t hrot-
tling edge to a vertex z. When p executes v, two cases are possible.
In the first case, executing v does not enable z, in which case no new
vertices are enabled, and p acts accordingly. In the second case,
however, executing v does enable z, in which case p performs two
actions. First, p changes its assigned vertex from v to z. Second, if
p has a nonempty deque, then p performs a tail swap: it exchanges
its assigned vertex z with the vertex at the tail of its deque.
This tail-swap operation is designed to reduce PIPERs space us-
age. Without the tail swap, in a normal ABP-style execution, when
a worker p finishes an iteration i that enables a vertex via a throt-
tling edge, p would conceptually choose to start a new iteration
i + K, even if iteration i + 1 were already suspended and on i ts
deque. With the tail swap, p resumes iteration i + 1, leaving i + K
available for stealing. The tail swap also enhances cache locality
by encouraging p to execute consecutive iterations.
It may seem, at first glance, that a tail-swap operation mi ght sig-
nificantly reduce the parallelism, since t he vertex z enabled by the
throttling edge is pushed onto the bottom of the deque. Intuitively,
if there were additional work above z in the deque, then a tail swap
could significantly delay the start of it eration i + K. L emma 4 will
show, however, that a tail-swap operation only occurs on deques
with exactly 1 element. Thus, whenever a tail swap occurs, z is at
the top of the deque and is immediately available to be stolen.
6. STRUCTURAL INVARIANTS
During the execution of a pipeline program by PIPER, the worker
deques satisfy two structural invariants, called the “contour” prop-
erty and the “depth” property. This section states and proves these
invariants.
Intuitively, we would like to describe the structure of the worker
deques in terms of frames activation records of functions’ lo-
cal variables, since the deques implement a “cactus stack” [18, 23].
A pipe_while loop would correspond to a parent frame with a
spawned child for each iteration. Although the actual Cilk-P im-
plementation manages frames in this fashion, t he control of a
pipe_while reall y does follow the schema illustrated in Figure 5,
where Stage 0 of an i teration i executes in the same lambda func-
tion as the parent, rather than in the child lambda function which
contains the rest of i. Consequently, we introduce “contours” to
represent this structure.
Consider a computation dag G = (V,E) that arises from the ex-
ecution of a pipeline program. A contour is a path in G composed
only of serial and continue edges. A contour must be a path, be-
cause there can be at most one serial or continue edge entering or
leaving any vertex. We call the first vertex of a contour the root
of the contour, which (except for the initi al instruction of the entire
computation) is the only vertex in the contour that has an incoming
spawn edge. Consequently, contours can be organized into a tree
hierarchy, where one contour is a parent of a second if the first con-
tour contains a vertex that spawns the root of the second. Given a
vertex v V , let c(v) denote the contour to which v belongs.
The following two lemmas describe two important properties ex-
hibited in the execution of a pipeline program.
LEMMA 1. O nly one vertex in a contour can belong to any ex-
tended deque at any time.
PROOF. The vertices in a contour form a chain and are, there-
fore, enabled serially.
The st ructure of a pipe_while guarantees that the “top-level”
vertices of each iteration correspond to a contour, and that all it-
erations of the pipe_while share a common parent in the contour
tree. These properties lead to the following lemma.
LEMMA 2. If an edge (x, y) is a cross edge, then c(x) and c(y)
are siblings in the contour tree and correspond to adjacent itera-
tions in a pipe_while loop. If an edge (x,y) is a throt tling edge,
then c(y) is the parent of c(x) in contour tree.
As PIPER executes a pipeline program, the deques of workers
are highly structured with respect to contours.
DEFINITION 3. At any ti me during an execution of a pipeline
program which produces a computation dag G = (V,E), consider
the extended deque hv
0
,v
1
,..., v
r
i of a worker p. This deque sat-
isfies the contour property if for all k = 0,1,... r 1, one of the
following two conditions holds:
1. c(v
k+1
) is the parent of c(v
k
).
2. The root of c(v
k
) is t he start of some iteration i, the root of
c(v
k+1
) is the start of the next iteration i + 1, and if k + 2 r,
then c(v
k+2
) is the common parent of both c(v
k
) and c(v
k+1
).
Contours allow us to prove an important property of the tail-swap
operation.
LEMMA 4. At any t ime during an execution of a pipeline pro-
gram which produces a computation dag G = (V, E), suppose that
worker p enables a vertex x via a throttling edge as a result of ex-
ecuting its assigned vertex v
0
. If p’s deque satisfies the contour
property (Definition 3), then either
1. p’s deque is empty and x becomes p’s new assigned vertex, or
2. p’s deque contains a single vertex v
1
which becomes p’s new
assigned vertex and x is pushed onto p’s deque.
PROOF. B ecause x is enabled by a throttling edge, v
0
must be
the last node of some iteration i, which by Lemma 2 means that c(x)
is the parent of c(v
0
). Because x is just being enabled, Lemma 1 im-
plies that no other vertex in c(x) can belong to ps deque. Suppose
that ps extended deque hv
0
,v
1
,..., v
r
i contains r 2 vertices. By
Lemma 1, either v
1
or v
2
belongs to contour c(x), neither of which
is possible, and hence r = 0 or r = 1. If r = 0, then x is ps assigned
vertex. If r = 1, then the root of c(v
1
) is the start of it eration i + 1.
Since x is enabled by a throttling edge, a tail swap occurs, making
v
1
the assigned vertex of p and putting x onto ps deque.
To analyze the time required for PIPER to execute a computation
dag G = (V,E), define the enabling tree G
T
= (V,E
T
) as the tree
containing an edge (x, y) E
T
if x is the last predecessor of y to
execute. The enabling depth d(x) of x V is the depth of x in the
enabling tree G
T
.
DEFINITION 5. At any time during an execution of a pipeline
program which produces a computation dag G = (V,E), consider
the extended deque hv
0
,v
1
,..., v
r
i of a worker p. The deque satis-
fies the depth property if the following conditions hold:
1. For k = 1,2,... , r 1, we have d(v
k1
) d(v
k
).
2. For k = r, we have d(v
k1
) d(v
k
) or v
k
has an incoming throt-
tling edge.
3. The inequalities are strict for k > 1.
THEOREM 6. At all times during an execution of a pipeline
program by PIPER, all deques satisf y the contour and depth prop-
erties (Definitions 3 and 5).
PROOF. The proof is similar to the inductive proof of Lemma 3
from [2]. Intuitively, we replace the “designated parents” discussed
in [2] with contours, which exhibit similar parent-child relation-
ships. Although most of the proof follows from this substitution,
we address the two most salient differences. The other minor cases
are either straightforward or similar to these two cases.
First, we consider the consequences of the tail-swap operation,
which may occur if the assigned vertex v
0
is the end of an iteration
and executing v
0
enables a vertex z via a throttling edge. L emma 4
describes the structure of a worker ps extended deque in this case,
and in particular, states that the deque contains at most 1 vertex. If
r = 0, the deque is empty and the properties hold vacuously. Oth-
erwise, r = 1 and the deque contains one element v
1
, in which case
the tail-swap operation assigns v
1
to p and puts z into ps deque.
The contour property holds, because c(z) is the parent of c(v
1
).
The depth property holds, because z is enabled by a throttling edge.
Second, we must show that the contour and depth properties hold
when a worker ps assigned vertex v
0
belongs to some iteration i of
a pipe_while, and executing v
0
enables a vertex y belonging to
iteration i + 1 via a cross edge (v
0
,y). Assume that executing v
0
also enables x, where x also belongs to iteration i. (The case w here
x is not enabled is similar.) Since both v
0
and x belong to the same
iteration, c(v
0
) = c(x), and by Lemma 2, c(x) is a sibling of c(y)
in t he contour tree. Suppose that before v
0
executes, ps extended
deque is hv
0
,v
1
,..., v
r
i, and thus after v
0
executes, ps extended
deque is hx, y,v
1
,..., v
r
i. For vertices v
2
,v
3
,..., v
r
, if they exist, the
conditions of the contour property continue to hold by induction.
Since c(x) and c(y) are adjacent siblings in the contour tree, we
need only show that c(v
1
), if it exists, is their parent. But if c(v
1
) is
not the parent of c(v
0
) = c(x), then by induction it must be that c(x)
and c(v
1
) are adjacent siblings. In this case c(v
1
) = c(y), which is
impossible by Lemma 1. The depth property holds because d(x) =
d(y) = d(v
0
) + 1 d(v
1
) + 1 > d(v
1
).
7. TIME ANALYSIS OF PIPER
This section bounds the completion time for PIPER, showing
that PIPER executes pipeline program asymptotically efficiently.
Specifically, suppose that a pipeline program produces a compu-
tation dag G = (V,E) with work T
1
and span T
when executed
by PIPER on P processors. We show that for any ε > 0, the run-
ning time is T
P
T
1
/P + O(T
+ lgP + lg(1/ε)) with probabil-
ity at least 1 ε, which implies that the expected running time is
E[T
P
] T
1
/P + O(T
+ lg P). This bound is comparable to the
work-stealing bound for fork-join dags originally proved in [8].
We adapt the potential-function argument of Arora, Blumofe,
and Plaxton [2]. PIPER executes computation dags in a style similar
to their work-stealing scheduler, except for tail swapping. Although
Arora et al. i gnore the issue of memory contention, we handle it us-
ing the “recycling game” analysis from [8], which contributes the
additive O(lg P) t erm to the bounds.
As in [2], the crux of the proof is to bound the number of steal
attempts performed during the execution of a computation dag G in
terms of its span T
. We measure progress through the computa-
tion by defining a potential function for a vertex in the computation
dag based on its depth in the enabling tree. Consider a particular
execution of a computation dag G = (V,E) by PIPER. For that exe-
cution, we define the weight of a vertex v as w(v) = T
d(v), and
we define the potential of vertex v at a given time as
φ(v) =
3
2w(v)1
if v is assigned ,
3
2w(v)
otherwise .
We define the potential of a worker ps extended deque
hv
0
,v
1
,..., v
r
i as φ(p) =
P
r
k=0
φ(v
k
).
Given this potential function, the proof of the time bound follows
the same overall structure as the proof in [2]. We sketch the proof.
First, we prove two properties of worker deques involving the
potential function.
LEMMA 7. At any t ime during an execution of a pipeline pro-
gram which produces a computation dag G = (V,E), the extended
deque hv
0
,v
1
,..., v
r
i of every worker p satisfies the following:
1. φ(v
r
) + φ(v
r1
) 3φ(p)/4.
2. Let φ
denote the potential after p executes v
0
. Then we have
φ(p) φ
(p) = 2(φ(v
0
) + φ(v
1
))/3, if p performs a tail swap,
and φ(p) φ
(p) 5φ(v
0
)/9 otherwise.
PROOF. Property 1 follows from the depth property of Theo-
rem 6. Property 2 follows from Lemma 4, if p performs a tail swap,
and the analysis in [2] otherwise.
As in [2], we analyze the behavior of workers randomly stealing
from each other using a balls-and-weighted-bins analog. We want
to analyze the case where the top 2 elements are stolen out of any
deque, however, not just the top element. To address this case, we
modify Lemma 7 of [2] to consider the probability that 2 out of 2P
balls land in the same bin.
LEMMA 8. Consider P bins, where for p = 1,2,... ,P, bin p
has weight W
p
. Suppose that 2P balls are thrown independently
and uniformly at random into the P bins. For bin p, define the
random variable X
p
as
X
p
=
W
p
if at least 2 balls land in bin p ,
0 otherwise .
Let W =
P
P
p=1
W
p
and X =
P
P
p=1
X
p
. For any β i n the range
0 < β < 1, we have P r {X βW } > 1 3/(1 β)e
2
.
PROOF. For each bin p, consider the random variable W
p
X
p
.
It takes on the value W
p
when 0 or 1 ball lands in bin p, and other-
wise it is 0. Thus, we have
E
W
p
X
p
= W
p
(1 1/P)
2P
+ 2P (1 1/P)
2P1
(1/P)
= W
p
(1 1/P)
2P
(3P 1)/(P 1) .
Since (1 1/P)
P
approaches 1/e and (3P 1)/(P 1) approaches
3, we have lim
P
E
W
p
X
p
= 3W
p
/e
2
. In fact, one can show
that E
W
p
X
p
is monotonically increasing, approaching the limit
from below, and thus E[W X] 3W /e
2
. By Markov’s inequal-
ity, we have that Pr {(W X) > (1 β)W } < E[W X ] /(1β)W ,
from which we conclude that Pr{ X < βW } 3/(1 β)e
2
.
To use Lemma 8 to analyze PIPER, we divide the time steps of
the execution of G into a sequence of rounds, where each round
(except the rst, which starts at time 0) starts at the time step after
the previous round ends and continues until the first t ime step such
that at least 2P steal attempts and hence less than 3P steal at-
tempts occur within the round. The following lemma shows t hat
a constant f r action of the total potential in all deques is lost in each
round, thereby demonstrating progress.
LEMMA 9. Consider a pipeline program executed by PIPER on
P processors. Suppose that a round starts at time step t and finishes
at time step t
. Let φ denote the potential at time t, let φ
denote the
potential at time t
, l et Φ =
P
P
p=1
φ(p), and let Φ
=
P
P
p=1
φ
(p).
Then we have Pr{Φ Φ
Φ/4} > 1 6/e
2
.
PROOF. We first show that stealing twice from a worker ps
deque contributes a potential drop of at least φ(p)/2. The proof
follows a similar case analysis to t hat in the proof of Lemma 8
in [2] with two main differences. First, we use the two properties
of φ in Lemma 7. Second, we must consider the case unique to
PIPER, w here p performs a tail swap after executing its assigned
vertex v
0
. In this case, ps deque contains a single ready vertex v
1
and p may perform a tail swap if executing v
0
enables a vertex
via an outgoing throttling edge. If so, however, then by Lemma 7,
the potential drops by at least 2(φ(v
0
) + φ(v
1
))/3 > φ(p)/2, since
φ(p) = φ(v
0
) + φ(v
1
).
Now, suppose that we assign each worker p a weight of W
p
=
φ(p)/2. These weights W
p
sum to W = Φ/2. If we think of
steal att empts as ball tosses, then the random variable X from
Lemma 8 bounds from below the potential decrease due to ac-
tions on ps deque. Specifically, if at least 2 steal attempts tar-
get ps deque in a round (which corresponds conceptually to at
least 2 balls landing in bin p), then the potential drops by at
least W
p
. Moreover, X is a lower bound on the potential de-
crease within the round, i.e., X Φ Φ
. By Lemma 8, we have
Pr{X W /2} > 1 6/e
2
. Substituting for X and W , we conclude
that Pr {(Φ Φ
) Φ/4} > 1 6/e
2
.
We are now ready to prove the completion-time bound.
THEOREM 10. Consider an execution of a pipeline program by
PIPER on P processors w hich produces a computation dag wi th
work T
1
and span T
. For any ε > 0, the running time is T
P
T
1
/P + O(T
+ lgP + lg(1/ε)) with probability at least 1 ε.
PROOF. On every time step, consider each worker as placing a
token in a bucket depending on its action. If a worker p executes an
assigned vertex, p places a token in the work bucket. Otherwise, p
is a thief and places a token in the steal bucket. There are exactly
T
1
tokens in the work bucket at the end of the computation. The
interesting part is bounding the size of the steal bucket.
Divide the execution of G into rounds. Recall that each round
contains at least 2P and less t han 3P steal attempts. Call a round
successful if after that round finishes, the potential drops by at
least a 1/4 fraction. From Lemma 9, a round is successful with
probability at least 1 6/e
2
1/6. Since t he potential starts at
Φ
0
= 3
2T
1
, ends at 0, and is always an integer, the number of
successful rounds is at most (2T
1)log
4/3
(3) < 8T
. Conse-
quently, the expected number of rounds needed to obtain 8T
suc-
cessful rounds is at most 48T
, and the expected number of tokens
in the steal bucket is therefore at most 3P · 48T
= 144PT
.
For the high-probability bound, suppose that the execution takes
n = 48T
+ m rounds. Because each round succeeds with proba-
bility at least p = 1/6, the expected number of successes is at least
np = 8T
+ m/6. We now compute the probability that the num-
ber X of successes is l ess than 8T
. As in [2], we use the Cher-
noff bound Pr{X < np a} < e
a
2
/2np
, with a = m /6. Choosing
m = 48T
+ 21 ln(1/ε), we have
Pr{X < 8T
} < e
(m/6)
2
16T
+m/3
< e
(m/6)
2
m/4+m/3
= e
m/21
ε .
Hence, the probability that the execution takes n = 96T
+
21ln(1/ε) rounds or more is less than ε, and the number of tokens
in the steal bucket is at most 288T
+ 63 ln(1/ε).
The additional lgP term comes from the “recycling game” analy-
sis described in [8], which bounds any delay that might be incurred
when multiple processors t ry to access the same deque in the same
time step in randomized work-stealing.
8. SPACE ANALYSIS OF PIPER
This section derives bounds on the stack space required by PIPER
by extending the bounds in [8] for fully strict fork-join parallelism
to i nclude pipeline parallelism. We show that PIPER on P proces-
sors uses S
P
P(S
1
+ f DK) stack space for pipeline iterations,
where S
1
is the serial stack space, f is the “frame size, D is the
depth of nested linear pipelines, and K is the throttling li mit.
To model PIPER’s usage of stack space, we partition the vertices
of the computation dag G of the pipeline program into a tree of
contours, as described in Section 6. Each contour in this partition
is rooted at the start of a spawned subcomputation. The control for
each pipe_while loop, which corresponds to a while loop as in
line 5 of Figure 5, belongs to some contour in the contour tree with
its iterati ons as children. Define t he pipe nesting depth D of G as
the maximum number of pipe_while contours on any path f rom
leaf to root in the contour tree.
We assume that every contour c of G has an associated f rame
size representing the stack space consumed by c while it or any of
its descendant contours are executing. The space used by PIPER
on any time step is the sum of frame sizes of all contours c which
are either (1) associated with a vertex in some worker’s extended
deque, or (2) suspended, meaning that the earliest unexecuted ver-
tex in the contour is not ready. Let S
P
denote the maximum over all
time steps of the stack space used by PIPER during a P-worker ex-
ecution of G. T hus, S
1
is the stack space used by PIPER for a serial
execution. We now generalize the space bound S
P
PS
1
from [8],
which deals only with fork-join parallelism, to pipeline programs.
THEOREM 11. Consider a pipeline program with pipe nesting
depth D executed on P processors by PIPER with throttling limit K.
The execution requires S
P
P (S
1
+ f DK) stack space, where f is
the maximum frame size of any contour of any pipe_while itera-
tion and S
1
is the serial stack space.
PROOF. We show that except for suspended contours that are
pipe_while iterations, PIPER still satisfies the “busy-leaves prop-
erty” [8]. More precisely, at any point during the execution, in
the t r ee of active and suspended contours, each leaf contour ei-
ther (1) is currently executing on some worker, or (2) is a sus-
pended pipe_while iteration with a sibling iteration that is cur-
rently executing on some worker. In fact, one can show that for any
pipe_while l oop, the contour for the leftmost (smallest) iteration
that has not completed is either active or has an active descendant
in the contour tree. The bound of PS
1
covers the space used by all
contours that fall into Case (1).
To bound the space used by contours from Case (2), observe
that any pipe_while loop uses at most f K space for iteration con-
tours, since the throttling edge from the leftmost active iteration
precludes having more than K active or suspended iterations in any
one pipe_while l oop. Thus, each worker p has at most f DK it-
eration contours for any pipe_while loop that is an ancestor of
the contour ps assigned vertex. Summing the space used over all
workers gives P f DK additional stack-space usage.
9. CILK-P RUNTIME DESIGN
This section describes the Cilk-P implementation of the PIPER
scheduler. We first introduce the data structures Ci lk-P uses to
implement a pipe_while loop. Then we describe the two main
optimizations that the Cilk-P runtime exploits: lazy enabling and
dependency folding.
Data structures
Like the Cilk-M runtime [23] on which it is based, Cilk-P organizes
runtime data into frames. Cilk-P executes a pipe_while loop in its
own function, whose f r ame, called a control frame, handles the
spawning and throttling of iterations. Furthermore, each iteration
of a pipe_while loop executes as an independent child function,
with its own iteration frame. T his frame structure is similar to that
of an ordinary while loop in Cilk-M, where each iteration spawns
a function to execute the loop body. Cross and throttling edges,
however, may cause the iteration and control frames to suspend.
Cilk-P’s runtime employs a simple mechanism to track progress
of an iteration i. The frame of iterati on i maintains a stage counter,
which stores the stage number of the current node in i, and a status
field, which indicates whether i is suspended due to an unsatisfied
cross edge. Because executed nodes in an iteration i have strictly
increasing stage numbers, checking whether a cross edge into it-
eration i is satisfied amounts to comparing the stage counters of
iterations i and i 1. Any iteration frame that is not suspended cor-
responds to either a currently executing or a completed iteration.
Cilk-P implements throttling using a join counter in the control
frame. Normally in Cil k-M, a frame’s join counter simply st ores
the number of active child frames. Cilk-P also uses the join counter
to l imit the number of active iteration frames in a pipe_while loop
to the throttling limit K. Starting an iteration increments the join
counter, while returning from an iteration decrements it. If a worker
tries to start a new iteration when the control frame’s join counter
is K, the control frame suspends until a child iteration returns.
Using these data structures, one could implement PIPER directly,
by pushing and popping the appropriate frames onto deques as
specified by PIPERs execution model. In particular, the normal
THE protocol [15] could be used for pushing and popping frames
from a deque, and frame locks could be used to update fields in
the frames at omically. Although this approach directly matches the
model analyzed in Sections 7 and 8, it incurs unnecessary overhead
for every node in an iteration. Cilk-P implements lazy enabling and
dependency folding to reduce this overhead.
Lazy enabling
In the PIPER algorithm, when a worker p nishes executing a node
in iteration i, it may enable an instruction in iteration i +1, in which
case p pushes this instruction onto its deque. To implement this
behavi or, intuitively, p must check right read the stage counter
and status of iteration i+1 whenever it finishes executing a node.
The work to check right at the end of every node could amount to
substantial overhead in a pipeline with fine-grained st ages.
Lazy enabling allows ps execution of an iteration i to defer the
check-right operation, as well as avoid any operations on its deque
involving iteration i+1. Conceptually, when p enables work in iter-
ation i + 1, this work is kept on ps deque implicitly. When a t hief
p
tries to steal iteration is frame from ps deque, p
first checks
right on behalf of p to see whether any work fr om iteration i + 1 is
implicitly on the deque. If so, p
resumes iteration i + 1 as if it had
found it on ps deque. In a similar vein, the Cilk-P runtime system
also uses lazy enabling to optimize the check-parent operation
the enabling of a control frame suspended due to t hrottl ing.
Lazy enabling requires p to behave differently when p completes
an iteration. When p finishes iteration i, it rst checks ri ght, and if
that fails (i.e., iteration i + 1 need not be resumed), it checks its
parent. It turns out that these checks find work only if ps deque is
empty. Therefore, p can avoid performing these checks at the end
of an iteration if its deque is not empty.
Lazy enabling is an application of the work-first principle [15]:
minimize the scheduling overheads borne by the work of a compu-
tation, and amortize them against the span. Requiring a worker to
check right every time it completes a node adds overhead propor-
tional to the work of the pipe_while in the worst case. With lazy
enabling, the overhead can be amortized against the span of the
computation. For programs with sufficient parallelism, the work
dominates the span, and the overhead becomes negligible.
Dependency folding
In dependency folding, the frame for iteration i stores a cached
value of the stage counter of iteration i 1, hoping to avoid the
checking of already satisfied cross edges. In a straightforward im-
plementation of PIPER, before a worker p executes each node in
iteration i with an incoming cross edge, it reads the stage counter
of iteration i 1 to see if the cross edge is satisfied. Reading the
stage counter of iteration i 1, however, can be expensive. Besides
the work involved, the access may contend with whatever worker p
is executing iteration i 1, because p
may be constantly updating
the stage counter of iteration i 1.
Dependency folding mitigates t his overhead by exploiting the
fact that an iteration’s stage counter must strictly increase. By
caching the most recently read stage-counter value from iteration
i 1, worker p can sometimes avoid reading this stage counter be-
fore each node with an incoming cross edge. For instance, if p
finishes executing a node (i 1, j), then all cross edges from nodes
(i1,0) through (i1, j) are necessarily satisfied. Thus, if p reads
j from iteration i 1’s stage counter, p need not reread the stage
counter of i 1 until it tries to execute a node with an incoming
cross edge (i, j
) where j
> j. This optimization is particularly
useful for fine-grained stages that execute quickly.
10. EVALUATION
This section presents empirical st udies of the Cilk-P prototype
system. We investigated the performance and scalability of Cilk-P
using the three PARSEC [4, 5] benchmarks that we ported, namely
ferret, dedup, and x264. The results show that Cilk-P’s implemen-
tation of pipeline parallelism has negligible overhead compared to
its serial counterpart. We compared the Cilk-P implementations
to TBB and Pthreaded implementations of these benchmarks. We
found that the Cilk-P and TBB implementations perform compara-
bly, as do the Cilk-P and Pthreaded implementations for ferret and
x264. The Pthreaded version of dedup outperforms both Cilk-P and
TBB, because t he bind-to-element approaches of Cilk-P and TBB
produce less parallelism than the Pthreaded bind-to-stage approach.
Moreover, the Pt hreading approach benefits more from “oversub-
scription. We study the effectiveness of dependency folding on a
Processing Time (T
P
) Speedup (T
S
/T
P
) Scalability (T
1
/T
P
)
P
Cilk-P Pthreads TBB Cilk-P Pthreads TBB Cilk-P Pthreads T BB
1 691.2 692.1 690.3 1.00 1.00 1.00 1.00 1.00 1.00
2
356.7 343.8 351.5 1.94 2.01 1.97 1.94 2.01 1.96
4
176.5 170.4 175.8 3.92 4.06 3.93 3.92 4.06 3.93
8
89.1 86.8 89.2 7.76 7.96 7.75 7.76 7.97 7.74
12
60.3 59.1 60.8 11.46 11.70 11.36 11.46 11.71 11.35
16
46.2 46.3 46.9 14.98 14.93 14.75 14.98 14.95 14.74
Figure 6: Performance comparison of the three ferret implementations.
The experiments were conducted using native, the largest input data set that
comes with the PARSEC benchmark suite.
4
The left-most column shows
the number of cores used (P). Subsequent columns show the running time
(T
P
), speedup over serial running time (T
S
/T
P
), and scalability (T
1
/T
P
) for
each system. The throttling limit was K = 10P.
Processing Time (T
P
) Speedup (T
S
/T
P
) Scalability (T
1
/T
P
)
P
Cilk-P Pthreads TBB Cilk-P Pthreads TBB Cilk-P Pthreads T BB
1 58.0 51.1 57.3 1.01 1.05 1.00 1.00 1.00 1.00
2
29.8 23.3 29.4 1.96 2.51 1.99 1.94 2.19 1.95
4
16.0 12.2 16.2 3.66 4.78 3.61 3.63 4.18 3.54
8
10.4 8.2 10.3 5.62 7.08 5.70 5.57 6.20 5.58
12
9.0 6.6 9.0 6.53 8.83 6.57 6.47 7.72 6.44
16
8.6 6.0 8.6 6.77 9.72 6.78 6.71 8.50 6.65
Figure 7: Performance comparison of the three dedup implementations.
The experiments were conducted using native, the largest input data set that
comes with the PARSEC benchmark suite. The column headers are the
same as in Figure 6. The throttling limit was K = 4P.
Encoding Time (T
P
) Speedup (T
S
/T
P
) Scalability (T
1
/T
P
)
P
Cilk-P Pthreads Cilk-P Pthreads Cilk-P Pthreads
1 217.1 223.0 1.02 0.99 1.00 1.00
2
97.0 105.3 2.27 2.09 2.24 2.12
4
47.7 53.3 4.63 4.14 4.55 4.19
8
25.9 26.7 8.53 8.27 8.40 8.36
12
18.6 19.3 11.84 11.44 11.66 11.57
16
16.0 16.2 13.87 13.63 13.66 13.76
Figure 8: Performance comparison between the Cilk-P implementation and
the Pthreaded implementation of x264 (encoding only). The experiments
were conducted using native, the largest input data set that comes with the
PARSEC benchmark suite. The column headers are the same as in Figure 6.
The throttling limit was K = 4P.
synthetic benchmark called pipe-fib, demonstrating that this opti-
mization can be effective for applications with fine-grained stages.
We ran all experiments on an AMD Opteron system with 4
2 GHz quad-core C PU’s having a total of 8 GBytes of memory.
Each processor core has a 64-KByte private L1-data-cache and a
512-KByte private L2-cache. The 4 cores on each chip share the
same 2-MByte L3-cache. The benchmarks were compiled with
GCC (or G++ for TBB) 4.4.5 using -O3 optimization, except for
x264, which by default comes with -O4.
Performance evaluation on PARSEC benchmarks
We implemented the Cilk-P versions of the three PARSEC bench-
marks by hand-compiling the r el evant pipe_while loops using
techniques similar to those described in [23]. We then compiled
the hand-compiled benchmarks with GCC. The ferret and dedup
applications can be parallelized as simple pipelines with a xed
number of stages and a static dependency str ucture. In particular,
ferret uses the 3-stage SPS pipeline shown in Figure 1, while dedup
uses a 4-stage SSPS pipeline as described in Figure 4.
For the Pthreaded versions, we used the code distributed with
PARSEC. The PARSEC Pthreaded implementations of ferret and
dedup employ the oversubscription method [33], a bind-to-stage
approach that creates more than one thread per pipeline stage and
utilizes the operating system for load balancing. For the Pthreaded
4
We dropped four out of the 3500 input images from the original native
data set, because those images are black-and-white, which trigger an array
index out of bound error in the image library provided.
implementations, when the user specifies an input parameter of Q,
the code creates Q threads per stage, except for the fir st (input)
and last (output) stages which are serial and use only one thread
each. To ensure a fair comparison, for all applications, we ran the
Pthreaded implementation using taskset to limit the process to P
cores (w hich corresponds to the number of workers used in Cilk-P
and TBB), but experimented to find the best setting for Q.
We used the TBB version of ferret t hat came with the PARSEC
benchmark, and implemented the TBB version of dedup, both us-
ing the same strat egies as for Cilk-P. T BB’s construct-and-run ap-
proach proved inadequate for the on-the-fly nature of x264, how-
ever, and indeed, in their study of these three applications, Reed,
Chen, and Johnson [33] say, “Implementing x264 in TBB is not
impossible, but the TBB pipeline structure is not suitable. Thus,
we had no TBB benchmark for x264 to include i n our comparisons.
For each benchmark, we throttl ed all versions similarly. For
Cilk-P, a throttling limit of 4P, where P is the number of cores,
seems to work well in general, although since ferret scales slightly
better with less throttli ng, we used a throttling limit of 10P for our
experiments. TBB supports a settable parameter that serves t he
same purpose as Cilk-P’s throttling limit. For the Pthreaded im-
plementations, we throttled the computation by setting a size limit
on the queues between stages, although we did not impose a queue
size limit on the last stage of dedup (the default limit is 2
20
), since
the program deadlocks otherwise.
Figures 6–8 show the performance results for the different imple-
mentations of the three benchmarks. Each data point in the study
was computed by averaging the results of 10 runs. The standard de-
viation of the numbers was typically just a few percent, indicating
that the numbers should be accurate to within better than 10 percent
with high confidence (2 or 3 st andard deviations). We suspect that
the superlinear scalability obtained for some measurements is due
to the fact that more L1- and L2-cache is available when running
on multiple cores.
The t hree tables from Figures 6–8 show that the Cilk-P and TBB
implementations of ferret and dedup are comparable, indicating
that there is no performance penalty incurred by these applica-
tions for using the more general on-the-fly pipeline instead of a
construct-and-run pipeline. Both Cilk-P and TBB execute using a
bind-to-element approach.
The dedup performance results for Ci lk-P and TBB are infe-
rior to those for Pthreads, however. The Pthreaded implementa-
tion scales to about 8.5 on 16 cores, whereas Cilk-P and TBB seem
to plateau at around 6.7. There appear to be two reasons for this
discrepancy.
First, the dedup benchmark on the test input has limited paral-
lelism. We modified the Cilkview scalability analyzer [19] to mea-
sure the work and span of our hand-compiled Cilk-P dedup pro-
grams, observing a parallelism of merely 7.4. The bind-to-stage
Pthreaded implementation creates a pipeline with a different struc-
ture from the bind-to-element Cilk-P and TBB versions, which en-
joys slightly more parallelism.
Second, since file I/O is the main performance bottleneck for
dedup, the Pthreaded i mplementation effectively benefits from
oversubscription using more threads than processing cores
and its st r at egic allocation of threads to st ages. Specifically, since
the first and last stages perform file I/O, which is inherently se-
rial, the Pthreaded implementation dedicates one thread to each of
these stages, but dedicates multi ple threads to the other compute-
intensive stages. While the writing thread is performing file I/O
(i.e., writing data out to the disk), the OS may deschedule it, allow-
ing the compute-intensive threads to be scheduled. This behavior
explains how the Pthreaded implementation scales by more than a
Dependency Serial Speedup Scalability
Program Folding
T
S
T
1
T
16
Overhead T
S
/T
16
T
1
/T
16
pipe-fib no 20.8 22.3 3.8 1.07 5.15 5.82
pipe-fib-256 no
20.8 20.9 1.7 1.01 12.26 12.31
pipe-fib yes
20.8 21.7 1.8 1.05 11.85 12.40
pipe-fib-256 yes
20.8 20.9 1.7 1.01 12.56 12.62
Figure 9: Performance evaluation using the pipe-fib benchmark. We tested
the system with two different programs, the ordinary pipe-fib, and the pipe-
fib-256, which is coarsened. Each program is tested with and without the
dependency folding optimization. For each program for a given setting, we
show the running time of its serial counter part (T
S
), running time executing
on a single worker (T
1
), on 16 workers (T
16
), its serial overhead, scalability,
and speedup obtained running on 16 workers.
factor of P for P = 2 and 4, even though the computation is re-
stricted to only P cores using taskset. Moreover, when we ran
the Pthreaded implementation without throttling on a single core,
the computation ran about 20% faster than the serial implementa-
tion, which makes sense if computation and file I/O are eff ectively
overlapped. With multiple threads per stage, throttling appears to
inhibit threads working on stages that are further ahead, allowing
threads working on heavier stages to obtain more processing re-
sources, and thereby balancing the load.
In summary, Cilk-P performs comparably to TBB while admit-
ting more expressive semantics for pipelines. Cilk-P also performs
comparably to the Pthreaded implementations of ferret and x264,
although its bind-to-element strategy seems to suffer on dedup
compared to the bind-to-stage strategy of the Pthreaded implemen-
tation. Despite losing the dedup “bake-off, Cilk-P’s strategy has
the significant advantage that it allows pipelines to be expressed
as deterministic programs. Determinism greatly reduces the effort
for debugging, release engineering, and maintenance (see, for ex-
ample, [9]) compared with the inherently nondeterministic code re-
quired to set up Pthreaded pipelines.
Evaluation of dependency folding
We also studied the effectiveness of dependency folding. Since the
PARSEC benchmarks are too coarse grained to permit such a study,
we implemented a synthetic benchmark, called pipe-fib, to study
this optimization t echnique. The pipe-fib benchmark computes the
nth Fibonacci number F
n
in binary. It uses a pipeline algorithm
that operates in Θ(n
2
) work and Θ(n) span. To construct the base
case, pipe-fib allocates three arrays of size Θ(n) and initializes the
first two arrays with the binary representations of F
1
and F
2
, both
of which are 1. To compute F
3
, pipe-fib performs a ripple-carry
addition on the two input arrays and stores the sum into the third
output array. To compute F
n
, pipe-fib repeats the addition by ro-
tating through the arrays for inputs and output until it reaches F
n
.
In the pipeline for t his computation, each iteration i computes F
i+2
,
and a stage j within the iteration computes the jth bit of F
i+2
. Since
the benchmark stops propagating the carry bit as soon as possible,
it generates a triangular pipeline dag in which the number of stages
increases with iteration number.
Figure 9 shows the performance results
5
obtained by running the
ordinary pipe-fib with ne-grained stages, as well as pipe-fib-256,
a coarsened version of pipe-fib in which each stage computes 256
bits instead of 1. As the data in the first row show, even though
the serial overhead for pipe-fib without coarsening is merely 7%, i t
fails to scale and exhibits poor speedup. The r eason is that check-
ing for dependencies due to cross edges has a relatively high over-
head compared to the little work in each fine-grained stage. As the
data for pipe-fib-256 in the second row show, coarsening the stages
improves both serial overhead and scalability. Ideally, one would
5
Figure 9 shows the results from a single run, but these data are representa-
tive of other runs with different input sizes.
...
(T
1
2/3
+ T
1
1/3
)/2
T
1
1/3
+ 1 T
1
1/3
+ 1 T
1
1/3
+ 1
...
Figure 10: Sketch of the pathological unthrottled linear pipeline dag, which
can be used to prove Theorem 13. Small circles represent nodes with unit
work, medium circles represent nodes with T
1/3
1
2 work, and large circles
represent nodes with T
2/3
1
2 work. The number of iterations per cluster is
T
1/3
1
+ 1, and the total number of iterations is (T
2/3
1
+ T
1/3
1
)/2.
like the system to coarsen automatically, which i s what dependency
folding effectively does.
Further investigation revealed that the time spent checking for
cross edges increases noticeably when the number of workers in-
creases fr om 1 to 2. It turns out that when iterations are run in par-
allel, each check f or a cross-edge dependency necessarily incurs a
true-sharing conflict between the two adjacent active it erations, an
overhead that occurs only during parallel execution. Dependency
folding eliminated much of this overhead for pipe-fib, as shown in
the third row of Figure 9, leading to scalability that exceeds the
coarsened version without the optimization, although a slight price
is still paid in speedup. Employing both optimizations, as shown
in the l ast row of the table, produces the best numbers for both
speedup and scalability.
11. CONCLUSION
What impact does throttling have on theoretical performance?
PIPER relies on throttling to achieve its provable space bound and
avoid runaway pipelines. Ideally, the user should not worry about
throttling, and the system should perform well automatically, and
indeed, PIPERs throttling of a pipeline computation is encapsu-
lated in Cilk-P’s runtime system. But what price is paid?
We can pose this question theoretically in terms of a pipeline
computation G’s unthrottled dag: the dag
b
G = (V,
b
E) with the same
vertices and edges as G, except without throttling edges. How does
adding throttling edges to an unthrottled dag affect span and paral-
lelism?
The following two theorems provide two partial answers to this
question. First, for uniform pipelines, where the cost of a node
(i, j) is identical across all iterations i all stages have the same
cost throttling does not affect the asymptotic performance of
PIPER executing
b
G.
THEOREM 12. Consider a uniform unthrottled linear pipeline
b
G = (V,
b
E) having n iterations and s stages. Suppose that PIPER
throttles the execution of
b
G on P processors using a window size of
K = aP, for some constant a > 1. Then PIPER executes
b
G in time
T
P
(1 + c/a)T
1
/P + cT
for some sufficiently large constant c,
where T
1
is the total work in
b
G and T
is the span of
b
G.
Second, we consider nonuniform pipelines, where the cost of
a node (i, j) may vary across iterations. It turns out that nonuni-
form pipelines can pose performance problems, not only for PIPER,
but for any scheduler that throttles the computation. Figure 10
illustrates a pathological nonuniform pipeline for any scheduler
that uses throttling. In t his dag, T
1
work is distributed across
(T
1/3
1
+ T
2/3
1
)/2 iterati ons such that any T
1/3
1
+ 1 consecutive it-
erations consist of 1 heavy iteration with T
2/3
1
work and T
1/3
1
light
iterations of T
1/3
1
work each. Intuitively, achieving a speedup of 3
on this dag requires having at least 1 heavy iteration and Θ(T
1/3
1
)
light iterations active simultaneously, which is impossible for any
scheduler that uses a throttling l imit of K = o(T
1/3
1
). The foll owing
theorem formalizes this intuition.
THEOREM 13. Let
b
G = (V,
b
E) denote the nonuniform unthrot-
tled linear pipeline shown in Figure 10, with work T
1
and span
T
2T
2/3
1
. Let S
1
denote the optimal stack-space usage when
b
G
is executed on 1 processor. Any P-processor execution of
b
G that
achieves T
P
T
1
/ρ, where ρ satisfies 3 ρ O(T
1
/T
), uses
space S
P
S
1
+ (ρ 3)T
1/3
1
/2 1.
Intuitively, these two theorems present two extremes of the effect
of throttling on pipeline dags. One interesting avenue for research
is to determine what are the minimum restrictions on the structure
of an unthrottled linear pipeline G that would allow a scheduler to
achieve parallel speedup on P processors using a throttling limit of
only Θ(P).
12. ACKNOWLEDGMENTS
Thanks to Loren Merritt of x264 LLC and Hank Hoffman of Uni-
versity of Chicago (formerly of MIT CSAIL) for answering ques-
tions about x264. Thanks to Yungang Bao of Institute of Comput-
ing Technology, Chinese Academy of Sciences (formerly of Prince-
ton) for answering questions about the PARSEC benchmark suite.
Thanks to Bradley Kuszmaul of MIT CSAIL for tips and insights
on file I/O related performance issues. Thanks to Arch Robison
of Intel for providing constructive feedback on an early draft of
this paper. Thanks to Will Hasenplaugh of MIT CSAIL and Nasro
Min-Allah of COMSATS Institute of Information Technology in
Pakistan for helpful discussions. We especially thank the reviewers
for their thoughtful comments.
13. REFERENCES
[1] K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs
using work-stealing. In IP D PS, pp. 1–12. IEEE, 2010.
[2] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling
for multiprogrammed multiprocessors. Theory of Computing
Systems, pp. 115–144, 2001.
[3] H. C. Baker, Jr. and C. Hewitt. The incremental garbage collection of
processes. SIGPLAN Notices, 12(8):55–59, 1977.
[4] C. Bienia, S. Kumar, J. P. S ingh, and K. Li. The PARSEC benchmark
suite: Characterization and architectural implications. In PACT, pp.
72–81. ACM, 2008.
[5] C. Bienia and K. Li. Characteristics of workloads using the pipeline
programming model. In ISCA, pp. 161–171. Springer-Verlag, 2010.
[6] G. E. Blelloch and M. Reid-Miller. Pipelining with futures. In SPAA,
pp. 249–259. ACM, 1997.
[7] R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of
multithreaded computations. SIAM Journal on Computing,
27(1):202–229, Feb. 1998.
[8] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded
computations by work stealing. JACM, 46(5):720–748, 1999.
[9] R. L. Bocchino, Jr., V. S. Adve, S . V. Adve, and M. Snir. Parallel
programming must be deterministic by default. In First USENIX
Conference on Hot Topics in Parallelism, 2009.
[10] F. W. Burton and M. R. Sleep. Executing functional programs on a
virtual tree of processors. In FPCA, pp. 187–194. ACM, 1981.
[11] C. Consel, H. Hamdi, L. Réveillère, L. Singaravelu, H. Yu, and
C. Pu. Spidle: a DSL approach to specifying streaming applications.
In GPCE, pp. 1–17. Springer-Verlag, 2003.
[12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to Algorithms. The MIT Press, third edition, 2009.
[13] R. Finkel and U. Manber. DIB A distributed implementation of
backtracking. ACM TOPLAS, 9(2):235–256, 1987.
[14] D. Friedman and D. Wise. Aspects of applicative programming for
parallel processing. IEEE Transactions on Computers ,
C-27(4):289–296, 1978.
[15] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of
the Cilk-5 multithreaded language. In PLDI, pp. 212–223. ACM,
1998.
[16] J. Giacomoni, T. Moseley, and M. Vachharajani. FastForward for
efficient pipeline parallelism: A cache-optimized concurrent
lock-free queue. In PPoPP, pp. 43–52. ACM, 2008.
[17] M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting
coarse-grained task, data, and pipeline parallelism in stream
programs. In ASPLOS, pp. 151–162. ACM, 2006.
[18] E. A. Hauck and B. A. Dent. Burroughs’ B6500/B7500 stack
mechanism. Proceedings of the AFIPS Spring Joint Computer
Conference, pp. 245–251, 1968.
[19] Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview
scalability analyzer. In SPAA, pp. 145–156, 2010.
[20] Intel Corporation. Intel® Cilk
Plus Language Extension
Specification, Version 1.1, 2013. Document 324396-002US.
Available from http://cilkplus.org/sites/default/files/
open_specifications/Intel_Cilk_plus_lang_spec_2.htm.
[21] D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. Mul-T: A
high-performance parallel Lisp. In PLDI, pp. 81–90. ACM, 1989.
[22] M. Lam. Software pipelining: an effective scheduling technique for
VLIW machines. In PLDI, pp. 318–328. ACM, 1988.
[23] I.-T. A. Lee, S. Boyd-Wickizer, Z. Huang, and C. E. Leiserson. Using
memory mapping to support cactus stacks in work-stealing runtime
systems. In PACT, pp. 411–420. ACM, 2010.
[24] C. E. Leiserson. The Cilk++ concurrency platform. J.
Supercomputing, 51(3):244–257, 2010.
[25] S. MacDonald, D. Szafron, and J. Schaeffer. Rethinking the pipeline
as object-oriented states with transformations. In HIPS, pp. 12 21.
IEEE, 2004.
[26] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. K ilgard. Cg: a
system for programming graphics hardware in a C-like language. In
SIGGRAPH, pp. 896–907. ACM, 2003.
[27] M. McCool, A. D. Robison, and J. Reinders. Structured Parallel
Programming: Patterns for Efficient Computation. Elsevier, 2012.
[28] A. Navarro, R. Asenjo, S. Tabik, and C. Ca¸scaval. Analytical
modeling of pipeline parallelism. In PACT, pp. 281–290. IEEE, 2009.
[29] OpenMP Application Program Interface, Version 3.0, 2008.
Available from
http://www.openmp.org/mp-documents/spec30.pdf.
[30] G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread
extraction with decoupled software pipelining. In MICRO, pp.
105–118. IEEE, 2005.
[31] A. Pop and A. Cohen. A stream-computing extension to OpenMP. In
HiPEAC, pp. 5–14. ACM, 2011.
[32] R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August.
Decoupled software pipelining with the synchronization array. In
PACT, pp. 177–188. ACM, 2004.
[33] E. C. Reed, N. Chen, and R. E. Johnson. Expressing pipeline
parallelism using TBB constructs: a case study on what works and
what doesn’t. In SPLASH, pp. 133–138. ACM, 2011.
[34] R. Rojas. Konrad Zuse’s legacy: The architecture of the Z1 and Z3.
IEEE Annals of the History of Computing, 19(2):5–16, Apr. 1997.
[35] D. Sanchez, D. Lo, R. M. Yoo, J. Sugerman, and C. Kozyrakis.
Dynamic fine-grain scheduling of pipeline parallelism. In PACT, pp.
22–32. IEE E, 2011.
[36] B. Stroustrup. The C++ Programming Language. Addison-Wesley,
fourth edition, 2013.
[37] M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt.
Feedback-directed pipeline parallelism. In PACT, pp. 147–156.
ACM, 2010.
[38] W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical
approach to exploiting coarse-grained pipeline parallelism in C
programs. In MICRO, pp. 356–369. IEEE, 2007.
[39] T. Wiegand, G. J. Sullivan, G. Bjφntegaard, and A. Luthra. Overview
of the H.264/AVC video coding standard. IEEE Transactions on
Circuits and Systems for Video Technology, 13(7):560–576, 2003.
... Task parallel programming languages that encode dataflow dependencies among tasks may express pipelines [Vandierendonck et al. 2011a], but stages must produce a fixed number of items. Cilk-P extends Cilk to pipelines but provides no support for buffering or chunking data elements [Lee et al. 2013]. Moreover, these models restrict producerconsumer dependencies among "sibling" tasks, whereas explicit threading allows queue access from any location in a program. ...
... Further insight in the efficiency of hyperqueues can be gleaned from comparison to Cilk-P [Lee et al. 2013], another pipeline-parallel system. We were not able to rebuild and deploy Cilk-P due to software version conflicts, however, the published performance analysis presents useful comparison points. ...
... The hyperqueue implementation looses some of its advantage for 22 threads and higher due to task granularity and locality issues. Cilk-P [Lee et al. 2013] provides further information to benchmark the performance of hyperqueues. The authors of Cilk-P modified the dedup program such that only one level of parallelism is used. ...
Article
The hyperqueue is a programming abstraction for queues that results in deterministic and scale-free parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues provide a shared view on a queue data structure. Hereby, hyperqueues guarantee determinism for programs using concurrent queues. We define the programming API and semantics of two instances of the hyperqueue concept. These hyperqueues differ in their API and the degree of concurrency that is extracted. We describe the implementation of the hyperqueues in a work-stealing scheduler and demonstrate scalable performance on pipeline-parallel benchmarks from PARSEC and StreamIt.
... All these programs consist of load-imbalanced loops, and thus, they cannot be handled efficiently by static schedulers. Moreover, Lee et al. [13] show that it is possible to efficiently execute the x264 video encoder included in PARSEC by expressing the main loop as a dynamic linear pipeline. Therefore, to efficiently handle all the pipelined programs included in PARSEC, we need a dynamic system that deals with load-balancing and the flexible structure of dynamic linear pipelines. ...
... Pipelite [18] and Piper [13] are dynamic systems that efficiently handle dynamic linear pipelines, and they achieve load-balancing. In contrast to earlier systems [6,12,17,26], both Pipelite and Piper perform dynamic mapping of stages onto threads to achieve load-balancing. ...
... These constraints are related to the scheduling algorithm and have been used by the state-of-theart [13,18], but they are orthogonal to the contribution of this article. In particular, the chunking algorithm determines the execution order of stage iterations for a chunk of loop iterations. ...
Article
Dynamic scheduling and dynamic creation of the pipeline structure are crucial for efficient execution of pipelined programs. Nevertheless, dynamic systems imply higher overhead than static systems. Therefore, chunking is the key to decrease the synchronization and scheduling overhead by grouping activities. We present a chunking algorithm for dynamic systems that handles dynamic linear pipelines, which allow the number and duration of stages to be determined at run-time. The evaluation on 44 cores shows that chunking brings the overhead of dynamic scheduling down to that of a static scheduler, and it enables efficient and scalable execution of fine-grained dynamic linear pipelines.
... Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing [42]. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing [43] as shown in Fig. 4. ...
... An illustration of dataset splitting strategy according to COCO-dataset 2017[36] Illustrated pipeline parallelism, where P1, P2and P3 are different process[43] Pipeline parallelism workflow; where D denotes detection phase, K represents keyframes selection phase and C refers to video summary construction ...
Article
Full-text available
Video summarization technologies strive to provide a succinct and thorough description by selecting the most informative frames from the original video. The essential concept of query-based video summarization methods is represented by constructing a video summary related to user query in term of user interest. This paper aims to address the problem of query-based video summarization construction task by adopting a lightweight deep learning model for object detection task in order to evident how concerning a frame or shot is to a given query. Both YOLOv3 and Tiny YOLOv3 deep learning models are used separately in the proposed system to train the networks using images and videos dataset including diverse types of objects such as; (motorcycles, bicycles, cars, buses, and trucks). Subsequence, the most relative frames to a given query are selected and assembled as keyframes using a modified K-mean clustering scheme to provide different interesting summaries from the original video. Based on the experimental results of object detection phase, we have obtained an average object detection accuracy around 93%, 83% based on YOLOv3 and Tiny YOLOv3 deep learning models respectively. Comprehensive experiments were performed to evaluate the proposed video summarization system, which exhibited an efficient summarization rate close to 33% of the original video. Further experiments were conducted using standard UTE dataset to exhibit the competitive performance of the proposed method compared to state of art query-based video summarization methods.
... Pipeline parallelism can be constructed by either futures (e.g. [16] who used it to shorten span) or synchronization variables, or by some elegantly defined linguistic constructs [39]. The key idea in pipeline parallelism is to organize a parallel program as a linear sequence of stages. ...
Preprint
The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "\parallel" (parallel) and ";" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "\leadsto" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O(i=0h1Q(t;σMi)Cip)O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right), where QQ^{*} is the cache complexity of task t{\mathsf t}, CiC_i is the cost of cache miss at level-i cache which is of size MiM_i, σ(0,1)\sigma\in(0,1) is a constant, and p is the number of processors in an h-level cache hierarchy.
... Second, how can Tapir be extended to model task-parallel programming constructs beyond recursive fork-join? Many task-parallel programming environments support non-fork-join models of parallelism, including parallel reductions [28,47,61,73,82,88,98,111], pipeline parallelism [23,60,78], and task graphs [2,88]. We anticipate seeing more such task-parallel constructs emerge in the future, because these constructs expand the set of programmers who can employ wellstructured task parallelism, rather than arbitrary concurrency, to parallelize applications. ...
Article
Tapir (pronounced TAY-per) is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler’s IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the “serial-projection property,” which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler. For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program’s control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3,000 lines of LLVM’s half-million-line core middle-end functionality. These changes sufficed to enable LLVM’s existing compiler optimizations for serial code—including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination—to work with parallel control constructs such as parallel loops and Cilk’s Cilk_Spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling, which restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way.
Article
A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with fork-join parallelism. This paper presents a race-detection algorithm, 2D-Order, for detecting races in a more general class of programs, namely programs whose dependence structure can be represented as planar dags embedded in 2D grids. Such dependence structures arise from programs that use pipelined parallelism or dynamic programming recurrences. Given a computation with T1 work and T∞ span, 2D-Order executes it while also detecting races in O(T1/P + T∞) time on P processors, which is asymptotically optimal. We also implemented PRacer, a race-detection algorithm based on 2D-Order for Cilk-P, which is a language for expressing pipeline parallelism. Empirical results demonstrate that PRacer incurs reasonable overhead and exhibits scalability similar to the baseline (executions without race detection) when running on multiple cores.
Article
The original PARSEC benchmark suite consists of a diverse and representative set of benchmark applications which are useful in evaluating shared-memory multicore architectures. However, it supports only three programming models: Pthreads (SPMD), OpenMP (parallel for), TBB (parallel for, pipeline), lacking support for emerging and widespread task parallel programming models. In this work, we present a task-parallelized PARSEC (TP-PARSEC) in which we have added translations for five different task parallel programming models (Cilk Plus, MassiveThreads, OpenMP Tasks, Qthreads, TBB). Task parallelism enables a more intuitive description of parallel algorithms compared with the direct threading SPMD approach, and ensures a better load balance on a large number of processor cores with the proven work stealing scheduling technique. TP-PARSEC is not only useful for task parallel system developers to analyze their runtime systems with a wide range of workloads from diverse areas, but also enables them to compare performance differences between systems. TP-PARSEC is integrated with a task-centric performance analysis and visualization tool which effectively helps users understand the performance, pinpoint performance bottlenecks, and especially analyze performance differences between systems.
Book
Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of the most popular and cutting edge programming models for parallel programming: Threading Building Blocks, and Cilk Plus. These architecture-independent models enable easy integration into existing applications, preserve investments in existing code, and speed the development of parallel applications. Examples from realistic contexts illustrate patterns and themes in parallel algorithm design that are widely applicable regardless of implementation technology. The patterns-based approach offers structure and insight that developers can apply to a variety of parallel programming models Develops a composable, structured, scalable, and machine-independent approach to parallel computing Includes detailed examples in both Cilk Plus and the latest Threading Building Blocks, which support a wide variety of computers.
Article
This paper investigates some problems associated with an argument evaluation order that we call “future” order, which is different from both call-by-name and call-by-value, In call-by-future, each formal parameter of a function is bound to a separate process (called a “future”) dedicated to the evaluation of the corresponding argument. This mechanism allows the fully parallel evaluation of arguments to a function, and has been shown to augment the expressive power of a language. We discuss an approach to a problem that arises in this context: futures which were thought to be relevant when they were created become irrelevant through being ignored in the body of the expression where they were bound. The problem of irrelevant processes also appears in multiprocessing problem-solving systems which start several processors working on the same problem but with different methods, and return with the solution which finishes first. This parallel method strategy has the drawback that the processes which are investigating the losing methods must be identified, stopped, and re-assigned to more useful tasks.
Article
This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on a 1 processor, T ∞ is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is O(T 1 /P+T ∞ ), that is, it achieves linear speedup when P=O(T 1 /T ∞ ), and it is space-efficient if it uses O(S 1 P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to “strict” computations – those in which all arguments to a procedure must be available before the procedure can be invoked – much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time O(T 1 /P+T ∞ lgP) and worst-case space O(S 1 PlgP), including overhead costs.
Conference Paper
The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a ldquohyperobjectrdquo library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.
Article
Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to express various patterns of parallelism. This case study focuses on a particular pattern: pipeline parallelism. We attempted to transform three representative pipeline applications - content-based image retrieval, compression and video encoding - to use the pipeline constructs in TBB. We successfully converted two of the three applications. In the successful cases we discuss our transformation process and contrast the expressivity and performance of our implementations to existing Pthreads versions; in the unsuccessful case, we detail what the challenges were and propose possible solutions.
Article
A wide variety of computational models, including the lambda calculus, may be represented by a set of reduction rules which guide the (run-time) construction of a process tree. Even a single source of parallelism in an otherwise lazy evaluator may give rise to an exponential growth in the process tree, which must eventually overwhelm any finite architecture. We present a simple model for concurrently executing such process trees, which gives us a basis for matching the production of new tasks to the available resources. In addition, we present a generalised interpretation of a familiar topology suited to the support of large, perhaps irregular, virtual process trees on a much smaller physical network.
Article
Mul-T is a parallel Lisp system, based on Multilisp's future construct, that has been developed to run on an Encore Multimax multiprocessor. Mul-T is an extended version of the Yale T system and uses the T system's ORBIT compiler to achieve “production quality” performance on stock hardware — about 100 times faster than Multilisp. Mul-T shows that futures can be implemented cheaply enough to be useful in a production-quality system. Mul-T is fully operational, including a user interface that supports managing groups of parallel tasks.
Article
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.