IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002423
Staggered Push—A Linearly Scalable Architecture
for Push-Based Parallel Video Servers
Jack Y. B. Lee, Member, IEEE
Abstract—With the rapid performance improvements in
low-cost PCs, it becomes increasingly practical and cost-effective
to implement large-scale video-on-demand (VoD) systems around
parallel PC servers. This paper proposes a novel parallel video
server architecture where video data are striped across an array
of autonomous servers connected by an interconnection network.
To coordinate data transmissions from multiple autonomous
servers to a client station, a staggered push scheduling algorithm
is proposed. A system model is constructed to quantify the per-
formance of the architecture. Unlike most studies, this work does
not assume the existence of a global clock among the servers and
tackles two problems arising from server asynchrony: inconsistent
schedule assignment and traffic overlapping. The former problem
is solved by using an admission scheduler and the latter problem
is solved by an over-rate transmission scheme. Analytical results
prove a remarkable property of the staggered push architecture:
as long as the network has sufficient capacity, the system can
be scaled up linearly to an arbitrary number of servers. Design
examples and numerical results are used to evaluate the proposed
architecture under realistic assumptions and to compare it against
Index Terms—Parallel video server, performance analysis,
scheduling algorithm, server push, server striping, staggered
and delivery of video streams to multiple clients. This
single-server approach is a natural extension of networked
file systems and works well for small-scale systems with
medium-quality videos. However, with the emergence of high
definition video in terrestrial broadcasting, consumers will
increasingly demand similar high quality video from VoD
service providers. Coupled with the need to serve a large
number of concurrent users, the capacity of the single-server
approach is quickly becoming a severe limiting factor.
While server replication – and partition can be used to
scale up the system capacity, the former approach would not be
economical due to the large storage required for high-quality
RADITIONAL video-on-demand (VoD) systems com-
monly employ a high-performance server for the storage
Manuscript received March 24, 1999; revised August 13, 2002. This work
was supported in part by Research Grant CUHK6095/99E from the HKSAR
Research Grant Council and the Area of Excellence in Information Technology,
and a research grant from the HKSAR UniversityGrants Council. The associate
editor coordinatingthe review of this paper and approving it for publication was
Prof. Chung-Yu Wu.
The author is with the Department of Information Engineering, Chinese
University of Hong Kong, Shatin, N.T., Hong Kong (e-mail: jacklee@com-
Digital Object Identifier 10.1109/TMM.2002.806533
videos; and the latter approach is known to suffer from load-
balancing problems , .
In this paper, we propose a parallel-server architecture for
designing scalable VoD systems. Unlike replication, we use
striping to achieve load sharing across multiple servers without
increasing storage requirement. Furthermore, by striping using
a small unit size, the system is insensitive to skewness in video
retrievals. This architecture allows one to incrementally scale
up the system capacity to more concurrent users by adding
(rather than replacing) more servers and redistributing (rather
than duplicating) video data among them.
The main contributions of this paper are as follows.
• We propose and analyze quantitatively a staggered push
architecture for scheduling disk retrieval and network
transmission in parallel video servers. We provea remark-
able property of the staggered push architecture—the
system can be scaled up linearly to an arbitrary number
of servers as long as the network has sufficient capacity.
• We discover that for loosely-coupled servers like PC or
workstation clusters, server-clock asynchrony could lead
to inconsistent schedule assignments among different
an external admission scheduler to centralize admission
control and perform schedule assignments.
• Apart from inconsistent schedule assignments, we dis-
lapping between data transmitted from different servers.
This could induce network congestion, leading to video
Worst still, the client may not be able to cope with the
aggregate data rate even if the network can successfully
deliver the data. To tackle this problem, we propose an
over-rate transmission scheme that can effectively prevent
• To evaluate the strengths and weaknesses of the proposed
trast staggered push with another architecture—concur-
rent push , using the same system parameters and as-
The rest of the paper is organized as follows. Section II re-
views some related works and compares them with this study.
tion IV studies the inconsistent schedule assignment problem
tion V studies the traffic overlapping problem and proposes an
presents buffer management algorithms for server and client,
1520-9210/02$17.00 © 2002 IEEE
424IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002
and derives the respective buffer requirement. Section VII eval-
uates the system performance using numerical results and com-
practicality and reliability issues. Finally Section IX draws a
II. RELATED WORKS
Recently, a number of researchers have proposed and studied
various architectures for implementing parallel video servers.
For most studies, fair and meaningful quantitative comparisons
are not feasible due to the inherent differences in architecture,
assumptions, and the lack of compatible performance models.
Therefore we first discuss the relevant literatures in this section
and compare them qualitatively with the approach proposed in
this paper. A quantitative comparison with the concurrent push
architecture  will be presented in Section VII.
First, the studies by Buddhikot et al. , Lougher et al. ,
Tewari etal. , , and Wuet al.  are based on architec-
tures having two types of nodes, namely storage nodes and de-
livery nodes (called independent-proxy in ). In –, the
delivery nodes are independent hosts connected to all servers
via a high-speed interconnect. The delivery nodes merge video
data retrieved from the storage nodes and deliver them to the
clients. In the study by Buddhikot et al. , the delivery node
is a proprietary ATM switch (called APIC), and is responsible
for delivering data blocks retrieved by the storage nodes in a
synchronous manner to a client. Effectively, the APIC provides
the hardware global clock needed to precisely synchronize the
different from the architecture proposed in this paper, where
servers directly transmit video data to a client without passing
through any intermediate node. Staggered push eliminates the
extra hardware needed to run the intermediate delivery nodes
as well as the extra high-speed interconnect linking up the
storage nodes with the delivery nodes.1By incorporating the
effect of server clock jitter and compensating accordingly,
existing network hardwares such as FastEthernet and ATM can
be used. Results (see Section VII) show that the staggered push
architecture is robust to server clock jitter and performs well
even if servers are loosely synchronized using conventional
distributed clock synchronization algorithms.
The studies by Freedman et al.  and Lee et al.  are
based on the client-pull service mode, where the server process
video-block-level requests as they arrive. This model differs
A detail comparison between the two service models is beyond
the scope of this paper. Briefly speaking, client-pull results in
simpler server design and does not need server-clock synchro-
nization. On the other hand, server-push allows better optimiza-
tion of server efficiency as periodic schedulers can be used.
Moreover, a full-fledged up-stream communication channel is
1The independent proxy can also be implemented within the storage
servers—called proxy-at-server. While this does not require additional hosts
dedicated for the proxies, transmission and processing overheads are still
incurred as the embedded proxy module will still have to receive video data
from all other servers and then forward to the clients.
not required, as there are no periodic requests traveling from a
client back to the servers. Interested readers are referred to Rao
et al.  for qualitative and simulation comparisons of the two
service models in single-server multimedia systems.
Finally, the studies by Biersack et al. , , Bolosky et
al. , Lee , and Reddy  are more closely related to the
architecture proposed in this paper. In particular, the servers in
client in ). Secondly, these studies all employ some forms of
server-push service model.
This paper differs from the study by Biersack et al. , 
in three ways. First, they proposed to stripe video across servers
in fixed number of frames while we propose striping using
fixed-size blocks. As compressed video frames vary widely
in sizes, the former approach clearly has potential storage and
load balancing problem.2By contrast, striping using fixed-size
blocks avoids the frame-level processing issues and guarantees
storage balance irrespective of the compression formats.
Second, while they also suggest striping in fixed-size units in
, their study did not consider variations in video-block-con-
sumption times and simply assumed constant consumption
time. This assumption is not valid for most compression
formats as each fixed-size block can contain different number
of frames (even partial frames) . Third, their study did
not consider scheduling issue at the servers and assumed a
server transmits data at line speed to a client, which is clearly
impractical due to the client’s lower processing capability.
This study has investigated the server-scheduling problem,
revealed the inconsistent schedule assignment problem and
traffic overlapping problem arising from server clock jitter, and
proposed solutions to solve them.
The study by Bolosky et al.  differs from this paper in
two ways. First, they focused on experimentation by means
of an implementation using the PC platform, and did not deal
with performance modeling in detail. Second, while their
design also employs a centralized controller for admission
control, they have not discovered the inconsistent schedule
assignment problem arising from server-clock asynchrony. By
contrast, we focus on modeling the system performance and on
quantifying the effect of server-clock asynchrony. We reveal
the inconsistent schedule assignment problem as well as the
traffic overlapping problem and tackle them with an admission
scheduler and over-rate transmission, respectively.
For the study by Reddy , while the author did point out
that potential problems could occur in case the server clocks
are not synchronized, no solution was suggested. Moreover,
the proposed architecture is designed for a specific intercon-
nection switch (Omega network). The study assumed that all
server clocks are precisely synchronized and transmission via
the switch can be exactly scheduled. By contrast, this paper
assumes a loosely-coupled system and derives a more realistic
performance model that incorporates server clock asynchrony,
delay jitters, and variable video consumption rates.
2While one can perform striping in units of group of pictures (GOP) for
MPEG-compressed video, GOPs still varies slightly in size. Worst, GOP struc-
ture (i.e., the sequence of I, B, P frames) in MPEG videos can change dynam-
ically (e.g., to adapt to scene changes or increase in motions). Hence, this ap-
proach still suffers from potential storage and load balancing problem.
LEE: STAGGERED PUSH—LINEARLY SCALABLE ARCHITECTURE425
Fig. 1. Architecture of a (five-servers) parallel video server.
Finally, a related study by Lee  has proposed a sched-
uling algorithm called concurrent push for scheduling servers’
retrieval and transmission. Briefly speaking, all serverstransmit
The transmission rate is reduced proportionally so that the ag-
proposed an Asynchronous Group Sweeping Scheme (AGSS)
to reduce client buffer requirement and system response time.
Nonetheless, the client buffer requirement still increases with
system scale (i.e., number of servers) and the author proposed
a Sub-schedule Striping Scheme (SSS) to maintain a constant
buffer requirement, at the expense of higher processing over-
head at the client.
This concurrent-push algorithm is designed to take advan-
tage of the quality-of-service (QoS) control available in today’s
to a client at a constant rate, one can easily integrate the con-
stant-bit-rate (CBR) service available in today’s ATM networks
so that end-to-end QoS can be guaranteed. The tradeoff, how-
ever, is scalability—both server buffer requirement and client
processing overhead increase with system scale. By contrast,
the staggered push algorithm proposed in this paper is linearly
scalable, i.e., the server requirement and client requirement re-
push cannot take advantage of ATMs QoS control, we can still
tackle network congestion problems by the over-rate transmis-
sion scheme proposed in Section V.
III. SYSTEM ARCHITECTURE
Fig. 1 shows the architecture of a parallel video server,
comprising multiple autonomous servers connected by an
interconnection network. We denote the number of servers in
the system by
and the number of clients by
client–server ratio, denoted by
separate CPU, memory, disk storage, and network interfaces. A
server’s storage spaces are divided into fixed-size stripe units
bytes each. Each video title is then striped into blocks of
bytes and stored into the servers in a round-robin manner as
shown in Fig. 1.
Striping with fixed-size blocks simplifies the process of
striping video streams encoded using interframe compression
algorithms (e.g., MPEG), where frame size varies considerably
for different frame types. Since a stripe unit is significantly
smaller than a video title (kilobytes versus megabytes), this
enables fine-grain load sharing (as oppose to coarse-grain
load sharing in data partitioning) among servers. Moreover,
. Hence the
, is . Each server has
Fig. 2. Transmission scenario for the staggered push algorithm.
Fig. 3. Two-level scheduler for staggered push.
the loads are evenly distributed to all servers irrespective of
skewness in video popularity .
To schedule disk retrievals and network transmissions at
the servers, we propose a staggered push algorithm where
the servers transmit bursts of data to a client in a round-robin
manner at the average video bit rate. Let
video rate and assumed to be the same for all clients. Then the
transmissions from the servers are staggered such that only one
of the servers transmits to a receiver at any given time, depicted
in Fig. 2. In this way, there will be at most
blocks being transmitted concurrently at a server. Note that
while one can potentially reduce server buffer requirement by
transmitting at a rate higher than
have to be capable of receiving at such a high data rate. This
is less practical as client network connection usually has lower
bandwidth and the client device (e.g., set-top box) will likely
have limited processing capability.
To support staggered push, the server scheduler is divided
into two scheduling levels: micro-round and macro-round as
shown in Fig. 3. Video blocks retrieved in a micro-round will
be transmitted in the next micro-round. Let
time needed to completely transmit a video block of
Since a video block is transmitted at a rate equal to the video
, we can obtain
be the average
, the client in turn will
be the average
micro-rounds, and each micro-round transfers
Hence, the disk will transfer up to
in one macro-round, with one block transmitted for each video
-servers system, each macro-round consists of
IV. SCHEDULE ASSIGNMENT
Unexpectedly, the two-level scheduling scheme may result in
inconsistent schedules among different servers if admission is
performed independently at each server. Specifically, as servers
are loosely coupled, the internal clock of each server in the
426 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002
Fig. 4. Inconsistent schedule assignment arising from server clock jitter.
Fig. 5. Micro-round overflow due to inconsistent schedule assignment.
as the difference between the internal real-time clocks of two
servers. Many algorithms for controlling clock jitter between
distributed computers have been proposed – and hence
will not be pursued further here. We assume that the maximum
clock jitter between any two servers in the system is bounded
and is denoted by .
With the presence of clock jitter, one server could assign two
new videosessions tostart withthesame micro-roundwhilean-
other server could assign them to two different micro-rounds as
sessions to micro-rounds according to its own internal clock,
which differs from other servers due to clock jitter. As a single
micro-round can serve only up to
one server could experience micro-round overflow although an-
otherserver canadmitthenew videosession (Fig.5). While one
can delay the new video session at the overflowed server until
the next available micro-round, the transmission schedule will
be delayed significantly and will result in severe traffic overlap-
ping with transmissions from another server (see Section V).
To solve this inconsistent schedule assignment problem, we
propose adding an external admission scheduler between the
servers and the clients to centralize schedule assignment. To
initiate a new video session, a client will first send a request to
the admission scheduler. Using the same clock-synchronization
protocol, the admission scheduler maintains the same clock
jitter bound with the servers. As new sessions are assigned
solely according to the admission scheduler’s clock, the
scenario depicted in Figs. 4 and 5 will not occur. However, to
ensure that the assigned micro-round has not started in any of
the servers due to clock jitter, the admission scheduler must
add an extra delay to the assignment.
Theorem 1: If the admission scheduler delays the start of a
new video session by
video sessions, eventually
Fig. 6. Worst-case delay in the admission process.
micro-rounds, then it guarantees that the assigned micro-round
has not started in any of the
Proof: See Appendix A.
For example, let be the local time the new request arrives
at the admission scheduler. Then the admission scheduler will
attempt to admit the request to micro-round
Note that we need to add one to
cannot join the current micro-round (it has started already). If
the assigned micro-round is full, the admission scheduler will
sequentially check the subsequent micro-rounds until an avail-
able micro-round is found. In the worst case shown in Fig. 6,
the transmission of the first video block will be delayed for
because a new request
To better evaluate the delay incurred, we can derive the av-
erage scheduling delay under a given server load. Assume that
( ) active video sessions, then it can
be shown that (see Appendix B) the average scheduling delay is
LEE: STAGGERED PUSH—LINEARLY SCALABLE ARCHITECTURE 427
Fig. 7.Traffic overlapping due to server clock jitter.
Fig. 8.Preventing traffic overlap by over-rate transmission.
V. TRAFFIC OVERLAPPING
If server clock jitter is larger than zero then transmissions
and multiply the transmission rate in the overlapping interval
(Fig. 7). This could cause congestion at the network and the
client, resulting in packet being dropped.
, say(Fig. 8). We call thisscheme over-ratetransmis-
sion (ORT) for obvious reason. The transmission window will
then be reduced to a time interval of
We can guarantee thatthere will be notransmission overlapping
by ensuring that
needed to avoid traffic overlapping:
Since the transmission rate must be positive and less than in-
finity, we have the condition that
In other words, the server clock jitter must be smaller than a
micro-round. Note that under this condition, traffic overlapping
involves at most two servers and the data rate is doubled to
in the overlapping region. As
range for over-rate transmission is actually limited by
in (10) can become very
Substituting (10) into (12) and rearrange we can then determine
the maximum clock jitter for which ORT is applicable:
Therefore, ORT can prevent traffic overlapping if clock jitter
is less than half of a micro-round. With ORT, the maximum
network bandwidth needed at each server will be increased to
VI. BUFFER MANAGEMENT
ployed at the server and client, and derive the respective buffer
jitter. However, the effect of network delay and delay jitter can
be incorporated in the same way as clock jitter and the same
derivations are still valid.
A. Server Buffer Requirement
duration of a macro-round, denoted by
micro-rounds in a macro-round, therefore the
, is given by
As buffers are released after each micro-round, this scheduler
buffers for each server, regardless of the
number of servers and clients in the system. Therefore, existing
servers do not need any upgrade when one scale up a system by
adding more servers.
B. Client Buffer Requirement
sumed periodically by the video decoder. However, our experi-
video decoders reveals that the decoder consumes fixed-size
, and block size,
decoder to consume a single block is
, the average time for a video
we employ the consumption model proposed in , reproduced
below for sake of completeness.
Definition 1: Let
be the time the video decoder starts de-
coding the th video block, then the decoding-time deviation of
is defined as
and decoding is late if
The maximum lag in decoding, denoted by
imum advance in decoding, denoted by
and early if.
, and the max-
, are defined as fol-
can be obtained empirically. Knowing these two bounds, the
playback instant for video block , denoted by
andare implementation dependent and
, is then
428IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002
Buffersare used attheclientto absorbthese variationstopre-
vent buffer underflow (which leads to playback hiccups) and
buffer overflow (which leads to packet dropping). Let
be the number of buffers (each
the client, organized as a circular buffer. The client prefills the
buffers before starting playback to prevent buffer under-
We first determine the lower bound for
(with respect to the admission scheduler’s clock) when the first
of generality, we can assume that the video title is striped with
block zero at server zero. Then the time for block to be com-
pletely received by the client, denoted by
bytes) available at
. Letbe the time
, is then bounded
mission rate deviation, CPU scheduling, bus contention, etc.
Since the client begins video playback after filling the first
buffers, the playback time for video block 0 is simply equal to
for video block
is bounded by
andare used to model the maximum transmission
in (20) then the playback time
To guarantee video playback continuity, we must ensure that
all video blocks arrive before their respective playback dead-
lines. Therefore, we need to ensure that for all video blocks,
the latest arrival time must be smaller than the earliest playback
Using the bounds from (21) and (22), we can rewrite (23) as
From (1) and (16), we know that
, we then obtain
, rearranging and
, the worst case is
which is the number of buffers that must be prefilled before
beginning video playback.
Similarly, to guarantee that the client buffer will not be over-
flowed by incoming video data, we need to ensure that the th
video block starts playback before the
block is completely received. This is because the client buffers
are organized as a circular buffer. Therefore, we need to ensure
Again using the bounds from (21) and (22), we can rewrite (28)
Similarly, rearrange and solve for
as shown at the bottom of the page. Again noting that
, the worst case is
we obtain (31),
which is the number of empty buffers needed to avoid client
C. System Response Time
Another key performance metric of a VoD service is system
response time, defined as the time from initiating a new request
to the time video playback starts. Ignoring system administra-
of two components: scheduling delay and prefill delay. Sched-
uling delay is the delay incurred at the admission scheduler plus
the delay incurred at the server scheduler, as derived in Sec-
tion IV. For prefill delay, we note that the client prefills the first
video blocks before starting video playback. Hence, the av-
erage prefill delay can be obtained from
and the system response time is simply the sum
LEE: STAGGERED PUSH—LINEARLY SCALABLE ARCHITECTURE 429
SYSTEM PARAMETERS USED IN PERFORMANCE EVALUATION
VII. PERFORMANCE EVALUATION
In this section, we evaluate the performance of the proposed
architecture using numerical results. All results are computed
using the derivations in Sections IV–VI with the system param-
eters listed in Table I. The parameters
experimentally by collecting video block consumption times of
a hardware MPEG-1 decoder (Sigma Designs RealMagic).
, are determined
A. Design Example
To illustrate performance and resource requirements of the
architecture, we first consider a design example in this section.
We assume that there are eight servers in the system, with a
client–server ratio of 503(i.e., up to 400 concurrent streams).
Using the parameters in Table I, the server buffer requirement
is calculated to be 6.25 MB. Compared with the amount of
memory in today’s PC, this buffer requirement is relatively
small. Moreover, as conventional PCs can be expanded to 256
MB or more memory, in theory a client–server ratio of over
2000 can be supported. Hence server buffer requirement will
not become a limiting factor to the system’s scalability.
Using the same parameters, the client buffer requirement is
calculated to be 256 KB. This translates into an average pre-
fill delay of 1.41 s. To determine the system response time, we
assume that the system is at 90% utilization. Then the corre-
sponding scheduling delay will be 0.735 s. Together with prefill
delay, the average system response time becomes 2.146 s, well
within acceptable limits. We perform more detailed sensitivity
analysis with respect to key system parameters in the following
B. Server Buffer Requirement
Fig. 9 plots server buffer requirement versus system scale
(i.e., number of servers) for both concurrent push and staggered
push. This graph clearly shows the remarkable property of stag-
gered push—constant server buffer requirement irrespective of
system scale. By contrast, server buffer requirement increases
with system scale under concurrent push, even with AGSS and
SSS. When concurrent push is scaled up to 12 servers, server
buffer requirement increases to 40.6 MB compared to just 6.25
MB under staggered push. Hence the ultimate scalability of the
concurrent push architecture will be limited by server buffer,
while the proposed staggered push architecture can be scaled
up without any upgrade to the existing servers.
3This particular client–server ratio is determined from past implementation
experiences using PentiumPro-200 Mhz class machines.
Fig. 9.Server buffer requirement versus system scale.
C. Client Buffer Requirement
both concurrent push and staggered push. We observe that con-
current push is not scalable without SSS, while staggered push
has a constant client buffer requirement that will not limit scala-
push canbe controlled toa constant bySSS,thesystem scal-
ability is still limited as client processing overhead due to SSS
increases with system scale. It is particularly important to main-
tain a constant client buffer requirement in practice as it would
be very expensive (if not impossible) to upgrade every existing
client devices (e.g., set-top box) whenever the system is scaled
In Fig. 11, we analyze the sensitivity of client buffer require-
ment to server clock jitter. As the results indicate, the buffer
requirement is relatively insensitive to clock jitter, even if the
jitter is increased to one second. Hence one can safely employ
the existing software-based, distributed clock-synchronization
protocols in staggered push.
D. System Response Time
Fig. 12 plots the system response time versus system scale.
While the worst-case system response time increases linearly
with more servers, the average system response time remains
430IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002
Fig. 10.Client buffer requirement versus system scale.
Fig. 11.Client buffer requirement versus server clock jitter.
low ( 2 s) for a utilization of 90%. This suggests that we can
maintain a low system response time simply by limiting the
system to, say, 90% utilization by means of admission control.
In Fig. 13, we study the sensitivity of system response time
to server clock jitter. As expected, the system response time in-
creases for larger clock jitter values (cf. Theorem 1). However
given that server clock jitter can readily be controlled to within
s for an eight-servers system at 90% utilization.
E. Server Bandwidth Overhead
Fig. 14 plots the ORT transmission rate versus server clock
jitter for block sizes of
As clock jitter can be readily controlled to within 100 ms by
KB, 128 KB, and 256 KB.
Average system response time versus system scale at (90%
Fig. 13. System response time versus server clock jitter.
distributed software algorithms, the results show that over-rate
transmission is applicable to all three cases. For example, with
KB, ORT will transmit at 1.556 Mbps instead of the
video bit rate at 1.2 Mbps, incurring a bandwidth overhead of
LEE: STAGGERED PUSH—LINEARLY SCALABLE ARCHITECTURE431
Fig. 14.Transmission rate versus server clock jitter.
29.7%. Increasing the block size to 256 KB reduces the ORT
transmission rate to 1.273 Mbps, or a bandwidth overhead of
only 6%. Thus the system designer can adjust the block size to
balance between bandwidth cost and memory cost. In any case,
compared to uncontrolled traffic overlapping which results in
doubled transmission rate at 2.4Mbps, bandwidth under ORT is
clearly substantially lower.
As the results in the previous section show, the proposed
staggered push architecture can be scaled up to any number
of servers, provided that the network has sufficient capacity.
Compare with the concurrent-push architecture, staggered push
achieves linear scalability at the expense of bursty network
traffic (and slightly larger delay and client buffer requirement).
In particular, if we consider the network traffic between a server
and a client, it is easy to see that the traffic will be in the form
of bursts with an average interburst interval of
By contrast, servers in concurrent push transmit to a client
continuously at a constant rate, allowing easy integration with
QoS offered by existing ATM networks. Staggered push will
not be able to make use of QoS available in today’s ATM
In practice, if the VoD system is deployed in dedicated net-
works with a priori bandwidth planning, then staggered push
can still be used effectively. This is because the over-rate trans-
to traffic overlapping will not occur, and the aggregate traffic
going from the servers to a client will be close to constant bit-
rate,with small gapsin between(duetoover-ratetransmission).
ATM networks may have support for such many-to-one traffic
model that can provide QoS similar to the one available to cur-
rent constant-bit-rate service.
Another issue in parallel video server is reliability. Specifi-
cally, as videos are striped across servers without data redun-
dancy, any single server failure will cripple the entire system.
There are a number of ways to tackle this reliability problem.
units and distributes them to the servers using declustering so
that additional loads after a server failure are evenly shared by
the remaining servers. The obvious tradeoff is doubled storage
requirement. A subtler tradeoff is the need for declustering. As
no known algorithm can automatically produce a declustering
scheme (i.e., where to place each replicated units) for an arbi-
trary number of servers, this mirroring approach would require
more capacity planning when being scaled up.
Another approach is by means of parity units, as proposed by
Lee et al. . Their system introduces redundant units com-
puted from video data units into the servers, and uses a special
with the redundant units and the survived data units, can then
compute the lost video units in real-time. In the simplest form,
the redundant units are simply parity units, computed from ex-
clusive-or between the video data units of the same stripe. This
parity-based striping scheme can protect single-server failure
432IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 4, DECEMBER 2002
and their protocol can maintain continuous video playback de-
While their architecture differs from staggered push (pull-based
versus push-based), similar redundant striping scheme can also
be introduced to staggered push to achieve fault tolerance. The
author is currently investigating the supportive system modules
(e.g., fault-detection protocol, recovery protocol, transmission
scheduling, etc.) that are needed to support fault tolerance in
In this paper, we propose and analyze a parallel video server
architecture for implementing linearly scalable video-on-de-
mand systems. The proposed architecture employs fixed-size
block striping for data storage, and a staggered push scheduling
algorithm for co-ordinating transmissions among multiple
autonomous servers. We incorporate the effect of server clock
jitter and reveal the inconsistent schedule assignment problem
and the traffic overlapping problem. We tackle the former
problem by an external admission scheduler and the latter
problem by an over-rate transmission scheme. Our results
show that the over-rate transmission scheme can effectively
prevent traffic overlapping with a small bandwidth overhead
under clock jitter bounds achievable by existing software-based
synchronization algorithms. Moreover, we show that the server
buffer requirement, the client buffer requirement, and the server
bandwidth requirement are all independent of the number
of servers in the system. The average system response time,
though increases slightly with more servers, remains acceptable
if we limit the system to less than full utilization. These results
demonstrate that the proposed architecture can be scaled up to
large number of users without costly upgrade to the existing
servers and clients.
A. Proof of Theorem 1
be the local time a new request arrives at server (
),be the local time the new request arrives at the
admission scheduler, and
be the extra scheduling delay (in
number of micro-rounds). Then the admission scheduler will
attempt to admit the request to micro-round
For server , the new request arrives during micro-round
. Hence the problem is to find
, i.e., the assigned micro-round has not been started
in any of the servers. Using this condition, we can then obtain
the following inequality:
as given in (3).
Applying the inequality
right-hand side of (36), we then obtain
: ,, to the
Since clock jitter is bounded:
we can rewrite (37) in terms of :
or at least
then the assigned micro-round is guaranteed to have not started
in any of the servers.
B. Derivation of the Average Scheduling Delay
Assume that video sessions start independently and with
equal likelihood at any time. Then a video session can be
assigned to any one of the
ability. Assume that there are
number of ways to distribute these
groups is a variant of the urn-occupancy distribution problem
and is given by  as:
micro-rounds with equal prob-
active video sessions, then the
video sessions among
To obtain the probability of having
rounds, we first notice that there are
tions of having
fully-occupied micro-rounds. Given that, the
number of ways to distribute (
) micro-rounds with none of those micro-rounds fully-
occupied can be obtained from (41) as
. Hence the total number of ways for exactly
micro-rounds fully occupied is given by
) video sessions among
Hence, the probability of having
active video sessions can be obtained from
the probability for the assigned micro-round to be available (not
fully occupied) is given by
micro-round being fully occupied. Now provided that the as-
will be the probability of the assigned
LEE: STAGGERED PUSH—LINEARLY SCALABLE ARCHITECTURE433
signed micro-round is fully occupied, the probability that the
next micro-round is available is
next round available(45)
This is also the probability for a client to wait one additional
micro-round provided the assigned micro-round is already fully
occupied. It can be shown that the probability for a client to
micro-rounds are all fully occupied is
th round available
We already know
for the first
, and it can be shown that the probability
micro-rounds all fully occupied is given by
additional micro-rounds from
th round free
occupied, the average number of micro-rounds a client has to
wait can be obtained from
—the number of micro-rounds that are fully-
where the second term accounts for the additional delay as de-
scribed in Theorem 1. Similarly, given —the number of active
video sessions, the average number of micro-rounds a client has
to wait can be obtained from
Substituting (43), (48)–(50) into (51) gives the desired result.
The author would like to express his gratitude to the anony-
mous reviewers for their insightful comments and suggestions
in improving this paper.
 N. Venkatasubramanian and S. Ramanthan, “Load management in dis-
tributed video servers,” in Proc. 17th Int. Conf. Distributed Computing
Systems, Baltimore, MD, May 1997, pp. 528–535.
video-on-demand systems,” in Proc. ICC’95, Seattle, WA, June 1995,
 S. A. Barnett and G. J. Anido, “A cost comparison of distributed and
centralized approaches to video-on-demand,” IEEE J. Select. Areas
Commun., vol. 14, no. 5, pp. 1173–1183, 1996.
 T. D. C. Little and D. Venkatesh, “Popularity-based assignment
of movies to storage devices in a video-on-demand system,” ACM
Multimedia Syst., 1994.
 C. Griwodz, M. Bar, and L. C. Wolf, “Long-term movie popularity
models in video-on-demand systems or the life of an on-demand
movie,” in Proc. Multimedia’97, pp. 349–357.
 Y. B. Lee, “Concurrent push—A scheduling algorithm for push-based
parallel video servers,” IEEE Trans. Circuits Syst. Video Technol., vol.
9, no. 3, Apr. 1999.
, “Parallel video servers—A tutorial,” IEEE Multimedia, vol. 5, no.
2, pp. 20–28, June 1998.
 M. M. Buddhikot and G. M. Parulkar, “Efficient data layout, scheduling
and playout control in MARS,” in Proc. NOSSDAV’95, 1995.
 P. Lougher, D. Pegler, and D. Shepherd, “Scalable storage servers for
digital audio and video,” in Proc. IEE Int. Conf. Storage and Recording
Systems 1994, 1994, pp. 140–143.
 R. Tewari, D. M. Dias, R. Mukherjee, and H. M. Vin, “Real-time issues
for a clustered multimedia servers,”, IBM Res. Rep. RC20020, 1995.
mance tradeoffs in clustered video servers,” in Proc. 3rd IEEE Int. Conf.
Multimedia Computing and Systems, Hiroshima, Japan, June 1996, pp.
 M. Wu and W. Shu, “Scheduling for large-scale parallel video servers,”
in Proc.SixthSymp.FrontiersofMassivelyParallel Computation.
Alamitos, CA: IEEE Comput. Soc. Press, 1996, pp. 126–133.
 C. S. Freedman and D. J. DeWitt, “The SPIFFI scalable video-on-de-
mand system,” in Proc. ACM SIGMOD’95, June 1995, pp. 352–363.
service onlocal areanetworks,” inIEEE INFOCOM’96, SanFrancisco,
CA, Mar. 1996.
 S. S. Rao, H. M. Vin, and A. Tarafdar, “Comparative evaluation of
server-push and client-pull architectures for multimedia servers,” in
Proc. 6th NOSSDAV, Zushi, Japan, Apr. 1996, pp. 45–48.
 E. Biersack, W. Geyer, and C. Bernhardt, “Intra- and inter-stream syn-
chronization for stored multimedia streams,” in Proc. IEEE Int. Conf.
 C. Bernhardt and E. Biersack, “The server array: A scalable video
server architecture,” in High-Speed Networks for Multimedia Applica-
tions.Dordrecht, The Netherlands: Kluwer, 1996.
 W. J. Bolosky, J. S. Barrera, III, R. P. Draves, R. P. Fitzgerald, G. A.
Gibson, M. B. Jones, S. P. Levi, N. P. Myhrvold, and R. F. Rashid, “The
tiger video fileserver,” in Proc. Sixth Int. Workshop on Network and Op-
erating System Support for Digital Audio and Video, Zushi, Japan, Apr.
 A. Reddy, “Scheduling and data distribution in a multiprocessor video
server,” in Proc. Second IEEE International Conference on Multimedia
Computing and Systems.Los Alamitos, CA: IEEE Comput. Soc.
Press, 1995, pp. 256–263.
 R. Gusella and S. Zatti, “The accuracy of the clock synchronization
achieved by TEMPO in Berkeley UNIX 4.3BSD,” IEEE Trans. Soft-
ware Eng., vol. 15, pp. 847–853, July 1989.
 D. Mills, “Internet time synchronization: The network time protocol,”
IEEE Trans. Commun., vol. 39, pp. 1482–1493, Oct. 1991.
 Z. Yang andT. A.Marsland, Eds., GlobalStates and Time in Distributed
Systems.Los Alamitos, CA: IEEE Comput. Soc. Press, 1994.
 R. Tewari, D. M. Dias, R. Mukherjee, and H. M. Vin, “High availability
in clustered multimedia servers,” in Proc. 12th Int. Conf. Data Engi-
neering, 1996, pp. 645–654.
 P. C. Wong and Y. B. Lee, “Redundant array of inexpensive servers
(RAIS) for on-demand multimedia services,” in Proc. ICC’97, Mon-
treal, QC, Canada, June 8–12, 1997.
York: John Wiley, 1997, pp. 125–126.
Jack Y. B. Lee (M’95) is an Associate Professor
with the Department of Information Engineering,
at the Chinese University of Hong Kong. He
directs the Multimedia Communications Laboratory
research in distributed multimedia systems, fault-tol-