Distributed Schedule Management in the Tiger Video Fileserver
by William J. Bolosky, Robert P. Fitzgerald and John R. Douceur
Tiger is a scalable, fault-tolerant video file server constructed from a collection of computers
connected by a switched network. All content files are striped across all of the computers and disks
in a Tiger system. In order to prevent conflicts for a particular resource between two viewers,
Tiger schedules viewers so that they do not require access to the same resource at the same time. In
the abstract, there is a single, global schedule that describes all of the viewers in the system. In
practice, the schedule is distributed among all of the computers in the system, each of which has a
possibly partially inconsistent view of a subset of the schedule.
consistency model for the schedule, Tiger achieves scalability and fault tolerance while still
providing the consistent, coordinated service required by viewers.
By using such a relaxed
In the past few years, relatively inexpensive computers, disk
drives, network interfaces and network switches have become
Exploiting this capability requires solutions to a number of
different problems such as providing the requisite quality of
service in the network, handling time sensitive data at the clients
and building servers to handle the real-time, storage and aggregate
bandwidth requirements of a video server.
attacked various facets of these problems, coming up with many
creative solutions. We built a video server, choosing to use a
distributed system structure.In building this server, we were
faced with having to control the system in a scalable, fault tolerant
manner. This paper describes our solution to this control problem.
Tiger [Bolosky96], the technology underlying the Microsoft
Netshow Professional Video Server, is a video fileserver
intended to supply digital video data on demand to up to tens of
thousands of users simultaneously.
these viewers with a data stream that is independent of all other
viewers; multiplexing or “near video-on-demand” does not meet
Tiger’s requirements.The key observations driving the Tiger
design are that the data rate of a single video stream is small
relative to the I/O bandwidth of personal computers, and that I/O
and switching bandwidth is cheaper in personal computers and
network switches than in large computer memory systems and
backplanes. Tiger is organized as a collection of machines
connected together with an ATM (or other type of) switched
network. While this distributed organization reduces the
hardware cost per stream of video and improves scalability over
monolithic designs, it introduces a host of problems related to
controlling the system.
The data in a Tiger file is striped across all of the computers
and all of the disks within the system. When a client (viewer)
high quality videodata.
Tiger must supply each of
wants to play a particular file, Tiger must assure that the system
capacity exists to supply the data to the viewer without violating
guarantees made to viewers already receiving service. In order to
keep these commitments, Tiger maintains a schedule of viewers
that are playing files. The size of this schedule is proportional to
the total capacity of the system, and so central management of the
schedule would not arbitrarily scale. In order to remove a single
point of failure and to improve scalability, the schedule is
implemented in a distributed fashion across the computers
comprising the Tiger server.
potentially out-of-date view of part of the schedule (and no view
at all of the rest), and uses a fault- and latency-tolerant protocol to
update these views.Based on their views, the computers take
action to send the required data to the viewers at the proper time,
to add and remove viewers from the schedule, and to compensate
for the failure of system components.
Tiger behaves as if there is a single, consistent, global
schedule. For reasons of scalability and fault tolerance, the
schedule does not exist in that form.
component computers acts as if the global schedule exists, but a
component computer only has a partial, possibly out-of-date view
of it. Because the component computers are acting based on a
non-existent global schedule, we call the global schedule a
hallucination. Because many component computers share a
common hallucination, we say that the hallucination is coherent.
The coherent hallucination model is a particularly powerful one
for distributed protocol design, because it allows the designer to
split the problem into two parts: generating a correct centralized
abstraction, and creating appropriate local views of that
The remainder of this paper is organized in four major
sections. The first describes the basic design of Tiger, including
the hardware organization, data layout and fault tolerance aspects
ofthe system;necessary background
functioning of the schedule. The next two sections describe the
Tiger schedule, the first treating the schedule as a single,
centralized object,and second
implementation. The final major section presents performance
results showing a modest sized Tiger system that scales linearly.
The paper wraps up with a related work section and conclusions.
Each of the computers has a
Rather, each of the
Permission to make digital/hard copy of part or all this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication and its date appear, and
notice is given that copying is by permission of ACM, Inc. To copy
otherwise, to republish, to post on servers, or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP-16 10/97 Saint-Malo, France
© 1997 ACM 0-89791-916-5/97/0010…$3.50
2. The Tiger Architecture
2.1 Tiger Hardware Organization
A Tiger system is made up of a single Tiger controller machine, a
set of identically configured machines – called “cubs” – to hold
content, and a switched network connecting the cubs and the
clients. The Tiger controller serves only as a contact point (i.e.,
an IP address) for clients, the system clock master, and a few
other low effort tasks; if it became an impediment to scalability,
distributing these tasks would not be difficult. Each of the cubs
hosts a number of disks, which are used to store the file content.
The cubs have one or more interfaces to the switched network to
send file data and to communicate with other cubs, and some sort
of connection to the controller, possibly using the switched
network or using a separate network such as an ethernet. The
switched network itself is typically an ATM network, but may be
built of any scalable networking infrastructure.
topology may be complex, but to simplify discussion we assume
that it is a single ATM switch of sufficient bandwidth to carry all
For the purposes of this paper, the important property of
Tiger’s hardware arrangement is that the cubs are connected to
one another through the switched network, so the total bandwidth
available to communicate between cubs grows as the system
capacity grows (although the bandwidth to and from any
particular cub stays constant regardless of the system size).
2.2 File Data Layout
Every file is striped across every disk and every cub in a Tiger
system, provided that the file is large enough. Tiger numbers its
disks in cub-minor order: Disk 0 is on cub 0, disk 1 is on cub 1,
disk n is on cub 0, disk n+1 is on cub 1 and so forth, assuming that
there are n cubs in the system. Files are broken up into blocks,
which are pieces of equal duration. For each file, a starting disk is
selected in some manner, the first block of the file is placed on
that disk, the next block is placed on the succeeding disk and so
on, until the highest numbered disk is reached. At that point,
Tiger places the next block on disk 0, and the process continues
for the rest of the file. The duration of a block is called the “block
play time,” and is typically around one second for systems
configured to handle video rate (1-10Mbit/s) files. The block play
time is the same for every file in a particular Tiger system.
Tiger uses this striping layout in order to handle imbalances
in demand for particular files. Because each file has blocks on
every disk and every server, over the course of playing a file the
load is distributed among all of the system components. Thus, the
system will not overload even if all of the viewers request the
same file, assuming that they are equitemporally spaced. If they
are not, Tiger will delay starting streams in order to enforce
One disadvantage of striping across all disks is that changing
the system configuration by adding or removing cubs and/or disks
requires changing the layout of all of the files and all of the disks.
Tiger includes software to update (or “re-stripe”) from one
configuration to another. Because of the switched network
between the cubs, the time to restripe a system does not depend
on the size of the system, but only on the size and speed of the
cubs and their disks.
This paper describes two different versions of the Tiger
system, called “single bit rate” and “multiple bit rate.” The single
bitrate version allocates all files as if they are the same bitrate,
and wastes capacity on files that are slower than this maximum
configured speed. The multiple bitrate version is more efficient in
dealing with files of differing speed. Because the block play time
of every file in a Tiger system is the same as that of any other file,
all blocks are of the same size in a single bitrate server (and files
of less than the configured maximum bitrate suffer internal
fragmentation in their blocks). In a multiple bitrate server block
sizes are proportional to the file bitrate. Tiger stores each block
contiguously on disk in order to minimize seeks and to have
predictable block read performance. Tiger DMA’s blocks directly
into pre-allocated buffers, avoiding any copies. Tiger’s network
code also DMA’s directly out of these buffers, resulting in a zero-
copy disk-to-network data path.
2.3 Fault Tolerance
One goal of Tiger is to tolerate the failure of any single disk or
cub within the system with no ongoing degradation of service.
Tiger does not tolerate complete failure of the switched network
or of the Tiger controller. There are several challenges involved
in providing this level of fault tolerance. The first is to be sure
that the file data remains available even when the hardware
holding it has failed. The second is assuring that the additional
load placed on the non-failed components does not cause them to
overload or hotspot. A final challenge is to detect failures and
reconfigure the responsibilities of the surviving components to
cope with the loss. This section describes our answers to the first
challenge. Maintaining the schedule across failures is covered in
section 4. Detecting faults is accomplished by a deadman
protocol that runs between the cubs.
While the Tiger controller is a single point of failure in the
current implementation, the distributed schedule work described
in this paper removes the major function that the controller in a
centralized Tiger system would have.
group plans on making the remaining functions of the controller
fault tolerant. When they have completed this task, the fault
tolerance aspects of the distributed schedule will have come to full
The Netshow™ product
Cub 0Cub 1
ATM Switching Fabric
. . .
Figure 1: Typical Tiger Hardware Organization
interesting primarily for scalability and academic interest.
Tiger uses mirroring to achieve data availability.
glance, this might seem like an odd choice compared to using
RAID-like parity striping [Patterson88]. The combination of two
factors led us to choose mirroring. First, we expect bandwidth,
rather than storage capacity, to be the limiting factor in Tiger
systems. Second, the requirement to survive failures not only of
disks but also of entire machines means that if Tiger used parity
encoding it would need to move almost all of the file data for a
parity stripe between machines in order to reconstruct the lost
data. Furthermore, this movement would have to happen prior to
the time that the lost data would normally be sent. The cost of the
inter-machine bandwidth and buffer memory for such a solution is
large compared to the cost of mirroring.
While mirroring requires that each data bit be stored twice, it
does not necessarily mean that half of the bandwidth of a disk or
machine needs to be reserved for failed-mode operation. Tiger
declusters its mirror data. That is, for each block of primary data
stored on a cub, its mirror (secondary) copy is split into several
pieces and spread across different disks and machines.
number of pieces into which the blocks are split is called the
decluster factor. Tiger always stores the secondary parts of a
block on the disks immediately following the disk holding the
primary copy of the block.
Because of declustering, when a single disk or machine fails
several other disks and machines combine to do its work. The
tradeoff in the choice of decluster factor is between reserving
bandwidth for failed mode operation and decreased fault
tolerance. With a decluster factor of 4, only a fifth of total disk
and network bandwidth needs to be reserved for failed mode
operation, but a second failure on any of 8 machines would result
in the loss of data.1Conversely, a decluster factor of 2 consumes
a third of system bandwidth for fault tolerance, but can survive
failures more than two cubs away from any other failure.
Even if Tiger suffers the failure of two cubs near to one
another, it will attempt to continue to send streams, although these
streams will necessarily miss some blocks of data. If two or more
consecutive cubs are failed, the preceding living cub will send
scheduling information to the succeeding living cub, bridging the
Until that task is complete, distributed scheduling is
1In a decluster 4 system the 4 disks before the failed disk need to
be alive because the failed disk’s mirror area holds some of the
secondary copy of their data, and the 4 disks after the failed disks
need to be alive because they hold the secondaries for the failed
Figure 2 illustrates Tiger’s data layout for a three disk,
decluster factor 2 system. The notation “Secondary m.n” means
that part n of each block in primary m is stored at the indicated
place. Because the outer tracks of a disk are longer than the inner
ones, modern disk drives store more sectors on these outer tracks.
Disks have constant angular velocity, so the outer tracks pass
under the drive head in the same amount of time as the inner
tracks. As a result, disks are faster on the outer tracks than on the
inner ones [Ruemmler94; Van Meter97]. Tiger takes advantage
of this fact in its data layout. Primaries are stored on the faster
portion of a disk, and secondaries are stored on the slower part.
At any one time a disk can be covering for at most one failed disk,
so for every primary read there will be at most one secondary
read. The primary reads are decluster times bigger than the
secondary reads, so Tiger can rely on the fact that at most 1 /
(decluster + 1) of the data will be read from the slower half of the
3. The Tiger Schedule
Over a sufficiently large period of time, a Tiger viewer’s load is
spread evenly across all components of a Tiger system. However,
in the short term Tiger needs to assure that there are no hotspots.
A hotspot occurs when a disk or cub is asked to do more work
than it is capable of doing over some small period of time.
Because the block play time is the same for all files and all files
are laid out in the same order, viewers move from cub to cub and
disk to disk in lockstep; alternately, this can be viewed as the
disks and cubs moving along the schedule in lockstep. A system
that has no hotspots at any particular time will continue to have no
hotspots unless another viewer starts playing. Thus, the problem
of preventing hotspots is reduced to not starting a viewer in such a
way as to create a new hotspot.
Tiger uses the schedule both for describing the needed work
to supply data to running viewers and for checking whether
starting a new viewer would create a hotspot. If there is a viewer
Slot 3/Viewer 0
Slot 1/Viewer 3
Slot 0/Viewer 4
Slot 7/Viewer 1
Slot 4/Viewer 5
Slot 5/Viewer 2
Figure 3: Example Disk Schedule
Figure 2: Tiger Disk Data Layout
who requests service, and whose request would create a hotspot,
the system will delay starting the viewer until it can be done
safely. This scheme gives variable delays for initial service, but
guarantees that once a viewer is started there will be no resource
conflicts.[Bolosky96] discusses the duration of the delays
introduced, and concludes that for reasonable system parameters
and restricted to running at 80-90% of capacity the delays are
acceptable for most purposes. Section 5 contains measurements
that support this conclusion for the particular Tiger system
3.1 The Disk Schedule
In a single bitrate Tiger, the system maintains a schedule
describing the work done by the disk drives.
perform best when doing large transfers (amortizing a seek over a
large amount of data to be read), Tiger reads each block in a
single chunk. The disk schedule is an array of slots, with one slot
for every stream of system capacity. One can think of the disk
schedule as being indexed by time rather than by slot number.
The time that it takes to process one block (the block play time
divided by the maximum number of streams per disk) is called the
block service time. This time is determined by either the speed of
the disks or the capacity of the network interface, whichever is the
bottleneck. So, each slot in the disk schedule is one block service
time long, and the entire schedule is the block play time times the
number of disks in the system. The schedule must be an integral
multiple of both the block play and block service times. If not,
the block service time is lengthened enough to make it so. This
requirement is equivalent to saying that a Tiger system (but not a
disk, cub or network card) must source an integral number of
streams, and that the actual hardware capacity of the system as a
whole is rounded down to the nearest stream.
Each cub maintains a pointer into the schedule for each disk
on the cub. These pointers move along the schedule in real time.
When the pointer for a particular disk reaches the beginning of a
slot in the schedule, the cub will start sending to the network the
appropriate block for the viewer occupying the schedule slot. In
order to allow time for the disk operations to complete, the disks
run at least one block service time ahead of the schedule. Usually,
they run a little earlier, trading off buffer usage to cover for slight
variations in disk and I/O system performance. The pointer for
each disk is one block play time behind the pointer for its
predecessor. Because of the requirement that the total schedule
length is the block play time times the number of disks, the
distance between the last and the first disk is also one block play
If a Tiger system is configured to be fault tolerant, the block
service time is increased to allow for processing the secondary
load that will be present in a failed state. If the disk rather than
the network is the limiting factor the inside/outside disk
optimization described in section 2.3 is taken into account when
determining how big to make the block service time.
3.2 The Network Schedule: Supporting
This section describes support for scheduling streams of differing
bitrates on a Tiger system. The discussion is offered primarily
because it serves to illustrate a particular difficulty (and its
solution) in distributing the schedule. Multiple bitrate scheduling
is only partially implemented in today’s Tiger systems.
The concept of block service time as described in section 3.1
has a number of underlying assumptions. One is that the block
service time is the same for all blocks in all files, which is true
only in a single bitrate system. A slightly more subtle assumption
is that the ratio of disk usage to network usage is constant for all
blocks. This assumption is necessary because the block service
time is chosen so the most heavily used resource is not
overloaded. In a multiple bitrate system, blocks of different files
Time (units of block play time)
Figure 4: Example Network Schedule
may have different sizes. The time to read a block from a disk
includes a constant seek overhead, while the time to send one to
the network does not, so small blocks use proportionally more
disk than network. Consequently, in a multiple bitrate Tiger
system whether the network or disk limits performance may
depend on the current set of playing files. Different parts of the
same schedule may have different limiting factors.
Because a combined schedule cannot work for a system
where block sizes vary from stream to stream, multiple bitrate
Tiger systems implement a second schedule that describes the
activity on the network, called a network schedule. Unlike disks,
networks interleave all of the streams being sent.
networks process several streams simultaneously, the network
schedule is a two dimensional structure. The x-axis is time and
the y-axis bandwidth. The overall length of the schedule is the
block play time times the number of cubs2, while the height is the
bandwidth of a cub’s network interface cards (NICs). The length
of an entry in the network schedule is one block play time, and the
height is determined by the bitrate of the stream being serviced.
Figure 4 shows a network schedule constructed by assigning
bitrates to the viewers shown in the schedule in Figure 3. Each
viewer is represented by a block of a certain color. For example,
viewer 4 runs at 2 Mbit/s from time 0 to time 1, and viewer 0 runs
at 3 Mbit/s from time 1.125 to 2.125. A vertical slice up from the
pointer for a cub shows what that cub’s NIC is doing at the
current time, so cub 2 is most of the way through sending its block
for viewer 1, a little farther from the end of viewer 4’s block and
about a third of the way into viewer 3’s block. As time advances,
the cubs move from left to right through the schedule, wrapping
around at the end. In one block play time cub 0 will be at exactly
the same position that cub 2 occupies in the figure. The total
height of entries at any point in the schedule shows the
instantaneous load on the NICs when servicing that part of the
A multiple bitrate Tiger system not only needs to assure that
its NICs aren’t overrun, it also has to assure that disk bandwidth
isn’t exceeded. Keeping a schedule similar to the one used for the
single bitrate system but with variable size slots is sufficient but
not necessary. The disk schedule in the single bitrate Tiger not
only avoids hotspots, it specifies the time at which each block
must be sent to the network. In the multiple bitrate system the
network schedule serves this function.
time ordering information in the disk schedule is not necessary in
the multiple bitrate system; entries in the disk schedule are free to
move around, as long as they’re completed before they’re due at
the network. Because of this reordering property, fragmentation
does not occur in the disk schedule.
Fragmentation can be a problem in the network schedule.
Consider the schedule shown in figure 4. The free bandwidth
below the 6 Mbit/s level between when viewer 4 finishes sending
and when viewer 2 starts is unusable, because any new entry
would be one block play time long, and the gap in the schedule is
slightly too short. In general, fragmentation can become fairly
severe if viewers are started at arbitrary points. We have found
that fragmentation is reduced to an acceptable level when viewers
are forced to start at times that are integral multiples of the block
play time divided by the decluster factor.
Therefore, the specific
2This is different from the disk schedule, whose length is the
block play time times the number of disks.
because the output of all of the disks on a cub are sent to the
network through the same NICs.
The difference is
3.3 Scalability Considerations
The question of whether it makes sense to distribute Tiger’s
schedule management depends on how large Tigers can grow and
how much work would be involved in central management of the
schedule.This section explores a limit on the size of Tigers
(which is probably not the limiting factor in the current
implementation), and considers the work involved in centrally
maintaining that large of a schedule.
A fundamental limit of scalability in a Tiger system is the
number of different disks that hold a particular file. A typical
movie is about 100 minutes (6000 seconds) in length. If a block
is 1 second long and disks are the same speed as the ones used in
the experiment in section 5 (which can serve 10.75 streams each),
using 6000 disks to store a movie would mean that a single copy
of a movie could serve over 64,000 streams.
would expect to not build systems quite this large, because
serving the full 64,000 viewers would require that they be evenly
spaced over 100 minutes. Still, with better disk technology it is
not hard to imagine Tiger systems with as many as 30,000 to
40,000 streams. Such a system would have on the order of 1000
In a centrally scheduled system, the controller would have to
track the entire schedule. Even with 40,000 streams, just keeping
up with the schedule is quite possible with a reasonable computer.
However, the controller would also have to communicate the
schedule to the cubs. If the message that the controller sends
instructing a cub to deliver a block to a viewer is 100 bytes long
(which is about the size of the comparable message sent from cub
to cub in the distributed system), the controller would have to
maintain a send rate of 3-4 Mbytes/s of control traffic through the
TCP stack to the roughly 1000 cubs.
transmission of this much data through TCP, particularly to that
many destinations, is probably beyond the capability of the class
of personal computers used to construct a Tiger system.
In addition to making control scalability easier, distributing
the schedule also eliminates the most complex aspect of having
the central controller as a single point of failure.
remaining functions fault tolerant is a simple exercise, and will be
completed by the product team. We chose to distribute schedule
management because of the combination of the fault tolerance and
In practice, we
Reliable and timely
4. Distributed Schedule Management
Consider the descriptions of the Tiger schedules in section 3.
They are worded as if there is a single disk or network schedule
for the entire Tiger system. Conceptually, this is true. In practice,
the schedule management is distributed among the cubs. Each
cub has partial (and possibly incorrect) knowledge of the global
schedule, but behaves as if the entire schedule exists. The net
result is a system that as a whole acts as if there were a global
schedule, but which is scalable and fault tolerant.
We use the term coherent hallucination to mean a distributed
implementation of a shared object, when there is no physical
instantiation of the object. The Tiger schedule is a coherent
hallucination because no particular machine holds a copy of the
entire schedule, but yet each behaves as if there is a single,
coherent global schedule.
There are two major components to a coherent hallucination.
Thefirst is the imaginary
“hallucination.” Second is the concept of a view. A view is the
picture that a participant in a coherent hallucination-based system
has of the hallucination. Views may be incomplete or out-of-date