Distributed Schedule Management in the Tiger Video Fileserver
by William J. Bolosky, Robert P. Fitzgerald and John R. Douceur
Tiger is a scalable, fault-tolerant video file server constructed from a collection of computers
connected by a switched network. All content files are striped across all of the computers and disks
in a Tiger system. In order to prevent conflicts for a particular resource between two viewers,
Tiger schedules viewers so that they do not require access to the same resource at the same time. In
the abstract, there is a single, global schedule that describes all of the viewers in the system. In
practice, the schedule is distributed among all of the computers in the system, each of which has a
possibly partially inconsistent view of a subset of the schedule.
consistency model for the schedule, Tiger achieves scalability and fault tolerance while still
providing the consistent, coordinated service required by viewers.
By using such a relaxed
In the past few years, relatively inexpensive computers, disk
drives, network interfaces and network switches have become
Exploiting this capability requires solutions to a number of
different problems such as providing the requisite quality of
service in the network, handling time sensitive data at the clients
and building servers to handle the real-time, storage and aggregate
bandwidth requirements of a video server.
attacked various facets of these problems, coming up with many
creative solutions. We built a video server, choosing to use a
distributed system structure.In building this server, we were
faced with having to control the system in a scalable, fault tolerant
manner. This paper describes our solution to this control problem.
Tiger [Bolosky96], the technology underlying the Microsoft
Netshow Professional Video Server, is a video fileserver
intended to supply digital video data on demand to up to tens of
thousands of users simultaneously.
these viewers with a data stream that is independent of all other
viewers; multiplexing or “near video-on-demand” does not meet
Tiger’s requirements.The key observations driving the Tiger
design are that the data rate of a single video stream is small
relative to the I/O bandwidth of personal computers, and that I/O
and switching bandwidth is cheaper in personal computers and
network switches than in large computer memory systems and
backplanes. Tiger is organized as a collection of machines
connected together with an ATM (or other type of) switched
network. While this distributed organization reduces the
hardware cost per stream of video and improves scalability over
monolithic designs, it introduces a host of problems related to
controlling the system.
The data in a Tiger file is striped across all of the computers
and all of the disks within the system. When a client (viewer)
high quality videodata.
Tiger must supply each of
wants to play a particular file, Tiger must assure that the system
capacity exists to supply the data to the viewer without violating
guarantees made to viewers already receiving service. In order to
keep these commitments, Tiger maintains a schedule of viewers
that are playing files. The size of this schedule is proportional to
the total capacity of the system, and so central management of the
schedule would not arbitrarily scale. In order to remove a single
point of failure and to improve scalability, the schedule is
implemented in a distributed fashion across the computers
comprising the Tiger server.
potentially out-of-date view of part of the schedule (and no view
at all of the rest), and uses a fault- and latency-tolerant protocol to
update these views.Based on their views, the computers take
action to send the required data to the viewers at the proper time,
to add and remove viewers from the schedule, and to compensate
for the failure of system components.
Tiger behaves as if there is a single, consistent, global
schedule. For reasons of scalability and fault tolerance, the
schedule does not exist in that form.
component computers acts as if the global schedule exists, but a
component computer only has a partial, possibly out-of-date view
of it. Because the component computers are acting based on a
non-existent global schedule, we call the global schedule a
hallucination. Because many component computers share a
common hallucination, we say that the hallucination is coherent.
The coherent hallucination model is a particularly powerful one
for distributed protocol design, because it allows the designer to
split the problem into two parts: generating a correct centralized
abstraction, and creating appropriate local views of that
The remainder of this paper is organized in four major
sections. The first describes the basic design of Tiger, including
the hardware organization, data layout and fault tolerance aspects
ofthe system;necessary background
functioning of the schedule. The next two sections describe the
Tiger schedule, the first treating the schedule as a single,
centralized object,and second
implementation. The final major section presents performance
results showing a modest sized Tiger system that scales linearly.
The paper wraps up with a related work section and conclusions.
Each of the computers has a
Rather, each of the
Permission to make digital/hard copy of part or all this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication and its date appear, and
notice is given that copying is by permission of ACM, Inc. To copy
otherwise, to republish, to post on servers, or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP-16 10/97 Saint-Malo, France
© 1997 ACM 0-89791-916-5/97/0010…$3.50
2. The Tiger Architecture
2.1 Tiger Hardware Organization
A Tiger system is made up of a single Tiger controller machine, a
set of identically configured machines – called “cubs” – to hold
content, and a switched network connecting the cubs and the
clients. The Tiger controller serves only as a contact point (i.e.,
an IP address) for clients, the system clock master, and a few
other low effort tasks; if it became an impediment to scalability,
distributing these tasks would not be difficult. Each of the cubs
hosts a number of disks, which are used to store the file content.
The cubs have one or more interfaces to the switched network to
send file data and to communicate with other cubs, and some sort
of connection to the controller, possibly using the switched
network or using a separate network such as an ethernet. The
switched network itself is typically an ATM network, but may be
built of any scalable networking infrastructure.
topology may be complex, but to simplify discussion we assume
that it is a single ATM switch of sufficient bandwidth to carry all
For the purposes of this paper, the important property of
Tiger’s hardware arrangement is that the cubs are connected to
one another through the switched network, so the total bandwidth
available to communicate between cubs grows as the system
capacity grows (although the bandwidth to and from any
particular cub stays constant regardless of the system size).
2.2 File Data Layout
Every file is striped across every disk and every cub in a Tiger
system, provided that the file is large enough. Tiger numbers its
disks in cub-minor order: Disk 0 is on cub 0, disk 1 is on cub 1,
disk n is on cub 0, disk n+1 is on cub 1 and so forth, assuming that
there are n cubs in the system. Files are broken up into blocks,
which are pieces of equal duration. For each file, a starting disk is
selected in some manner, the first block of the file is placed on
that disk, the next block is placed on the succeeding disk and so
on, until the highest numbered disk is reached. At that point,
Tiger places the next block on disk 0, and the process continues
for the rest of the file. The duration of a block is called the “block
play time,” and is typically around one second for systems
configured to handle video rate (1-10Mbit/s) files. The block play
time is the same for every file in a particular Tiger system.
Tiger uses this striping layout in order to handle imbalances
in demand for particular files. Because each file has blocks on
every disk and every server, over the course of playing a file the
load is distributed among all of the system components. Thus, the
system will not overload even if all of the viewers request the
same file, assuming that they are equitemporally spaced. If they
are not, Tiger will delay starting streams in order to enforce
One disadvantage of striping across all disks is that changing
the system configuration by adding or removing cubs and/or disks
requires changing the layout of all of the files and all of the disks.
Tiger includes software to update (or “re-stripe”) from one
configuration to another. Because of the switched network
between the cubs, the time to restripe a system does not depend
on the size of the system, but only on the size and speed of the
cubs and their disks.
This paper describes two different versions of the Tiger
system, called “single bit rate” and “multiple bit rate.” The single
bitrate version allocates all files as if they are the same bitrate,
and wastes capacity on files that are slower than this maximum
configured speed. The multiple bitrate version is more efficient in
dealing with files of differing speed. Because the block play time
of every file in a Tiger system is the same as that of any other file,
all blocks are of the same size in a single bitrate server (and files
of less than the configured maximum bitrate suffer internal
fragmentation in their blocks). In a multiple bitrate server block
sizes are proportional to the file bitrate. Tiger stores each block
contiguously on disk in order to minimize seeks and to have
predictable block read performance. Tiger DMA’s blocks directly
into pre-allocated buffers, avoiding any copies. Tiger’s network
code also DMA’s directly out of these buffers, resulting in a zero-
copy disk-to-network data path.
2.3 Fault Tolerance
One goal of Tiger is to tolerate the failure of any single disk or
cub within the system with no ongoing degradation of service.
Tiger does not tolerate complete failure of the switched network
or of the Tiger controller. There are several challenges involved
in providing this level of fault tolerance. The first is to be sure
that the file data remains available even when the hardware
holding it has failed. The second is assuring that the additional
load placed on the non-failed components does not cause them to
overload or hotspot. A final challenge is to detect failures and
reconfigure the responsibilities of the surviving components to
cope with the loss. This section describes our answers to the first
challenge. Maintaining the schedule across failures is covered in
section 4. Detecting faults is accomplished by a deadman
protocol that runs between the cubs.
While the Tiger controller is a single point of failure in the
current implementation, the distributed schedule work described
in this paper removes the major function that the controller in a
centralized Tiger system would have.
group plans on making the remaining functions of the controller
fault tolerant. When they have completed this task, the fault
tolerance aspects of the distributed schedule will have come to full
The Netshow™ product
Cub 0Cub 1
ATM Switching Fabric
. . .
Figure 1: Typical Tiger Hardware Organization
load. It looks lower than you would expect because most dots are
clustered at the lower loads and overwrite one another on the
graph. We did not show startup times for schedule loads lower
than 50%, but they were all clustered around 1.8 seconds, the
minimum startup time. 1 second of this time is due to the time to
transmit a 1 second Tiger block.
receive time of a block to be when the last byte of the block
arrives rather than when the first byte arrives. Video rendering
(non-test) clients are free to begin rendering before the entire
block arrives, however, so they may mask some of this second.
The remaining 800ms is a combination of network latency and
scheduling lead (which includes time for the first block disk read).
Even at schedule loads of 95%, the mean time to start a
viewer is less than 5 seconds. However, there are a reasonable
number of outliers that took over 20 seconds. For that reason, we
do not recommend running Tiger systems at greater than 90%
load, and suggest limiting them to even lower loads.
contains code to prevent schedule insertions beyond a certain
level, which we disabled for this test.
loads, some insertions took about as long as the entire 56s
schedule to complete, and in larger systems would take longer.
A final measurement was the time for the system to
reconfigure from a cub failure. We loaded the system to 50% of
capacity and cut the power to a cub. We inspected the clients’
logs and found about 8 seconds between the earliest and latest lost
The test client records the
At very high schedule
6. Related Work
Tiger systems are typically built entirely of commodity hardware
components, allowing them to take advantage of commodity
hardware price curves.By contrast, other commercial video
servers, such as those produced by Silicon Graphics[Nelson95]
backplanes or massively parallel memory and I/O systems in
order to provide the needed bandwidth. These servers also tend to
allocate entire copies of movies at single servers, requiring that
content be replicated across a number of servers proportional to
the expected demand for the content. Tiger, by contrast, stripes all
content, eliminating the need for additional replicas to satisfy
independently developed single-machine disk striping algorithm
with some similarities to that used by Tiger.
[Freedman96] is a parallel file system implemented on an Intel
Paragon system that can stripe data across large numbers of disks
and can be used for multimedia files (as well as for more
traditional parallel filesystem tasks).
Thereisa certain similarity
hallucination model and distributed [Li88; Nitzberg91] or tightly
coupled [Kuskin94; LaRowe91] shared memory multiprocessing.
Both types of systems have a notion of a global abstraction upon
which multiple participants act. Both require some attention by
the programmer to keep coherence between the participants. The
primary difference lies in that shared memory systems do not have
a hallucination, but rather directly implement the global
abstraction. They are usually more tightly coupled, and often lack
fault tolerance. In these systems, a view corresponds to the
portion of the shared data structure that is used by any particular
participant. Because the view is not explicit to the programmer, it
is often harder to judge the scalability and access patterns.
The implementations of some existing wide scale distributed
systems can be viewed as coherent hallucinations. For example,
the Domain Name System [Mockapetris88] can be viewed as a
simple form of coherent hallucination. A directory of the global
to relyon super-computer
namespace is the hallucination, while each DNS server’s
authoritative knowledge and cached information make up the
views. Other examples include protocols such as RIP [Malkin94],
OSPF [Moy94] and BGP [Rekhter95] for IP routing. In these
protocols, the existence, up/down state and speed/load of all of the
routers and links in the network take the place of the
hallucination, and the current set of beliefs about them correspond
to views. These protocols differ from Tiger’s coherent
hallucination in that the views describe the entire system rather
than just a subset, but like a view in a coherent hallucination they
are allowed to be out of date. A further example is the portion of
the Network Time Protocol [Mills91] dealing with cascaded
synchronization. The synchronization tree is a hallucination; it
propagation through a synchronization subnet, yet it is not fully
represented at any node in the system. Each node's view is the
peer selection process performed with respect to the node's
7. Summary and Conclusions
Tiger is a video server that is designed to scale to tens of
thousands of simultaneous streams of digital video. It stripes its
content files across a collection of personal computers and high
speed disks, and combines the file blocks into a stream through an
ATM switch. It uses a schedule to prevent resource conflicts
among viewers. In the abstract, the schedule is a data structure
whose size is proportional to that of the Tiger system. In practice
the machines comprising the Tiger system see only part of the
global schedule, and have only non-authoritative knowledge about
most of what they know, a technique we name “coherent
We found that:
Tiger maintains its schedule in a manner that is fault tolerant,
robust and scalable.
Tiger is able to provide a number of streams of video data
that is not limited by its schedule management algorithms,
but rather by its hardware’s bandwidth.
We would like to thank Troy Batterberry, Akhlaq Khatri, Erik
Hedberg, and Steven Levi for the use of their equipment and
talents in collecting the data for the performance measurements.
We would also like to thank Bill Schiefelbein, Chih-Kan Wang,
Aamer Hydrie and the rest of the Netshow Pro Video Server
team for their help with the software and ideas we describe. We
owe a debt to the SOSP program committee and outside reviewers
for their suggestions on the organization and presentation of the
paper. We would like to thank Garth Gibson, Rick Rashid and
Nathan Myhrvold for their architectural suggestions during the
early phase of the Tiger project, and Fyeb for dex.
[Berson94] S. Berson, S. Ghandeharizadeh, R. Muntz, X. Ju.
Staggered Striping in Multimedia Information Systems.
In ACM SIGMOD ‘94, pages 79-90.
[Bolosky96] W. Bolosky, J. Barrera III, R. Draves, R. Fitzgerald,
G. Gibson, M. Jones, S. Levi, N. Myhrvold, R. Rashid.
The Tiger Video Fileserver. In Proceedings of the Sixth
International Workshop on Network and Operating
System Support for Digital Audio and Video. IEEE
Computer Society, Zushi, Japan, April 1996. Also
available from www.research.microsoft.com in the
operating systems area.
[Freedman96] C. S. Freedman, J. Burger and D. J. DeWitt.
SPIFFI – A Scalable Parallel File System for the Intel
Paragon. In IEEE Trans. on Parallel and Distributed
Systems, 7(11), pages 1185-1200, November 1996.
[Kuskin94] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R.
Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J.
Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J.
Hennessy. The Stanford FLASH Multiprocessor. In
Proceedings of the 21st International Symposium on
Computer Architecture, pages 302-313, April, 1994.
[LaRowe91] R. LaRowe, C Ellis and L. Kaplan. The Robustness
of NUMA Memory Management. In SOSP 13, pages
[Laursen94] A. Laursen, J. Olkin, and M. Porter. Oracle Media
Server: Providing Consumer Based Interactive Access
to Multimedia Data. In ACM SIGMOD ‘94, pages 470-
[Li88] K. Li. IVY: A Shared Memory Virtual Memory System
for Parallel Computing. In Proceedings of the 1988
International Conference on Parallel Processing, pages
II-94 – II-101, 1988.
[Malkin94] G. Malkin. RIP Version 2 Protocol Analysis. RFC
1721. November, 1994
[Mills91] D. L. Mills. Internet Time Synchronization: The
Network Time Protocol. In IEEE Transactions on
Communications, pages 1482-1493, Vol. 39, No. 10,
[Moy94] J. Moy. OSPF Version 2. RFC 1583. March, 1994.
[Nelson95] M. Nelson, M. Linton, and S. Owicki. A Highly
Available, Scalable ITV System. In SOSP 15, pages 54-
67. December, 1995.
[Mockapteris88] P. Mockapetris and K. Dunlap. Development of
the Domain Name System. In Proceedings of
SIGCOMM ’88, pages 123-133, April 1988.
[Nitzberg91] B. Nitzberg and V. Lo. Distributed Shared Memory:
A Survey of Issues and Algorithms. Computer,
24(8):52-60, August, 1991.
[Patterson88] D. Patterson, G. Gibson, R. Katz. A Case for
Redundant Arrays of Inexpensive Disks (RAID). In
ACM SIGMOD ‘88, pages 109-116.
[Rekhter95] Y. Rekhter, T. Li. A Border Gateway Protocol 4
(BGP-4). RFC 1771. March, 1995.
[Ruemmler94] C. Ruemmler and J. Wilkes. An Introduction to
Disk Drive Modeling. Computer, 27(2):17-28, March,
[Van Meter97] R. Van Meter. Observing the Effects of Multi-
Zone Disks. In Proceedings of the USENIX 1997
Annual Technical Conference, page 19-30, January,
0 100 200300400 500600
Bytes/s (for Control Bytes)
Controller CPUCub CPUDisk LoadControl Bytes
Figure 8: Tiger Loads, No Cubs Failed
0 100 200300400 500600
Bytes/s (for Control Bytes)
Figure 9: Tiger Loads, One Cub Failed
55%60%65% 70%75% 80%85% 90%95% 100%
Time to Delivery of First Block (s)
Figure 10: Stream Startup Latency