ArticlePDF Available

High performance Web site design techniques


Abstract and Figures

This article presents techniques for designing Web sites that need to handle large request volumes and provide high availability. The authors present new techniques they developed for keeping cached dynamic data current and synchronizing caches with underlying databases. Many of these techniques were deployed at the official Web site for the 1998 Olympic Winter Games
Content may be subject to copyright.
This article presents techniques
for designing Web sites that
need to handle large request
volumes and provide high
availability. The authors present
new techniques they developed
for keeping cached dynamic
data current and synchronizing
caches with underlying
databases. Many of these
techniques were deployed at the
official Web site for the 1998
Olympic Winter Games.
High-Performance Web Site
Design Techniques
IBM T.J. Watson Research Center
erformance and high availability are critical at Web sites that
receive large numbers of requests. This article presents several tech-
niques—including redundant hardware, load balancing, Web serv-
er acceleration, and efficient management of dynamic data—that can be
used at popular sites to improve performance and availability. We describe
how we deployed several of these techniques at the official Web site for the
1998 Olympic Winter Games in Nagano, Japan, which was one of the
most popular sites up to its time. In fact, the Guinness Book of World Records
recognized the site on 14 July 1998 for setting two records:
“Most Popular Internet Event Ever Recorded,” based on the officially
audited figure of 634.7 million requests over the 16 days of the
Olympic Games; and
“Most Hits on an Internet Site in One Minute,” based on the official-
ly audited figure of 110,414 hits received in a single minute around
the time of the womens freestyle figure skating.
The site visibility, large number of requests, and amount of data made per-
formance and high availability critical design issues. The site’s architecture
was an outgrowth of our experience designing and implementing the 1996
Olympic Summer Games Web site. In this article, we first describe gen-
eral techniques that can be used to improve performance and availability
at popular Web sites. Then we describe how several of these were deployed
at the official Web site in Nagano.
To handle heavy traffic loads, Web sites must use multiple servers running on
different computers. The servers can share information through a shared file
system, such as Andrew File System (AFS) or Distributed File System (DFS),
or via a shared database; otherwise, data can be replicated across the servers.
The Round-Robin Domain Name Server (RR-DNS)
approach, which
NCSA used for its server,
is one method for distributing requests to multiple
servers. RR-DNS allows a single domain name to be associated with multiple
IP addresses, each of which could represent a different Web server. Client
2000 IEEE
requests specifying the domain name are mapped to
servers in a round-robin fashion. Several problems
arise with RR-DNS, however.
Server-side caching. Caching name-to-IP address
mappings at name servers can cause load imbal-
ances. Typically, several name servers cache the
resolved name-to-IP-address mapping between
clients and the RR-DNS. To force a mapping to
different server IP addresses, RR-DNS can specify
a time-to-live (TTL) for a resolved name, such that
requests made after the specified TTL are not
resolved in the local name server. Instead, they are
forwarded to the authoritative RR-DNS to be
remapped to a different HTTP server’s IP address.
Multiple name requests made during the TTL peri-
od would be mapped to the same HTTP server.
Because a small TTL can significantly increase
network traffic for name resolution, name servers
will often ignore a very small TTL given by the RR-
DNS and impose their own minimum TTL
instead. Thus, there is no way to prevent interme-
diate name servers from caching the resolved name-
to-IP address mapping—even by using small TTLs.
Many clients, such as those served by the same
Internet service provider, may share a name server,
and may therefore be pointed to a single specific
Web server.
Client-side caching. Client-side caching of resolved
name-to-IP address mappings can also cause load
imbalances. The load on the HTTP servers cannot
be controlled, but rather will vary with client access
patterns. Furthermore, clients make requests in
bursts as each Web page typically involves fetching
several objects including text and images; each burst
is directed to a single server node, which increases
the skew. These effects can lead to significant imbal-
ances that may require the cluster to operate at
lower mean loads in order to handle peak loads (see
Dias et al.
Another problem with RR-DNS is that the
round-robin method is often too simplistic for pro-
viding good load balancing. We must consider fac-
tors such as the load on individual servers. For
instance, requests for dynamic data constructed
from many database accesses at a server node can
overload a particular Web server.
Node failures. Finally, client and name server
caching of resolved name-to-IP-address mappings
make it difficult to provide high availability in the
face of Web server node failures. Since clients and
name servers are unaware of the failures, they may
continue to make requests to failed Web servers.
Similarly, online maintenance may require bringing
down a specific Web server node in a cluster. Again,
giving out individual node IP addresses to the client
and name servers makes this difficult. While we can
configure backup servers and perform IP address
takeovers for detected Web server node failures, or
for maintenance, it is hard to manage these tasks.
Moreover, an active backup node may end up with
twice the load if a primary node fails.
TCP Routing
Figure 1 illustrates a load-balancing method based
on routing at the TCP level (rather than standard
IP-level routing). A node of the cluster serves as a
so-called TCP router, forwarding client requests to
the Web server nodes in the cluster in a round-
robin (or other) order. The router’s name and IP
address are public, while the addresses of the other
nodes in the cluster are hidden from clients. (If
there is more than one TCP router node, RR-DNS
maps a single name to the multiple TCP routers.)
The client sends requests to the TCP router node,
which forwards all packets belonging to a particu-
lar TCP connection to one of the server nodes. The
TCP router can use different load-based algorithms
to select which node to route to, or it can use a sim-
ple round-robin scheme, which often is less effec-
tive than load-based algorithms. The server nodes
bypass the TCP router and respond directly to the
client. Note that because the response packets are
larger than the request packets, the TCP router
adds only small overhead.
The TCP router scheme allows for better load
balancing than DNS-based solutions and avoids
HTTP serverHTTP server
Client Client
Figure 1. The TCP router routes requests to various Web servers.
Server responses go directly to clients, bypassing the router.
the problem of client or name server caching.
routers can use sophisticated load-balancing algo-
rithms that take individual server loads into
account. A TCP router can also detect Web serv-
er node failures and route user requests to only the
available Web server nodes. The system adminis-
trator can change the TCP router configuration to
remove or add Web server nodes, which facilitates
Web server cluster maintenance. By configuring a
backup TCP router, we can handle failure of the
TCP router node, and the backup router can oper-
ate as a Web server during normal operation. On
detecting the primary TCP router’s failure, the
backup would route client requests to the remain-
ing Web server nodes, possibly excluding itself.
Commercially available TCP routers. Many TCP
routers are available commercially. For example,
IBM’s Network Dispatcher (ND) runs on stock
hardware under several operating systems (OS),
including Unix, Sun Solaris, Windows NT, and a
proprietary embedded OS.
An embedded OS improves router performance
by optimizing the TCP communications stack and
eliminating the scheduler and interrupt processing
overheads of a general-purpose OS. The ND can
route up to 10,000 HTTP requests per second
when running under an embedded OS on a
uniprocessor machine, which is a higher request
rate than most Web sites receive.
Other commercially available TCP routers
include Radwares Web Server Director (http:// and Resonates Central Dis-
patch (
tral_dispatch/data_sheets.html). Cisco Systems
LocalDirector (
lic/cc/cisco/mkt/scale/locald/) differs from the TCP
router approach because packets returned from
servers go through the LocalDirector before being
returned to clients. For a comparison of different
load-balancing approaches, see Cardellini, Cola-
janni, and Yu.
Combining TCP routing and RR-DNS. If a single
TCP router has insufficient capacity to route
requests without becoming a bottleneck, the
TCP-router and DNS schemes can be combined
in various ways. For example, the RR-DNS
method can be used to map an IP address to mul-
tiple router nodes. This hybrid scheme can toler-
ate the load imbalance produced by RR-DNS
because the corresponding router will route any
burst of requests mapped by the RR-DNS to dif-
ferent server nodes. It achieves good scalability
because a long TTL can ensure that the node run-
ning the RR-DNS does not become a bottleneck,
and several router nodes can be used together.
Special requirements. One issue with TCP routing
is that it spreads requests from a single client across
Web server nodes. While this provides good load
balancing, some applications need their requests
routed to specific servers. To support such applica-
tions, ND allows requests to be routed with an
affinity toward specific servers. This is useful for
requests encrypted using Secure Sockets Layer
(SSL), which generates a session key to encrypt
information passed between a client and server. Ses-
sion keys are expensive to generate. To avoid regen-
erating one for every SSL request, they typically
have a lifetime of about 100 seconds. After a client
and server have established a session key, all
requests between the specific client and server with-
in that lifetime use the same session key.
In a system with multiple Web servers, however,
one server will not know about session keys gener-
ated by another. If a simple load-balancing scheme
like round-robin is used, it is highly probable that
multiple SSL requests from the same client will be
sent to different servers within the session key’s life-
time—resulting in unnecessary generation of addi-
tional keys. ND avoids this problem by routing two
SSL requests received from the same client within
100 seconds of each other to a single server.
In satisfying a request, a Web server often copies data
several times across layers of software. For example,
it may copy from the file system to the application,
again to the operating system kernel during trans-
mission, and perhaps again at the device-driver level.
Other overheads, such as OS scheduler and interrupt
processing, can add further inefficiencies. One tech-
nique for improving Web site performance is to cache
data at the site so that frequently requested pages can
be served from a cache with significantly less over-
head than a Web server. Such caches are known as
An embedded OS improves
router performance by optimizing
the TCP communications stack.
HTTPD accelerators or Web server accelerators.
Our Accelerator
We have developed a Web server accelerator that
runs under an embedded operating system and can
serve up to 5,000 pages per second from its cache
on a uniprocessor 200-MHz PowerPC.
throughput is up to an order of magnitude higher
than typically attainable by a high-performance
Web server running on similar hardware under a
conventional OS. Both the Harvest and Squid
( caches include
HTTP daemon (HTTPD) accelerators, but ours
performs considerably better than either.
also sells an HTTPD accelerator as part of its Bor-
Our systems superior performance results large-
ly from the embedded operating system. Buffer
copying is kept to a minimum. With its limited
functionality, the OS is unsuitable for general-pur-
pose software applications, such as databases or
online transaction processing. Its optimized sup-
port for communications, however, makes it well
suited to specialized network applications like Web
server acceleration.
A key difference between our accelerator and
others is that our API allows application developers
to explicitly add, delete, and update cached data,
which helps maximize hit rates and maintain cur-
rent caches. We allow caching of dynamic as well as
static Web pages because applications can explicit-
ly invalidate any page whenever it becomes obso-
lete. Dynamic Web page caching is important for
improving performance at Web sites with signifi-
cant dynamic content. We are not aware of other
accelerators that allow caching of dynamic pages.
As Figure 2 illustrates, the accelerator front-ends
a set of Web server nodes. A TCP router runs on
the same node as the accelerator (although it could
also run on a separate node). If the requested page
is in the cache, the accelerator returns the page to
the client. Otherwise, the TCP router selects a Web
server node and sends the request to it. Persistent
TCP connections can be maintained between the
cache and the Web server nodes, considerably
reducing the overhead of satisfying cache misses
from server nodes. Our accelerator can reduce the
number of Web servers needed at a site by handling
a large fraction of requests from the cache (93 per-
cent at the Wimbledon site, for example).
As the accelerator examines each request to see if
it can be satisfied from cache, it must terminate the
connection with the client. For cache misses, the
accelerator requests the information requested by
the client from a server and returns it to the client.
In the event of a miss, caching thus introduces some
overhead as compared to the TCP router in Figure 1
because the accelerator must function as a proxy for
the client. By keeping persistent TCP connections
between the cache and the server, however, this over-
head is significantly reduced. In fact, the overhead
on the server is lower than it would be if the server
handled client requests directly.
The cache operates in one of two modes: trans-
parent or dynamic. In transparent mode, data is
cached automatically after cache misses. The web-
master can also set cache policy parameters to deter-
mine which URLs get automatically cached. For
example, different parameters determine whether
to cache static image files, static nonimage files, and
dynamic pages, as well as their default lifetimes.
HTTP headers included in a server response can
then override a specific URLs default behavior as
set by the cache policy parameters.
In dynamic mode, the cache contents are explic-
itly controlled by applications executed on either the
accelerator or a remote node. API functions allow
programs to cache, invalidate, query, and specify life-
times for URL contents. While dynamic mode com-
plicates the application programmers task, it is often
required for optimal performance. Dynamic mode is
particularly useful for prefetching popular objects into
caches and for invalidating objects whose lifetimes
were not known at the time they were cached.
Because caching objects on disk would slow the accel-
erator too much, all cached data is stored in memory.
HTTP server
HTTP server
Figure 2. Web server acceleration. The cache significantly reduces
server load.
Thus, cache sizes are limited by memory sizes. Our
accelerator uses the least recently used (LRU) algo-
rithm for cache replacement. Figure 3 shows the
throughput that a single accelerator can handle as a
function of requested object size. Our test system had
insufficient network bandwidth to drive the acceler-
ator to maximum throughput for requested objects
larger than 2,048 bytes. The graph shows the num-
bers we actually measured as well as the projected
numbers on a system where network throughput was
not a bottleneck. A detailed analysis of our accelera-
tor’s performance and how we obtained the projec-
tions in Figure 3 is available in Levy et al.
Our accelerator can handle up to about 5,000
cache hits per second and 2,000 cache misses per sec-
ond for an 8-Kbyte page when persistent connections
are not maintained between the accelerator and back-
end servers. In the event of a cache miss, the acceler-
ator must request the information from a back-end
server before returning it to the client. Requesting the
information from a server requires considerably more
instructions than fetching the object from cache.
Cache miss performance can therefore be improved
by maintaining persistent connections between the
accelerator and back-end servers.
High-performance uniprocessor Web servers can
typically deliver several hundred static files per sec-
ond. Dynamic pages, by contrast, are often deliv-
ered at rates orders of magnitude slower. It is not
uncommon for a program to consume over a sec-
ond of CPU time to generate a single dynamic page.
For Web sites with a high proportion of dynamic
pages, the performance bottleneck is often the CPU
overhead associated with generating them.
Dynamic pages are essential at sites that provide
frequently changing data. For example, the official
1998 U.S. Open tennis tournament Web site aver-
aged up to four updated pages per second over an
extended time period. A server program that gen-
erates pages dynamically can return the most recent
version of the data, but if the data is stored in files
and served from a file system, it may not be feasi-
ble to keep it current. This is particularly true when
numerous files need frequent updating.
Cache Management Using DUP
One of the most important techniques for improv-
ing performance with dynamic data is caching
pages the first time they are created. Subsequent
requests for an existing dynamic page can access it
from cache rather than repeatedly invoking a pro-
gram to generate the same page. A key problem
with this technique is determining which pages to
cache and when they become obsolete. Explicit
cache management by invoking API functions is
again essential for optimizing performance and
ensuring consistency.
We have developed a new algorithm called Data
Update Propagation (DUP) for precisely identify-
ing which cached pages have been obsoleted by
new information. DUP determines how cached
Web pages are affected by changes to underlying
data. If a set of cached pages is constructed from
tables belonging to a database, for example, the
cache must be synchronized with the database so
that pages do not contain stale data. Furthermore,
cached pages should be associated with parts of the
database as precisely as possible; otherwise, objects
whose values are unchanged can be mistakenly
invalidated or updated after a database change.
Such unnecessary cache updates can increase miss
rates and hurt performance.
DUP maintains correspondences between
objects—defined as items that might be cached—
and underlying data, which periodically change and
affect object values. A program known as a trigger
monitor maintains data dependence information
between the objects and underlying data and deter-
mines when data have changed. When the system
becomes aware of a change, it queries the stored
dependence information to determine which
64 bytes
128 bytes
256 bytes
512 bytes
1,024 bytes
2,048 bytes
4,096 bytes
8,192 bytes
Actual hits/sec
Projected hits/sec
Figure 3. Our accelerator achieved a maximum of 5,000 cache hits
per second. Network bandwidth limitations in our test configura-
tion restricted the accelerator’s maximum throughput for page
sizes larger than 2,048 bytes.
cached objects are affected and should be invali-
dated or updated.
Dependencies are represented by a directed graph
known as an object dependence graph (ODG),
wherein a vertex usually represents an object or
underlying data. An edge from a vertex v to anoth-
er vertex u indicates that a change to v also affects
u. The trigger monitor constructed the ODG in
Figure 4 from its knowledge of the application. If
node go2 changed, for example, the trigger monitor
would detect it. The system uses graph traversal
algorithms to determine which objects are affected
by the change to go2, which in this case include go5,
go6, and go7. The system can then invalidate or
update cached objects it determines to be obsolete.
Weights can be associated with edges to help
determine how obsolete underlying data changes
have rendered an object. In Figure 4, the data
dependence from go1 to go5 is more important
than the dependence from go2 to go5 because the
former edge is five times the weight of the latter.
Therefore, a change to go1 would generally affect
go5 more than a change to go2. We have described
the DUP algorithm in detail in earlier work.
Interfaces for Creating Dynamic Data
The interface for invoking server programs that cre-
ate dynamic pages has a significant effect on perfor-
mance. The Common Gateway Interface (CGI)
works by creating a new process to handle each
request, which incurs considerable overhead. Once
the most widely used interface, CGI is being replaced
with better performing mechanisms. For instance,
Open Market’s FastCGI (
establishes long-running processes to which a Web
server passes requests. This avoids the overhead of
process creation but still requires some communica-
tion overhead between the Web server and the FastC-
GI process. FastCGI also imposes some limits on the
number of simultaneous connections a server can han-
dle by the number of physical processes that a machine
can handle.
Rather than forking off new processes each time
a server program is invoked, or communicating
with prespawned processes, Web servers such as
Apache, the IBM Go server, and those by Netscape
and Microsoft provide interfaces to invoke server
extensions as part of the Web server process itself.
Server tasks are run in separate threads within the
Web server and may be either dynamically loaded
or statically bound into the server. IBM’s Go Web
server API (GWAPI), Netscapes server application
programming interface (NSAPI), and Microsoft’s
InternetServer API (ISAPI), as well as Apaches low-
level “modules,” are all examples of this approach.
Also, careful thread management makes it possible
to handle many more connections than there are
servicing threads, reducing overall resource con-
sumption as well as increasing capacity. Unfortu-
nately, the interfaces presented by GWAPI, NSAPI,
ISAPI, and Apache modules can be rather tricky to
use in practice, with issues such as portability,
thread safety, and memory management compli-
cating the development process.
More recent approaches, such as IBM’s Java
Server Pages (JSP), Microsoft’s Active Server Pages
(ASP), as well as Java Servlets, and Apaches
mod_perl hide those interfaces to simplify the Web
author’s job through the use of Java, Visual Basic,
and Perl. These services also hide many of the issues
of thread safety and provide built-in garbage col-
lection to relieve the programmer of the problems
of memory management. Although execution can
be slightly slower than extensions written directly
to the native interfaces, ease of program creation,
maintenance and portability join with increased
application reliability to more than make up for the
slight performance difference.
The 1998 Winter Games Web site’s architecture
was an outgrowth of our experience with the 1996
Olympic Summer Games Web site. The server logs
we collected in 1996 provided significant insight
that influenced the design of the 1998 site. We
determined that most users had spent too much
time looking for basic information, such as medal
standings, most recent results, and current news
stories. Clients had to make at least three Web
server requests to navigate to a result page. Brows-
ing patterns were similar for the news, photos, and
sports sections of the site. Furthermore, when a
8 9
Figure 4. Object dependence graph (ODG). Weights are correlated
with the importance of data dependencies.
client reached a leaf page, there were no direct
links to pertinent information in other sections.
Given this hierarchy, intermediate navigation
pages were among the most frequently accessed.
Hit reduction was a key objective for the 1998
site because we estimated that using the design
from the 1996 Web site in conjunction with the
additional content provided by the 1998 site could
result in more than 200 million hits per day. We
thus redesigned the pages to allow clients to access
relevant information while examining fewer Web
pages. A detailed description of the 1998 Olympic
Games Web site is available in our earlier work.
The most significant changes were the generation
of a new home page for each day and the addition of
a top navigation level that allowed clients to view any
previous days home page. We estimate that improved
page design reduced hits to the site by at least three-
fold. Web server log analysis suggests that more than
25 percent of the users found the information they
were looking for with a single hit by examining no
more than the current days home page.
Site Architecture
The Web site utilized four IBM Scalable Power-
Parallel (SP2) systems at complexes scattered
around the globe, employing a total of 143 proces-
sors, 78 Gbytes of memory, and more than 2.4
Tbytes of disk space. We deployed this level of
hardware to ensure high performance and avail-
ability because not only was this a very popular site,
but the data it presented was constantly changing.
Whenever new content was entered into the sys-
tem, updated Web pages reflecting the changes
were available to the world within seconds. Clients
could thus rely on the Web site for the latest results,
news, photographs, and other information from
the games. The system served pages quickly even
during peak periods, and the site was available 100
percent of the time.
We achieved high availability by using replicat-
ed information and redundant hardware to serve
pages from four different geographic locations. If a
server failed, requests were automatically routed to
other servers, and if an entire complex failed,
requests could be routed to the other three. The
network contained redundant paths to eliminate
single points of failure. It was designed to handle
at least two to three times the expected bandwidth
in order to accommodate high data volumes if por-
tions of the network failed.
Dynamic pages were created via the FastCGI
interface, and the Web site cached dynamic pages
using the DUP algorithm. DUP was a critical com-
ponent in achieving cache hit rates of better than 97
percent. By contrast, the 1996 Web site used an ear-
lier version of the technologies and cached dynam-
ic pages without employing DUP. It was thus diffi-
cult to precisely identify which pages had changed
as a result of new information. Many current pages
were invalidated in the process of ensuring that all
stale pages were removed, which caused high miss
rates after the system received new information.
Cache hit rates for the 1996 Web site were around
80 percent.
Another key component in achieving
near 100-percent hit rates was prefetching. When
hot pages in the cache became obsolete, new versions
were updated directly in the cache without being
invalidated. Consequently, these pages had no cache
Because the updates to underlying data were per-
formed on different processors from those serving
pages, response times were not adversely affected
during peak update times. By contrast, processors
functioning as Web servers at the 1996 Web site also
performed updates to the underlying data. This
design combined with high cache miss rates during
peak update periods to increase response times.
System Architecture
Web pages were served from four locations:
Schaumburg, Illinois; Columbus, Ohio; Bethesda,
Maryland; and Tokyo, Japan. Client requests to were routed from
Mainframe SP2
Master database
Figure 5. Data flow from the master database to the Internet servers.
the local ISP into the IBM Global Services network,
and IGS routers forwarded the requests to the serv-
er location geographically nearest to the client.
Pages were created by, and served from, IBM SP2
systems at each site. Each SP2 was composed of
multiple frames, each containing ten RISC/6000
uniprocessors, and one RISC/6000 8-way symmet-
ric multiprocessor. Each uniprocessor had 512
Mbytes of memory and approximately 18 Gbytes
of disk space. Each multiprocessor had one Gbyte
of memory and approximately six Gbytes of disk
space. In all, we used 13 SP2 frames: four at the SP2
in Schaumburg and three at each of the other sites.
Numerous machines at each location were also ded-
icated to maintenance, support, file serving, net-
working, routing, and various other functions.
We had calculated the hardware requirements
from forecasts based on 1996 site traffic, capacity,
performance data; rate of increase in general Web
traffic over the preceding year and a half; and the
Rack 1
DCE server
DFS server
Rack 2
2 DS3 links
to Internet
DFS server
NetView F50
Link to
ND 2
ND 3
ND 4
ND 1
3151 terminal
DB2 server
high node
Domino thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
Domino thin node
DB2 server
high node
Domino thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
Domino thin node
DS3 link
to other site
DS3 link
to Internet NAP
TR 4
TR 2
TR 3
TR 1
Tivoli server
DB2 server
high node
Domino thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
DGW thin node
Domino thin node
SP Ethernet 1
SP Ethernet 2
SP Ethernet 1
SP Ethernet 2
SP switch
ND = Network Dispatcher
TR = Token Ring
DGW = Domino Go Webserver
DFS = Distributed File System
DCE = Distributed Computing Environment
CWS = Control Workstation
(for the RS/6000 SP complex)
DS3 = Data Speed 3
NAP = Network Access Point
Figure 6. Network diagram for each of the four complexes serving Web pages.
need for 100 percent site availability for the dura-
tion of the games. In retrospect, we deployed sig-
nificantly more hardware than required. This over-
estimate was partly due to the fact that our caching
algorithms reduced server CPU cycles by more
than we had expected.
Results data was collected directly from the tim-
ing equipment and information keyed into com-
puters at each venue. The scoring equipment was
connected via token ring LAN at each venue to a
local DB2 database that transferred its data to the
master (mainframe-attached) database in Nagano.
This master database served both the on-site scor-
ing system and the Internet.
Figure 5 shows how data was replicated from the
master database to the SP2 complexes in Tokyo and
Schaumburg. From Schaumburg, the data was
again replicated to the machines in Bethesda and
Columbus. For reliability and recovery purposes,
the Tokyo site could also replicate the database to
Local Load Balancing and
High Availability
Load balancers (LB) were an essential component
in implementing the complex network manage-
ment techniques we used. While we chose IBM
NDs, we could also have used other load balancers,
such as those from Radware or Resonate. The LBs
accepted traffic requests from the Internet, routed
them to a complex, and forwarded them to avail-
able Web servers for processing. As illustrated in
Figure 6 (previous page), four LB servers sat
between the routers and the front-end Web servers
at each complex in the U.S. Each of these four LB
servers was the primary source of three of the 12
addresses and the secondary source for two other
addresses. Secondary addresses for a particular LB
were given a higher routing cost.
The LB servers ran the gated routing daemon,
which was configured to advertise IP addresses as
routes to the routers.
We assigned each LB a dif-
ferent cost based on whether it was the primary or
secondary server for an IP address. The routers then
redistributed these routes into the network. With
knowledge of the routes being advertised by each
of the complexes, routers could decide where to
deliver incoming requests to the LB with the lowest
cost. This was typically the LB that was the prima-
ry source for the address assigned to incoming
requests at the closest complex.
The request would only go to the secondary LB
for a given address if the primary LB were down for
some reason. If the secondary LB also failed, traf-
fic would be routed to the primary LB in a different
complex. This design gave LB server operators con-
trol of load balancing across complexes. Moreover,
the routers required no changes to support the
design because they learned routes from the LB
servers via a dynamic routing protocol.
Each LB server was connected to a pool of front-
end Web servers dispersed among the SP2 frames
at each site. Traffic was distributed among Web
servers based on load information provided by so-
called advisors running at the Web server node. If
a Web node went down, the advisors immediately
pulled it from the distribution list.
This approach helped ensure high availability by
avoiding any single failure point in a site. In the
event of a Web server failure, the LB would auto-
matically route requests to the other servers in its
pool, or if an SP2 frame went down, the LB would
route requests to the other frames at the site. The
router would send requests to the backup if an LB
server went down, and if an entire complex failed,
traffic was automatically routed to a backup site.
In this way we achieved what we call elegant degra-
dation, in which various failure points within a
complex are immediately accounted for, and traf-
fic is smoothly redistributed to system elements
that are still functioning.
Since the 1998 Olympics, our technology has been
deployed at several other highly accessed sites
including the official sites for the 1998 and 1999
Wimbledon tennis tournaments. Both of these sites
received considerably higher request rates than the
1998 Olympic games site due to continuous
growth in Web usage. The 1999 Wimbledon site
made extensive use of Web server acceleration tech-
nology that was not ready for the 1998 Olympic
games. The site received 942 million hits over 14
days, corresponding to 71.2 million page views and
8.7 million visits. Peak hit rates of 430 thousand
per minute and 125 million per day were observed.
By contrast, the 1998 Olympic Games Web site
Since the 1998 Olympics, our
technology has been deployed at
several other highly accessed sites.
received 634.7 million requests over 16 days with
peak hit rates of 110 thousand per minute and 57
million per day.
We have developed a sophisticated publishing
system for dynamic Web content for the 2000
Olympic Games Web site, which will also make
extensive use of Web server acceleration. A descrip-
tion of this publishing system is scheduled to
appear in the Proceedings of InfoCom 2000.
Many people made valuable contributions to the 1998 Olympic
Games’ site. Cameron Ferstat, Paul Reed, and Kent Rankin con-
tributed to the network design, and John Thompson, Jerry Spi-
vak, Michael Maturo, Kip Hansen, Brian Taylor, Elin Stilwell,
and John Chiavelli, to the Web site design. Kim Benson pro-
vided us useful information. Glen Druce contributed to the
news, photos, and Lotus Notes design. Eric Levy and Junehwa
Song made key contributions to the Web server accelerator
described in this article.
1. T. Brisco, “DNS Support for Load Balancing,” Tech.
Report RFC 1974, Rutgers Univ., N.J., April 1995.
2. P. Mockapetris, “Domain Names—Implementation and
Specification,” Tech. Report RFC 1035, USC Information
Sciences Inst., Los Angeles, Calif., Nov. 1987.
3. T.T. Kwan, R.E. McGrath, and D.A. Reed, “NCSAs World
Wide Web Server: Design and Performance,” Computer,
Vol. 28, No.11, Nov. 1995, pp. 68-74.
4. D. Dias et al., “A Scalable and Highly Available Web Serv-
er,” Proc. 1996 IEEE Computer Conf. (CompCon), IEEE
Computer Soc. Press, Los Alamitos, Calif., 1996.
5. G. Hunt et al., “Network Dispatcher: A Connection Router
for Scalable Internet Services,” Proc. 7th Int’l World Wide
Web Conf., 1998; available online at http://www7.scu.
6. V. Cardellini, M. Colajanni, and P. Yu, “Dynamic Load Bal-
ancing on Web-Server Systems,” IEEE Internet Computing,
Vol. 3, No. 3, May/June 1999, pp. 28-39.
7. A. Chankhunthod et al., “A Hierarchical Internet Object
Cache,” Proc. 1996 Usenix Tech. Conf., Usenix Assoc.,
Berkeley, Calif., 1996, pp. 153-163.
8. E. Levy, et al., “Design and Performance of a Web Server
Accelerator,” Proc. IEEE InfoCom 99, IEEE Press, Piscat-
away, N.J., 1999, pp. 135-143.
9. R. Lee, “A Quick Guide to Web Server Acceleration,” white
paper, Novell Research, 1997; available at http://www.nov-
10. J. Challenger, A. Iyengar, and P. Dantzig, “A Scalable Sys-
tem for Consistently Caching Dynamic Web Data,” Proc.
IEEE InfoCom 99, IEEE Press, Piscataway, N.J., 1999,
11. J. Challenger, P. Dantzig, and A. Iyengar, “A Scalable and
Highly Available System for Serving Dynamic Data at Fre-
quently Accessed Web Sites,” Proc. ACM/IEEE Supercom-
puting 98 (SC 98), ACM Press, N.Y., 1998; available at
12. A. Iyengar and J. Challenger, “Improving Web Server Per-
formance by Caching Dynamic Data,” Proc. Usenix Symp.
Internet Tech. and Systems, Usenix Assoc., Berkeley, Calif.,
1997, pp. 49-60.
13. D.E. Comer, Internetworking with TCP/IP, Prentice Hall,
Englewood Cliffs, N.J., sec. ed., 1991.
14. J. Challenger et al., “A Publishing System for Efficiently
Creating Dynamic Web Content,” to be published in Proc.
InfoCom 2000, IEEE Press, Piscataway, N.J., Mar. 2000.
Arun Iyengar is a research staff member at IBM’s T.J. Watson
Research Center. His research interests include Web per-
formance, caching, parallel processing, and electronic com-
merce. He has a BA in chemistry from the University of
Pennsylvania, and an MS and PhD in computer science
from the Massachusetts Institute of Technology.
Jim Challenger is a senior programmer at the T.J. Watson
Research Center. His research interests include develop-
ment and deployment of highly scaled distributed com-
puting systems, caching, and highly scaled Web servers. He
has a BS in mathematics and an MS in computer science
from the University of New Mexico.
Daniel Dias manages the parallel commercial systems depart-
ment at the T.J. Watson Research Center. His research
interests include scalable and highly available clustered sys-
tems including Web servers, caches, and video servers; Web
collaboratory systems; frameworks for business-to-business
e-commerce; and performance analysis. He received the
B.Tech. degree from the Indian Institute of Technology,
Bombay, and an MS and a PhD from Rice University, all
in electrical engineering.
Paul Dantzig is manager of high-volume Web serving at the T.J.
Watson Research Center and adjunct professor of comput-
er science at Pace University, New York. His research inter-
ests include Web serving, digitial library and search, speech
application software, content management, Internet pro-
tocols, and electronic commerce. He has a BS interdepart-
mental degree in mathematics, statistics, operations
research, and computer science from Stanford University
and an MS in computer science and computer engineering
from Stanford University.
Readers can contact the authors at {aruni, challngr, dias,
... On the other hand, if the information changes frequently as in online electronic product catalogs, product price lists, brochures, and daily news columns the system must be designed for easy information maintainability in order to keep the information current and consistent. Where an application demands very high availability and has to allow for very high peak or uncertain demands, the system may need to run on multiple Web servers with load balancing and other performance enhancement mechanisms [2,8]. Examples of this category of application are online stock trading, online banking, and highly accessed sports and entertainment Web sites, such as those for the Olympics, Wimbledon, and the Oscars. ...
... The performance, availability (up-time), and security required for the application govern the design of the network and computer hardware. Arun Iyengar, et al. [8] and Valeria Cardellini, et al. [2] discuss the design of computer hardware and network infrastructure to support high-performance Web sites. ...
... Such a balancing reviews the request itself to decide on the route. In [11] and [12], the load balancing in the websites having high access rate in reality has been addressed. In [13] and [14], client-side algorithms for assign the requests to a server have been provided. ...
Full-text available
Session Initiation Protocol (SIP) grows for VoIP applications, and faces challenges including security and overload. On the other hand, the new concept of Software-defined Networking (SDN) has made great changes in the networked world. SDN is the idea of separating the control plane from the network infrastructure that can bring several benefits. We used this idea to provide a new architecture for SIP networks. Moreover, for the load distribution challenge in these networks, a framework based on SDN was offered, in which the load balancing and network management can be easily done by a central controller considering the network status. Unlike the traditional methods, in this framework, there is no need to change the infrastructures like SIP servers or SIP load balancer to implement the distribution method. Also, several types of load distribution algorithms can be performed as software in the controller. We were able to achieve the desired results by simulating the three methods based on the proposed framework in Mininet.
... Se hace necesario aumentar la capacidad de cómputo para lograr altos niveles de concurrencia de los visitantes y respuestas satisfactorias por parte de los servidores. Algunas técnicas para disminuir los tiempos de respuesta de los servidores web incluyen la utilización de clustering (Iyengar, et al., 2000), (Cardellini, et al., 1999), con lo cual se logra elevar el rendimiento total para acometer tareas que individualmente o en sumatoria simple no podrían realizar; utilización de arquitecturas de despliegue en varios niveles, uso del archivado con caché para la introducción de mecanismos eficientes de distribución de objetos en la Web (Bakhtiyari, 2012), optimización en cuanto a rendimiento de las herramientas de despliegue seleccionadas y mejoramiento del hardware de los servidores en cuanto a memoria y procesamiento fundamentalmente. ...
Conference Paper
Full-text available
Thewebenvironmentbecomes a littlemorecomplex and competitive each time. An importantaspect atpositioningof a websiteis constituted by the performanceto satisfy adequately users demands.That is whyit is necessary toadoptmechanisms tooptimize thedeployment ofwebsites by usingtechniques and toolsof good effectiveness. This paperintendsto make a suggestionofFree Software fora scalabledeployment architectureof web servers developed inthe programming languagePHPandusing thePostgreSQLdatabase managerdatato achieve high levelsof concurrent ofusers, using modern methods andtoolsfor the solution.The results achievedto a portaldevelopedinDrupalCMSwithApacheJMetertool, where the consumption ofhardware resourcesused andnumber ofrequestsansweredevaluatedare presented.Keywords: Websites, performance, positioning, deployment.
... Cardellini et al. in [12] purposed a categorization based on the place where the distribution logic is applied when forwarding a request to the chosen server of a web server system which is locally distributed: at the network and at at the the Web system [29, 40, 41, 44, 45, 46, 47 and 48]., at the client, at the Domain Name System (DNS) [28] . In this paper, we consider the first option as it is the most popular, dispatcherbased clusters or cluster based web server system. ...
Full-text available
An efficient server load balancing policy is required to achieve scalability and high availability of the service offered by web server clusters. A critical part of the load balancing policy is to find the best available server to assign the load. For that, server load needs to be calculated. In this paper, the parameters required to assess the load of the server are explored. An important load parameter, ‘number of open file descriptors’ is identified to find the load on a server along with existing load parameters, CPU cycles and free memory. The server load reporting is improved by extending Net-SNMP agent to report server resources including ‘number of open file descriptors’. HAProxy is used as a load balancer in this setup and new code is added to it to consider the value of ‘number of open file descriptors’ for selecting a server with more resources. Performance metrics used in test scenarios are: Throughput, HTTP Response Time and Error rate or connection failures. Tests were done in two different scenarios, normal and with high load on servers. The experimental test bed of our load balancing system is presented. The load balancing results of the server cluster by comparing our implementation with known “blind selection” balancing algorithm used on web-clusters, Round Robin (RR) and state full algorithm Least Connections (LC) are described. Our experimental results show that the previously mentioned algorithms can be outperformed by an adaptive mechanism proposed (ELBL) by us. The performance of the server cluster is Significantly improved (errors were close to zero) by the adaptive algorithm proposed over existing algorithms (RR and LC) that showed 8, and 22 errors in case of 100 kbps HTTP requests per second.
... The raw Web log entries are used by the web usage mining algorithms to infer the browsing behavior or usage implication in an application domain. Studies utilizing web log records have focused on analyzing system performance (Iyengar et al., 2000), improving system design by Web caching, (Yang and Zhang, 2003), exploring page prefetching (Rangarajan et al., 2004) and personalization navigation (Eirinaki and Vazirgiannis, 2003), and improving website design and e-learning quality (Chou, et al., 2010). These examples demonstrate that processing the Web log data by Web usage mining algorithms reveals useful information like browsing behavior or patterns, and these are then applied in practice and provide knowledge in specific areas. ...
With the growing popularity of mobile commerce (m-commerce), it becomes vital for both researchers and practitioners to understand m-commerce usage behavior. In this study, we investigate browsing behavior patterns based on the analysis of clickstream data that is recorded in server-side log files. We compare consumers' browsing behavior in the m-commerce channel against the traditional ecommerce channel. For the comparison, we offer an integrative web usage mining approach, combining visualization graphs, association rules and classification models to analyze the Web server log files of a large Internet retailer in Israel, who introduced m-commerce to its existing e-commerce offerings. The analysis is expected to reveal typical m-commerce and e-commerce browsing behavior, in terms of session timing and intensity of use and in terms of session navigation patterns. The obtained results will contribute to the emerging research area of m-commerce and can be also used to guide future development of mobile websites and increase their effectiveness. Our preliminary findings are promising. They reveal that browsing behaviors in m-commerce and e-commerce are different.
An increasing number of interactive applications and services, such as online gaming and cognitive assistance, are being hosted in the cloud because of the elastic properties and cost benefits of distributed data centers. Despite these benefits, the longer and often unpredictable end-to-end network latencies between the end user and the cloud can be detrimental to time-critical response to the applications. Although technology enablers, such as Cloudlets or Micro Data Centers (MDCs), are increasingly being leveraged by cloud infrastructure providers to address the network latency concerns, existing efforts in re-provisioning services from the cloud to the MDCs seldom focus on ensuring that the performance properties of the migrated services are met. This chapter demonstrates the application of Dynamic Data-Driven Applications Systems (DDDAS) principles in the systems software layer, to address these limitations by: (a) determining when to re-provision; (b) identifying the appropriate MDC and a suitable host within that MDC that meets the performance considerations of the applications; and (c) ensuring that the cloud service provider continues to meet customer service-level objectives while keeping its operational and energy costs low. Empirical evaluations using a setup comprising a cloud data center and multiple MDCs composed of heterogeneous hardware are presented to validate the capabilities of the INDICES (Intelligent Deployment for ubiquitous Cloud and Edge Services) framework to process DDDAS methods. It should also be noted that the capabilitis created through INDICES are aimed to satisfy a broad set of applications requiring real-time data deliver and thus also satisfy the support requirements of environments enabling DDDAS-based applications.
To achieve scalability and high availability of the service offered by web server clusters, an efficient server load balancing policy is required. A critical part of the load balancing policy is to find the best available server to assign the load. For that, server load needs to be calculated. In this paper, the parameters required to assess the load of the server are explored. An important load parameter, 'number of open file descriptors' is identified to find the load on a server along with existing load parameters, CPU cycles and free memory. The server load reporting is improved by extending SNMP agent to report server resources including 'number of open file descriptors'. Performance metrics used in test scenarios are: Throughput, HTTP Response Time and Error rate and Normalized Throughput. Tests were done in two different scenarios: normal condition scenario and the other scenario with high load on web servers. The load balancing results of the server cluster by comparing our implementation with known load balancing algorithm used on web-clusters, Round Robin (RR) and state full algorithm Least Connections (LC) are described. Our experimental results show that the previously mentioned algorithms can be outperformed by our proposed adaptive mechanism, Scalable Load Balancing (SLBL) algorithm. Our experimental results show that the performance of the cluster of web servers is significantly improved by the proposed adaptive algorithm SLBL over the existing algorithms, RR and LC. The average service request rate that can be serviced by the SLBL algorithm is around 1.27 times more than that of LC and around 1.93 times more than that of RR.
Reducing power consumption for server-class computers is important, since increased energy usage causes more heat dissipation, greater cooling requirements, reduced computational density, and higher operating costs. For a typical data center, storage accounts for 27% of energy consumption. Conventional server-class RAIDs cannot easily reduce power because loads are balanced to use all disks, even for light loads. We have built the power-aware RAID (PARAID), which reduces energy use of commodity server-class disks without specialized hardware. PARAID uses a skewed striping pattern to adapt to the system load by varying the number of powered disks. By spinning disks down during light loads, PARAID can reduce power consumption, while still meeting performance demands, by matching the number of powered disks to the system load. Reliability is achieved by limiting disk power cycles and using different RAID encoding schemes. Based on our five-disk prototype, PARAID uses up to 34% less power than conventional RAIDs while achieving similar performance and reliability.
Conference Paper
Full-text available
This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet in- formation systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefitsof hierarchical filecaching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Inter- net cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierar- chy does not measurably increase access latency. Our soft- ware can also be configuredas a Web-server accelerator; we present data that our httpd-acceleratoris ten times faster than Netscape' s Netsite and NCSA 1.4 servers. Finally, we relate our experience fittingthe cache into the increasingly complex and operational world of Internet in- formation systems, including issues related to security, trans- parency to cache-unaware clients, and the role of filesystems in support of ubiquitous wide-area information systems.
Conference Paper
Full-text available
This paper presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates an Web pages affected by the change. Our system accommodates both content that must be proof-read before publication and is typically from humans as well as content that has to be published immediately and is typically from automated feeds. Our system is deployed at several popular Web sites including the 2000 Olympic Games Web site. We discuss some of our experiences with real deployments of our system as well as its performance
Conference Paper
Full-text available
This paper presents a new approach for consistently caching dynamic Web data in order to improve performance. Our algorithm, which we call data update propagation (DUP), maintains data dependence information between cached objects and the underlying data which affect their values in a graph. When the system becomes aware of a change to underlying data, graph traversal algorithms are applied to determine which cached objects are affected by the change. Cached objects which are found to be highly obsolete are then either invalidated or updated. The DUP was a critical component at the official Web site for the 1998 Olympic Winter Games. By using DUP, we were able to achieve cache hit rates close to 100% compared with 80% for an earlier version of our system which did not employ DUP. As a result of the high cache hit rates, the Olympic Games Web site was able to serve data quickly even during peak request periods
Conference Paper
Full-text available
We describe the design, implementation and performance of a Web server accelerator which runs on an embedded operating system and improves Web server performance by caching data. The accelerator resides in front of one or more Web servers. Our accelerator can serve up to 5000 pages/second from its cache on a 200 MHz PowerPC 604. This throughput is an order of magnitude higher than that which would be achieved by a high-performance Web server running on similar hardware under a conventional operating system such as Unix or NT. The superior performance of our system results in part from its highly optimized communications stack. In order to maximize hit rates and maintain updated caches, our accelerator provides an API which allows application programs to explicitly add, delete, and update cached data. The API allows our accelerator to cache dynamic as well as static data, analyze the SPECweb96 benchmark, and show that the accelerator can provide high hit ratios and excellent performance for workloads similar to this benchmark
Network Dispatcher (ND) is a TCP connection router that supports load sharing across several TCP servers. Prototypes of Network Dispatcher were used to support several large scale high-load Web sites. Network Dispatcher provides a fast IP packet-forwarding kernel-extension to the stack. Load sharing is supported by a user-level manager process that monitors the load on the servers and controls the connection allocation algorithm in the kernel extension. This paper describes the design of Network Dispatcher, outlines Network Dispatcher's performance in the context of http traffic, and presents several of its features including high-availability, support for WANs, and client affinity.
The World Wide Web (WWW) server at the National Center for Supercomputing Applications (NCSA) is one of the most heavily accessed WWW servers in the world. This server is based on a collection of cooperating hosts that share a common file system. To increase our understanding of how users access this server and to provide a basis for assessing server and system software optimizations, we analyzed NCSA's server logs for multiple weeks during a five month period. This paper describes the server's architecture, presents the results of our access pattern analysis, and discusses the implications for server design and extensibility.
Conference Paper
This paper describes the system and key techniques used for achieving performance and high availability at the official Web site for the 1998 Olympic Winter Games which was one of the most popular Web sites for the duration of the Olympic Games. The Web site utilized thirteen SP2 systems scattered around the globe containing a total of 143 processors. A key feature of the Web site was that the data being presented to clients was constantly changing. Whenever new results were entered into the system, updated Web pages reflecting the changes were made available to the rest of the world within seconds. One technique we used to serve dynamic data efficiently to clients was to cache dynamic pages so that they only had to be generated once. We developed and implemented a new algorithm we call Data Update Propagation (DUP) which identifies the cached pages that have become stale as a result of changes to underlying data on which the cached pages depend, such as databases. For the Olympic Games Web site, we were able to update stale pages directly in the cache which obviated the need to invalidate them. This allowed us to achieve cache hit rates of close to 100%. Our system was able to serve pages to clients quickly during the entire Olympic Games even during peak periods. In addition, the site was available 100% of the time. We describe the keyfeatures employed by our site for high availability. We also describe how the Web site was structured to provide useful information while requiring clients to examine only a small number of pages.