Conference PaperPDF Available

A publishing system for efficiently creating dynamic Web content

Authors:

Abstract and Figures

This paper presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates an Web pages affected by the change. Our system accommodates both content that must be proof-read before publication and is typically from humans as well as content that has to be published immediately and is typically from automated feeds. Our system is deployed at several popular Web sites including the 2000 Olympic Games Web site. We discuss some of our experiences with real deployments of our system as well as its performance
Content may be subject to copyright.
A Publishing System for Efficiently Creating
Dynamic Web Content
Jim Challenger, Arun Iyengar, Karen Witting Cameron Ferstat, Paul Reed
IBM Research IBM Global Services
T.J. Watson Research Center 17 Skyline Drive
P.O. Box 704 Hawthorne, NY 10532
Yorktown Heights, NY 10598
AbstractThis paper presents a publishing system for efficiently creat-
ing dynamic Web content. Complex Web pages are constructed from sim-
pler fragments. Fragments may recursively embed other fragments. Re-
lationships between Web pages and fragments are represented by object
dependence graphs. We present algorithms for efficiently detecting and up-
dating Web pages affected after one or more fragments change. We also
present algorithms for publishing sets of Web pages consistently; different
algorithms are used depending upon the consistency requirements.
Our publishing system provides an easy method for Web site designers
to specify and modify inclusion relationships among Web pages and frag-
ments. Users can update content on multiple Web pages by modifying a
template. The system then automatically updates all Web pages affected by
the change. Oursystemaccommodatesbothcontent thatmustbe proofread
before publication and is typically from humans as well as content that has
to be published immediately and is typically from automated feeds.
Our system is being deployed at several popular Web sites including the
2000 Olympic Games Web site. We discuss some of our experiences with
real deployments of our system as well as its performance.
I. INTRODUCTION
Many Web sites need to provide dynamic content. Examples
include sport sites [2], stock market sites, and virtual stores or
auction sites where information on available products is con-
stantly changing.
There are several problems with providing dynamic data to
clientsefficiently and consistently. Akey problemwith dynamic
data is that it can be expensive to create; a typical dynamic page
may requireseveral ordersof magnitudemoreCPUtimeto serve
than a typical static page of comparable size. The overhead
for dynamic data is a major problem for Web sites which re-
ceive substantial request volumes. Significant hardware may be
needed for such Web sites.
A key requirement for many Web sites providing dynamic
data is to completely and consistently update pages which have
changed. In other words, if a change to underlying data af-
fects multiplepages, all such pages shouldbe correctly updated.
In addition, a bundle of several changed pages may have to be
made visible to clients at the same time. For example, publish-
ing pages in bundles instead of individually may prevent situa-
tions where a client views a first page, clicks on a hypertext link
to view a second page, and sees information on the second page
which is older and not consistent with the information on the
first page.
Depending upon the way in which dynamic data are being
served, achieving complete and consistent updates can be diffi-
cult or inefficient. Many Web sites cache dynamic data in mem-
ory or a file system in order to reduce the overhead of recal-
culating Web pages every time they are requested [7]. In these
systems, it is often difficult to identify which cached pages are
affected by a change to underlying data which modifies several
dynamic Web pages. In making sure that all obsolete data are
invalidated, deleting some current data from cache may be un-
avoidable. Consequently, cache miss rates after an update may
be high, adversely affecting performance. In addition, multiple
cache invalidations from a single update must be made consis-
tently.
This paper presents a system for efficiently and consistently
publishing dynamic Web content. In order to reduce the over-
head of generating dynamic pages from scratch, our system
composes dynamic pages from simpler entities known as frag-
ments. Fragments typically represent parts of Web pages which
change together;when a change tounderlyingdata occurs which
affects several Web pages, the fragments affected by the change
can easily be identified. It is possible for a fragment to recur-
sively embed another fragment.
Our system provides a user-friendly method for managing
complex Web pages composed of fragments. Users specify how
Web pages are composed from fragments by creating templates
in a markup language. Templates are parsed to determine in-
clusion relationships among fragments and Web pages. These
inclusion relationships are represented by a graph known as an
object dependence graph (ODG). Graph traversal algorithmsare
applied to ODG’s in order to determine how changes should be
propagated throughoutthe Web site after one or more fragments
change.
Our system allows multiple independent authors to provide
content as well as multipleindependent proofreaders to approve
some pages for publication and reject others. Publication may
proceed in multiple stages in which a set of pages must be ap-
proved in one stage before it is passed to the next stage. Our
system can also include a link checker which verifies that a Web
page has no broken hypertext links at the time the page is pub-
lished.
Akey feature of our systemis thatit is scalable to handle high
request rates. We are deploying our system at several popular
Web sites including the 2000 Olympic Games Web site.
The remainder of the paper is organized as follows. Section II
describes the architecture of our system in detail. Section III
discusses some of our experiences with deploying our system
at real Web sites. Section IV describes the performance of our
system. Section V discusses related work. Finally, Section VI
summarizes our main results and conclusions.
II. SYSTEM ARCHITECTURE
A. Constructing Web Pages from Fragments
A.1 Overview
A key feature of our system is that it composes complex Web
pages from simpler fragments (Figure 8). A page is a complete
entity which may be served to a client. We say that a fragment
or page is atomic if it doesn’t include any other fragments and
complex if it includesother fragments. An objectis either a page
or a fragment.
Our approach is efficient because the overhead for compos-
ing an object from simpler fragments is usually minor. By con-
trast, the overhead for constructing the object from scratch as
an atomic fragment is generally much higher. Using the frag-
ment approach, it is possible to achieve significant performance
improvements without caching dynamic pages and dealing with
the difficulties of keeping caches consistent. For optimal per-
formance, our system has the ability to cache dynamic pages.
Caching capabilities are integrated with fragment management.
The fragment-based approach for generating Web pages
makes it easier to design Web sites in additionto improvingper-
formance. It is easy to design a set of Web pages with a common
look and feel. It is also easy to embed common information into
several Web pages. Sets of Web pages containing similar infor-
mation can be managed together. For example, it is easy to up-
date common information represented by a single fragment but
embedded withinmultiplepages; inorder to updatethe common
informationeverywhere, onlythe fragment needs to be changed.
By contrast, if the Web pages are stored staticallyin a file sys-
tem, identifyingand updatingall pages affected by a change can
be difficult. Once all changed pages have been identified, care
must be taken to update all changed pages in order to preserve
consistency.
Dynamic Web pages which embed fragments are implicitly
updated any time an embedded fragment changes, so consis-
tency is automatically achieved. Consistency becomes an is-
sue with the fragment-based approach when the pages are being
published to a cache or file system. Our system provides several
differentmethodsforconsistentlypublishingWeb pages inthese
situations; each method provides a different level of consistency.
A.2 Object Dependence Graphs
When pages are constructed from fragments, it is important
to construct a fragment before any object containing is
constructed. In order to construct objects in an efficient order,
our system represents relationshipsbetween fragments and Web
pages by graphs known as object dependence graphs (ODG’s)
(Figures 1 and 2).
Object dependence graphs may have several different edge
types. An inclusionedge indicates that an object embeds a frag-
ment. A link edge indicates that an object contains a hypertext
link to another object.
In the ODG in Figure2, all but one of the edges are inclusion
edges. For example, the edge from to indicates that
contains ; thus, when changes, shouldbe updated before
is updated. The graph resultingfrom only inclusion edges is
a directed acyclic graph.
Hypertext
link
P1
P2
P3
f3
f1
f2
f4
f2
f5
f5
f2
f4
Fig. 1. A set of Web pagescontaining fragments.
link
edge
P1
P2
P3
f3
f1
f2
f4
f5
Fig. 2. The object dependencegraph (ODG) correspondingto Figure 1.
The edge from to is a link edge which indicates that
contains a hypertext link to A key reason for maintain-
ing link edges is to prevent dangling or inconsistent hypertext
links. In this example, the link edge from to indicates that
publishing before will result in a broken hypertext link.
Similarly, when both and change, publishing a current
version of before publishing a current version of could
present inconsistentinformation to clients who view an updated
version of click on the hypertext link to an outdated version
of and then see information which is obsolete relative to the
referring page. Link edges can formcycles withinan ODG. This
would occur, for example, if two pages both contain hypertext
links to each other.
There are two methods for creating and modifying ODG’s.
Using one approach, users specify how Web pages are com-
posed from fragments by creating templates in a markup lan-
guage. Templates are parsed to determine inclusion relation-
ships among fragments and Web pages. Using the second ap-
proach, a program may directly manipulate edges and vertices
of an ODG by using an API.
Our system allows an arbitrary number of edge types to exist
in ODG’s. So far, we have only foundpractical use forinclusion
and link edges. We suspect that there may be other types of
important relationships which can be represented by other edge
types.
When our system becomes aware of changes to a set of
one or more objects, it does a depth-first graph traversal using
topologicalsort[4] todetermine allvertices reachable from by
following inclusion edges. The topological sort orders vertices
such that whenever there is an edge from a vertex to another
vertex , appears before inthe topologicalsort. For example,
a valid topologicalsort of the graph in Figure 2 after , ,and
change would be , , , , , , ,and .This
topological sort ignores link edges.
Objects are updated in an order consistent with the topologi-
cal sort. Our system updates objects in parallel when possible.
In the previous example, , ,and can be updated in par-
allel. After is updated, and may be updated in parallel.
A number of other objects may be constructed in parallel in a
manner consistent with the inclusion edges of the ODG.
After a set of pages, , has been updated (or generated for
the first time), the pages in are published so that they can be
viewed by clients. In some cases, the pages are published to file
systems. In other cases, they are published to caches. Pages may
be published either locally on the system generating them or to
a remote system. It is often a requirement for a set of multiple
pages to be published consistently. Consistency can be guar-
anteed by publishing all changed (or newly generated) pages in
a single atomic action. One potential drawback to this method
of publication is that the publication process may be relatively
long. For example, pages may have to be proofread before pub-
lication. If everything is published together in a single atomic
action, there can be considerable delay before any information
is made available.
Therefore, incremental publication, wherein information is
published in stages instead of together, is often desirable. The
disadvantagetoincremental publicationisthat consistency guar-
antees are not as strong. Our system provides three different
methods for incremental publication, each providing different
levels of consistency.
The first incremental publishing method guarantees that a
freshly published page will not contain a hypertext link to ei-
ther an obsolete or unpublished page. This consistency guar-
antee applies to pages reached by following several hypertext
links. More specifically, if and are two pages in if a
client views an updated version of and follows one or more
hypertext links to view , then the client is guaranteed to see a
version of which is not obsolete with respect to the version
of which the client viewed (a version of is obsolete with
respect to a version of if the version of was outdated at
the timethe version of became current, regardless of whether
or have any fragments in common).
For example, consider theWeb pages inFigure 3. A client can
access by starting at followinga hypertext link to and
then following a second hypertext to Suppose that both
and change. The first incremental publishing method guar-
antees that the new version of will not be published before
the new version of regardless of whether has changed.
P3P1
P2
Fig. 3. A set of Web pages connectedby hypertextlinks.
This incremental publishing method is implemented by rst
determining the set of all pages which can be reached by fol-
lowinghypertext linksfrom a page in includes all pages of
; it mayalsoincludepreviouslypublishedpages whichhaven’t
changed. is determined by traversing link edges in reverse or-
der starting from pages in
Let be the subgraph of the ODG consisting of all nodes in
and link edges in the ODG connecting nodes in is topo-
logically sorted, and its strongly connected components are de-
termined. A strongly connected component of a directed graph
is a maximal subset of vertices such that every vertex in
has a directed path to every other vertex in A good algorithm
for findingstrongly connected components in directed graphs is
contained in [4].
Vertices in are then examined in an order consistent with
the topologicalsort of Each time a page in is examined for
which the updated version hasn’t been published yet, the page
is published together with all other pages in belonging to the
same strongly connected component. Each set of pages which
are published together in an atomic action is known as a bundle.
The second incremental publishing method guarantees that
any twopages in whichboth containa common changed frag-
ment are published in the same bundle. For example, consider
the Web pages in Figure 4. Suppose that both and change.
Since and both embed their updated versions must be
published together. Since and both embed their up-
dated versions must be published together. Thus, updated ver-
sions of all three Web pages must be published together. Note
that updated versions of and must be published together,
even though the two pages don’t embed a common fragment.
P1
P2
P3
f1
f2
f1
f2
Fig. 4. A set of Web pages containing commonfragments.
In order to implement this approach, the set of all changed
fragments contained within each changed object is deter-
mined. We call this set the changed fragment set for and
denote it by All changed objects are constructed in topo-
logical sorting order. When a changed object is constructed,
is calculated as the union of and for each frag-
ment such that a dependence edge exists in the ODG.
After all changed fragment sets have been determined, an
undirected graph is constructed in which the vertices of
are pages in . An edge exists between two pages and in
if and have at least one fragment in common.
is examined to determine itsconnected components (two ver-
tices are part of the same connected component if and only if
there is a path between the vertices in the graph). All pages be-
longing to the same connected component are published in the
same bundle.
The third incremental publishing method satisfies the consis-
tency guarantees of both the first and second method. In other
words,
1. A freshly published page will not contain a hypertext link to
either an obsolete or unpublished page. More specifically, if
and are twopages in if a clientviews an updatedversionof
and follows one or more hypertext links to view , then the
client is guaranteed to see a version of which is not obsolete
with respect to the version of which the client viewed.
2. Any two changed pages which both contain a common
changed fragment are published together.
This method generally results in publishing fewer bundles but
of larger sizes than the first two approaches.
For example, consider the Web pages in Figure 5. Suppose
that both and change. Updatedversions of and must
be published together because they both embed Since
contains a hypertext link to theupdatedversion of cannot
be published before the bundle containing updated versions of
and
Hypertext
Link
P1
P2
P3
f1
f1
Fig. 5. Anotherset of related Web pages.
If, instead, the first incremental publishingmethod were used
to publish the Web pages in Figure 5, the updated version of
could not be published before the updated version of
However, the updated version of would not have to be pub-
lished in the same bundle as the updated version of If the
second incremental publishing method were used, updated ver-
sions of both and would have to be published together in
the same bundle. However, publication of the updated version
of would be allowed to precede publication of the bundle
containing updated versions of and
The third incremental publishing method is implemented by
constructing as in the first incremental publishingmethod and
changed fragment sets as in the second incremental publishing
method. Additional edges are then added to between pages
in . For all pages and in such that and
have a fragment in common, directed edges from both to
and to are then added. The same procedureisthen applied
to to publish pages in bundles as in the first method.
Incremental publishing methods can be designed for other
consistency requirements as well. For example, consider Fig-
ure 3. Suppose that both and change. It may be desirable
to publish updated versions of and in the same bundle.
This would avoid the following situation which could occur us-
ing the first incremental publishing method.
Aclient views an old version of After followinghypertext
links, the client arrives at a new version of The browser’s
cache is then used to go to the old version of The client
reloads in order to obtain a version consistent with but
still sees the old version because the new version of has not
yet been published.
It is straightforward to implement an incremental publishing
method which would publish and in the same bundle us-
ing techniques similar to the ones just described.
B. The Publishing System
B.1 Combined Content Pages
Many Web sites contain information that is fed from multi-
ple sources. Some of the information, such as the latest scores
from a sportingevent, is generated automatically by a computer.
Other information, such as news stories, is generated by hu-
mans. Both types of information are subject to change. A page
containing both human and computer-generated information is
known as a combined content page.
A key problem with serving combined content pages is the
different rates at which sources produce content. Computer-
generated content tends to be produced at a relatively high rate,
often as fast as the most sophisticated timing technology per-
mits. Human-generated content is produced at a much lower
rate. Thus, it is difficult for humans to keep pace with auto-
mated feeds. By the time an editor has finished with a page, the
actual results on the page may have changed. If the editor takes
time to update the page, the results may have changed yet again.
A requirement for many of the Web sites we have helped de-
sign is that computer-generated content should not be delayed
by humans. Computer-generated results, such as the latest re-
sults from a sporting event, are often extremely important and
should be published as soon as possible. If computer-generated
results are combined with human-edited content using conven-
tional Web publishing systems, publication of the computer-
generated results can be delayed significantly. What is needed
is a scheme to combine data feeds of differing speeds so that
informationarriving at high rates is not unnecessarily delayed.
In order to provide combined content pages, our system di-
vides fragments into two categories. Immediate fragments are
fragments whichcontain vitalinformationwhichshould bepub-
lished quickly with minimal proofreading. For the sports Web
sites that our system is being used for, the latest results in a
sporting event would be published as an immediate fragment.
Qualitycontrolled fragments are fragments which don’t have to
be published as quickly as immediate fragments but have con-
tent which must be examined in order to determine whether the
fragments are suitable to be published. Background stories on
athletes are typically published as quality controlled fragments
at the sports sites which use our system. Combined content Web
pages consist of a mixture of immediate and quality controlled
fragments.
When one or more immediate fragments change, the Web
pages affected by the changes are updated and published with-
outproofreading. If both immediate and qualitycontrolled frag-
ments change, the system first performs updates resulting from
the immediate fragments and publishes the updated Web pages
immediately. It subsequently performs updates resulting from
quality controlled fragments and only publishes these updated
Web pages after they have been proofread. Multipleversions of
a combined content page may be published using this approach.
The first version wouldbe the page before any updates. The sec-
ond version might contain updates to all immediate fragments
but not to any quality controlled fragments. The third version
might contain updates to all fragments.
It is possible for an update to an immediate fragment to
be published before an update to a quality controlled fragment
even though changed before This might occur if the
changes to are delayed in publication due to proofreading.
B.2 System Description
Web pages produced by our system typically consist of mul-
tiple fragments. Each fragment may originate from a different
source and may be produced at a different rate than other frag-
ments. Fragments may be nested, permittingthe constructionof
complex and sophisticated pages. Completed pages are written
to sinks, which may be file systems, Web server accelerators [9],
or even other HTTP servers.
The Trigger Monitor is the software which takes objects from
one or more sources, constructs pages, and writes the con-
structed pages to one or more sinks (Figure 6). Relationships
between fragments are maintained in a persistent ODG which
preserves state information in the event of a system crash. Our
new Trigger Monitor has significantly enhanced functionality
compared with the Trigger Monitor used for the 1998 Olympic
Games Web site [2], [3].
Page
Assembler
NetCache
NetCache
NetCache
Distributed
Filesystem
Distributed
Filesystem
B
A
D
C
E
B
E
C
A
D
C
B
A
D
E
C
ODG
Sink
Source
Assembled
Pages
Trigger Monitor
Fig. 6. Schematic of the Publish Process.
Whenever the Trigger Monitor is notified of a modification,
addition, or deletion of one or more objects, it fetches new
copies of the changed objects from one or more sources. The
ODG is updated by parsing changed objects. The graph traver-
sal algorithms described in Section II-A.2 are then applied to
determine all Web pages which need to be updated and an ef-
ficient order for updating them. Finally, bundles of published
pages are written to the sinks.
SincetheTriggerMonitorisaware ofall fragmentsand pages,
synchronization is possible to prevent corruption of the pages.
The ODG is used as the synchronization object to keep the
fragment space consistent. Many “trigger handlers”, each with
their own sources and sinks, may be configured to use a com-
mon ODG. This design permits, for example, a slow-moving,
carefully edited human-generated set of pages and fragments
to be integrated with a high-speed, automated, database-driven
content source. Because the ODG is aware of the entire frag-
ment space and the interrelationship of the objects within that
space, synchronization points can be chosen to ensure that mul-
tiple, differently-sourced, differently-paced content streams re-
main consistent.
MultipleTrigger Monitorinstances may be chained, the sinks
of earlier instances becoming the sources for later ones. This
allowspublicationtotakeplace in multiplestages. For example,
the publishing system for the 2000 Summer Olympic Games
Web site consists of the following stages (Figure 7):
Development is the first step in the process. Fragments which
appear on many Web pages (such asgeneric headers and footers)
as well as overall site design occur here. The output of develop-
ment may be structurally complete but lacking in content.
Staging takes as its input, or source, the output, or sink, of De-
velopment. Editors polish pages and combine content from var-
ious sources. Finished pages are the result.
QualityAssurance takes as itssource the sink of Staging. Pages
are examined here for correctness and appropriateness.
Automated Results are produced when a database trigger is gen-
erated as the result of an update. The trigger causes programs
to be executed that extract current results and compose relevant
updated pages and fragments. Unlike the previous stages, no
human intervention occurs in this stage.
Production is where pages are served from. Its source is the
sink of QA, and its sinks are the serving directories and caches.
Note how one stage can use the sink of another stage as
its source. The automated feed updates each source at the
same time, but independently of the human-driven stages. This
achieves the dual goals of keeping the entire site consistent
while publishingcontent immediately from automated feeds.
A similar organization was used for the 1998 Winter Olympic
Games Web site. The primary difference was that the process of
moving pages from one stage to the next was purely manual. In
other words, authors had to keep track of all the pages that were
affected by their changes and move them down to Staging, edi-
tors had to move their material to Q/A, and so on. This required
the authors to know something about the editing process and the
editors to know about Q/A. Learning the process was difficult
enough; changing it was even worse.
Our new system eliminatesmost of the procedural difficulties
which were experienced at the 1998 Olympic Games Web site.
Stages can be added and deleted easily. Data sources can be
added and deleted with little or no disruption to the flow. The
new system adapts much more easily to changingconditionsand
requires peopleworkingon specific stages of the systemtoknow
less about what is required for other stages.
C. Example
To demonstrate how a site might be built from fragments, we
present a real example from the official Web site for the 1999
French Open Tennis Tournament. A site architect views the
player page for Steffi Graf (shown in Figure 8) as consisting of
a standard header, sidebar, and footer, with biographical infor-
mation and recent results thrown in. The site architect composes
HTML similar to the following,establishinga general layoutfor
the site:
<html>
Trigger
Monitor
Trigger
Monitor
Trigger
Monitor
htdocs
Web Server
Enter Data
Publish
Publish
Publish
Development
Staging
(Editors)
Quality
Assurance
Trigger
Monitor
Automated
Results
(Production)
Fig. 7. Schematic of the Publish Process.
<!-- %include(header.frg) -->
<table>
<tr>
<td><!-- %include(sidebr.frg) --></td>
<td><table>
<tr><!-- %fragment(graf_bio.frg) --></tr>
<tr><!-- %fragment(graf_score.frg) --></tr>
</td></table>
</tr>
</table>
<!-- %include(footer.frg) -->
</html>
where “footer.frg” consists of
<!-- %fragment(factoid.frg) -->
<!-- %fragment(copyr.frg) -->
Prior to the beginningof play, the contents of “graf score.frg”
will be empty, since no matches have commenced. This means
the part of the page outlined by the dashed box in Figure8 will,
at first, be empty. The first publication of this fragment will
result in the ODG seen to the right of Steffi Grafs player page
in Figure 8. Again, the objects and edges within the dashed box
will not yet be within the ODG, since no match play has yet
occurred.
Using fragments in this way permits many architects, editors,
and even automated systems to modify the page simultaneously.
Our system ensures that all changes are properly included in
the final page that is seen by the user. An architect updating
the structure of the page does not need to know anything about
copyrights,trademarks, the size of thesponsor’slogos, the look-
and-feel of the site, or any of the data that will be included on
the page. Similarly, an editor wishing to change the look-and-
feel of a site does not need to understand the structure of any
particular page.
Major site changes, like changing the look-and-feel of a site,
are as simple as changing a single page. For example, changing
the sidebar to reflect the end of a long event is as simple as up-
dating “sidebr.frg”. To change the look-and-feel of a site, an ed-
itor only needs to change “header.frg” and “footer.frg”. For both
these kinds of changes, the system will use the ODG from Fig-
ure 8 to determine that Steffi Grafs page must be rebuilt (along
with many others). Once all pages have been rebuilt, they will
be republished. The user will see the changes on every page, al-
though the vast majority of underlying fragments will not have
changed.
More static information, like player biographies, can be kept
up-to-date in one place but used on many pages. For exam-
ple, “graf bio.frg” is used on our example page, but may also be
used in many other places. To include a new photo or update
the information included in the biography, the editors need only
concern themselves with updating “graf bio.frg”. The system
ensures that all pages which include “graf bio.frg” will auto-
matically be rebuilt.
Since scoring information will change frequently once a ten-
nis match is in progress, updating that aspect of a page can
be handled by an automated process. As a match begins,
“graf score.frg” is updated to include the match in progress.
This means that once the final has begun, the graf score.frg”
page will consist of HTML similar to
<!-- %fragment(final.frg) -->
<!-- %fragment(semi.frg) -->
When the updated “graf score.frg” is published, the system
will detect that it now includes “final.frg” and “semi.frg” and
will update the ODG as shown in the dashed box within Fig-
ure 8. Now, as the final match progresses, only “final.frg” needs
to be updated and published through our system. As part of the
publication process, the system will detect that “final.frg” is in-
cluded in “graf score.frg”, causing “graf score.frg” to be rebuilt
using the updated score. Likewise, the system will detect that
Steffi Grafs page must be rebuilt as well, and a new page will
be built including the updated scoring information. Eventually,
when the match completes, the complete page shown in the ex-
ample is produced.
The score for the final match will be displayed on many
pages other than Steffi Grafs player page. For instance, Martina
Hingiss player page will also include these results, as will the
scoreboard page while the match is in progress. A page listing
matchups between different players will also contain the score.
To update all of these pages, the automated system only updates
one fragment. This keeps the automated system independent of
the site design.
III. DEPLOYMENT EXPERIENCES
One of the key things our publishing system enables is sep-
aration of the creative process from the mechanical process of
building a Web site. Previously, the content, look, and feel of
large sites we were involved with had to be carefully planned
final.frg
semif.frg
graf_score.frg
foot.frg
factoid.frg
copyr.frg
graf_bio.frg
header.frg
sidebar.frg
graf.html
graf_score.frg graf_bio.frg
sidebr.frg
footer.frgheader.frg
final.frg
semi.frg
copyr.frg
factoid.frg
ODG representation of this
page
Fig. 8. Sample screen shot from the official Web site for the French Open Tennis Tournament.
well in advance of the creation of the first page. Changes to the
original plans were quite difficult to execute, even in the best
of circumstances. Last-minute changes tended to be impossible,
resulting in a choice between delayed or flawed site publication.
With our publishing system, the entire look and feel of a site
can be changed and republishedwithin minutes. Aside from the
cost savings, this has allowed tremendous creativity on the part
of designers. Entire site designs can be created, experimented
with, changed, discarded, and replaced several times a day dur-
ing the construction of the site. This can take place in parallel
with and independently of the creation of site content.
A specific example of this was demonstrated just before a
new site look for the 2000 Sydney Olympic Games Web site
(http://www.olympics.com) was made public. One day before
the site was to go live before the public, it was decided that the
search facility was not working sufficiently well and must be
removed. This change affected thousands of pages, and would
previously have delayed publication of the site by as much as
several days. Using our system, the site authors simply removed
the search button from the appropriatefragment and republished
the fragment. Ten minuteslater, the change was complete, every
page had been rebuilt, and the site went live on schedule.
Figures 9-12 characterize the objects and ODG’s at the 2000
Olympic Games Web site in early November of 1999. Recall
that an object is either a page or a fragment. Figure 9 shows
the distributionof object sizes. Figure 10 shows the distribution
of the number of incoming edges for ODG nodes. Figure 11
showsthe distributionof the numberof outgoing edges for ODG
nodes. Finally, Figure 12 shows the distribution of maximum
levels at which objects are recursively embedded. The embed
depth of an object is the maximum length of any path in the
ODG originating from the object.
The number of objects at the Web site will increase as the
start date for the 2000 Olympic Games approaches. Once the
Olympic Games are in full swing, the number of objects at the
site will likely exceed the number correspondingto Figures9-12
by a factor of more than ten.
IV. SYSTEM PERFORMANCE
This section describes the performance of a Java implementa-
tion of our system running on an IBM Intellistation containing
a 333 Mhz Pentium II processor with 256 Mbytes of memory
Distribution of Object Size
1523
765
1325
1785
229
3213
1419
938
491
1082
281
67
25
43
6
9
8
3
111
1
10
100
1000
10000
200
400
600
800
1,000
2,500
5,000
7,500
10,000
25,000
50,000
75,000
100,000
250,000
500,000
1,000,000
1,500,000
3,000,000
4,000,000
7,000,000
20,000,000
Object Size (in bytes)
Number of Objects (logarithmic scale)
Fig. 9. The distribution of object sizes at the 2000 Olympic Games Web site.
Each barrepresents the number ofobjects containedin thesize range whose
upper limit is shown on the X-axis.
Distribution of Incoming Edges
1
10
100
1000
10000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23
Number of Incoming Edges
Number of Objects (logarithmic scale)
Fig. 10. The distribution of the number of incoming edges for nodes of ODG’s
at the 2000Olympic Games Web site.
and the Windows NT (version 4.0) operating system. The dis-
tribution of Web pages sizes is similar to the one for the 1998
Olympic Games Web site [8] as well as more recent Web sites
deploying our system; the average Web page size is around 10
Kbytes. Fragment sizes are typically several hundred bytes but
usually less than 1 Kbyte. The distribution of fragment sizes is
also representative of real Web sites deploying our system.
Figure 13 shows the CPU time in milliseconds required for
constructing and publishing bundles of various sizes. Times are
averaged over 100 runs. All 100 runs were submitted simulta-
neously, so the times in the figure reflect the ability for the runs
to be executed in parallel. The solid curve depicts times when
all objects which need to be constructed are explicitlytriggered.
The dotted line depicts times when a single fragment which is
included in multiplepages is triggered; the pages which need to
be built as a result of the change to the fragment are determined
Distribution of Outgoing Edges
1
10
100
1000
10000
100000
0
1
2
3
4
5
10
15
20
25
30
40
50
60
100
200
300
500
1000
2500
2827
Number of Outgoing Edges
Number of Objects (logarithmic scale)
Fig. 11. The distribution of the numberof outgoing edges for nodes of ODG’s
at the 2000Olympic Games Web site.
Distribution of Embed Depth
1
10
100
1000
10000
100000
01234
Embed Depth
Number of Objects (logarithmic scale)
Fig. 12. The distribution of the degree to which objects are embedded at the
2000Olympic GamesWeb site.
from the ODG. Graph traversal algorithms applied to the ODG
have relatively low overhead. By contrast, each object which is
triggered has to be read from disk and parsed; these operations
consume considerable CPU overhead. As the graph indicates,
it is more desirable to trigger a few objects, which are included
in multiple pages, than to trigger all objects which need to be
constructed.
Our implementation allows multiple complex objects to be
constructed in parallel. As a result, we are able to achieve near
100% CPU utilization,even when construction of an object was
blocked due to I/O, by concurrently constructing other objects.
The breakdown as to where CPU time is consumed is shown
in Figure 14. CPU time is dividedinto the followingcategories:
Retrieve, parse: time to read all triggered objects from disk
and parse them for determining included fragments.
ODG update: time for updating the ODG based on the in-
formation obtained from parsing objects and for analyzing the
0
500
1000
1500
2000
2500
3000
3500
4000
0 10 20 30 40 50 60 70 80 90
Time (ms)
Pages in Bundle
Construction/Publication Times
all obj. triggered
one obj. triggered
Fig. 13. The CPU time in millisecondsrequiredtoconstructand publishbundles
of various sizes.
ODG to determine all objects which need to be updated and an
efficient order for updating the objects.
Assembly: time to update all objects.
Save data: time to save all updated objects on disk.
Send ack: time to send an acknowledgment message via
HTTP that publication is complete.
CPU Breakdown, absolute times
0.03
3.80
12.75
9.02
0.26
17.21
3.33
12.25
11.43
0.29
0
2
4
6
8
10
12
14
16
18
20
Retrieve, parse odg update assembly save data send ack
milliseconds
1 to 100
100 to 100
Fig. 14. The breakdowninCPU time requiredto constructand publish a typical
complexWeb page.
In the bars marked 1 to 100, one fragment included in 100
others was triggered. The 100 pages which needed to be con-
structed were determined from theODG. In the bars marked100
to 100, the 100 pages which needed to be constructed were all
triggered. The times shown in Figure 14 are the average times
for a single page. The total average time for constructing and
publishinga pageinthe 1 to100 pageis25.86milliseconds(rep-
resented by theaggregate of allbars); thecorrespondingtimefor
the 100 to 100 case is 44.51 milliseconds.
The retrieve and parse time is significantly higher for the 100
to 100 case because the system is reading and parsing 100 ob-
jects compared with 1 in the 1 to 100 case. Since the source for
every object that is triggered must be saved, the time it takes to
save the data is somewhat longerwhen 100 objects are triggered
than when only one object is triggered.
Figure 15 shows how the average construction and publica-
tion time varies with the number of embedded fragments within
a Web page. Figure 16 shows how the average construction and
publicationtime varies with the number of fragments which are
triggered for a Web page containing 20 fragments. Both graphs
are averaged over 100 runs.
0
50
100
150
200
250
300
0 10 20 30 40 50 60 70 80 90
Time (ms)
Embedded Fragments
Construction/Publication Times
Fig. 15. The averageCPU time inmillisecondsrequiredtoconstructand publish
a complex Web page as a function of the number of embedded fragments.
In each case,one fragment in the page was triggered.
0
100
200
300
400
500
600
700
800
0 5 10 15 20
Time (ms)
Fragments Triggered
Construction/Publication Times
Fig. 16. The averageCPU time inmillisecondsrequiredtoconstructand publish
a complexWeb page as a function of the number of fragments triggered.
V. RELATED WORK
There are a number of Web content management tools on
the marketplace today such as NetObjects Fusion [11], Allaires
ColdFusion and Homesite [1], FutureTense’s Internet Publish-
ing System [5], Eventus Softwares Control [6], Wallop Soft-
wares Build-It (now owned by IBM) [13], Site Technologies’
SiteMaster [12], and Microsoft’s Visual InterDev [10].
As far as we know, none of these products allow nested frag-
ments to the degree which we do. Most of them don’t allow
any type of embedded fragments. They are also not designed to
publish content in multiple stages as oursis.
A key problem with many products such as Fusion and
SiteMaster is that they only work well when all of the Web
content is designed using the product. They don’t provide rich
programmatic interfaces which can deal with or import content
from external sources or feeds. These products thus lack the
ability to treat external data with the same level of control and
consistency as the sources of data the application owns.
By contrast, our system allows Web pages to come from mul-
tiple external sources. This is a key requirement for many of the
Web sites we have encountered. Build-Itis similarto our system
inthat itworkswithWebcontent createdby othersources. How-
ever, we found that Build-It was not able to handle Web sites as
large as the 1998 Olympic Games Web site, for example. Our
system is scalable to handle extremely large Web sites.
Our system uses many ideas from the system used for the
1998OlympicGames Web site[2], [3]. That system usedan ear-
lier versionof the TriggerMonitortomaintain updatedcaches of
dynamic data. The originalTrigger Monitormaintainedupdated
caches by reacting to database triggers. When a database change
occurred, a database triggerinvoked a UDF(User Defined Func-
tion) that sent a message to the Trigger Monitor containing an
encoded summary of the change. The Trigger Monitor decoded
the message, consulted an ODG to determine which pages were
affected, requested pages from a non-caching HTTP server, and
finally replaced the updated pages in the caches of the servers
connected to the Web.
While the 1998 Olympic Games system worked extremely
well for maintaining updated caches, it lacked the automated
features of our new system for automatically and consistently
publishing dynamic content. While the earlier system used ob-
ject dependence graphs for determining how changes to under-
lying data affected cached objects, it didn’t have capabilities for
automatically constructing pages and fragments in an optimal
order. The earlier system also couldn’t publish combined con-
tent pages efficiently and had fewer options for bundling Web
pages for consistent publication.
VI. SUMMARY AND CONCLUSIONS
We have presented a publishingsystem forefficiently creating
dynamic Web content. Our publishing system constructs com-
plex objects from fragments which may recursively embed other
fragments. Relationships between Web pages and fragments are
represented by object dependence graphs. We presented algo-
rithms for efficiently detecting and updating all affected Web
pages after one or more fragments change.
After a set of multiple Web pages change or are created for
the first time, the Web pages must be published to an audience.
Publishing all changed Web pages in a single atomic action
avoids consistency problems but may cause delays in publica-
tion, particularly if the newly constructed pages must be proof-
read before publication. Incremental publication can provide
information faster but may also result in inconsistencies across
published Web pages. We presented three algorithms for in-
cremental publication designed to handle different consistency
requirements.
Our publishing system provides an easy method for Web site
designers to specify and modify inclusion relationships among
Web pages and fragments. Users can updatecontent on multiple
Web pages by modifying a template. The system then automat-
ically updates all Web pages affected by the change. It is easy
to change the look and feel of an entire Web site as well as to
consistently update common information on many Web pages.
Our system accommodates both quality controlled fragments
that must be proofread before publicationand are typically from
humans as well as immediate fragments that have to be pub-
lished immediately and are typically from automated feeds. A
Web page can combine both quality controlled and immediate
fragments and still be updated in a timely fashion.
Our publishing system has been implemented in both Java
and C++ and is being deployed at several popular Web sites in-
cluding the 2000 Olympic Games Web site. We discussed some
of our experiences with real deployments of our system as well
as its performance.
ACKNOWLEDGMENTS
Several people have contributed to this work including Paul
Dantzig, Peter Davis, Daniel Dias, Glenn Druce, Sara Elo, Grant
Emery, Peter Fiorese, Kip Hansen, Brenden O’Sullivan, Kent
Rankin, and Jerry Spivak.
REFERENCES
[1] Allaires ColdFusion and HomeSite. http://www.allaire.com/.
[2] Jim Challenger, Paul Dantzig, and Arun Iyengar. A Scalable and Highly
Available System for Serving Dynamic Data at Frequently Accessed Web
Sites. In Proceedingsof ACM/IEEE SC98, November1998.
[3] Jim Challenger, Arun Iyengar, and Paul Dantzig. A Scalable System for
Consistently Caching Dynamic Web Data. In Proceedings of IEEE INFO-
COM ’99, March 1999.
[4] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms,MIT
Press, 1990.
[5] FutureTense’s InternetPublishing System. http://www.futuretense.com/.
[6] EventusSoftware’s Control. http://www.eventus.com/.
[7] Arun Iyengar and Jim Challenger. Improving Web Server Performance by
Caching Dynamic Data. In Proceedings of the 1997 USENIX Symposium
on Internet Technologiesand Systems, December 1997.
[8] Arun Iyengar, Mark Squillante, and Li Zhang. Analysis and Characteriza-
tion of Large-scale Web Server Access Patterns and Performance. In World
Wide Web, June 1999.
[9] Eric Levy-Abegnoli,Arun Iyengar,JunehwaSong and DanielDias. Design
and Performance of a Web Server Accelerator. In Proceedings of IEEE
INFOCOM ’99, March 1999.
[10] Microsofts Visual InterDev. http://www.microsoft.com/.
[11] NetObjects Fusion. http://www.netobjects.com/.
[12] Site TechnologiesSiteMaster. http://www.sitetech.com/.
[13] Wallop Softwares Build-It. http://www.wallop.com/.
... Several papers and articles are concerned with the basic issue of DIWS, which is performance. Most of the papers approach the performance problem through the caching of dynamic data as presented in Challenger (2001Challenger ( , 2000 and Candan (2001). The papers provide different methodologies for caching dynamic Web data on Web servers. ...
... A lot of important work has also been done in the evaluation of the performance and other metrics of real-world Web sites. The first case study presented in Challenger (2001Challenger ( , 2000 is related to Web sites of Olympic games. The IBM team that was responsible for the design and maintenance of consecutive Olympic games Web sites, has reported on many important issues related to DIWS. ...
Chapter
Full-text available
This chapter presents a step-by-step approach to the design, implementation and management of a Data-Intensive Web Site (DIWS). The approach introduces five data formulation and manipulation graphs that are presented analytically. The core concept behind the modeling approach is that of “Web fragments,” that is an information decomposition technique that aids design, implementation and management of DIWS. We then present the steps that must be followed in order to “build” a DIWS based on Web fragments. Finally, we show how our approach can be used to ensure the basic DIWS user requirements of personalization, integrity and performance.
... In this paper we describe the system designed and implemented to monitor flows within the publishing and content distribution systems for the Sydney 2000 Olympic Website [1] [3] and the IBM sponsored Special Events websites [2]. or more stages, to its final destination. ...
Chapter
To support complex, rapidly changing, high-volume websites many components contribute to keeping the content current. Monitoring the workflow through all these components is a challenging task. This paper describes a system in which monitoring objects created by the various heterogeneous, distributed components are distributed to any application choosing to present monitoring information.
Article
Full-text available
Knowledge in web documents, Relevance ranking of webpages and so on are some of the under-researched areas in web content mining (WCM). Apart from the general data mining tools used for knowledge discovery in web, there have been few attempts at reviewing WCM and these were from the perspective of the methods used and the problems solved but not in sufficient depth. This existing literature review attempts does not also reveal which problems have been under-researched and which application area has the most attention when it gets to WCM. The goal of this systematic review is to make available a comprehensive and semi-structured overview of WCM methods, problems and solutions proffered. To provide a comprehensive literature review on this subject, 57 publications which include journals, conferences proceeding, and workshops were considered between the periods of 1999-2018. The findings reveal that updating dynamic content, efficient content extraction, eliminating noise blocks etc remain the most prominent challenges associated with WCM with a very high attention on solving these problems in a more efficient manner. Also, most of the solutions proffered to the problems still come with their various limitations which make this area of research fertile for future research. Caching dynamic web data. With regard to content, the techniques used for content extraction in WCM consist of used Data Update Propagation (DUP), Association rule, Object Dependence Graphs, classification techniques, Document Object Model, Vision-Based Segmentation, Hyperlink-Induced Topic Search and so on. Finally, the study revealed that WCM has been mostly applied to general websites which include random webpages seeking to extract specific parameters. The review was able to identify the limitations of the current research on the subject matter and identify future research opportunities in WCM.
Thesis
Full-text available
İş ve iş süreçlerinin hızla internet ortamına aktarılması sonucu internet ortamındaki bilgi miktarı da hızla artmaktadır. İnternet ortamındaki bilginin yönetilmesi de bilgi miktarının artması sonucu zorlaşmıştır. İnternet ortamındaki bilginin kolay ve hızlı bir şekilde yönetilmesini sağlamak amacıyla otomasyon sistemleri geliştirilmeye başlanmıştır. Web'deki bilginin ya da içeriğin yönetim süreçlerini gerçekleştirmek için geliştirilen otomasyon sistemlerine içerik yönetim sistemleri denmektedir. İçerik yönetim sistemleri uygulama alanlarına göre farklı türlere ayrılmıştır. İçerik yönetim sistemi kavramının oluşmasında ise en çok web sitesi geliştirmek için kullanılan web içerik yönetim sistemleri etkili olmuştur. Web sitesi geliştirmeyi bilmeyen kullanıcıların yetkilerine göre, kendi web içeriklerini yönetmelerini sağlamak, web içeriği çok fazla olan web siteleri için içeriğin kolay ve hızlı bir şekilde yönetilmesini sağlamak gibi bazı özellikler web sitesi içerik yönetim sistemlerinin en önemli amaçları arasında yer almaktadır.�Bu tez çalışmasında, GaziWEB adı verilen yeni bir web sitesi içerik yönetim sistemi geliştirilmiş ve problemin çözümüne yeni bir yaklaşım sunulmaya çalışılmıştır. GaziWEB içerik yönetim sistemiyle kurumsal yapıların ve birimlerinin kendi web içeriklerini birbirlerinden bağımsız olarak yönetebilmelerini sağlayan bir yaklaşım sunulmaktadır. GaziWEB içerik yönetim sisteminde içerik ve içeriğin sunum özellikleri birbirinden ayırmak amacıyla şablon sistemleri desteklenmektedir.�GaziWEB İYS; çoklu dil desteği, kişiselleştirilebilir şablonlara izin verme, sınırsız sayıda rol ve yetki tanımlayarak kullanıcılar ekleme, içeriğin ziyaretçilere sunumunu esnek bir şekilde gerçekleştirebilmek amacıyla her bir menü içeriğinin üye gruplarına göre erişim düzeylerinin belirlenebilmesi ve menü içerik yerleşiminin bölgelere ayrılmasıyla bir menü içeriğinin içerik bileşenleri kullanılarak ayrı ayrı tasarlanabilir olması gibi özelliklerinden dolayı, sistem yeni bir içerik yönetim ve yayınlama örneği ortaya koymaktadır.
Article
Segregating the web page content into logical chunks is one of the popular techniques for modular organization of web page. While chunk-based approach works well for public web scenarios, in case of mobile-first personalization cases, chunking strategy would not be as effective for performance optimization due to dynamic nature of the Web content and due to the nature of content granularity. In this paper, the authors propose a novel framework Micro chunk based Web Delivery Framework which proposes and uses a novel concept of "micro chunk". The micro chunk based Web Delivery framework aims to address the performance challenges posed by regular chunk in a personalized web scenario. The authors will look at the methods for creating micro chunk and they will discuss the advantages of micro chunk when compared to a regular chunk for a personalized mobile web scenario. They have created a prototype application implementing the Micro chunk based Web Delivery Framework and benchmarked it against a regular personalized web application to quantify the performance improvements achieved by micro chunk design.
Article
Dividing the web site page content or web portal page into logical chunks is one of the prominent methods for better management of web site content and for improving web site's performance. While this works well for public web page scenarios, personalized pages have challenges with dynamic data, data caching, privacy and security concerns which pose challenges in creating and caching content chunks. Web portals has huge dependence on personalized data. In this paper the authors have introduced a novel concept called “personalized content chunk” and “personalized content spot” that can be used for segregating and efficiently managing the personalized web scenarios. The authors' experiments show that performance can be improved by 30% due to the personalized content chunk framework.
Article
The performance of web applications is of paramount importance as it can impact end-user experience and the business revenue. Web Performance Optimization (WPO) deals with front-end performance engineering. Web performance would impact customer loyalty, SEO, web search ranking, SEO, site traffic, repeat visitors and overall online revenue. In this paper we have conducted the survey of state of the art tools, techniques, methodologies of various aspects of web performance optimization. We have identified key web performance patterns and proposed novel web performance driven development framework. We have elaborated on various techniques related to different phases of web performance driven development framework.
Chapter
It is generally agreed that the Internet has already become an important media of our daily lives. With its interactive and multimedia abilities, the Internet has an even greater potential than any of the current media such as television or telephone. It is no longer just a media for personal communication and information dissemination, it is also a platform for education, business, and entertainment. This can be reflected by the fact that despite the fluctuation of e-commerce applications, the numbers of Internet users and web pages published on the Internet keep increasing.
Chapter
Full-text available
The World Wide Web provides a means for sharing data and applications among users. However, its performance and in particular providing fast response time is still an issue. Caching is one of the key techniques that addresses some of the performance issues in today's Web-enabled applications. Deploying dynamic data especially in an emerging class of Web applications, called Web Portals, makes caching even more interesting. In this chapter, we study Web caching techniques with focus on dynamic content. We also discuss the limitations of caching in Web portals and study a solution that addresses these limitations.
Thesis
Full-text available
PHP, MySQL AND XML BASED TURKISH DYNAMIC WEB SITE CONTENT MANAGEMENT SYSTEM: DyNA Keywords: Website Content Management System, Dynamic Template, Web Site Design, XML, PHP, MySQL Abstract: Improvement and widespread usage of the Internet doesn’t meet the requirements of web sites, which had been used and prepared as static web sites in the beginning. So many studies take place in the literature to overcome this problem. In this thesis, a new dynamic web site content management system, called as DyNA, is developed and introduced to provide solutions with new approaches. DyNA offers a lot of facility in preparation and publishing of web site to authors, administrators and developers. It provides a management environment to make basic operations easily, without the requirement of another third-party software. Therefore, it can be used by even un-experienced person with respect to authority level. Templates are used in web sites to keep visual and constitutional integrity and to protect institutional appearance from damages. DyNA web site content management system also uses templates, hence while the integrity of web site is provided easily production and management of the content is becoming possible. The presented web site content management system has many advantageous features. It provides multi-language, customized and multiple appearances for different publication environments. It is able to append unlimited actors by defining roles and authorities. Even though it is not described in the planning and publishing phase, a new page-type, a new language and even a new publishing platform can be developed and transformed.
Article
Full-text available
In this paper we develop a general methodology for characterizing the access patterns of Web server requests based on a time‐series analysis of finite collections of observed data from real systems. Our approach is used together with the access logs from the IBM Web site for the Olympic Games to demonstrate some of its advantages over previous methods and to construct a particular class of benchmarks for large‐scale heavily‐accessed Web server environments. We then apply an instance of this class of benchmarks to analyze aspects of large‐scale Web server performance, demonstrating some additional problems with methods commonly used to evaluate Web server performance at different request traffic intensities.
Conference Paper
Full-text available
This paper presents a new approach for consistently caching dynamic Web data in order to improve performance. Our algorithm, which we call data update propagation (DUP), maintains data dependence information between cached objects and the underlying data which affect their values in a graph. When the system becomes aware of a change to underlying data, graph traversal algorithms are applied to determine which cached objects are affected by the change. Cached objects which are found to be highly obsolete are then either invalidated or updated. The DUP was a critical component at the official Web site for the 1998 Olympic Winter Games. By using DUP, we were able to achieve cache hit rates close to 100% compared with 80% for an earlier version of our system which did not employ DUP. As a result of the high cache hit rates, the Olympic Games Web site was able to serve data quickly even during peak request periods
Conference Paper
Full-text available
We describe the design, implementation and performance of a Web server accelerator which runs on an embedded operating system and improves Web server performance by caching data. The accelerator resides in front of one or more Web servers. Our accelerator can serve up to 5000 pages/second from its cache on a 200 MHz PowerPC 604. This throughput is an order of magnitude higher than that which would be achieved by a high-performance Web server running on similar hardware under a conventional operating system such as Unix or NT. The superior performance of our system results in part from its highly optimized communications stack. In order to maximize hit rates and maintain updated caches, our accelerator provides an API which allows application programs to explicitly add, delete, and update cached data. The API allows our accelerator to cache dynamic as well as static data, analyze the SPECweb96 benchmark, and show that the accelerator can provide high hit ratios and excellent performance for workloads similar to this benchmark
Article
Full-text available
Dynamic Web pages can seriously reduce the performance of Web servers. One technique for improving performance is to cache dynamic Web pages. We have developed the Dynamic Web cache which is particularly well-suited for dynamic pages. Our cache has improved performance significantly at several commercial Web sites. This paper analyzes the design and performance of the Dynamic Web cache. It also presents a model for analyzing overall system performance in the presence of caching. Our cache can satisfy several hundred requests per second. On systems which invoke server programs via CGI, the DynamicWeb cache results in near-optimal performance, where optimal performance is that which would be achieved by a hypothetical cache which consumed no CPU cycles. On a system we tested which invoked server programs via ICAPI which has significantly less overhead than CGI, the Dynamic Web cache resulted in near-optimal performance for many cases and 58% of optimal performance in the worst case. The Dynamic Web cache ...
Article
Full-text available
: This paper describes the system and key techniques used for achieving performance and high availability at the official Web site for the 1998 Olympic Winter Games which was one of the most popular Web sites for the duration of the Olympic Games. The Web site utilized thirteen SP2 systems scattered around the globe containing a total of 143 processors. A key feature of the Web site was that the data being presented to clients was constantly changing. Whenever new results were entered into the system, updated Web pages reflecting the changes were made available to the rest of the world within seconds. One technique we used to serve dynamic data efficiently to clients was to cache dynamic pages so that they only had to be generated once. We developed and implemented a new algorithm we call Data Update Propagation (DUP) which identifies the cached pages that have become stale as a result of changes to underlying data on which the cached pages depend, such as databases. For the Olympic G...
Conference Paper
This paper describes the system and key techniques used for achieving performance and high availability at the official Web site for the 1998 Olympic Winter Games which was one of the most popular Web sites for the duration of the Olympic Games. The Web site utilized thirteen SP2 systems scattered around the globe containing a total of 143 processors. A key feature of the Web site was that the data being presented to clients was constantly changing. Whenever new results were entered into the system, updated Web pages reflecting the changes were made available to the rest of the world within seconds. One technique we used to serve dynamic data efficiently to clients was to cache dynamic pages so that they only had to be generated once. We developed and implemented a new algorithm we call Data Update Propagation (DUP) which identifies the cached pages that have become stale as a result of changes to underlying data on which the cached pages depend, such as databases. For the Olympic Games Web site, we were able to update stale pages directly in the cache which obviated the need to invalidate them. This allowed us to achieve cache hit rates of close to 100%. Our system was able to serve pages to clients quickly during the entire Olympic Games even during peak periods. In addition, the site was available 100% of the time. We describe the keyfeatures employed by our site for high availability. We also describe how the Web site was structured to provide useful information while requiring clients to examine only a small number of pages.
Leiserson and R Rivest Introduction to Algorithms
  • T Cormen
http://www.sitetech.com/. [13] Wallop Software's Build-It
  • Site Technologies
  • Sitemaster
Site Technologies' SiteMaster. http://www.sitetech.com/. [13] Wallop Software's Build-It. http://www.wallop.com/.