Conference PaperPDF Available

Techniques for efficient fragment detection in web pages

  • Peraton Labs

Abstract and Figures

The existing approaches to fragment-based publishing, delivery and caching of web pages assume that the web pages are manually fragmented at their respective web sites. However manual fragmentation of web pages is expensive, error prone, and not scalable. This paper proposes a novel scheme to automatically detect and flag possible fragments in a web site. Our approach is based on an analysis of the web pages dynamically generated at given web sites with respect to their information sharing behavior, personalization characteristics and change patterns.
Content may be subject to copyright.
Techniques for Efficient Fragment Detection in Web Pages
Lakshmish Ramaswamy 1Arun Iyengar 2Ling Liu 1Fred Douglis 2
1College of Computing, Georgia Tech 2IBM T.J. Watson Research Center
801 Atlantic Drive P.O. Box 704
Atlanta GA 30332 Yorktown Heights, NY 10598
{laks, lingliu} {aruni, fdouglis}
The existing approaches to fragment-based publishing, delivery
and caching of web pages assume that the web pages are man-
ually fragmented at their respective web sites. However manual
fragmentation of web pages is expensive, error prone, and not
scalable. This paper proposes a novel scheme to automatically
detect and flag possible fragments in a web site. Our approach is
given web sites with respect to their information sharing behavior,
personalization characteristics and change patterns.
Categories and Subject Descriptors: H.3.3 [Informa-
tion Systems - Information storage and retrieval]: Informa-
tion search and retrieval
General Terms: Design, Algorithms
Keywords: Fragment-based publishing, Fragment caching,
Fragment detection
Dynamic content on the web has posed serious challenges
to the scalability of the web in general and the performance
of individual web sites in particular. There has been con-
siderable research towards alleviating this problem. One
promising research direction that has been pursued and suc-
cessfully commercialized in recent years is Fragment based
publishing, delivery and caching of web pages [2, 7, 8].
Research on fragment-based publishing and caching has
been prompted by the following observations:
Web pages don’t always have a single theme or func-
tionality. Often web pages have several pieces of infor-
mation, whose themes and functions are independent.
Generally, web pages aren’t completely dynamic or
personalized. Often the dynamic and personalized con-
tent are embedded in relatively static web pages.
Most of this work was done while Lakshmish was an intern
at IBM Research in the summers of 2002 and 2003.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM 2003 November 3–8, 2003, New Orleans, Louisiana, USA.
Copyright 2003 ACM 1-58113-723-0/03/0011 ...$5.00.
Web pages from the same web site tend to share infor-
mation among themselves.
Conceptually a fragment can be defined as a part of a web
page (or more generally part of another fragment), which
has a distinct theme or functionality associated with it and
which is distinguishable from the other parts of the web
page. In fragment-based publishing, the cacheability and
the lifetime are specified at a fragment level rather than the
page level.
The advantages of fragment-based schemes are apparent
and have been conclusively demonstrated in the literature [7,
8]. By separating the non-personalized content from the
personalized content and marking them as such, it increases
the cacheable content of the web sites. Furthermore, with
fragment-based publishing, the amount of data that gets
invalidated at the caches is reduced. In addition, the infor-
mation that is shared across web pages needs to be stored
only once, which improves the disk space utilization at the
Though there have been considerable research efforts on
performance and benefits of fragment-based publishing and
caching, there has been little research on detecting such frag-
ments in existing web sites. Most of the research efforts on
fragment caching rely upon the web administrator or the
web page designer to manually fragment the pages on the
web site. However, manual fragment detection is both very
costly and error prone.
In this paper, we propose a novel scheme to automatically
detect and flag possible fragments in a web site. We analyze
web pages with respect to their information sharing behav-
ior, personalization characteristics, and the changes occur-
ring to them over time. Based on this analysis, our system
detects and flags the “interesting” fragments in a web site.
The research contributions of this paper are along three di-
mensions. First, we formally define the concept of a Candi-
date Fragment, which forms the basis for fragment detection
schemes. Second, we provide an infrastructure for detecting
fragments within a web site, including an efficient document
model and fast string encoding algorithm for fragment detec-
tion. Third, we present two algorithms for detecting frag-
ments that are shared among Mdocuments or that have
different lifetime characteristics, which we call the Shared
Fragment Detection Algorithm and Lifetime-Personalization
based (L-P) Fragment Detection Algorithm respectively.
The web documents considered in this paper are HTML
documents. We assume that all HTML documents are wel l-
formed [6]. Documents that are not well formed can be
converted to well-formed documents. We refer to such a
transformation as document normalization. HTML Tidy [3]
is a well-known Internet tool for transforming an arbitrary
HTML document into a well-formed one.
2.1 Candidate Fragments
We introduce the notion of candidate fragments as follows:
Each Web page of a web site is a candidate fragment.
A part of a candidate fragment is itself a candidate
fragment if any one of the following two conditions are
The part is shared among “M” already existing
candidate fragments, where M >1
The part has different personalization and life-
time characteristics than those of its encompass-
ing (parent or ancestor) candidate fragment.
It is evident from the definition that the two conditions
are independent and these conditions define fragments that
benefit caching from two different and independent perspec-
tives. We call the fragments satisfying Condition 1 Shared
fragments, and the fragments satisfying Condition 2 L-P
fragments (denoting Lifetime-Personalization based frag-
ments). Lifetime characteristics of a fragment govern the
time duration for which the fragment, if cached, would stay
fresh (in tune with the value at the server). The personal-
ization characteristics of a fragment correspond to the vari-
ations of the fragment in relation to cookies or parameters
of the URL.
It can be observed that the two independent conditions
in the candidate fragment definition correspond well to two
key aims of fragment caching. By identifying and creating
fragments out of the parts that are shared across more than
one fragment, we avoid unnecessary duplication of informa-
tion at the caches. By creating fragments that have different
lifetime and personalization properties we not only improve
the cacheable content but also minimize the amount and
frequency of the information that needs to be invalidated.
The primary goal of our system is to detect and flag candi-
date fragments from pages of a given web site. The fragment
detection process is divided into three steps. First, the sys-
tem is conceived to construct an Augmented Fragment Tree
(see Section 3.1) for the pages of a web site. Second, the sys-
tem applies the fragment detection algorithms to augmented
fragment trees of the web pages to detect the candidate frag-
ments. In the third step, the system collects statistics about
the fragments such as the size, how many pages share the
fragment, access rates etc. These statistics aid the admin-
istrator to decide whether to turn on the fragmentation.
Figure 1 gives a sketch of the architecture of our fragment
detection system.
Our fragment detection framework has two independent
schemes: a scheme to detect Shared fragments and another
scheme to detect L-P fragments. Both of the schemes are
located in one large framework, probably collocated with a
server-side cache, and work on the web page dumps from
the web site.
Figure 1: Fragment Detection System Architecture
The scheme to detect Shared fragments works on various
pages from the same web site, whereas the L-P fragments
approach works on different versions of each web page (if
the web page has more than one version). For example, in
order to detect L-P fragments, we have to locate parts of
a fragment that have different lifetime and personalization
characteristics. This can be done by comparing different
versions of the web page and detecting the parts that have
changed and the parts that have remained constant. How-
ever, to detect Shared fragments, we are looking for parts of
a fragment that are shared by other fragments, which natu-
rally leads us to work with a collection of different web pages
from the same web site.
While the inputs to the L-P fragment detection scheme
and the Shared fragment detection approach differ, both
schemes rely upon the Augmented Fragment Tree represen-
tation of its input web pages, which is described in the next
subsection. The output of our fragment detection system is
a set of fragments that are shared among a given number
of documents or that have different lifetime characteristics.
This information will be served as recommendations to the
fragment caching policy manager or the respective web ad-
ministrator (see Figure 1).
3.1 Augmented Fragment Tree
A good model to represent the web pages is one of the keys
to efficient and accurate fragment detection. We introduce
the concept of an Augmented Fragment Tree as a model to
represent web pages.
An augmented fragment (AF) tree is a hierarchical rep-
resentation of the structure of an HTML document. It is
a compact DOM tree [1] with all the text-formatting tags
removed, and each node augmented with additional infor-
mation for efficient comparison of different documents and
different fragments of documents. Each node in the tree is
annotated with the following fields:
Node Identifier (Node-ID): A vector indicating the lo-
cation of the node in the tree.
Node-Value: A string indicating the value of the node.
The value of a leaf node is the text itself and the value
of an internal node is NULL (empty string).
Subtree-Value: A string that is defined recursively. For
a leaf node, the Subtree-Value is equal to its Node-
Value. For all internal nodes, the Subtree-Value is a
concatenation of the Subtree-Values of all its children
nodes and its own Node-Value. The Subtree-Value of a
node can be perceived as the fragment (content region)
of a web document anchored at this subtree node.
Subtree-Size: An integer whose value is the length of
Subtree-Value in bytes. This represents the size of the
structure in the document being represented by this
Subtree-Signature: An encoding of the subtree value
for fast comparison. We choose shingles [5, 9] as the
encoding mechanism (see the discussion below). There-
fore we also refer to the Subtree-Signature as Subtree-
Shingles [5, 9] are essentially fingerprints of the document
(or equivalently a string). But unlike other fingerprints like
MD5, if the document changes by a small amount, its shingle
also changes by a small amount.
Figure 2 illustrates this property by giving examples of the
MD5 hash and the shingles of two strings. The first and the
second strings in the figure are essentially the same strings
with small perturbations. It can be seen that the MD5 hash
of the two strings are totally different, whereas the shingles
of the two strings vary just by a single value out of the 8
values in the shingles set (shingle values appearing in both
sets are underlined in the diagram). This property of shin-
gles has made it very popular in estimating the resemblance
and containment of documents [5, 4].
Figure 2: Example of Shingle Vs MD5
3.2 AF Tree Construction
The first step of our fragment detection process is to convert
web pages to their corresponding AF trees. The AF tree can
be constructed in two steps. The first step is to transform a
web document to its DOM tree and prune the fragment tree
by eliminating the text formatting nodes. The result of the
first step is a specialized DOM tree that contains only the
content structure tags (e.g., like <TAB L E >,<TR>,<P>).
The second step is to annotate the fragment tree obtained
in the first step with Node-ID, Node-Value, Subtree-Value,
and Subtree-Shingle.
Having described the structure of the AF tree and method-
ology to construct it, we now describe the algorithms to de-
tect Shared and L-P fragments.
4.1 Detecting Shared Fragments
The Shared fragment detection algorithm operates on var-
ious web pages from the same web site and detects candi-
date fragments that are “approximately” shared. In our
framework for Shared fragment detection, we add three ad-
ditional parameters to define the appropriateness of such
approximately shared fragments. These parameters can be
configured based on the needs of the particular web site.
The accuracy and the performance of the algorithm are de-
pendent on the values of these parameters.
Minimum Fragment Size(MinFragSize): This pa-
rameter specifies the minimum size of the detected
Sharing Factor(ShareF actor): This indicates the
minimum number of pages that should share a seg-
ment in order for it to be declared a fragment.
Minimum Matching Factor(MinM atchF actor): This
parameter specifies the minimum overlap between the
Subtree-Shingles to be considered as a shared frag-
The shared fragment detection algorithm performs the de-
tection in three steps. First, it creates a pool of all the
nodes belonging to AF trees of every web page of the web
site, removing all nodes that do not meet the MinFragSize
threshold. Then the algorithm processes each node in the
pool in decreasing order of their sizes. While processing each
node, it is compared against other nodes in the pool and
groups the nodes that are similar. The similarity b etween
nodes is measured by comparing their SubtreeShingles.If
the number of nodes a group has is equal or higher than
ShareFactor, then that group of nodes is flagged as a can-
didate fragment. The third step ensures that the scheme
detects only the fragments that are maximally shared by
eliminating those fragments that are not maximally shared.
This is done by checking whether there was a larger frag-
ment which has already been detected that contained the
ancestors of all the nodes in the current group and no other
nodes. If such a fragment has already been detected, then
this is not a maximally shared fragment and hence not de-
clared as a candidate fragment. Otherwise it is declared as
a fragment, and all the nodes in the group are removed from
the nodes pool.
4.2 Detecting L-P Fragments
We detect the L-P fragments by comparing different ver-
sions of already existing candidate fragments (web pages)
and identifying the changes occurring among the different
The input to this algorithm is a set of AF trees corre-
sponding to different versions of web pages. These ver-
sions may be time-spaced or versions generated with dif-
ferent cookies.
The nodes of the AF trees in this algorithm have an ad-
ditional field termed as the NodeStatus,whichcantakeany
value from {U nChang ed, V a lueChang ed, P ositionChang ed}.
The scheme compares two versions of a web page at each
step and detects L-P candidate fragments. Each step out-
puts a set of candidate fragments, which are merged to ob-
tain the Object Dependency Graph (ODG) of the entire
document. Object Dependency Graph [7] is a graphical
representation of the containment relationship between the
fragments of a web site. Each step of the algorithm exe-
cutes in two passes. In the first pass the algorithm marks
the nodes that have changed in value or in position between
the two versions of the AF tree. This is done by recursively
comparing each node of the AF tree of one version with the
most similar node from the AF tree of the second version.
The content and the position of the two nodes are compared
and accordingly the NodeStatus of the nodes are marked as
V alueChanged,PositionChangedor U nchanged.Thesec-
ond pass of the algorithm detects the candidate fragments
and merges them to obtain an Object Dependency Graph.
Similar to the Shared fragment detection algorithm, we
have a few configurable parameters in this algorithm:
Minimum Fragment Size(MinFragSize): This pa-
rameter indicates the minimum size of the detected
Child Change Threshold(Chil dChangeT hr eshold):
This parameter indicates the minimum fraction of chil-
dren of a node that should change in value before the
parent node itself can be declared as V alueC hanged.
We have performed a range of experiments to evaluate
our automatic fragment detection scheme. In this section we
give a brief overview of two sets of experiments. The first set
of experiments tests the two fragment detection algorithms,
showing the benefits and effectiveness of the algorithms. The
second set studies the impact of the fragments detected by
our system on improving the caching efficiency.
The input to the schemes is a collection of web pages
including different versions of each page. Therefore we peri-
odically fetched web pages from different web sites like BBC
(, Internetnews
(, Slashdot
( etc. and created a web ‘dump’
for each web site.
In the first set of experiments we evaluated our shared
fragment detection scheme and the impact of the parame-
ters M inMatchF actor and MinFragSize on the detected
shared fragments. For example on a dataset of 75 web
pages from BBC website, which were collected on the 14th of
July 2002, our shared fragment detection algorithm detected
350 fragments when MinFragSize was set to 30 bytes and
M inMatchF actor was set to 70%. In all our experiments
we noticed that our algorithm detected a larger number of
smaller sized fragments when M inMatchF actor was set to
higher values. For the same BBC dataset, the number of
fragments increased to 358 when the M inMatchF actor was
increased to 90%. In our experiments we also noticed that a
large percentage of detected fragments are shared by 2 pages
and only a few fragments are shared by more than 50% of
the web pages.
Our second experiment was aimed at studying the per-
formance of the L-P fragment detection algorithm. Though
we experimented with a number of web sites, due to space
limitations, we briefly discuss our experiments on the web
site from Slashdot ( A total of 79
fragments were detected when the C hildChang eT hreshold
was set to 0.50, and 285 fragments were detected when
Chil dChangeT hr eshold was set to 0.70. We observed higher
numbers of fragments being detected when
Chil dChangeT hr eshold is set to higher values in all our
In our final set of experiments we study the impact of frag-
ment caching on the performance of the cache, the server and
the network when the web sites incorporate the fragments
detected by our system into their respective web pages.
Incorporating the detected fragments improves the perfor-
mance of the caches and the web servers in at least two ways.
First, as the information that is shared among web pages is
stored only once, the disk-space required to store the data
from the web site is reduced. For example, our experiments
on the BBC web site show that disk space requirements are
reduced by around 22% by using the fragments detected by
our algorithm.
Secondly, incorporating fragments into web sites reduces
the amount of data invalidated in the caches. This in turn
causes a reduction in the traffic between the origin servers
and the cache. Our experiments show that for each web site
this reduction is closely related to the average number of
fragments in the web pages, their invalidation rates and the
request rates to the web pages in the web site. However,
the amount of data transferred between server and cache in
a page-level caching scheme is higher than the data trans-
ferred between server and cache in a fragment-level caching
This paper addresses the problem of automatic fragment
detection in web pages. We provided a formal definition of
a fragment and proposed an infrastructure to detect frag-
ments in web sites. Two algorithms are developed for au-
tomatic detection of fragments that are shared across web
pages and fragments that have distinct lifetime and person-
alization characteristics. We report the evaluation of the
proposed scheme through a series of experiments, showing
the effectiveness of the proposed algorithms.
[1] Document object model - w3c recommendation.
[2] Edge Side Includes Standard Specification.
[3] Html tidy.
[4] Z. Bar-Yossef and S. Rajagopalan. Template Detection via
Data Mining and its Applications. In P roceedi ngs o f
WWW-2002, May 2002.
[5] A. Broder. On resemblance and Containment of Documents.
In Proceedi ngs of S EQU EN CES -9 7, 1997.
[6] D. Buttler and L. Liu. A Fully Automated Object
Extraction System for the World Wide Web. In Proceedi ng s
of ICDCS-2001, 2001.
[7] J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and
P. Reed. Publishing System for Efficiently Creating Dynamic
Web C o ntent . I n P roceedi ngs of I EE E I NFOC OM 2000,
May 2000.
[8] A. Datta, K. Dutta, H. Thomas, D. VanderMeer, Suresha,
and K. Ramamritham. Proxy-Based Accelaration of
Dynamically Generated Content on the World Wide Web:
An Approach and Implementation. In Proceeding s of
SIGMOD-2002, June 2002.
[9] U. Manber. Finding Similar Files in a Large File System. In
Proceedi ngs of U SEN IX -1994, January 1994.
... However, automatic fragmentation has been investigated. Related studies determine the fragments by periodically comparing different HTML output versions generated by the same URL ([15] and [12]). The main disadvantages of this approach are the overhead of fragmenting a page online and the redundant online processing. ...
In order to accelerate access to Web applications, content providers are increasingly relying on Content Delivery Networks. Currently, CDNs serve the dynamic content from the edge in two major ways : page assembly or edge computing. Page assembly assumes that the proportion of cacheable content is high, that the cached fragments are reusable and that they do not change very often, while edge computing generally assumes that the whole application is replicated on the edge, which is not always suitable. Besides, current CDNs do not provide a means of scaling the database component of a Web application. In this study we propose a hybrid CDN called FRACS which combines both page assembly and edge computing in order to address the needs of relatively static applications as well as more dynamic applications. FRACS automatically determines the replicable fragments of a Web application, then it modies the application's code so as to generate fragmented pages in ESI format and to enable the server to serve the fragments separately. Moreover, FRACS maintains the consistency of all the manipulated fragments. Using the TPC-W benchmark we were able to achieve up to 60% savings in bandwidth and more than an 80% reduction in response time.
... , ranging from 0.2% to 100%. 1998; Chen et al., 2000; Haveliwala et al., 2000; Mitzenmacher and Owen, 2001; Haveliwala et al., 2002; Ramaswamy et al., 2003; Poon and Chang, 2003) in data mining and information retrieval. Border's sketches was designed for estimating the resemblance between sets P . ...
Full-text available
We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity. Sam-pling methods have become popular in large-scale data mining and information retrieval, where high data sparsity is a norm. A distinct feature of our algorithm is that it combines the advan-tages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH). While most sketch-based algorithms are designed for specific sum-mary statistics, our proposed algorithm is a general purpose technique, useful for estimating any summary statistics including two-way and multi-way distances and joint histograms.
... Though there have been considerable efforts to exploit the potential of fragment-based schemes, there has been little research on detecting the fragments automatically on existing Web sites. To the best of our knowledge, only two research studies gone more deeply into the automation of the fragmentation ([7] and [6]). Both studies rely on the generation of a modified HTML tree 1 but differ in the selection criteria of the fragments. ...
Conference Paper
Full-text available
In this paper we propose a fragment-based caching system that aims at improving the performance of Web- based applications. The system fragments the dynamic pages automatically. Our approach consists in statically analyzing the programs that generate the dynamic pages rather than their output. This approach has the considerable advantage of optimizing the overhead due to fragmentation. Furthermore, we propose a mechanism that increases the reuse rate of the stored fragments, so that the site response time can be improved among other benets. We validate our approach by using TPC-W as a benchmark.
... Compared to our work, they do not attempt to detect individual news items on the pages. Other approaches with similar goals are proposed by Bar-Yossef and Rajagopalan in [1] and by Ramaswamy et al. in [15]. In [10] and [13] Kao et al. and Lin and Ho present methods to extract informative information from web page tables (<TABLE> in HTML). ...
Conference Paper
Web newspapers provide a valuable resource for information. In order to benefit more from the available information, text mining techniques can be applied. However, because each newspaper page often covers a lot of unrelated topics, page-based data mining will not always give useful results. In order to improve on complete-page mining, we present an approach based on extracting the individual news items from the web pages and mining these separately. Automatic news item extraction is a difficult problem, and in this paper we also provide strategies solving that task. We study the quality of the news item extraction, and also provide results from clustering the extracted news items.
Conference Paper
Full-text available
In the democratic state of Brazil, the growing desire for information transparency exposes Government and Society in complementary roles. Society is anxious for data increasingly free and democratic, on the other hand, Government has to ensure not only the transparency, but also consistency and reliability of information provided. In this context, this paper aims at presenting a System to Integrate Government News -an environment that allows searching of articles published by government agencies and also provides ways for the Government assess whether their communication actions comply with reporting criteria of appropriateness to public messages.. Resumo. No estado democrático brasileiro, o crescente desejo de transparência de informações expõe Governo e Sociedade a papéis complementares. Se por um lado a Sociedade anseia por dados cada vez mais democráticos e livres, o Governo necessita de meios que garantam não somente a transparência, mas também a coerência e idoneidade das informações disponibilizadas. Neste contexto, este artigo apresenta o Sistema Integrador de Notícias de Governo -um ambiente que permite a busca de notícias publicadas em órgãos públicos e ainda provê subsídios para o Governo avaliar se suas ações de comunicação obedeçam a critérios de sobriedade e adequação das mensagens ao público.
Conference Paper
Full-text available
The efficacy of a fragment-based caching system fundamen- tally depends on the fragments' definition and the bringing into play of mechanisms that improve reuse and guarantee the consistency of the cache content (notably "purification" and invalidation mechanisms). Ex- isting caching systems assume that the administrator provides the re- quired configuration data manually, which is likely to be a heavy, time- consuming task and one that is prone to human error. This paper pro- poses a tool that helps the administrator to cope with these issues, by automating the systematic tasks and proposing a default fragmentation with the prerequisite reuse and invalidation directives, that may either be augmented or overwritten if necessary.
Advantages of cache cooperation on edge cache networks serving dynamic web content were studied. Design of cooperative edge cache grid a large-scale cooperative edge cache network for delivering highly dynamic web content with varying server update frequencies was presented. A cache clouds-based architecture was proposed to promote low-cost cache cooperation in cooperative edge cache grid. An Internet landmarks-based scheme, called selective landmarks-based server-distance sensitive clustering scheme, for grouping edge caches into cooperative clouds was presented. Dynamic hashing technique for efficient, load-balanced, and reliable documents lookups and updates was presented. Utility-based scheme for cooperative document placement in cache clouds was proposed. The proposed architecture and techniques were evaluated through trace-based simulations using both real-world and synthetic traces. Results showed that the proposed techniques provide significant performance benefits. A framework for automatically detecting cache-effective fragments in dynamic web pages was presented. Two types of fragments in web pages, namely, shared fragments and lifetime-personalization fragments were identified and formally defined. A hierarchical fragment-aware web page model called the augmented-fragment tree model was proposed. An efficient algorithm to detect maximal fragments that are shared among multiple documents was proposed. A practical algorithm for detecting fragments based on their lifetime and personalization characteristics was designed. The proposed framework and algorithms were evaluated through experiments on real web sites. The effect of adopting the detected fragments on web-caches and origin-servers is experimentally studied. Ph.D. Committee Chair: Dr. Ling Liu; Committee Member: Dr. Arun Iyengar; Committee Member: Dr. Calton Pu; Committee Member: Dr. H. Venkateswaran; Committee Member: Dr. Mustaque Ahamad
Full-text available
Proc. of Natl' Conf. on Advanced Database Systems (SEBD '01), Venezia (June 2001), Italy, LCM Selecta, Milan, pp. 215--222, 2001.
Conference Paper
Full-text available
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
Conference Paper
Full-text available
As Internet traffic continues to grow and web sites become increasingly complex, performance and scalability are major issues for web sites. Web sites are increasingly relying on dynamic content generation applications to provide web site visitors with dynamic, interactive, and personalized experiences. However, dynamic content generation comes at a cost --- each request requires computation as well as communication across multiple components.To address these issues, various dynamic content caching approaches have been proposed. Proxy-based caching approaches store content at various locations outside the site infrastructure and can improve Web site performance by reducing content generation delays, firewall processing delays, and bandwidth requirements. However, existing proxy-based caching approaches either (a) cache at the page level, which does not guarantee that correct pages are served and provides very limited reusability, or (b) cache at the fragment level, which requires the use of pre-defined page layouts. To address these issues, several back end caching approaches have been proposed, including query result caching and fragment level caching. While back end approaches guarantee the correctness of results and offer the advantages of fine-grained caching, they neither address firewall delays nor reduce bandwidth requirements.In this paper, we present an approach and an implementation of a dynamic proxy caching technique which combines the benefits of both proxy-based and back end caching approaches, yet does not suffer from their above-mentioned limitations. Our dynamic proxy caching technique allows granular, proxy-based caching where both the content and layout can be dynamic. Our analysis of the performance of our approach indicates that it is capable of providing significant reductions in bandwidth. We have also deployed our proposed dynamic proxy caching technique at a major financial institution. The results of this implementation indicate that our technique is capable of providing order-of-magnitude reductions in bandwidth and response times in real-world dynamic Web applications.
Conference Paper
Full-text available
This paper presents a fully automated object extraction system Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 99% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization
Conference Paper
Full-text available
This paper presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates an Web pages affected by the change. Our system accommodates both content that must be proof-read before publication and is typically from humans as well as content that has to be published immediately and is typically from automated feeds. Our system is deployed at several popular Web sites including the 2000 Olympic Games Web site. We discuss some of our experiences with real deployments of our system as well as its performance
Full-text available
This paper presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates all Web pages affected by the change. Our system accommodatesboth content that must be proofread before publication and is typically from humans as well as content that has to be published immed...
Conference Paper
We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection. 1. Introduction Our goal is to identify files that came from the same source ...