Conference Paper

What’s Changed? Measuring Document Change in Web Crawling for Search Engines

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Ali and Williams have discussed in [1] that the significance of the past changes in document content is an effective measure for estimating whether the document will change again and should be re-crawled. We extend this idea to some of the aforementioned static features as follows. ...
... The distance between a webpage and the centroid can generally be represented using the cosine function given by (3). This measure ranges in [0, 1] and becomes larger when the webpage is more similar to the centroid. A sufficient sample for a cluster is defined as the sample which produces a valid mean score for the cluster. ...
Conference Paper
Due to resource constraints, search engines usually have dif- ficulties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the crawler used by a search engine should be able to predict the change behavior of webpages so that it can use the limited resource to download those webpages that are most likely to change. Towards this goal, we propose using sampling approach at the level of a cluster. We first group all the local webpages into different clusters such that each cluster contains webpages with similar change patterns. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster, and the clus- ter containing webpages with higher change frequency will be revisited more often by our crawler. We run extensive experiments on a real Web data set of about 300,000 dis- tinct URLs distributed among 210 websites. The results show that by applying our clustering algorithm, pages with similar change patterns are effectively clustered together. Our proposal significantly outperforms the comparators by improving the average freshness of the local database.
... Ali and Williams [Ali and Williams 2003] have discussed that the significance of the past changes in document content is an effective measure for estimating whether the document will change again and should be re-crawled. We extend this idea to some of the aforementioned static features as follows. ...
Article
When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.
Conference Paper
Resource constraints, such as time and network bandwidth, hinder modern search engine providers to keep local database completely synchronize with the Web. In this paper, we propose an adaptive clustering based change prediction approach to refresh the local web repository. Especially, we first group the existing web pages in the current repository into web clusters based on their similar change characteristics. We then sample and examine some pages in each cluster to estimate their change patterns. Selected cluster of web pages with higher change probability will be later downloaded to update the current repository. Finally, the effectiveness of the current download cycle will be examined; either auxiliary (non-downloaded), reward (correct change prediction), or penalty (wrong change prediction) score will be assigned to a web page. This score will later be used to reinforce the consecutive web clustering as well as the change prediction processes. To evaluate the performance of the proposed approach, we run extensive experiments on snapshots of real Web dataset of about 282,000 distinct URLs which are belonging to more than 12,500 websites. The results clearly show that the proposed approach outperforms the existing state-of-the-art on clustering-based web crawling policy in that it can provide fresher local web repository with limited resource.
Conference Paper
Accurate prediction of changing web page content improves a variety of retrieval and web related components. For example, given such a prediction algorithm one can both design a better crawling strategy that only recrawls pages when necessary as well as a proactive mechanism for personalization that pushes content associated with user revisitation directly to the user. While many techniques for modeling change have focused simply on past change frequency, our work goes beyond that by additionally studying the usefulness in page change prediction of: the page's content; the degree and relationship among the prediction page's observed changes; the relatedness to other pages and the similarity in the types of changes they undergo. We present an expert prediction framework that incorporates the information from these other signals more effectively than standard ensemble or basic relational learning techniques. In an empirical analysis, we find that using page content as well as related pages significantly improves prediction accuracy and compare it to common approaches. We present numerous similarity metrics to identify related pages and focus specifically on measures of temporal content similarity. We observe that the different metrics yield related pages that are qualitatively different in nature and have different effects on the prediction performance.
Conference Paper
The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how tomake good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering-based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.
Conference Paper
Full-text available
Article
Full-text available
WebCQ is a prototype system for large-scale Web information monitoring and delivery. It makes heavy use of the structure presentinhypertext and the concept of continual queries. In this paper we discuss both mechanisms that WebCQ uses to discover and detect changes to the World Wide Web (the Web) pages efficiently, and the methods to notify users of interesting changes with a personalized customization. The WebCQ system consists of four main components: achange detection robot that discovers and detects changes, a proxy cache service that reduces communication traffics to the original information servers, a personalized presentation tool that highlights changes detected byWebCQ sentinels, and a change notification service that delivers fresh information to the right users at the right time. A salient feature of our change detection robot is its ability to support various types of web page sentinels for detecting, presenting, and delivering interesting changes to web pages. This paper describes the WebCQ system with an emphasis on general issues in designing and engineering a large-scale information change monitoring system on the Web.
Article
Full-text available
This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy. This crawler is scalable and incremental. The model makes no assumptions about the statistical behaviour of web page changes, but rather uses an adaptive approachtomaintain data on actual change rates which are in turn used as inputs for the optimization. Computational results with simulated but realistic data show that there is no `magic bullet' - different, but equally plausible, objectives lead to conicting `optimal' strategies. However, we nd that there are compromise objectives which lead to good strategies that are robust against a number of criteria. Categories and Subject Descriptors H3.4 [Systems and Software]: Performance Evaluation (eciency and eectiveness); H4.3 [Communications Applications ]: Information Browsers; G1.6 [Optimization]: Nonlinear Programming General Terms Algorithms, Experimentation, Performance Keywords Crawler, incremental crawler, scalability, optimization # This work was completed while the author was on leaveat IBM Almaden Research Center. Copyright is held by the author/owner. WWW10, May 1-5, 2001, Hong Kong. ACM 1-58113-348-0/01/0005. 1.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
Recent experiments and analysis suggest that there are about 800 million publicly-indexable Web pages. However, unlike books in a traditional library, Web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes preliminary data on and statistical analysis of the frequency and nature of Web page modifications. Using empirical models and a novel analytic metric of `up-to-dateness', we estimate the rate at which Web search engines must re-index the Web to remain current.
Article
This work focuses on characterizing information about Web resources and server responses that is relevant to Web caching. The approach is to study a set of URLs at a variety of sites and gather statistics about the rate and nature of changes compared with the resource type. In addition, we gather response header information reported by the servers with each retrieved resource. Results from the work indicate that there is potential to reuse more cached resources than is currently being realized due to inaccurate and nonexistent cache directives. In terms of implications for caching, the relationships between resources used to compose a page must be considered. Embedded images are often reused, even in pages that change frequently. This result both points to the need to cache such images and to discard them when they are no longer included as part of any page. Finally, while the results show that HTML resources frequently change, these changes can be in a predictable and localized manner. Separating out the dynamic portions of a page into their own resources allows relatively static portions to be cached, while retrieval of the dynamic resources can trigger retrieval of new resources along with any invalidation of already cached resources.
Article
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help data mining. We first identify various scenarios, where different applications have different requirements on the accuracy of the estimated frequency. Then we develop several "frequency estimators" for the identified scenarios. In developing the estimators, we analytically show how precise/effective the estimators are, and we show that the estimators that we propose can improve precision significantly. 1 Introduction With the explosive growth of the internet, many data sources are available online. Most of the data sources are autonomous and are updated independently of the clients that access the sources. For instance, popular news web sites, such as CNN and NY Times, update their contents periodically, whenever there are new developments. Also, many online stores update the price/availab...
Article
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Article
In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. The incremental crawler can improve the "freshness" of the collection significantly and bring in new pages in a more timely manner. We first present results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time. Based on these experimental results, we compare various design choices for an incremental crawler and discuss their trade-offs. We propose an architecture for the incremental crawler, which combines the best design choices.
Overview of TREC-7 Very Large Collection Track In: The Eighth Text Retrieval Conference (TREC 8), National Institute of Standards and Technology Special Publication
  • D Hawking
  • N Craswell
  • P Thistlewaite
National Institute of Standards and Technology Special Publication 500-246
  • D Hawking
  • N Craswell
  • P Thistlewaite