To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by
crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents
contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images,
advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining
whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional
approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users
do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change
recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective
component of a web crawling strategy.