ArticlePDF Available

WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization

Authors:

Abstract and Figures

The most challenging part of a web crawler is to download contents at the fastest rate to utilize bandwidth and processing of the downloaded data so that it will never starve the downloader. Our implemented scalable web crawling system, named as WEBTracker has been designed to meet this challenge. It can be used very efficiently in the distributed environment to maximize downloading. WEBTracker has a Central Crawler Server and it administers all the crawler nodes. At each crawler node, Crawler Manager runs the downloader and manages the downloaded contents. Central Crawler Server and its Crawler Managers are members of the Distributed File System which ensures synchronized distributed operations of the system. In this paper, we have only concentrated on the architecture of a web crawling node which is owned by the Crawler Manager. We have shown that our implemented crawler architecture makes efficient use of allocated bandwidth, keeps the processor less busy for the processing of downloaded contents and makes efficient use of the run time memory.
Content may be subject to copyright.
A preview of the PDF is not available
... In this crawler a new assignment function is used for partitioning the domain between the crawlers. Amin et al. (2012) proposed a scalable web crawling system, named as WEBTracker, to increase the number of visited pages by making use of distributed environment. Topic specific crawlers, also known as focussed crawlers have been developed in Mukhopadhyay and Sinha (2008) and Yang et al. (2009). ...
Article
Purpose – The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach – In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings – Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value – The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
Article
Full-text available
Robots are deployed by a Web search engine for collecting information from different Web servers in order to maintain the currency of its data base of Web pages. In this paper, we investigate the number of robots to be used by a search engine so as to maximize the currency of the data base without putting an unnecessary load on the network. We adopt a finite-buffer queueing model to represent the system. The arrivals to the queueing system are Web pages brought by the robots; service corresponds to the indexing of these pages. Good performance requires that the number of robots, and thus the arrival rate of the queueing system, be chosen so that the indexing queue is rarely starved or saturated. Thus, we formulate a multi-criteria stochastic optimization problem with the loss rate and empty-buffer probability being the criteria. We take the common approach of reducing the problem to one with a single objective that is a linear function of the given criteria. Both static and dynamic policies can be considered. In the static setting the number of robots is held fixed; in the dynamic setting robots may be re-activated/de-activated as a function of the state. Under the assumption that arrivals form a Poisson process and that service times are independent and exponentially distributed random variables, we determine an optimal decision rule for the dynamic setting, i.e., a rule that varies the number of robots in such a way as to minimize a given linear function of the loss rate and empty-buffer probability. Our results are compared with known results for the static case. A numerical study indicates that substantial gains can be achieved by dynamically controlling the activity of the robots.
Article
Full-text available
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Article
We present a random walk as an efficient and accurate approach to approximating certain aggregate queries about web pages. Our method uses a novel random walk to produce an almost uniformly distributed sample of web pages. The walk traverses a dynamically built regular undirected graph. Queries we have estimated using this method include the coverage of search engines, the proportion of pages belonging to .com and other domains, and the average size of web pages. Strong experimental evidence suggests that our walk produces accurate results quickly using very limited resources.
Article
A new principles framework is presented for retrieval evaluation of ranked outputs. It applies decision theory to model relevance decision preferences and shows that the Probability Ranking Principle (PRP) specifies optimal ranking. It has two new components, ...
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
Samet introduced a notion of hypothetical knowledge and showed how it could be used to capture the type of counterfactual reasoning necessary to force the backwards induction solution in a game of perfect information. He argued that while ...
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Conference Paper
Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts