Geographic ranking for a local search engine.
ABSTRACT Traditional ranking schemes of the relevance of a Web page to a user query in a search engine are less appropriate when the search term contains geographic information. Often, geographic entities, such as addresses, city names, and location names, appear only once or twice in a Web page, and are typically not in a heading or larger font. Consequently, an alternative ranking approach to the traditional weighted tf*idf relevance ranking is need. Further, if a Web site contains a geographic entity, it is often the case that its in- and out-neighbours do not refer to the same entity, although they may refer to other geographic entities. We present a local search engine that applies a novel ranking algorithm suitable for ranking Web pages with geographic content. We describe its major components: geographic ranking, focused crawling, geographic extractor, and the related web-sites feature.
- SourceAvailable from: Mark Manasse
Conference Proceeding: Detecting Spam Web Pages through Content Analysis[show abstract] [hide abstract]
ABSTRACT: In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.205/2006
Conference Proceeding: Geographically focused collaborative crawling.[show abstract] [hide abstract]
ABSTRACT: A collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specic portion of the web. We study the problem of collecting geographi- cally-aware pages using collaborative crawling strategies. We rst propose several collaborative crawling strategies for the geographically focused crawling, whose goal is to collect web pages about specied geographic locations, by considering features like URL address of page, content of page, extended anchor text of link, and others. Later, we propose vari- ous evaluation criteria to qualify the performance of such crawling strategies. Finally, we experimentally study our crawling strategies by crawling the real web data showing that some of our crawling strategies greatly outperform the simple URL-hash based partition collaborative crawling, in which the crawling assignments are determined according to the hash-value computation over URLs. More precisely, features like URL address of page and extended anchor text of link are shown to yield the best overall performance for the geographically focused crawling.Proceedings of the 15th international conference on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 23-26, 2006; 01/2006