Applying Web analysis in Web page filtering
ABSTRACT Vertical search engines provide Web users with an alternative way to search for information on the Web by providing customized searching in particular domains. However, two issues need to be addressed when developing these search engines: how to locate relevant documents on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports the research in addressing the second issue. In this research a machine learning-based approach that combines Web content analysis and Web structure analysis is proposed.
[show abstract] [hide abstract]
ABSTRACT: The size of the World Wide Web is growing rapidly and it has become a very important source of information that can be useful to various academic and commercial applications. However, because of the large number of documents online, it is becoming increasingly difficult to search for useful information on the Web. General-purpose Web search engines, such as Google and AltaVista, present search results as ranked lists. Such ranked lists can only show users the first few documents of the search results and fail to give them a quick overview of retrieved document set. To address this problem, clustering techniques are often used to group documents into different topics. While traditional clustering algorithms have been applied to Web page clustering, such clustering techniques do not make use of the unique characteristics of the Web, such as its hyperlink structures. In this study, we propose to incorporate hyperlink analysis into the traditional vector space model used in document clustering. Specifically, we will introduce a new metric HFIDF based on link analysis to be used with the traditional TFIDF (term frequency multiplied by inverse document frequency) in similarity measure in clustering algorithms. The proposed study will investigate whether the use of Web structure analysis techniques improve the performance of document clustering in presenting Web search results.