Article

A probabilistic relational approach for web document clustering

Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Italy; Consorzio Milano Ricerche, Via Cozzi 53, 20126 Milano, Italy
Information Processing & Management DOI:10.1016/j.ipm.2009.08.003 pp.117-130

ABSTRACT The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization.

0 0
 · 
0 Bookmarks
 · 
22 Views

Keywords

clustering algorithm
 
contents information
 
document clustering method
 
effective methods
 
efficient
 
Expectation Maximization
 
field document clustering
 
hyperlink structure
 
information available
 
pages
 
probabilistic graph representation
 
proposed approach
 
RED-clustering
 
relational clustering algorithm
 
relevant contents
 
search engines
 
web page collection
 
World Wide Web