Conference Paper

Improving Text Clustering with Social Tagging.

Conference: Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17-21, 2011
Source: DBLP
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large- scale social bookmarking websites such as can be used as a complementary data source to page text and an- chor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive in- clusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically signifi- cant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
    Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, February 9-11, 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Clustering is traditionally viewed as an unsupervisedmethod for data analysis. However,in some cases information about theproblem domain is available in addition tothe data instances themselves. In this paper,we demonstrate how the popular k-meansclustering algorithm can be protably modi-ed to make use of this information. In experimentswith articial constraints on sixdata sets, we observe improvements in clusteringaccuracy. We also apply this methodto the real-world...
    Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001; 01/2001


Available from