Conference Paper

Improving Text Clustering with Social Tagging.

Conference: Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17-21, 2011
Source: DBLP
0 Followers
 · 
99 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Breadcrumbs is a folksonomy of news clips, where users can aggregate fragments of text taken from online news. Besides the textual content, each news clip contains a set of metadata fields associated with it. User-defined tags are one of the most important of those information fields. Based on a small data set of news clips, we build a network of co-occurrence of tags in news clips, and use it to improve text clustering. We do this by defining a weighted cosine similarity proximity measure that takes into account both the clip vectors and the tag vectors. The tag weight is computed using the related tags that are present in the discovered community. We then use the resulting vectors together with the new distance metric, which allows us to identify socially biased document clusters. Our study indicates that using the structural features of the network of tags leads to a positive impact in the clustering process.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years there has emerged the field of Constrained Clustering, which proposes clustering algorithms which are able to ac-commodate domain information to obtain a better final grouping. This information is usually provided as pairwise constraints, whose acquisi-tion from humans can be costly. In this paper we propose a novel method based on word n-grams to automatically extract positive constraints from text collections. Clustering experiments in text collections composed by different types of documents show that the constraints created with our method attain statistically significant improvements over the results ob-tained with constraints created using named entities and over the results of a high-performing non-constrained algorithm.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently a new fashion of semi-supervised clustering algorithms, coined as constrained clustering, has emerged. These new algorithms can incorporate some a priori domain knowledge to the clustering process, allowing the user to guide the method. The vast majority of studies about the effectiveness of these approaches have been performed using information, in the form of constraints, which was totally accurate. This would be the ideal case, but such a situation will be impossible in most realistic settings, due to errors in the constraint creation process, misjudgements of the user, inconsistent information, etc. Hence, the robustness of the constrained clustering algorithms when dealing with erroneous constraints is bound to play an important role in their final effectiveness.In this paper we study the behaviour of four constrained clustering algorithms (Constrained k-Means, Soft Constrained k-Means, Constrained Normalised Cut and Normalised Cut with Imposed Constraints) when not all the information supplied to them is accurate. The experimentation over text and numeric datasets using two different noise models, one of them an original approach based on similarities, highlighted the strengths and weaknesses of each method when working with positive and negative constraints, indicating the scenarios in which each algorithm is more appropriate.
    Information Processing & Management 05/2012; 48(3). DOI:10.1016/j.ipm.2011.08.006 · 1.07 Impact Factor

Preview

Download
0 Downloads
Available from