Conference Paper

Cluster-based retrieval using language models

DOI: 10.1145/1008992.1009026 Conference: SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004
Source: DBLP


Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

22 Reads
  • Source
    • "We opt for partitioning tweets just on the basis of their timestamps: this implies each index contains all tweets generated during a certain period of time. In our case this strategy is more convenient than others [2],[4],[9],[11],[15], since it is suitable in presence of an unbounded stream of tweets delivered in chronological order; moreover, it enables the optimization of the query process when a time-based constraint is specified for the query. "
    The 8th International Workshop on Information Filtering and Retrieval (DART 2014), Pisa, Italy; 12/2014
  • Source
    • "Furthermore, the model would be prone to numerical effects, especially for cases where f / ∈ Fe, i.e., the feature f does not occur in the entity profile e. A possible solution to this problem is the extension of the feature set of e with features from the documents similar to the actual entity [13] [21] [32]. We implemented models that extended the feature sets with features from the top-k documents that are " closest " to e (e.g., in terms of cosine similarity, Jaccard distance, etc.) and experimented with different values for k. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Some of the main ranking features of today's search engines reflect result popularity and are based on ranking models, such as PageRank, implicit feedback aggregation, and more. While such features yield satisfactory results for a wide range of queries, they aggravate the problem of search for ambiguous entities: Searching for a person yields satisfactory results only if the person we are looking for is represented by a high-ranked Web page and all required information are contained in this page. Otherwise, the user has to either reformulate/refine the query or manually inspect low-ranked results to find the person in question. A possible approach to solve this problem is to cluster the results, so that each cluster represents one of the persons occurring in the answer set. However clustering search results has proven to be a difficult endeavor by itself, where the clusters are typically of moderate quality. A wealth of useful information about persons occurs in Web 2.0 platforms, such as LinkedIn, Wikipedia, Facebook, etc. Being human-generated, the information on these platforms is clean, focused, and already disambiguated. We show that when searching for ambiguous person names the information from such platforms can be bootstrapped to group the results according to the individuals occurring in them. We have evaluated our methods on a hand-labeled dataset of around 5,000 Web pages retrieved from Google queries on 50 ambiguous person names.
  • Source
    • "Hierarchical smoothing is a common strategy in text modeling for text classification [28], information filtering [31] and retrieval [15]. For example, cluster-based document models for document retrieval [15] smooth document models hierarchically, but do not combine the conditional probabilities from documents for each cluster. Unlike earlier work in text modeling, TDM combines these two types of mixtures. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.
    Proceedings of the 18th Australasian Document Computing Symposium; 12/2013
Show more


22 Reads
Available from