Conference Paper

Cluster-based retrieval using language models.

DOI: 10.1145/1008992.1009026 Conference: SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004
Source: DBLP

ABSTRACT Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Modeling text with topics is currently a popular research area in both Machine Learning and Information Retrieval (IR). Most of this research has focused on automatic methods though there are many hand-crafted topic resources available online. In this paper we investigate retrieval performance with topic models constructed manually based on a hand-crafted directory resource. The original query is smoothed on the manually selected topic model, which can also be viewed as an "ideal" user context model. Experiments with these topic models on the TREC retrieval tasks show that this type of topic model alone provides little benefit, and the overall performance is not as good as relevance modeling (which is an automatic query modification model). However, smoothing the query with topic models outperforms relevance models for a subset of the queries and automatic selection from these two models for particular queries gives better results overall than relevance models. We further demonstrate some improvements over relevance models with automatically built topic models based on the directory resource.
    Large Scale Semantic Access to Content (Text, Image, Video, and Sound); 05/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a study of the cluster hypothesis, and of the performance of cluster-based retrieval methods, performed over large scale Web collections. Among the findings we present are (i) the cluster hypothesis can hold, as determined by a specific test, for large scale Web corpora to the same extent it does for newswire corpora; (ii) while spam documents do not affect the extent to which the cluster hypothesis holds, they considerably affect the performance of cluster based, as well as that of document-based, retrieval methods; and, (iii) as is the case for newswire corpora, cluster-based methods can yield better performance than document-based methods for Web corpora.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Query expansion methods using pseudo-relevance feedback have been shown effective for microblog search because they can solve vocabulary mismatch problems often seen in searching short documents such as Twitter messages (tweets), which are limited to 140 characters. Pseudo-relevance feedback assumes that the top ranked documents in the initial search results are relevant and that they contain topic-related words appropriate for relevance feedback. However, those assumptions do not always hold in reality because the initial search results often contain many irrelevant documents. In such a case, only a few of the suggested expansion words may be useful with many others being useless or even harmful. To overcome the limitation of pseudo-relevance feedback for microblog search, we propose a novel query expansion method based on two-stage relevance feedback that models search interests by manual tweet selection and integration of lexical and temporal evidence into its relevance model. Our experiments using a corpus of microblog data (the Tweets2011 corpus) demonstrate that the proposed two-stage relevance feedback approaches considerably improve search result relevance over almost all topics.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013


Available from