Conference Paper

Cluster-based retrieval using language models.

DOI: 10.1145/1008992.1009026 In proceeding of: SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004
Source: DBLP

ABSTRACT Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we present a novel approach that allows a robot to improve its own navigation performance through introspection and then targeted data retrieval. It is a step in the direction of life-long learning and adaptation and is motivated by the desire to build robots that have plastic competencies which are not baked in. They should react to and benefit from use. We consider a particular instantiation of this problem in the context of place recognition. Based on a topic-based probabilistic representation for images, we use a measure of perplexity to evaluate how well a working set of background images explain the robot's online view of the world. Offline, the robot then searches an external resource to seek out additional background images that bolster its ability to localize in its environment when used next. In this way the robot adapts and improves performance through use. We demonstrate this approach using data collected from a mobile robot operating in outdoor workspaces.
    The International Journal of Robotics Research 12/2013; 32(14):1742-1766. · 2.86 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.
    Proceedings of the 18th Australasian Document Computing Symposium; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Hierarchical text classification of a Web taxonomy is challenging because it is a very large-scale problem with hundreds of thousands of categories and associated documents. Furthermore, the conceptual levels and training data availabilities of categories vary widely. The narrow-down approach is the state of the art; it utilizes a search engine for generating candidates from the taxonomy and builds a classifier for the final category selection. In this paper, we take the same approach but address the issue of using global information in a language modelling framework to improve effectiveness. We propose three methods of using non-local information for the task: a passive way of utilizing global information for smoothing; an aggressive way where a top-level classifier is built and integrated with a local model; and a method of using label terms associated with the path from a category to the root, which is based on our systematic observation that they are underrepresented in the documents. For evaluation, we constructed a document collection from Web pages in the Open Directory Project. A series of experiments and their results show the superiority of our methods and reveal the role of global information in hierarchical text classification.
    Journal of Information Science. 04/2014; 40(2):127-145.


Available from