Conference Paper

Improve precategorized collection retrieval by using supervised term weighting schemes

Dept. of Comput. Sci., Minnesota Univ., Minneapolis, MN, USA
DOI: 10.1109/ITCC.2002.1000353 Conference: Information Technology: Coding and Computing, 2002. Proceedings. International Conference on
Source: IEEE Xplore

ABSTRACT The emergence of the World Wide Web has led to an increased interest in methods for searching for information. A key characteristic of many online document collections is that the documents have pre-defined category information, such as the variety of scientific articles accessible via digital libraries (e.g. ACM, IEEE, etc.), medical articles, news-wires and various directories (e.g. Yahoo, OpenDirectory Project, etc.). However, most previous information retrieval systems have not taken the pre-existing category information into account. In this paper, we present weight adjustment schemes based upon the category information in the vector-space model, which are able to select the most content-specific and discriminating features. Our experimental results on TREC data sets show that the pre-existing category information does provide additional beneficial information to improve retrieval. The proposed weight adjustment schemes perform better than the vector-space model with the inverse document frequency (IDF) weighting scheme when queries are less specific. The proposed weighting schemes can also benefit retrieval when clusters are used as an approximations to categories.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is pro- posed to enable scaling the document clustering problem to large collections. Using standard message passing opera- tions reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard TREC test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size . Results show that our algorithm performs close to the expected O(n2/p) time on p processors, rather than the worst-case O(n3/p) time . Furthermore, the O(n2/p) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm th ose of other studies showing that hier- archical algorithms produce significantly tighter cluster s in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorith m can be used as the clustering subroutine for a parallel version of the Buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations.
    Journal of the American Society for Information Science and Technology 06/2007; 58:1207-1221. DOI:10.1002/asi.20596 · 2.01 Impact Factor
  • Source
    Fuzzy Logic - Algorithms, Techniques and Implementations, 03/2012; , ISBN: 978-953-51-0393-6
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A method for Information Extraction (IE) in a set of knowledge is proposed in this paper in order to answer to user consultations using natural language. The system is based on a fuzzy logic engine, which takes advantage of its flexibility for managing sets of accumulated knowledge. These sets can be built in hierarchic levels by a tree structure. A method of consultation based on a fuzzy logic application provided with an interface that one may interact with in natural language is also proposed. The eventual aim of this system is the implementation of an intelligent agent to manage the information contained in an internet portal.
    Computational Science and Its Applications - ICCSA 2007, International Conference, Kuala Lumpur, Malaysia, August 26-29, 2007. Proceedings. Part III; 01/2007

Full-text (3 Sources)

Available from
Dec 27, 2014