Conference Paper

A Comparative Experimental Assessment of a Threshold Selection Algorithm in Hierarchical Text Categorization.

DOI: 10.1007/978-3-642-20161-5_6 Conference: Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings
Source: DBLP

ABSTRACT Most of the research on text categorization has focused on mapping text documents to a set of categories among which structural
relationships hold, i.e., on hierarchical text categorization. For solutions of a hierarchical problem that make use of an
ensemble of classifiers, the behavior of each classifier typically depends on an acceptance threshold, which turns a degree
of membership into a dichotomous decision. In principle, the problem of finding the best acceptance thresholds for a set of
classifiers related with taxonomic relationships is a hard problem. Hence, devising effective ways for finding suboptimal
solutions to this problem may have great importance. In this paper, we assess a greedy threshold selection algorithm aimed
at finding a suboptimal combination of thresholds in a hierarchical text categorization setting. Comparative experiments,
performed on Reuters, report the performance of the proposed threshold selection algorithm against a relaxed brute-force algorithm
and against two state-of-the-art algorithms. Results highlight the effectiveness of the approach.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, nave Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.
    Journal of Intelligent Information Systems 02/2007; 28(1):37-78. DOI:10.1007/s10844-006-0003-2 · 0.63 Impact Factor
  • Source
    Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2000, 6th International Conference, College de France, France, April 12-14, 2000. Proceedings; 01/2000