A Comparative Experimental Assessment of a Threshold Selection Algorithm in Hierarchical Text Categorization.
ABSTRACT Most of the research on text categorization has focused on mapping text documents to a set of categories among which structural
relationships hold, i.e., on hierarchical text categorization. For solutions of a hierarchical problem that make use of an
ensemble of classifiers, the behavior of each classifier typically depends on an acceptance threshold, which turns a degree
of membership into a dichotomous decision. In principle, the problem of finding the best acceptance thresholds for a set of
classifiers related with taxonomic relationships is a hard problem. Hence, devising effective ways for finding suboptimal
solutions to this problem may have great importance. In this paper, we assess a greedy threshold selection algorithm aimed
at finding a suboptimal combination of thresholds in a hierarchical text categorization setting. Comparative experiments,
performed on Reuters, report the performance of the proposed threshold selection algorithm against a relaxed brute-force algorithm
and against two state-of-the-art algorithms. Results highlight the effectiveness of the approach.
- SourceAvailable from: Michelangelo Ceci[Show abstract] [Hide abstract]
ABSTRACT: Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, nave Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.Journal of Intelligent Information Systems 02/2007; 28(1):37-78. DOI:10.1007/s10844-006-0003-2 · 0.63 Impact Factor
Conference Paper: The Effect of Using Hierarchical Classifiers in Text Categorization[Show abstract] [Hide abstract]
ABSTRACT: Given a set of categories, with or without a preexisting hierarchy among them, we consider the problem of assigning documents to one or more of these categories from the point of view of a hierarchy with more or less depth. We can choose to make use of none, part or all of the hierarchical structure to improve the categorization effectiveness and efficiency. It is possible to create additional hierarchy among the categories. We describe a procedure for generating a hierarchy of classifiers that models the hierarchy structure. We report on computational experience using this procedure. We show that judicious use of a hierarchy can significantly improve both the speed and effectiveness of the categorization process. Using the Reuters-21578 corpus, we obtain an improvement in running time of over a factor of three and a 5% improvement in F-measure. 1. Introduction and Background The document categorization problem is one of assigning newly arriving documents to one or more preexisting c...Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2000, 6th International Conference, College de France, France, April 12-14, 2000. Proceedings; 05/2000