Most of the research on text categorization has focused on mapping text documents to a set of categories among which structural
relationships hold, i.e., on hierarchical text categorization. For solutions of a hierarchical problem that make use of an
ensemble of classifiers, the behavior of each classifier typically depends on an acceptance threshold, which turns a degree
of membership into a dichotomous decision. In principle, the problem of finding the best acceptance thresholds for a set of
classifiers related with taxonomic relationships is a hard problem. Hence, devising effective ways for finding suboptimal
solutions to this problem may have great importance. In this paper, we assess a greedy threshold selection algorithm aimed
at finding a suboptimal combination of thresholds in a hierarchical text categorization setting. Comparative experiments,
performed on Reuters, report the performance of the proposed threshold selection algorithm against a relaxed brute-force algorithm
and against two state-of-the-art algorithms. Results highlight the effectiveness of the approach.
[Show abstract][Hide abstract] ABSTRACT: Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural
relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy
of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship
among categories opens several additional issues in the development of methods for automated document classification. Questions
concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental
results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization
framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature
selection, learning and classification of a new document. An automated threshold determination method for classification scores
is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document
to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based,
nave Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets
(Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported
and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison
of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings
vs. our findings.
Journal of Intelligent Information Systems 02/2007; 28(1):37-78. DOI:10.1007/s10844-006-0003-2 · 0.89 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Given a set of categories, with or without a preexisting hierarchy among them, we consider the problem of assigning documents to one or more of these categories from the point of view of a hierarchy with more or less depth. We can choose to make use of none, part or all of the hierarchical structure to improve the categorization effectiveness and efficiency. It is possible to create additional hierarchy among the categories. We describe a procedure for generating a hierarchy of classifiers that models the hierarchy structure. We report on computational experience using this procedure. We show that judicious use of a hierarchy can significantly improve both the speed and effectiveness of the categorization process. Using the Reuters-21578 corpus, we obtain an improvement in running time of over a factor of three and a 5% improvement in F-measure. 1. Introduction and Background The document categorization problem is one of assigning newly arriving documents to one or more preexisting c...
Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2000, 6th International Conference, College de France, France, April 12-14, 2000. Proceedings; 05/2000
Note: Although carefully collected, accuracy of this list of references cannot be guaranteed.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.