Article

On using hierarchies for document classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A priori, one may entertain one of two clear intuitions here. One is that classification into broader classes is more effective than into narrow categories due to more training examples [9]. The competing intuition is that classification into more narrow categories is more effective because the terms associated with such categories tend to be more discriminating. ...
... Our findings refute claims by Wibowo and Williams [9] that classification into broader categories is more accurate than into narrow categories. We explain the different findings in terms of the fact that 80 of narrow categories used by Wibowo and Williams [9] had only one training example. ...
... Our findings refute claims by Wibowo and Williams [9] that classification into broader categories is more accurate than into narrow categories. We explain the different findings in terms of the fact that 80 of narrow categories used by Wibowo and Williams [9] had only one training example. In our study the number of positive examples for the narrow categories ranged from 981 to 8360. ...
Conference Paper
We examine the impact on classification effectiveness of semantic differences in categories. Specifically, we measure broadness and narrowness of categories in terms of their distance to the root of a hierarchically organized thesaurus. Using categories of four different levels degrees of broadness, we show that classifying documents into narrow categories gives better scores than classifying them into broad terms, which we attribute to the fact that more specific categories are associated with terms with a higher discriminatory power.
... goal is to evaluate the broadness of a given corpus. We assume (see for instance [2] ) that it is easier to classify documents belonging to very different categories, for instance " sports " and " seeds " , than those belonging to very similar ones, e.g. " barley " and " corn " (Reuters- 21578). ...
... In this approach, we assume (see for instance [2] ) that it is easier to classify documents belonging to very different categories, for instance " sports " and " seeds " , than those belonging to very similar ones, e.g. " barley " and " corn " (Reuters- 21578). ...
Conference Paper
Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.
... By considering the relationship between categories, better assignment decisions are made. Choosing the correct categories in higher levels of the hierarchy has been shown to be more reliable—for the reason that the categories are broader and more distinct than lower-level categories—and this aids assignment at the leaf nodes [17]. Perhaps surprisingly, these improvements are usually small: in recent work, Dumais and Chen [5] showed that a hierarchical approach is around 4% more accurate than a flat approach, a result that is consistent with those reported elsewhere [3, 4, 16]. ...
Conference Paper
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.
... For compactness, we present only selected results in this sec- tion.Table 1 shows the typical results we found when categorising into the RTSC collection; in this table we show the result of using Rocchio categorisers from each of our three classes. Hierarchical categorisation using all terms is around 3% more accurate than child level (flat) categorisation with all terms, a result that is consistent with those found in other hierarchical experiments [3, 4, 5, 21, 22] . Our results also show that the stoplist feature selection techniques have small positive or negative effects on performance: in all cases, the stopping results are within 1% of the performance of using all terms. ...
Conference Paper
Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6\%, while top-down hierarchical categorisation accuracy can be improved by up to 12\%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.
Article
Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.
Article
Full-text available
Thesis (Ph.D.)--University of Ottawa, 2006. Includes bibliographies.
ResearchGate has not been able to resolve any references for this publication.