DOI: 10.1007/978-3-540-45224-9_113 Conference: Knowledge-Based Intelligent Information and Engineering Systems, 7th International Conference, KES 2003, Oxford, UK, September 3-5, 2003, Proceedings, Part I
Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important
factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes.
In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type
of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization
is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization,
(2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length
normalization is useful for improving classification accuracy in all cases.
[Show abstract][Hide abstract] ABSTRACT: Current text classification methods are mostly based on a supervised approach, which require a large number of examples to
build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive.
In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information
embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category.
Particularly, it classifies the documents by considering not only their own content but also information about the assigned
category to other similar documents from the same test set. Experimental results in four data sets of different sizes are
encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly
outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.
Computational Linguistics and Intelligent Text Processing, 11th International Conference, CICLing 2010, Iasi, Romania, March 21-27, 2010. Proceedings; 01/2010
Note: This list is based on the publications in our database and might not be exhaustive.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.