Conference Paper

Term-length Normalization for Centroid-based Text Categorization

Thammasat University, Siayuthia, Bangkok, Thailand
DOI: 10.1007/978-3-540-45224-9_113 Conference: Knowledge-Based Intelligent Information and Engineering Systems, 7th International Conference, KES 2003, Oxford, UK, September 3-5, 2003, Proceedings, Part I
Source: DBLP


Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important
factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes.
In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type
of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization
is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization,
(2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length
normalization is useful for improving classification accuracy in all cases.

7 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.
    Full-text · Conference Paper · Mar 2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an effective way of combining character-based (N-gram) and word-based approaches for Chinese text classification. Uni-gram and bi-gram features are considered as the baseline model, which are then combined with word features of length greater than or equal to 3. A weight coefficient that can be used to give higher weights to word features is also introduced. We further employ a serial approach based on feature transformation and dimension reduction techniques. The results of McNemar's test indicate that the performance is significantly improved by our proposed method. © 2014 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.
    No preview · Article · Dec 2014 · IEEJ Transactions on Electrical and Electronic Engineering
  • [Show abstract] [Hide abstract]
    ABSTRACT: High dimensionality of text data hinders the performance of classifiers making it necessary to apply feature selection for dimensionality reduction. Most of the feature ranking metrics for text classification are based on document frequencies (df) of a term in positive and negative classes. Considering only document frequencies to rank features favors terms frequently occurring in larger classes in unbalanced datasets. In this paper we introduce a new feature ranking metric termed as relative discrimination criterion (RDC), which takes document frequencies for each term count of a term into account while estimating the usefulness of a term. The performance of RDC is compared with four well known feature ranking metrics, information gain (IG), CHI squared (CHI), odds ratio (OR) and distinguishing feature selector (DFS) using support vector machines (SVM) and multinomial naive Bayes (MNB) classifiers on four benchmark datasets, namely Reuters, 20 Newsgroups and two subsets of Ohsumed dataset. Our results based on macro and micro F1 measures show that the performance of RDC is superior than the other four metrics in 65% of our experimental trials. Also, RDC attains highest macro and micro F1 values in 69% of the cases.
    No preview · Article · May 2015 · Expert Systems with Applications