Conference Paper

Two Odds-Radio-Based Text Classification Algorithms.

DOI: 10.1109/WISEW.2002.1177866 Conference: 3rd International Conference on Web Information Systems Engineering Workshops (WISE 2002 Workshops), 11 December 2002, Singapore, Proceedings
Source: DBLP

ABSTRACT Since 1990's, the exponential growth of these Web documents has led to a great deal of interest in developing efficient tools and software to assist users in finding relevant information. Text classification has been proved to be useful in helping organize and search text information on the Web. Although there have been existed a number of text classification algorithms, most of them are either inefficient or too complex. In this paper we present two Odds-Radio-Based text classification algorithms, which are called OR and TF*OR respectively. We have evaluated our algorithm on two text collections and compared it against k-NN and SVM. Experimental results show that OR and TF*OR are competitive with k-NN and SVM. Furthermore, OR and TF*OR is much simpler and faster than them. The results also indicate that it is not TF but relevance factors derived from Odds Radio that play the decisive role in document categorization.

  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the purpose of building a language-independent text classifier, the study reported here is concerned with statistical TFS only. In this paper, we propose a novel statistical TFS approach that hybridizes the ideas of two existing techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and RS (Relevancy Score). With respect to associative (text) classification, the experimental results demonstrate that the proposed approach can produce greater classification accuracy than other alternative approaches. KeywordsAssociative Classification-(Language-independent) Text Classification-Text Mining-Textual Feature Selection
    06/2010: pages 222-236;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The recently introduced Gini-Index Text (GIT) feature-selection algorithm for text classification, through incorporating an improved Gini Index for better feature-selection performance, has some drawbacks. Specifically, the algorithm, under real-world experimental conditions, concentrates feature values to one point and be inadequate for selecting representative features. As such, good representative features cannot be estimated, and neither, moreover, can good performance be achieved in unbalanced text classification. Therefore, we suggest a new complete GIT feature-selection algorithm for text classification. The new algorithm, according to experimental results, could obtain unbiased feature values, and could eliminate many irrelevant and redundant features from feature subsets while retaining many representative features. Furthermore, the new algorithm, compared with the original version, demonstrated a notably improved overall classification performance.
    Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on; 07/2010