Two Odds-Radio-Based Text Classification Algorithms.
ABSTRACT Since 1990's, the exponential growth of these Web documents has led to a great deal of interest in developing efficient tools and software to assist users in finding relevant information. Text classification has been proved to be useful in helping organize and search text information on the Web. Although there have been existed a number of text classification algorithms, most of them are either inefficient or too complex. In this paper we present two Odds-Radio-Based text classification algorithms, which are called OR and TF*OR respectively. We have evaluated our algorithm on two text collections and compared it against k-NN and SVM. Experimental results show that OR and TF*OR are competitive with k-NN and SVM. Furthermore, OR and TF*OR is much simpler and faster than them. The results also indicate that it is not TF but relevance factors derived from Odds Radio that play the decisive role in document categorization.
- SourceAvailable from: Frans Coenen[Show abstract] [Hide abstract]
ABSTRACT: Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the purpose of building a language-independent text classifier, the study reported here is concerned with statistical TFS only. In this paper, we propose a novel statistical TFS approach that hybridizes the ideas of two existing techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and RS (Relevancy Score). With respect to associative (text) classification, the experimental results demonstrate that the proposed approach can produce greater classification accuracy than other alternative approaches. KeywordsAssociative Classification-(Language-independent) Text Classification-Text Mining-Textual Feature Selection06/2010: pages 222-236;
- [Show abstract] [Hide abstract]
ABSTRACT: Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between text- categories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybrid (i) & (ii). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti! Sebastiani! Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.Advanced Data Mining and Applications, 5th International Conference, ADMA 2009, Beijing, China, August 17-19, 2009. Proceedings; 01/2009
- [Show abstract] [Hide abstract]
ABSTRACT: The recently introduced Gini-Index Text (GIT) feature-selection algorithm for text classification, through incorporating an improved Gini Index for better feature-selection performance, has some drawbacks. Specifically, the algorithm, under real-world experimental conditions, concentrates feature values to one point and be inadequate for selecting representative features. As such, good representative features cannot be estimated, and neither, moreover, can good performance be achieved in unbalanced text classification. Therefore, we suggest a new complete GIT feature-selection algorithm for text classification. The new algorithm, according to experimental results, could obtain unbiased feature values, and could eliminate many irrelevant and redundant features from feature subsets while retaining many representative features. Furthermore, the new algorithm, compared with the original version, demonstrated a notably improved overall classification performance.Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on; 07/2010