Conference Paper

Two Odds-Radio-Based Text Classification Algorithms.

DOI: 10.1109/WISEW.2002.1177866 Conference: 3rd International Conference on Web Information Systems Engineering Workshops (WISE 2002 Workshops), 11 December 2002, Singapore, Proceedings
Source: DBLP

ABSTRACT Since 1990's, the exponential growth of these Web documents has led to a great deal of interest in developing efficient tools and software to assist users in finding relevant information. Text classification has been proved to be useful in helping organize and search text information on the Web. Although there have been existed a number of text classification algorithms, most of them are either inefficient or too complex. In this paper we present two Odds-Radio-Based text classification algorithms, which are called OR and TF*OR respectively. We have evaluated our algorithm on two text collections and compared it against k-NN and SVM. Experimental results show that OR and TF*OR are competitive with k-NN and SVM. Furthermore, OR and TF*OR is much simpler and faster than them. The results also indicate that it is not TF but relevance factors derived from Odds Radio that play the decisive role in document categorization.

6 Reads
  • Source
    • "EXISTING FEATURE SELECTION AND GINI-INDEX The typical feature-selection methods for text classifications, Ȥ 2 , Information Gain and Odds Ratio have been used widely and been proven in many experimental studies. The Ȥ 2 , Information Gain and Odds Ratio formulae are as follows [5][12][13][14]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The recently introduced Gini-Index Text (GIT) feature-selection algorithm for text classification, through incorporating an improved Gini Index for better feature-selection performance, has some drawbacks. Specifically, the algorithm, under real-world experimental conditions, concentrates feature values to one point and be inadequate for selecting representative features. As such, good representative features cannot be estimated, and neither, moreover, can good performance be achieved in unbalanced text classification. Therefore, we suggest a new complete GIT feature-selection algorithm for text classification. The new algorithm, according to experimental results, could obtain unbiased feature values, and could eliminate many irrelevant and redundant features from feature subsets while retaining many representative features. Furthermore, the new algorithm, compared with the original version, demonstrated a notably improved overall classification performance.
    Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on; 07/2010
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an approach to the recovery of damaged disks that measures the similarity among sectors using the Instance-based Feature Filtering algorithm and classification. After a hard-disk is destroyed, maliciously or accidentally, that hard-disk can be simply repaired using the recovery programs. However, there are always some sectors that cannot connect with the original file after recovery; typically, attempts are made to connect with the original file manually, or those attempts prove unsuccessful, the effort is abandoned. Therefore, an automatic process for finding the original file related to unconnected sectors is required. Typical methods assess the similarity among sectors and recommend relevant candidate sectors. Thus we propose an algorithm and process that can automatically find relevant sectors with the Extended Relief-F algorithm and the classifiers. We reformulated the Relief-F algorithm to select features by updating the difference functions and computation of the weight of features, apply those features to sectors, classify unconnected sectors, and recommend relevant candidate sectors. In the experiments, we also tested Information Gain, Odds Ratio and Relief-F for feature selection and compared them with the Extended Relief-F algorithm; additionally, we used the KNN and SVM classifiers for classification and estimation of relevant sectors. In the experimental results, the Extended Relief-F algorithm, compared with the others, performed best for all of the datasets.
Show more