Conference Paper

Two Odds-Radio-Based Text Classification Algorithms.

DOI: 10.1109/WISEW.2002.1177866 Conference: 3rd International Conference on Web Information Systems Engineering Workshops (WISE 2002 Workshops), 11 December 2002, Singapore, Proceedings
Source: DBLP

ABSTRACT Since 1990's, the exponential growth of these Web documents has led to a great deal of interest in developing efficient tools and software to assist users in finding relevant information. Text classification has been proved to be useful in helping organize and search text information on the Web. Although there have been existed a number of text classification algorithms, most of them are either inefficient or too complex. In this paper we present two Odds-Radio-Based text classification algorithms, which are called OR and TF*OR respectively. We have evaluated our algorithm on two text collections and compared it against k-NN and SVM. Experimental results show that OR and TF*OR are competitive with k-NN and SVM. Furthermore, OR and TF*OR is much simpler and faster than them. The results also indicate that it is not TF but relevance factors derived from Odds Radio that play the decisive role in document categorization.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract “qualified” document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.
    08/2008: pages 357-367;
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an approach to the recovery of damaged disks that measures the similarity among sectors using the Instance-based Feature Filtering algorithm and classification. After a hard-disk is destroyed, maliciously or accidentally, that hard-disk can be simply repaired using the recovery programs. However, there are always some sectors that cannot connect with the original file after recovery; typically, attempts are made to connect with the original file manually, or those attempts prove unsuccessful, the effort is abandoned. Therefore, an automatic process for finding the original file related to unconnected sectors is required. Typical methods assess the similarity among sectors and recommend relevant candidate sectors. Thus we propose an algorithm and process that can automatically find relevant sectors with the Extended Relief-F algorithm and the classifiers. We reformulated the Relief-F algorithm to select features by updating the difference functions and computation of the weight of features, apply those features to sectors, classify unconnected sectors, and recommend relevant candidate sectors. In the experiments, we also tested Information Gain, Odds Ratio and Relief-F for feature selection and compared them with the Extended Relief-F algorithm; additionally, we used the KNN and SVM classifiers for classification and estimation of relevant sectors. In the experimental results, the Extended Relief-F algorithm, compared with the others, performed best for all of the datasets.
  • Source