Difference-Similitude Matrix in Text Classification

DOI: 10.1007/11540007_3
Source: DBLP


Text classification can greatly improve the performance of information retrieval and information filtering, but high dimensionality
of documents baffles the applications of most classification approaches. This paper proposed a Difference-Similitude Matrix
(DSM) based method to solve the problem. The method represents a pre-classified collection as an item-document matrix, in
which documents in same categories are described with similarities while documents in different categories with differences.
Using the DSM reduction algorithm, simpler and more efficient than rough set reduction, we reduced the dimensionality of document
space and generated rules for text classification.

5 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new method of text classification using stochastic decision lists. A stochastic decision list is an ordered sequence of IF-THEN-ELSE rules, and our method can be viewed as a rule-based method for text classification having advantages of readability and refinability of acquired knowledge. Our method is unique in that decision lists are automatically constructed on the basis of the principle of minimizing extended stochastic complexity (ESC), and with it we are able to construct decision lists that have fewer errors in classification. The accuracy of classification achieved with our method appears better than or comparable to those of existing rule-based methods. We have empirically demonstrated that rule-based methods like ours result in high classification accuracy when the categories to which texts are to be assigned are relatively specific ones and when the texts tend to be short. We have also empirically verified the advantages of rule-based methods over non-rule-based ones.
    Information Processing & Management 05/2002; 38(3-38):343-361. DOI:10.1016/S0306-4573(01)00038-3 · 1.27 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we, in essence, point out that the methods used in the current vector based systems are in conflict with the premises of the vector space model. The considerations, naturally, lead to how things might have been done differently. More importantly, it is felt that this investigation will lead to a clearer understanding of the issues and problems in using the vector space model in information retrieval.
    Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval; 01/1984
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The existing rough set based methods are not applicable for large data set because of the high time and space complexity and the lack of scalability. We present a classification method, which is equivalent to rough set based classification methods, but is scalable and applicable for large data sets. The proposed method is based on lazy learning idea[2] and Apriori algorithm for sequent item-set approaches [1]. In this method the set of decision rules matching the new object is generated directly from training set. Accept classification task, this method can be used for adaptive rule generation system where data is growing up in time.
    Rough Sets and Current Trends in Computing, Third International Conference, RSCTC 2002, Malvern, PA, USA, October 14-16, 2002, Proceedings; 01/2002
Show more