ABSTRACT: The rapid burgeoning of available data in the form of categorical sequences, such as biological sequences, natural language texts, network and retail transactions, makes the classification of categorical sequences increasingly important. The main challenge is to identify significant features hidden behind the chronological and structural dependencies characterizing their intrinsic properties. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but categorical sequences often have similar features in non-chronological order. In addition, these algorithms have serious difficulties in outperforming domain-specific algorithms. In this paper we propose CLASS, a general approach for the classification of categorical sequences. By using an effective matching scheme called SPM for Significant Patterns Matching, CLASS is able to capture the intrinsic properties of categorical sequences. Furthermore, the use of Latent Semantic Analysis allows capturing semantic relations using global information extracted from large number of sequences, rather than comparing merely pairs of sequences. Moreover, CLASS employs a classifier called SNN for Significant Nearest Neighbours, inspired from the K Nearest Neighbours approach with a dynamic estimation of K, which allows the reduction of both false positives and false negatives in the classification. The extensive tests performed on a range of datasets from different fields show that CLASS is oftentimes competitive with domain-specific approaches.
Canadian Journal of Electrical and Computer Engineering 02/2009; · 0.24 Impact Factor