Conference Paper

REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences

Conference: Proceedings of the 14th International Conference on Management of Data, December 17-19, 2008, IIT Bombay, Mumbai, India
Source: DBLP


An important problem in biological data analysis is to pre- dict the family of a newly discovered sequence like a pro- tein or DNA sequence, using the collection of available se- quences. In this paper we tackle this problem and present REBMEC, a Repeat Based Maximum Entropy Classifier of biological sequences. Maximum entropy models are known to be theoretically robust and yield high accuracy, but are slow. This makes them useful as benchmarks to evaluate other classifiers. Specifically, REBMEC is based on the classical Generalized Iterative Scaling (GIS) al- gorithm and incorporates repeated occurrences of subse- quences within each sequence. REBMEC uses maximal frequent subsequences as features but can support other types of features as well. Our extensive experiments on two collections of protein families show that REBMEC performs as well as existing state-of-the-art probabilistic classifiers for biological sequences without using domain- specific background knowledge such as multiple align- ment, data transformation and complex feature extraction methods. The design of REBMEC is based on generic ideas that can apply to other domains where data is orga- nized as collections of sequences.

Download full-text


Available from: Vikram Pudi, Feb 13, 2014