Conference Paper

REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences

Conference: Proceedings of the 14th International Conference on Management of Data, December 17-19, 2008, IIT Bombay, Mumbai, India
Source: DBLP


An important problem in biological data analysis is to pre- dict the family of a newly discovered sequence like a pro- tein or DNA sequence, using the collection of available se- quences. In this paper we tackle this problem and present REBMEC, a Repeat Based Maximum Entropy Classifier of biological sequences. Maximum entropy models are known to be theoretically robust and yield high accuracy, but are slow. This makes them useful as benchmarks to evaluate other classifiers. Specifically, REBMEC is based on the classical Generalized Iterative Scaling (GIS) al- gorithm and incorporates repeated occurrences of subse- quences within each sequence. REBMEC uses maximal frequent subsequences as features but can support other types of features as well. Our extensive experiments on two collections of protein families show that REBMEC performs as well as existing state-of-the-art probabilistic classifiers for biological sequences without using domain- specific background knowledge such as multiple align- ment, data transformation and complex feature extraction methods. The design of REBMEC is based on generic ideas that can apply to other domains where data is orga- nized as collections of sequences.

Download full-text


Available from: Vikram Pudi, Feb 13, 2014
26 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on develop- ing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multi- class problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vs- the-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weight- ing of binary classifiers that improves multi-class prediction with respect to a fixed set of out- put codes. We use a cross-validation set-up to generate output vectors for training, and we de- fine codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here us- ing a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in pro- tein sequence analysis, and find that our method strongly outperforms it on every structure clas-
    Journal of Machine Learning Research 07/2007; 8:1557-1581. · 2.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Incorporating basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can improve the performance in some cases. The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of HMMs, with satisfactory performance. 1 Introduction In the last few years there is growing effort to org...
  • [Show abstract] [Hide abstract]
    ABSTRACT: In many machine learning applications that deal with se- quences, there is a need for learning algorithms that can effectively utilize the hierarchical grouping of words. We introduce Word Taxonomy guided Naive Bayes Learner for the Multinomial Event Model (WTNBL-MN) that exploits word taxonomy to generate compact classifiers, and Word Taxonomy Learner (WTL) for automated construction of word taxon- omy from sequence data. WTNBL-MN is a generalization of the Naive Bayes learner for the Multinomial Event Model for learning classifiers from data using word taxonomy. WTL uses hierarchical agglomerative clustering to cluster words based on the distribution of class labels that co-occur with the words. Our experimental results on protein localiza- tion sequences and Reuters text show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model.
    Abstraction, Reformulation and Approximation, 6th International Symposium, SARA 2005, Airth Castle, Scotland, UK, July 26-29, 2005, Proceedings; 01/2005
Show more