Learning cellular sorting pathways using protein interactions and sequence motifs.

Language Technology Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Journal of computational biology: a journal of computational molecular cell biology (Impact Factor: 1.67). 11/2011; 18(11):1709-22. DOI: 10.1089/cmb.2011.0193
Source: PubMed

ABSTRACT Proper subcellular localization is critical for proteins to perform their roles in cellular functions. Proteins are transported by different cellular sorting pathways, some of which take a protein through several intermediate locations until reaching its final destination. The pathway a protein is transported through is determined by carrier proteins that bind to specific sequence motifs. In this article, we present a new method that integrates protein interaction and sequence motif data to model how proteins are sorted through these sorting pathways. We use a hidden Markov model (HMM) to represent protein sorting pathways. The model is able to determine intermediate sorting states and to assign carrier proteins and motifs to the sorting pathways. In simulation studies, we show that the method can accurately recover an underlying sorting model. Using data for yeast, we show that our model leads to accurate prediction of subcellular localization. We also show that the pathways learned by our model recover many known sorting pathways and correctly assign proteins to the path they utilize. The learned model identified new pathways and their putative carriers and motifs and these may represent novel protein sorting mechanisms. Supplementary results and software implementation are available from

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Conserved motifs in biological sequences are closely related to their structure and functions. Recently, discriminative motif discovery methods have attracted more and more attention. However, little attention has been devoted to the data imbalance problem, which is one of the main reasons affecting the performance of the discriminative models. In this article, a simulated evolution method is applied to solve the multi-class imbalance problem at the stage of data preprocessing, and at the stage of Hidden Markov Models (HMMs) training, a random under-sampling method is introduced for the imbalance between the positive and negative datasets. It is shown that, in the task of discovering targeting motifs of nine subcellular compartments, the motifs found by our method are more conserved than the methods without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, we use the found motifs to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes.
    PLoS ONE 02/2014; 9(2):e87670. DOI:10.1371/journal.pone.0087670 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted Gene Ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO Categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
    Journal of Theoretical Biology 01/2014; 347. DOI:10.1016/j.jtbi.2014.01.003 · 2.30 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nuclear receptors (NRs) are members of a large superfamily of evolutionarily related DNA-binding transcription factors. They regulate diverse functions, such as homeostasis, reproduction, development and metabolism. As nuclear receptors bind small molecules that can easily be modified by drug design, and control functions associated with major diseases (e.g. cancer, osteoporosis and diabetes), they are promising pharmacological targets. According to their different action mechanisms or functions, NR superfamily has been classified into seven families: NR1 (thyroid hormone like), NR2 (HNF4-like), NR3 (estrogen like), NR4 (nerve growth factor IB-like), NR5 (fushi tarazu-F1 like), NR6 (germ cell nuclear factor like), and NR0 (knirps or DAX like). With the avalanche of protein sequences generated in the postgenomic age, Scientists are facing the following challenging problems. Given an uncharacterized protein sequence, how can we identify whether it is a nuclear receptor? If it is, what family even subfamily it belongs to? To address these problems, many cheminformatics tools have been developed for nuclear receptor prediction. The current review is mainly focused on this field, including the functions, computational methods and limitations of these tools.
    Current topics in medicinal chemistry 05/2013; · 3.45 Impact Factor


Available from
May 16, 2014