Hong-Bin Shen

Shanghai Jiao Tong University, Shanghai, Shanghai Shi, China

Are you Hong-Bin Shen?

Claim your profile

Publications (63)146.45 Total impact

  • Source
    Article: Predicting Protein N-Terminal Signal Peptides Using Position-Specific Amino Acid Propensities and Conditional Random Fields
    [show abstract] [hide abstract]
    ABSTRACT: Protein signal peptides play a vital role in targeting and translocation of most secreted proteins and many integral membrane proteins in both prokaryotes and eukaryotes. Consequently, accurate prediction of signal peptides and their cleavage sites is an important task in molecular biology. In the present study, firstly, we develop a novel discriminative scoring method for classifying proteins with or without signal peptides. This method successfully captured the characteristics of signal peptides and non-signal peptides by integrating hydrophobicity alignment and position-specific amino acid propensities based on the highest average positions. As a result, this method is capable of discriminating proteins with signal peptides at the overall accuracies of 96.3%, 97.0% and 97.2% by leave-one-out jackknife tests on the constructed benchmark datasets for three different organisms, i.e. Eukaryotic, Gram-negative, and Gram-positive respectively. Secondly, we consider the prediction task of signal peptide cleavage sites as a sequence labeling problem and apply Conditional Random Fields (CRFs) algorithm to solve it. Experimental results demonstrate that the proposed CRFs-based cleavage site finding approach can achieve the prediction success rates of 80.8%, 89.4%, and 74.0% respectively, for the secretory proteins from three different organisms. An online tool, LnSignal, is established for labeling the N-terminal signal cleavage sites and is freely available for academic use
    Current Bioinformatics 02/2013; 8. · 0.90 Impact Factor
  • Article: LabCaS: Labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields.
    Yong-Xian Fan, Yang Zhang, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: The calpain family of Ca(2+) -dependent cysteine proteases plays a vital role in many important biological processes which is closely related with a variety of pathological states. Activated calpains selectively cleave relevant substrates at specific cleavage sites, yielding multiple fragments that can have different functions from the intact substrate protein. Until now, our knowledge about the calpain functions and their substrate cleavage mechanisms are limited because the experimental determination and validation on calpain binding are usually laborious and expensive. In this work, we aim to develop a new computational approach (LabCaS) for accurate prediction of the calpain substrate cleavage sites from amino acid sequences. To overcome the imbalance of negative and positive samples in the machine-learning training which have been suffered by most of the former approaches when splitting sequences into short peptides, we designed a conditional random field algorithm that can label the potential cleavage sites directly from the entire sequences. By integrating the multiple amino acid features and those derived from sequences, LabCaS achieves an accurate recognition of the cleave sites for most calpain proteins. In a jackknife test on a set of 129 benchmark proteins, LabCaS generates an AUC score 0.862. The LabCaS program is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/LabCaS. Proteins 2012. © 2012 Wiley Periodicals, Inc.
    Proteins Structure Function and Bioinformatics 11/2012; · 3.39 Impact Factor
  • Article: Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features.
    [show abstract] [hide abstract]
    ABSTRACT: Adenosine-5'-triphosphate (ATP) is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex. In this paper, we propose a novel framework for predicting the proteins' functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target's knowledge and multiple sources of descriptors can be complementary. Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.
    BMC Bioinformatics 05/2012; 13:118. · 2.75 Impact Factor
  • Article: Conotoxin superfamily prediction using diffusion maps dimensionality reduction and subspace classifier.
    Jiang-Bo Yin, Yong-Xian Fan, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Conotoxins are disulfide-rich small peptides that are invaluable channel-targeted peptides and target neuronal receptors, which have been demonstrated to be potent pharmaceuticals in the treatment of Alzheimer's disease, Parkinson's disease, and epilepsy. Accurate prediction of conotoxin superfamily would have many important applications towards the understanding of its biological and pharmacological functions. In this study, a novel method, named dHKNN, is developed to predict conotoxin superfamily. Firstly, we extract the protein's sequential features composed of physicochemical properties, evolutionary information, predicted secondary structures and amino acid composition. Secondly, we use the diffusion maps for dimensionality reduction, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set in order to obtain efficient representation of data geometric descriptions. Finally, an improved K-local hyperplane distance nearest neighbor subspace classifier method called dHKNN is proposed for predicting conotoxin superfamilies by considering the local density information in the diffusion space. The overall accuracy of 91.90% is obtained through the jackknife cross-validation test on a benchmark dataset, indicating the proposed dHKNN is promising.
    Current Protein and Peptide Science 07/2011; 12(6):580-8. · 2.89 Impact Factor
  • Article: Adaptive compressive learning for prediction of protein-protein interactions from primary sequence.
    [show abstract] [hide abstract]
    ABSTRACT: Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems.
    Journal of Theoretical Biology 05/2011; 283(1):44-52. · 2.21 Impact Factor
  • Article: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach.
    Xiao-Yong Pan, Ye Tian, Yan Huang, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Epistatic miniarray profiling (E-MAP) is a powerful tool for analyzing gene functions and their biological relevance. However, E-MAP data suffers from large proportion of missing values, which often results in misleading and biased analysis results. It is urgent to develop effective missing value estimation methods for E-MAP. Although several independent algorithms can be applied to achieve this goal, their performance varies significantly on different datasets, indicating different algorithms having their own advantages and disadvantages. In this paper, we propose a novel ensemble approach EMDI based on the high-level diversity to impute missing values that consists of two global and four local base estimators. Experimental results on five E-MAP datasets show that EMDI outperforms all single base algorithms, demonstrating an appropriate combination providing complementarity among different methods. Comparison results between several fusion strategies also demonstrate that the proposed high-level diversity scheme is superior to others. EMDI is freely available at www.csbio.sjtu.edu.cn/bioinf/EMDI/.
    Genomics 03/2011; 97(5):257-64. · 3.02 Impact Factor
  • Article: Large-scale mining co-expressed genes in Arabidopsis anther: from pair to group.
    Qing-Ju Jiao, Yan Huang, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Research on gene co-expression not only plays an important role in understanding the complex regulatory relationships, but also contributes to our understanding of gene regulatory network. Beyond the co-expression knowledge between two genes, investigating the co-expression relationships among multiple target genes is more informative for understanding the basic working mechanisms in a cell. In this paper, all the Arabidopsis anther genes and every gene's potential co-expressed partners are collected by cross-database search. By combining simple pair gene co-expression networks, a complex Arabidopsis anther co-expression network is constructed. Maximum-clique algorithm is then applied to mine the groups reflecting co-expression relationships among multiple Arabidopsis anther genes that are represented by completely connected graphs. As a result, 254 Arabidopsis anther complete co-expression groups are obtained and our analysis shows that all the genes in the same group have high propensity to be functionally related and co-expressed together. We also demonstrate the efficacy of the proposed maximum-clique algorithm by comparing its results with the known Arabidopsis genome pathways, K-means clustering algorithm derived results and randomized data. It is expected that the 254 Arabidopsis anther complete co-expression groups generated in this paper can be a valuable knowledge source for further studies of molecular mechanisms of anther and its transcription regulations.
    Computational biology and chemistry 02/2011; 35(2):62-8. · 1.37 Impact Factor
  • Article: Gaussian kernel optimization: Complex problem and a simple solution.
    Jiang-Bo Yin, Tao Li, Hong-Bin Shen
    Neurocomputing. 01/2011; 74:3816-3822.
  • Source
    Article: BinTree seeking: a novel approach to mine both bi-sparse and cohesive modules in protein interaction networks.
    [show abstract] [hide abstract]
    ABSTRACT: Modern science of networks has brought significant advances to our understanding of complex systems biology. As a representative model of systems biology, Protein Interaction Networks (PINs) are characterized by a remarkable modular structures, reflecting functional associations between their components. Many methods were proposed to capture cohesive modules so that there is a higher density of edges within modules than those across them. Recent studies reveal that cohesively interacting modules of proteins is not a universal organizing principle in PINs, which has opened up new avenues for revisiting functional modules in PINs. In this paper, functional clusters in PINs are found to be able to form unorthodox structures defined as bi-sparse module. In contrast to the traditional cohesive module, the nodes in the bi-sparse module are sparsely connected internally and densely connected with other bi-sparse or cohesive modules. We present a novel protocol called the BinTree Seeking (BTS) for mining both bi-sparse and cohesive modules in PINs based on Edge Density of Module (EDM) and matrix theory. BTS detects modules by depicting links and nodes rather than nodes alone and its derivation procedure is totally performed on adjacency matrix of networks. The number of modules in a PIN can be automatically determined in the proposed BTS approach. BTS is tested on three real PINs and the results demonstrate that functional modules in PINs are not dominantly cohesive but can be sparse. BTS software and the supporting information are available at: www.csbio.sjtu.edu.cn/bioinf/BTS/.
    PLoS ONE 01/2011; 6(11):e27646. · 4.09 Impact Factor
  • Conference Proceeding: Feature Fusion and Selection for Recognizing Cancer-Related Mutations from Common Polymorphisms
    Jian-Bo Lei, Jiang-Bo Yin, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) are the most common form of genetic variant in humans, which can be generally classified into disease related mutations and common ones. It has been generally accepted that SNPs caused amino acid substitutions are of particular interest as candidates for affecting susceptibility to complex diseases, such as cancer, which is a serious public issue affecting millions of people worldwide each year. In this study, we have developed an automated and robust method to distinguish cancer-related mutations from common polymorphisms from amino acid sequence, which has a significant meaning for the cancer diagnosis, prognosis and treatment. Multiple different sequential features are extracted and the most important features are finally selected for constructing the prediction model. Experimental results show that an overall 81.07% success rate has been obtained, indicating the proposed method is very promising in the clinical cancer research studies.
    Pattern Recognition (CCPR), 2010 Chinese Conference on; 11/2010
  • Article: Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features.
    Xiao-Yong Pan, Ya-Nan Zhang, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Protein-protein interaction (PPI) is at the core of the entire interactomic system of any living organism. Although there are many human protein-protein interaction links being experimentally determined, the number is still relatively very few compared to the estimation that there are ∼300,000 protein-protein interactions in human beings. Hence, it is still urgent and challenging to develop automated computational methods to accurately and efficiently predict protein-protein interactions. In this paper, we propose a novel hierarchical LDA-RF (latent dirichlet allocation-random forest) model to predict human protein-protein interactions from protein primary sequences directly, which is featured by a high success rate and strong ability for handling large-scale data sets by digging the hidden internal structures buried into the noisy amino acid sequences in low dimensional latent semantic space. First, the local sequential features represented by conjoint triads are constructed from sequences. Then the generative LDA model is used to project the original feature space into the latent semantic space to obtain low dimensional latent topic features, which reflect the hidden structures between proteins. Finally, the powerful random forest model is used to predict the probability for interaction of two proteins. Our results show that the proposed latent topic feature is very promising for PPI prediction and could also become a powerful strategy to deal with many other bioinformatics problems. As a web server, LDA-RF is freely available at http://www.csbio.sjtu.edu.cn/bioinf/LR_PPI for academic use.
    Journal of Proteome Research 10/2010; 9(10):4992-5001. · 5.11 Impact Factor
  • Source
    Article: Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites.
    Hong-Bin Shen, Kuo-Chen Chou
    [show abstract] [hide abstract]
    ABSTRACT: Knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very important because it is closely related to their destructive tendencies and consequences. Facing the avalanche of new protein sequences discovered in the post genomic era, we are challenged to develop automated methods for quickly and accurately predicting the location sites of viral proteins in a host cell; the information thus acquired is particularly important for medical science and antiviral drug design. In view of this, a new fusion classifier called "Virus-mPLoc" was established by hybridizing the gene ontology information, functional domain information, and sequential evolutionary information. The new predictor not only can more accurately predict the location sites of viral proteins in a host cell, but also have the capacity to identify the multiple-location virus proteins, which is beyond the reach of any existing predictors specialized for viral proteins. For reader's convenience, a user-friendly web-server for Virus-mPLoc was designed that is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/.
    Journal of biomolecular structure & dynamics 10/2010; 28(2):175-86. · 4.99 Impact Factor
  • Article: Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins.
    Hong-Bin Shen, Kuo-Chen Chou
    [show abstract] [hide abstract]
    ABSTRACT: By incorporating the information of gene ontology, functional domain, and sequential evolution, a new predictor called Gneg-mPLoc was developed. It can be used to identify Gram-negative bacterial proteins among the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. It can also be used to deal with the case when a query protein may simultaneously exist in more than one location. Compared with the original predictor called Gneg-PLoc, the new predictor is much more powerful and flexible. For a newly constructed stringent benchmark dataset in which none of proteins included has >or=25% pairwise sequence identity to any other in a same subset (location), the overall jackknife success rate achieved by Gneg-mPLoc was 85.5%, which was more than 14% higher than the corresponding rate by the Gneg-PLoc. As a user friendly web-server, Gneg-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/.
    Journal of Theoretical Biology 05/2010; 264(2):326-33. · 2.21 Impact Factor
  • Source
    Article: Improving the accuracy of predicting disulfide connectivity by feature selection.
    [show abstract] [hide abstract]
    ABSTRACT: Disulfide bonds are primary covalent cross-links formed between two cysteine residues in the same or different protein polypeptide chains, which play important roles in the folding and stability of proteins. However, computational prediction of disulfide connectivity directly from protein primary sequences is challenging due to the nonlocal nature of disulfide bonds in the context of sequences, and the number of possible disulfide patterns grows exponentially when the number of cysteine residues increases. In the previous studies, disulfide connectivity prediction was usually performed in high-dimensional feature space, which can cause a variety of problems in statistical learning, such as the dimension disaster, overfitting, and feature redundancy. In this study, we propose an efficient feature selection technique for analyzing the importance of each feature component. On the basis of this approach, we selected the most important features for predicting the connectivity pattern of intra-chain disulfide bonds. Our results have shown that the high-dimensional features contain redundant information, and the prediction performance can be further improved when these high-dimensional features are reduced to a lower but more compact dimensional space. Our results also indicate that the global protein features contribute little to the formation and prediction of disulfide bonds, while the local sequential and structural information play important roles. All these findings provide important insights for structural studies of disulfide-rich proteins.
    Journal of Computational Chemistry 05/2010; 31(7):1478-85. · 4.58 Impact Factor
  • Article: Computational prediction of DNA-protein interactions: a review.
    [show abstract] [hide abstract]
    ABSTRACT: The interaction between DNA and proteins comprises a pivotal role in almost every cellular process, including gene regulation and DNA replication. Given a novel protein, it is very important to know whether it is a DNA-binding protein or not and where the binding sites are. Over the last three decades, since the discovery that lac operon was regulated by a protein, knowledge of the DNA-protein interactions has soared. However, it is very difficult to use experimental techniques to identify the DNA-binding proteins because these experiments can be prohibitively laborintensive in studying all the possible mutations of the residues on the molecular surface. Hence, it has been generally recognized that the ability to automatically identify the DNA binding proteins and their binding sites can significantly speed up our understanding of cellular activities and contribute to advances in drug discovery. The main goal of present paper is to review the recent progress in the development of computational approaches to predict DNA-protein bindings. We will show a historical roadmap of the amelioration, and how the modifications promote better performance.
    Current Computer - Aided Drug Design 05/2010; 6(3):197-206. · 1.76 Impact Factor
  • Source
    Article: Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization.
    Kuo-Chen Chou, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: One of the fundamental goals in proteomics and cell biology is to identify the functions of proteins in various cellular organelles and pathways. Information of subcellular locations of proteins can provide useful insights for revealing their functions and understanding how they interact with each other in cellular network systems. Most of the existing methods in predicting plant protein subcellular localization can only cover three or four location sites, and none of them can be used to deal with multiplex plant proteins that can simultaneously exist at two, or move between, two or more different location sits. Actually, such multiplex proteins might have special biological functions worthy of particular notice. The present study was devoted to improve the existing plant protein subcellular location predictors from the aforementioned two aspects. A new predictor called "Plant-mPLoc" is developed by integrating the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify plant proteins among the following 12 location sites: (1) cell membrane, (2) cell wall, (3) chloroplast, (4) cytoplasm, (5) endoplasmic reticulum, (6) extracellular, (7) Golgi apparatus, (8) mitochondrion, (9) nucleus, (10) peroxisome, (11) plastid, and (12) vacuole. Compared with the existing methods for predicting plant protein subcellular localization, the new predictor is much more powerful and flexible. Particularly, it also has the capacity to deal with multiple-location proteins, which is beyond the reach of any existing predictors specialized for identifying plant protein subcellular localization. As a user-friendly web-server, Plant-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. It is anticipated that the Plant-mPLoc predictor as presented in this paper will become a very useful tool in plant science as well as all the relevant areas.
    PLoS ONE 01/2010; 5(6):e11335. · 4.09 Impact Factor
  • Source
    Article: A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0.
    Kuo-Chen Chou, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Information of subcellular locations of proteins is important for in-depth studies of cell biology. It is very useful for proteomics, system biology and drug development as well. However, most existing methods for predicting protein subcellular location can only cover 5 to 12 location sites. Also, they are limited to deal with single-location proteins and hence failed to work for multiplex proteins, which can simultaneously exist at, or move between, two or more location sites. Actually, multiplex proteins of this kind usually posses some important biological functions worthy of our special notice. A new predictor called "Euk-mPLoc 2.0" is developed by hybridizing the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracell, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole. Compared with the existing methods for predicting eukaryotic protein subcellular localization, the new predictor is much more powerful and flexible, particularly in dealing with proteins with multiple locations and proteins without available accession numbers. For a newly-constructed stringent benchmark dataset which contains both single- and multiple-location proteins and in which none of proteins has pairwise sequence identity to any other in a same location, the overall jackknife success rate achieved by Euk-mPLoc 2.0 is more than 24% higher than those by any of the existing predictors. As a user-friendly web-server, Euk-mPLoc 2.0 is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/euk-multi-2/. For a query protein sequence of 400 amino acids, it will take about 15 seconds for the web-server to yield the predicted result; the longer the sequence is, the more time it may usually need. It is anticipated that the novel approach and the powerful predictor as presented in this paper will have a significant impact to Molecular Cell Biology, System Biology, Proteomics, Bioinformatics, and Drug Development.
    PLoS ONE 01/2010; 5(4):e9931. · 4.09 Impact Factor
  • Article: Multi label learning for prediction of human protein subcellular localizations.
    Lin Zhu, Jie Yang, Hong-Bin Shen
    [show abstract] [hide abstract]
    ABSTRACT: Predicting protein subcellular locations has attracted much attention in the past decade. However, one of the most challenging problems is that many proteins were found simultaneously existing in, or moving between, two or more different cell components in a eukaryotic cell. Seldom previous predictors were able to deal with such multiplex proteins although they have extremely important implications in future drug discovery in terms of their specific subcellular targeting. Approximately 20% of the human proteome consists of such multiplex proteins with multiple sample labels. In order to efficiently handle such multiplex human proteins, we have developed a novel multi-label (ML) learning and prediction framework called ML-PLoc, which decomposes the multi-label prediction problem into multiple independent binary classification problems. ML-PLoc is constructed based on support vector machine (SVM) and sequential evolution information. Experimental results show that ML-PLoc can achieve an overall accuracy 64.6% and recall ratio 67.2% on a benchmark dataset consisting of 14 human subcellular locations, and is very powerful for dealing with multiplex proteins. The current approach represents a new strategy to deal with the multi-label biological problems. ML-PLoc software is freely available for academic use at: http://www.csbio.sjtu.edu.cn/bioinf/ML-PLoc .
    The Protein Journal 10/2009; 28(9-10):384-90. · 1.04 Impact Factor
  • Chapter: Recent Progress of Bioinformatics in Membrane Protein Structural Studies
    03/2009: pages 293 - 308; , ISBN: 9780470741191
  • Source
    Article: QuatIdent: a web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information.
    Hong-Bin Shen, Kuo-Chen Chou
    [show abstract] [hide abstract]
    ABSTRACT: Many proteins exist in vivo as oligomers with various different quaternary structural attributes rather than as single individual chains. They are the structural bases of various marvelous biological functions such as cooperative effects, allosteric mechanism, and ion-channel gating. Therefore, with the avalanche of protein sequences generated in the postgenomic era, it is very important for both basic research and drug discovery to identify their quaternary structural attributes in a timely manner. In view of this, a powerful ensemble identifier, called QuatIdent, is developed by fusing the functional domain and sequential evolution information. QuatIdent is a 2-layer predictor. The 1st layer is for identifying a query protein as belonging to which one of the following 10 main quaternary structural attributes: (1) monomer, (2) dimer, (3) trimer, (4) tetramer, (5) pentamer, (6) hexamer, (7) heptamer, (8) octamer, (9) decamer, and (10) dodecamer. If the result thus obtained turns out to be anything but monomer, the process will be automatically continued to further identify it as belonging to a homo-oligomer or hetero-oligomer. The overall success rate by QuatIdent for the 1st layer identification was 71.1% and that for the 2nd layer ranged from 84 to 96%. These rates were derived by the jackknife cross-validation tests on the stringent benchmark data sets where none of proteins has > or =60% pairwise sequence identity to any other in a same subset. QuatIdent is freely accessible to the public as a web server via the site at http://www.csbio.sjtu.edu.cn/bioinf/Quaternary/ , by which one can get the desired 2-level results for a query protein sequence in around 25 seconds. The longer the sequence is, the more time that is needed.
    Journal of Proteome Research 03/2009; 8(3):1577-84. · 5.11 Impact Factor

Institutions

  • 2007–2012
    • Shanghai Jiao Tong University
      • Department of Automation
      Shanghai, Shanghai Shi, China
  • 1970–2010
    • Shanghai University
      Shanghai, Shanghai Shi, China
  • 2009
    • Massachusetts Institute of Technology
      Cambridge, MA, USA
  • 2007–2008
    • Harvard University
      • Department of Biological Chemistry and Molecular Pharmacology
      Boston, MA, USA