HSEpred: predict Half-Sphere Exposure from protein sequences

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan.
Bioinformatics (Impact Factor: 4.62). 08/2008; 24(13):1489-97. DOI: 10.1093/bioinformatics/btn222
Source: PubMed


Half-sphere exposure (HSE) is a newly developed two-dimensional solvent exposure measure. By conceptually separating an amino acid's sphere in a protein structure into two half spheres which represent its distinct spatial neighborhoods in the upward and downward directions, the HSE-up and HSE-down measures show superior performance compared with other measures such as accessible surface area, residue depth and contact number. However, currently there is no existing method for the prediction of HSE measures from sequence data.
In this article, we propose a novel approach to predict the HSE measures and infer residue contact numbers using the predicted HSE values, based on a well-prepared non-homologous protein structure dataset. In particular, we employ support vector regression (SVR) to quantify the relationship between HSE measures and protein sequences and evaluate its prediction performance. We extensively explore five sequence-encoding schemes to examine their effects on the prediction performance. Our method could achieve the correlation coefficients of 0.72 and 0.68 between the predicted and observed HSE-up and HSE-down measures, respectively. Moreover, contact number can be accurately predicted by the summation of the predicted HSE-up and HSE-down values, which has further enlarged the application of this method. The successful application of SVR approach in this study suggests that it should be more useful in quantifying the protein sequence-structure relationship and predicting the structural property profiles from protein sequences.
The prediction webserver and supplementary materials are accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/hse/.
Supplementary data are available at Bioinformatics online.

Download full-text


Available from: Jiangning Song,
  • Source
    • "The performance of SVM is largely dependent on the quality of the features (Liu et al., 2008). Although, plenty of feature representation and selection methods were proposed for protein sequence (Fang et al., 2008; Chou, 2011; Yuan et al., 2010; Song et al., 2008) and these methods were systematically surveyed (Nanni et al., 2010; Zhang et al., 2005), the underlying principle of protein-DNA interaction is still largely unknown. To this end, we propose a comprehensive feature representation, including the sequence information, evolutionary profiles, predicted secondary structural, predict relative solvent accessibility (RSA) information, physicochemical properties, and biological function information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Identification of DNA-binding proteins is essential in studying cellular activities as the DNA-binding proteins play a pivotal role in gene regulation. In this study, we propose newDNA-Prot, a DNA-binding protein predictor that employs support vector machine classifier and a comprehensive feature representation. The sequence representation are categorized into 6 groups: primary sequence based, evolutionary profile based, predicted secondary structure based, predicted relative solvent accessibility based, physicochemical property based and biological function based features. The mRMR, wrapper and two-stage feature selection methods are employed for removing irrelevant features and reducing redundant features. Experiments demonstrate that the two-stage method performs better than the mRMR and wrapper methods. We also perform a statistical analysis on the selected features and results show that more than 95% of the selected features are statistically significant and they cover all 6 feature groups. The newDNA-Prot method is compared with several state of the art algorithms, including iDNA-Prot, DNAbinder and DNA-Prot. The results demonstrate that newDNA-Prot method outperforms the iDNA-Prot, DNAbinder and DNA-Prot methods. More specific, newDNA-Prot improves the runner-up method, DNA-Prot for around 10% on several evaluation measures. The proposed newDNA-Prot method is available at http://sourceforge.net/projects/newdnaprot/
    Computational Biology and Chemistry 10/2014; 52. DOI:10.1016/j.compbiolchem.2014.09.002 · 1.12 Impact Factor
  • Source
    • "The most important challenge for SVM-based prediction is to find a suitable way to fully describe the information implied in protein-DNA interactions [26]. There are several different protein features and feature extraction methods that can be used [8,27-29] and a comprehensive survey of these methods can be found in related research work [30,31]. However, the underlying principle of protein-DNA interactions is still largely unknown. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background DNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence. Results In this work, the focus is how to transform these informative features into uniform numeric representation appropriately and improve the prediction accuracy of our SVM-based classifier for DNA-BPs. A systematic representation of some selected features known to perform well is investigated here. Firstly, four kinds of protein properties are obtained and used to describe the protein sequence. Secondly, three different feature transformation methods (OCTD, AC and SAA) are adopted to obtain numeric feature vectors from three main levels: Global, Nonlocal and Local of protein sequence and their performances are exhaustively investigated. At last, the mRMR-IFS feature selection method and ensemble learning approach are utilized to determine the best prediction model. Besides, the optimal features selected by mRMR-IFS are illustrated based on the observed results which may provide useful insights for revealing the mechanisms of protein-DNA interactions. For five-fold cross-validation over the DNAdset and DNAaset, we obtained an overall accuracy of 0.940 and 0.811, MCC of 0.881 and 0.614 respectively. Conclusions The good results suggest that it can efficiently develop an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-BPs accurately. Moreover, a novel systematic framework for sequence descriptor-based protein function prediction is proposed here.
    BMC Bioinformatics 03/2013; 14(1):90. DOI:10.1186/1471-2105-14-90 · 2.58 Impact Factor
  • Source
    • "Then this information can be represented as a two dimensional matrix which is known as the PSSM of the protein. PSSM has been widely used to predict protein fold pattern [25], protein quaternary structural attribute [26], disulfide connectivity [27,28], half-sphere exposure [29], protein fold recognition and superfamily discrimination [30], ATP binding residues of a protein [31], and catalytic residues [32]. As a result, we also use it to predict bioluminescent proteins. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioluminescent proteins are important for various cellular processes, such as gene expression analysis, drug discovery, bioluminescent imaging, toxicity determination, and DNA sequencing studies. Hence, the correct identification of bioluminescent proteins is of great importance both for helping genome annotation and providing a supplementary role to experimental research to obtain insight into bioluminescent proteins' functions. However, few computational methods are available for identifying bioluminescent proteins. Therefore, in this paper we develop a new method to predict bioluminescent proteins using a model based on position specific scoring matrix and auto covariance. Tested by 10-fold cross-validation and independent test, the accuracy of the proposed model reaches 85.17% for the training dataset and 90.71% for the testing dataset respectively. These results indicate that our predictor is a useful tool to predict bioluminescent proteins. This is the first study in which evolutionary information and local sequence environment information have been successfully integrated for predicting bioluminescent proteins. A web server (BLPre) that implements the proposed predictor is freely available.
    International Journal of Molecular Sciences 12/2012; 13(3):3650-60. DOI:10.3390/ijms13033650 · 2.86 Impact Factor
Show more