Real value prediction of solvent accessibility from amino acid sequence
ABSTRACT The solvent accessibility of amino acid residues has been predicted in the past by classifying them into exposure states with varying thresholds. This classification provides a wide range of values for the accessible surface area (ASA) within which a residue may fall. Thus far, no attempt has been made to predict real values of ASA from the sequence information without a priori classification into exposure states. Here, we present a new method with which to predict real value ASAs for residues, based on neighborhood information. Our real value prediction neural network could estimate the ASA for four different nonhomologous, nonredundant data sets of varying size, with 18.0-19.5% mean absolute error, defined as per residue absolute difference between the predicted and experimental values of relative ASA. Correlation between the predicted and experimental values ranged from 0.47 to 0.50. It was observed that the ASA of a residue could be predicted within a 23.7% mean absolute error, even when no information about its neighbors is included. Prediction of real values answers the issue of arbitrary choice of ASA state thresholds, and carries more information than category prediction. Prediction error for each residue type strongly correlates with the variability in its experimental ASA values.
SourceAvailable from: PubMed Central[Show abstract] [Hide abstract]
ABSTRACT: Protein O-GlcNAcylation, involving the attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues. Elucidation of O-GlcNAcylation sites on proteins is required in order to decipher its crucial roles in regulating cellular processes and aid in drug design. With an increasing number of O-GlcNAcylation sites identified by mass spectrometry (MS)-based proteomics, several methods have been proposed for the computational identification of O-GlcNAcylation sites. However, no development that focuses on the investigation of O-GlcNAcylated substrate motifs has existed. Thus, we were motivated to design a new method for the identification of protein O-GlcNAcylation sites with the consideration of substrate site specificity. In this study, 375 experimentally verified O-GlcNAcylation sites were collected from dbOGAP, which is an integrated resource for protein O-GlcNAcylation. Due to the difficulty in characterizing the substrate motifs by conventional sequence logo analysis, a recursively statistical method has been applied to obtain significant conserved motifs. To construct the predictive models learned from the identified substrate motifs, we adopted Support Vector Machines (SVMs). A five-fold cross validation was used to evaluate the predictive model, achieving sensitivity, specificity, and accuracy of 0.76, 0.80, and 0.78, respectively. Additionally, an independent testing set, which was really blind to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (0.94) and outperform three other O-GlcNAcylation site prediction tools. This work proposed a computational method to identify informative substrate motifs for O-GlcNAcylation sites. The evaluation of cross validation and independent testing indicated that the identified motifs were effective in the identification of O-GlcNAcylation sites. A case study demonstrated that the proposed method could be a feasible means of conducting preliminary analyses of protein O-GlcNAcylation. We also anticipated that the revealed substrate motif may facilitate the study of extensive crosstalk between O-GlcNAcylation and phosphorylation. This method may help unravel their mechanisms and roles in signaling, transcription, chronic disease, and cancer.BMC Bioinformatics 12/2014; 15(Suppl 16):S1. DOI:10.1186/1471-2105-15-S16-S1 · 2.67 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: We present a new approach for predicting the Accessible Surface Area (ASA) using a General Neural Network (GENN). The novelty of the new approach lies in not using residue mutation profiles generated by multiple sequence alignments as descriptive inputs. Instead we use solely sequential window information and global features such as single-residue and two-residue compositions of the chain. The resulting predictor is both highly more efficient than sequence alignment based predictors and of comparable accuracy to them. Introduction of the global inputs significantly helps achieve this comparable accuracy. The predictor, termed ASAquick, is tested on predicting the ASA of globular proteins and found to perform similarly well for so-called easy and hard cases indicating generalizability and possible usability for de-novo protein structure prediction. The source code and a Linux executables for GENN and ASAquick are available from Research and Information Systems at http://mamiris.com, from the SPARKS Lab at http://sparks-lab.org, and from the Battelle Center for Mathematical Medicine at http://mathmed.org. © Proteins 2014;. © 2014 Wiley Periodicals, Inc.Proteins Structure Function and Bioinformatics 11/2014; 82(11). DOI:10.1002/prot.24682 · 2.92 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: Developing systems to accurately predict protein structure and function from sequence is of fundamental importance to biology and medicine. An effective solution would enable effective computational drug design and would lead to an increased understanding of complex and devastating disease processes such as cancer. While progress continues to be made on the prediction of structure from sequence, knowledge of a protein’s structure may not be sufficient to discern its function. For example, most proteins undergo some form of post-translational modification (PTM) following initial synthesis which may have a profound impact on protein function. Sumoylation is an important form of reversible PTM which involves the addition of one or more SUMO proteins to a substrate protein. Here we present a novel approach for the prediction of sumoylation sites on proteins, making use of parallel cascade identification, a powerful method from the field of nonlinear system identification. No assumptions are made regarding subcellular localization nor the presence of a sequence motif which makes this a broadly applicable method. Classifier accuracy is compared to SUMOPlot, the only other currently available and generally applicable method. PCI sensitivity is greatly improved at 94%, while specificity is on par with SUMOPlot at 17%. Specificity is expected to improve greatly with the release of new experimentally verified training data. The method has been implemented as a web service and is available at: http://www.sce.carleton.ca/faculty/green/green.php?page=webservers.29th Conference of the Canadian Medical and Biological Engineering Society, Vancouver, BC, Canada; 06/2006