Computational prediction of human proteins that can be secreted into the bloodstream

Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA.
Bioinformatics (Impact Factor: 4.62). 09/2008; 24(20):2370-5. DOI: 10.1093/bioinformatics/btn418
Source: PubMed

ABSTRACT We present a novel computational method for predicting which proteins from highly and abnormally expressed genes in diseased human tissues, such as cancers, can be secreted into the bloodstream, suggesting possible marker proteins for follow-up serum proteomic studies. A main challenging issue in tackling this problem is that our understanding about the downstream localization after proteins are secreted outside the cells is very limited and not sufficient to provide useful hints about secretion to the bloodstream. To bypass this difficulty, we have taken a data mining approach by first collecting, through extensive literature searches, human proteins that are known to be secreted into the bloodstream due to various pathological conditions as detected by previous proteomic studies, and then asking the question: 'what do these secreted proteins have in common in terms of their physical and chemical properties, amino acid sequence and structural features that can be used to predict them?' We have identified a list of features, such as signal peptides, transmembrane domains, glycosylation sites, disordered regions, secondary structural content, hydrophobicity and polarity measures that show relevance to protein secretion. Using these features, we have trained a support vector machine-based classifier to predict protein secretion to the bloodstream. On a large test set containing 98 secretory proteins and 6601 non-secretory proteins of human, our classifier achieved approximately 90% prediction sensitivity and approximately 98% prediction specificity. Several additional datasets are used to further assess the performance of our classifier. On a set of 122 proteins that were found to be of abnormally high abundance in human blood due to various cancers, our program predicted 62 as blood-secreted proteins. By applying our program to abnormally highly expressed genes in gastric cancer and lung cancer tissues detected through microarray gene expression studies, we predicted 13 and 31 as blood secreted, respectively, suggesting that they could serve as potential biomarkers for these two cancers, respectively. Our study demonstrated that our method can provide highly useful information to link genomic and proteomic studies for disease biomarker discovery. Our software can be accessed at

Download full-text


Available from: Juan Cui, Sep 02, 2014
  • Source
    • "A reliable prediction capability for proteins that can travel from circulation to saliva will represent a highly useful tool as it can provide a candidate list of biomarkers specific to a particular disease. This will allow targeted searches for effective biomarkers in saliva using antibody-based techniques, in comparison with the traditional search strategies by direct comparisons among proteomic data collected from saliva samples of multiple patients and healthy controls, which have proved to be ineffective in searches for biomarkers in blood [8,35] and urine [36]. Here we demonstrated that it is possible to develop one such tool, which by no means represents the possibly most reliable tool for such a prediction. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteins can move from blood circulation into salivary glands through active transportation, passive diffusion or ultrafiltration, some of which are then released into saliva and hence can potentially serve as biomarkers for diseases if accurately identified. We present a novel computational method for predicting salivary proteins that come from circulation. The basis for the prediction is a set of physiochemical and sequence features we found to be discerning between human proteins known to be movable from circulation to saliva and proteins deemed to be not in saliva. A classifier was trained based on these features using a support-vector machine to predict protein secretion into saliva. The classifier achieved 88.56% average recall and 90.76% average precision in 10-fold cross-validation on the training data, indicating that the selected features are informative. Considering the possibility that our negative training data may not be highly reliable (i.e., proteins predicted to be not in saliva), we have also trained a ranking method, aiming to rank the known salivary proteins from circulation as the highest among the proteins in the general background, based on the same features. This prediction capability can be used to predict potential biomarker proteins for specific human diseases when coupled with the information of differentially expressed proteins in diseased versus healthy control tissues and a prediction capability for blood-secretory proteins. Using such integrated information, we predicted 31 candidate biomarker proteins in saliva for breast cancer.
    PLoS ONE 11/2013; 8(11):e80211. DOI:10.1371/journal.pone.0080211 · 3.23 Impact Factor
  • Source
    • "These results engender confidence in proposing some of them as potential molecular markers for ovarian epithelial carcinoma cells versus normal HOSE cells. Using a prediction method that we recently developed and validated (Cui et al., 2008), 103 of these genes were predicted to have their protein products secreted into circulation, thus providing another important pool of potential serum markers for ovarian cancer (Cui et al., 2011c). "
    Ovarian Cancer - Basic Science Perspective, 02/2012; , ISBN: 978-953-307-812-0
  • Source
    • "According to Han et al., [23] prediction accuracy may be improved by incorporating this function as it could test if the BPC property of an amino acid is dependent of that of its neighbours and has been used in the protein structural and functional classification studies. However, this was not effective in the prediction of membrane proteins [24]. The Moreau-Broto auto-correlation function Fv of an amino acid index is calculated within a window, as: "
    [Show abstract] [Hide abstract]
    ABSTRACT: Prediction of short stretches in protein sequences capable of forming amyloid-like fibrils is important in understanding the underlying cause of amyloid illnesses thereby aiding in the discovery of sequence-targeted anti-aggregation pharmaceuticals. Due to the constraints of experimental molecular techniques in identifying such motif segments, it is highly desirable to develop computational methods to provide better and affordable in silico predictions. Accurate in silico prediction techniques of amyloidogenic peptide regions rely on the cooperation between informative features and classifier design. In this research article, we propose one such efficient fibril prediction implementation exploiting heterogeneous features based on bio-physio-chemical (BPC) properties, auto-correlation function of carefully selected amino acid indices and atomic composition within a protein fragment of amino acids in a window. In an attempt to get an optimal number of BPC features, an evolutionary Support Vector Machine (SVM) integrating a novel implementation of hybrid Genetic Algorithm termed Memetic Algorithm and SVM is utilized. Five prediction modules designed using Artificial Neural Network (ANN) models are trained with independent and integrated features in order to validate the fibril forming motifs. The results provide evidence that incorporating new feature namely auto-correlation function besides BPC, attempt to strengthen the sequence interaction effect in forming the feature vector thereby obtaining better prediction quality in terms of sensitivity, specificity, Mathews Correlation Coefficient and Area under the Receiver Operating Characteristics curve. A significant improvement in performance is observed by introducing features like auto-correlation function that maintains sequence order effect, in addition to the conventional BPC properties selected through a novel optimization strategy to predict the peptide status - amyloidogenic or non-amyloidogenic. The proposed approach achieves acceptable results, comparable to most online predictors. Besides, it compensates the lacuna in existing amyloid fibril prediction tools by maintaining equilibrium between sensitivity and specificity.
    BMC Bioinformatics 11/2011; 12 Suppl 13(Suppl 13):S21. DOI:10.1186/1471-2105-12-S13-S21 · 2.67 Impact Factor
Show more