Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies

College of Automation, Northwestern Polytechnical University, No. 127 Youyi West Road, Xi'an 710072, China.
Amino Acids (Impact Factor: 3.65). 06/2008; 34(4):565-72. DOI: 10.1007/s00726-007-0010-9
Source: PubMed

ABSTRACT The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou's pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A cell contains thousands of proteins. Many important functions of cell are carried out through the proteins therein. Proteins rarely function alone. Most of their functions essential to life are associated with various types of protein-protein interactions (PPIs). Therefore, knowledge of PPIs is fundamental for both basic research and drug development. With the avalanche of proteins sequences generated in the postgenomic age, it is highly desired to develop computational methods for timely acquiring this kind of knowledge. Here, a new predictor, called "iPPI-Emsl", is developed. In the predictor, a protein sample is formulated by incorporating the following two types of information into the general form of PseAAC (pseudo amino acid composition): (1) the physicochemical properties derived from the constituent amino acids of a protein; (2) the wavelet transforms derived from the numerical series along a protein chain. The operation engine to run the predictor is an ensemble classifier formed by fusing seven individual random forest engines via a voting system. It is demonstrated with the benchmark dataset from S. cerevisiae as well as the dataset from H. pylori that the new predictor achieves remarkably higher success rates than any of the existing predictors in this area. The new predictor' web-server has been established at For the convenience of most experimental scientists, we have further provided a step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics involved during its development. Copyright © 2015. Published by Elsevier Ltd.
    Journal of Theoretical Biology 04/2015; 377. DOI:10.1016/j.jtbi.2015.04.011 · 2.30 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Facing the explosive growth of biological sequence data, such as those of protein/peptide and DNA/RNA, generated in the post-genomic age, many bioinformatical and mathematical approaches as well as physicochemical concepts have been introduced to timely derive useful informations from these biological sequences, in order to stimulate the development of medical science and drug design. Meanwhile, because of the rapid penetrations from these disciplines, medicinal chemistry is currently undergoing an unprecedented revolution. In this minireview, we are to summarize the progresses by focusing on the following six aspects. (1) Use the pseudo amino acid composition or PseAAC to predict various attributes of protein/peptide sequences that are useful for drug development. (2) Use pseudo oligonucleotide composition or PseKNC to do the same for DNA/RNA sequences. (3) Introduce the multi-label approach to study those systems where the constituent elements bear multiple characters and functions. (4) Utilize the graphical rules and "wenxiang" diagrams to analyze complicated biomedical systems. (5) Recent development in identifying the interactions of drugs with its various types of target proteins in cellular networking. (6) Distorted key theory and its application in developing peptide drugs.
    Medicinal Chemistry 12/2014; 11(3). DOI:10.2174/1573406411666141229162834 · 1.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteins located in appropriate cellular compartments are of paramount importance to exert their biological functions. Prediction of protein subcellular localization by computational methods is required in the post-genomic era. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. However, most of the existing predictors are far from effective for tackling the challenges of multi-label proteins. This paper proposes an efficient multi-label predictor (namely mPLR-Loc) based on penalized logistic regression and adaptive decisions for predicting both single- and multi-location proteins. Specifically, for each query protein, mPLR-Loc exploits the information from the gene ontology (GO) database by using its accession number (AC) or the ACs of its homologs obtained via BLAST. The frequencies of GO occurrences are used to construct feature vectors, which are then classified by an adaptive-decision based multi-label penalized logistic regression classifier. Experimental results based on two recent stringent benchmark datasets (virus and plant) show that mPLR-Loc remarkably outperforms existing state-of-the-art multi-label predictors. In addition to being able to rapidly and accurately predict subcellular localization of single- and multi-label proteins, mPLR-Loc can also provide probabilistic confidence scores for the prediction decisions. For readers' convenience, the mPLR-Loc server is available online at Copyright © 2014 Elsevier Inc. All rights reserved.
    Analytical Biochemistry 10/2014; 473. DOI:10.1016/j.ab.2014.10.014 · 2.31 Impact Factor