Conference Paper

Prediction of Protein Subcellular Locations by Combining K-Local Hyperplane Distance Nearest Neighbor

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A huge number of protein sequences have been generated and collected. However, the functions of most of them are still unknown. Protein subcellular localization is important to elucidate protein function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been done to accomplish such a task, there is the need for further research to improve the accuracy of prediction. In this paper, with K-local Hyperplane Distance Nearest Neighbor algorithm (HKNN) as base classifier, an ensemble classifier is proposed to predict the subcellular locations of proteins in eukaryotic cells. Each basic HKNN classifiers are constructed from a separated feature set, and finally combined with majority voting scheme. Results obtained through 5-fold cross-validation test on the same protein dataset showed an improvement in pre-diction accuracy over existing algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
4.5%Y2O3-ZrO2, 0.6%Sc2O3-YSZ composite ceramic nanopowders were prepared by chemical co-precipitation method in ethanol/water (the volume ratio of alcohol to aqueous is 5:1) using ZrOCl2.8H2O, Y(NO3)3·6H2O, Sc2O3 as raw materials and polyethyleneglycol (PEG) as dispersant. The synthesized powders were characterized by means of TEM and XRD analysis. High temperature stability of composite ceramic nanopowders was also studied. The results show that, after calcined at 600°C for 2 h, the presynthesized powders show the pure noneequilibrium tetragonal structure, the size of 0.6%Sc2O3-YSZ particles prepared by reverse titration method is ~20 nm compared with~30 nm prepared by the straight titration method. After sintered at 1 200°C for 100 h, YSZ shows equilibrium tetragonal structure together with cubic phase; after sintered at 1 300°C for 100 h, the equilibrium tetragonal structure partly transforms to cubic phase and a small amount of monoclinic phase; after sintered at 1 400°C for 100 h, the tetragonal structure transforms to cubic and monoclinic phase entirely. Whereas, 0.6%Sc2O3-YSZ keeps the pure noneequilibrium tetragonal structure after sintered at 1 200, 1 300 or 1 400°C for 100 h. Therefore, the high temperature phase stability of YSZ can be modified by addition small fraction of Sc2O3.
Article
Molybdenum-alumina (Mo/Al2O3) composites with different volume fraction of Ceria (CeO2) were prepared by pressureless sintering. The microstructure and wear resistance were studied by XRD and electron microprobe. The results show that the new phase of CeAl11O18 is found with the addition of CeO2. The amount of CeAl11O18 increases and Al2O3 decreases with the increasing of CeO2. Al2O3 completely disappears and replaced by CeAl11O18 when the volume fraction of CeO2 is 6%. The interdiffusion of Mo, Al and O elements at the Al2O3 and CeAl11O18 phases boundary is promoted by CeO2. The wear microstructure research shows that tribolayers can be found in Mo/Al2O3 composites sintered at 1730°C. Massive tribolayers appear when the content of CeO2 is 4% (volume fraction), which increase the wear resistance.
Article
Y2O3: Eu3+ powders were synthesized by combustion method and the influence of dispersant was investigated. XRD analysis indicated that the particle size increased with a small amount of dispersant firstly and then decreased with a further increase of dispersant. The morphologies of the powders were studied by scanning electron microscopy (SEM) and high-resolution transmission electron microscopy (HRTEM). SEM images revealed that an appropriate amount of dispersant could reduce the agglomeration significantly. Due to the increase of specific surface area, the charge transfer bands in the excitation spectra showed a red shift with the decrease of particle size. The emission spectra displayed that the reduction of agglomeration could enhance the luminescent intensity greatly. The decay curves of the powders were single exponential and the lifetime of 612 nm emission increased with the decrease of particle size. The two deexcitation processes were discussed, and the increase of lifetime with decreased particle size resulted from the combined effects of surface defects and effective refractive index.
Conference Paper
Full-text available
Most proteins function only when folded into a par- ticular 3D configuration. Recently, a class of proteins has been discovered that do not fold into any particu- lar configuration; these are known as Intrinsically Un- structured (IU) proteins. We construct a classifier to identify IU regions in proteins based on features de- rived from protein sequence information alone, and evaluate it on out-of-sample data. Our results indicate that the resulting classifier represents a viable alterna- tive to existing IU classifiers.
Article
Full-text available
Motivation: Identifying the destination or localization of proteins is key to understanding their function and facilitating their purification. A number of existing computational prediction methods are based on sequence analysis. However, these methods are limited in scope, accuracy and most particularly breadth of coverage. Rather than using sequence information alone, we have explored the use of database text annotations from homologs and machine learning to substantially improve the prediction of subcellular location. Results: We have constructed five machine-learning classifiers for predicting subcellular localization of proteins from animals, plants, fungi, Gram-negative bacteria and Gram-positive bacteria, which are 81% accurate for fungi and 92-94% accurate for the other four categories. These are the most accurate subcellular predictors across the widest set of organisms ever published. Our predictors are part of the Proteome Analyst web-service.
Article
Full-text available
Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.
Article
Full-text available
Motivation: Identifying the destination or localization of proteins is key to understanding their function and facilitating their purification. A number of existing computational prediction methods are based on sequence analysis. However, these methods are limited in scope, accuracy and most particularly breadth of coverage. Rather than using sequence information alone, we have explored the use of database text annotations from homologs and machine learning to substantially improve the prediction of subcellular location. Results: We have constructed five machine-learning classifiers for predicting subcellular localization of proteins from animals, plants, fungi, Gram-negative bacteria and Gram-positive bacteria, which are 81% accurate for fungi and 92–94% accurate for the other four categories. These are the most accurate subcellular predictors across the widest set of organisms ever published. Our predictors are part of the Proteome Analyst web-service. Availability:http://www.cs.ualberta.ca/~bioinfo/PA/Sub, http://www.cs.ualberta.ca/~bioinfo/PA Supplementary information:http://www.cs.ualberta.ca/~bioinfo/PA/Subcellular
Article
Full-text available
Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs.
Article
Sequences of intracellular and extracellular soluble proteins were analyzed statistically in terms of amino acid composition and residue-pair frequencies. Residue-pair frequencies were calculated for sequential separations from (n, n + 1) to (n, n + 5), and converted into scoring parameters. Then, for each test protein, the single-residue and residue-pair parameters were applied to calculate a total score. According to our definition, a protein which yields a positive score is indicative of an intracellular protein, whereas a negative score implies an extracellular one. The parameter set was derived from 894 sequences constituting different protein families in the PIR database, and assessed by application to a test of 379 proteins. The results showed that 88% of intracellular and 84% of extracellular proteins were correctly assigned. The discrimination power was improved by about 8% in comparison with the previous study, which used composition data alone. Segregation of intra/extracellular proteins is also observed by other criteria, such as structural class (intracellular proteins prefer alpha and alpha/beta types and extracellular proteins prefer beta and alpha + beta types). Segregation by sequence was found to be a more reliable procedure for distinguishing intra/extracellular proteins than methods using structural class. Possible causes for this segregation by sequence are discussed.
Article
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.
Article
The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are closely correlated with its biological functions. The success of human genome project and the rapid increase in the number of protein sequences entering into data bank have stimulated a challenging frontier: How to develop a fast and accurate method to predict the cellular attributes of a protein based on its amino acid sequence? The existing algorithms for predicting these attributes were all based on the amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns for protein sequences is extremely large, which has posed a formidable difficulty for realizing this goal. To deal with such a difficulty, the pseudo-amino acid composition is introduced. It is a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. A remarkable improvement in prediction quality has been observed by using the pseudo-amino acid composition. The success rates of prediction thus obtained are so far the highest for the same classification schemes and same data sets. It has not escaped from our notice that the concept of pseudo-amino acid composition as well as its mathematical framework and biochemical implication may also have a notable impact on improving the prediction quality of other protein features.
Article
The function of a protein is closely correlated to its subcellular location. Is it possible to utilize a bioinformatics method to predict the protein subcellular location? To explore this problem, proteins are classified into 12 groups (Protein Eng. 12 (1999) 107-118) according to their subcellular location: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracellular, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole. In this paper, the neural network method was proposed to predict the subcellular location of a protein according to its amino acid composition. Results obtained through self-consistency, cross-validation and independent dataset tests are quite high. Accordingly, the present method can serve as a complement tool for the existing prediction methods in this area.
Article
The structural class and subcellular location are the two important features of proteins that are closely related to their biological functions. With the rapid increase in new protein sequences entering into data banks, it is highly desirable to develop a fast and accurate method for predicting the attributes of these features for them. This can expedite the functionality determination of new proteins and the process of prioritizing genes and proteins identified by genomics efforts as potential molecular targets for drug design. Various prediction methods have been developed during the last two decades. This review is devoted to presenting a systematic introduction and comparison of the existing methods in respect to the prediction algorithm and classification scheme. The attention is focused on the state-of-the-art, which is featured by the covarient-discriminant algorithm developed very recently, as well as some new classification schemes for protein structural classes and subcellular locations. Particularly, addressed are also the physical chemistry foundation of the existing prediction methods, and the essence why the covariant-discriminant algorithm is so powerful.
Article
Motivation: The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs). Results: We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet.
Article
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.
Article
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
Article
Membrane proteins are generally classified into the following five types: 1), type I membrane protein; 2), type II membrane protein; 3), multipass transmembrane proteins; 4), lipid chain-anchored membrane proteins; and 5), GPI-anchored membrane proteins. In this article, based on the concept of using the functional domain composition to define a protein, the Support Vector Machine algorithm is developed for predicting the membrane protein type. High success rates are obtained by both the self-consistency and jackknife tests. The current approach, complemented with the powerful covariant discriminant algorithm based on the pseudo-amino acid composition that has incorporated quasi-sequence-order effect as recently proposed by K. C. Chou (2001), may become a very useful high-throughput tool in the area of bioinformatics and proteomics.
Article
We present an algorithm for improving the accuracy of algorithms for learning binary concepts. The improvement is achieved by combining a large number of hypotheses, each of which is generated by training the given learning algorithm on a different set of examples. Our algorithm is based on ideas presented by Schapire in his paper "The strength of weak learnability", and represents an improvement over his results. The analysis of our algorithm provides general upper bounds on the resources required for learning in Valiant's polynomial PAC learning framework, which are the best general upper bounds known today. We show that the number of hypotheses that are combined by our algorithm is the smallest number possible. Other outcomes of our analysis are results regarding the representational power of threshold circuits, the relation between learnability and compression, and a method for parallelizing PAC learning algorithms. We provide extensions of our algorithms to cases in which the conc...
Review: prediction of protein structural classes and subcellular locations
  • K C Chou