Article

Proximity based GPCRs prediction in transform domain

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work, we predict G-protein coupled receptors (GPCRs) using hydrophobicity of amino acid sequences and Fast Fourier Transform for feature generation. We analyze whether the GPCRs classification strategy depends on the way the feature space may be exploited. Consequently, we show that the sequence pattern based information could easily be exploited in the frequency domain using proximity rather than increasing margin of separation between the classes. We thus develop a simple proximity based approach known as nearest neighbor (NN) for classifying the 17 GPCRs subfamilies. The NN classifier has outperformed the one against all implementation of support vector machine using both Jackknife and independent dataset. The results validate the importance of the understanding and efficient exploitation of the feature space. It also shows that simple classification strategies may outperform complex ones because of the efficient exploitation of the feature space.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... SVM is a machine learning algorithm based on statistical learning theory first introduced by Vapnik in 1995 [53][54][55][56][57][58][59]. It was updated by Vapnik in 1998 [60]. ...
... In this work, jackknife and independent dataset tests have been employed. During the process of jackknifing, each protein is singled out for testing and the remaining proteins are merged for training [58,61,62]. In the case of the independent dataset test, the model is trained on one dataset and tested on another dataset independently. ...
... In the training process, C and s were optimized by parameter optimization Khan et al. 2008a). ...
... The sample under question X is then assigned to the category, which is found in majority among the k samples. The kNN is considered as a simple classifier, based on instancebased learning and has been commonly employed in protein prediction problems (Khan et al. 2008a). ...
Article
Full-text available
Mitochondria are all-important organelles of eukaryotic cells since they are involved in processes associated with cellular mortality and human diseases. Therefore, trustworthy techniques are highly required for the identification of new mitochondrial proteins. We propose Mito-GSAAC system for prediction of mitochondrial proteins. The aim of this work is to investigate an effective feature extraction strategy and to develop an ensemble approach that can better exploit the advantages of this feature extraction strategy for mitochondria classification. We investigate four kinds of protein representations for prediction of mitochondrial proteins: amino acid composition, dipeptide composition, pseudo amino acid composition, and split amino acid composition (SAAC). Individual classifiers such as support vector machine (SVM), k-nearest neighbor, multilayer perceptron, random forest, AdaBoost, and bagging are first trained. An ensemble classifier is then built using genetic programming (GP) for evolving a complex but effective decision space from the individual decision spaces of the trained classifiers. The highest prediction performance for Jackknife test is 92.62% using GP-based ensemble classifier on SAAC features, which is the highest accuracy, reported so far on the Mitochondria dataset being used. While on the Malaria Parasite Mitochondria dataset, the highest accuracy is obtained by SVM using SAAC and it is further enhanced to 93.21% using GP-based ensemble. It is observed that SAAC has better discrimination power for mitochondria prediction over the rest of the feature extraction strategies. Thus, the improved prediction performance is largely due to the better capability of SAAC for discriminating between mitochondria and non-mitochondria proteins at the N and C terminus and the effective combination capability of GP. Mito-GSAAC can be accessed at http://111.68.99.218/Mito-GSAAC . It is expected that the novel approach and the accompanied predictor will have a major impact to Molecular Cell Biology, Proteomics, Bioinformatics, System Biology, and Drug Development.
... The Chou's PseAAC based-methods achieved about an increase of 20 percent of predicting accuracy than amino acids composition-based methods; (3) the hybrid methods allowing for integrating features from multiple views, which usually increase prediction accuracy [8][9][10]. After the sequence feature was constructed, various classifiers including covariant discriminant (CDC) [10,11], nearest neighbor (NN) [12,13], support vector machine (SVM) [14], deep learning [15] and ensemble classifier [16,17] were adopted to predict protein subcellular localization. ...
Article
Full-text available
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.
... It is also called an instance based learner that has no earlier knowledge regarding the distribution of data in feature vector ( Keller et al., 1985 ). It uses the concept of the proximity of feature vector ( Khan et al., 2008a ). KNN calculates the distance from the query point to its training neighborhood instances. ...
... This classification algorithm is extensively deployed in the area of bio informatics ( Feng et al., 2014a,b ), Support Vector Machine is a learning algorithm under supervision structured according to the foundation of Statistical-Learning-Theory ( Khan et al., 2008 ) The information is transformed into a space of features with high dimensions by using SVM to determine the best margin for hyper-plane. The border between the dividing line and the support vectors in the training set is shown by hyper-plane, one vs. ...
Article
This study examines accurate and efficient computational method for identification of 5-methylcytosine sites in RNA modification. The occurrence of 5-methylcytosine (m5C) plays a vital role in a number of biological processes. For better comprehension of the biological functions and mechanism it is necessary to recognize m5C sites in RNA precisely. The laboratory techniques and procedures are available to identify m5C sites in RNA, but these procedures require a lot of time and resources. This study develops a new computational method for extracting the features of RNA sequence. In this method, first the RNA sequence is encoded via composite feature vector, then, for the selection of discriminate features, the minimum-redundancy-maximum-relevance algorithm was used. Secondly, the classification method used has been based on a support vector machine by using jackknife cross validation test. The suggested method efficiently identifies m5C sites from non- m5C sites and the outcome of the suggested algorithm is 93.33% with sensitivity of 90.0 and specificity of 96.66 on bench mark datasets. The result exhibits that proposed algorithm shown significant identification performance compared to the existing computational techniques. This study extends the knowledge about the occurrence sites of RNA modification which paves the way for better comprehension of the biological uses and mechanism.
... For the investigation of output of these encoding schemes, different classifiers have been applied. These algorithms are support vector machine SVM ( Chou, 2011 ;Ding and Dubchak, 2001 ;Lin, 2008 ;Kumar et al., 2011 ;Chen et al., 2009 ;Du et al., 2014 ;Ding et al., 2014 ;Hayat and Iqbal, 2014 ), random forest (RF) ( Breiman, 2001 ), probabilistic neural network (PNN) ( Hayat and Khan, 2012 ;Khan et al., 2008 ) and k-nearest neighbour (KNN) ( Chou, 2011 ) were utilized as operational engines. Although, from previous research works, it has been proven that machine learning algorithms play a vital role for prediction of mycobacterium membrane proteins and their types but, still it is applied very little for mycobacterium prediction. ...
Article
Mycobacterium is a pathogenic bacterium, which is a causative agent of tuberculosis (TB) and leprosy. These diseases are very crucial and become the cause of death of millions of people every year in the world. So, the characterize structure of membrane proteins of the protozoan play a vital role in the field of drug discovery because, without any knowledge about this Mycobacterium's membrane protein and their types, the scientists are unable to treat this pathogenic protozoan. So, an accurate and competitive computational model is needed to characterize this uncharacterized structure of mycobacterium. Series of attempts were carried out in this connection. Split amino acid compositions, Unbiased-Dipeptide peptide compositions (Unb-DPC), Over-represented tri-peptide compositions, compositions & translation were the few recent encoding techniques followed by different researchers in their publications. Although considerable results have been achieved by these models, still there is a gap which is filled in this study. In this study, an evolutionary feature extraction technique position specific scoring matrix (PSSM) is applied in order to extract evolutionary information from protein sequences. Consequently, 99.6% accuracy was achieved by the learning algorithms. The experimental results demonstrated that the proposed computational model will lead to develop a powerful tool for anti-mycobacterium drugs as well as play a promising rule in proteomic and bioinformatics.
... It is also called instance base classifier. The K-NN classifier first calculates the Euclidean distance between the testing sample and all the training samples (Khan et al., 2008). After that, K samples are selected from the feature space that have minimum distance from test sample and assigned that class, which is most frequently occurred among them. ...
Article
Full-text available
Outer Membrane Proteins (OMPs) assume essential part in cell science. The separation of OMPs from genomic groupings is a testing assignment because of short layer spreading over areas with high variety in properties. Subsequently, a mechanized and high throughput computational model for separation of OMPs from their essential groupings is required. In this paper, we have used K-closest Neighbor in mix with Amino corrosive piece. The execution of K-closest Neighbor s is assessed by two datasets utilizing 5-fold cross-approval. After the test, we have watched that K-closest Neighbor makes the most elevated progress rate of 96.0% exactness for segregating OMPs from non-OMPs and 96.3% and 96.5% correctnesses from α-helix film and Globular proteins, separately on dataset1. While on dataset2, K-closest Neighbor acquires 96.4% exactness for separating OMPs from non-OMPs.
... In this paper, first KNN and PNN based ensemble classifiers are developed. KNN is a learning algorithm that is based on the concept of proximity in the feature space [19], while PNN is based on Bayes theory to estimates the likelihood of a sample being part of a learned category [20]. Then these two ensembles are combined to form a composite ensemble. ...
... SVM was first introduced by Cortes and Vapnik in 1995 (Vapnik 2000). It is a very effective method used for the classification of supervised pattern recognition process (Asifullah and Tahir 2008;Khan et al. 2008). Later on it was updated by Vapnik (1998). ...
Article
Full-text available
Mitochondrion is the key organelle of eukaryotic cell, which provides energy for cellular activities. Submitochondrial locations of proteins play crucial role in understanding different biological processes such as energy metabolism, program cell death, and ionic homeostasis. Prediction of submitochondrial locations through conventional methods are expensive and time consuming because of the large number of protein sequences generated in the last few decades. Therefore, it is intensively desired to establish an automated model for identification of submitochondrial locations of proteins. In this regard, the current study is initiated to develop a fast, reliable, and accurate computational model. Various feature extraction methods such as dipeptide composition (DPC), Split Amino Acid Composition, and Composition and Translation were utilized. In order to overcome the issue of biasness, oversampling technique SMOTE was applied to balance the datasets. Several classification learners including K-Nearest Neighbor, Probabilistic Neural Network, and support vector machine (SVM) are used. Jackknife test is applied to assess the performance of classification algorithms using two benchmark datasets. Among various classification algorithms, SVM achieved the highest success rates in conjunction with the condensed feature space of DPC, which are 95.20 % accuracy on dataset SML3-317 and 95.11 % on dataset SML3-983. The empirical results revealed that our proposed model obtained the highest results so far in the literatures. It is anticipated that our proposed model might be useful for future studies.
... Customer churn prediction is a binary classification problem but the large dimensionality of the telecommunication dataset makes difficult for conventional binary classifiers like Support Vector Machines to show desired performance [2]. Similarly, the simplest of the classifiers, KNN shows good performance on various classification problems [3][4][5] and its hybridized form with Logistic Regression [6] also claims competitive performance for churn prediction. However, this performance is constrained to the application domains where datasets do not possess high dimensionality and imbalance distribution. ...
Conference Paper
Full-text available
Ensemble classifiers have received increasing attention for attaining the higher classification performance in recent times. In this paper, we present comparative performances of various tree based ensemble classifiers in collaboration with maximum relevancy and minimum redundancy (mRMR), Fisher's ratio and F-score based features selection schemes for a challenging problem of churn prediction in telecommunication. The large sized telecommunication dataset has been the main hurdle in achieving the desired classification performance in the contemporary proposed churn prediction models. Though, tree based ensemble classifiers are considered suitable for larger datasets, but we have found rotation forest and rotboost as effective techniques compared to random forest, which employ boosting through features selection and increased diversity by incorporating linear feature extraction method such as Principal Component Analysis. In addition to the features selection performed by used ensembles, we have also incorporated mRMR, Fisher's ratio and F-score techniques for features selection. mRMR returns a coherent and well discriminants feature set, compared to Fisher's ratio and F-score, which significantly reduces the computations and helps classifier in attaining improved performance. The performance evaluation is conducted using area under curve, sensitivity and specificity where Rotboost, an ensemble of rotation forest and Adaboost in collaboration with mRMR has shown competitive results for churn prediction in telecommunication as compared to other ensemble methods.
... MCC is a discrete version of Pearson's correlation coefficient, which returns values in the interval of [21,1]. A value of 1 means the classifier never makes any mistake and a value 21 means the classifier always makes mistakes ...
Article
Membrane proteins are fundamental elements of a cell that play essential roles nearly in all the cellular processes. Prediction of membrane protein types using biological experiments are often complicated and time consuming. Therefore it is highly desirable to develop a robust, reliable and high-throughput silico method to predict membrane protein types. In this study, the authors have used two feature extraction strategies known as dipeptide and pseudo amino acid (PseAA) compositions for classification of membrane proteins types. In addition, a composite model is also developed by concatenating dipeptide and PseAA composition based features. Further, two feature selection methods such as neighbourhood preserving embedding and locally linear embedding (LLE) are applied to reduce the dimensionality of the composite model. The performance of these feature extraction strategies is evaluated using four different classifiers: K-nearest neighbour, probabilistic neural network (PNN), support vector machine (SVM) and grey incidence degree. The highest success rates have been observed using the LLE-based reduced features. SVM has yielded the best accuracy of 88.2% in case of jackknife test. Although in case of independent dataset test, PNN has obtained the highest accuracy of 98.4%. Performance measures other than accuracy are also used such as 'Mathew correlation coefficient', sensitivity and precision. The authors simulated results show that the composite model has significantly discriminated the types of membrane protein and might be useful for future research and drug discovery.
... Customer churn prediction is a binary classification problem but the large dimensionality and rare share of minority class in the telecom datasets emerge as major hurdles for conventional classifiers to show desired performance [2]. KNN, a widely used classifier, shows good performance on various classification problems3456 and its hybridized form with Logistic Regression [7] also claims competitive performance for churn prediction. However, this performance is constrained to application domains where datasets do not possess high dimensionality and imbalance distribution. ...
Article
Churn prediction in telecom has recently gained substantial interest of stakeholders because of associated revenue losses. Predicting telecom churners, is a challenging problem due to the enormous nature of the telecom datasets. In this regard, we propose an intelligent churn prediction system for telecom by employing efficient feature extraction technique and ensemble method. We have used Random Forest, Rotation Forest, RotBoost and DECORATE ensembles in combination with minimum redundancy and maximum relevance (mRMR), Fisher’s ratio and F-score methods to model the telecom churn prediction problem. We have observed that mRMR method returns most explanatory features compared to Fisher’s ratio and F-score, which significantly reduces the computations and help ensembles in attaining improved performance. In comparison to Random Forest, Rotation Forest and DECORATE, RotBoost in combination with mRMR features attains better prediction performance on the standard telecom datasets. The better performance of RotBoost ensemble is largely attributed to the rotation of feature space, which enables the base classifier to learn different aspects of the churners and non-churners. Moreover, the Adaboosting process in RotBoost also contributes in achieving higher prediction accuracy by handling hard instances. The performance evaluation is conducted on standard telecom datasets using AUC, sensitivity and specificity based measures. Simulation results reveal that the proposed approach based on RotBoost in combination with mRMR features (CP-MRB) is effective in handling high dimensionality of the telecom datasets. CP-MRB offers higher accuracy in predicting churners and thus is quite prospective in modeling the challenging problems of customer churn prediction in telecom.
... Therefore, computational methods coupled with machine learning techniques are required to determine protein subcellular localizations. In this regard, researchers have endeavoured to develop numerous bioinformatics based prediction systems coupled with machine learning methods to localize a range of proteins (Chebira et al., 2007;Hamilton et al., 2007;Khan et al., 2008;Khan et al., 2011;Lin et al., 2007;Murphy et al., 2003;Nanni and Lumini, 2008;Nanni et al., 2010a;Zhang et al., 2009). Researchers have confirmed that many proteins have been found to be the part of a multi-label system in which they are able to reside in two or more subcellular locations simultaneously or travel across two or more subcellular location sites. ...
Article
Discriminative feature extraction technique is always required for the development of accurate and efficient prediction systems for protein subcellular localization so that effective drugs can be developed. In this work, we showed that Local Ternary Patterns (LTPs) effectively exploit small variations in pixel intensities; present in fluorescence microscopy based protein images of human and hamster cell lines. Further, Synthetic Minority Oversampling Technique is applied to balance the feature space for the classification stage. We observed that LTPs coupled with data balancing technique could enable a classifier, in this case Support Vector Machine, to yield good performance. The proposed ensemble based prediction system, using 10-fold cross-validation, has yielded better performance compared to existing techniques in predicting various subcellular compartments for both 2D HeLa and CHO datasets. The proposed predictor is available online at: http://111.68.99.218/Protein_SubLoc/, which is freely accessible to the public.
... It is also called instance base learner. The K-NN classifier first calculates the Euclidean distance between the testing instance and all the training instances [51]. Subsequently, only K instances are selected from the feature space, which has minimum distance from the testing instance. ...
Article
Full-text available
Outer membrane proteins (OMPs) play important roles in cell biology. In addition, OMPs are targeted by multiple drugs. The identification of OMPs from genomic sequences and successful prediction of their secondary and tertiary structures is a challenging task due to short membrane-spanning regions with high variation in properties. Therefore, an effective and accurate silico method for discrimination of OMPs from their primary sequences is needed. In this paper, we have analyzed the performance of various machine learning mechanisms for discriminating OMPs such as: Genetic Programming, K-nearest Neighbor, and Fuzzy K-nearest Neighbor (Fuzzy K-NN) in conjunction with discrete methods such as: Amino acid composition, Amphiphilic Pseudo amino acid composition, Split amino acid composition (SAAC), and hybrid versions of these methods. The performance of the classifiers is evaluated by two datasets using 5-fold crossvalidation. After the simulation, we have observed that Fuzzy K-NN using SAAC based-features makes it quite effective in discriminating OMPs. Fuzzy K-NN achieves the highest success rates of 99.00% accuracy for discriminating OMPs from non-OMPs and 98.77% and 98.28% accuracies from α-helix membrane and globular proteins, respectively on dataset1. While on dataset2, Fuzzy K-NN achieves 99.55%, 99.90%, and 99.81% accuracies for discriminating OMPs from non- OMPs, α-helix membrane, and globular proteins, respectively. It is observed that the classification performance of our proposed method is satisfactory and is better than the existing methods. Thus, it might be an effective tool for high throughput innovation of OMPs.
... In statistical prediction, three important cross-validation methods (sub-sampling (K-fold), independent dataset, and leave-one-out cross-validation) are often used to check the effectiveness of predictors. Among these three methods, leave-one-out cross validation method is considered to be the most rigorous and effective one (Chou and Shen, 2006a;Khan et al., 2008cKhan et al., , 2010Naveed and Khan, 2011;Rehman and Khan, 2011). Since, in independent dataset test the testing dataset is independent from the training dataset, therefore, to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large (Chou and Zhang, 1995). ...
Article
Full-text available
About 50% of available drugs are targeted against membrane proteins. Knowledge of membrane protein's structure and function has great importance in biological and pharmacological research. Therefore, an automated method is exceedingly advantageous, which can help in identifying the new membrane protein types based on their primary sequence. In this paper, we tackle the interesting problem of classifying membrane protein types using their sequence information. We consider both evolutionary and physicochemical features and provide them to our classification system based on support vector machine (SVM) with error correction code. We employ a powerful sequence encoding scheme by fusing position specific scoring matrix and split amino acid composition to effectively discriminate membrane protein types. Linear, polynomial, and RBF based-SVM with Bose, Chaudhuri, Hocquenghem coding are trained and tested. The highest success rate of 91.1% and 93.4% on two datasets is obtained by RBF-SVM using leave-one-out cross-validation. Thus, our proposed approach is an effective tool for the discrimination of membrane protein types and might be helpful to researchers/academicians working in the field of Drug Discovery, Cell Biology, and Bioinformatics. The web server for the proposed MemHyb-SVM is accessible at http://111.68.99.218/MemHyb-SVM.
... In this paper, first KNN and PNN based ensemble classifiers are developed. KNN is a learning algorithm that is based on the concept of proximity in the feature space [19], while PNN is based on Bayes theory to estimates the likelihood of a sample being part of a learned category [20]. Then these two ensembles are combined to form a composite ensemble. ...
Article
Knowledge of the types of membrane protein provides useful clues in deducing the functions of uncharacterized membrane proteins. An automatic method for efficiently identifying uncharacterized proteins is thus highly desirable. In this work, we have developed a novel method for predicting membrane protein types by exploiting the discrimination capability of the difference in amino acid composition at the N and C terminus through split amino acid composition (SAAC). We also show that the ensemble classification can better exploit this discriminating capability of SAAC. In this study, membrane protein types are classified using three feature extraction and several classification strategies. An ensemble classifier Mem-EnsSAAC is then developed using the best feature extraction strategy. Pseudo amino acid (PseAA) composition, discrete wavelet analysis (DWT), SAAC, and a hybrid model are employed for feature extraction. The nearest neighbor, probabilistic neural network, support vector machine, random forest, and Adaboost are used as individual classifiers. The predicted results of the individual learners are combined using genetic algorithm to form an ensemble classifier, Mem-EnsSAAC yielding an accuracy of 92.4 and 92.2% for the Jackknife and independent dataset test, respectively. Performance measures such as MCC, sensitivity, specificity, F-measure, and Q-statistics show that SAAC-based prediction yields significantly higher performance compared to PseAA- and DWT-based systems, and is also the best reported so far. The proposed Mem-EnsSAAC is able to predict the membrane protein types with high accuracy and consequently, can be very helpful in drug discovery. It can be accessed at http://111.68.99.218/membrane.
... In addition, in case of GPCRs, functionsimilarity relationship is still unclear. Other methods include the use of covariant discriminant algorithm (Elrod and Chou 2002), support vector machine (SVM) (Karchin et al. 2002), k-nearest neighbors (KNN) (Gao and Wang 2006; Khan et al. 2008a), statistical analysis method (Chou and Elrod 2002), Hidden Markov Models (Qian et al. 2003), and binary topology pattern (Inoue et al. 2004). Ensemble approaches have also been used for protein identification (Huang et al. 2004;). ...
Article
G protein-coupled receptors (GPCRs) are transmembrane proteins, which transduce signals from extracellular ligands to intracellular G protein. Automatic classification of GPCRs can provide important information for the development of novel drugs in pharmaceutical industry. In this paper, we propose an evolutionary approach, GPCR-MPredictor, which combines individual classifiers for predicting GPCRs. GPCR-MPredictor is a web predictor that can efficiently predict GPCRs at five levels. The first level determines whether a protein sequence is a GPCR or a non-GPCR. If the predicted sequence is a GPCR, then it is further classified into family, subfamily, sub-subfamily, and subtype levels. In this work, our aim is to analyze the discriminative power of different feature extraction and classification strategies in case of GPCRs prediction and then to use an evolutionary ensemble approach for enhanced prediction performance. Features are extracted using amino acid composition, pseudo amino acid composition, and dipeptide composition of protein sequences. Different classification approaches, such as k-nearest neighbor (KNN), support vector machine (SVM), probabilistic neural networks (PNN), J48, Adaboost, and Naives Bayes, have been used to classify GPCRs. The proposed hierarchical GA-based ensemble classifier exploits the prediction results of SVM, KNN, PNN, and J48 at each level. The GA-based ensemble yields an accuracy of 99.75, 92.45, 87.80, 83.57, and 96.17% at the five levels, on the first dataset. We further perform predictions on a dataset consisting of 8,000 GPCRs at the family, subfamily, and sub-subfamily level, and on two other datasets of 365 and 167 GPCRs at the second and fourth levels, respectively. In comparison with the existing methods, the results demonstrate the effectiveness of our proposed GPCR-MPredictor in classifying GPCRs families. It is accessible at http://111.68.99.218/gpcr-mpredictor/.
... The prediction of CDC is found by exploiting the variation in the PseAA features of protein sequence [13]. NN is reported to perform well on classification tasks regarding protein sequences [13,17,18]. PNN classifier is based on the Bayes theory to estimates the likelihood of a sample being part of a learned class [19]. ...
Article
Full-text available
Predicting subcellular localizations of human proteins become crucial, when new unknown proteins sequences do not have significant homology to proteins of known subcellular locations. In this paper, we present a novel approach to develop CE-Hum-PLoc system.Individual classifiers are created by selecting a fixed learning algorithm from a pool of base learners and then trained by varying feature dimensions of Amphiphilic Pseudo Amino Acid Composition. The output of combined ensemble is obtained by fusing the predictions ofindividual classifiers. Our approach is based on the utilization of diversity in feature and decision spaces. As a demonstration, the predictive performance was evaluated for a benchmark dataset of 12 human proteins subcellular locations. The overall accuracies reachupto 80.83% and 86.69% in jackknife and independent dataset tests, respectively. Our method has given an improved prediction as compared to existing methods for this dataset. Our CEHum-PLoc system can also be a used as a useful tool for prediction of other subcellular locations.
... However, they do not seem to be sufficiently successful for comprehensive functional identification of GPCRs, since GPCRs make up a highly divergent family, and even when they are grouped according to similarity of function, their sequences share strikingly little homology or similarity to each other [13]. The third one is based on statistical and machine learning method, including support vector machines (SVM) [8,14-17], hidden Markov models (HMMs) [1,3,6,18], covariant discriminant (CD) [7,11,19,20], nearest neighbor (NN) [2,21] and other techniques [13,22-24]. ...
Article
Full-text available
Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number of known sequence and the number of known function is widening rapidly, and it is both time-consuming and expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to develop a computational method for quick and accurate classification of GPCRs. In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA and GPCRPred. The results with high success rates indicate that the proposed predictor is a useful automated tool in predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be acquired freely on request from the authors.
Article
This study investigates an efficient and accurate computational method for predicating mycobacterial membrane protein. Mycobacterium is a pathogenic bacterium which is the causative agent of tuberculosis (TB) and leprosy. The existing feature encoding algorithms for protein sequence representation such as composition and translation, and split amino acid composition cannot suitably express the mycobacterium membrane protein and their types due to biasness among different types. Therefore, in this study a novel un-biased dipeptide composition (Unb-DPC) method is proposed. The proposed encoding scheme has two advantages, first it avoid the biasness among the different mycobacterium membrane protein and their types. Secondly, the method is fast and preserves protein sequence structure information. The experimental results yield SVM based classification accurately of 97.1% for membrane protein types and 95.0% for discriminating mycobacterium membrane and non-membrane proteins by using jackknife cross validation test. The results exhibit that proposed model achieved significant predictive performance compared to the existing algorithms and will lead to develop a powerful tool for anti-mycobacterium drugs.
Conference Paper
Membrane proteins play an important role in many biological processes and are attractive drug targets. In this study, membrane proteins are classified using two feature extraction and several classification strategies. The first feature extraction strategy is pseudo amino acid (PseAA) composition; utilizing hydrophobicity and hydrophilicity for reflecting the sequence order effects, while the second method is discrete wavelet analysis (DWT); analyzing the different components of a signal localized both in space and scale domains. The nearest neighbor, probabilistic neural network, support vector machine, random forest, and Adaboost are used as basic learning mechanisms. The predicted results of the base learners are combined using majority voting to form an ensemble classifier. The best accuracy obtained for the Jackknife and independent dataset test is 85.4% and 95.3%, respectively. Using performance measures such as MCC, Sensitivity, Specificity, and F-measure, it has been observed that PseAA based prediction is significantly higher than that of the DWT, and is also the best reported, so far.
Article
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Article
G-protein coupled receptor (GPCR) is a protein family that is found only in the Eukaryotes. They are used for the interfacing of cell to the outside world and are involved in many physiological processes. Their role in drug development is evident. Hence, the prediction of GPCRs is very much demanding. Because of the unavailability of 3D structures of most of the GPCRs; the statistical and machine learning based prediction of GPCRs is much demanding. The GPCRs are classified into family, sub family and sub-sub family levels in the proposed approach. We have extracted features using the hybrid combination of Pseudo amino acid, Fast Fourier Transform and Split amino acid techniques. The overall feature vector is then reduced using Principle component analysis. Mostly, GPCRs are composed of two or more sub units. The arrangement and number of sub units forming a GPCR are referred to as quaternary structure. The functions of GPCRs are closely related to their quaternary structure. The classification in the present research is performed using grey incidence degree (GID) measure, which can efficiently analyze the numerical relation between various components of GPCRs. The GID measure based classification has shown remarkable improvement in predicting GPCRs.
Article
G-protein-coupled receptors (GPCRs) are the largest family of cell surface receptors that, via trimetric guanine nucleotide-binding proteins (G-proteins), initiate some signaling pathways in the eukaryotic cell. Many diseases involve malfunction of GPCRs making their role evident in drug discovery. Thus, the automatic prediction of GPCRs can be very helpful in the pharmaceutical industry. However, prediction of GPCRs, their families, and their subfamilies is a challenging task. In this article, GPCRs are classified into families, subfamilies, and sub-subfamilies using pseudo-amino-acid composition and multiscale energy representation of different physiochemical properties of amino acids. The aim of the current research is to assess different feature extraction strategies and to develop a hybrid feature extraction strategy that can exploit the discrimination capability in both the spatial and transform domains for GPCR classification. Support vector machine, nearest neighbor, and probabilistic neural network are used for classification purposes. The overall performance of each classifier is computed individually for each feature extraction strategy. It is observed that using the jackknife test the proposed GPCR-hybrid method provides the best results reported so far. The GPCR-hybrid web predictor to help researchers working on GPCRs in the field of biochemistry and bioinformatics is available at http://111.68.99.218/GPCR.
Article
A novel approach CE-Ploc is proposed for predicting protein subcellular locations by exploiting diversity both in feature and decision spaces. The diversity in a sequence of feature spaces is exploited using hydrophobicity and hydrophilicity of amphiphilic pseudo amino acid composition and a specific learning mechanism. Diversity in learning mechanisms is exploited by fusion of classifiers that are based on different learning mechanisms. Significant improvement in prediction performance is observed using jackknife and independent dataset tests.
Article
Full-text available
In this paper, we have investigated the problem of gender classification using frontal facial images. Four different classifiers, namely K-means, k-nearest neighbors, Linear Discriminant Analysis and Mahalanobis Distance Based classifiers are compared. Receiver operating characteristics (ROC) curve along with the area under the convex hull (AUCH) have been utilized as the performance measures of the classifiers at different feature subsets. To measure the overall performance of a classifier with single scalar value, the new scheme of finding the area under the convex hull of AUCH of ROC curves (AUCH of AUCHS) is proposed. It has been observed that, when the number of macro features is increased beyond 5, the AUCH saturates and even decreases for some classifiers, illustrating the curse of dimensionality. We then used genetic programming to combine classifiers and thus evolved an optimum combined classifier (OCC), producing better performance than the individual classifiers. We found that using only two features, the OCC has comparable performance to that of original classifier using 20 macro features. It produces true positive rate values as high as 0.94 corresponding to false positive rate as low as 0.15 for 1: 3 train to testing ratio. We also observed that heterogeneous combination of classifiers is more promising than the homogenous combination.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
Full-text available
Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
Article
Full-text available
The vast cell-surface receptor family of G-protein coupled receptors (GPCRs) is the focus of both academic and pharmaceutical research due to their key role in cell physiology along with their amenability to drug intervention. As the data flow rate from the various genome and proteome projects continues to grow, so does the need for fast, automated and reliable screening for new members of the various GPCR families. PRED-GPCR is a free Internet service for GPCR recognition and classification at the family level. A submitted sequence or set of sequences, is queried against the PRED-GPCR library, housing 265 signature profile HMMs corresponding to 67 well-characterized GPCR families. Users query the server through a web interface and results are presented in HTML output format. The server returns all single-motif matches along with the combined results for the corresponding families. The service is available online since October 2003 at http://bioinformatics.biol.uoa.gr/PRED-GPCR.
Article
Full-text available
The receptors of amine subfamily are specifically major drug targets for therapy of nervous disorders and psychiatric diseases. The recognition of novel amine type of receptors and their cognate ligands is of paramount interest for pharmaceutical companies. In the past, Chou and co-workers have shown that different types of amine receptors are correlated with their amino acid composition and are predictable on its basis with considerable accuracy [Elrod and Chou (2002) Protein Eng., 15, 713-715]. This motivated us to develop a better method for the recognition of novel amine receptors and for their further classification. The method was developed on the basis of amino acid composition and dipeptide composition of proteins using support vector machine. The method was trained and tested on 167 proteins of amine subfamily of G-protein-coupled receptors (GPCRs). The method discriminated amine subfamily of GPCRs from globular proteins with Matthew's correlation coefficient of 0.98 and 0.99 using amino acid composition and dipeptide composition, respectively. In classifying different types of amine receptors using amino acid composition and dipeptide composition, the method achieved an accuracy of 89.8 and 96.4%, respectively. The performance of the method was evaluated using 5-fold cross-validation. The dipeptide composition based method predicted 67.6% of protein sequences with an accuracy of 100% with a reliability index > or =5. A web server GPCRsclass has been developed for predicting amine-binding receptors from its amino acid sequence [http://www.imtech.res.in/raghava/gpcrsclass/ and http://bioinformatics.uams.edu/raghava/gpersclass/ (mirror site)].
Article
Full-text available
G-protein coupled receptors (GPCRs) are transmembrane proteins which via G-proteins initiate some of the important signaling pathways in a cell and are involved in various physiological processes. Thus, computational prediction and classification of GPCRs can supply significant information for the development of novel drugs in pharmaceutical industry. In this paper, a nearest neighbor method has been introduced to discriminate GPCRs from non-GPCRs and subsequently classify GPCRs at four levels on the basis of amino acid composition and dipeptide composition of proteins. Its performance is evaluated on a non-redundant dataset consisted of 1406 GPCRs for six families and 1406 globular proteins using the jackknife test. The present method based on amino acid composition achieved an overall accuracy of 96.4% and Matthew's correlation coefficient (MCC) of 0.930 for correctly picking out the GPCRs from globular proteins. The overall accuracy and MCC were further enhanced to 99.8% and 0.996 by dipeptide composition-based method. On the other hand, the present method has successfully classified 1406 GPCRs into six families with an overall accuracy of 89.6 and 98.8% using amino acid composition and dipeptide composition, respectively. For the subfamily prediction of 1181 GPCRs of rhodopsin-like family, the present method achieved an overall accuracy of 76.7 and 94.5% based on the amino acid composition and dipeptide composition, respectively. Finally, GPCRs belonging to the amine subfamily and olfactory subfamily of rhodopsin-like family were further analyzed at the type level. The overall accuracy of dipeptide composition-based method for the classification of amine type and olfactory type of GPCRs reached 94.5 and 86.9%, respectively, while the overall accuracy of amino acid composition-based method was very low for both subfamilies. In comparison with existing methods in the literature, the present method also displayed great competitiveness. These results demonstrate the effectiveness of our method on identifying and classifying GPCRs correctly. GPCRsIdentifier, a corresponding stand-alone executable program for GPCR identification and classification was also developed, which can be acquired freely on request from the authors for academic purposes.
Article
Assigning subcellular localization (SL) to proteins is one of the major tasks of functional proteomics. Despite the impressive technical advances of the past decades, it is still time-consuming and laborious to experimentally determine SL on a high throughput scale. Thus, computational predictions are the preferred method for large-scale assignment of protein SL, and if appropriate, followed up by experimental studies. In this report, using a machine learning approach, the Nearest Neighbor Algorithm (NNA), we developed a prediction system for protein SL in which we incorporated a protein functional domain profile. The overall accuracy achieved by this system is 93.96%. Furthermore, comparisons with other methods have been conducted to demonstrate the validity and efficiency of our prediction system. We also provide an implementation of our Subcellular Location Prediction System (SLPS), which is available at http://pcal.biosino.org.
Article
Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-Protein Coupled Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors. Results: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the Minimum Error Point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector Kernel Nearest Neighbor (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN.
Article
We have developed an alignment-independent method for classification of G-protein coupled receptors (GPCRs) according to the principal chemical properties of their amino acid sequences. The method relies on a multivariate approach where the primary amino acid sequences are translated into vectors based on the principal physicochemical properties of the amino acids and transformation of the data into a uniform matrix by applying a modified autocross-covariance transform. The application of principal component analysis to a data set of 929 class A GPCRs showed a clear separation of the major classes of GPCRs. The application of partial least squares projection to latent structures created a highly valid model (cross-validated correlation coefficient, Q(2) = 0.895) that gave unambiguous classification of the GPCRs in the training set according to their ligand binding class. The model was further validated by external prediction of 535 novel GPCRs not included in the training set. Of the latter, only 14 sequences, confined in rapidly expanding GPCR classes, were mispredicted. Moreover, 90 orphan GPCRs out of 165 were tentatively identified to GPCR ligand binding class. The alignment-independent method could be used to assess the importance of the principal chemical properties of every single amino acid in the protein sequences for their contributions in explaining GPCR family membership. It was then revealed that all amino acids in the unaligned sequences contributed to the classifications, albeit to varying extent; the most important amino acids being those that could also be determined to be conserved by using traditional alignment-based methods.
Article
In this paper, based on the approach by combining the "functional domain composition" [K.C. Chou, Y. D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high-throughput tool in the area of bioinformatics and proteomics.
Article
The functional domain composition is introduced to predict the structural class of a protein or domain according to the following classification: all-alpha, all-beta, alpha/beta, alpha+beta, micro (multi-domain), sigma (small protein), and rho (peptide). The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins and domains with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 98%. In contrast to this, the corresponding rates obtained by the simple geometry approaches based on the amino acid composition were only 36-39%. This indicates that using the functional domain composition to represent the sample of a protein for statistical prediction is very promising, and that the functional type of a domain is closely correlated with its structural class.
Article
Although the sequence information on G-protein coupled receptors (GPCRs) continues to grow, many GPCRs remain orphaned (i.e. ligand specificity unknown) or poorly characterized with little structural information available, so an automated and reliable method is badly needed to facilitate the identification of novel receptors. In this study, a method of fast Fourier transform-based support vector machine has been developed for predicting GPCR subfamilies according to protein's hydrophobicity. In classifying Class B, C, D and F subfamilies, the method achieved an overall Matthe's correlation coefficient and accuracy of 0.95 and 93.3%, respectively, when evaluated using the jackknife test. The method achieved an accuracy of 100% on the Class B independent dataset. The results show that this method can classify GPCR subfamilies as well as their functional classification with high accuracy. A web server implementing the prediction is available at http://chem.scu.edu.cn/blast/Pred-GPCR.
Article
Proteins are classified mainly on the basis of alignments of amino acid sequences. Drug discovery processes based on pharmacologically important proteins such as G-protein-coupled receptors (GPCRs) may be facilitated if more information is extracted directly from the primary sequences. Here, we investigate an alignment-free approach to protein classification using self-organizing maps (SOMs), a kind of artificial neural network, which needs only primary sequences of proteins and determines their relative locations in a two-dimensional lattice of neurons through an adaptive process. We first showed that a set of 1397 aligned samples of Class A GPCRs can be classified by our SOM program into 15 conventional categories with 99.2% accuracy. Similarly, a nonaligned raw sequence data set of 4116 samples was categorized into 15 conventional families with 97.8% accuracy in a cross-validation test. Orphan GPCRs were also classified appropriately using the result of the SOM learning. A supposedly diverse family of olfactory receptors formed the most distinctive cluster in the map, whereas amine and peptide families exhibited diffuse distributions. A feature of this kind in the map can be interpreted to reflect hierarchical family composition. Interestingly, some orphan receptors that were categorized as olfactory were somatosensory chemoreceptors. These results suggest the applicability and potential of the SOM program to classification prediction and knowledge discovery from protein sequences.
Article
The G-protein coupled receptor (GPCR) superfamily fulfils various metabolic functions and interacts with a diverse range of ligands. There is a lack of sequence similarity between the six classes that comprise the GPCR superfamily. Moreover, most novel GPCRs found have low sequence similarity to other family members which makes it difficult to infer properties from related receptors. Many different approaches have been taken towards developing efficient and accurate methods for GPCR classification, ranging from motif-based systems to machine learning as well as a variety of alignment-free techniques based on the physiochemical properties of their amino acid sequences. This review describes the inherent difficulties in developing a GPCR classification algorithm and includes techniques previously employed in this area.
Article
Functioning as an "address tag" that directs nascent proteins to their proper cellular and extracellular locations, signal peptides have become a crucial tool in finding new drugs or reprogramming cells for gene therapy. To effectively and timely use such a tool, however, the first important thing is to develop an automated method for rapidly and accurately identifying the signal peptide for a given nascent protein. With the avalanche of new protein sequences generated in the post-genomic era, the challenge has become even more urgent and critical. In this paper, we have developed a novel method for predicting signal peptide sequences and their cleavage sites in human, plant, animal, eukaryotic, Gram-positive, and Gram-negative protein sequences, respectively. The new predictor is called Signal-3L that consists of three prediction engines working, respectively, for the following three progressively deepening layers: (1) identifying a query protein as secretory or non-secretory by an ensemble classifier formed by fusing many individual OET-KNN (optimized evidence-theoretic K nearest neighbor) classifiers operated in various dimensions of PseAA (pseudo amino acid) composition spaces; (2) selecting a set of candidates for the possible signal peptide cleavage sites of a query secretory protein by a subsite-coupled discrimination algorithm; (3) determining the final cleavage site by fusing the global sequence alignment outcome for each of the aforementioned candidates through a voting system. Signal-3L is featured by high success prediction rates with short computational time, and hence is particularly useful for the analysis of large-scale datasets. Signal-3L is freely available as a web-server at http://chou.med.harvard.edu/bioinf/Signal-3L/ or http://202.120.37.186/bioinf/Signal-3L, where, to further support the demand of the related areas, the signal peptides identified by Signal-3L for all the protein entries in Swiss-Prot databank that do not have signal peptide annotations or are annotated with uncertain terms but are classified by Signal-3L as secretory proteins are provided in a downloadable file. The large-scale file is prepared with Microsoft Excel and named "Tab-Signal-3L.xls", and will be updated once a year to include new protein entries and reflect the continuous development of Signal-3L.
Article
Given a protein sequence, how can we identify whether it is an enzyme or non-enzyme? If it is, which main functional class it belongs to? What about its sub-functional class? It is important to address these problems because they are closely correlated with the biological function of an uncharacterized protein and its acting object and process. Particularly, with the avalanche of protein sequences generated in the Post Genomic Age and relatively much slower progress in determining their functions by experiments, it is highly desired to develop an automated method by which one can get a fast and accurate answer to these questions. Here, a top-down predictor, called EzyPred, is developed by fusing the results derived from the functional domain and evolution information. EzyPred is a 3-layer predictor: the 1st layer prediction engine is for identifying a query protein as enzyme or non-enzyme; the 2nd layer for the main functional class; and the 3rd layer for the sub-functional class. The overall success rates for all the three layers are higher than 90% that were obtained through rigorous cross-validation tests on the very stringent benchmark datasets in which none of the proteins has > or = 40% sequence identity to any other in a same class or subclass. EzyPred is freely accessible at http://chou.med.harvard.edu/bioinf/EzyPred/, by which one can get the desired 3-level results for a query protein sequence within less than 90 s.
a new generation of protein database search programs
  • Lipman
  • Gapped
  • Psi-Blast Blast
Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389–3402.