ChapterPDF Available

Prediction of Protein Function based on Machine Learning Methods: An Overview

Prediction of Protein Function based on
Machine Learning Methods: An Overview
Kiran Kadam, Sangeeta Sawant, Urmila Kulkarni-Kale
Bioinformatics Centre
University of Pune, Pune, India
Valadi K. Jayaraman
Center for Development of Advanced Computing (C-DAC)
University of Pune Campus, Pune, India
Shiv Nadar University, Chithera (Gautam Budh Nagar), India
1 Introduction
Proteins represent the most important class of biomolecules in living organisms. They carry out majority
of the cellular processes and act as structural constituents, catalysis agents, signaling molecules and
molecular machines of every biological system (Eisenberg et al., 2000). Understanding of protein
function is, thus, very essential for studying any biological process. In addition to experimental
approaches and methods, several bioinformatics approaches have been developed and are being used to
assign and predict functions on the basis of sequences and structures of proteins. Availability of
overwhelming amount of genomic data from a large number of genome sequencing projects has further
intensified the need for function prediction. Attempts are being made to experimentally solve structures
of a large number of proteins under the structural genomics initiatives (Terwilliger et al., 2009).
Computational methods have been developed to expand the structural repertoire of proteins by predicting
structures of homologous proteins. As a result, sequences and structures of a very large number of
proteins are becoming available but their functions are not known. While classical experimental
approaches have proved to be extremely useful, there are practical limitations to take them to genome
scale (Saghatelian & Cravatt, 2005). Due to the obstacles faced by the experimental approaches,
computational analysis for prediction of protein function has become absolutely necessary. Biological
functions of proteins are described at various levels of biocomplexity such as biochemical, cellular,
physiological and phenotypic levels. Gene Ontology (GO) terms provide a basis for precise description
and understanding of various levels of protein function (Ashburner et al., 2000). It is imperative to
understand that it is the molecular or biochemical function of a protein that is illustrated using sequence
and/or structure data and hence in silico approaches help to predict molecular function of a protein
(Friedberg, 2006). The current status of sequence- and structure-based approaches for protein function
prediction is briefly explained below. These methods have been extensively reviewed elsewhere
(Sleator, 2012).
1.1 Sequence-Based Approaches for Prediction of Protein Function
The amino acid sequence of a protein is known as the primary structure of a protein and it is the most
fundamental form of information available about the protein. It plays the most critical role in determining
various characteristics of the protein such as its structure, function and sub-cellular localization. Because
of this, amino acid sequence has tremendous potential to be used extensively for functional annotation of
proteins (Bork & Koonin, 1998).
In any function prediction method, sequence to function approach is very common and is based on
homology. There are mainly two strategies used for this kind of approach. First strategy includes
methods based on global and local sequence alignments (Needleman & Wunsch, 1970; Smith &
Waterman, 1981; Altschul et al., 1990; Pearson, 1996; Sturrock & Collins, 1993) and second includes
methods based on sequence motifs (Bairoch et al., 1995; Henikoff & Henikoff, 1994; Attwood et al.,
1994). Identification of protein function by alignment-based sequence similarity search using BLAST
(Altschul et al., 1990) or FASTA (Pearson & Lipman, 1988) to find similar proteins in public domain
databases is the most popular approach. The annotations of significant hits are used for prediction of
protein function. However, it has been found that proteins that have diverged from a common ancestral
gene may have the same function but no detectable sequence similarity (Benner et al., 2000). Methods
based on sequence profiles such as PSI-BLAST (Altschul et al., 1997) have been developed which
provide high sensitivity for detecting remote homologs. Sequence motifs and patterns are also used for
detection of close and distant homologues. This approach yields a much higher sensitivity as well as
specificity of function prediction as compared to alignment-based methods since many functional motifs
and patterns have been identified for a large number of protein families. In spite of this, sequence
similarity-based approaches may not be always adequate for function identification of novel proteins.
1.2 Structure-Based Approaches for Prediction of Protein Function
Computational methods utilizing three-dimensional (3D) structures of proteins can be employed when
sequence-based function prediction cannot be achieved with high confidence level. This is due to the fact
that protein structures are more conserved than sequences during evolution. Various aspects of structural
data such as the overall fold, active site residues and their conformation, interactions with ligands and
other biomolecules can provide insights into functions of proteins. Consequently, various categories of
methods are available for structure-based function prediction. The methods that utilize the fold
information depend on global and local structural alignments algorithms (Holm & Sander, 1993; Madej
et al., 1995; Orengo & Taylor, 1996; Harrison et al., 2003). Global and local conformational similarities
between proteins indicate functional similarities and are useful for inferring functions of novel proteins.
Several methods have been developed to identify surface pockets and cavities in protein structures as
well, that help in identification of potential active/binding sites and amino acid residues therein (Capra et
al., 2009; Najmanovich et al., 2008; Gold & Jackson, 2006; Chang et al., 2006; Wass et al., 2010). This
approach is especially useful for prediction of enzymatic functions. Detection of similar local geometries
of functionally important residues implies similar functions even in distantly related proteins (Torrance et
al., 2005). Availability of co-crystal structures of protein-ligand, protein-protein and protein-DNA/RNA
complexes have enabled characterization of detailed atomic interactions. Analyses of these structures
have provided valuable insights into the principles that govern intermolecular interactions that are
important for the functions of proteins and have led to development of approaches for prediction of
function (Kinoshita et al., 2008). Further, the applications of techniques for molecular dynamics and
docking simulations have opened the vistas in understanding of molecular motions and interactions
involved in function. Simulations also offer rich data to understand the detailed atomic-level mechanisms
of function (Glazer et al., 2009; Dodson et al., 2008; Pierri et al., 2010; Favia & Nobeli, 2011; Chang et
al., 2005). A combination of various structure-based approaches viz. structural alignments, active site
identification, simulations and characterization of molecular interactions provide a useful methodology
towards function prediction. Availability of high-resolution structural data of the target proteins or their
homologues, however, remains the major limitation of this methodology.
1.3 Machine Learning-Based Approaches for Prediction of Protein Function
Until recently, sequence and structure-based methods that utilize homology relationships among proteins
have dominated the scene in prediction of functions. However, these methods suffer from limitations of
availability of adequate data of homologous proteins. They also fail when homology relationships cannot
be established for target proteins (Whisstock & Lesk, 2003). Even though sequence similarity is
correlated to functional similarity, exceptions are observed on both ends of the similarity scale (Galperin
et al., 1998; Rost, 2002). Protein function prediction based on the structure has been restricted in scope
because of the availability of limited number of structures and folds in the databases. All of these factors
have contributed to the development of approaches for computational function prediction, which can
utilize several other important features in sequences and structures along with similarity measures.
Among these, machine learning-based approaches have been found particularly useful in
predicting various functional aspects in proteins. The major advantage of these methods is their ability to
map the problem of function prediction to the problem of generating classification models (Tan et al.,
2005). Machine learning based methods utilize protein sequence and/or structure data, represented by
transformed and more meaningful information in the form of feature vectors. It has been shown that
using machine learning classifiers, it is possible to predict the function of hypothetical proteins based on
features of amino acid sequences of well characterized proteins, without using homology information
(Han et al., 2006). Similarly, more and more machine learning based methods are being developed which
use 3D structure and function data of well characterized proteins to predict functions of unknown
proteins (Al-Shahib et al., 2007). The efficacy of these methods has been demonstrated through several
recent studies and it is therefore important that bioinformatics researchers are aware of and have a basic
understanding of these methods. In the succeeding sections of the chapter, various machine learning
methods are described and their applications for the purpose of prediction of protein function are
2 Machine Learning Algorithms
The essence of machine learning algorithms lies in development of models from the existing data and
subsequently, classification and/or prediction using novel data. Methods based on machine learning
algorithms are grouped into two classes; namely, supervised and unsupervised. In supervised learning,
predefined class labels are available for all the training examples. This labeled training data is used to
build a model, which is used to predict the class of new input data. In unsupervised learning, predefined
class labels for the training examples are not available. The basic aim of unsupervised methods is to
discover the patterns hidden in the input data and group (cluster) the data appropriately.
Methods based on machine learning algorithms have been used extensively for various
applications in the field of biology (Tarca et al., 2007). These methods have been utilized in diverse
domains like genomics, proteomics and systems biology (Larrañaga et al., 2006). Specifically, supervised
machine learning approaches have found immense importance in numerous bioinformatics prediction
methods. A brief review of methodologies for prediction of protein function with special emphasis on
machine learning methods is available (Zhao et al., 2008). The aim of the present article is to provide an
overview of the machine learning algorithms as well as application methods based on these algorithms.
Artificial Neural Network (ANN)
This algorithm is based on the concept of biological neurons. In biological systems, learning process is
based on the minor adjustments to the synaptic connections between neurons, while in ANNs, the
learning process is carried out by interconnections between the processing elements that constitute the
network topology. Typically ANN consists of 3 layers, viz. input layer, hidden layer and output layer.
ANN trains a hidden-layer-containing network and uses its connected structures for pattern recognition
and classification (Wang & Larder, 2003; Drăghici & Potter, 2003). In the bioinformatics applications of
ANNs, many different types of architectures are employed. Perceptron and multi-layer perceptron (MLP)
are the simplest type of architectures. Radial Basis function networks and Kohonen self-organizing maps
have also been found to be very useful architectures.
The major steps involved in an ANN algorithm are as follows:
Generation of training and test datasets by processing of the available data
Encoding of the data into digital format using various encoding systems (e.g. binary system)
Design and development of ANN architecture consisting of 3 layers for prediction
Training the ANN by using appropriate parameters and input data
Selection of ANN model that gives the valid output
Validation of ANN model using test dataset for estimation of efficacy for prediction
Major advantage of ANN is its ability to process and analyze large complex datasets, containing non-
linear relationships. The method has other benefits like being able to handle noisy data and the capability
of generalization. An apparent limitation of the method is the time taken for processing complex datasets
(Lancashire et al., 2009). In bioinformatics, ANNs have been extensively used for the tasks like gene
prediction (Xu & Uberbacher, 1997), signal peptide prediction (Nielsen et al., 1997), protein secondary
structure prediction (Chae et al., 2010) and sequence feature analysis (Blekas et al., 2005).
Hidden Markov Models (HMM)
Hidden Markov Models (HMMs) are very popular machine learning approaches in bioinformatics. These
are probabilistic models which are generally applicable to time series or linear sequences. They can be
used to describe the evolution of observable events that depend on internal factors, which are not directly
observable. The observed event is referred to as a ‘symbol’ and the invisible factor underlying the
observation is called as a ‘state’ (Yoon, 2009). An HMM consists of several states, connected by means
of the transition probabilities, thus forming a Markov process. Each of these states has an observable
symbol attached to it. An HMM consists of a visible process of observable events and a hidden process
of internal states moving in tandem. The aim is to find the optimal path through the states, which
maximizes the probability of occurrence of the observed sequence (of symbols).
The major steps associated in the algorithm for generation of HMM are as follows:
Development of an HMM architecture using various states to represent the given set of features
Assignment of hidden states to the features and construction of HMM model
Training the HMM either using supervised or unsupervised technique so that the model
sufficiently fits the problem under study
Derivation of emission probabilities that govern the distribution of the observed symbols i.e. the
probability that a symbol will be observed given that the HMM is in a particular state
Decoding of the HMM for the prediction of hidden states from the data
The major benefit associated with HMMs is their ease of use, requirement of smaller datasets
and precise understanding of the process. Among the main drawbacks of HMMs is their greater
computational cost. HMMs are the most effective method for biological sequence analysis and
therefore they are routinely applied for multiple sequence alignments (Finn et al., 2011), gene
finding (Lukashin et al., 1998) as well as protein secondary structure and function prediction
(Majoros et al., 2005; De Fonzo et al., 2007).
Support Vector Machine (SVM)
SVM is a supervised methodology rigorously based on the statistical learning theory (Vapnik, 1995). For
linearly separable examples, SVM constructs a maximum margin hyperplane separating the data points
into two different classes. This hyperplane acts as a decision surface between the two classes. For
nonlinearly separable data, SVM first transforms the data into a higher dimensional feature space and
subsequently employs a linear maximum margin hyperplane. This may introduce a computational
intractability requiring a transformation to high dimensional space. SVM handles this by defining
appropriate kernel functions by virtue of which the computations can be carried out in the original space
itself. Three popular kernel functions are Linear, Polynomial and Radial Basis Function (RBF). In
bioinformatics, many domain specific kernel functions are also available like graph kernel and string
kernel. The concept can be extended to multiclass classification. Two popular multiclass classification
methods are employed viz., one against all and one against one.
The general steps involved in the SVM algorithm are as follows:
Construction of a feature vector representing the positive and negative dataset: This feature
vector consists of the properties of the input data like amino acid and/or dipeptide composition,
physico-chemical properties etc.
Choice of an appropriate kernel function suitable for the prediction task using the classifier
Training of SVM classifier by selecting optimum kernel parameters so as to achieve highest
Selection of model with the best performance to perform predictions
Application of selected model for performing predictions on the unknown input data set
SVM is the most robust classifier, and has the best generalization ability on the unseen data as compared
to other methods. It is the most commonly used machine learning method in bioinformatics and
computational biology. It has been employed for secondary structure prediction, fold recognition, binding
site prediction as well as for gene finding (Yang, 2004; Tong et al., 2008; Kadam et al., 2012).
k-nearest-neighbor (KNN) Classifiers
KNN classifiers are based on finding the k nearest examples in a reference set, and taking a majority vote
among the classes of these k examples (Johnson & Wichern, 1982) to assign a class to the query.
Decision boundaries for assigning classes are implicitly derived in KNN.
Following are the important steps involved in development of KNN classifier.
Construction of a feature set and a distance metric to compute distances between features
Determination of the number of nearest neighbors (parameter k) for the training set
Calculation of the Euclidian distances (or any other distance measure like Mahalanobis distance)
between the query-instance and all the training samples
Sorting of distances and determination of nearest neighbors based on the k
minimum distance
Prediction of class label for new/unknown instance using the class label of nearest neighbors
The most prominent advantage of KNN method is its high efficiency on larger datasets and robustness
when processing noisy data. The disadvantage of KNN is the high computation cost, which often reduces
its speed. In bioinformatics, KNN has been employed successfully for performing various protein
function prediction tasks (Huang & Yanda, 2004; Shen et al, 2006).
Decision Tree (DT)
Decision tree refers to a branch-test-based classifier (Quinlan, 1993). Construction of decision trees
involves analysis of a set of training examples for which the class labels are known. This information is
then used to classify new and unseen examples. Every branch corresponds to a group of classes and a leaf
denotes a specific class. A decision node specifies a test on a single attribute value, with one branch and
its subsequent classes as possible outcomes.
The major steps involved in the DT algorithm are as follows:
Preparation of training dataset in the appropriate form for the classifier by feature extraction
from the input data
Construction of a decision tree by placing the instances in training set at an initial node
Division of the instances into two distinct classes (child nodes) based on chosen test value
Recursive application of the last step until fulfillment of termination (pre-pruning) condition
Pruning of the resultant tree and its application to perform predictions
DTs are very simple classifiers and hence have better interpretability than other machine learning
methods (Kingsford & Salzberg, 2008). They have been widely used in bioinformatics for prediction of
genetic interactions and similar applications (Wong et al., 2004; Che et al., 2011).
Random Forests (RF)
Random Forests (RF) is an ensemble of randomly constructed independent classification and decision
trees (Breiman, 2001). It normally exhibits substantial performance improvements over single-tree
classifiers such as CART (Breiman et al, 1984) and C4.5 (Quinlan, 1993). Randomness may be
introduced into the RF algorithm in two ways.
1. Bootstrapping: A bootstrap set is constructed from the original training data set using random
sampling with replacement to generate each tree.
2. Node splitting: It is carried out by selecting a subset of attributes. While splitting a node, if there
are M input attributes, then a number 'm', where m M, is specified such that at each node, m
attributes are selected at random and the best split on these are considered. A good value of ‘m’,
which is selected as default by many implementations, considers 'm' as sqrt (M) for classification
(Liaw & Wiener, 2002).
The classification tree is thus induced using the ‘in bag’ data based on the CART algorithm. Later, an
out-of-bag (OOB) data, formed after leaving out the in-bag samples from the original data is used for
cross validation.
The major steps of the RF algorithm are as follows:
Employment of the CART algorithm on the data to grow random classification trees
Use of a bootstrap data known as the in-bag set to train the CART algorithm
Node splitting based on the best condition over a random subset of ‘m’ attributes
Use of a majority voting strategy to decide class affiliation of each OOB sample
A Variable Importance (VI) ranking, which can be used later to retrain the RF using a smaller
subset of the most important variables
Resistance to over-fitting of data
RF and its variants have been applied to solve a variety of bioinformatics problems, such as gene
expression classification, analysis of mass spectroscopy data from protein expression, sequence
annotation and prediction of protein-protein interactions (Qi, 2012).
Ensemble Classifiers
In ensemble classifiers, individual decisions of a set of classifiers are combined either by weighted or
unweighted voting for classification of new instances. These are also known as multi-classifier systems.
Ensemble classifiers are more effective for prediction tasks due to the fact that they use a combination of
classifiers and can capture features that cannot be captured from any single model alone. Ensemble
methods have been applied in different bioinformatics problems due to high prediction accuracy (Yang et
al., 2010). Table 1 briefly summarizes the important advantages and limitations of major machine
learning algorithms.
Artificial Neural
Networks (ANN)
Good approximation of nonlinear relationships
Capacity to handle noisy data
Greater computational burden
Prone to over-fitting
Hidden Markov
Models (HMM)
Precise understanding of the background
Powerful and easy to use
Computationally intensive
Relatively slower than other methods
Prone to over-fitting
Support Vector
Machine (SVM)
Provide the best generalization ability
Robustness to the noisy datasets
Less susceptible to over-fitting
Computationally expensive in some
cases such as in case of nonlinearly
separable data
k- nearest
neighbor (KNN)
Simple and easy to learn
Effective when training data is large
Training is very fast
Computational complexity
Inconsistent performance when
number of attributes increases
Decision Tree
Able to handle both continuous and discrete
Better interpretability
Good results for redundant attributes
Error-prone in case of data with large
number of classes
Sensitive to small variations in the
Random Forest
High accuracy and speed
Less prone to over-fitting
Ability to evaluate each attribute for prediction
Tendency to over-fitting when data is
Greater efficiency of prediction
Better utilization of the data
Greater computational complexity
Table 1: Major machine learning algorithms with their advantages and limitations.
A generalized protocol for bioinformatics applications based on machine learning algorithms
Every machine learning algorithm has some uniqueness with respect to the model of learning and
parameter optimization. But there are some common steps involved like preparation of datasets, feature
selection methods and performance evaluation approaches. These steps are discussed in the succeeding
sections. Figure 1 illustrates a schematic representation of the general protocol for
classification/prediction using a machine learning approach.
Figure 1: Schematic representation of prediction methodology by a machine learning approach.
2.1 Preparation of Datasets
Successful applications of the machine learning methods demand identification of discriminatory
features, which is dependent on the quality of datasets used for training. Thus, selection and/or curation
of discriminative datasets for training, testing and validation determine practical effectiveness of the
It is necessary to use non-redundant datasets for generating predictive models, irrespective of the
type of machine learning algorithm as well as the type of input data being used. This is due to the fact
that redundancy in the data leads to bias in the statistical analysis, which might result in overestimation of
predictive performance (Nielsen et al., 1996). Curation of the data to remove repetitive features is an
essential pre-requisite. The criteria used for generating non-redundant datasets, however, should not be
too rigorous as it can cause omission of valuable information from the dataset.
For an efficient classification algorithm, well-defined annotation classes and training dataset
containing positive and negative data for each class are very critical. The positive dataset should
represent the members of a particular annotation class whereas the negative dataset should represent non-
members. The number of instances for each class should be balanced, as some classifiers like SVM tend
to produce reduced accuracies for imbalanced datasets. Appropriate representation of the informative
experimental data available and its conversion into datasets relevant to machine learning denotes the
critical step of generating an efficient classifier (Juncker et al., 2009).
2.2 Feature Extractions and Selection
In most of the machine learning classifiers, the input is represented by parameters that provide
information for prediction. These parameters, denoted as features or attributes, are present in large
numbers. Many of these features are redundant in nature and are not needed for efficient prediction of
labels. The dimensionality of the original data can be reduced by feature extraction and feature selection
methods (Jain et al., 2000).
Feature Extraction:
It involves the production of a new set of features from the original features in the data, through the
application of a mapping method. Well-known unsupervised feature extraction methods include Principal
Component Analysis (PCA) and spectral clustering. PCA finds a linear projection of high dimensional
data into a lower dimensional subspace, which leads to the maximization of the variance and
minimization of the least square reconstruction error. Because of this, PCA has been found very effective
for performing feature extraction (Li et al., 2008). It has also been used extensively in the studies
involving analysis of spectral data, with considerable efficiency (Grill & Rush, 2008).
Feature Selection:
Feature selection (also known as subset selection, attribute selection or variable selection) is the process
of choosing a small subset of features that is sufficient to predict the target classes accurately. There are
numerous advantages of applying feature selection techniques in prediction methods. The most notable
are, i) to avoid over-fitting and improve prediction performance of the model generated; (ii) to reduce the
computational complexity of learning and prediction algorithms; (iii) to provide faster and cost-effective
models; and (iv) to gain a deeper insight into the underlying processes that generated the data (Saeys et
al., 2007). Considering these benefits that feature selection techniques offer, they have been widely
applied in development of prediction methods. Based on the context of classification (prediction), feature
selection techniques are grouped into three categories.
2.2.1 Filter Methods
In filter method, a predictive subset of features is determined based on simple statistics (scores)
calculated from the empirical distribution/s. These scores represent relevance of the particular feature and
hence filter method denotes intrinsic properties of the data to be classified. The most prominent
advantage of filter method is its non-dependence on the classification algorithm. It is a simple and fast
method that can also be applied to high dimensional datasets. Prominent examples of filter approach
include Information gain, Chi-square test, Correlation based feature selection (CFS) and Mutual
Information based feature selection (Saeys et al., 2007).
2.2.2 Wrapper Methods
These methods employ a search through the available feature space to identify a subset of features. They
use the estimated accuracy from an induction algorithm as the measure of efficiency for this purpose.
Thus, in these methods, different subsets of features are generated and evaluated. The evaluation of a
particular subset is coupled to the classification algorithm, making this approach algorithm specific.
These methods have the advantage of utilizing feature dependencies and taking into account the
interdependencies of feature subset search and final model selection. Algorithms based on the wrapper
approach include Sequential Forward selection and SVM-RFE feature selection. Similarly, Genetic
Algorithm is also found very efficient for performing feature selection with respect to size of the feature
set and performance of the feature selection algorithm (Li et al., 2004).
2.2.3 Embedded Methods
Like wrapper methods, embedded methods are specific to the given algorithm. In contrast to filter and
wrapper approaches, the learning part and the feature selection part cannot be separated in embedded
methods. These methods are computationally less intensive than wrapper methods. They make better use
of all the available data as splitting of the data into training and validation datasets is not required.
Feature selection using the decision trees and weight vector of SVM is an example of embedded
approach (Guyon et al, 2002).
2.3 Performance Evaluation (Estimation of Accuracy)
For any statistical prediction method, it is of paramount importance to determine the efficacy of the
method. Cross validation tests, along with benchmark datasets are employed for this purpose (Chou,
2011). Three cross-validation methods that are commonly used to estimate the prediction efficiency are
briefly described below.
2.3.1 Independent Dataset Test
In the independent dataset test, instances in the test dataset are selected such that the examples used in the
training of the classifier are not included. This ensures that there is no inherent memory bias in making
predictions. Application of this test produces reliable estimate of accuracy only when the test dataset is
adequately large. In case of insufficient data, the construction of independent dataset becomes
inconsistent and variable results are obtained.
2.3.2 The Sub-sampling (n-fold Cross Validation) Test
In the n-fold cross validation, original dataset is divided into n subsamples. Out of these n subsamples,
one is used as a test dataset while other n-1 subsamples are used as training dataset. The cross validation
process is carried out n times, with each of the n subsamples used once as a test dataset. For the sub-
sampling test, 5-fold, 7-fold or 10-fold cross-validation is commonly applied. The most apparent
advantage of this procedure is the requirement of less computational time. In any sub-sampling method,
only a particular fraction of all the subsample selections is considered. Due to this, inconsistencies like
biased predictions are observed when this method is applied to same datasets or same predictors. In cases
where the available test dataset is small, this method can be utilized since it is known to produce reliable
2.3.3 The Jackknife Test
In the jackknife test, each instance is considered as a test data one by one; while remaining instances are
used for training the predictor. This facilitates use of each instance in the input data set for training as
well as testing and the memory bias, if any, is avoided. Due to the consistent results obtained for a
particular predictor, the problem of arbitrariness observed in case of independent test datasets and
subsampling is not encountered in this case. The jackknife test, therefore, has been widely used to
estimate efficiencies of different classifiers.
2.4 A Case Study: Prediction of Antigens and Non-antigens from Sequence
Prediction and identification of the antigenic proteins of pathogens is one of the most important steps in
design and development of vaccines using reverse vaccinology approach (Rappuoli, 2001; Kulkarni-Kale
et al., 2012). In this section, a case study of prediction of antigenic and non-antigenic proteins is
presented using SVM as a classification and prediction method. Figure 2 shows protocol of the method as
a flow chart. Compositional features viz. single amino acid and dipeptide frequencies are used for
training and testing purposes.
Figure 2: Flow chart depicting methodology of simple classifier generation for predicting
antigens and non-antigens.
2.4.1 Dataset
The protein sequence dataset for this study is taken from the work of Doytchinova (Doytchinova &
Flower, 2007). This training dataset comprises of 75 sequence instances each, of antigens (true positives)
and non-antigens (true negatives). The antigenic protein sequences are known bacterial proteins that are
used to design candidate vaccines. The sequences of non-antigenic proteins (negative dataset) are taken
from the same bacterial species, which are used to derive the positive dataset of antigens.
2.4.2 Feature Extraction
For every sequence in the true positive and true negative datasets, single amino acid and dipeptide
frequencies are calculated and are used as features. Amino acid composition is represented by 20 features
while 400 features denote dipeptide composition. The feature vector is constructed for SVM-based
prediction, with total of 420 features for each protein sequence.
2.4.3 Feature Selection
Feature selection is carried out using Waikato Environment for Knowledge Analysis (WEKA) software
package (Hall et al., 2009), to find the feature subset that gives the best results. Information Gain
(InfoGain) is used as an attribute evaluator. InfoGain acts as a filter and it assigns a specific rank to each
feature based on its effectiveness in classification.
2.4.4 Classification Method
SVM is employed for the classification purpose using LIBSVM (version 3.0) software package (Chang
& Lin, 2011). Three kernel functions viz. Linear, Polynomial and Radial Basis Function (RBF) are used
and 10 fold cross-validation is done to determine prediction accuracy, for the whole feature set as well as
for subsets of selected features.
2.4.5 Validation
The RBF kernel is observed to provide the best cross-validation accuracies. The results obtained for all
the feature subsets are listed in Table 2. The feature subset comprising of 20 features is observed to give
maximum cross-validation accuracy of 83.33%. Both single amino acid and dipeptide composition
features are found to contribute to this feature set. Thus a very compact and efficient SVM classifier is
generated for prediction of antigens and non-antigens using combination of single amino acid and
dipeptide compositions.
Feature Set
Cross-validation Accuracy
420 features (whole feature set)
37 ranked features selected by InfoGain
20 top ranked features selected by InfoGain
10 top ranked features selected by InfoGain
Gamma: Kernel parameter that regulates the nonlinearity of the classifier
Cost: Kernel parameter that deals with over-fitting of the classifier
Table 2: Cross-validation results for different feature sets.
3 Machine Learning Approaches Using Sequence-based Features
Information present in the amino acid sequences of proteins has been very useful for prediction of protein
function by machine learning methods (Han et al., 2006). In this section, a detailed account of sequence-
based machine learning methods is presented. These methods utilize the sequence information in
different forms of features like amino acid and dipeptide frequencies (compositional features), sequence
order information in the form of pseudo amino acid composition (Chou, 2001), position specific scoring
matrix (PSSM) and/or profiles and functional domain compositions. The individual informative features
and their combinations have been utilized very successfully for different function prediction applications
like prediction of structural class, prediction of protein subcellular localization, prediction of protein-
protein interactions and many others. A brief summary of the methods that have been developed for these
applications is given below.
3.1 Compositional Features
Compositional features of proteins refer to the features derived directly from the amino acid sequence of
the protein. Compositional features of proteins are mainly derived in the form of either single amino acid
residues frequencies (monopeptide composition) or as frequencies of pairs of consecutive amino acid
residues (dipeptide composition). Monopeptide composition is expressed as a vector of 20 numerical
attributes, in which each numerical attribute is the occurrence of a specific amino acid residue (Yang,
2004). Similarly, dipeptide composition is represented as a vector of 400 numerical attributes (Fang et
al., 2008).
It has been shown that the structural class and overall fold of a protein are determined by its amino
acid composition (Chou & Zhang, 1995). Compositional features have been, therefore, routinely used to
predict structural classes of proteins using ANN (Rost & Sander, 1994; Chandonia & Karplus, 1995) and
SVM (Cai et al., 2002) based methods. Amino acid compositional features have been shown to be
characteristic features for subcellular localization of proteins (Cedano et al., 1997). Consequently, these
features have been used in combination with machine learning algorithms to predict subcellular
localization of novel/hypothetical proteins. The methods employed for this purpose include KNN (Nakai
& Horton, 1999; Huang & Yanda, 2004) and SVM (Hua & Sun, 2001; Yu et al., 2006a; Park &
Kanehisa, 2003). SVM based methods have also been developed for prediction of subcellular localization
of proteins from specific organisms such as prokaryotes (Bhasin et al., 2005), Gram-negative bacteria
(Yu et al., 2004), plants (Tamura & Akutsu, 2007), and humans (Garg et al., 2005).
Amino acid composition combined with ANN and SVM based classifiers have been successfully
applied for prediction of DNA-binding residues (Wang & Brown, 2006a; Ofran et al., 2007; Hwang et
al., 2007). Cai and Lin have presented a detailed account of SVM-based methods for predicting rRNA-,
mRNA-, and DNA-binding proteins from amino acid sequence (Cai & Lin, 2003). A web server,
ProteDNA ( is available to predict DNA-binding residues in
transcriptional factors using SVM (Chu et al., 2009). Prediction of RNA-binding sites using amino acid
sequence has also been performed (Terribilini et al., 2006). BindN (
is an efficient web-based tool for prediction of both DNA and RNA binding proteins (Wang & Brown,
Prediction of protein-protein interactions has emerged as an important problem in recent
bioinformatics research. SVM has been efficiently used for this task (Shen et al., 2007; Yu et al., 2010).
Chang and co-workers utilized accessible surface area of amino acids derived from sequence-based
predictions, to develop a novel and efficient method for inferring protein-protein interactions using kernel
density estimation algorithm (Chang et al., 2010).
Bhasin et al. used single amino acid and dipeptide composition for accurate classification of
nuclear receptors (Bhasin & Raghava, 2004a) and only dipeptide composition for predicting G-protein
coupled receptors (Bhasin & Raghava, 2004b). CTKPred (
CTKPred/) is an SVM-based online prediction server for the prediction and classification of cytokine
superfamilies, which is based on the dipeptide composition (Huang et al., 2005).
Prediction of secretory proteins is performed successfully using amino acid composition and
similarity search (Garg & Raghava, 2008). ANN-based method is employed for prediction of novel
archaeal enzymes using properties of sequences (Jensen et al., 2002). An RF-based method for predicting
protein motions from amino acid sequences has also been developed (Hirose et al, 2010). PrDOS
( is a server that predicts the disordered regions of a protein using SVM from its
amino acid sequence (Ishida & Kinoshita, 2007).
3.2 Pseudo Amino Acid Composition
The pseudo amino acid (PseAA) composition is a unique way to represent a protein sequence in a
discrete model without losing its sequence-order information completely. The PseAA composition of a
given protein is denoted by a set of more than 20 discrete factors, where the first 20 factors represent the
components of its conventional amino acid composition while the additional factors incorporate its
sequence order information via various modes (Chou, 1999; Chou, 2001). This unique representation is
generated by sequence correlation factors, which implicitly incorporate the effect of sequence order of
the proteins. These correlation factors are discrete numbers that are derived from physico-chemical
properties between amino acid pairs of the type (i, i+1), (i, i+2) and (i, i+3). The physicochemical
properties incorporate the local sequence order effects that are independent of size, contiguity, and global
order of the sequence. The PseAA composition greatly enhances the efficiency of protein function
prediction and hence has been widely applied for functional annotation of variety of proteins.
SVM has been extensively utilized in prediction of protein structural class using the PseAA
composition (Chen et al., 2006a; Ding et al., 2007; Chen et al., 2006b), while fuzzy K nearest neighbors
(FKNN) classifier is adopted as the prediction engine developed by Zhang and co-workers (Zhang et al.,
SVM has also been used for predicting subcellular localization (Shi et al., 2007; Li & Li, 2008)
using pseudo amino acid composition. A similar approach has been also applied specifically for
prediction of subcellular localizations of proteins involved in apoptosis (Chen & Li, 2007). For prediction
of subcellular localization of gram-negative bacterial proteins, a KNN based method (Wang & Yang,
2010) as well as a method based on multiple SVMs is available (Wang et al., 2005). In a recent approach,
Ma & Gu (2010) used the Elman Recurrent Neural Network (RNN) classifier to accurately predict
subcellular localization of proteins. RNN is an enhanced ANN model in which an extra context layer is
present, which makes the classifier more adaptable to time-varying properties of the network itself.
Membrane proteins are attractive targets for basic research and drug design. These proteins are
grouped into five types viz. (i) type-1 (ii) type-2 (iii) multipass (iv) lipid chain-anchored and (v) GPI-
anchored membrane proteins. Owing to the importance of these proteins as potential drug targets, it has
become imperative to develop computational methods for prediction of the types of membrane proteins.
Classifiers based on KNN (Shen et al, 2006) and weighted SVM (Wang et al, 2004) have been developed
to predict types of membrane proteins.
Enzymes represent one of the most important biomolecules as they act as catalysts in almost all
the cellular processes. Identification of a functional class of a newly found enzyme is very important for
determining its biochemical function. SVM is the algorithm of choice for this purpose and it has been
used with Discrete Wavelet Transform (Qiu et al., 2010) and with amphiphilic pseudo-amino acid
composition (Zhou et al., 2007) and is known to yield high accuracy of predictions.
3.3 PSSM Profiles
The position-specific scoring matrix (PSSM) is a very useful quantitative representation of multiple
sequence alignments (Henikoff, 1996). It facilitates integration of evolutionary information to predict
protein function. PSSM profiles of a protein family are generated using methods such as PSI-BLAST
search against nonredundant (nr) database of protein sequences. The PSSM profiles are reliable and
quantitative measures of residue conservation at a given location and are being used as features by
machine learning methods for predicting various properties and functional annotation of proteins.
There are a number of methods available in which evolutionary profiles have been utilized for
detecting subcellular locations of proteins. These include prediction of mitochondrial proteins in malarial
parasite (Verma et al., 2010), prediction of subcellular localization of mycobacterial proteins (Rashid et
al., 2007) and prediction of protein subnuclear localization (Mundra et al., 2007). Web servers such as
LOCSVMPSI (Xie et al., 2005; URL:
php) and TSSub (Guo & Lin, 2006; URL: are available
that implement SVM-based methods for prediction of subcellular localization of eukaryotic proteins.
An ANN-based algorithm reports efficient prediction of DNA-binding sites using PSSM profiles
(Ahmad & Sarai, 2005). The results observed are claimed to be far better in comparison with the methods
based on sequence information alone. Similarly, in an SVM-based approach for predicting RNA-binding
sites, greater accuracy is achieved when PSSM profiles are used than that of single sequence-based
methods (Kumar et al., 2008; Cheng et al., 2008).
For prediction of fungal adhesins and adhesin-like proteins using SVM, the best classifier
performance is obtained when PSI-BLAST derived PSSM matrices are used as features (Ramana &
Gupta, 2010). SVM has also been applied to predict proteins secreted by malaria parasite into host
erythrocyte (Verma et al., 2008), allergenic proteins (Kumar & Shelokar, 2008) and proline cis/trans
isomerization in proteins (Song et al., 2006). ANN-based methods that use PSSM profiles have also been
developed like PPRODO (Sim et al., 2005a), which predicts domain boundaries of proteins from
sequence information while another method predicts beta turn types in proteins (Kaur & Raghava, 2004).
A method based on FKNN that uses PSI-BLAST profiles as feature vectors has been developed for
prediction of solvent accessibility (Sim et al, 2005b).
3.4 Sequence Motifs and Patterns
Sequence motifs or patterns are defined as sub-sequences that are conserved across a set of protein
sequences, most often belonging to a family of proteins (Bork & Koonin, 1996). Being associated with a
characteristic function, presence of motif/pattern in a protein sequence serves as a diagnostic for function
prediction. Several machine learning methods have been developed which perform encoding of the
protein sequence in terms of features that determine whether a certain motif is present in a sequence.
Sequence motifs, thus, can be utilized for feature extraction to predict important functional aspects of
TargetP ( is an ANN-based tool that uses only N-terminal
sequence information, to predict large-scale subcellular localization of proteins (Emanuelsson et al.,
2000). ANN is also used to predict mitochondrial transit peptides in malarial parasite (Bender et al.,
2003). An automated motif-finding algorithm combined with ANN has resulted in a prediction server
ChloroP ( for identifying chloroplast transit peptides and their
cleavage sites (Emanuelsson et al., 1999). Based on similar features, SVM has been applied for
prediction of proteinprotein interactions (Martin et al., 2005), alpha-turn types in proteins (Cai et al.,
2003a) and prediction of phosphorylation sites (Kim et al., 2004). An SVM-based prediction server to
predict aminoacyl tRNA synthetases using PROSITE domains has also been developed (Panwar &
Raghava, 2010). There are many ANN-based methods that are available for prediction of kinase-specific
phosphorylation sites (Blom et al., 2004) and to predict mucin type O-glycosylation sites in mammalian
proteins (Hansen et al., 1998). Motif-based approach has been used for functional classification of the
enzymes by an SVM classifier (Kunik et al., 2005).
3.4.1 Analysis of Signal Peptides
The processes through which proteins are routed to their final destination within a cell are referred to as
subcellular protein sorting. These are mediated through specialized “sorting signals” in the sequences of
proteins. These signals are called as signal peptides and they control entry of almost all proteins to the
secretory pathway both in prokaryotes and eukaryotes (von Heijne, 1990; Rapoport, 1992).
Computational identification of signal peptides and prediction of their cleavage sites is very important
not only for genome analysis and automated annotations (Nielsen, 1999) but also for production of
proteins in recombinant systems (Nielsen, 1997).
Machine-learning techniques being well suited for pattern recognition tasks, have been widely
applied for prediction of signal peptides. ANNs were utilized extensively for predicting signal peptides in
the early 1990s (Ladunga et al., 1991; Schneider and Wrede, 1993). The method SignalP was developed
for identification of signal peptides in prokaryotes and eukaryotes and their cleavage sites based on ANN
(Nielsen, 1997). An updated version, SignalP 3.0 is based on ANN and HMM algorithms (Bendtsen et
al., 2004a). As compared to earlier versions of the method, SignalP 3.0 offers improved accuracy for
identification of signal peptides in the proteins from eukaryotes, Gram positive and Gram negative
bacteria. SPEPlip is an ANN-based method for prediction of lipoprotein cleavage sites (Fariselli et al.,
2003). There are other approaches available too, like, Phobius, which is a combined transmembrane
protein topology and signal peptide predictor based on HMMs (Käll et al., 2004). The method Signal-3L
predicts signal peptides and their cleavage sites in human, plant, animal, eukaryotic, Gram-positive and
Gram-negative bacterial protein sequences based on KNN (Shen & Chou, 2007a). HMM-based method
is available for prediction of signal peptides in archaeal proteins (Bagos et al., 2009).
Lipoproteins in bacteria are characterized by the presence of a unique signal sequence at the N-
terminal end (Hayashi & Wu, 1990). This signal sequence is referred to as a “lipobox” and is used
extensively for computational identification and analysis of bacterial lipoproteins. Machine learning-
based approaches for this purpose include HMM-based methods for prediction of lipoprotein signal
peptides in gram-negative bacteria (Juncker et al., 2003) and in gram-positive bacteria (Bagos et al.,
2008). An HMM-based algorithm has been used for functional assignment of predicted lipoproteins in
addition to prediction of signal sequences (Babu et al., 2006). Recently, an SVM-based server
LIPOPREDICT has also been developed to predict bacterial lipoproteins (Kumari et al., 2012; URL:
3.4.2 Prediction of MHC-binding Peptides and B-cell Epitopes
The Major Histocompatibility Complex (MHC) constitutes a very important component of the human
immune system because of its role in both cell-mediated immune response as well as in the humoral
immune response. Antigenic proteins are processed inside a cell so as to present only a specific part of a
protein (T-cell epitope) bound to an MHC molecule, to evoke an immune response. Specificity of
binding of peptides to various MHC alleles is a well-known characteristic feature that makes precise
identification of MHC-binding peptides necessary for the prevention, diagnosis and treatment of different
types of diseases. Hence automated prediction of MHC binding peptides is an important part of
computational immunology (immunoinformatics).
It has been discovered that peptides binding to specific MHC alleles are of short lengths ranging
from 9-12 amino acid residues and share specific conserved positions where amino acids with similar
properties are present, making them functionally important (Falk et al, 1991; Roetzschke et al., 1991).
This led to the conclusion that specific “peptide motifsare involved in the binding of peptides to both
MHC class I and class II molecules. Machine learning algorithms have been used in combination with
the peptide motifs extensively for prediction of MHC-binding peptides.
ANN-based classifiers have been used for prediction of peptide binding to both class I (Gulukota
et al., 1997; Brusic et al., 1994) and class II alleles (Brusic et al., 1998). In a comparative study it has
been shown that ANN-based method offers high specificity when sufficient experimental data for
peptide-MHC binding is available while HMM-based method would be preferred for high sensitivity
with increasing peptide data (Yu et al., 2002). Earlier, Mamitsuka developed an HMM-based method to
predict peptides binding to MHC (Mamitsuka, 1989). Several groups have utilized SVM for predicting
class I and class II binding peptides (Donnes & Elofsson, 2002; Bhasin & Raghava, 2004c; Bozic et al.,
2005). SVM has been shown to outperform ANN and DT when smaller training datasets are used for
performing predictions (Zhao et al., 2003). Recently, a competition of machine learning methods was
held, to assess their accuracies for predicting T-cell epitopes. The results show that machine learning
methodologies perform exceptionally well for the prediction task (Zhang et al., 2011a).
Machine learning algorithms have also been widely employed for prediction of linear B-cell
epitopes with high efficacies. ABCPred ( is a method that
uses recurrent ANNs for predicting linear B-cell epitopes (Saha & Raghava, 2006). BCPred and
FBCPred predict linear B-cell epitopes and flexible length linear B-cell epitopes respectively, using SVM
classifiers (El Manzalawy et al., 2008a; El Manzalawy et al., 2008b). COBEpro
( is based on a two-step procedure involving SVM for predicting
linear B-cell epitopes (Sweredoski & Baldi, 2009).
3.5 Prediction methods Using Functional Domain Composition
Functional Domain composition is a recent concept introduced by Chou and co-workers (Chou & Cai,
2002) to represent a protein in an effective way to improve the quality of statistical prediction of protein
function. In this system, a protein is represented by a specific set of discrete numbers, instead of using 20
amino acid components or pseudo amino acid components. In the studies carried out by Murvai et al
(2001), native functional domains are used as a vector base to define a protein.
Functional Domain composition combined with SVM is employed for a variety of prediction
problems. These include prediction of protein subcellular location (Chou & Cai, 2002) and membrane
protein types (Cai et al., 2003b). KNN predictors are routinely applied for different prediction tasks viz.
prediction of subcellular localization (Jia et al., 2007), protein quaternary structures (Yu et al., 2006b),
enzyme family class (Cai et al., 2005a) and peptidase identification and classification (Xu et al., 2008).
In a study that uses functional domain composition for prediction of functional class of proteins from
Saccharomyces cerevisiae, KNN algorithm is found to produce better results than SVM (Cai & Doig,
2004). Web-based software Enzyme Classification System (ECS) ( performs
efficient identification as well as classification of enzymes, on the basis of functional domain
composition (Lu et al., 2007).
3.6 Combinations of Different Types of Features
In addition to several methods as described so far, there are methods that combine two or more types of
features in an attempt to enhance the accuracy of function prediction. Traditional approaches for
sequence analysis mentioned above derive information from the sequences in their raw form, i.e. as a
string of amino acids. As discussed above, it is possible to transform this raw sequence information into
biologically more meaningful and dependable form. Thus, different sets of features are derived from the
sequence, which represent the protein in a more comprehensive manner. These discriminative feature sets
can be combined and advantageously utilized for determining functions of uncharacterized protein using
standard machine learning methodologies.
3.6.1 Combination of Compositional Features and Physico-chemical Properties
Several sequence-derived physico-chemical properties are used in combination with amino acid
compositions. These physico-chemical properties are denoted by descriptors that denote numerous
features like distribution of hydrophobicity and hydrophilicity, polarity, charge, secondary structures,
solvent accessibility, polarizability, surface tension, normalized van der Waals volumes and many others.
A database, AAindex, archives various physico-chemical and biochemical properties of amino acids and
pairs of amino acids by numerical values (Kawashima et al., 2000). These values can be utilized for the
purpose of effectively discriminating proteins into functional classes.
SVM is widely used for different prediction problems like protein structural class (Kurgan et al.,
2008), protein subcellular locations (Sarda et al., 2005; Huang et al., 2007), using combination of
features. MultiLoc ( is a web-based
server that uses SVM for predicting subcellular localization by integrating N-terminal targeting
sequences, amino acid composition and protein sequence motifs (Hoglund et al., 2006). In a method
developed by Wang et al., SVM is combined with conjoint triad features for prediction of enzyme
subfamily class (Wang et al., 2010). A highly efficient web server based on SVM, SVM-Prot is
developed for functional classification of proteins (Cai et al., 2003c; URL:
bin/svmprot.cgi). SVM has also been used to predict proteins involved in bacterial secretion systems
(Pundhir & Kumar, 2011), protein stability changes (Teng et al., 2010), functional class of metal binding
proteins (Lin et al., 2006a), lipid binding proteins (Lin et al., 2006b) and allergens (Cui et al., 2007). A
recent review provides insights into the potential of SVM-based method in prediction of druggable
proteins (Han et al., 2007). Prediction of protein-protein interactions has also been achieved using SVM
and combination of sequence derived features (Bock & Gough, 2001; Guo et al., 2008).
KNN algorithm is used for prediction of protein subcellular location using amino acid and
dipeptide composition along with physico-chemical properties (Gao et al., 2005). Various RF-based
methods have been developed for various tasks using combination of descriptors. These include
algorithms for prediction of glycosylation sites (Hamby & Hirst, 2008), DNA-binding residues (Wu et
al., 2009), antifreeze proteins (Kandaswamy et al., 2011) and protein fold (Dehzangi et al., 2010). ANNs
have been employed to predict protein folding class (Dubchak et al., 1995), antigenic activity in hepatitis
C virus protein (Lara et al., 2008), mammalian secretory proteins targeted to the non-classical secretory
pathways (Bendtsen et al., 2004b) and membrane spanning amino acid sequences (Lohmann et al.,
3.6.2 Combination of Pseudo Amino Acid Composition with Other Features
The pseudo amino acid composition (PseAA) of a protein is combined with PSSM derived protein
profiles to create a new type of descriptor, Pseudo Position-Specific Scoring Matrix; PsePSSM (Chou &
Shen, 2007). This descriptor has been used to predict transmembrane proteins and their types, using a 2-
layered method involving ensemble classifier that is a combination of Optimized Evidence-Theoretic K-
Nearest Neighbor (OET-KNN) classifiers. An ensemble classifier method for predicting protein
subnuclear location is developed that is based on combination of PseAA and PsePSSM (Shen & Chou,
2007b). PseAA has also been employed in prediction of protein structural class using continuous wavelet
transform and principal component analysis (Li et al., 2009). SVM-based methods are found to be
effective in predicting protein submitochondrial location using PseAA and physico-chemical features
(Du & Li, 2006) as well as combining different descriptors like amino acid and dipeptide composition,
gene ontology and evolutionary information into Chou’s PseAA (Fan & Li, 2012). SVM is also applied
for prediction of DNA-binding proteins, based on combinations of autocross-covariance transform,
pseudo-amino acid composition and dipeptide composition (Fang et al., 2008). Cheng and co-workers
have developed an algorithm to predict protein folding rates by applying PseAA along with sliding
window method (Cheng et al., 2012). RF is used for development of web server iDNA-Prot, by
incorporation of PseAA-based grey model (Lin et al., 2011; URL:
3.6.3 Combination of PSSM Profiles with Other Features
The position-specific scoring matrices (PSSM) have also found reasonable application when used in
combination with other features. SVM is used to develop web servers for different application purposes,
which are based on combinations of PSI-BLAST generated PSSM with amino acid composition;
dipeptide composition, secondary structure composition etc. These servers include CyclinPred (Kalita et
al., 2008; URL: for predicting cyclin proteins and ATPsite (Chen
et al., 2011a; URL: for prediction of ATP-binding residues.
ANN and SVM-based models are generated for prediction of GTP interacting residues, dipeptides and
tripeptides using similar feature combinations (Chauhan et al., 2010). In the method developed by Mishra
& Raghava, PSSM profile information is derived for a fixed window length in a sequence to predict FAD
interacting residues via SVM (Mishra & Raghava, 2010). A classifier based on SVM and ANN performs
prediction of histidines and cysteines that participate in binding of several transition metals and iron
complexes (Passerini et al., 2006). VirulentPred, a freely accessible web server based on bi-layer cascade
SVM predicts virulent proteins in bacterial pathogens using PSSM and compositional features (Garg &
Gupta, 2008; URL:
3.6.4 Combination of Functional Domain Composition with Other Features
Methods based on this type of combination include a predictor for enzyme subclass (Cai & Chou,
2005b). A web server ProtIdent ( has been developed for
identifying proteases and their types by fusing functional domain composition and sequential evolution
information (Chou & Shen, 2008) while another algorithm predicts substrate-enzyme-product triads by
combining compound similarity and functional domain composition (Chen et al., 2010). There are also
some methods that utilize combination of PseAA and functional domain composition, such as ANN
classifier for predicting protein subcellular location (Chou & Cai, 2004).
4 Machine Learning Approaches Using Structure-Based Features
The 3D structure of a protein determines several of its important functional features. Therefore numerous
structure-based machine learning methods have been developed to determine various aspects of protein
function. This section describes the approaches that employ structure and shape-based (geometric)
features to identify protein function and local active sites in protein structure. Algorithms that are based
on combinations of sequence derived and structural features have also been described.
4.1 Structure and Shape Based Properties
Local 3D structural patterns, such as the surface cavities of proteins (e.g. the clefts and pockets) represent
important functional sites because of their conserved structural features (Liu, 2008). Hence identification
of these local patterns (pockets) constitutes a very important aspect of structure-based functional
annotation. Relationship of local surface patterns and functions is particularly critical in the context of
functions of enzymes as active site of an enzyme consists of several catalytic residues with specific
spatial arrangements. Machine-learning methodologies have been applied successfully for determining
enzyme active sites and the residues therein.
SVM algorithm is used for prediction of active sites from 3D structure alone to develop methods
applicable to any enzyme family, in general, (e.g. Tong et al., 2008) and also for methods applicable for
family-specific prediction of active sites. The method developed by Cai and co-workers for example,
predicts the catalytic triad of serine hydrolase family (Cai et al., 2004).
SitePredict is an RF-based method for predicting binding sites for specific metal ions or small
molecules in protein structures (Bordner, 2008; URL: The method SCREEN uses
RF algorithm for the accurate characterization of protein surface cavities and prediction of drug-binding
cavities (Nayal & Honig, 2006; URL:
Recently, an SVM classifier is developed for prediction of ligand-binding sites in bacterial
lipoproteins, using combinations of structure and shape-based properties (Kadam et al., 2012).
For identification of DNA-binding proteins having helix-turn-helix structural motif, methods
based on ANNs, DT models and an SVM-based kernel protocol are available (Ferrer-Costa et al., 2005;
McLaughlin & Berman, 2003; Bhardwaj et al., 2005). SVM is employed to predict proteinprotein
interaction surfaces by using surface patch analysis (Bradford & Westhead, 2004) and local surface
properties (Bordner & Abagyan, 2005). Subsequently Bradford and co-workers combined surface patch
analysis with a Bayesian network to predict proteinprotein binding sites (Bradford et al., 2006).
Methods for prediction of functional sites in proteins using properties such as shape and geometry
of the protein surfaces also constitute an important approach for prediction of function. These methods
model continuous surface of 3D structure of proteins to represent its shape and geometry with high
resolution and provide useful features that help in the identification and analysis of functionally
important sites such as voids or pockets on these surfaces. 3D Zernike Descriptors (3DZD) are moment-
based descriptors that provide an effective and popular technique to describe molecular surfaces and
hence can be used to represent protein surfaces (Novotni & Klein, 2004). The method based on Zernike
descriptors has a huge potential in shape-based prediction methods due to its simple representation and
high efficiency of protein shape comparison. (Server available at:
Computational identification of conformational epitopes is very critical because it has been shown
that majority of B-cell epitopes are conformational epitopes (Walter, 1986). The total numbers of
experimentally determined 3D structures of antigen-antibody complexes being limited, structure-based
methods for epitope predictions are comparatively few in number (Kulkarni-Kale et al., 2005;
Sweredoski et al., 2008). An RF predictor has been employed to identify conformational B-cell epitopes
using 3D structures (Zhang et al., 2011b). Liang et al. have used support vector regression based on
structures for prediction of antigenic epitopes (Liang et, al., 2010).
Metal atoms are often essential for maintenance of protein structure, enzyme catalysis and
regulatory roles. Prediction of metal binding sites from structural data therefore, is of immense value for
functional annotation of newly solved protein structures. DT and SVM classifiers have been successfully
used for detecting metal binding sites in protein structures (Babor et al., 2008; Levy et al., 2009). A
machine-learning based method, FEATURE, performs accurate prediction of calcium binding sites (Liu
& Altman, 2009). A Bayesian classifier has been employed for prediction of zinc binding sites (Ebert &
Altman, 2008).
4.2 Combination of Sequence and Structural Features
As has been discussed so far, both sequence information and 3D structures can be utilized very
efficiently by machine learning methods for determining various functional aspects of proteins. Both the
approaches however, suffer from certain limitations like less reliability of sequence-based methods in
conferring common function to remote homologues and availability of limited structural data,
respectively. Newer algorithms are being developed that rely on combining both sequence and structure
features to overcome the limitations and for more reliable predictions. An example of this approach is the
web server ProFunc that facilitates comprehensive prediction of protein function (Laskowski et al., 2005; Different types of features derived from amino
acid sequence of a protein can be combined with important structure-based attributes. Feature vectors
derived from such combinations provide highly informative clues for specific function of a protein.
A very efficient application based on machine learning has been created that combines computed
electrostatics, evolutionary information derived from sequences and pocket geometric features for high-
performance prediction of catalytic residues (Somarowthu et al., 2011). SVM is applied to predict
catalytic residues in proteins using structure and sequence features (Petrova & Wu, 2006; Pugalenthi et
al., 2008). Ota et al developed a KNN predictor for catalytic residues (Ota et al., 2003). ANN classifiers
have been successfully employed for prediction of nucleic acid (specifically, DNA-) binding proteins
based on the structural and sequence properties of electrostatic patches (Stawiski et al., 2003). An ANN-
based method has been developed for prediction of DNA-binding proteins along with prediction of DNA-
binding residues using sequence composition and structural information (Ahmad et al., 2004). Method
based on RF algorithm is developed for prediction of protein-RNA binding sites (Liu et al., 2010). An
online web server RNABindR that uses a Naive Bayes classifier is also available for predicting RNA-
binding sites using both sequence and structure features (Terribilini et al., 2007; URL:
Prediction of protein-protein interactions has become an important step in the roadmap to gain
insights into systems biology. With increasing repertoire of protein structural data, the need for accurate
methods that can use surface features and sequence features to predict potential interacting protein
partners is growing. ANNs have been successfully applied in various methods for prediction of protein-
protein interacting sites. These include prediction of protein-protein interaction sites in heterocomplexes
(Fariselli et al., 2002) and prediction of interface residues in protein-protein complexes (Chen & Zhou,
2005). SVM-based algorithms are available for identification of protein-protein binding (interaction) sites
that employ (i) sequence-based properties along with protein interaction sites ratios (Koike & Takagi,
2004) and (ii) information relating to sequence and structural complementarities across protein interfaces
(Chung et al., 2007). Sequence profile and accessible surface area information combined with the
structure-based conservation score and SVM has also been applied for the same purpose (Chung et al.,
2006). An RF predictor is developed to predict protein interaction sites from sequence and structure-
derived parameters (Sikic et al., 2009). Using both structure and sequence data, a machine learning based
classifier for predicting B-cell epitopes has been developed (Rubinstein et al., 2009a). A web server,
Epitopia, provides an approach for prediction of B-cell epitopes (Rubinstein et al., 2009b; URL:
A server for prediction of leucine-rich nuclear export signals in proteins is available online, which
is based on ANNs and HMM (La Cour et al., 2004). Sequence and structure information in combination
with SVM is utilized in a method to predict specificities within enzyme families (Rottig et al., 2010) as
well as in a web server Mupro ( to predict protein stability
changes in proteins due to single amino acid mutations (Cheng et al., 2006). SVM is used in prediction of
phosphorylation sites in eukaryotic proteins (Blom et al., 1999). Both ANNs and SVM have been
employed to predict protein backbone torsion angles (Kuang et al., 2004). It has been shown that SVM
performs very efficiently in the task of inferring gene functional annotations from a combination of
protein sequence and structure data (Lewis et al., 2006). SVM-based approaches have been developed for
predicting transmembrane helix packing arrangements (Nugent & Jones, 2010) and also for prediction of
membrane-binding proteins (Bhardwaj et al., 2006).
4.3 Combination of Evolutionary Profiles and Structural Features
PSSMs / sequence profiles can be advantageously combined with useful structure derived properties.
This combination provides reliable clues for the function of a protein and can be utilized for efficient
functional annotation.
ANN is used to identify the catalytic residues in enzymes, based on an analysis of the structure
and sequence conservation (Gutteridge et al., 2003) while Tang et al integrated ANN with genetic
algorithm for prediction of catalytic residues in enzymes using their structures (Tang et al. 2008). SVM
has proven to be very effective in predicting catalytic residues when structural features along with
sequence profiles are used (Youn et al., 2007).
An ANN-based predictor that uses PSI-BLAST derived sequence profiles and solvent
accessibilities of each surface residue has been developed to predict DNA-binding sites on protein
surfaces (Tjong & Zhou, 2007). An SVM-based algorithm that derives features from composition,
evolutionary conservation and structural parameters is applied for characterization and prediction of the
binding sites in DNA-binding proteins (Kuznetsov et al., 2006; Dey et al., 2012). SVM classifier is also
employed in the method PRINTR, to predict RNA-binding sites in proteins using PSSM profiles and
structural information (Wang et al., 2008).
Combination of different types of structure and sequence features along with SVM is used to
predict hotspots in protein interfaces (Xia et al., 2010; Chen et al., 2011b). An ensemble method
consisting of SVM is employed for predicting protein-protein interaction sites using profiles and other
informative features (Deng et al., 2009). Machine learning algorithm based on ANNs is utilized for
prediction of protein-protein interaction hotspots (Ofran & Rost, 2007).
An ANN classifier is shown to be very efficient for prediction of mammalian mucin-type O-
glycosylation sites using substitution matrix profiles and structural information (Julenius et al., 2005).
ANN has also been employed for prediction of protein solvent accessibility, combining PSSM and
structural profiles (Bondugula & Xu, 2008) and for prediction of carbohydrate binding sites using
structure and profiles (Malik & Ahmad, 2007).
5 Conclusions
As has been discussed in the previous sections, machine learning methods have been used extensively in
the field of protein function prediction and have significantly contributed in the transformation of huge
volume of data into useful knowledge. It has been attempted in this review to provide a glimpse of the
vast and ever-expanding realm of machine learning based methods in the area of Bioinformatics.
Distinction of machine learning methods lies in the fact that they do not require explicit knowledge of
homology and homology-derived parameters to be incorporated for the purpose of function prediction.
This distinction therefore makes these classes of methods promising especially for novel targets for
which homologs are not available. As the amount of genomic data continues to grow at an exponential
rate, the requirement for accurate methods for prediction of protein function remains high. There is also a
need to develop meta-servers, which will make the methods developed for a specific purpose/objective
such as protein-protein interaction sites available as a portal. The meta-servers would facilitate
developers and users alike, by making the training and testing datasets as well as benchmarking results
available in public domain. While there is a need to refine the methods to achieve higher accuracy in
each class/group, there is also a growing need to bring various aspects of protein function prediction
under the realm of machine learning methods. To understand and model complex biological functions of
proteins, algorithms employing diverse and novel features will be developed and are expected to play
critical roles in functional annotations. Hybrid machine learning approaches, which utilize combination
of different features along with newer feature selection strategies, will constitute an important part of
future protein function prediction methods.
KK is a recipient of Junior Research Fellowship of University Grants Commission (UGC) and gratefully
acknowledges funding from UGC, Govt. of India. SVS and UKK acknowledge financial support under
the Center of Excellence programs from the Department of Biotechnology (DBT), Govt. of India as well
as the Department of Information Technology (DIT), Ministry of Communications and Information
Technology (MCIT), Govt. of India. VKJ acknowledges Council of Scientific and Industrial Research
(CSIR), New Delhi, India, for emeritus scientist grant. All the authors acknowledge infrastructural
facilities at the Bioinformatics Centre, University of Pune, Pune, India.
Ahmad, S., Gromiha, M.M., Sarai, A. (2004). Analysis and prediction of DNA-binding proteins and their binding residues
based on composition, sequence and structural information. Bioinformatics, 20, 477486.
Ahmad, S., Sarai, A. (2005). PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.
Al-Shahib, A., Breitling, R., Gilbert, D. R. (2007). Predicting protein function by machine learning on amino acid
sequences a critical evaluation. BMC Genomics, 8, 78.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997). Gapped BLAST and
PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res., 25, 33893402.
Altschul, S.F., et al. (1990). Basic local alignment search tool. J. Mol. Biol., 215, 403410.
Ashburner, M., Ball, C. A., Blake J. A. et al. (2000). Gene ontology: tool for the unification of biology. The gene ontology
consortium. Nat Genet, 25, 25-9.
Attwood, T. K., et al. (1994). PRINTS A database of protein motif fingerprints. Nucleic Acids Res., 22, 35903596.
Babor, M., Gerzon, S., Raveh, B., Sobolev, V. and Edelman, M. (2008). Prediction of transition metal-binding sites from
apo protein structures. Proteins, 70, 208217.
Babu, M. M., Priya, M.. L., Selvan, A. T., Madera, M., Gough, J., Aravind, L., Sankaran, K. (2006). A database of
bacterial lipoproteins (DOLOP) with functional assignments to predicted lipoproteins. J. Bacteriol., 188, 2761
Bagos, P .G., Tsirigos, K. D., Plessas, S. K., Liakopoulos, T. D., Hamodrakas, S. J (2009). Prediction of signal peptides in
archaea. Protein Eng Des Sel, 22, 27-35.
Bagos, P. G., Tsirigos, K. D., Liakopoulos, T. D. and Hamodrakas, S. J. (2008). Prediction of Lipoprotein Signal Peptides
in Gram-Positive Bacteria with a Hidden Markov Model. J Proteome Res., 7(12), 5082-93.
Bairoch, A. et al. (1995). The PROSITE database, its status in 1995. Nucleic Acids Res., 24, 189196.
Bender, A., Van Dooren G.G., Ralph, S.A., McFadden, G.I., Schneider, G. (2003). Properties and prediction of
mitochondrial transit peptides from Plasmodium falciparum. Mol Biochem Parasitol, 132 (2), 59-66.
Bendtsen, J. D., Nielsen, H., Von Heijne, G., Brunak, S. (2004a). Improved prediction of signal peptides: SignalP 3.0. J
Mol Biol, 340 (4), 783-795.
Bendtsen,, J. D., Jensen, L. J., Blom, N., Von Heijne, G., Brunak, S. (2004b). Feature-based prediction of non-classical and
leaderless protein secretion Protein Eng. Des. Sel. 17, 349356.
Benner, S. A., Chamberlin, S. G., Liberles, D. A., Govindarajan, S., Knecht, L (2000). Functional inferences from
reconstructed evolutionary biology involving rectified databases an evolutionarily grounded approach to functional
genomics. Res Microbiol, 151, 97-106.
Bhardwaj, N., Langlois, R. E., Zhao, G., Lu, H. (2005). Kernel-based machine learning protocol for predicting DNA-
binding proteins. Nucleic Acids Res., 33, 64866493.
Bhardwaj, N., Stahelin, R.V., Langlois, R.E., Cho, W., Lu, H. (2006). Structural Bioinformatics Prediction of Membrane-
binding Proteins. J Mol Biol, 359 (2), 486-495.
Bhasin, M. and Raghava, G.P. (2004a). Classification of nuclear receptors based on amino acid composition and dipeptide
composition J. Biol. Chem., 279, 2326223266.
Bhasin, M., Garg A., and Raghava, G. P. S. (2005). PSLpred: prediction of subcellular localization of bacterial proteins.
Bioinformatics, 21(10), 2522-2524.
Bhasin, M., Raghava, G. P. (2004b). GPCRpred: an SVM-based method for prediction of families and subfamilies of G-
protein coupled receptors. Nucleic Acids Res., 32, W383W389.
Bhasin, M., Raghava, G. P. S. (2004c). SVM based method for predicting HLA-DRB1*0401 binding peptides in an
antigen sequence. Bioinformatics, 20. 4213.
Blekas K., Fotiadis, DI., Likas, A. (2005). Motif-based protein sequence classification using neural networks. J Comput
Biol, 12, 6482.
Blom, N., Gammeltoft, S. and Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein
phosphorylation sites. J. Mol. Biol., 294, 13511362.
Blom, N., Sicheritz-Pontén, T., Gupta, R., Gammeltoft, S. and Brunak, S. (2004). Prediction of post-translational
glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics, 4, 16331649.
Bock, J. R. and Gough, D.A. (2001). Predicting proteinprotein interactions from primary structure. Bioinformatics, 17,
Bondugula, R., Xu, D. (2008). Combining sequence and structural profiles for protein solvent accessibility prediction.
Proc LSS Comput Syst Bioinform Conf., 7, 195-202.
Bordner, A. J. (2008). Predicting small ligand binding sites in proteins using backbone structure.
Bioinformatics, 24 (24), 2865-2871.
Bordner, A. J. and Abagyan, R. (2005). Statistical analysis and prediction of proteinprotein interfaces. Proteins, 60, 353
Bork, P. and Koonin, E. V. (1996). Protein sequence motifs. Curr Opin Struct Biol., 6, 3, 366376.
Bork, P. and Koonin, E.V. (1998). Predicting functions from protein sequenceswhere are the bottlenecks? Nat Genet, 18,
Bozic, I., Zhang, G., Brusic, V. (2005). Predictive vaccinology: optimisation of predictions using support vector machine
classifiers. Lecture Notes in Computer Science. 3578, 37581.
Bradford, J. R. and Westhead, D. R. (2004). Improved prediction of protein-protein binding sites using a support vector
machines approach, Bioinformatics, 21 (8), 1487-1494.
Bradford, J. R., Needham, C. J., Bulpitt, A. J., Westhead D. R. (2006). Insights into Protein-Protein Interfaces using a
Bayesian Network Prediction Method. J Mol Biol, 362 (2), 365-386.
Breiman, L., (2001). Random Forests. Mach. Learn. 45, 532.
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and regression trees. Chapman & Hall, New York.
Brusic, V., Rudy, G., Harrison, L. C. (1994). Prediction of MHC binding peptides using artificial neural networks. In:
Stonier RJ, Yu XS (eds). Complex Systems: Mechanism of Adaptation. Amsterdam: IOS Press. 25360.
Brusic, V., Rudy, G., Honeyman, M. et al. (1998). Prediction of MHC class II-binding peptides using an evolutionary
algorithm and artificial neural network. Bioinformatics, 14, 12130.
Cai, Y. D., Chou, K. C. (2005a). Using functional domain composition to predict enzyme family classes. J Proteome
Res, 4 (1), 109-111.
Cai, Y. D. and Chou, K. C. (2005b). Predicting enzyme subclass by functional domain composition and pseudo amino acid
composition, J Proteome Res, 4, 967-971.
Cai, Y. D., Hu, J. Liu, X. J., and Chou, K. C. (2002). Prediction of protein structural classes by neural network method.
Internet Electron J Mol Des, 1, 332-338.
Cai, Y. D., Lin, S. L. (2003). Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from
amino acid sequence. Biochimica et Biophysica Acta - Proteins and Proteomics, 1648 (1-2), 127-133.
Cai, Y. D., Feng, K. Y., Li, Y. X., Chou, K. C. (2003a). Support Vector Machine for predicting α-turn
types. Peptides, 24 (4), 629-630.
Cai, Y. D., Zhou, G. P., Chou, K. C. (2003b). Support vector machines for predicting membrane protein types by using
functional domain composition. Biophys. J., 84 (5), 3257-3263.
Cai, C. Z., Han. L. Y., Ji, Z. L., Chen, X., Chen, Y. Z. (2003c). SVM-Prot: Web-based support vector machine software for
functional classification of a protein from its primary sequence. Nucleic Acids Res, 31 (13), 3692-3697.
Cai, Y. D., Zhou, G. P., Jen, C. H., Lin, S. L., and Chou, K.C. (2004). Identify catalytic triads of serine hydrolases by
support vector machines, J Theor Biol, 228, 551-557.
Cai, Y.D. and Doig, A.J. (2004). Prediction of Saccharomyces cerevisiae protein functional class from functional domain
composition. Bioinformatics, 20, 12921300.
Capra, JA., Laskowski, RA., Thornton, JM., Singh, M., Funkhouser, TA. (2009). Predicting Protein Ligand Binding Sites
by Combining Evolutionary Sequence Conservation and 3D Structure. PLoS Comput Biol, 5(12), e1000585.
Cedano, J., Aloy, P., Perez-Pons, J. A., Querol E. (1997). Relation between amino acid composition and cellular location
of proteins. J Mol Biol, 266 (3), 594-600.
Chae, M.H., Krull, F., Lorenzen, S., Knapp, E.W. (2010). Predicting protein complex geometries with a neural network.
Proteins, 78, 10261039.
Chandonia, J. M., Karplus, M., (1995). Neural networks for secondary structure and structural class prediction, Protein
Sci., 4, 275285.
Chang, C.-C.; Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent
Systems and Technology, 2, 27:127:27, Software available at
Chang, DTH., Weng, YZ., Lin, JH. et al. (2006). Protemot: prediction of protein binding sites with automatically extracted
geometrical templates. Nucleic Acids Res, 34, W3039.
Chang, DT., Oyang, YJ., Lin, JH. (2005). MEDock: a web server for efficient prediction of ligand binding sites based on a
novel optimization algorithm. Nucleic Acids Res, 33, W233W238.
Chang, DT., Syu, YT., Lin, PC. (2010). Predicting the protein-protein interactions using primary structures with predicted
protein surface. BMC Bioinformatics, 11, Suppl 1S3.
Chauhan, J. S., Mishra, N. K. and Raghava, G. P. S. (2010). Prediction of GTP interacting residues, dipeptides and
tripeptides in a protein from its evolutionary information. BMC Bioinformatics, 11, 301.
Che, D., Liu, Q., Rasheed, K., Tao, X. (2011). Decision tree and ensemble learning algorithms with their applications in
bioinformatics. Adv Exp Med Biol, 696, 191-199.
Chen, C., Tian, Y. X., Zou, X. Y., Cai, P. X., Mo J.-Y.. (2006a). Using pseudo-amino acid composition and support vector
machine to predict protein structural class. J Theor Biol, 243, 444-448.
Chen, C., Zhou, X., Tian, Y., Zou, X., Cai, P. (2006b). Predicting protein structural class with pseudo-amino acid
composition and support vector machine fusion network. Anal Biochem, 357 (1), 116-121.
Chen, H. and Zhou, H.-X. (2005). Prediction of interface residues in proteinprotein complexes by a consensus neural
network method: Test against NMR data. Proteins, 61, 2135.
Chen, K., Mizianty, M. J., Kurgan, L. (2011a). ATPsite: sequence-based prediction of ATP-binding residues. Proteome
Sci, 9(Suppl 1), S4.
Chen R., Chen W., Yang S., Wu D., Wang Y., Tian Y., Shi Y. (2011b). Rigorous assessment and integration of the
sequence and structure based features to predict hot spots. BMC Bioinformatics, 12, 311.
Chen, L., Feng, K. Y., Cai, Y. D., Chou, K.C., & Li, H.P. (2010). Predicting the network of substrate-enzyme-product
triads by combining compound similarity and functional domain composition. BMC Bioinformatics, 11, 293.
Chen, Y. L., Li, Q. Z. (2007). Prediction of the subcellular location of apoptosis proteins. J Theor Biol, 245 (4), 775-783.
Cheng, C. W., Su, E. C., Hwang, J. K., Sung, T. Y., Hsu, W. L. (2008). Predicting RNA-binding sites of proteins using
support vector machines and evolutionary information. BMC Bioinformatics, 9, S6.
Cheng, J., Randall, A., Baldi, P. (2006). Prediction of protein stability changes for single-site mutations using support
vector machines. Proteins, 62, 11251132.
Cheng, X., Xiao, X., Wu, ZC., Wang, P., Lin, WZ. (2012). Swfoldrate: Predicting protein folding rates from amino acid
sequence with sliding window method. Proteins, Aug. 29 (Epub ahead of print).
Chou, K. C. (1999). Using pair-coupled amino acid composition to predict protein secondary structure content. J. Protein
Chem., 18, 473480.
Chou, K. C. (2001). Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246255.
Chou, K. C. and Cai, Y. D. (2004). Predicting subcellular localization of proteins by hybridizing functional domain
composition and pseudo-amino acid composition. J Cell Biochem, 91, 1197-1203.
Chou, K. C. and Shen, H.B. (2008). ProtIdent: A web server for identifying proteases and their types by fusing functional
domain and sequential evolution information. Biochem Biophys Res Comm, 376, 321-325.
Chou, K. C. and Zhang, C.T. (1995). Review: Prediction of protein structural classes. Crit Rev Biochem Mol Biol, 30, 275-
Chou, K. C., Shen, H. B. (2007). MemType-2L: A Web server for predicting membrane proteins and their types by
incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun, 360 (2), pp. 339-345.
Chou, K., Cai, Y. (2002). Using functional domain composition and support vector machines for prediction of protein
subcellular location. J. Biol. Chem. 277, 45765-45769.
Chou, KC. (2011). Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary
Year Review). J Theor Biol, 273, 236247.
Chu, W., Huang, Y., Huang, C., Cheng, Y., Huang, C., Oyang, Y. (2009). ProteDNA: a sequence-based predictor of
sequence-specific DNA-binding residues in transcription factors. Nucleic Acids Res., 37, W396-W401.
Chung, J. L., Wang, W., Bourne, P. E. (2006). Exploiting sequence and structure homologs to identify protein-protein
binding sites. Proteins, 62(3), 630-640.
Chung, J. L., Wang, W., Bourne, P. E. (2007). High-Throughput Identification of Interacting Protein-Protein Binding
Sites. BMC Bioinformatics, 8, 223.
Cui, J., Han, L. Y., Li, H., Ung, C. Y., Tang, Z. Q., Zheng, C. J., Cao, Z. W., Chen, Y. Z. (2007). Computer prediction of
allergen proteins from sequence-derived protein structural and physicochemical properties. Mol
Immunol, 44 (4), 514-520.
De Fonzo V., Aluffi-Pentini, F., Parisi, V. (2007). Hidden Markov models in bioinformatics. Curr Bioinform, 2, 4961.
Dehzangi, A., Amnuaisuk, S. P., Dehzangi, O. (2010). Using Random Forest for Protein Fold Prediction Problem: An
Empirical Study. J. Inf. Sci. Eng., 26(6), 1941-1956.
Deng, L., Guan, J., Dong, Q., Zhou, S. (2009). Prediction of protein-protein interaction sites using an ensemble method.
BMC Bioinformatics, 10, 426.
Dey, S., Pal, A., Guharoy, M., Sonavane, S., and Chakrabarti, P. (2012). Characterization and prediction of the binding site
in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation
and structural parameters. Nucleic Acids Res., May 27.
Ding, Y. S., Zhang, T. L. & Chou, K.C. (2007). Prediction of protein structure classes with pseudo amino acid composition
and fuzzy support vector machine network. Protein Pept Lett, 14, 811-815.
Dodson, GG.,Lane, DP.,Verma, CS. (2008). Molecular simulations of protein dynamics: new windows on mechanisms in
biology. EMBO Rep, 9, 144150.
Donnes, P., Elofsson, A. (2002). Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3, 25.
Doytchinova, I.A., Flower, D.R. (2007). Identifying candidate subunit vaccines using an alignment-independent method
based on principal amino acid properties. Vaccine, 25 (5), 856-866.
Drăghici, S. and Potter, R. B. (2003). Predicting HIV drug resistance with neural networks. Bioinformatics, 19(1), 98-107.
Du, P., Li, Y. (2006). Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with
various physicochemical features of segmented sequence. BMC Bioinformatics, 7, 518.
Dubchak, I., Muchnik, I., Holbrook, S., Kim, S. (1995). Prediction of protein folding class using global description of
amino acid sequence. Proc Natl Acad Sci USA, 92, 87008704.
EL-Manzalawy, Y., Dobbs, D., Honavar, V. (2008a). Predicting linear B-cell epitopes using string kernels. J. Mol.
Recognit., 21, 243-255.
EL-Manzalawy, Y., Dobbs, D., Honavar, V. (2008b). Predicting flexible length linear Bcell epitopes. 7th International
Conference on Computational Systems Bioinformatics, 121-131.
Ebert, J., Altman, R. (2008). Robust recognition of zinc binding sites in proteins. Prot Sci, 17, 5465.
Eisenberg, D., Marcotte, E. M., Xenarios. I. & Yeates, T. (2000). Protein function in the post-genomic era. Nature, 405,
Emanuelsson, O., Nielsen, H., Brunak, S., Von Heijne G. (2000). Predicting subcellular localization of proteins based on
their N-terminal amino acid sequence. J Mol Biol, 300 (4), 1005-1016.
Emanuelsson, O., Nielsen, H., Von Heijne G. Chloro, P. (1999). a neural network-based method for predicting chloroplast
transit peptides and their cleavage sites. Protein Sci., 8, 978984.
Falk, K., Ro¨tzschke O, Stevanovic, S. et al. (1991). Allele-specific motifs revealed by sequencing of self-peptides eluted
from MHC molecules. Nature, 351, 2906.
Fan, GL. and Li, QZ. (2012). Predicting protein submitochondria locations by combining different descriptors into the
general form of Chou’s pseudo amino acid composition. Amino Acids, 43, 2, 545-55.
Fang, Y., Guo, Y., Feng, Y., Li, M. (2008). Predicting DNA-binding proteins: approached from Chou's pseudo amino acid
composition and other specific sequence features. Amino Acids, 34, 103109.
Fariselli, P., Finocchiaro, G., Casadio, R. (2003). SPEPlip: the detection of signal peptide and lipoprotein cleavage sites.
Bioinformatics, 19, 24982499.
Fariselli, P., Pazos, F., Valencia, A. and Casadio, R. (2002). Prediction of proteinprotein interaction sites in
heterocomplexes with neural networks. Eur J Biochem, 269, 13561361.
Favia, A. D. and Nobeli, I. (2011). Using Chemical Structure to Infer Biological Function, In: Computational Approaches
in Cheminformatics and Bioinformatics (eds R. Guha and A. Bender), John Wiley & Sons, Inc., Hoboken, NJ, USA.
Ferrer-Costa, C., Shanahan, H. P.,, Jones, S., Thornton, J. M. (2005). HTHquery: A method for detecting DNA-binding
proteins with a helix-turn-helix structural motif. Bioinformatics, 21 (18), 3679-3680.
Finn, R., Clements, J., Eddy, S. (2011). HMMER Web Server: Interactive Sequence Similarity Searching. Nucleic Acids
Res, 39, W29-37.
Friedberg, I. (2006). Automated protein function prediction--the genomic challenge. Brief Bioinform, 7, 225-242.
Galperin, MY., Walker, DR., Koonin, EV. (1998). Analogous enzymes: independent inventions in enzyme evolution.
Genome Res, 8, 779-90.
Gao, Q. B., Wang, Z. Z., Yan, C., Du, Y. H. (2005). Prediction of protein subcellular location using a combined feature of
sequence. FEBS Letters, 579 (16), 3444-3448.
Garg, A., Bhasin M., Raghava G. P. (2005). SVM-based method for subcellular localization of human proteins using
amino acid compositions, their order and similarity search. J. Biol. Chem. 280, 1442714432.
Garg, A., Gupta, D. (2008). VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens.
BMC Bioinformatics, 62.
Garg, A., Raghava, G. P.S. (2008). A Machine Learning Based Method for the Prediction of Secretory Proteins Using
Amino Acid Composition, Their Order and Similarity-Search. In Silico Biol, 8, 129-140.
Glazer, DS., Radmer, RJ., Altman, RB. (2009). Improving structure-based function prediction using molecular dynamics.
Structure, 17, 919929.
Gold, ND., Jackson, RM. (2006). Fold independent structural comparisons of protein-ligand binding sites for exploring
functional relationships. J Mol Biol., 3, 355(5), 1112-24.
Grill, C. P. & Rush, V. N. (2000). Analysing spectral data: comparison and application of two techniques. Biol J Linn Soc,
69, 121138.
Gulukota, K., Sidney, J., Sette, A., DeLisi, C. (1997). Two complementary methods for predicting peptides binding major
histocompatibility complex molecules. J Mol Biol. 267, 125867.
Guo, J., Lin, Y. (2006). TSSub: eukaryotic protein subcellular localization by extracting features from profiles.
Bioinformatics, 22, 1784-5.
Guo, Y., Yu, L., Wen, Z., Li, M. (2008). Using support vector machine combined with auto covariance to predict protein-
protein interactions from protein sequences. Nucleic Acids Res., 36, 30253030.
Gutteridge, A., Bartlett, G.J., and Thornton, J.M. (2003). Using a neural network and spatial clustering to predict the
location of active sites in enzymes. J. Mol. Biol., 330, 719734.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector
machines. Mach. Learn., 46, 389-422.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H. (2009). The WEKA data mining software: an
update. SIGKDD Explor. Newsl., 11, 1018.
Hamby, S. E., Hirst, J. D. (2008). Prediction of glycosylation sites using random forests. BMC Bioinformatics, 9, 500.
Han, L. Y.,, Zheng, C. J., Xie, B., Jia, J., Ma, X. H., Zhu, F., Lin, H. H., Chen, Y. Z. (2007). Support vector machines
approach for predicting druggable proteins: recent progress in its exploration and investigation of its
usefulness. Drug Discov Today, 12 (7-8), 304-313.
Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y., Chen, Y. (2006). Recent progresses in the application of machine learning
approach for predicting protein functional class independent of sequence similarity. Proteomics, 6, 4023-4037.
Hansen, J. E., Lund, O., Tolstrup, N., Gooley, A. A., et al. (1998). NetOglyc: Prediction of mucin type O-glycosylation
sites based on sequence context and surface accessibility. Glycoconj J, 15, 115-130.
Harrison A., Pearl F., Sillitoe I., Slidel T., Mott R., Thornton J. M., Orengo C. (2003). Recognising the fold of a protein
structure. Bioinformatics, 19, 1748-1759.
Hayashi, S. and Wu, H.C. (1990). Lipoproteins in bacteria. J. Bioenerg. Biomembr. 22, 451471.
Henikoff, S. (1996). Scores for sequence searches and alignments. Curr Opin Struct Biol, 6 (3), 353-360.
Henikoff, S. and Henikoff, J. G. (1994). Protein family classification based on searching a database of blocks. Genomics,
19, 97107.
Hirose, S., Yokota, K., Kuroda, Y., Wako, H., Endo, S., Kanai, S., Noguchi, T. (2010). Prediction of protein motions from
amino acid sequence and its application to protein-protein interaction. BMC Struct. Biol., 10, 20.
Hoglund, A., Donnes, P., Blum, T., Adolph, H. W., Kohlbacher, O. (2006). MultiLoc: Prediction of protein subcellular
localization using N-terminal targeting sequences, sequence motifs and amino acid composition.
Bioinformatics, 22 (10), 1158-1165.
Holm L., Sander C. (1993). Protein structure comparison by alignment of distance matrices. J Mol Biol, 233, 123-138.
Hua, S. and Sun, Z. (2001). Support vector machine approach for protein subcellular localization prediction.
Bioinformatics, 17(8), 721-728.
Huang, N., Chen, H., and Sun, Z. (2005). CTKPred: an SVM-based method for the prediction and classification of the
cytokine superfamily. Protein Eng Des Sel, 18(8), 365-368.
Huang, W. L., Tung, C. W., Huang, H. L., Hwang, S. F. Ho, S. Y. (2007). ProLoc: Prediction of protein subnuclear
localization using SVM with automatic selection from physicochemical composition
features. BioSystems, 90 (2), 573-581.
Huang, Y. and Yanda, Li. (2004). Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics,
20(1), 21-28.
Hwang, S., Gou, Z. and Igor, B. (2007). DP-Bind: a web server for sequence-based prediction of DNA-binding residues in
DNA-binding proteins. Bioinformatics, 23, 634-636.
Ishida, T., Kinoshita, K. (2007). PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids
Res., 35, W460464.
Jain, AK., Duin, RPW., Mao, J. (2000). Statistical Pattern Recognition: A Review. IEEE Trans Pattern Anal Mach Intell,
22, 4–37.
Jensen, L. J., Skovgaard, M., Brunak, S. (2002). Prediction of novel archaeal enzymes from sequence-derived features.
Protein Sci., 3, 28942898.
Jia, P., Qian, Z., Zeng, Z., Cai, Y., Li Y. (2007). Prediction of subcellular protein localization based on functional domain
composition. Biochem Biophys Res Commun, 357 (2), 366-370.
Johnson, R. and Wichern, D. (1982). Applied Multivariate Statistical Analysis. Prentice-Hall, Inc.: Englewood Cliffs, NJ.
Julenius, K., Molgaard, A., Gupta, R., Brunak, S. (2005). Prediction, conservation analysis, and structural characterization
of mammalian mucin-type O-glycosylation sites. Glycobiology, 15,153-164.
Juncker, A.S., Willenbrock, H., Von Heijne, G, Brunak, S., Nielsen, H., et al. (2003). Prediction of lipoprotein signal
peptides in Gram-negative bacteria. Protein Sci. 12, 16521662.
Juncker, AS., Jensen, LJ., Pierleoni, A., Bernsel, A., Tress, ML., Bork, P., Heijne, Gv., Valencia, A., Ouzounis, CA.,
Casadio, R., Brunak, S. (2009). Sequence-based feature prediction and annotation of proteins. Genome Biol., 10,
Kadam, K., Prabhakar, P., and Jayaraman, V.K. (2012). SVM prediction of ligand-binding sites in bacterial lipoproteins
employing shape and physio-chemical descriptors. Protein Pept Lett, 19, 1155-1162.
Kalita, M. K., Nandal, U. K., Pattnaik, A., Sivalingam, A., Ramasamy, G., Kumar, M., Raghava, G. P. S. and Gupta, D.
(2008). CyclinPred: a SVM-based method for predicting cyclin protein sequences. PLoS ONE, 3(7), e2605.
Kandaswamy, K. K., Chou, K. C., Martinetz, T., Moller, S., Suganthan, P. N., Sridharan, S., Pugalenthi, G. (2011). AFP-
Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor
Biol, 270 (1), 56-62.
Kaur, H. and Raghava, G. P. S. (2004). A neural network method for prediction of β-turn types in proteins using
evolutionary information. Bioinformatics, 20(16), 2751-2758.
Kawashima, S., Ogata, H., Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic Acids Res, 28, 374.
Kim, J. H., Lee, J., Oh, B., Kimm, K., Koh, I. (2004). Prediction of phosphorylaton sites using SVMs. Bioinformatics, 20,
Kingsford, C., Salzberg, SL. (2008). What are decision trees? Nat Biotechnol., 26, 1011-1013.
Kinoshita, K., Kono, H. and Yura, K. (2008). Prediction of Molecular Interactions from 3D-Structures: From Small
Ligands to Large Protein Complexes, In: Prediction of Protein Structures, Functions, and Interactions (ed J. M.
Bujnicki), John Wiley & Sons, Ltd, Chichester, UK.
Koike, A. and Takagi, T. (2004). Prediction of proteinprotein interaction sites using support vector machines. Protein Eng
Des Sel, 17, 165173.
Kuang, R., Lesliei, CS., Yang, A.-S. (2004). Protein backbone angle prediction with machine learning approaches.
Bioinformatics, 20, 16121621.
Kulkarni-Kale, U., Waman, V., Raskar, S., Mehta S., Saxena, S. (2012). Genome To Vaccinome: Role of Bioinformatics,
Immunoinformatics & Comparative Genomics. Curr Bioinform, 7, 4, 451-463.
Kulkarni-Kale, U., Bhosle, S., and Kolaskar, A.S. (2005). CEP: A conformational epitope prediction server. Nucleic Acids
Res., 33, W168W171.
Kumar, K. K., Shelokar, P. S. (2008). An SVM method using evolutionary information for the identification of allergenic
proteins. Bioinformation, 2(6), 253256.
Kumar, M., Gromiha, M. M. and Raghava, G. P. S. (2008). Prediction of RNA binding sites in a protein using SVM and
PSSM profile. Proteins, 71, 189-194.
Kumari, R. S., Kadam, K., Badwaik, R., Jayaraman. V. K. (2012). Lipopredict: Bacterial lipoprotein prediction server.
Bioinformation, 8(8): 394-398.
Kunik, V., Solan, Z., Edelman, S. et al. (2005). Motif extraction and protein classification. Proc IEEE Comput Syst
Bioinform Conf., 80, 5.
Kurgan, L., Cios, K., Chen, K. (2008). SCPRED: Accurate prediction of protein structural class for sequences of twilight-
zone similarity with predicting sequences. BMC Bioinformatics, 9, 226.
Kuznetsov, I. B., Gou, Z., Li, R. and Hwang, S. (2006). Using evolutionary and structural information to predict DNA-
binding sites on DNA-binding proteins. Proteins, 64, 1927.
Käll, L., Krogh, A., Sonnhammer, E. L. (2004). A combined transmembrane topology and signal peptide prediction
method. J. Mol. Biol., 338, 10271036.
La Cour, T., L. Kiemer, A. Mølgaard, R. Gupta, K. Skriver, and S. Brunak. (2004). Analysis and prediction of leucine-rich
nuclear export signals. Protein Eng. Des. Sel., 17, 527-536.
Ladunga, I., Czakó, F., Csabai, I., Geszti, T. (1991). Improving signal peptide prediction accuracy by simulated neural
network. Comput Appl Biosci., 7(4), 485-7.
Lancashire, LJ., Lemetre, C., Ball, GR. (2009). An introduction to artificial neural networks in bioinformaticsapplication
to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform., 10, 315329.
Lara, J., Wohlhueter, R. M., Dimitrova, Z., Khudyakov, Y. E. (2008) Artificial neural network for prediction of antigenic
activity for a major conformational epitope in the hepatitis C virus NS3 protein. Bioinformatics, 24 (17), 1858-1864.
Larrañaga, P., Calvo, B., Robles, V., et al. (2006). Machine learning in bioinformatics Brief Bioinform, 7(1), 86-112.
Laskowski, R. A., Watson, J. D., Thornton, J. M. (2005). ProFunc: a server for predicting protein function from 3D
structure. Nucleic Acids Res, 33, W89-93.
Levy, R., Edelman, M. and Sobolev, V. (2009). Prediction of 3D metal binding sites from translated gene sequences based
on remote-homology templates. Proteins, 76, 365374.
Lewis, DP., Jebara, T., Noble, WS. (2006). Support vector machine learning from heterogeneous data: an empirical
analysis using protein sequence and structure. Bioinformatics, 22, 2753-60.
Li, F. M., Li, Q. Z. (2008). Predicting Protein Subcellular Location Using Chou's Pseudo Amino Acid Composition and
Improved Hybrid Approach. Protein Pept Lett, 15, 612-616.
Li, L., Umbach, DM., Terry, P., Taylor, JA. (2004). Application of the GA/KNN method to SELDI proteomics data.
Bioinformatics, 20, 1638-1640.
Li, ZC., Zhou, XB., Dai, Z., Zou, XY. (2009). Prediction of protein structural classes by Chou's pseudo amino acid
composition: approached using continuous wavelet transform and principal component analysis. Amino acids, 37, 2,
Li, GZ., Bu, HL., Yang, MQ., Zeng, XQ., Yang, JY. (2008). Selecting subsets of newly extracted features from PCA and
PLS in microarray data analysis. BMC Genomics, 9, S24.
Liang, S., Zheng, D., Standley, D. M., Yao, B., Zacharias, M., et al. (2010). EPSVR and EPMeta: prediction of antigenic
epitopes using support vector regression and multiple server results. BMC Bioinformatics, 11, 381.
Liaw, A., M. Wiener. (2002). Classification and regression by randomForest, R News, 2, 1822.
Lin, H. H., Han, L. Y., Zhang, H. L., Zheng, C. J., Xie, B., et al. (2006a). Prediction of the functional class of metal-
binding proteins from sequence derived physicochemical properties by support vector machine approach. BMC
Bioinformatics, 7, S13.
Lin, H. H., Han, L. Y., Zhang, H. L., Zheng, C. J., Xie, B., et al. (2006b) Prediction of the functional class of lipid binding
proteins from sequence-derived properties irrespective of sequence similarity. J Lipid Res, 47, 824831.
Lin, W. Z., Fang, J. A., Xiao, X., Chou. K. C. (2011). iDNA-Prot: Identification of DNA Binding Proteins Using Random
Forest with Grey Model. PLoS ONE, 6(9), e24756.
Liu, T., Altman, R. B. (2009). Prediction of calcium-binding sites by combining loop-modeling with machine learning.
BMC Struct Biol, 9, 72.
Liu, Z. P., Wu, L. Y., Wang, Y., Zhang, X. S., Chen, L. N. (2010). Prediction of protein-RNA binding sites by a random
forest method with combined features. Bioinformatics, 26, 16161622.
Liu, Z. P., et al. (2008). Bridging protein local structures and protein functions. Amino Acids, 35, 627-650.
Lohmann, R., Schneider, G., Behrens, D., Wrede, P. (1994). A neural network model for the prediction of membrane-
spanning amino acid sequences. Protein Sci, 15971601.
Lu, L., Qian, Z., Cai, Y. D., Li, Y. (2007). ECS: An automatic enzyme classifier based on functional domain composition.
Comput Biol Chem, 31 (3), 226-232.
Lukashin, A.V., Borodovsky, M. (1998). GeneMark.hmm: new solutions for gene finding, Nucleic Acids Res, 26, 1107-
Ma, J., Gu, H. (2010). A novel method for predicting protein subcellular localization based on pseudo amino acid
composition. BMB Rep., 43(10), 670-6.
Madej, T., Gibrat, JF., Bryant, SH. (1995). Threading a database of protein cores. Proteins, 23, 35669.
Majoros, WH., Pertea, M., Salzberg, SL. (2005). Efficient implementation of a generalized pair hidden Markov model for
comparative gene finding. Bioinformatics, 21, 1782-8.
Malik, A., Ahmad, S. (2007). Sequence and structural features of carbohydrate binding in proteins and assessment of
predictability using a neural network. BMC Struct Biol, 7, 1.
Mamitsuka, H. (1989). Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov
models. Proteins. 33, 46074.
Martin, S., Roe, D., Faulon, J. L. (2005). Predicting proteinprotein interactions using signature products. Bioinformatics,
21, 218226.
McLaughlin, W. A., Berman, H. M. (2003). Statistical models for discerning protein structures containing the DNA-
binding helix-turn-helix motif. J Mol Biol, 330 (1), 43-55.
Mishra, N.K. and Raghava, G. P. S. (2010). Prediction of FAD interacting residues in a protein from its primary sequence
using evolutionary information. BMC Bioinformatics, 11, S48.
Mundra, P., Kumar, M., Kumar, K. K., Jayaraman, V. K., Kulkarni, B. D. (2007). Using pseudo amino acid composition to
predict protein subnuclear localization: Approached with PSSM. Pattern Recognit Lett, 28 (13), 1610-1615.
Murvai, J., Vlahovicek, K., Barta, E., Pongor, S. (2001). The SBASE protein domain library, release 8.0: a collection of
annotated protein sequence segments. Nucleic Acids Res. 29, 5860.
Najmanovich, R., Kurbatova, N., Thornton, J. (2008). Detection of 3D atomic similarities and their use in the
discrimination of small molecule protein-binding sites. Bioinformatics, 24(16), 105-11.
Nakai, K., Horton, P. (1999). PSORT: A program for detecting sorting signals in proteins and predicting their subcellular
localization. Trends Biochem Sci., 24 (1), 34-35.
Nayal, M., Honig, B. (2006). On the nature of cavities on protein surfaces: application to the identification of drug-binding
sites. Proteins, 63, 892906.
Needleman, SB., Wunsch, CD. (1970). A general method applicable to the search for similarities in the amino acid
sequences of two proteins. J. Mol. Biol., 48, 443-453.
Nielsen, H., Brunak, S., Von Heijne, G. (1999). Machine learning approaches for the prediction of signal peptides and
other protein sorting signals. Protein Eng., 12, 39.
Nielsen, H., Engelbrecht, J., Brunak, S., Von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal
peptides and prediction of their cleavage sites. Protein Eng., 10, 16.
Nielsen, H., Engelbrecht, J., von Heijne, G. and Brunak, S. (1996). Defining a similarity threshold for a functional protein
sequence pattern: The signal peptide cleavage site. Proteins, 24, 165177.
Novotni, M., Klein, R. (2004). Shape retrieval using 3D Zernike descriptors. Comput Aided Des, 36, 1047-1062.
Nugent, T., Jones, DT. (2010). Predicting Transmembrane Helix Packing Arrangements using Residue Contacts and a
Force-Directed Algorithm. PLoS Comput Biol, 6(3), e1000714.
Ofran, Y., Mysore,V. and Rost, B. (2007). Prediction of DNA-binding residues from sequence. Bioinformatics, 23(13),
Ofran, Y., Rost, B. (2007). ProteinProtein Interaction Hotspots Carved into Sequences. PLoS Comput Biol, 3(7), e119.
Orengo, CA., Taylor, WR. (1996). SSAP: sequential structure alignment program for protein structure comparison.
Methods Enzymol, 266, 617-635.
Ota, M., Kinoshita, K., Nishikawa, K. (2003). Prediction of catalytic residues in enzymes based on known tertiary
structure, stability profile, and sequence conservation. J Mol Biol, 327 (5), 1053-1064.
Panwar, B., Raghava, G. P. (2010). Prediction and classification of aminoacyl tRNA synthetases using PROSITE domains.
BMC Genomics, 11, 507.
Park, K. J. and Kanehisa, M. (2003). Prediction of protein subcellular locations by support vector machines using
compositions of amino acids and amino acid pairs. Bioinformatics, 19(13), 1656-1663.
Passerini, A., Punta, M., Ceroni, A., Rost, B. and Frasconi, P. (2006). Identifying cysteines and histidines in transition-
metal-binding sites using support vector machines and neural networks. Proteins, 65, 305316.
Pearson, W. R. (1996). Effective protein sequence comparison. Methods Enzymol., 266, 227258.
Pearson, W. R., Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci USA, 85,
Petrova, N. V., Wu, C. H. (2006). Prediction of catalytic residues using Support Vector Machine with selected protein
sequence and structural properties. BMC Bioinformatics, 7, 312.
Pierri, C.L., Parisi, G., Porcelli, V. (2010). Computational approaches for protein function prediction: A combined strategy
from multiple sequence alignment to molecular docking-based virtual screening. Biochimica et Biophysica Acta -
Proteins and Proteomics, 1804 (9), 1695-1712.
Pugalenthi, G., Kumar, K. K., Suganthan, P. N., Gangal, R. (2008). Identification of catalytic residues from protein
structure using support vector machine with sequence and structural features. Biochem Biophys Res
Commun, 367 (3), 630-634.
Pundhir, S. & Kumar, A. (2011). SSPred: A prediction server based on SVM for the identification and classification of
proteins involved in bacterial secretion systems. Bioinformation, 6(10): 380-382.
Qi, Yanjun. (2012). Random Forest for Bioinformatics. In: Ensemble Machine Learning: Methods and Applications. Eds.
Zhang, C. and Ma, Y. Springer-Verlag New York Inc, 307.
Qiu, J. D., Huang, J. H., Shi, S. P., Liang, R. P. (2010). Using the Concept of Chou's Pseudo Amino Acid Composition to
Predict Enzyme Family Classes: An Approach with Support Vector Machine Based on Discrete Wavelet Transform.
Protein Pept Lett., 17(6), 715-22.
Quinlan, J. (1993). C4.5: programs for machine learning, Morgan Kaufmann, San Mateo.
Ramana, J., Gupta, D. (2010). FaaPred: a SVM-based prediction method for fungal adhesins and adhesin-like proteins.
PLoS One., 5, e9695.
Rapoport, T. A. (1992). Transport of proteins across the endoplasmic reticulum membrane. Science, 258, 931936.
Rappuoli, R. (2001). Reverse vaccinology, a genome-based approach to vaccine development. Vaccine, 19, 26882691.
Rashid, M., Saha, S., Raghava, G. P. S. (2007). Support Vector Machine-based method for predicting subcellular
localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics, 8(1), 337.
Roetzschke, O., Falk, K., Stefanovic, S. et al. (1991). Exact prediction of a natural T cell epitope. EurJ Immunol, 21,
Rost, B., Sander, C. (1994). Combining evolutionary information and neural networks to predict protein secondary
structure, Proteins, 19, 5572.
Rost, B. (2002). Enzyme function less conserved than anticipated. J Mol Biol, 318, 595-608.
Rubinstein, N. D., Mayrose, I., Pupko, T. (2009a). A machine-learning approach for predicting B-cell epitopes. Mol
Immunol, 46 (5), 840-847.
Rubinstein, N. D., Mayrose, I., Martz, E., Pupko, T. (2009b). Epitopia: a web-server for predicting B-cell epitopes. BMC
Bioinformatics. 10, 287.
Röttig, M., Rausch, C., Kohlbacher, O. (2010). Combining Structure and Sequence Information Allows Automated
Prediction of Substrate Specificities within Enzyme Families. PLoS Comput Biol, 6(1), e1000636.
Saeys, Y., Inza, I., Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23
(19), 2507-2517.
Saghatelian, A. & Cravatt, B. F. (2005). Assignment of the protein function in the post genomic era. Nat Chem Biol, 1,
Saha, S., Raghava, G. (2006). Prediction of continuous B-cell epitopes in an antigen using recurrent neural network.
Proteins, 65, 40-48.
Sarda, D., Chua, G. H., Li, K. B., Krishnan, A. (2005). pSLIP: SVM based protein subcellular localization prediction using
multiple physicochemical properties. BMC Bioinformatics, 6, 152.
Schneider, G., Wrede, P. (1993). Development of artificial neural filters for pattern recognition in protein sequences. J Mol
Evol., 36(6), 586-95.
Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., Jiang, H. (2007). Predicting protein-protein interactions
based only on sequences information. Proc. Natl Acad. Sci. USA., 104, 4337-4341.
Shen, H. B., Chou, K. C. (2007a). Signal-3L: A 3-layer approach for predicting signal peptides. Biochem Biophys Res
Commun, 363 (2), 297-303.
Shen, H. B. & Chou, K.C. (2007b). Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing
PseAA composition and PsePSSM. Protein Eng Des Sel, 20, 561-567.
Shen, H. B., Yang, J., Chou, K. C. (2006). Fuzzy KNN for predicting membrane protein types from pseudo-amino acid
composition. J Theor Biol, 240 (1), 9-13.
Shi, J. Y., Zhang S. W., Pan Q., Cheng Y. M. and Xie, J. (2007). Prediction of protein subcellular localization by support
vector machines using multi-scale energy and pseudo amino acid composition. Amino Acids, 33, 69-74.
Šikić, M., Tomić, S., Vlahoviček, K. (2009). Prediction of ProteinProtein Interaction Sites in Sequences and 3D
Structures by Random Forests. PLoS Comput Biol, 5(1), e1000278.
Sim, J., Kim, S.-Y. and Lee, J. (2005a). PPRODO: Prediction of protein domain boundaries using neural networks.
Proteins, 59, 627632.
Sim, J., Kim, S. Y., Lee, J. (2005b). Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method.
Bioinformatics, 21 (12), 2844-2849.
Sleator, RD. (2012). Prediction of protein functions. Methods Mol Biol, 815, 15-24.
Smith, TF., Waterman, MS. (1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.
Somarowthu, S., Yang, H., Hildebrand, D. G.C. and Ondrechen, M. J. (2011). High-performance prediction of functional
residues in proteins with machine learning and computed input features. Biopolymers, 95, 390400.
Song, J., Burrage, K., Yuan, Z., Huber, T. (2006). Prediction of cis/trans isomerization in proteins using PSI-BLAST
profiles and secondary structure information. BMC Bioinformatics, 7, 124.
Stawiski, E. W., Gregoret, L. M., Mandel-Gutfreund, Y. (2003). Annotating nucleic acid-binding function based on protein
structure. J Mol Biol, 326 (4), 1065-1079.
Sturrock, S. S. and Collins, J. F. (1993). MPsrch version 1.3. Biocomputing Research Unit, University of Edinburgh,
Edinburgh, UK.
Sweredoski, MJ., Baldi, P. (2008). PEPITO: improved discontinuous B-cell epitope prediction using multiple distance +
thresholds and half sphere exposure. Bioinformatics, 24, 1459-1460.
Sweredoski, M., Baldi, P. (2009). COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des
Sel., 22(3), 113-120.
Tamura, T., Akutsu, T. (2007). Subcellular location prediction of proteins using support vector machines with alignment of
block sequences utilizing amino acid composition. BMC Bioinformatics, 8, 466.
Tan, P.N., Steinbach, M., Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.
Tang, YR., Sheng, ZY., Chen, YZ., Zhang, Z. (2008). An improved prediction of catalytic residues in enzyme structures.
Protein Eng Des Sel, 21, 295302.
Tarca, A. L., Carey, V. J., Chen, X-w, Romero R., Drăghici S. (2007). Machine Learning and Its Applications to Biology.
PLoS Comput Biol, 3(6), e116.
Teng, S., Srivastava, A. K., Wang, L. (2010). Sequence feature-based prediction of protein stability changes upon amino
acid substitutions. BMC Genomics, (Suppl 2), S5.
Terribilini, M., Lee, J. H., Yan, C., Jernigan, R. L., Honavar, V., Dobbs, D. (2006). Prediction of RNA binding sites in
proteins from amino acid sequence. RNA, 12, 1450-1462.
Terribilini, M., Sander, J. D., Lee, J. H., Zaback, P., Jernigan, R. L., et al. (2007). RNABindR: a server for analyzing and
predicting RNA-binding sites in proteins. Nucleic Acids Res, 35, W578W584.
Terwilliger, TC., Stuart, D., Yokoyama, S. (2009). Lessons from structural genomics. Annu Rev Biophys., 38, 371383.
Tjong, H., Zhou, H-X., (2007). DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces.
Nucleic Acids Res, 35, 1465-1477.
Tong, W., Williams, R. J., Wei, Y., Murga, L. F., Ko, J., Ondrechen, M. J. (2008). Enhanced performance in prediction of
protein active sites with THEMATICS and support vector machines. Protein Sci. 17, 333341.
Torrance, J.W., Bartlett, G.J., Porter, C.T., Thornton, J.M. (2005). Using a library of structural templates to recognise
catalytic sites and explore their evolution in homologous families. J Mol Biol, 347, 565-581.
Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer, New York.
Verma, R., Tiwari, A., Kaur, S., Varshney, G. C., Raghava, G. P. S. (2008). Identification of Proteins Secreted by Malaria
Parasite into Erythrocyte using SVM and PSSM profiles. BMC Bioinformatics, 9, 201.
Verma, R., Varshney, G., Raghava, G. (2010). Prediction of mitochondrial proteins of malaria parasite using split amino
acid composition and PSSM profile. Amino Acids, 39, 101-110.
von Heijne, G. (1990). The signal peptide. J. Membrane Biol., 115, 195201.
Walter, G. (1986). Production and use of antibodies against synthetic peptides. J. Immunol. Methods, 88, 149-61.
Wang, D. and Larder, B. (2003). Enhanced Prediction of Lopinavir Resistance from Genotype by Use of Artificial Neural
Networks. J. Infect. Dis., 188, 5, 653-660.
Wang, J., Sung, W. K., Krishnan, A., Li, K. B. (2005). Protein subcellular localization prediction for Gram-negative
bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC
Bioinformatics, 6, 174.
Wang, L., Brown, S. J. (2006a). Prediction Of DNA-Binding Residues from Sequence Features. J Bioinform Comput Biol,
4, 6, 1141-1158.
Wang, L. J., Brown, S. J. (2006b). BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in
amino acid sequences. Nucleic Acids Res., 34, W243W248.
Wang, M., Yang, J., Liu, G. P., Xu, Z. J., Chou, K. C. (2004). Weighted-support vector machines for predicting membrane
protein types based on pseudo-amino acid composition. Protein Eng Des Sel, 17 (6), pp. 509-516.
Wang, T., Yang J. (2010). Predicting subcellular localization of gram-negative bacterial proteins by linear dimensionality
reduction method. Protein Pept Lett., 17(1), 32-7.
Wang, Y. C., Wang, X. B., Yang, Z. X., Deng, N. Y. (2010). Prediction of enzyme subfamily class via pseudo amino acid
composition by incorporating the conjoint triad feature. Protein Pept. Lett., 17, pp. 14411449.
Wang, Y., Xue, Z., Shen, G., and Xu, J. (2008). PRINTR: Prediction of RNA binding sites in proteins using SVM and
profiles. Amino Acids, 35, 2, 295-302.
Wass, MN., Kelley, LA., Sternberg, MJE. (2010). 3DLigandSite: predicting ligand-binding sites using similar structures.
Nucleic Acids Res, 38, W469W473.
Whisstock, JC., Lesk, AM. (2003). Prediction of protein function from protein sequence and structure. Q Rev Biophys, 36,
Wong, S. L., Zhang, L. V., Tong, A. H., Li, Z., et al. (2004). Combining biological networks to predict genetic
interactions. Proc Natl Acad Sci USA, 101, 15682-15687.
Wu, J., Liu, H., Duan, X., Ding, Y., Wu, H., Bai, Y. and Sun, X. (2009). Prediction of DNA-binding residues in proteins
from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics, 25, 1, 30-35.
Xia, JF., Zhao, XM., Song, J., Huang, DS. (2010). APIS: accurate prediction of hot spots in protein interfaces by
combining protrusion index with solvent accessibility. BMC Bioinformatics, 11, 174174.
Xie, D., Li, A., Wang, M., Fan, Z., Feng, H. (2005). LOCSVMPSI: a web server for subcellular localization of eukaryotic
proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res, 33, W105-W110.
Xu, Y., Uberbacher, E. (1997). Computational Gene Identification Using Neural Networks and Similarity Search. Machine
Learning and Sequence Pattern Analysis, In: Computational Biology, Eds. Steven Salzberg, David Searls, Simon
Kasifi, Elsevier Publishing Company.
Xu, X., Yu, D., Fang, W., Cheng, Y., Qian, Z., Lu, W., Cai, Y., Feng, K. (2008). Prediction of peptidase category based
on functional domain composition. J Proteome Res, 7(10), 4521-4524.
Yang, Z.R. Biological applications of support vector machines. (2004). Brief. Bioinform., 5 (4), 328338.
Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y. (2010). A review of ensemble methods in bioinformatics. Curr Bioinform,
5, 296308.
Yoon, BJ. (2009). Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr Genomics, 10,
Youn, E., Peters, B., Radivojac, P., Mooney, S. (2007). Evaluation of features for catalytic residue prediction in novel
folds. Prot Sci, 16, 216226.
Yu, C. S., Chen, Y. C., Lu, C. H. and Hwang, J. K. (2006a), Prediction of protein subcellular localization. Proteins,
64: 643651.
Yu, X., Wang, C., Li, Y. (2006b). Classification of protein quaternary structure by functional domain composition. BMC
Bioinformatics, 7, 187.
Yu, C. S., Lin, C. J., Hwang, J. K. (2004). Predicting subcellular localization of proteins for gram-negative bacteria by
support vector machines based on n-peptide compositions. Protein Sci., 13, 14021406.
Yu, C. Y., Chou, L. C., Chang, D. T. (2010). Predicting protein-protein interactions in unbalanced data using the primary
structure of proteins. BMC Bioinformatics, 11, 167.
Yu, K., Petrovsky, N., Schonbach, C., et al. (2002). Methods for prediction of peptide binding to MHC molecules: a
comparative study. Mol Med, 8, 137-48.
Zhang, T. L., Ding, Y. S., Chou, K. C. (2008). Prediction protein structural classes with pseudo-amino acid composition:
Approximate entropy and hydrophobicity pattern. J Theor Biol, 250 (1), 186-193.
Zhang, G. L., Ansari, H.R., Bradley, P., Cawley, G.C., Hertz ,T., Hu, X., Jojic, N., Brusic, V. (2011a). Machine learning
competition in immunology - Prediction of HLA class I binding peptides. J Immunol Methods, 374 (1-2), 1-4.
Zhang, W., Xiong, Y., Zhao, M., Zou, H., Ye, X., et al. (2011b). Prediction of conformational B-cell epitopes from 3D
structures by random forests with a distance-based feature. BMC Bioinformatics, 12, 341.
Zhao, Y., Pinilla, C., Valmori, D., et al. (2003). Application of support vector machines for T-cell epitopes prediction.
Bioinformatics, 19, 197884.
Zhao, XM., Chen, L., Aihara K. (2008). Protein function prediction with high-throughput data. Amino Acids, 35 (3), 517
Zhou, X. B., Chen, C., Li, Z. C., Zou, X. Y. (2007). Using Chou's amphiphilic pseudo-amino acid composition and support
vector machine for prediction of enzyme subfamily classes. J Theor Biol, 248 (3), 546-551.
... The differences of established ML approaches are grounded on the process of feature engineering and model selection. Overall, features can be sequence-, structure-, and energy-based (Kadam, Sawant, Kulkarni-Kale, & Jayaraman, 2014) and usually vary in effort for engineering. ...
... In this study, Bedbrook et al. used a first classification of possible recombinants and subsequently trained a regressor to predict the functional and spectral properties. Whereas sequence descriptors are based on the amino acid sequence and were implemented as an assumption defining that similar sequences will mainly cause similar effects-whereas also exceptions to this occur (Galperin, Walker, & Koonin, 1998;Kadam et al., 2014;Rost, 2002)-structural features were extracted from the three-dimensional crystal data. Here, selected structural data was used to define the distance of residues of the different variants as contact maps. ...
... More examples of diverse applied ML approaches and target enzyme properties can be found in recent reviews on ML-assisted protein engineering (Kadam et al., 2014;Li, Dong, & Reetz, 2019;Mazurenko et al., 2019;Yang, Wu, & Arnold, 2019). ...
Directed evolution and rational design are powerful strategies in protein engineering to tailor enzyme properties to meet the demands in academia and industry. Traditional approaches for enzyme engineering and directed evolution are often experimentally driven, in particular when the protein structure–function relationship is not available. Though they have been successfully applied to engineer many enzymes, these methods are still facing significant challenges due to the tremendous size of the protein sequence space and the combinatorial problem. It can be ascertained that current experimental techniques and computational techniques might never be able to sample through the entire protein sequence space and benefit from nature's full potential for the generation of better enzymes. With advancements in next generation sequencing, high throughput screening methods, the growth of protein databases and artificial intelligence, especially machining learning (ML), data-driven enzyme engineering is emerging as a promising solution to these challenges. To date, ML-assisted approaches have efficiently and accurately determined the quantitative structure-property/activity relationship for the prediction of diverse enzyme properties. In addition, enzyme engineering can be accelerated much faster than ever through the combination of experimental library generation and ML-based prediction. In this chapter, we review the recent progresses in ML-assisted enzyme engineering and highlight several successful examples (e.g., to enhance activity, enantioselectivity, or thermostability). Herein we explain enzyme engineering strategies that combine random or (semi-)rational approaches with ML methods and allow an effective reengineering of enzymes to improve targeted properties. We further discuss the main challenges to solve in order to realize the full potential of ML methods in enzyme engineering. Finally, we describe the current limitations of ML-assisted enzyme engineering, and our perspective on future opportunities in this growing field.
... All the layers comprising this network architecture are together known as a Deep Neural Network (DNN). ANNs can process nonlinear data and handle noisy data but are prone to overfitting (Kulkarni-Kale et al. 2014). ANNs include RNNs, CNNs, and GCNs (see below). ...
Full-text available
Annotating protein sequences according to their biological functions is one of the key steps in understanding microbial diversity, metabolic potentials and evolutionary histories. However, even in the best-studied prokaryotic genomes, not all proteins can be characterized by classical in vivo, in vitro, and/or in silico methods-a challenge rapidly growing alongside the advent of Next Generation Sequencing technologies and their enormous extension of 'omics' data in public databases. These so-called hypothetical proteins (HPs) represent a huge knowledge gap and hidden potential for biotechnological applications. Opportunities for leveraging the available 'Big Data' have recently proliferated with the use of artificial intelligence (AI). Here we review the aims and methods of protein annotation and explain the different principles behind machine and deep learning algorithms including recent research examples, in order to assist both biologists wishing to apply AI tools in developing comprehensive genome annotations and computer scientists who want to contribute to this leading edge of biological research.
... Extraction of relevant features from a protein/peptide sequence is a critical component of machine learning method development (Kadam et al., 2014). Sequence-based features used in this study are described briefly. ...