ArticlePDF Available

FEPS: Feature Extraction from Protein Sequences webserver

Authors:

Abstract and Figures

Protein sequence-driven features are numeric vectors extracted from amino acid residues of protein sequences for their ability to capture some information that can be used for knowledge discovery in both supervised and unsupervised machine learning. Extracting features from protein sequences is always a challenge for many researchers, who need features to develop a learning model or for statistical purposes, without dealing with the hassle of mathematical and programming details. We developed FEPS, a web application for protein feature extraction that computes most common sequence-driven features of proteins from a single or multiple fasta-formatted files with multiple protein sequences and outputs user-friendly and ready-to-use feature files. The application uses 48 published feature extraction methods, of which 6 can use any one of the 544 physicochemical properties and 4 can accept user-defined amino acid indices. The total number of features calculated by FEPS is 2765, which is far more than the number of features that can be computed by any other peer application. A simple tutorial and guidelines were provided to walk the user through the different steps without difficulties. The FEPS is available online at http://bcb.ncat.edu/Features/. Index Terms – protein feature extraction, protein descriptors, machine learning.
Content may be subject to copyright.
1
FEPS: Feature Extraction from Protein Sequences webserver
Hamid D. Ismail, Mary Smith and Dukka B KC
Abstract
Protein sequence-driven features are numeric vectors extracted from amino acid residues of protein sequences for their
ability to capture some information that can be used for knowledge discovery in both supervised and unsupervised machine
learning. Extracting features from protein sequences is always a challenge for many researchers, who need features to
develop a learning model or for statistical purposes, without dealing with the hassle of mathematical and programming
details. We developed FEPS, a web application for protein feature extraction that computes most common sequence-driven
features of proteins from a single or multiple fasta-formatted files with multiple protein sequences and outputs user-friendly
and ready-to-use feature files. The application uses 48 published feature extraction methods, of which 6 can use any one of
the 544 physicochemical properties and 4 can accept user-defined amino acid indices. The total number of features
calculated by FEPS is 2765, which is far more than the number of features that can be computed by any other peer
application. A simple tutorial and guidelines were provided to walk the user through the different steps without difficulties.
The FEPS is available online at http://bcb.ncat.edu/Features/.
Index Terms protein feature extraction, protein descriptors, machine learning.
I. INTRODUCTION
Protein sequence features are numeric values derived
from amino acid residues of protein sequences and
intended to be informative to facilitate the subsequent
knowledge learning from the data using machine learning
tools [1]. Feature extraction is a transformation process
that extracts numeric values from textual sequences. The
numeric values are the ideal data types for measurement
and are more recognizable and analyzable by the
computer. The 20 amino acids, which are the forming
blocks of the proteins and encoded directly by the genetic
code, are represented by single alphabetic symbols in the
field of bioinformatics [2]. Yet, these textual symbols
require properties to acquire lively biological meaning for
correctly representing real amino acids. The
physicochemical properties of amino acids were the
subject of intense laboratory research spanned decades.
The number of the reported amino acid properties is 544 as
of today[3]. Each property was found to be an importance
factor in one or more biological aspects of proteins such as
structure and function [4]. The empirical laboratory
observations yielded numeric measurement for each amino
acid for any property[3]. For example, hydrophobicity is
one of the physicochemical properties, each amino acid
has a numeric value called index that represents the
measurement scale for that amino acid for that particular
property. The indices of the amino acids for the various
properties were deposited and made available online at
aaindex database [3].
Most of the published protein sequence feature
extraction methods use one or more of these properties and
they are usually deemed biologically meaningful and
widely accepted by biologists [4, 5]. Other protein feature
extraction methods rely on statistics advocated by
statisticians with the view that information can be captured
from the frequencies of certain settings of amino acids [6].
Other features rely on the spatial correlation between
residues in the same protein sequences. All kinds of
features were proven to be successful in research and can
be used individually or in combination [7, 8].
In feature extraction, the optimal representation of a
protein sequence is the amino acid sequence itself as it
contains all the information [9, 10]. Such representation is
known as sequential model as it relies on the order of the
amino acid residues in the sequence. Despite the optimal
information obtained from the fully sequential model, this
representation is impractical in many uses. One of the
major flaws is that each sequence may have different
incomparable columns of values, which makes the features
fail to capture pattern for nonhomologous or distant related
protein sequences. Another feature extraction model,
known as discrete model, was proposed to overcome the
flaws of the sequential one [11]. Discrete feature extraction
model relies on discrete values rather than the order of the
amino acid residues in the sequence. A hybrid model,
comprising both sequential and discrete models, was
proposed as well [5].
The numeric features are required as input for
machine learning algorithms in any study aims to build a
computational model from protein sequences [12, 13].
With the accumulation of protein sequences in databases,
researchers who intend to make a fast model to predict
some facts for a preliminary study or to discover a general
trend of information seldom find the bioinformatics tools
that enable them to obtain the sequence-driven features
without programming hassle and interdisciplinary efforts.
A few web applications were developed to achieve this
goal including PROFEAT [14, 15], BioBayesNet [16],
SPiCE [17], and PseAAC [18] but those web server-based
applications are either not-inclusive of most of the feature
extraction types or parts of a service which makes
acquiring a feature file to use with other tools of your
choice difficult. On the other hand, FEPS emerged from a
real need for a protein sequence-driven feature extraction
tool and it was developed with the researchers in biological
and bioinformatics field in mind. It includes the most
common used and published feature extraction methods
and makes all reported physicochemical properties of
amino acids available to the users to customize their
features. The feature output file formats include comma
separated value (csv) file, attribute-relation file (ARFF) for
2
weka software, svm-light file format for SVM-light, which
is commonly used SVM software, and tab-delimited text
file. Tutorials and step-by-step guidelines were provided to
make the entire process an easy experience. The features,
then, can be used to develop a sequence-based model such
as protein classification [19], protein post-transcriptional
site prediction [20], protein structure prediction [21],
protein function prediction [22], protein localization
prediction [23] or to use them for statistical purposes.
II. MATERIAL AND METHODS
FEPS input files
The FEPS server can be accessed at
http://bcb.ncat.edu/Features/. In a typical case a user may
need protein sequence-driven features as input for
supervised or unsupervised learning or for other kind of
analysis. If features are needed for supervised learning or
statistical analysis, in which the class label is ignored, then
only one plain text fasta-formatted file containing all
protein sequences is required. If the features are needed as
input for supervised learning, where the sequences are pre-
classified then each of these sequence groups can be stored
in a separate fasta-formatted file and each file is given the
group name without space. For example if the sequences
belong to three known groups, we can store the sequences
of each family in one file and give them name such as
groups1.fasta, group2.fasta, and groups3.fasta
respectively. When the form is submitted, the input files
will be validated for containing valid protein sequences in
valid fasta format.
Feature types
FEPS can compute numerous feature types, therefore,
choosing the right ones is almost an art but it can be
supported by a profound assumption that the chosen
method may explain a relevant biological aspect of the
subject under study. The selection of the right feature type
may require some tweaking and fine adjustment.
The number of elements (n) in the feature vector
varies with the feature types. Generally, the feature vector
(V) of the protein sequence of the order index (i) can be
represented as: 
where fj is a feature element value of the element j. Feature
element may also be called attribute, variable or column
and these terms are used interchangeably.
Typically, the features are extracted from more than
one sequence. If the number of sequences is denoted by N
and i=1 to N is the order index of a sequence, the extracted
features of a set of protein sequences, then, can be
represented by a matrix as follows:
    
    
    
  
   
Using a different notation, this matrix also can be called a
table with N rows and n columns. While N is determined
by the number of sequences, n depends on the feature
extraction methods.
Generally, the feature types can be divided into seven
groups: Amino acid composition, autocorrelation,
composition transition and distribution, Quasi-sequence
order, pseudo-amino acid composition [14, 24], Shannon
entropy descriptors, and others such as conjoint triad. The
computational complexity of feature extraction varies with
the feature extraction method, the average length of the
sequences, and the number of sequences. It can be
observed that the autocorrelation and pseudo-amino acid
composition may take more computational time than the
other methods.
In the following paragraphs we will review some of
the feature extraction methods which can be computed
with FEPS (Table 2).
1- Amino acid composition (AAC)
AAC relies on frequency of amino acid or a set of
amino acid in the sequence which may capture some
information that helps predict the subject of interest. AAC
is a family of discrete features that do not depend on the
order of the residues and it includes uni-amino acids,
dipeptides, tripeptides, or any setting of amino acids in any
order. It has been proven that AAC were successful in
prediction related to homologous groups of sequences [25-
27]. Therefore, it may be useful where homolog implies
such as in protein sequence classification, protein function
and structure prediction. Dipeptide and tripeptide amino
acid composition may capture more local information than
uni-amino acid composition.
i- Uni-amino Acid Composition (UAAC)
Uni-amino acid composition is defined as the
frequency of each 20 amino acid in the protein sequence. It
is frequently used and the simplest form of amino acid
composition. The number of elements in the feature vector
is 20. Each element represents the relative frequency of an
amino acid. The general formula of the UAAC of each
protein sequence is as follows [19, 24]:


where amino acid j represents any one of the twenty
amino.
ii- Dipeptide Composition (DPC)
With DPC, a feature value represents the frequency of
a dipeptide, which is the number of occurrence of an
amino acid in two adjacent positions in a protein sequence.
For example in the sequence: MALMACC the frequencies
of the dipeptide: MA, AL, LM, AC, and CC are 2, 1, 1, 1,
and 1 respectively. The number of possible dipeptides is
400, which is the number of feature elements. The DPC
features are normalized by dividing the frequencies by (N-
1), where N is the length of the sequence, and multiply it
by 100 [24]. 

The dipeptide is adding a new meaning to the amino acid
composition as the frequency of any contiguous two amino
acids may capture some local-order information [28].
3
Therefore, dipeptide composition is suitable for the cases
where a localized information is required such as
homology information [29].
iii- Tripeptide Composition (TPC)
Tri-peptide Composition, also known as 3-mer
spectrum [24], is an extension to the notion of the
frequency of the adjacent amino acids implemented with
the dipeptide composition adding more local-order value
as the frequency of the three possible amino acids is
considered. The number of feature elements in this case is
8000, which is substantially large, which is considered as
one of the setbacks as most of the machine learning
algorithms do not favor high dimensional data because of
the ill-conditioning problem that may arise during the
calculation and may lead to poor convergence or no
convergence state. Despite of the flaw, it has been proven
to be suitable with some algorithms, which implement
feature selection to choose only the feature columns that
have prediction capability and ignore the others. The
tripeptide composition can be calculated as:


Like dipeptide composition and tripeptide composition,
one could continue considering more contiguous amino
acids such as tetra-peptide and penta-peptide composition.
However, the number of feature becomes considerably
high as in the case of tetra-peptide composition, the
number of the features becomes 204 = 160,000, which is
tremendously high.
iv- Composite Amino acid composition
The above amino acid compositions are added
together to form a composite amino acid composition [24].
The feature vector will be represented as follows:

where  is the UAAC,  is the DPC, and
 is TPC.
We can notice that amino acid composition does not
implement any of the physicochemical properties of the
amino acid but it depends solely on the discrete values of
the frequencies of the amino acids.
2- Autocorrelation
As the name implies the autocorrelation is the
correlation amongst values of a single variable in contrast
to the conventional correlation that seeks to find the
correlation between two variables. Autocorrelation is a part
of the spatial statistics first used to detect non-randomness
in times series data [30] geographic spatial dependency or
the co-variation of properties within geographic space. The
application of the autocorrelation concept as protein
descriptor arises from the fact that a protein sequence can
be conceivable as a space and the correlation between two
values of an amino acid property at different positions in
the protein sequence can be assessed either positively or
negatively correlated. With this classic correlation notion
the autocorrelation can be expressed mathematically in
terms of Pearson correlation coefficient formula [31].
Given the properties values, P1, P2, P3, ..., Pn for the
sequence aa1, aa2, aa3, ..., aan where aa is an amino acid
residue, the lag k autocorrelation function can be defined
as:  


where the mean of the property indices.
A number of autocorrelation descriptors were
proposed for protein sequences [24, 32]. All of them agree
on that as autocorrelation they seek to find a correlation
between two values in the same sequence. The values, in
fact, are the values of the physicochemical properties of
the amino acids, which can be selected from a dropdown
menu on FEPS. In a typical case, each amino acid residue
is substituted by a value of a physicochemical property.
Usually, different amino acids have different values. The
same physicochemical property is used for all sequence
residues at a time. Before autocorrelation calculation, the
values of the property P for each amino acid must be
standardized using the mean and standard deviation of the
property values [14, 33].
where is the property value of the amino acid i=1 to 20,
the mean


 , and standard deviation


 .
The most common autocorrelation protein descriptors
include Geary autocorrelation, Moran autocorrelation, and
Moreau-Broto autocorrelation.
i- Moran autocorrelation
Moran autocorrelation was the first measure of spatial
autocorrelation to study stochastic phenomena distributed
in space [34]. Like Pearson correlation coefficient its
values range from +1 (strong positive autocorrelation), to 0
(a random pattern) to -1 (strong negative autocorrelation).


 

 
where k is the n-lag, is the standardized amino acid
property of the protein sequence residue i,

is
the mean of the standardized properties of the entire
sequence residues, and  is the standardized residue
property at position i+k [14, 24, 33].
ii- Geary autocorrelation
Geary autocorrelation [35] ranges between 1 and 2.
There is no autocorrelation if the coefficient is 1, positive
autocorrelation for the values ranging from 0 to 1 and
negative autocorrelation if the values between 1 and 2.
However, values sometimes are found greater than 2 [36].




 
iii- Moreau-Broto autocorrelation descriptors [37]
4


 
where n, k, , ,  are as above.
The normalized version of Moreau-Broto autocorrelation
is the above equation divided by (n-k) [14, 24, 33].
 



3- Composition, Transition and distribution
The composition, transition and distribution (CTD)
[38] features consist of seven physicochemical properties;
hydrophobicity; normalized Van der Waals volume;
polarity; polarizibility; charge; secondary structures; and
solvent accessibility. For each physicochemical property,
the amino acids are split into three groups. The amino
acids in the same group are considered as with the same
property (Table 1).
i- Composition
Composition is defined as the number of amino acid
residues with that particular property divided by the total
number of amino acids in a protein sequence. There are 21
features (3 feature elements for each one of the seven
physiochemical properties).
ii- Transition
Transition is defined as the percent frequency with
which amino acid residues with a particular property is
followed by residues of a different property. The number
of feature elements is 21 (3 for each one of the seven
physicochemical properties).
iii- Distribution
The distribution is defined as the chain length within
which the first, 25%, 50%, 75% and 100% of the amino
acids with a particular property are located respectively.
The number of feature elements for the distribution is 105
(15 for each one of the seven physicochemical properties).
TABLE 1
AMINO ACID GROUPING FOR CTD
FEPS computes different flavors of Composition,
Transition and distribution including all three together for
the seven physicochemical properties (147 feature
elements), only composition for the seven properties (21
elements), only transition for the seven properties (21
elements), only distribution for the seven properties (105
elements), and then each of the three for each one of the
seven properties (3 element for each one of the seven
properties with composition and transition and 15 elements
for each one of the seven properties with distribution).
CTD were proven to be successful in the prediction of
subcellular localization of protein [38].
4- Sequence-order-coupling numbers (SOCN)
SOCN is an example of hybrid model that comprises
both discrete and sequential model [14, 24, 33]. The first
20 features are amino acid composition while the features
from 21 and upward reflect the sequence order using four
physicochemical properties; hydrophobicity,
hydrophilicity, polarity, and side-chain volume. The
features are derived from a distance matrix created by
computing the distance between each pair of the 20 amino
acids using Schneider-Wrede physicochemical distance
matrix [39] or the Grantham chemical distance matrix [40].
The first 20 features are given by:
 




where i=1,2,…20, is the normalized frequency of the
amino acid i, w is a weighting factor (the default is 0.1)
and is the dth rank of the sequence order coupling
number and is given by
 


where m is the maximum lag value.
The features from 21 and above are given by:

 



 
The number of feature elements is determined by the
maximum lag value (m).
SOCN was used for the prediction of subcellular
localization of protein [41].
5- Pseudo-amino acid composition (PsAAC)
The pseudo amino acid composition [18] [11] is
another example of the hybrid features that combines both
discrete and sequential features. It is also an improvement
to the sequence-order coupling numbers. The first 20
features represent the discrete amino acid composition and
the others represent the sequence order features computed
using three physicochemical properties; hydrophobicity
(H1), hydrophilicity (H2), and side-chain mass (M) of the
amino acids. The three properties H1, H2, and M of the 20
amino acids are standardized as follows:
Properties Group 1 Group 2 Group 3
Polar Neural Hydrophobicity
R,K,E,D,Q,N G, A, S,T,P,H,Y C,L,V,I,M,F,W
Normalize
d van
0-2.78 2.95-4.0 4.03-8.08
der Waals
volume
G,A,S,T,P,D N,V,E,Q,I,L M,H,K,F,R,Y,W
4.9-6.2 8.0-9.2 10.4-13.0
L,I,F,W,C,M,V,Y P,A,T,G,S H,Q,R,K,N,E,D
0-1.08 0.128-0.186 0.219-0.409
G,A,S,D,T C,P,N,V,E,Q,I,L K,M,H,F,R,Y,W
Positive Neutral Negative
K,R
A,N,C,Q,G,H,I,L,M,F,
P,S,T,W,Y, V
D,E
Secondary Helix Strand Coil
Structure E,A,L,M,Q,K,R,H V,I,Y,C,W,F,T G,N,P,S,D
Solvent Buried Exposed Intermediate
accessibil-
ity
A,L,F,C,G,I,V,W R,K,Q,E,N,D M,S P,T,H,Y
Charge
Hydroph-
obicity
Polarity
Polarizabil-
ity
5



 





where is the property and is the standardized
property of an amino acid.
The features are then computed by a correlation factor that
is given by:



where is the first-tier correlation factor, that indicates
the sequence order between all of the λ-most adjacent
amino acid residues in the protein sequence (λ=1, 2, m)
where m is the maximum λ value, N is the number of
amino acids in the sequence, and  is the
correlation factor and is given by


where , , and are the standardized
hydrophobicity, hydrophilicity, and side-chain mass of the
amino acid .
The first 20 features are amino acid composition and are
given by:

 


Where is the relative frequency of amino acid type i, w
is a weighting factor (the default is 0.1), and C is the first-
tier correlation factor.
The features from 21 and above are given by:


 

 
where m is the maximum λ value [14, 24].
User defined options
Some features types are provided with user defined
options that can be set to values selected from a dropdown
menu or entered from the keyboard. Geary, Moran, and
normalized Moreau-Broto autocorrelation, and pseudo-
amino acid composition can be redefined by using one
from the 544 physicochemical properties each time (Table
3). For easy access, the 544 physicochemical properties of
the amino acids, as reported on aaindex database, are made
available from a dropdown list, which the user can scroll
up and down to locate the physiochemical property of
interest.
Other features types accept a varying number of
options besides the default ones such as pseudo-amino
composition (lambda and weight), sequence order
correlation factor (lambda), quasi sequence order (weight
and maximum lag), sequence order coupling (maximum
lag).
Such user-defined options provide flexibility to the
user and huge extensibility to FEPS to compute a large
number of feature types.
III. DISCUSSION
FEPS is the most up-to-date feature extraction web-
based application compared to the existing ones. It
calculates the most used and published feature besides the
user is able to redefine some of the features by choosing
one of the 544 physicochemical properties or to enter any
user-defined amino acid indices, which provides a large
number of feature choices. Another aspect, not found in
other peer applications, is the multiple file input. FEPS has
the flexibility to accept a single fasta-formatted file
containing multiple protein sequences or multiple files;
each one contains multiple sequences, as input. In case of
multiple files, FEPS will assume that those files contain
sequences belonging to different groups then it creates
features for all input files in one output file with the last
column indicating the group label of each sequence.
Moreover, FEPS provides different output file formats;
CSV, ARFF, svm-light, and tab-delimited format. The
other peer web-based applications compute far less
features; PROFEAT has 51 and SPiCE has 17.
FEPS in action
Given a single or multiple fasta-formatted files as
input, FEPS will perform, in the server-side, all the
complicated computation and will provide the user with
ready-to-use output files.
The input can be a single or multiple fasta-formatted
files of protein sequences (Fig. 1). It is recommended that
the input file names are to be meaningful as they will be
used as the names of the groups the sequences belong to.
Fig. 1. A snapshot shows multiple files as input
After providing the input files, the user will select one
of the seven feature groups (Fig. 2). Each feature group
has several feature types. The dropdown list can be used to
scroll down or up to the feature type of interest.
Fig. 2. A snapshot shows the five feature groups
6
Once a feature group and a feature type have been
selected, the options fields valid for that particular feature
type will be enabled to accept data entry (Fig. 3).
Fig. 3. A snapshot shows the feature options
Default values are provided for some options. Some
features require a physicochemical property, which can be
selected from the dropdown list (Fig. 4); the aaindex ID
field will be filled automatically with the right ID number
or the user can enter any valid aaindex ID.
Fig. 4. A snapshot shows the physicochemical properties
of the amino acids
Some features require user-defined indices or indices of
any physicochemical properties (Fig. 5). The user-defined
indices can be developed by the researcher based on a
certain criterion. As an example giving a numeric index for
each amino acid based on the probability of being found in
a secondary structure zone or a disordered protein region.
There are numerous ways to generate custom indices that
the researcher may come across.
Fig. 5. A snapshot shows the fields of the user-defined
amino acid indices
The email field is optional if the user needs the output
files to be sent as attachments. The output file format can
be any one of four choices; comma separated value (CSV)
file, SVM-light file, Weka (arff) file, or Tab delimited text
file (Fig. 6). Those file formats are the most common used
formats in machine learning
Fig. 6. A snapshot shows the output options
Once all fields have been filled and the form has been
submitted the entry will be validated and any error will be
communicated if it is found. The output files will be
rendered on the page for downloading and will also be
emailed if the email is provided.
The output consists of five files; sequence feature file,
sequence class names, sequence feature name, feature
options, and feature element names (Fig. 7).
Fig. 7. A snapshot shows the feature statistics output files
7
FEPS is expected to work fast but the speed is
governed by different factors including the web traffic,
internet speed, the size of the input files and the number
and lengths of the sequences submitted for processing.
IV. CONCLUSION
FEPS was developed to solve a real problem at a time
the author was scavenging the web for a state-of-the-art
tool to extract protein sequence-driven features for a
protein classification problem but it was in vain. It was
designed with the researchers, who do not want to be
overwhelmed by the programming details, in mind. It is a
handy tool for those who want to extract protein features
for statistical or machine learning purposes.
FEPS is equipped with the most common used
published feature extraction methods besides availing of
all reported physiochemical properties of amino acids to
provide the user with different choices. It accepts multiples
input files and outputs ready-to-use files that are suitable
for the common machine learning software and packages
such as weka, SVM-light, matlab and python.
V. REFERENCES
1. Karlin, S. and S.F. Altschul, Methods for assessing the
statistical significance of molecular sequence features by
using general scoring schemes. Proc Natl Acad Sci U S A,
1990. 87(6): p. 2264-8.
2. in Pure and Applied Chemistry1972. p. 639.
3. Kawashima, S., et al., AAindex: amino acid index database,
progress report 2008. Nucleic Acids Res, 2008.
36(Database issue): p. D202-5.
4. Zhang, Y., Progress and challenges in protein structure
prediction. Curr Opin Struct Biol, 2008. 18(3): p. 342-8.
5. Esmaeili, M., H. Mohabatkar, and S. Mohsenzadeh, Using
the concept of Chou's pseudo amino acid composition for
risk type prediction of human papillomaviruses. J Theor
Biol, 2010. 263(2): p. 203-9.
6. Feng, Y.E., Identify Secretory Protein of Malaria Parasite
with Modified Quadratic Discriminant Algorithm and Amino
Acid Composition. Interdiscip Sci, 2015.
7. Chou, K.C., Prediction of protein cellular attributes using
pseudo-amino acid composition. Proteins, 2001. 43(3): p.
246-55.
8. Nakashima, H., K. Nishikawa, and T. Ooi, The folding type
of a protein is relevant to the amino acid composition. J
Biochem, 1986. 99(1): p. 153-62.
9. Yang, W.-Y., B.-L. Lu, and Y. Yang. A comparative study
on feature extraction from protein sequences for subcellular
localization prediction. in Computational Intelligence and
Bioinformatics and Computational Biology, 2006.
CIBCB'06. 2006 IEEE Symposium on. 2006. IEEE.
10. Nanni, L., A. Lumini, and S. Brahnam, An empirical study
of different approaches for protein classification.
ScientificWorldJournal, 2014. 2014: p. 236717.
11. Chou, K.C., Some remarks on protein attribute prediction
and pseudo amino acid composition. J Theor Biol, 2011.
273(1): p. 236-47.
12. Nanuwa, S.S. and H. Seker. Investigation into the role of
sequence-driven-features for prediction of protein structural
classes. in BioInformatics and BioEngineering, 2008. BIBE
2008. 8th IEEE International Conference on. 2008.
13. Bengio, Y., A. Courville, and P. Vincent, Representation
learning: a review and new perspectives. IEEE Trans
Pattern Anal Mach Intell, 2013. 35(8): p. 1798-828.
14. Li, Z.R., et al., PROFEAT: a web server for computing
structural and physicochemical features of proteins and
peptides from amino acid sequence. Nucleic Acids Res,
2006. 34(Web Server issue): p. W32-7.
15. Rao, H.B., et al., Update of PROFEAT: a web server for
computing structural and physicochemical features of
proteins and peptides from amino acid sequence. Nucleic
Acids Res, 2011. 39(Web Server issue): p. W385-90.
16. Nikolajewa, S., et al., BioBayesNet: a web server for feature
extraction and Bayesian network modeling of biological
sequence data. Nucleic Acids Res, 2007. 35(Web Server
issue): p. W688-93.
17. van den Berg, B.A., et al., SPiCE: a web-based tool for
sequence-based protein classification and exploration. BMC
Bioinformatics, 2014. 15: p. 93.
18. Shen, H.B. and K.C. Chou, PseAAC: a flexible web server
for generating various kinds of protein pseudo amino acid
composition. Anal Biochem, 2008. 373(2): p. 386-8.
19. Bhasin, M. and G.P. Raghava, Classification of nuclear
receptors based on amino acid composition and dipeptide
composition. J Biol Chem, 2004. 279(22): p. 23262-6.
20. Blom, N., S. Gammeltoft, and S. Brunak, Sequence and
structure-based prediction of eukaryotic protein
phosphorylation sites. J Mol Biol, 1999. 294(5): p. 1351-62.
21. McGuffin, L.J., K. Bryson, and D.T. Jones, The PSIPRED
protein structure prediction server. Bioinformatics, 2000.
16(4): p. 404-5.
22. Lee, D., O. Redfern, and C. Orengo, Predicting protein
function from sequence and structure. Nat Rev Mol Cell
Biol, 2007. 8(12): p. 995-1005.
23. Nakai, K. and M. Kanehisa, A knowledge base for
predicting protein localization sites in eukaryotic cells.
Genomics, 1992. 14(4): p. 897-911.
24. Cao, D.S., Q.S. Xu, and Y.Z. Liang, propy: a tool to
generate various modes of Chou's PseAAC. Bioinformatics,
2013. 29(7): p. 960-2.
25. Harris, C.E. and D.C. Teller, Estimation of primary
sequence homology from amino acid composition of
evolutionary related proteins. J Theor Biol, 1973. 38(2): p.
347-62.
26. Dedman, J.R., R.W. Gracy, and B.G. Harris, A method for
estimating sequence homology from amino acid
compositions. The evolution of Ascaris employing aldolase
and glyceraldehyde-3-phosphate dehydrogenase. Comp
Biochem Physiol B, 1974. 49(4): p. 715-31.
27. Cornish-Bowden, A., How reliably do amino acid
composition comparisons predict sequence similarities
between proteins? J Theor Biol, 1979. 76(4): p. 369-86.
28. Bhasin, M. and G.P. Raghava, ESLpred: SVM-based method
for subcellular localization of eukaryotic proteins using
dipeptide composition and PSI-BLAST. Nucleic Acids Res,
2004. 32(Web Server issue): p. W414-9.
29. Petrilli, P., Classification of protein sequences by their
dipeptide composition. Comput Appl Biosci, 1993. 9(2): p.
205-9.
30. Bartholomew, D.J., Operational Research Quarterly (1970-
1977), 1971. 22(2): p. 199-201.
31. Pearson, K., Note on Regression and Inheritance in the Case
of Two Parents. Proceedings of the Royal Society of
London, 1895. 58: p. 240-242.
32. Ren, X.-M. and J.-F. Xia, Prediction of Protein-Protein
Interaction Sites by Using Autocorrelation Descriptor and
Support Vector Machine, in Advanced Intelligent Computing
Theories and Applications. With Aspects of Artificial
8
Intelligence, D.-S. Huang, et al., Editors. 2010, Springer
Berlin Heidelberg. p. 76-82.
33. Ong, S.A., et al., Efficacy of different protein descriptors in
predicting protein functional families. BMC Bioinformatics,
2007. 8: p. 300.
34. Moran, P.A., Notes on continuous stochastic phenomena.
Biometrika, 1950. 37(1-2): p. 17-23.
35. Geary, R.C., The Contiguity Ratio and Statistical Mapping.
The Incorporated Statistician, 1954. 5(3): p. 115-146.
36. Haining, R.P., Geography, 1989. 74(1): p. 81.
37. Broto, P., G. Moreau, and C. Vandycke, Molecular
Structures Perception, Auto-correlation Descriptor Eur. J.
Med. Chem., 1984. 19: p. 71-78.
38. Govindan, G. and A.S. Nair. Composition, Transition and
Distribution (CTD) — A dynamic feature for
predictions based on hierarchical structure of cellular
sorting. in India Conference (INDICON), 2011 Annual
IEEE. 2011.
39. Schneider, G. and P. Wrede, The rational design of amino
acid sequences by artificial neural networks and simulated
molecular evolution: de novo design of an idealized leader
peptidase cleavage site. Biophys J, 1994. 66(2 Pt 1): p. 335-
44.
40. Grantham, R., Amino acid difference formula to help explain
protein evolution. Science, 1974. 185(4154): p. 862-4.
41. Chou, K.C., Prediction of protein subcellular locations by
incorporating quasi-sequence-order effect. Biochem
Biophys Res Commun, 2000. 278(2): p. 477-83.
Article
Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites...
Article
Full-text available
Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.
Article
Full-text available
Amino acid sequences and features extracted from such sequences have been used to predict manyprotein properties, such as subcellular localization or solubility, using classifier algorithms. Althoughsoftware tools are available for both feature extraction and classifier construction, their applicationis not straightforward, requiring users to install various packages and to convert data into differentformats. This lack of easily accessible software hampers quick, explorative use of sequence-basedclassification techniques by biologists. We have developed the web-based software tool SPiCE for exploring sequence-based features ofproteins in predefined classes. It offers data upload/download, sequence-based feature calculation,data visualization and protein classifier construction and testing in a single integrated, interactiveenvironment. To illustrate its use, two example datasets are included showing the identification ofdifferences in amino acid composition between proteins yielding low and high production levels infungi and low and high expression levels in yeast, respectively. SPiCE is an easy-to-use online tool for extracting and exploring sequence-based features of sets ofproteins, allowing non-experts to apply advanced classification techniques. The tool is available athttp://helix.ewi.tudelft.nl/spice.
Article
Full-text available
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Malaria parasite secretes various proteins in infected red blood cell for its growth and survival. Thus identification of these secretory proteins is important for developing vaccine or drug against malaria. In this study, the modified method of quadratic discriminant analysis is presented for predicting the secretory proteins. Firstly, 20 amino acids are divided into five types according to the physical and chemical characteristics of amino acids. Then, we used five types of amino acids compositions as inputs of the modified quadratic discriminant algorithm. Finally, the best prediction performance is obtained by using 20 amino acid compositions, the sensitivity of 96 %, the specificity of 92 % with 0.88 of Mathew's correlation coefficient in fivefold cross-validation test. The results are also compared with those of existing prediction methods. The compared results shown our method are prominent in the prediction of secretory proteins.
Article
Dans cet article on détaille le principe du descripteur moléculaire que l'on utilise dans des études structure-activité. Ce descripteur dérivé d'une fonction d'autocorrélation s'applique indifféremment aux structures topologiques ou tridimensionnelles des molécules, il permet la comparaison de 2 molécules quelconques. Les molécules apparaissent alors sous forme de vecteurs facilement manipulables en machine. Ces vecteurs traduisent comment une propriété est distribuée sur une structure moléculaire.
Article
Subcellular location of protein is crucial for the dynamic life of cells as it is an important step towards elucidating its function. It is widely recognized that the information from the amino acid sequence can serve as vital pointers in predicting location of proteins. We introduce a new feature vector for predicting proteins targeted to various compartments in the hierarchical structure of cellular sorting pathway from protein sequence. Features are based on the overall Composition, Transition and Distribution (CTD) of amino acid attributes such as hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility of the protein sequences. Classification of protein locations in cellular sorting pathway is achieved through Support Vector Machine. Our method gives an accuracy of 92% in human and 95% in fungi with non redundant test set at root level.
Article
Consider a population in which sexual selection and natural selection may or may not be taking place. Assume only that the deviations from the mean in the case of any organ of any generation follow exactly or closely the normal law of frequency, then the following expressions may be shown to give the law of inheritance of the population.
Article
Sequence-derived structural and physiochemical features have been frequently used for analysing and predicting structural, functional, expression and interaction profiles of proteins and peptides. To facilitate extensive studies of proteins and peptides, we developed a freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence. It computes five feature groups composed of 13 features, including amino acid composition, dipeptide composition, tripeptide composition, normalized Moreau-Broto autocorrelation, Moran autocorrelation, Geary autocorrelation, sequence-order-coupling number, quasi-sequence-order descriptors, composition, transition and distribution of various structural and physicochemical properties and two types of pseudo amino acid composition (PseAAC) descriptors. These features could be generally regarded as different Chou's PseAAC modes. In addition, it can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database. Availability: The python package, propy, is freely available via http://code.google.com/p/protpy/downloads/list, and it runs on Linux and MS-Windows. Supplementary information: Supplementary data are available at Bioinformatics online.
Chapter
We propose a sequence-based method to infer protein-protein interaction sites in protein hetero-complexes. The autocorrelation descriptor is used to code the numerical vectors of continuous amino acids segments. The support vector machine model combined with autocorrelation descriptor yields the best performance with a high F1 score of 46.80%, which demonstrates the effectiveness of the proposed method. KeywordsProtein-protein interaction sites-Support vector machine-Autocorrelation descriptor-Protein sequence