ArticlePDF Available

Prediction of Subcellular Localization of Apoptosis Protein Using Chou’s Pseudo Amino Acid Composition

Authors:
  • yixing people hospital

Abstract and Figures

Apoptosis proteins play an essential role in regulating a balance between cell proliferation and death. The successful prediction of subcellular localization of apoptosis proteins directly from primary sequence is much benefited to understand programmed cell death and drug discovery. In this paper, by use of Chou's pseudo amino acid composition (PseAAC), a total of 317 apoptosis proteins are predicted by support vector machine (SVM). The jackknife cross-validation is applied to test predictive capability of proposed method. The predictive results show that overall prediction accuracy is 91.1% which is higher than previous methods. Furthermore, another dataset containing 98 apoptosis proteins is examined by proposed method. The overall predicted successful rate is 92.9%.
Content may be subject to copyright.
REGULAR ARTICLE
Prediction of Subcellular Localization of Apoptosis
Protein Using Chou’s Pseudo Amino Acid Composition
Hao Lin ÆHao Wang ÆHui Ding ÆYing-Li Chen Æ
Qian-Zhong Li
Received: 8 July 2008 / Accepted: 16 December 2008 / Published online: 24 January 2009
Springer Science+Business Media B.V. 2009
Abstract Apoptosis proteins play an essential role in regulating a balance between
cell proliferation and death. The successful prediction of subcellular localization of
apoptosis proteins directly from primary sequence is much benefited to understand
programmed cell death and drug discovery. In this paper, by use of Chou’s pseudo
amino acid composition (PseAAC), a total of 317 apoptosis proteins are predicted
by support vector machine (SVM). The jackknife cross-validation is applied to test
predictive capability of proposed method. The predictive results show that overall
prediction accuracy is 91.1% which is higher than previous methods. Furthermore,
another dataset containing 98 apoptosis proteins is examined by proposed method.
The overall predicted successful rate is 92.9%.
Keywords Apoptosis protein Subcellular localization
Pseudo amino acid composition Support vector machine
1 Introduction
Apoptosis is a type of cell death regulated growth, development and immune
response, and clearing redundant or abnormal cells in organisms (Raff 1998; Steller
1995). It plays a key role in development and tissue homeostasis (Chou et al. 1998,
1999). The malfunctions of apoptosis will deal to a variety of formidable diseases,
H. Lin (&)H. Wang
Center for Bioinformatics, School of Life Science and Technology, University of Electronic Science
and Technology of China, 610054 Chengdu, China
e-mail: hlin@uestc.edu.cn
H. Ding Y.-L. Chen Q.-Z. Li
Laboratory of Theoretical Biophysics, School of Physics Sciences and Technology,
Inner Mongolia University, 010021 Hohhot, China
123
Acta Biotheor (2009) 57:321–330
DOI 10.1007/s10441-008-9067-4
for example, blocking apoptosis is associated with cancer (Adams and Cory 1998;
Evan and Littlewood 1998) and autoimmune disease, whereas unwanted apoptosis
can possible lead to ischemic damage (Reed and Paternostro 1999) or neurodegen-
erative disease (Schulz et al. 1999). Because the localization of proteins in cellular
is closely associated with the protein function, the study of subcellular localization
of apoptosis protein is very important for elucidating functions of apoptosis protein
involved in various cellular processes (Schulz et al. 1999; Suzuki et al. 2000) and
drug development (Chou et al. 1997,2000; Chou 2004).
Computational approaches, such as structural bioinformatics (Chou 2004),
molecular docking (Chou et al. 2003; Li et al. 2007; Wang et al. 2008; Zheng
et al. 2007), molecular packing (Chou et al. 1984,1988), pharmacophore
modeling (Sirois et al. 2004; Chou et al. 2006), Mote Carlo simulated approach
(Chou 1992), diffusion-controlled reaction simulation (Chou and Zhou 1982), bio-
macromolecular internal collective motion simulation (Chou 1988), QSAR (Du
et al. 2008), protein subcellular location prediction (Chou and Shen 2007a,2008a)
identification of membrane proteins and their types (Chou and Shen 2007b),
identification of enzymes and their functional classes (Shen and Chou 2007),
identification of GPCR and their types (Chou 2005), identification of proteases
and their types (Chou and Shen 2008b), protein cleavage site prediction (Shen and
Chou 2008b), and signal peptide prediction (Chou and Shen 2007c) and so on can
timely provide very useful information and insights for both basic research and
drug design and hence are widely welcome by science community. The present
study is attempted to develop a computational approach for predicting the
subcellular localization of apoptosis proteins in hope to stimulate the development
of the relevant areas.
In the past 5 years, several algorithms such as covariant discriminant function
(Zhou and Doctor 2003), support vector machine (SVM) (Huang and Shi 2005;
Zhang et al. 2006; Zhou et al. 2008; Shi et al. 2008), Bayesian classifier
(Bulashevska and Eils 2006), increment of diversity (ID) (Chen and Li 2007a),
increment of diversity combined with support vector machine (ID_SVM) (Chen and
Li 2007b) and fuzzy K-nearest neighbor (FKNN) (Jiang et al. 2008; Ding and
Zhang 2008) have been proposed to predict subcellular localization of apoptosis
protein based on various amino acid composition or pseudo amino acid
composition. The pseudo amino acid composition (PseAAC) was firstly proposed
by Chou to efficiently improve prediction quantity of protein subcellular
localization (Chou 2001; Chou and Shen 2007a). PseAAC can represent a protein
sequence with a discrete model yet without completely losing its sequence order
information.
In this paper, based on the concept of Chou’s PseAAC, SVM is applied to the
latest dataset with 317 apoptosis proteins. The jackknife cross-validation is
applied to examine the predictive ability of method. Moreover, another 98
apoptosis proteins built by Zhou and Doctor (2003) are examined by proposed
method. The predictive results of proposed method can improve the predictive
success rates, and hence the current method may play a complementary role to
other existing methods for predicting protein subcellular localization of apoptosis
protein.
322 H. Lin et al.
123
2 Materials and Methods
2.1 Data Sets
The 317 apoptosis proteins extracted from Swiss-Prot 49.0 can be classified into six
subcellular locations: 112 cytoplasmic proteins, 55 membrane proteins, 34
mitochondrial proteins, 17 secreted proteins, 52 nuclear proteins and 47 endoplas-
mic reticulum proteins. The distribution of the sequence identity percentage is
40.1% with B40% sequence identity, 15.5% with sequence identity from 41% to
80%, 18.9% with sequence identity from 81% to 90% and 25.6% with C91%
sequence identity (Chen and Li 2007a,b).
In addition, the 98 apoptosis proteins containing 43 cytoplasmic proteins, 30
plasma membrane-bound proteins, 13 mitochondrial proteins and 12 other proteins
(Zhou and Doctor 2003) are also used to estimate the effectiveness of the method.
2.2 Pseudo Amino Acid Composition
The appropriate parameter is one of the most important aspects for prediction issues.
The essence of PseAAC includes not only the main feature of amino acid
composition, but also the sequence order correlation (Chou 2001; Chou and Shen
2007a; Shen and Chou 2008a). Consider a protein (X) chain with length Lamino
acid residues:
R1R2R3...RLð1Þ
Then a protein may be denoted as a (20 ?k) dimension vector defined by 20 ?k
discrete numbers; i.e.
X¼x1...x20x20þ1...x20þk
½
Tð2Þ
here xu¼
fu
P
20
i¼1
fiþxP
k
j¼1
hj
;ð1u20Þ
xhu20
P
20
i¼1
fiþxP
k
j¼1
hj
;ð21 u20 þkÞ
8
>
>
>
>
<
>
>
>
>
:
ð3Þ
In Eq. 3, the f
i
is the normalized frequency of the 20 amino acids in protein X,xis
the weight factor for sequence order effect. h
j
is the j-tier sequence correlation factor
computed by the following formula:
hj¼1
LjX
Lj
i¼1
HðRi;RiþjÞ;ðj\LÞð4Þ
where H(R
i
,R
i?j
) is the correlation function and can be given by
HðRi;RiþjÞ¼1
kX
k
l¼1
HlRiþj

HlRi
ðÞ

2ð5Þ
Prediction of Subcellular Localization 323
123
In Eq. 5,kis the number of factors. H
l
(R
i
) is any one of the physico-chemical
characteristics values of the amino acid R
i
. These physico-chemical characteristics
mainly include hydrophobicity, hydrophilicity, side chain mass, pK of the
a-COOH group, pK of the a-NH
3?
group and pI at 25C. The hydrophobicity,
hydrophilicity and side chain mass are used for the current study. The physico-
chemical characteristics values must convert to standard type by the following
equation:
HlðRiÞ¼
H0
lðiÞP
20
i¼1
H0
lðiÞ20

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
20
i¼1
H0
lðiÞP
20
i¼1
H0
lðiÞ=20
ðÞ

2
20
v
u
u
t
ð6Þ
where H0
lðiÞis the original physico-chemical characteristics values of the i-th amino
acid. We use the numerical indices 1,2,3,,20 to represent the 20 native amino
acids according to the alphabetical order of their single-letter codes: A, C, D, E, F,
G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. The data calculated by standard
conversion will have a zero mean value and will remain unchanged if going through
the same conversion procedure again.
2.3 Support Vector Machine
SVM is a kind of machine learning method based on statistical learning theory
(Vapnik 1998). As a supervised machine learning technology, it has been
successfully used in wide fields of bioinformatics by transforming the input vector
into a high-dimension Hilbert space and to seek a separating hyperplane in this
space. Now, we briefly explain the basic idea of the SVM. For a two-class
classification problem, a series of training vectors Xi
!2Rd(i=1, 2, ,N) with
corresponding labels yi2fþ1;1g(i=1, 2, ,N). Here, ?1 and -1,
respectively indicate the two classes. SVM maps the input vectors Xi
!2Rdinto a
high dimensional feature space for constructing an optimal separating hyperplane
with the largest distance between two classes, measured along a line perpendicular
to this hyperplane. The decision function implemented by SVM can be written as:
fðX
!Þ¼sgn X
N
i¼1
yiaiKðX
!;Xi
!Þþb
!
ð7Þ
where KX
!;Xi
!

is a kernel function which defines an inner product in a high
dimensional feature space. Three kinds of kernel functions may be defined as:
Polynomial function:
KX
i
!;Xj
!

¼Xi
!Xj
!þ1

d
ð8Þ
Radial basis function (RBF):
324 H. Lin et al.
123
KX
i
!;Xj
!

¼exp cjjXi
!Xj
!jj2
 ð9Þ
Sigmoid function:
KX
i
!;Xj
!

¼tanhb X
i
!Xj
!

þc
hi
:ð10Þ
The coefficients a
i
can be solved by the following convex Quadratic Program-
ming (QP) problem: Maximize
X
N
i¼1
ai1
2X
N
i¼1X
N
j¼1
aiajyiyjKX
i
!;Xj
!

subject to 0 aiCð11Þ
here P
N
i¼1
aiyi¼0;i=1, 2, ,N. The regularization parameter Ccan control the
trade off between margin and misclassification error. These Xi
!are called Support
Vectors only if the corresponding a
i
[0.
In general, One-Versus-Rest (OVR) and One-Versus-One (OVO) are the most
commonly used approach for solving multi-class problems by reducing a single multi-
class problem into multiple binary problems. This paper used the OVO strategy. The
software used to implement SVM is LibSVM2.83 written by Lin’s lab and can be
freely downloaded from: http://www.csie.ntu.edu.tw/*cjlin/libsvm (Chang and Lin
2001). Here, the RBF is used for all our calculations. The regularization parameter C
and the kernel parameter cof the RBF must be determined in advance.
2.4 The Criteria Definitions
The predictive capability of the algorithm is estimated by four parameters:
sensitivity (S
n
), specificity (S
p
) and correlation coefficient (CC) defined as follows
(Chen and Li 2007a,b):
Sn¼TP=ðTP þFNÞð12Þ
Sp¼TP=ðTP þFPÞð13Þ
CC ¼ðTP TNÞðFP FNÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðTP þFPÞðTN þFNÞðTP þFNÞðTN þFPÞ
pð14Þ
here TP denotes the numbers of the correctly recognized positives, FN denotes the
numbers of the positives recognized as negatives, FP denotes the numbers of the
negatives recognized as positives, TN denotes the numbers of correctly recognized
negatives.
3 Results and Discussion
In statistical prediction, the following three cross-validation tests are often used to
examine the power of a predictor: independent dataset test, sub-sampling (such
fivefold or tenfold sub-sampling) test, and jackknife test. Of these three examine
Prediction of Subcellular Localization 325
123
method, the jackknife test is deemed the most objective and rigorous one (Chou and
Zhang 1995) that can always yield a unique outcome as demonstrated by a penetrating
analysis in a recent comprehensive review (Chou and Shen 2007a) and has been
widely and increasingly adopted by investigators to test the power of various
prediction methods (Lin and Li, 2007a,b; Lin 2008; Li and Li 2008a,b; Jia et al. 2008;
Jin et al. 2008; Zhang and Fang 2008; Munteanu et al. 2008; Niu et al. 2008; Lin et al.
2008; Gao et al. 2008). For the jackknife cross-validation, each proteins in the dataset
is in turn singled out as an independent test sample and all the rule parameters are
calculated based on the remaining proteins without including the one being identified.
Therefore, we also use the jackknife cross-validation to examine proposed method.
The weight factor wand correlation factor kin the Chou’s PseAAC are two kind
important parameters. Usually, the larger the k, the more information the represen-
tation bears. However, if the PseAAC contains too many components, it would reduce
the cluster-tolerant capacity (Chou 1999) so as to lower down the jackknife success
rate. We examine a great deal of parameters of PseAAC (xand k) and SVM (Cand r)
by using jackknife cross-validation. For the current study, we found that, when
w=0.1, k=3, C=1,000 and r=0.04, the predicted successful rate is the highest.
The results of 317 apoptosis proteins are listed in Table 1. The results show that the
sensitivity, specificity and CC of endoplasmic reticulum proteins are 95.7, 95.7 and
94.9%, respectively, which is higher than other subcellular location.
The compared results with other methods are shown in Table 2. Table 2exhibits
that the sensitivities of SVM combined with PseAAC are higher than other methods
Table 1 The predictive results
of jackknife cross-validation for
317 apoptosis proteins
Sn Sp CC
Cyto 0.938 0.921 0.890
Memb 0.909 0.893 0.880
Mito 0.853 0.935 0.881
Secr 0.765 0.813 0.777
Nucl 0.904 0.887 0.874
Endo 0.957 0.957 0.949
Overall prediction rate 0.911
Table 2 The predictive results of different methods by the jackknife test for 317 apoptosis proteins
Method Sn 9100%
Cyto Memb Mito Secr Nucl Endo Overall
ID
a
81.3 81.8 85.3 88.2 82.7 83.0 82.7
ID_SVM
b
91.1 89.1 79.4 58.8 73.1 87.2 84.2
FKNN
c
92.0 89.1 85.3 76.5 92.3 93.7 90.2
FKNN
d
93.8 92.7 82.4 76.5 90.4 93.6 90.9
SVM ?PseAAC (This paper) 93.8 90.9 85.3 76.5 90.4 95.7 91.1
a
Comes from Chen and Li (2007a).
b
Comes from Chen and Li (2007b).
c
Comes from Jiang et al.
(2008).
d
Comes from Ding and Zhang (2008)
326 H. Lin et al.
123
for cytoplasmic proteins, membrane proteins, mitochondrial proteins and endoplas-
mic proteins, whereas for secreted proteins and nuclear proteins, the sensitivities of
proposed method are lower than ID and FKNN. The overall predictive successful
rate of proposed method is highest among other methods.
Table 3exhibits the compared results with other methods for 98 apoptosis
proteins. Here, by use of lots of examination, we select x=0.3, k=3, C=1,000
and r=0.08 for this prediction. The results show that the predictive successful rate
of proposed method is 92.9%.
The successful accuracies clearly indicate that the SVM combined PseAAC is a
promising approach. We hope that the better results using novel descriptors or
appropriate parameters will improve the performance of subcellular localization
prediction of apoptosis proteins. The high accuracy is helpful for further drug
development.
Acknowledgments This study was supported in part by Scientific Research Startup Foundation of
UESTC and National Natural Science Foundation of China (30560039).
References
Adams JM, Cory S (1998) The Bcl-survival. Science 281:1322–1326. doi:10.1126/science.281.5381.1322
Bulashevska A, Eils R (2006) Predicting protein subcellular locations using hierarchical ensemble of
Bayesian classifiers based on Markov chains. BMC Bioinformatics 7:298. doi:10.1186/1471-
2105-7-298
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at
(http://www.csie.ntu.edu.tw/_cjlin/libsvm)
Chen YL, Li QZ (2007a) Prediction of the subcellular location of apoptosis proteins. J Theor Biol
245:775–783. doi:10.1016/j.jtbi.2006.11.010
Table 3 The predictive results of different methods by the jackknife test for 98 apoptosis proteins
Method Sn 9100%
Cyto Memb Mito Others Overall
Covariant
a
97.7 73.3 30.8 25.0 72.5
SVM ?20 sqrt-amino acid composition
b
86.0 90.0 100.0 100.0 90.8
EBGW_SVM
c
97.7 90.0 92.3 83.3 92.9
HensBC-approach
d
95.3 90.0 92.3 66.7 89.8
Dual-layer SVM
e
95.4 96.7 92.3 91.7 94.9
ID
f
90.7 90.0 92.3 91.7 90.8
ID_SVM
g
95.3 93.3 84.6 58.3 88.8
Hilber Huang_SVM
h
95.3 96.7 96.7 75.7 92.9
FKNN
i
95.3 96.7 100 91.7 95.9
SVM ?PseAAC (This paper) 95.3 93.3 92.3 83.3 92.9
a
Comes from Zhou and Doctor (2003).
b
Comes from Huang and Shi (2005).
c
Comes from Zhang et al.
(2006).
d
Comes from Bulashevska and Eils (2006).
e
Comes from Zhou et al. (2008).
f
Comes from Chen
and Li (2007a).
g
Comes from Chen and Li (2007b).
h
Comes from shi et al. (2008).
i
Comes from Ding
and Zhang (2008)
Prediction of Subcellular Localization 327
123
Chen YL, Li QZ (2007b) Prediction of apoptosis proteins subcellular location using improved hybrid
approach and pseudo-amino acid composition. J Theor Biol 248:377–381. doi:10.1016/j.jtbi.
2007.05.019
Chou KC (1988) Review: low-frequency collective motion in biomacromolecules and its biological
functions. Biophys Chem 30:3–48. doi:10.1016/0301-4622(88)85002-6
Chou KC (1992) Energy-optimized structure of antifreeze protein and its binding mechanism. J Mol Biol
223:509–517. doi:10.1016/0022-2836(92)90666-8
Chou KC (1999) A key driving force in determination of protein structural classes. Biochem Biophys Res
Commun 264:216–224. doi:10.1006/bbrc.1999.1325
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins
43:246–255. doi:10.1002/prot.1035
Chou KC (2004) Review: structural bioinformatics and its impact to biomedical science. Curr Med Chem
11:2105–2134
Chou KC (2005) Prediction of G-protein-coupled receptor classes. J Proteome Res 4:1413–1418. doi:
10.1021/pr050087t
Chou KC, Shen HB (2007a) Recent progress in protein subcellular location prediction. Anal Biochem
370:1–16. doi:10.1016/j.ab.2007.07.006
Chou KC, Shen HB (2007b) MemType-2L: a web server for predicting membrane proteins and their
types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun
360:339–345. doi:10.1016/j.bbrc.2007.06.027
Chou KC, Shen HB (2007c) Signal-CF: a subsite-coupled and window-fusing approach for predicting
signal peptides. Biochem Biophys Res Commun 357:633–640. doi:10.1016/j.bbrc.2007.03.162
Chou KC, Shen HB (2008a) Cell-Ploc: a package of web servers for predicting subcellular localization of
proteins in various organisms. Nat Protocols 3:153–162. doi:10.1038/nprot.2007.494
Chou KC, Shen HB (2008b) ProtIdent: a web server for identifying proteases and their types by fusing
functional domain and sequential evolution information. Biochem Biophys Res Commun
376(2):321–325. doi:10.1016/j.bbrc.2008.1008.1125
Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol
30:275–349. doi:10.3109/10409239509083488
Chou KC, Zhou GP (1982) Role of the protein outside active site on the diffusion-controlled reaction of
enzyme. J Am Chem Soc 104:1409–1413. doi:10.1021/ja00369a043
Chou KC, Nemethy G, Scheraga HA (1984) Energetic approach to packing of a-helices: 2. General
treatment of nonequivalent and nonregular helices. J Am Chem Soc 106:3161–3170. doi:10.1021/
ja00323a017
Chou KC, Maggiora GM, Nemethy G, Scheraga HA (1988) Energetics of the structure of the four-alpha-
helix bundle in proteins. Proc Natl Acad Sci USA 85:4295–4299. doi:10.1073/pnas.85.12.4295
Chou KC, Jones D, Heinrikson RL (1997) Prediction of the tertiary structure and substrate binding site of
caspase-8. FEBS Lett 419:49–54. doi:10.1016/S0014-5793(97)01246-5
Chou JJ, Matsuo H, Duan H, Wagner G (1998) Solution structure of the RAIDD CARD and model for
CARD/CARD interaction in caspase-2 and caspase-9 recruitment. Cell 94:171–180. doi:10.1016/
S0092-8674(00)81417-8
Chou JJ, Li H, Salvessen GS, Yuan J, Wagner G (1999) Solution structure of BID, an intracellular
amplifier of apoptotic signalling. Cell 96:615–624. doi:10.1016/S0092-8674(00)80572-3
Chou KC, Tomasselli AG, Heinrikson RL (2000) Prediction of the tertiary structure of a caspase-9/
inhibitor complex. FEBS Lett 470:249–256. doi:10.1016/S0014-5793(00)01333-8
Chou KC, Wei DQ, Zhong WZ (2003) Binding mechanism of coronavirus main proteinase with ligands
and its implication to drug design against SARS. (Erratum: ibid., 2003, Vol.310, 675). Biochem
Biophys Res Commun 308:148–151
Chou KC, Wei DQ, Du QS, Sirois S, Zhong WZ (2006) Review: progress in computational approach to
drug development against SARS. Curr Med Chem 13:3263–3270. doi:10.2174/0929867067
78773077
Ding YS, Zhang TL (2008) Using Chou’s pseudo amino acid composition to predict subcellular
localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble
classifier. Pattern Recognit Lett 29:1887–1892. doi:10.1016/j.patrec.2008.06.007
Du QS, Huang RB, Chou KC (2008) Review: recent advances in QSAR and their applications in
predicting the activities of chemical molecules, peptides and proteins for drug design. Curr Protein
Pept Sci 9:248–259. doi:10.2174/138920308784534005
328 H. Lin et al.
123
Evan G, Littlewood T (1998) A matter of life and cell death. Science 281:1317–1322. doi:
10.1126/science.281.5381.1317
Gao QB, Wu CH, Ma XQ, Lu J, He J (2008) Classification of amine type G-protein coupled receptors
with feature selection. Protein Pept Lett 15:834–842. doi:10.2174/092986608785203755
Huang J, Shi F (2005) Support vector machines for predicting apoptosis proteins types. Acta Biotheor
53:39–47. doi:10.1007/s10441-005-7002-5
Jia P, Qian Z, Feng K, Lu W, Li Y, Cai Y (2008) Prediction of membrane protein types in a hybrid space.
J Proteome Res 7:1131–1137. doi:10.1021/pr700715c
Jiang X, Wei R, Zhang T, Gu Q (2008) Using the concept of Chou’s pseudo amino acid composition to
predict apoptosis proteins subcellular location: an approach by approximate entropy. Protein Pept
Lett 15:392–396. doi:10.2174/092986608784246443
Jin YH, Niu B, Feng KY, Lu WC, Cai YD, Li GZ (2008) Predicting subcellular localization with
AdaBoost Learner. Protein Pept Lett 15:286–289. doi:10.2174/092986608783744234
Li FM, Li QZ (2008a) Using pseudo amino acid composition to predict protein subnuclear location with
improved hybrid approach. Amino Acids 34:119–125. doi:10.1007/s00726-007-0545-9
Li FM, Li QZ (2008b) Predicting protein subcellular location using Chou’s pseudo amino acid
composition and improved hybrid approach. Protein Pept Lett 15:612–616. doi:10.2174/0929866
08784966930
Li Y, Wei DQ, Gao WN, Gao H, Liu BN, Huang CJ, Xu WR, Liu DK, Chen HF, Chou KC (2007)
Computational approach to drug design for oxazolidinones as antibacterial agents. Med Chem
3:576–582. doi:10.2174/157340607782360362
Lin H (2008) The modified Mahalanobis discriminant for predicting outer membrane proteins by using
Chou’s pseudo amino acid composition. J Theor Biol 252:350–356. doi:10.1016/j.jtbi.2008.02.004
Lin H, Li QZ (2007a) Using pseudo amino acid composition to predict protein structural class:
approached by incorporating 400 dipeptide components. J Comput Chem 28:1463–1466. doi:
10.1002/jcc.20554
Lin H, Li QZ (2007b) Predicting conotoxin superfamily and family by using pseudo amino acid
composition and modified Mahalanobis discriminant. Biochem Biophys Res Commun 354:548–551.
doi:10.1016/j.bbrc.2007.01.011
Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of mycobacterial
proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 15:739–744. doi:
10.2174/092986608785133681
Munteanu CB, Gonzalez-Diaz H, Magalhaes AL (2008) Enzymes/non-enzymes classification model
complexity based on composition, sequence, 3D and topological indices. J Theor Biol 254:476–482.
doi:10.1016/j.jtbi.2008.06.003
Niu B, Jin YH, Feng KY, Liu L, Lu WC, Cai YD, Li GZ (2008) Predicting membrane protein types with
bagging learner. Protein Pept Lett 15:590–594. doi:10.2174/092986608784966921
Raff M (1998) Cell suicide for beginners. Nature 396:119–122. doi:10.1038/24055
Reed JC, Paternostro G (1999) Postmitochondrial regulation of apoptosis during heart failure. Proc Natl
Acad Sci USA 96:7614–7616. doi:10.1073/pnas.96.14.7614
Schulz JB, Weller M, Moskowitz MA (1999) Caspases as treatment targets in stroke and
neurodegenerative diseases. Ann Neurol 45:421–429. doi:10.1002/1531-8249(199904)45:4\421::
AID-ANA2[3.0.CO;2-Q
Shen HB, Chou KC (2007) EzyPred: a top-down approach for predicting enzyme functional classes and
subclasses. Biochem Biophys Res Commun 364:53–59. doi:10.1016/j.bbrc.2007.09.098
Shen HB, Chou KC (2008a) PseAAC: a flexible web server for generating various kinds of protein pseudo
amino acid composition. Anal Biochem 373:386–388. doi:10.1016/j.ab.2007.10.012
Shen HB, Chou KC (2008b) HIVcleave: a web-server for predicting HIV protease cleavage sites in
proteins. Anal Biochem 375:388–390. doi:10.1016/j.ab.2008.01.012
Shi F, Chen QJ, Li NN (2008) Hilbert Huang transform for predicting proteins subcellular location.
J. Biomed Sci Eng 1:59–63
Sirois S, Wei DQ, Du QS, Chou KC (2004) Virtual screening for SARS-CoV protease based on KZ7088
pharmacophore points. J Chem Inf Comput Sci 44:1111–1122. doi:10.1021/ci034270n
Steller H (1995) Mechanisms and genes of cellular suicide. Science 267:1445–1449. doi:10.1126/
science.7878463
Suzuki M, Youle RJ, Tjandra N (2000) Structure of Bax: coregulation of dimmer formation and
intracellular location. Cell 103:645–654. doi:10.1016/S0092-8674(00)00167-7
Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York
Prediction of Subcellular Localization 329
123
Wang JF, Wei DQ, Chen C, Li Y, Chou KC (2008) Molecular modeling of two CYP2C19 SNPs and its
implications for personalized drug design. Protein Pept Lett 15:27–32. doi:10.2174/09298
6608783330305
Zhang GY, Fang BS (2008) Predicting the cofactors of oxidoreductases based on amino acid composition
distribution and Chou’s amphiphilic pseudo amino acid composition. J Theor Biol 253:310–315.
doi:10.1016/j.jtbi.2008.03.015
Zhang ZH, Wang ZH, Zhang ZR, Wang YX (2006) A novel method for apoptosis protein subcellular
localization prediction combining encoding based on grouped weight and support vector machine.
FEBS Lett 580:6169–6174. doi:10.1016/j.febslet.2006.10.017
Zheng H, Wei DQ, Zhang R, Wang C, Wei H, Chou KC (2007) Screening for new agonists against
Alzheimer’s disease. Med Chem 3:488–493. doi:10.2174/157340607781745492
Zhou GP, Doctor K (2003) Subcellular location prediction of apoptosis proteins. Proteins 50:44–48. doi:
10.1002/prot.10251
Zhou XB, Chen C, Li ZC, Zou XY (2008) Improved prediction of subcellular location for apoptosis
proteins by the dual-layer support vector machine. Amino Acids 35:383–388. doi:10.1007/s00726-
007-0608-y
330 H. Lin et al.
123
... Different types of PseAAC are employed to predict protein structural class [32], bacterial secreted proteins [33], cyclins [34], risk type of human papillomaviruses [35], enzyme subfamily classes [24,36,37], G-protein coupled receptor classes [38][39][40], cell wall lytic enzymes [41], subcellular localization of apoptosis proteins [42,43], lipase types [44], subcellular localization of mycobacterial proteins [45], cofactors of oxidoreductases [46], DNAbinding proteins [47], quaternary structural attributes [48], proteases and their types [49] GABAA receptors [50] and Glutathione S-transferases [51][52][53]. ...
Article
Full-text available
Phospholipases, as important lipolytic enzymes, have diverse industrial applications. Regarding the stability of extremophilic archaea's proteins in harsh conditions, analyses of unusual features of their proteins are significantly important for their utilization. This research was accomplished to in silico study of archaeal phospholipases' properties and to develop a pioneering method for distinguishing these enzymes from other archaeal enzymes via machine learning algorithms and Chou's pseudo-amino acid composition concept. The non-redundant sequences of archaeal phospholipases were collected. BioSeq-Analysis sever was used with Support Vector Machine (SVM), Random Forests (RF), Covariance Discrimination (CD), and Optimized Evidence-Theoretic K-nearest Neighbor (OET-KNN) as powerful machine learnings algorithms. Also, different Chou's pseudo-amino acid composition modes were performed and then, 5-fold cross-validation was applied to the sequences. Based on our results, the OET-KNN predictor, with 96% accuracy, yields the best performance in SC-PseAAC mode by 5-fold cross-validation. This predictor also achieved very high values of specificity (95%), sensitivity (96%), Matthews's correlation coefficient (0.92), and accuracy (96%). The present investigation yielded a robust anticipatory model for the archaeal phospholipase prediction utilizing the tenets PseAAC and OET-KNN machine learning algorithm.
... It provides numerical vectors of 20 components, with each reflecting the occurrence frequency for the 20 amino acids (sequence order information); This method was developed by [8,9]to formulate an amino acid sequence of arbitrary length, such as a digital vector. A peptide sequence with length L amino acid residues(Position-Specific-Scoring-Matrix) Sequence composition. ...
Preprint
Full-text available
Background An emerging type of cancer treatment, known as cell immunotherapy, is gaining popularity over chemotherapy or other radiation therapy that causes mass destruction to our body. One favourable approach in cell immunotherapy is the use of neoantigens as targets that help our body immune system identify the cancer cells from healthy cells. Neoantigens, which are non-autologous proteins with individual specificity, are generated by non-synonymous mutations in the tumor cell genome. Owing to its strong immunogenicity and lack of expression in normal tissues, it is now an important target for tumor immunotherapy. Neoantigens are some form of special protein fragments excreted as a by-product on the surface of cancer cells during the DNA mutation at the tumour. In cancer immunotherapies, certain neoantigens which exist only on cancer cells elicit our white blood cells (body’s defender, anti-cancer T-cell) responses that fight the cancer cells while leaving healthy cells alone. Personalized cancer vaccines therefore can be designed de novo for each individual patient, when the specific neoantigens are found to be relevant to his/her tumour. The vaccine which is usually coded in synthetic long peptides, RNA or DNA representing the neoantigens trigger an immune response in the body to destroy the cancer cells (tumour). The specific neoantigens can be found by a complex process of biopsy and genome sequencing. Alternatively, modern technologies nowadays tap on AI to predict the right neoantigen candidates using algorithms. However, determining the binding and non-binding of neoantigens on T-cell receptors (TCR) is a challenging computational task due to its very large search space. Objective To enhance the efficiency and accuracy of traditional deep learning tools, for serving the same purpose of finding potential responsiveness to immunotherapy through correctly predicted neoantigens. It is known that deep learning is possible to explore which novel neoantigens bind to T-cell receptors and which ones don’t. The exploration may be technically expensive and time-consuming since deep learning is an inherently computational method. one can use putative neoantigen peptide sequences to guide personalized cancer vaccines design. Methods These models all proceed through complex feature engineering, including feature extraction, dimension reduction and so on. In this study, we derived 4 features to facilitate prediction and classification of 4 HLA-peptide binding namely AAC and DC from the global sequence, and the LAAC and LDC from the local sequence information. Based on the patterns of sequence formation, a nested structure of bidirectional long-short term memory neural network called local information module is used to extract context-based features around every residue. Another bilstm network layer called global information module is introduced above local information module layer to integrate context-based features of all residues in the same HLA-peptide binding chain, thereby involving inter-residue relationships in the training process. introduced Results Finally, a more effective model is obtained by fusing the above two modules and 4 features matric, the method performs significantly better than previous prediction schemes, whose overall r-square increased to 0.0125 and 0.1064 on train and increased to 0.0782 and 0.2926 on test datasets. The RMSE for our proposed models trained decreased to approximately 0.0745 and 1.1034, respectively, and decreased to 0.6712 and 1.6506 on test dataset. Conclusion Our work has been actively refining a machine-learning model to improve neoantigen identification and predictions with the determinants for Neoantigen identification. The final experimental results show that our method is more effective than existing methods for predicting peptide types, which can help laboratory researchers to identify the type of novel HLA-peptide binding.
... Computational methods can be good alternative to supplement the biochemical experiments. However, this has been done mostly for protein subcellular localization prediction [27][28][29][30] . In addition, few attempts have also been made towards RNA molecules. ...
Article
Full-text available
MicroRNAs (miRNAs) are one kind of non-coding RNA, play vital role in regulating several physiological and developmental processes. Subcellular localization of miRNAs and their abundance in the native cell are central for maintaining physiological homeostasis. Besides, RNA silencing activity of miRNAs is also influenced by their localization and stability. Thus, development of computational method for subcellular localization prediction of miRNAs is desired. In this work, we have proposed a computational method for predicting subcellular localizations of miRNAs based on principal component scores of thermodynamic, structural properties and pseudo compositions of di-nucleotides. Prediction accuracy was analyzed following fivefold cross validation, where ~ 63–71% of AUC-ROC and ~ 69–76% of AUC-PR were observed. While evaluated with independent test set, > 50% localizations were found to be correctly predicted. Besides, the developed computational model achieved higher accuracy than the existing methods. A user-friendly prediction server “miRNALoc” is freely accessible at https://cabgrid.res.in:8080/mirnaloc/, by which the user can predict localizations of miRNAs.
Preprint
Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins implies that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred.
Article
Full-text available
Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for the understanding the mechanism of programmed cell death, and their function is related to their types. The apoptosis proteins are categorized into the following four types: (1) Cytoplasmic protein; (2) Plasma membrane-bound protein; (3) Mitochondrial inner and outer proteins; (4) Other proteins. A novel method, the Hilbert-Huang transform, is applied for predicting the type of a given apoptosis protein with support vector machine. High success rates were obtained by the re-substitute test (98/98=100%), jackknife test (91/98 = 92.9%).
Article
Full-text available
Enzymes possess extremely high catalytic rates, but the catalytic reactions occur only when substrate molecules contact with the active site, a quite small part in comparison with the corresponding major protein. Therefore, it is interesting from different perspectives to discuss the functions of the major protein outside the active site. In this paper, from the viewpoint of diffusion-controlled reactions, the role of the major protein is discussed, and on such a basis, it is pointed out in which case the major protein will act like a "hard wall", hindering some part of the substrate molecules from diffusing into the active site, and in which case the major protein will behave as a "promoter", accelerating the flow of substrate molecules around into the active site so as to increase the rate of diffusion-controlled reactions significantly. Calculated results show that these two extremely opposite cases will markedly depend on the size of van der Waals binding energy between substrate molecules and the enzyme protein outside active site.
Article
In order to stimulate the development of drugs against severe acute respiratory syndrome (SARS), based on the atomic coordinates of the SARS coronavirus main proteinase determined recently [Science 13 (May) (2003) (online)], studies of docking KZ7088 (a derivative of AG7088) and the AVLQSGFR octapeptide to the enzyme were conducted. It hasbeen observed that both the above compounds interact with the active site of the SARS enzyme through six hydrogen bonds. Also, a clear definition of the binding pocket for KZ7088 has been presented. These findings may provide a solid basis for subsite analysis and mutagenesis relative to rational design of highly selective inhibitors for therapeutic application. Meanwhile, the idea of how to develop inhibitors of the SARS enzyme based on the knowledge of its own peptide substrates (the so-called “distorted key” approach) was also briefly elucidated.
Article
Conformational energy calculations have been carried out in order to determine energetically favorable ways of packing two alpha -helices. A generalized mathematical formulation of the selection of helical coordinate systems and of coordinate transformations used in packing calculations has been developed. It is suitable for the description of any interacting helical assembly in proteins and any other macromolecules. Comparison of the two pairs of helices indicates the effect of introducing a bulky side chain.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, based on the concept of pseudo amino acid (PseAA) composition originally introduced by Chou, a novel approximate entropy (ApEn) based PseAA composition is proposed to represent apoptosis protein sequences. An ensemble classifier is introduced, of which the basic classifier is the FKNN (fuzzy K-nearest neighbor) one, as prediction engine. Each basic classifier is trained in different dimensions of PseAA composition of protein sequences. The immune genetic algorithm (IGA) is used to search the optimal weight factors in generating the PseAA composition for crucial of weight factors in PseAA composition. The results obtained by Jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for protein function, or at least can play a complimentary role to the existing methods in the relevant areas.