Content uploaded by George D. Magoulas
Author content
All content in this area was uploaded by George D. Magoulas
Content may be subject to copyright.
Classification of Protein Localisation Patterns
via Supervised Neural Network Learning
Aristoklis D. Anastasiadis, George D. Magoulas, and Xiaohui Liu
Department of Information Systems and Computing, Brunel University,
UB8 3PH, Uxbridge, United Kingdom
{Aristoklis.Anastasiadis,George.Magoulas,Xiaohui.Liu}@brunel.ac.uk
http://www.brunel.ac.uk/ csstxhl/IDA/
Abstract. There are so many existing classification methods from di-
verse fields including statistics, machine learning and pattern recognition.
New metho ds have been invented constantly that claim superior perfor-
mance over classical methods. It has become increasingly difficult for
practitioners to choose the right kind of the methods for their applica-
tions. So this paper is not about the suggestion of another classification
algorithm, but rather about conveying the message that some existing
algorithms, if properly used, can lead to better solutions to some of the
challenging real-world problems. This paper will lo ok at some important
problems in bioinformatics for which the best solutions were known and
shows that improvement over those solutions can be achieved with a form
of feed-forward neural networks by applying more advanced schemes for
network supervised learning. The results are evaluated against those from
other commonly used classifiers, such as the K nearest neighbours using
cross validation, and their statistical significance is assessed using the
nonparametric Wilcoxon test.
1 Introduction
An area of protein characterization that it considered particularly useful in the
post-genomics era is the study of protein localization. In order to function prop-
erly, proteins must be transported to various localization sites within a particular
cell. Description of protein localization provides information about each protein
that is complementary to the protein sequence and structure data. Automated
analysis of protein localization may be more complex than the automated anal-
ysis of DNA sequences; nevertheless the benefits to be derived are of same im-
portance [1]. The ability to identify known proteins with similar sequence and
similar localization is becoming increasingly important, as we need structural,
functional and localization information to accompany the raw sequences. Among
the various applications developed so far, the classification of protein localiza-
tion patterns into known categories has attracted significant interest. The first
approach for predicting the localization sites of proteins from their amino acid
sequences was an expert system developed by Nakai and Kanehisa [8,9]. Later,
expert identified features were combined with a probabilistic model, which could
learn its parameters from a set of training data [4].
Better prediction accuracy has been achieved by using standard classification
algorithms such as K nearest neighbours (KNN), the binary decision tree and a
nave Bayesian classifier. The KNN achieved the best classification accuracy com-
paring to these methods in two drastically imbalanced dataset namely the E.coli
and Yeast [5]. E.coli proteins were classified into 8 classes with an average accu-
racy of 86%, while Yeast proteins were classified into 10 classes with an average
accuracy of 60%. Recently, genetic algorithms, growing cell structures, expand-
ing range rules and feed-forward neural networks were comparatively evaluated
for this problem, but no improvements over the KNN algorithm were reported
[2].
This paper is not about proposing another classification algorithm, but rather
about conveying the message that some existing algorithms, if properly used, can
lead to better solutions to this challenging real-world problem. The paper ad-
vocates the neural network-based approach for classifying localization sites of
proteins. We investigate the use of several supervised learning schemes [11] to
improve the classification success of neural networks. The paper is organized as
follows. First, we briefly describe the methods that we use to predict the localisa-
tion sites. Next, we introduce the datasets and the evaluation methodology that
are used in the paper, and presents experimental results by the neural network
approach when compared with the best existing method.
2 Classification Methods
2.1 The K Nearest Neighbours Algorithm
Let X = {x
i
= (x
i
1
, . . . , x
i
p
), i = 1, . . . , N} be a collection of p-dimensional
training samples and C = {C
1
, . . . , C
M
} be a set of M classes. Each sample
x
i
will first be assumed to posses a class label L
i
∈ {1, . . . , M} indicating with
certainty its membership to a class in C. Assume also x
S
be an incoming sample
to be classified. Classifying x
S
corresponds to assigning it to one of the classes in
C, i.e. deciding among a set of M hypotheses: x
S
∈ C
q
, q = 1, . . . , M. Let Φ
S
be
the set of the K nearest neighbours of x
S
in X. For any x
i
∈ Φ
S
, the knowledge
that L
i
= q can be regarded as evidence that x
S
increases our belief that also
belongs to C
q
. However, this piece of evidence does not by itself provide 100%
certainty. The K nearest neighbours classifier (KNN), as suggested by Duda and
Hart [3], stores the training data, the pair (X , L). The examples are classified by
choosing the majority class among the K closest examples in the training data,
according to the Euclidean distance measure. In our experiments we set K = 7
for the E.coli proteins and K = 21 for the yeast dataset; these values have been
found empirically to give the best performance [5].
2.2 Multilayer Feed-forward Neural Networks and Supervised
Learning
In a multilayer Feed-forward Neural Network (FNN) no des are organised in layers
and connections are from input nodes to hidden nodes and from hidden nodes to
output nodes. In our experiments we have used FNNs with sigmoid hidden and
output nodes. The notation I − H − O is used to denote a network architecture
with I inputs, H hidden layer nodes and O outputs nodes. The most popular
training algorithm of this category is the batch Back-Propagation; a first order
method that minimizes the error function using the steepest descent.
Adaptive gradient-based algorithms with individual step-sizes try to over-
come the inherent difficulty of choosing the appropriate learning rates. This is
done by controlling the weight update for every single connection during the
learning process in order to minimize oscillations and maximize the length of
the step-size. One of the best of these techniques, in terms of convergence speed,
accuracy and robustness with respect to its parameters, is the Rprop algorithm
[11]. The basic principle of Rprop is to eliminate the harmful influence of the
size of the partial derivative on the weight step, considering only the sign of the
derivative to indicate the direction of the weight update. The Rprop algorithm
requires setting the following parameters: (i) the learning rate increase factor
η
+
= 1.2; (ii) the learning rate decrease factor η
−
= 0.5; (iii) the initial update-
value is set to ∆
0
= 0.1; (iv) the maximum weight step, which is used in order
to prevent the weights from becoming too large, is ∆
max
= 50[11].
2.3 Ensemble-Based Methods
Methods for creating ensembles focus on creating classifiers that disagree on
their decisions. In general terms, these methods alter the training process in
an attempt to produce classifiers that will generate different classifications. In
the neural network context, these methods include techniques for training with
different network topologies, different initial weights, different learning parame-
ters, and learning different portions of the training set (see [10] for reviews and
comparisons).
We have investigated the use of two different methods for creating ensem-
bles. The first one consists of creating a simple neural network ensemble with
five networks where each network uses the full training set and differs only in
its random initial weight settings. Similar approaches often produce results as
good as Bagging [10]. The second approach is based on the notion of diversity,
where networks belonging to the ensemble can be said to be diverse with respect
to a test set if they make different generalisation errors on that test set [12].
Different patterns of generalisations can be produced when networks are trained
on different training sets, or from different initial conditions, or with different
numbers or hidden nodes, or using different algorithms [12]. Our implementation
consisted of using four networks and belongs to the so called level 3 diversity [12],
which allows to weight the outputs of the networks in such a way that either the
correct answer is obtained or at least that the correct output is obtained often
enough that generalisation is improved (see [12] for details).
3 Experimental Study
3.1 Description of Datasets
The datasets used have been submitted to the UCI Machine Learning Data
Repository by Murphy and Aha [7] and are described in [4,8,9]. They include an
E.coli dataset with 336 proteins sequences labelled according to 8 localization
sites and a Yeast data set with 1484 sequences labelled according to 10 sites.
In particular, protein patterns in the E.coli data set are organized as follows:
143 patterns of cytoplasm (cp), 77 of inner membrane without signal sequence
(im), 52 of periplasm (pp), 35 of inner membrane with uncleavable signal se-
quence (imU), 20 of outer membrane without lipoprotein (om), 5 of outer mem-
brane with lipoprotein (omL), 2 of inner membrane with lipoprotein (imL) and
2 patterns of inner membrane with cleavable signal sequence (imS).
Yeast proteins are organized as follows: there are 463 patterns of cytoplasm
(CYT), 429 of nucleus (NUC), 244 of mitochondria (MIT), 163 of membrane pro-
tein without N-terminal signal (ME3), 51 of membrane protein with uncleavable
signal (ME2), 44 of membrane protein with cleavable signal (ME1), 35 of extra-
cellular (EXC), 30 of vacuole (VAC), 20 of peroxisome (POX) and 5 patterns of
endoplasmic reticulum (ERL).
3.2 Evaluation Methods
Cross Validation In k-fold cross-validation a dataset D is randomly split into k
mutually exclusive subsets D
1
, . . . , D
k
of approximately equal size. The classifier
is trained and tested k times; each time t ∈ {1, 2, . . . , k} is trained on all D
i
, i =
1, . . . , k, with i 6= t and tested on D
t
. The cross-validation estimate of accuracy
is the overall number of correct classifications divided by the number of instances
in the dataset.
We used cross-validation to estimate the accuracy of the classification meth-
ods. Both data sets were randomly partitioned into equally sized subsets. The
proportion of the classes is equal in each partition as this procedure provides
more accurate results than a plain cross-validation does [6]. In our experiments
we have partitioned the data into 4 equally sized subsets for the E.coli dataset
and into 10 equally sized subsets for the Yeast dataset, as proposed in previous
works [4,5].
The Wilcoxon Test of Statistical Significance The Wilcoxon signed rank
test is a nonparametric method. It is an alternative to the paired t-test. This test
assumes that there is information in the magnitudes of the differences between
paired observations, as well as the signs. Firstly, we take the paired observations,
we calculate the differences and then we rank them from smallest to largest by
their absolute value. After adding all the ranks associated with positive and
negative differences giving the T
+
and T
−
statistics respectively. Finally, the
probability value associated with this statistic is found from the appropriate ta-
ble. In our experiments, we analyzed the statistical significance by implementing
the Wilcoxon rank sum test as proposed in [13]. All statements refer to a sig-
nificance level of 10%, which corresponds to T
+
< 8 for the E.coli dataset and
T
+
< 14 for the Yeast dataset.
4 Results
4.1 Classifying E.coli Patterns Using a Feed-forward Neural
Network
We conducted a set of preliminary experiments to find the most suitable FNN
architecture in terms of training speed. We trained several networks with one
and two hidden layers using the Rprop algoritrhm, trying various combinations
of hidden nodes, i.e. 8, 12, 14, 16, 24, 32, 64, 120 hidden nodes. Each FNN
architecture was trained 10 times with different initial weights. The b est available
architecture found was a 7-16-8 FNN that was further used in our experiments.
100 independent trials were performed with two different termination criteria,
E
MSE
<0.05 and E
MSE
< 0.015, in order to explore the effect of the network
error on the accuracy of the classifications. We have conducted two experiments
as recommended in previous works [2,4,5,8,9]. In the first experiment the neural
networks are tested on the entire dataset. Table 1 shows the classification success
achieved on each class.
Table 1. The accuracy of classification of E.coli proteins for each class.
No of Patterns Class KNN (%) FNN E
M SE
<0.05 (%) FNN E
M SE
< 0.015 (%)
77 im 75.3 82.5 89
143 cp 98.6 98.1 99
2 imL 0.0 85.0 91.8
5 omL 80.0 100.0 100
35 imU 65.7 68.0 88.3
2 imS 0.0 6.5 49
20 om 90.0 86.6 93.2
52 pp 90.3 88.7 93.6
Mean 62.5 76.9 88.0
The results of the FNN show the mean value over 100 trials. As we can observe
the FNNs outperforms KNN in every class. It is very important to highlight the
neural network classification success in inner membrane with cleavable signal
sequence (imS) and in inner membrane with lipoprotein (imL) classes. In the
first case, FNNs trained with E
MSE
< 0.05 exhibit a 6% success while the other
methods have 0%. In the second case, the FNNs have 85% success while the other
two methods only 50%. Results for FNNs become even better when trained with
an E
MSE
< 0.015. This obviously causes an increase to the average number of
epochs required to converge (approximately 2000 epochs are needed) but leads
to significant improvement: the average of overall successful predictions out of
100 runs was 93% with a standard deviation of 0.33. All FNNs results satisfy
the conditions of the Wilcoxon test showing that the improvements achieved by
the FNNs are statistically significant (e.g. comparing FNN p erformance when
trained with an E
MSE
< 0.05 against the the KNN gives T
+
= 7).
In order to identify common misclassifications, we calculated the confusion
matrix for the FNN that exhibited the best training time for an E
MSE
< 0.015.
These results are shown in Table 2. The neural network achieved high percentage
of classification compared to other methods (cf. with Table 5 in [5]). These results
also show that fast convergences achieved by the FNN by no means affect its
classification success.
Table 2. Confusion matrix for E.coli proteins with FNN.
No of Patterns Class cp imL imS imU im omL om pp
143 cp 142 0 0 0 0 0 0 1
2 imL 0 2 0 0 0 0 0 0
2 imS 0 0 1 1 0 0 0 0
35 imU 0 0 0 31 4 0 0 0
77 im 2 0 0 6 69 0 0 0
5 omL 0 0 0 0 0 5 0 0
20 om 0 0 0 0 0 0 19 0
52 pp 3 0 0 0 0 0 0 49
In the second experiment, we performed a leave-one-out cross validation
test by randomly partitioning the dataset into 4 equally sized subsets, as sug-
gested in [4,5]. Each time three subsets are used for training while the remaining
one is used for testing. Table 3 gives the best results for each method using
the leave-one-out cross-validation showing that the FNN when trained with an
E
MSE
< 0.015 exhibits better performance. Lastly, it is important to mention
that the performance of KNN algorithm is significantly improved when leave-
one-out cross-validation is used but it still lacks in performance compared to the
best FNN.
4.2 Classifying Yeast Patterns Using a Feed-forward Neural
Network
We conducted a set of preliminary experiments as with the E.coli dataset in
order to find the most suitable architecture. An 8-16-10 FNN architecture ex-
hibited the best performance. 100 FNNs were trained with the Rprop algorithm
using different initial weights to achieve an E
MSE
< 0.045. As previously we con-
ducted two experiments following the guidelines of [4,5]. The first experiment
Table 3. . Best performance for each method with leave-one-out cross-validation for
E.coli proteins.
Cross Validation Partition KNN (%) FNN E
M SE
< 0.015 (%)
0 89.3 91.7
1 95.2 88.1
2 88.1 84.5
3 69.1 88.1
Mean 82.4 88.1
concerns testing using the whole dataset. The FNN outperforms significantly the
other methods. The average neural network performance out of 100 runs is 67%;
the worst-case performance was 64% (which is still an improvement over other
methods) and the best one 69%. 10000 epochs were required on average. The
result of the KNN is 59.5%.
In an attempt to explore the influence of the error goal on the success of the
classifications we slightly increased the error goal to a value of E
MSE
< 0.05. This
results to a significant reduction on the average number of epochs; 3500 epochs
are needed in order to reach convergence. Table 4 shows the classification success
achieved on each class. The results of the FNN represent the average of 100 trials.
The neural network outperforms the other methods in almost every class. It is
very important to highlight the neural network classification success in the POX
and ERL classes. With regards to the results of the Wilcoxon test of statistical
significance, the improvements achieved by the neural networks are statistically
significant giving T
+
= 7 when compared against the KNN. The overall average
classification of the FNNs is 64%, which shows that a small variation in the
value of the error goal might affect the classification success. This might provide
an explanation for the unsatisfactory performance that FNNs exhibited in the
experiments rep orted in [2].
To identify the misclassifications in the Yeast dataset we have created the
confusion matrix for the FNN that exhibited the fastest convergence to an
E
MSE
< 0.045. The results are shown in Table 5. It is important to highlight
the significant improvement to classify the localisation sites in each class (cf.
with Table 6 in [5]).
The second experiment involves the use of leave-one-out cross validation
method with 10 equally sized partitions. The results are shown in Table 6.
The KNN algorithm improves its generalization success. The performance of
the FNNs is also improved, achieving average classification success of 70% (an
E
MSE
< 0.045 was used and an average of 3500 epochs). The results of the
Wilcoxon test gives T
+
= 0 showing that the improved performance achieved by
the FNNs is statistically significant in all partitions when compared against the
results of the KNN algorithm.
Table 4. The accuracy of classification of Yeast proteins for each class.
No of Patterns Class KNN (%) FNN E
M SE
<0.05 (%)
463 cyt 70.7 66.7
5 erl 0.0 99.6
35 exc 62.9 62.7
44 me1 75.0 82.9
51 me2 21.6 47.8
163 me3 74.9 85.6
244 mit 57.8 61.3
429 nuc 50.7 57.7
20 p ox 55.0 54.6
30 vac 0.0 4.1
Mean 46.8 62.3
Table 5. The confusion matrix of Yeast proteins for each class using a neural network.
No of Patterns Class cyt erl exc me1 me2 me3 mit nuc pox vac
463 cyt 335 0 0 1 0 7 24 96 0 0
5 erl 0 5 0 0 0 0 0 0 0 0
35 exc 4 0 23 2 2 0 2 2 0 0
44 me1 2 0 1 0 0 1 1 0 0 0
51 me2 5 0 2 3 28 4 7 2 0 0
163 me3 8 0 0 0 0 145 1 8 0 1
244 mit 46 0 0 2 4 10 155 27 0 0
429 nuc 121 0 0 0 48 13 24 223 0 0
20 p ox 6 0 0 0 0 0 2 1 11 0
30 vac 11 0 2 0 1 6 2 7 0 1
4.3 Classifying Protein Patterns Using Ensemble-based Techniques
In this section we report results using the Simple Network Ensemble (SNE)
method and the Diverse Neural Networks (DNN) method. Table 7 shows the
results of the two ensemble-based methods on the E.coli dataset. The overall
classification success of the SNE method was 93.5%, while the overall classifica-
tion success of the DNN method was 96.8%.
We decided to concentrate on the DNN method and created an ensemble
for the Yeast dataset. The results are shown in Table 8. The DNN ensemble
significantly outperforms all other methods tested so far [2,4,5].
5 Conclusions
In this paper we have explored the use of a neural network approach for the
prediction of localisation sites of proteins in the E.coli and Yeast datasets. Nu-
merical results using supervised learning schemes, with and without the use of
Table 6. Best performance for each method using leave-one-out cross-validation for
Yeast proteins.
Cross Validation Partition KNN (%) FNN E
M SE
< 0.045 (%)
0 55.8 68.5
1 59.2 69.1
2 61.0 69.2
3 65.8 69.0
4 48.6 69.6
5 62.3 69.5
6 68.5 69.8
7 58.9 69.8
8 56.9 69.7
9 58.2 70.5
Mean 59.5 69.5
Std. Dev 5.49 0.56
Table 7. Accuracy of classification for E.coli proteins using ensemble-based techniques.
No of Patterns Class SNE E
M SE
<0.015 (%) DNN E
M SE
< 0.015 (%)
77 im 87.0 92.22
143 cp 98.0 100
2 imL 100 100
5 omL 100 100
35 imU 77.2 94.3
2 imS 50.0 50
20 om 80.0 100
52 pp 91.4 96.15
Mean 85.45 91.7
Table 8. Accuracy of classification for Yeast proteins using diverse neural networks
No of Patterns Class DNN E
M SE
<0.045 (%)
463 cyt 80.5
5 erl 100
35 exc 74.3
44 me1 100
51 me2 70.6
163 me3 92.65
244 mit 71.4
429 nuc 70.1
20 p ox 55.0
30 vac 20.0
Mean 73.5
cross validation, are better than the best previous attempts. This is mainly be-
cause of the use of more advanced training schemes. We investigated the use
of network ensembles exhibiting level 3 diversity. Our future work focuses on
exploring the use of network ensembles with higher degrees of diversity, taking
into account the drastically imbalanced nature of these datasets.
References
1. Boland M.V. and Murphy R.F.: After sequencing: quantitative analysis of protein
lo calization. IEEE Engineering in Medicine and Biology. (1999) 115-119
2. Cairns, P. Huyck, C. Mitchell, I. Wu, W.: A Comparison of Categorisation Algo-
rithms for Predicting the Cellular Localization Sites of Proteins. IEEE Engineering
in Medicine and Biology.(2001) 296-300
3. Duda, R.O. and Hart.: Pattern Classification and Scene Analysis. John Wiley and
Sons,(1973)
4. Horton, P., and Nakai, K.: A probabilistic classification system for predicting the
cellular localization sites of proteins. Proceedings of the Fourth International Con-
ference on Intelligent Systems for Molecular Biology.(1996) 109-115
5. Horton, P., and Nakai, K.: Better Prediction of Protein Cellular Localization Sites
with the k Nearest Neighbors Classifier. Proceedings of Intelligent Systems in
Molecular Biology.(1997) 368-383
6. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
mo del selection. IInternational Joint Conference on Artificial Intelligence.(1995)
223-228
7. Murphy, P. M., and Aha, D. W.: UCI repository of machine learning databases.
http://www.ics.uci.edu/mlearn,(1996)
8. Nakai, K. and Kanehisa, M.: Expert system for predicting protein localization sites
in gram-negative bacteria. PROTEINS. 11(1991) 95-110
9. Nakai, K. and Kanehisa, M.: A knowledge base for predicting protein localization
sites in eukaryotic cells. Genomics. 14 (1992) 897-911
10. Opitz D. and Maclin R.: Popular Ensemble Methods: An Empirical Study”, Journal
of Articial Intelligence Research.11 (1999) 169-198
11. Riedmiller, M. and Braun, H.: A direct adaptive method for faster backpropagation
learning: The RPROP algorithm. Proceedings International Conference on Neural
Networks, (1993) 586-591
12. Sharkey A. J.C. and Sharkey N.E.:Combining diverse neural nets.The Knowledge
Engineering Review.12 (1997) 231-247
13. Snedecor, G., and Cochran, W.: Statistical Metho ds. 8th edn.Iowa State University
Press, (1989)