Conference PaperPDF Available

Classification of Protein Localisation Patterns via Supervised Neural Network Learning

Authors:

Abstract and Figures

There are so many existing classification methods from di- verse fields including statistics, machine learning and pattern recognition. New methods have been invented constantly that claim superior perfor- mance over classical methods. It has become increasingly dicult for practitioners to choose the right kind of the methods for their applica- tions. So this paper is not about the suggestion of another classification algorithm, but rather about conveying the message that some existing algorithms, if properly used, can lead to better solutions to some of the challenging real-world problems. This paper will look at some important problems in bioinformatics for which the best solutions were known and shows that improvement over those solutions can be achieved with a form of feed-forward neural networks by applying more advanced schemes for network supervised learning. The results are evaluated against those from other commonly used classifiers, such as the K nearest neighbours using cross validation, and their statistical significance is assessed using the nonparametric Wilcoxon test.
Content may be subject to copyright.
Classification of Protein Localisation Patterns
via Supervised Neural Network Learning
Aristoklis D. Anastasiadis, George D. Magoulas, and Xiaohui Liu
Department of Information Systems and Computing, Brunel University,
UB8 3PH, Uxbridge, United Kingdom
{Aristoklis.Anastasiadis,George.Magoulas,Xiaohui.Liu}@brunel.ac.uk
http://www.brunel.ac.uk/ csstxhl/IDA/
Abstract. There are so many existing classification methods from di-
verse fields including statistics, machine learning and pattern recognition.
New metho ds have been invented constantly that claim superior perfor-
mance over classical methods. It has become increasingly difficult for
practitioners to choose the right kind of the methods for their applica-
tions. So this paper is not about the suggestion of another classification
algorithm, but rather about conveying the message that some existing
algorithms, if properly used, can lead to better solutions to some of the
challenging real-world problems. This paper will lo ok at some important
problems in bioinformatics for which the best solutions were known and
shows that improvement over those solutions can be achieved with a form
of feed-forward neural networks by applying more advanced schemes for
network supervised learning. The results are evaluated against those from
other commonly used classifiers, such as the K nearest neighbours using
cross validation, and their statistical significance is assessed using the
nonparametric Wilcoxon test.
1 Introduction
An area of protein characterization that it considered particularly useful in the
post-genomics era is the study of protein localization. In order to function prop-
erly, proteins must be transported to various localization sites within a particular
cell. Description of protein localization provides information about each protein
that is complementary to the protein sequence and structure data. Automated
analysis of protein localization may be more complex than the automated anal-
ysis of DNA sequences; nevertheless the benefits to be derived are of same im-
portance [1]. The ability to identify known proteins with similar sequence and
similar localization is becoming increasingly important, as we need structural,
functional and localization information to accompany the raw sequences. Among
the various applications developed so far, the classification of protein localiza-
tion patterns into known categories has attracted significant interest. The first
approach for predicting the localization sites of proteins from their amino acid
sequences was an expert system developed by Nakai and Kanehisa [8,9]. Later,
expert identified features were combined with a probabilistic model, which could
learn its parameters from a set of training data [4].
Better prediction accuracy has been achieved by using standard classification
algorithms such as K nearest neighbours (KNN), the binary decision tree and a
nave Bayesian classifier. The KNN achieved the best classification accuracy com-
paring to these methods in two drastically imbalanced dataset namely the E.coli
and Yeast [5]. E.coli proteins were classified into 8 classes with an average accu-
racy of 86%, while Yeast proteins were classified into 10 classes with an average
accuracy of 60%. Recently, genetic algorithms, growing cell structures, expand-
ing range rules and feed-forward neural networks were comparatively evaluated
for this problem, but no improvements over the KNN algorithm were reported
[2].
This paper is not about proposing another classification algorithm, but rather
about conveying the message that some existing algorithms, if properly used, can
lead to better solutions to this challenging real-world problem. The paper ad-
vocates the neural network-based approach for classifying localization sites of
proteins. We investigate the use of several supervised learning schemes [11] to
improve the classification success of neural networks. The paper is organized as
follows. First, we briefly describe the methods that we use to predict the localisa-
tion sites. Next, we introduce the datasets and the evaluation methodology that
are used in the paper, and presents experimental results by the neural network
approach when compared with the best existing method.
2 Classification Methods
2.1 The K Nearest Neighbours Algorithm
Let X = {x
i
= (x
i
1
, . . . , x
i
p
), i = 1, . . . , N} be a collection of p-dimensional
training samples and C = {C
1
, . . . , C
M
} be a set of M classes. Each sample
x
i
will first be assumed to posses a class label L
i
{1, . . . , M} indicating with
certainty its membership to a class in C. Assume also x
S
be an incoming sample
to be classified. Classifying x
S
corresponds to assigning it to one of the classes in
C, i.e. deciding among a set of M hypotheses: x
S
C
q
, q = 1, . . . , M. Let Φ
S
be
the set of the K nearest neighbours of x
S
in X. For any x
i
Φ
S
, the knowledge
that L
i
= q can be regarded as evidence that x
S
increases our belief that also
belongs to C
q
. However, this piece of evidence does not by itself provide 100%
certainty. The K nearest neighbours classifier (KNN), as suggested by Duda and
Hart [3], stores the training data, the pair (X , L). The examples are classified by
choosing the majority class among the K closest examples in the training data,
according to the Euclidean distance measure. In our experiments we set K = 7
for the E.coli proteins and K = 21 for the yeast dataset; these values have been
found empirically to give the best performance [5].
2.2 Multilayer Feed-forward Neural Networks and Supervised
Learning
In a multilayer Feed-forward Neural Network (FNN) no des are organised in layers
and connections are from input nodes to hidden nodes and from hidden nodes to
output nodes. In our experiments we have used FNNs with sigmoid hidden and
output nodes. The notation I H O is used to denote a network architecture
with I inputs, H hidden layer nodes and O outputs nodes. The most popular
training algorithm of this category is the batch Back-Propagation; a first order
method that minimizes the error function using the steepest descent.
Adaptive gradient-based algorithms with individual step-sizes try to over-
come the inherent difficulty of choosing the appropriate learning rates. This is
done by controlling the weight update for every single connection during the
learning process in order to minimize oscillations and maximize the length of
the step-size. One of the best of these techniques, in terms of convergence speed,
accuracy and robustness with respect to its parameters, is the Rprop algorithm
[11]. The basic principle of Rprop is to eliminate the harmful influence of the
size of the partial derivative on the weight step, considering only the sign of the
derivative to indicate the direction of the weight update. The Rprop algorithm
requires setting the following parameters: (i) the learning rate increase factor
η
+
= 1.2; (ii) the learning rate decrease factor η
= 0.5; (iii) the initial update-
value is set to
0
= 0.1; (iv) the maximum weight step, which is used in order
to prevent the weights from becoming too large, is
max
= 50[11].
2.3 Ensemble-Based Methods
Methods for creating ensembles focus on creating classifiers that disagree on
their decisions. In general terms, these methods alter the training process in
an attempt to produce classifiers that will generate different classifications. In
the neural network context, these methods include techniques for training with
different network topologies, different initial weights, different learning parame-
ters, and learning different portions of the training set (see [10] for reviews and
comparisons).
We have investigated the use of two different methods for creating ensem-
bles. The first one consists of creating a simple neural network ensemble with
five networks where each network uses the full training set and differs only in
its random initial weight settings. Similar approaches often produce results as
good as Bagging [10]. The second approach is based on the notion of diversity,
where networks belonging to the ensemble can be said to be diverse with respect
to a test set if they make different generalisation errors on that test set [12].
Different patterns of generalisations can be produced when networks are trained
on different training sets, or from different initial conditions, or with different
numbers or hidden nodes, or using different algorithms [12]. Our implementation
consisted of using four networks and belongs to the so called level 3 diversity [12],
which allows to weight the outputs of the networks in such a way that either the
correct answer is obtained or at least that the correct output is obtained often
enough that generalisation is improved (see [12] for details).
3 Experimental Study
3.1 Description of Datasets
The datasets used have been submitted to the UCI Machine Learning Data
Repository by Murphy and Aha [7] and are described in [4,8,9]. They include an
E.coli dataset with 336 proteins sequences labelled according to 8 localization
sites and a Yeast data set with 1484 sequences labelled according to 10 sites.
In particular, protein patterns in the E.coli data set are organized as follows:
143 patterns of cytoplasm (cp), 77 of inner membrane without signal sequence
(im), 52 of periplasm (pp), 35 of inner membrane with uncleavable signal se-
quence (imU), 20 of outer membrane without lipoprotein (om), 5 of outer mem-
brane with lipoprotein (omL), 2 of inner membrane with lipoprotein (imL) and
2 patterns of inner membrane with cleavable signal sequence (imS).
Yeast proteins are organized as follows: there are 463 patterns of cytoplasm
(CYT), 429 of nucleus (NUC), 244 of mitochondria (MIT), 163 of membrane pro-
tein without N-terminal signal (ME3), 51 of membrane protein with uncleavable
signal (ME2), 44 of membrane protein with cleavable signal (ME1), 35 of extra-
cellular (EXC), 30 of vacuole (VAC), 20 of peroxisome (POX) and 5 patterns of
endoplasmic reticulum (ERL).
3.2 Evaluation Methods
Cross Validation In k-fold cross-validation a dataset D is randomly split into k
mutually exclusive subsets D
1
, . . . , D
k
of approximately equal size. The classifier
is trained and tested k times; each time t {1, 2, . . . , k} is trained on all D
i
, i =
1, . . . , k, with i 6= t and tested on D
t
. The cross-validation estimate of accuracy
is the overall number of correct classifications divided by the number of instances
in the dataset.
We used cross-validation to estimate the accuracy of the classification meth-
ods. Both data sets were randomly partitioned into equally sized subsets. The
proportion of the classes is equal in each partition as this procedure provides
more accurate results than a plain cross-validation does [6]. In our experiments
we have partitioned the data into 4 equally sized subsets for the E.coli dataset
and into 10 equally sized subsets for the Yeast dataset, as proposed in previous
works [4,5].
The Wilcoxon Test of Statistical Significance The Wilcoxon signed rank
test is a nonparametric method. It is an alternative to the paired t-test. This test
assumes that there is information in the magnitudes of the differences between
paired observations, as well as the signs. Firstly, we take the paired observations,
we calculate the differences and then we rank them from smallest to largest by
their absolute value. After adding all the ranks associated with positive and
negative differences giving the T
+
and T
statistics respectively. Finally, the
probability value associated with this statistic is found from the appropriate ta-
ble. In our experiments, we analyzed the statistical significance by implementing
the Wilcoxon rank sum test as proposed in [13]. All statements refer to a sig-
nificance level of 10%, which corresponds to T
+
< 8 for the E.coli dataset and
T
+
< 14 for the Yeast dataset.
4 Results
4.1 Classifying E.coli Patterns Using a Feed-forward Neural
Network
We conducted a set of preliminary experiments to find the most suitable FNN
architecture in terms of training speed. We trained several networks with one
and two hidden layers using the Rprop algoritrhm, trying various combinations
of hidden nodes, i.e. 8, 12, 14, 16, 24, 32, 64, 120 hidden nodes. Each FNN
architecture was trained 10 times with different initial weights. The b est available
architecture found was a 7-16-8 FNN that was further used in our experiments.
100 independent trials were performed with two different termination criteria,
E
MSE
<0.05 and E
MSE
< 0.015, in order to explore the effect of the network
error on the accuracy of the classifications. We have conducted two experiments
as recommended in previous works [2,4,5,8,9]. In the first experiment the neural
networks are tested on the entire dataset. Table 1 shows the classification success
achieved on each class.
Table 1. The accuracy of classification of E.coli proteins for each class.
No of Patterns Class KNN (%) FNN E
M SE
<0.05 (%) FNN E
M SE
< 0.015 (%)
77 im 75.3 82.5 89
143 cp 98.6 98.1 99
2 imL 0.0 85.0 91.8
5 omL 80.0 100.0 100
35 imU 65.7 68.0 88.3
2 imS 0.0 6.5 49
20 om 90.0 86.6 93.2
52 pp 90.3 88.7 93.6
Mean 62.5 76.9 88.0
The results of the FNN show the mean value over 100 trials. As we can observe
the FNNs outperforms KNN in every class. It is very important to highlight the
neural network classification success in inner membrane with cleavable signal
sequence (imS) and in inner membrane with lipoprotein (imL) classes. In the
first case, FNNs trained with E
MSE
< 0.05 exhibit a 6% success while the other
methods have 0%. In the second case, the FNNs have 85% success while the other
two methods only 50%. Results for FNNs become even better when trained with
an E
MSE
< 0.015. This obviously causes an increase to the average number of
epochs required to converge (approximately 2000 epochs are needed) but leads
to significant improvement: the average of overall successful predictions out of
100 runs was 93% with a standard deviation of 0.33. All FNNs results satisfy
the conditions of the Wilcoxon test showing that the improvements achieved by
the FNNs are statistically significant (e.g. comparing FNN p erformance when
trained with an E
MSE
< 0.05 against the the KNN gives T
+
= 7).
In order to identify common misclassifications, we calculated the confusion
matrix for the FNN that exhibited the best training time for an E
MSE
< 0.015.
These results are shown in Table 2. The neural network achieved high percentage
of classification compared to other methods (cf. with Table 5 in [5]). These results
also show that fast convergences achieved by the FNN by no means affect its
classification success.
Table 2. Confusion matrix for E.coli proteins with FNN.
No of Patterns Class cp imL imS imU im omL om pp
143 cp 142 0 0 0 0 0 0 1
2 imL 0 2 0 0 0 0 0 0
2 imS 0 0 1 1 0 0 0 0
35 imU 0 0 0 31 4 0 0 0
77 im 2 0 0 6 69 0 0 0
5 omL 0 0 0 0 0 5 0 0
20 om 0 0 0 0 0 0 19 0
52 pp 3 0 0 0 0 0 0 49
In the second experiment, we performed a leave-one-out cross validation
test by randomly partitioning the dataset into 4 equally sized subsets, as sug-
gested in [4,5]. Each time three subsets are used for training while the remaining
one is used for testing. Table 3 gives the best results for each method using
the leave-one-out cross-validation showing that the FNN when trained with an
E
MSE
< 0.015 exhibits better performance. Lastly, it is important to mention
that the performance of KNN algorithm is significantly improved when leave-
one-out cross-validation is used but it still lacks in performance compared to the
best FNN.
4.2 Classifying Yeast Patterns Using a Feed-forward Neural
Network
We conducted a set of preliminary experiments as with the E.coli dataset in
order to find the most suitable architecture. An 8-16-10 FNN architecture ex-
hibited the best performance. 100 FNNs were trained with the Rprop algorithm
using different initial weights to achieve an E
MSE
< 0.045. As previously we con-
ducted two experiments following the guidelines of [4,5]. The first experiment
Table 3. . Best performance for each method with leave-one-out cross-validation for
E.coli proteins.
Cross Validation Partition KNN (%) FNN E
M SE
< 0.015 (%)
0 89.3 91.7
1 95.2 88.1
2 88.1 84.5
3 69.1 88.1
Mean 82.4 88.1
concerns testing using the whole dataset. The FNN outperforms significantly the
other methods. The average neural network performance out of 100 runs is 67%;
the worst-case performance was 64% (which is still an improvement over other
methods) and the best one 69%. 10000 epochs were required on average. The
result of the KNN is 59.5%.
In an attempt to explore the influence of the error goal on the success of the
classifications we slightly increased the error goal to a value of E
MSE
< 0.05. This
results to a significant reduction on the average number of epochs; 3500 epochs
are needed in order to reach convergence. Table 4 shows the classification success
achieved on each class. The results of the FNN represent the average of 100 trials.
The neural network outperforms the other methods in almost every class. It is
very important to highlight the neural network classification success in the POX
and ERL classes. With regards to the results of the Wilcoxon test of statistical
significance, the improvements achieved by the neural networks are statistically
significant giving T
+
= 7 when compared against the KNN. The overall average
classification of the FNNs is 64%, which shows that a small variation in the
value of the error goal might affect the classification success. This might provide
an explanation for the unsatisfactory performance that FNNs exhibited in the
experiments rep orted in [2].
To identify the misclassifications in the Yeast dataset we have created the
confusion matrix for the FNN that exhibited the fastest convergence to an
E
MSE
< 0.045. The results are shown in Table 5. It is important to highlight
the significant improvement to classify the localisation sites in each class (cf.
with Table 6 in [5]).
The second experiment involves the use of leave-one-out cross validation
method with 10 equally sized partitions. The results are shown in Table 6.
The KNN algorithm improves its generalization success. The performance of
the FNNs is also improved, achieving average classification success of 70% (an
E
MSE
< 0.045 was used and an average of 3500 epochs). The results of the
Wilcoxon test gives T
+
= 0 showing that the improved performance achieved by
the FNNs is statistically significant in all partitions when compared against the
results of the KNN algorithm.
Table 4. The accuracy of classification of Yeast proteins for each class.
No of Patterns Class KNN (%) FNN E
M SE
<0.05 (%)
463 cyt 70.7 66.7
5 erl 0.0 99.6
35 exc 62.9 62.7
44 me1 75.0 82.9
51 me2 21.6 47.8
163 me3 74.9 85.6
244 mit 57.8 61.3
429 nuc 50.7 57.7
20 p ox 55.0 54.6
30 vac 0.0 4.1
Mean 46.8 62.3
Table 5. The confusion matrix of Yeast proteins for each class using a neural network.
No of Patterns Class cyt erl exc me1 me2 me3 mit nuc pox vac
463 cyt 335 0 0 1 0 7 24 96 0 0
5 erl 0 5 0 0 0 0 0 0 0 0
35 exc 4 0 23 2 2 0 2 2 0 0
44 me1 2 0 1 0 0 1 1 0 0 0
51 me2 5 0 2 3 28 4 7 2 0 0
163 me3 8 0 0 0 0 145 1 8 0 1
244 mit 46 0 0 2 4 10 155 27 0 0
429 nuc 121 0 0 0 48 13 24 223 0 0
20 p ox 6 0 0 0 0 0 2 1 11 0
30 vac 11 0 2 0 1 6 2 7 0 1
4.3 Classifying Protein Patterns Using Ensemble-based Techniques
In this section we report results using the Simple Network Ensemble (SNE)
method and the Diverse Neural Networks (DNN) method. Table 7 shows the
results of the two ensemble-based methods on the E.coli dataset. The overall
classification success of the SNE method was 93.5%, while the overall classifica-
tion success of the DNN method was 96.8%.
We decided to concentrate on the DNN method and created an ensemble
for the Yeast dataset. The results are shown in Table 8. The DNN ensemble
significantly outperforms all other methods tested so far [2,4,5].
5 Conclusions
In this paper we have explored the use of a neural network approach for the
prediction of localisation sites of proteins in the E.coli and Yeast datasets. Nu-
merical results using supervised learning schemes, with and without the use of
Table 6. Best performance for each method using leave-one-out cross-validation for
Yeast proteins.
Cross Validation Partition KNN (%) FNN E
M SE
< 0.045 (%)
0 55.8 68.5
1 59.2 69.1
2 61.0 69.2
3 65.8 69.0
4 48.6 69.6
5 62.3 69.5
6 68.5 69.8
7 58.9 69.8
8 56.9 69.7
9 58.2 70.5
Mean 59.5 69.5
Std. Dev 5.49 0.56
Table 7. Accuracy of classification for E.coli proteins using ensemble-based techniques.
No of Patterns Class SNE E
M SE
<0.015 (%) DNN E
M SE
< 0.015 (%)
77 im 87.0 92.22
143 cp 98.0 100
2 imL 100 100
5 omL 100 100
35 imU 77.2 94.3
2 imS 50.0 50
20 om 80.0 100
52 pp 91.4 96.15
Mean 85.45 91.7
Table 8. Accuracy of classification for Yeast proteins using diverse neural networks
No of Patterns Class DNN E
M SE
<0.045 (%)
463 cyt 80.5
5 erl 100
35 exc 74.3
44 me1 100
51 me2 70.6
163 me3 92.65
244 mit 71.4
429 nuc 70.1
20 p ox 55.0
30 vac 20.0
Mean 73.5
cross validation, are better than the best previous attempts. This is mainly be-
cause of the use of more advanced training schemes. We investigated the use
of network ensembles exhibiting level 3 diversity. Our future work focuses on
exploring the use of network ensembles with higher degrees of diversity, taking
into account the drastically imbalanced nature of these datasets.
References
1. Boland M.V. and Murphy R.F.: After sequencing: quantitative analysis of protein
lo calization. IEEE Engineering in Medicine and Biology. (1999) 115-119
2. Cairns, P. Huyck, C. Mitchell, I. Wu, W.: A Comparison of Categorisation Algo-
rithms for Predicting the Cellular Localization Sites of Proteins. IEEE Engineering
in Medicine and Biology.(2001) 296-300
3. Duda, R.O. and Hart.: Pattern Classification and Scene Analysis. John Wiley and
Sons,(1973)
4. Horton, P., and Nakai, K.: A probabilistic classification system for predicting the
cellular localization sites of proteins. Proceedings of the Fourth International Con-
ference on Intelligent Systems for Molecular Biology.(1996) 109-115
5. Horton, P., and Nakai, K.: Better Prediction of Protein Cellular Localization Sites
with the k Nearest Neighbors Classifier. Proceedings of Intelligent Systems in
Molecular Biology.(1997) 368-383
6. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and
mo del selection. IInternational Joint Conference on Artificial Intelligence.(1995)
223-228
7. Murphy, P. M., and Aha, D. W.: UCI repository of machine learning databases.
http://www.ics.uci.edu/mlearn,(1996)
8. Nakai, K. and Kanehisa, M.: Expert system for predicting protein localization sites
in gram-negative bacteria. PROTEINS. 11(1991) 95-110
9. Nakai, K. and Kanehisa, M.: A knowledge base for predicting protein localization
sites in eukaryotic cells. Genomics. 14 (1992) 897-911
10. Opitz D. and Maclin R.: Popular Ensemble Methods: An Empirical Study”, Journal
of Articial Intelligence Research.11 (1999) 169-198
11. Riedmiller, M. and Braun, H.: A direct adaptive method for faster backpropagation
learning: The RPROP algorithm. Proceedings International Conference on Neural
Networks, (1993) 586-591
12. Sharkey A. J.C. and Sharkey N.E.:Combining diverse neural nets.The Knowledge
Engineering Review.12 (1997) 231-247
13. Snedecor, G., and Cochran, W.: Statistical Metho ds. 8th edn.Iowa State University
Press, (1989)
... Yeast proteins are organized as in [16]. The most suitable architecture for this problem, as suggested by [20], is an 8-16-10 FNN architecture. A termination criterion of E ≤ 0.05 within 2000 iterations (Epochs) is used. ...
... A termination criterion of E ≤ 0.05 within 2000 iterations (Epochs) is used. The evaluation method that we have employed to estimate the accuracy of the methods was a 10-fold cross validation following the guidelines of [19, 20]. The proportion of the number of the patterns for all the classes is equal in each partition, as this procedure provides more accurate results than a plain cross validation does [21].Figure 3 gives an overview of the experiments conducted in order to choose the best value of q for this problem. ...
Article
Full-text available
In this paper, inspired from our previous algorithm, which was based on the theory of Tsallis statistical mechanics, we develop a new evolving stochastic learning algorithm for neural networks. The new algorithm combines deterministic and stochastic search steps by employing a different adaptive stepsize for each network weight, and applies a form of noise that is characterized by the nonextensive entropic index q, regulated by a weight decay term. The behavior of the learning algorithm can be made more stochastic or deterministic depending on the trade off between the temperature T and the q values. This is achieved by introducing a formula that defines a time- dependent relationship between these two important learning parameters. Our experimental study verifies that there are indeed improvements in the convergence speed of this new evolving stochastic learning algorithm, which makes learning faster than using the original Hybrid Learning Scheme (HLS). In addition, experiments are conducted to explore the influence of the entropic index q and temperature T on the convergence speed and stability of the proposed method.
... Yeast proteins are organized as in [16]. The most suitable architecture for this problem, as suggested by [20], is an 8-16-10 FNN architecture. A termination criterion of E ≤ 0.05 within 2000 iterations (Epochs) is used. ...
... A termination criterion of E ≤ 0.05 within 2000 iterations (Epochs) is used. The evaluation method that we have employed to estimate the accuracy of the methods was a 10-fold cross validation following the guidelines of [19, 20]. The proportion of the number of the patterns for all the classes is equal in each partition, as this procedure provides more accurate results than a plain cross validation does [21].Figure 3 gives an overview of the experiments conducted in order to choose the best value of q for this problem. ...
Article
Full-text available
In this paper, inspired from our previous algorithm, which was based on the theory of Tsallis statistical mechanics, we develop a new evolving stochastic learning algorithm for neural networks. The new algorithm combines deterministic and stochastic search steps by employing a different adaptive stepsize for each network weight, and applies a form of noise that is characterized by the nonextensive entropic index q, regulated by a weight decay term. The behavior of the learning algorithm can be made more stochastic or deterministic depending on the trade off between the temperature T and the q values. This is achieved by introducing a formula that defines a time-dependent relationship between these two important learning parameters. Our experimental study verifies that there are indeed improvements in the convergence speed of this new evolving stochastic learning algorithm, which makes learning faster than using the original Hybrid Learning Scheme (HLS). In addition, experiments are conducted to explore the influence of the entropic index q and temperature T on the convergence speed and stability of the proposed method. Copyright EDP Sciences/Società Italiana di Fisica/Springer-Verlag 2006
... In these experiments the neural networks were tested using 4–fold cross validation , as this approach has been used before in the literature for training probabilistic and nearest neighbor classifiers in this problem [12]. The best available architectures that was suggested is a 7–16–8 FNN [1]. Rprop–trained FNNs of this architecture achieved better generalization than the best results reported in the literature [12], when the training error goal was E < 0.02 [1]. ...
... The best available architectures that was suggested is a 7–16–8 FNN [1]. Rprop–trained FNNs of this architecture achieved better generalization than the best results reported in the literature [12], when the training error goal was E < 0.02 [1]. Results from 150 runs for three algorithms using the same architecture are given inTable 5 . ...
Article
In this paper a globally convergent first-order training algorithm is proposed that uses sign-based information of the batch error measure in the framework of the nonlinear Jacobi process. This approach allows us to equip the recently proposed Jacobi–Rprop method with the global convergence property, i.e. convergence to a local minimizer from any initial starting point. We also propose a strategy that ensures the search direction of the globally convergent Jacobi–Rprop is a descent one. The behaviour of the algorithm is empirically investigated in eight benchmark problems. Simulation results verify that there are indeed improvements on the convergence success of the algorithm.
... In the previous work, the encoding methods such as n-gram, amino acid compositions, motifs, physicochemical properties, position specific scoring matrix (PSSM), and so on were employed, but these methods fail to achieve the rich statistical characteristics and sequence order information (Wang et al. 2001;Anastasiadis, Magoulas, and Liu 2003;Sharma et al. 2004;Zhao et al. 2004;Bandyopadhyay 2005;Blekas, Fotiadis, and Likas 2005;Zainuddin and Kumar 2008;Datta et al. 2009;Hong et al. 2009;Mansoori et al. 2009;Saidi et al. 2010;Vipsita et al. 2010;Li et al. 2012;Gupta, Niyogi, and Misra 2013;Srinivasan et al. 2013;Liu et al. 2014). ...
Article
Machine learning is being implemented in bioinformatics and computational biology to solve challenging problems emerged in the analysis and modeling of biological data such as DNA, RNA, and protein. The major problems in classifying protein sequences into existing families/superfamilies are the following: the selection of a suitable sequence encoding method, the extraction of an optimized subset of features that possesses significant discriminatory information, and the adaptation of an appropriate learning algorithm that classifies protein sequences with higher classification accuracy. The accurate classification of protein sequence would be helpful in determining the structure and function of novel protein sequences. In this article, we have proposed a distance-based sequence encoding algorithm that captures the sequence's statistical characteristics along with amino acids sequence order information. A statistical metric-based feature selection algorithm is then adopted to identify the reduced set of features to represent the original feature space. The performance of the proposed technique is validated using some of the best performing classifiers implemented previously for protein sequence classification. An average classification accuracy of 92% was achieved on the yeast protein sequence data set downloaded from the benchmark UniProtKB database.
... El clasicador de los Kvecinos más próximos (KNN), como sugurieron Duda y Hart [7], almacena los datos de entrenamiento, el par (X, L). Los ejemplos son clasicados eligiendo la clase mayoritaria entre los K ejemplos más cercanos del conjunto de entrenamiento, de acuerdo a una medida de distancia [3]. En nuestros experimentos fueron probados varios valores de K (número de vecinos) K = 1, 3, 5, y se ha encontrado empíricamente que K = 1 ha proporcionado los mejores resultados para nuestro conjunto de datos. ...
Article
Full-text available
Resumen El diagnóstico mediante la clasicación de fa-llos en plantas industriales es un cuerpo de investigación que recibe continuamente aten-ción. En algunos dominios estos fallos vienen descritos por series temporales. En este artí-culo vamos a introducir algunas técnicas para la clasicación de distintos modelos de fallo obtenidos por medio de una planta de labo-ratorio. Presentamos un marco computacional para resolver problemas de clasicación de fa-llos usando Razonamiento Basado en Casos. Este artículo ilustra diferentes técnicas para la reutilización y recuperación de casos (como por ejemplo Alineamiento Dinámico Temporal o distancia Euclídea), evaluando y comparan-do los resultados.
Conference Paper
Many artificial intelligence techniques have been developed to process the constantly increasing volume of data to extract meaningful information from it. The accurate annotation of the unknown protein using the classification of the protein sequence into an existing superfamily is considered a critical and challenging task in bioinformatics and computational biology. This classification would be helpful in the analysis and modeling of unknown protein to determine their structure and function. In this paper, a frequency-based feature encoding technique has been used in the proposed framework to represent amino acids of a protein's primary sequence. The technique has considered the occurrence frequency of each amino acid in a sequence. Popular classification algorithms such as decision tree, naive Bayes, neural network, random forest and support vector machine have been employed to evaluate the effectiveness of the encoding method utilized in the proposed framework. Results have indicated that the decision tree classifier significantly shows better results in terms of classification accuracy, specificity, sensitivity, F-measure, etc. The classification accuracy of 88.7% was achieved over the Yeast protein sequence data taken from the well-known UniProtKB database.
Article
In this paper, a new globally convergent modification of the Resilient Propagation-Rprop algorithm is presented. This new addition to the Rprop family of methods builds on a mathematical framework for the convergence analysis that ensures that the adaptive local learning rates of the Rprop's schedule generate a descent search direction at each iteration. Simulation results in six problems of the PROBEN1 benchmark collection show that the globally convergent modification of the Rprop algorithm exhibits improved learning speed, and compares favorably against the original Rprop and the Improved Rprop, a recently proposed Rrpop modification.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Conference Paper
Full-text available
A previous attempt to categorize yeast proteins based on certain attributes yielded only a 55% success rate of correct categorisation using a new type of decision procedure. This paper considers using existing soft computing approaches to improve the categorisation. More specifically, learning algorithms based on neural networks, growing cell systems, a rule development algorithm and genetic algorithms are applied to the yeast data. All of the results are at least as good as the original data showing that new problems do not necessarily require new algorithms. More interestingly as a consequence of using different algorithms, a consistent failure to achieve high success rates actually indicates features of the data rather than the failings of one or other of the algorithms
Article
To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses.
Article
We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations. For example, the rule to recognize an inner membrane protein is the presence of either a hydrophobic stretch in the predicted mature protein or an uncleavable N-terminal signal sequence. Lipoproteins are first recognized by a consensus pattern and then assumed present at either the inner or outer membrane. These two possibilities are further discriminated by examining an acidic residue in the mature N-terminal portion. Furthermore, we found an empirical rule that periplasmic and outer membrane proteins were successfully discriminated by their different amino acid composition. Overall, our system could predict 83% of the localization sites of proteins in our database.
Article
We have defined a simple model of classification which combines human provided expert knowledge with probabilistic reasoning. We have developed software to implement this model and have applied it to the problem of classifying proteins into their various cellular localization sites based on their amino acid sequences. Since our system requires no hand tuning to learn training data, we can now evaluate the prediction accuracy of protein localization sites by a more objective cross-validation method than earlier studies using production rule type expert systems. 336 E. coli proteins were classified into 8 classes with an accuracy of 81% while 1484 yeast proteins were classified into 10 classes with an accuracy of 55%. Additionally we report empirical results using three different strategies for handling continuously valued variables in our probabilistic reasoning system.
Article
We have compared four classifiers on the problem of predicting the cellular localization sites of proteins in yeast and E. coli. A set of sequence derived features, such as regions of high hydrophobicity, were used for each classifier. The methods compared were a structured probabilistic model specifically designed for the localization problem, the k nearest neighbors classifier, the binary decision tree classifier, and the naïve Bayes classifier. The result of tests using stratified cross validation shows the k nearest neighbors classifier to perform better than the other methods. In the case of yeast this difference was statistically significant using a cross-validated paired t test. The result is an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes. The best previously reported accuracies for these datasets were 55% and 81% respectively.