ArticlePDF Available

Significantly improved prediction of subcellular localization by integrating text and protein sequence data

Authors:
  • SciCross AB

Abstract and Figures

Computational prediction of protein subcellular localization is a challenging problem. Several approaches have been presented during the past few years; some attempt to cover a wide variety of localizations, while others focus on a small number of localizations and on specific organisms. We present a comprehensive system, integrating protein sequence-derived data and text-based information. Itis tested on three large data sets, previously used by leading prediction methods. The results demonstrate that our system performs significantly better than previously reported results, for a wide range of eukaryotic subcellular localizations.
Content may be subject to copyright.
Significantly Improved Prediction of Subcellular Localization by Integrating Text and
Protein Sequence Data
Annette Hoglund, Torsten Blum, Scott Brady, Pierre Donnes, John San Miguel, Matthew
Rocheford, Oliver Kohlbacher, and Hagit Shatkay
Pacific Symposium on Biocomputing 11:16-27(2006)
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
SIGNIFICANTLY IMPROVED PREDICTION OF SUBCELLULAR
LOCALIZATION BY INTEGRATING TEXT AND PROTEIN SEQUENCE DATA
ANNETTE H ¨
OGLUND , TORSTEN BLUM , SCOTT BRADY ,
PIERRE D ¨
ONNES , JOHN SAN MIGUEL , MATTHEW ROCHEFORD ,
OLIVER KOHLBACHER , HAGIT SHATKAY
Div. for Simulation of Biological Systems, ZBIT/WSI,
University of ubingen, Sand 14, D-72076 ubingen, Germany
School of Computing, Queen’s University,
Kingston, Ontario, Canada K7L 3N6
Computational prediction of protein subcellular localization is a challenging problem. Several
approaches have been presented during the past few years; some attempt to cover a wide va-
riety of localizations, while others focus on a small number of localizations and on specic
organisms. We present a comprehensive system, integrating protein sequence-derived data and
text-based information. It is tested on three large data sets, previously used by leading prediction
methods. The results demonstrate that our system performs signicantly better than previously
reported results, for a wide range of eukaryotic subcellular localizations.
1. Introduction
In this paper we introduce a new system for computationally assigning proteins
to their subcellular localization. By integrating several types of sequence-derived
features and text-based information, the achieved performance is the best reported
so far, in terms of sensitivity, specicity, and overall accuracy. Unlike several
recent systems which focus on a few subcellular localizations or on a specic
organism , our system is applicable to and retains its good performance
across a wide variety of organisms and subcellular localizations. Moreover, we
show that the integrated system, which combines sequence and text, performs sig-
nicantly better than its individual components, based on each data source alone.
The task of protein subcellular localization prediction is important and well-
studied . Knowing a protein’s localization helps elucidate its function, its
role in both healthy processes and in the onset of disease, and its potential
use as a drug target. Experimental methods for protein localization range from
immunolocalization to tagging of proteins using green uorescent protein (G FP)
To whom correspondence should be addressed: shatkay@cs.queensu.ca. H S is supported by NS ERC
Discovery grant 298292-04.
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
and isotopes . Such methods are accurate but, even at their best, are slow and
labor-intensive compared with large-scale computational methods. Computational
tools for predicting localization are useful for a large-scale initial “triage”, espe-
cially for proteins whose amino acid sequence may be determined from the ge-
nomic sequence, but are hard to produce, isolate, or locate experimentally.
The past decade, and most notably the last ve years, has seen much progress
in computational prediction of protein localization from sequence data. Nakai and
Kanehisa introduced PSort, a rule-based expert system, which was later im-
proved upon by a probabilistic and by a K-nearest neighbor classier. Another
pair of prominent systems, TargetP and ChloroP , based on articial neural net-
works, demonstrated a signicantly higher accuracy when applied to a limited set
of subcellular localizations in plant and animal cells. Other recent systems use a
variety of machine learning techniques. Most of them focus on a few subcellular
localizations and improve upon or just meet the state of the art on those .
Several recent publications have examined the possibility of using text to sup-
port subcellular localization. Specically, Stapley et al. represented yeast pro-
teins as vectors of weighted terms from all the PubMed articles mentioning their
respective genes. They then trained a support vector machine (SVM) on protein-
text-vectors, to distinguish among subcellular localizations. The performance was
favorable when compared to a classier trained on amino acid composition alone,
but it was not compared against any state-of-the-art localization system, and the
reported results do not suggest an improvement over earlier systems. Moreover,
while their text-based classier performed better than an amino acid composition
classier, combining the two forms of data did not signicantly improve perfor-
mance with respect to the text-based classier alone.
Nair and Rost used the text taken from Swiss-Prot annotations of proteins to
represent these proteins, and trained a subcellular classier using this representa-
tion. They concentrate on a few subcellular localizations, and report results that
are compatible but do not improve upon the state of the art at that time. Their
work was elaborated upon by Eskin and Agichtein , who added subsequences
from the protein’s amino acid sequence as part of the terms considered in the text
representation. The system was not tested against existing systems or data sets,
and the reported results do not indicate improvement over previous systems.
The best performing comprehensivesystems reportedso far, which were tested
on a large set of proteins, are PLOC and, more recently, MultiLoc . While they
report the best accuracy until now, on a broad range of organisms and localiza-
tions, there is still room for improvement.
The work reported here, similarly to that reported by Nair and Rost , uses
Swiss-Prot as a text source. Unlike them though, we use the PubMed abstracts
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
referenced by Swiss-Prot, rather than the annotation text placed by Swiss-Prot cu-
rators. Furthermore, unlike Stapley et al. who use all abstracts that contain the
gene name for the protein, we use only abstracts that are referenced by Swiss-Prot,
and moreover, rather than use all the terms in them with a standard (TF*IDF )
weighting, as done by Stapley et al., we select terms based on a distinguishing
criterion described in Section 2, and apply a probability-based weighting scheme.
We train an SVM as a text-based classier, and combine it with a sequence-based
classier, to produce a comprehensive subcellular categorizer. Our integrated sys-
tem is tested on a number of publicly available, extensive, homology-reduced,data
sets which were used for evaluating earlier systems (TargetP, PLOC, and Multi-
Loc). For each system, we rst conduct a comparison using the same data and
the same subcellular localizations as reported in the paper published about that
system. We then conduct a test using all the proteins in Swiss-Prot for which a
subcellular annotation is assigned, among the 11 localizations: chloroplast, cy-
toplasm, endoplasmic reticulum, extracellular space, Golgi apparatus, lysosome,
mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. On each of
the data sets our system performs better than the state-of-the-art systems in terms
of overall prediction accuracy, and other standard measures.
The next section outlines the methods used, while in Section 3 we demonstrate
the performance of our system. Section 4 concludes and outlines future work.
2. Methods
Our system combines ve separate classiers, four sequence-based and one text-
based. Their output is integrated through a sixth classier to produce an improved
prediction of protein subcellular localization. The sequence-based classiers have
been successfully used before by the MultiLoc system and are briey described
below. Section 2.2 then presents the novel text-based method, while Section 2.3
explains how all these classiers are combined to form an integrated prediction
system. Four of the ve classiers are based on support vector machines (SVMs),
using the LI BSV M implementation . The latter supports soft, probabilistic cate-
gorization for -class tasks , assigning to each classied item an -dimensional
vector denoting its probability to belong to each of the classes. Radial Basis
Function kernels were used throughout this study. Further details are given below.
2.1. Sequence-based methods
Each of the sequence-based classiers utilizes a different approach to derive bio-
logically informative features that can be used to predict localization, and classi-
es the input protein sequence to its respective localization using these features.
An acronym for Term Frequency, Times Inverse Document Frequency.
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
Three of these classiers are SVM-based. The fourth scans the protein sequences
for short sequence motifs indicative of structure and function. The four classiers
are briey described below (see the MultiLoc paper for further details).
SVMTarget This classier uses the N-terminal targeting peptide (TP) to pre-
dict a few subcellular categories. It distinguishes among four plant (chloroplast
(ch), mitochondria (mi), secretory pathway (SP), and other (OT)) and three non-
plant (mi, SP, OT) localizations. The targeting peptides are represented by their
partial amino acid composition, motivated by the observation that TPs for specic
localizations have a similar amino acid composition while their actual sequence
may differ. Given an input protein, the classier outputs a three-dimensional vec-
tor (four-dimensional for plant) of class probabilities. SVMTarget alone demon-
strated a slightly better performance than TargetP in a comparative study .
SVMSA Some proteins of the secretory pathway carry a signal anchor (SA)
that, unlike the targeting peptide, is usually located further away from the N-
terminus and contains a longer hydrophobic component. SVMSA can predict
secretory pathway (SP) proteins that are hard to detect using SVMTarget. It is
a binary classier, trained to distinguish proteins carrying SA from those that do
not. It outputs, given an input sequence, its probability to contain a signal anchor.
SVMaac This method uses the whole protein amino acid composition (aac),
and categorizes proteins into any of the possible localizations. It combines a col-
lection of binary classiers, each trained to distinguish one class from all others,
although one classier in the collection was especially trained to distinguish cy-
tosolic (cy) from nuclear (nu) proteins, as these are hard to separate using the
one-against-all approach. Given an input protein, , with possible localizations,
the classier outputs an -dimensional probability vector containing ’s probabil-
ity to belong to each localization.
MotifSearch Proteins from several subcellular localizations can be charac-
terized by a few types of short sequence motifs, such as Nuclear Localization Sig-
nal and DNA-binding domains. The motifs were obtained from the PROSITE
and from the NLSdb databases. This classier outputs a discrete, binary vec-
tor, representing the presence (1) or the absence (0) of each type of motif in the
query protein sequence.
2.2. Text-based method
The idea underlying the text-based classier is the representation of each protein
as a vector of weighted text features. While text-based localization has been pre-
sented before , the key differences between the current work and previous ones
is in the text source used, the feature selection, and the term weighting scheme.
First, for each protein the text comes from the abstracts curated for the protein
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
in its Swiss-Prot entry. We used a script that scanned each protein in Swiss-Prot
for all the PubMed identiers occurring in its Swiss-Prot entry, and obtained the
respective title and abstract from PubMed. Each protein is thus assigned a set
of PubMed abstracts, based on Swiss-Prot. This choice of abstracts is different
from that of Stapley et al. who used all the PubMed abstracts mentioning the
gene’s name, and from that of Nair and Rost who use Swiss-Prot annotation
text rather than PubMed abstracts. The assigned abstracts are then tokenized into
a set of terms, consisting of singleton and pairs of consecutive words, with a list of
standard stop words excluded from consideration. The results reported here also
include the application of Porter stemming to all the words in the terms.
Second, from all the extracted terms, we select a subset of distinguishing
terms. This is done by scoring each term with respect to each subcellular lo-
calization, where the score reects the probability of the term to occur in abstracts
that are associated with proteins of this certain localization. Intuitively, a term
is distinguishing for a localization , if it is much more likely to occur in ab-
stracts associated with localization than with abstracts associated with all other
localizations. We formalize this idea in the following paragraphs.
Let be a term, a localization, and a protein. If protein is known to be
localized in , we denote this . We also dene the following sets:
The set of all PubMed abstracts associated with protein according to Swiss-
Prot, denoted ;
The set of all proteins known to be localized at , denoted ;
The set of abstracts that are associated with a localization , denoted , is
dened as: . It is the set of all the abstracts associ-
ated with the proteins that are in localization . The number of documents in
this set is denoted .
The probability of a term to be associated with a localization , denoted , is
dened as the conditional probability of the term to appear in a document, given
that the document is associated with the localization: .
A maximum likelihood estimate for this probability is simply the proportion of
documents containing among all those associated with the localization:
of documents . For each term and each localization ,
the estimate for the probability is calculated.
Based on this probability, a term is called distinguishing for localization , if
and only if its probability to occur in localization , , is signicantly different
from its probability to occur in any other localization , . The statistical test
applied, uses the Z-score , which evaluates the difference between two binomial
Without using any of the MeSH terms.
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
Table 1. Examples of distinguishing stemmed terms for several localizations
Localization Example Terms
Nucleus bind, control, dna, histon, nuclear, promot, transcript
Mitochondria coa (CoA), complex, cytochrom, dehydrogenas, mitochondri, oxidas, respiratori
Golgi Apparatus acceptor, catalyt domain, fucosyltransferas, galactos, glycosyltransferas, golgi, transferas
Endoplasmic Reticulum calcium, chaperon, disuld isomeras, endoplasm, lumen, microsom, transmembran
probabilities, , and , as follows:
where
When , the hypothesis that the two probabilities , are
different is accepted with a condence level greater than . Therefore, if the
term has a localization such that for any other localization ,
is considered distinguishing for localization , and is included in the set of
distinguishing terms. In our representation of proteins as term vectors, we use
only distinguishing terms. In the experiments described in Section 3, using several
different proteins sets, the average number of PubMed abstracts is on the order
of , while that of distinguishing terms is about . Some examples of
distinguishing terms for several localizations are shown in Table 1.
Finally, once the collection of distinguishing terms, denoted as , was
established, each protein is represented as an N-dimensional vector, where the
weight at position , (where ), is the conditional probability
of the term to appear in the abstracts associated with the protein , given all
the PubMed abstracts related to the protein, (the set ) This probability is es-
timated as the ratio between the total number of times the term occurs in the
abstracts associated with the protein and the total number of all the occurrences
of distinguishing terms in these same abstracts. Formally it is calculated as:
of times occurs in
of times occurs in
where the sums are taken over all the abstracts in the set of abstracts associated
with the protein , .
The representation of proteins as weighted term vectors, is then partitioned
into training and test sets for each subcellular localization, and as before, an SVM
is trained to classify these protein vectors into their respective localization. This
classier, like SVMacc described above, produces an -dimensional probability
vector denoting the probability of the protein to be in each of the localizations.
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
2.3. Integrated method
The output from the ve classiers above, is a set of four probability vectors and
one binary-valued vector (resulting from MotifSearch). These are all concatenated
to form one integrated feature vector for each protein. Again, an SVM classier
is trained on these feature vectors to produce a prediction. This classier consists
of a set of one-against-one classiers (each of which distinguishes between a pair
of localizations) and its output, yet again, is a probabilistic vector, holding for
each localization the probability of the protein to belong to it. Based on this
nal classication step, a protein is assigned to the localization with the highest
probability value in the last output vector. The training and evaluation procedure
uses strict ve-fold cross-validation, where no test protein was used to train any
of the classiers comprising the system.
3. Experiments and Results
To train and to evaluate our integrated system, we used three different data sets,
namely those used for training and testing TargetP, MutliLoc, and PLOC. These
sets provide the basis for an extensive and sound comparison. The data sets, the
evaluation procedure, and the results are described throughout this section.
3.1. Experimental setting
The data sets used in our experiments are the following:
TargetP This data set contains a total of 3,415 distinct proteins representing
four plant (ch,mi,SP, and OT) and three non-plant (mi,SP, and OT) localizations.
Homologs were removed from it by the TargetP authors. The SP category includes
proteins from several localizations in the secretory pathway: endoplasmic retic-
ulum (er), extracellular space (ex), Golgi apparatus (go), lysosome (ly), plasma
membrane (pm), and vacuole (va). The OT category includes cy and nu proteins.
MultiLoc The MultiLoc data set contains a total of 5,959 protein se-
quences, which were extracted from the Swiss-Prot database release 42.0 . An-
imal, fungal, and plant proteins with an annotated subcellular localization were
grouped into eleven eukaryotic localizations: cy,ch,er,ex,go,ly,mi,nu, per-
oxisome (pe), pm,va. In the experiments reported here homologous proteins
with identity higher than , (the same threshold used by PLOC ), were ex-
cluded from the set, to avoid the occurrence of highly similar sequences in both
the training and the test sets . Further details about the data set extraction and the
implications of homology reduction are available in the MultiLoc publication .
Excluding proteins whose annotation was commented by similarity or potential.
We also conducted experiments with a more lenient and more stringent homology constraints, of
and identity, respectively (data not shown).
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
PLOC The PLOC data set was used by Park and Kanehisa and consists of
proteins extracted from Swiss-Prot release 39.0, covering 12 localizations. In con-
trast to MultiLoc, (aside for the older Swiss-Prot version), this data set introduces
an additional category within the cy proteins, namely, the cytoskeleton (cs). There
are 41 cs proteins, compared to 1,245 cy proteins. The total number of sequences
is 7,579 (max. sequence identity 80%). This set is larger than the MultiLoc data
set due to a less restrictive data extraction, assigning proteins to localization even
when the localization annotation includes the words “potential” or “by similarity”.
Using these three data sets, the performance of our integrated system is com-
pared to that of TargetP, PLOC, and MultiLoc . In addition, we also compare
the performance of the integrated system to that of an SVM classier applied to
the text data alone. Following previous evaluations , we consistently employ
ve-fold cross-validation. For comparison against the PLOC data set we use the
same split as the one used by Park and Kanehisa . For the TargetP data, as the
split used by Emanuelsson et al. was not provided, we ran the ve-fold cross-
validation procedure ve times, each using a different randomized ve-way split,
to ensure robustness. The reported results are averaged over all the 5 folds, and
over the 5 randomized splits when those are used.
Since the performance of previous systems was evaluated using several
different metrics, for a fair comparison we calculated these same performance
measures. Thus, for each system and data set the performance is measured,
for each localization, in terms of the sensitivity (Sens), specicity (Spec), and
Matthews correlation coefcient (MCC). These are dened as:
and
where denote the number of true positives, true negatives,
false positives, and false negatives, respectively, with respect to a given localiza-
tion. Like Park and Kanehisa we also measure the overall accuracy, namely,
, where is the number of correctly classied proteins over all the
localizations, and is the total number of classied proteins. They also measured
the average sensitivity, over all the localizations, a metric they call local accuracy,
which we calculate as well. This last measure, which we denote as Avg, gives an
equal weight to the categorization performance on each localization, regardless of
the number of proteins known to be associated with it.
Comparison to PSort is not included here, since MultiLoc has already demonstrated a higher pre-
diction accuracy compared to this method .
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
3.2. Results
We present the results of running the sequence-based system, MultiLoc, the text-
based classier alone (denoted Text), and the integrated system (denoted Multi-
LocText), on all the three data sets. For completeness, we also present the results
reported by the authors of PLOC and of TargetP on the respective data sets.
These numbers were directly taken from the respective publications.
Table 2 summarizes the results, showing the overall accuracy (Acc) and the
average local accuracy (Avg) for both the TargetP and the PLOC data sets. For
TargetP the results are shown for plant and non-plant proteins, while for PLOC
results are shown for plant, animal, and fungal proteins. Table 3 compares the
performance of TargetP and PLOC with our integrated system, with respect to the
individual subcellular localizations.
Table 2. An overview of the prediction results using the TargetP and PLOC data sets. Both the total (Acc)
and the average (Avg) prediction accuracies are shown for all the methods. The highest values appear in bold.
Standard deviations, (denoted ) are provided where available.
Data set Method Acc [%] ( Standard Deviation) / Avg [%] ( Standard Deviation)
TargetP Plant Non-Plant
TargetP 85.3 ( 3.5) / 85.6 (n/a) 90.0 ( 0.7) / 90.7 (n/a)
MultiLoc 89.7 ( 1.6) / 90.2 ( 2.0) 92.5 ( 1.2)/ 92.8 ( 1.1)
Text 81.2 ( 2.6) / 78.1 ( 3.2) 88.7 ( 1.1)/ 89.8 ( 1.6)
MultiLocText 94.7 ( 1.5) / 94.4 ( 1.6) 96.2 ( 0.8) / 96.7 ( 0.9)
PLOC Plant Animal Fungal
PLOC 78.2 ( 0.9)/ 57.9 ( 2.1) 79.6 ( 0.9)/ 59.9 ( 3.3) 79.5 ( 0.9)/ 56.8 ( 1.9)
MultiLoc 73.6 ( 0.7) / 71.3 ( 2.8) 76.0 ( 0.7) / 73.6 ( 3.9) 75.8 ( 0.8) / 72.5 ( 2.5)
Text 68.7 ( 0.7) / 73.5 ( 1.8) 70.2 ( 0.7) / 75.5 ( 2.7) 67.8 ( 0.5) / 72.4 ( 2.6)
MultiLocText 85.3 ( 1.2) / 84.2 ( 2.4) 86.4 ( 0.8) / 84.5 ( 3.6) 85.4 ( 0.8) / 83.8 ( 2.8)
Table 3. Localization specic results using the TargetP (left), and the PLOC (right) data sets. For
both sets, the results reported in the respective papers are compared to results of our integrated
system (MultiLocText). As PLOC localization-specic results are averaged over all three organ-
isms, we show such averaged results for our system as well. Specicity and MCC values were not
available for PLOC, hence only its Sensitivity is listed and compared with our sensitivity values.
The highest compared values for each data set are shown in bold.
TargetP Data Set PLOC Data Set
Loc TargetP MultiLocText Loc PLOC MultiLocText
Plant (Sens Spec MCC) Avg. Sens Avg. (Sens Spec MCC)
ch ch
mi mi
OT cs
SP cy
Non-Plant (Sens Spec MCC)er
mi ex
OT go
SP nu
pe
pm
va
ly
A comparison of the performance of our three systems (MultiLoc alone, Text
alone, and the integrated MultiLocText) using ve-fold cross-validation over the
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
5,959 proteins of the MultiLoc data set, is presented in Table 4. The sensitivity
(Sens), specicity (Spec), and Matthews MCC values for the plant and animal
versions are listed. (Similar results were obtained for the fungal version, and are
not shown here due to space limitation).
The results in Tables 2, 3, and 4 clearly show that the combined classier,
which integrates text and sequence data, outperforms earlier prediction methods.
It also outperforms its own text-based (Text) and sequence-based (MultiLoc) com-
ponents, if taken separately. A signicance test was performed to evaluate the
differences between the values obtained from MultiLocText and those obtained
from each of MultiLoc and Text alone, (Table 4). The improved performance
values of MultiLocText are highly statistically signicant ( ), for almost
all the subcellular localizations. The only exceptions are the Golgi ( , animal
and plant), where there is no signicant difference in sensitivity with respect to
text-alone, as well as the peroxisome predictions ( , animal and plant), where
MultiLocText does not outperform the text-alone system.
4. Discussion and Conclusion
The methods, experiments, and results presented here clearly demonstrate a sig-
nicant improvement in the prediction of protein subcellular localization through
the integration of sequence- and text-based methods. Table 4 shows that the two
Table 4. Prediction performance of MultiLoc, Text, and MultiLocText on
the MultiLoc data set. Both localization-specic values (sens, spec, M CC)
and overall results (Acc and Avg) are shown. Highest values appear in bold.
Loc MultiLoc Text MultiLocText
Plant (Sens Spec MCC)
ch
cy
er
ex
go
mi
nu
pe
pm
va
Acc [%] 74.6 73.1
Avg [%] 75.2 76.0
Animal (Sens Spec MCC)
cy
er
ex
go
ly
mi
nu
pe
pm
Acc [%]
Avg [%]
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
types of methods distinctly complement each other. MultiLoc, which is based on
sequence data, typically performs well predicting protein localizations that are di-
rected by N-terminal signals such as the mitochondria and the chloroplast. The use
of text information complements and signicantly boosts its performance for lo-
calizations whose sequence-based signal is not as overt, including the peroxisome
and localizations related to the secretory pathway such as the Golgi apparatus and
the endoplasmic reticulum.
In this work we have demonstrated, using ve-fold cross-validation, that our
system can reproduce, with unprecedented sensitivity and specicity, localiza-
tions of proteins which were already annotated in Swiss-Prot. A natural next step
is to apply the method to yet un-localized proteins. We are developing the means
to predict subcellular localization of proteins for which PubMed reference exist
in Swiss-Prot but no localization assigned, as well as for those with no curated
PubMed reference. Our current use of “raw text” from PubMed abstracts (in con-
trast, for instance, to the use of Swiss-Prot annotation text as was done before ),
is expected to make our approach amenable to such extensions. We are also in-
vestigating methods for the localization of proteins with no PubMed references,
through the use of alternative data sources.
References
1. Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular lo-
calization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 300
(2000) 1005–1016
2. Nair, R., Rost, B.: Inferring sub-cellular localization through automated lexical anal-
ysis. Bioinformatics 18 (2002) S78–S86
3. Gardy, J.L., Spencer, C., Wang, K. el al.: PSORT-B: Improving protein subcellular
localization prediction for gram-negative bacteria. Nucleic Acids Research 31 (2003)
137–140
4. Cai, Y.D., Chou, K.C.: Predicting 22 protein localizations in budding yeast. Biochem
Biophys Res Commun. 323 (2004) 425–428
5. Schneider, G., Fechner, U.: Advances in the prediction of protein targeting signals.
Proteomics 4(2004) 1571–1580
6. onnes, P., oglund, A.: Predicting Protein Subcellular Localization: Past, Present,
and Future. Genomics, Proteomics, and Bioinformatics 2(2004)
7. Burns, N., Grimwade, B., Ross-Macdonald, P., Choi, E., Finberg, K., GS, R., M, S.:
Large-scale analysis of gene expression, protein localization and gene disruption in
Saccharomyces cerevisiae. Genes and Development 8(1994) 1087–1105
8. Hanson, M.R., K¨ohler, R.H.: GFP imaging: Methodology and application to investi-
gate cellular compartmentation in plants. Journal of Experimental Botany 52 (2001)
9. Dunkley, T., Watson, R., Grifn, J., Dupree, P., Lilley, K.: Localization of organelle
proteins by isotope tagging (LOPIT). Molecular and Cellular Proteomics 3(2004)
10. Nakai, K., Kanehisa, M.: Expert system for predicting protein localization sites in
gram-negative bacteria. Proteins: Structure, Function and Genetics 11 (1991) 95–110
September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay
11. Nakai, K., Kanehisa, M.: A knowledge base for predicting protein localization sites in
eukaryotic cells. Genomics. 14 (1992) 897–911
12. Horton, P., Nakai, K.: A probabilistic classication system for predicting the cellular
localization of proteins. In: Proc. of the Int. Conf. on Intelligent Systems for Molecular
Biology (ISMB). (1996)
13. Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with the
k nearest neighbors classier. In: Proc. of the Int. Conf. on Intelligent Systems for
Molecular Biology (ISMB). (1997)
14. Emanuelsson, O., Nielsen, H., von Heijne, G.: Chlorop, a neural network-based
method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci-
ence 8(1999) 978–984
15. Bannai, H., Tamada, Y., Maruyama, O., Nakai, K., Miyano, S.: Extensive feature
detection of N-terminal protein sorting signals. Bioinformatics. 18 (2002) 298–305
16. Nair, R., Rost, B.: Mimicking cellular sorting improves prediction of subcellular lo-
calization. J Mol Biol. 348 (2005) 85–100
17. Stapley, B.J., Kelley, L.A., Sternberg, M.J.E.: Predicting the subcellular location of
proteins from text using support vector machines. In: Proc. of the Pacic Symposium
on Biocomputing (PSB). (2002) 374–385
18. Eskin, E., Agichtein, E.: Combining text mining and sequence analysis to discover
protein functional regions. In: Proc. of the 9th Pacic Symposium on Biocomputing
(PSB). (2004) 288–299
19. Park, K.J., Kanehisa, M.: Prediction of protein subcellular location by support vector
machines using compositions of amino acids and amino acid pairs. Bioinformatics. 19
(2003) 1656–1663
20. oglund, A., onnes, P., Blum, T., Adolph, H., Kohlbacher, O.: Using N-terminal tar-
geting sequences, amino acid composition, and sequence motifs for predicting protein
subcellular localization. German Conference on Bioinformatics (GCB) 2005.
21. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2003)
http://www.csie.ntu.edu.tw/ clin/libsvm/.
22. Wu, T.F., Linand, C.J., Weng, R.C.: Probability Estimates for Multi-class Classica-
tion by Pairwise Coupling. Journal of Machine Learning Research 5(2004) 975–1005
23. Bairoch, A., Bucher, P.: PROSITE: recent developments. Nucleic Acids Res. 22
(1994) 3583–3589
24. Cokol, M., Nair, R., Rost, B.: Finding nuclear localization signals. EMBO Rep. 1
(2000) 411–415
25. Nair, R., Carter, P., Rost, B.: NLSdb: database of nuclear localization signals. Nucleic
Acids Res. 31 (2003) 397–399
26. Porter, M.F.: An Algorithm for Sufx Stripping (Reprint). In: Readings in Informa-
tion Retrieval. Morgan Kaufmann (1997)
http://www.tartarus.org/ martin/PorterStemmer/.
27. Walpole, R.E., Myers, R.H., Myers, S.L. In: One- and Two-Sample Tests of Hypothe-
ses. (1998) 235–335
28. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its sup-
plement in TrEMBL in 2000. Nucleic Acids Res. 28 (2000) 45–48
29. Matthews, B.W.: Comparison of predicted and observed secondary structure of T4
phage lysozyme. Biochim Biophys Acta. 405 (1975) 442–451
... This was one of the first applications of SVMs to biological journal abstracts and it showed that text and amino acid composition together yield better results than either alone. Properties of proteins themselves were again used to improve text categorization for animal, plant and fungi subcellular localization data sets (Höglund et al, 2006). The authors' text classifiers were based on the most distinguishing terms of documents, and they included the output of four protein sequence classifiers in their training data. ...
... The text only classifier reported by Höglund et al. has F-measure in the range [0.449,0.851] (Höglund et al, 2006) and the text only classifiers presented by Stapley et al. begin with a baseline classifier with F-measure in the range [0.31,0.80] (Stapley et al, 2002). ...
... Though we use different data sets than Höglund et al. (2006), we compare our results to theirs on a class by class basis. For those 7 localization classes for which we both make predictions, the F-measure of our classifiers trained on Data Set 3 exceed the Fmeasures of the Höglund et al. text only classifiers in all cases, and our Data Set 3 classifier beats the Fmeasure of the MutliLocText classifier for 5 classes (see supplementary material http://www.cs. ...
... Hagit's expertise made her a 'go to person' for tackling a variety of real-life problems. For example, developing methods for prediction of protein localization of eukaryotic proteins which integrated several types of sequence and text-based features in collaboration with Dr Kohlbacher's lab (University of Tü bingen, Germany) (Briesemeister et al., 2009;Hö glund et al., 2006;Shatkay et al., 2007;Simha et al., 2015); applying machine learning to improve the information that can be gleaned from 'fuzzy' EKG and echocardiogram images to help predict which hypertrophic cardiomyopathy patients would develop arrhythmias (Bhattacharya et al., 2019(Bhattacharya et al., , 2021Rahman et al., 2014Rahman et al., , 2015 in collaboration with Dr Maria Roselle Abraham, MD (Johns Hopkins Medicine Heart and Vascular Institute); and more recently, developing computational methods to analyze data collected through noisy sensors to accelerate data-intensive discovery in astro-particle physics (Stewart, 2019), in an NSF funded project codirected with Dr Christopher Tunnell (assistant professor of physics, astronomy and computer science at Rice University) and Dr Waheed Bajwa (associate professor of electrical and computer engineering and of statistics at Rutgers University). ...
Article
Full-text available
... • Extraction of keywords and functional annotation of proteins (Andrade and Valencia, 1998) • Generating gene summaries (e. g. sequence information, phenotypes, interactions) (Ling et al., 2006) • Predicting the subcellular localization of proteins (Stapley et al., 2002), in conjunction with protein sequence based features (Hoglund et al., 2006) • Annotation of enzyme classes with disease-related information (Hofmann and Schomburg, 2005) • Finding protein-protein interactions (see Chapter 5) ...
Article
A concise introduction to fundamental methods for finding and extracting relevant information from the ever-increasing amounts of biomedical text available. © 2012 Massachusetts Institute of Technology. All rights reserved.
Article
Recognized as an essential component of Chinese culture, Traditional Chinese Medicine (TCM) is both an ancient medical system and one still used widely in China today. TCM's independently evolved knowledge system is expressed mainly in the Chinese language and the information is frequently only available through ancient classics and confidential family records, making it difficult to utilize. The major concern in TCM is how to consolidate and integrate the data, enabling efficient retrieval and discovery of novel knowledge from the dispersed data. Computational approaches such as data mining, semantic reasoning and computational intelligence have emerged as innovative approaches for the reservation and utilization of this knowledge system. Typically, this requires an inter-disciplinary approach involving Chinese culture, computer science, modern healthcare and life sciences. This book examines the computerization of TCM information and knowledge to provide intelligent resources and supporting evidences for clinical decision-making, drug discovery, and education. Recent research results from the Traditional Chinese Medicine Informatics Group of Zhejiang University are presented, gathering in one resource systematic approaches for massive data processing in TCM. These include the utilization of modern Semantic Web and data mining methods for more advanced data integration, data analysis and integrative knowledge discovery. This book will appeal to medical professionals, life sciences students, computer scientists, and those interested in integrative, complementary, and alternative medicine. Interdisciplinary book bringing together Traditional Chinese Medicine and computer scientists Introduces novel network technologies to Traditional Chinese Medicine informatics Provides theory and practical examples and case studies of new techniques. © 2012 Zhejiang University Press Co., Ltd. Published by Elsevier Inc. All rights reserved.
Article
The fungus Rhizopus is frequently used for fermentative production of lactic acid, but little is known about the mechanisms or proteins for transporting this carboxylic acid. Since transport of the lactate anion across the plasma membrane is critical to prevent acidification of the cytoplasm, we evaluated the functionality of two lactate-proton symport paralogs, LacA and LacB, from Rhizopus delemar. Both of these proteins showed significant ancestral homology to bacterial lactate permease with 46–50% identity to similar homologs from the genus Burkholderia. Based on qPCR, the highest level of expression in Rhizopus for the lacA gene was on complex medium containing pyruvate, while lacB transcript was barely detected with all of the tested culture conditions. A Saccharomyces cerevisiae jen1 deletion strain lacking the ability to transport monocarboxylates was restored for growth on lactate and pyruvate with the expression of LacA. Expression of the LacB in this same strain did not confer the ability to grow on either carbon source. LacA expression also allowed active transport of L-[14C(U)]-lactate into yeast cells and this accumulation was inhibited by the proton uncoupler carbonyl cyanide p-(trifluoromethoxy) phenylhydrazone. Translation fusions with GFP showed that LacA accumulates primarily in the plasma and vacuolar membrane, while LacB is dispersed throughout the cytoplasm. These results indicate that the Rhizopus LacA is a functional lactate symport that is probably involved in uptake of pyruvate or lactate, while the physiological role of LacB is unknown.
Article
Membrane bioreactor (MBR) technology is advancing rapidly around the world both in research and commercial applications. Despite the increasing number of studies and full-scale applications of MBR systems, directions and trends in academic research as well as commercial developments require further analysis. This paper aims to critically characterize and review worldwide academic research efforts in the area of MBRs as well as focus attention to commercial MBR applications in North America. A total of 339 research papers published in peer-reviewed international journals from 1991 to 2004 and a total of 258 full-scale MBR installations in North America were used as the database for the analysis provided in this paper. After a surge of MBR publications in 2002, research appears to have reached a plateau in the last 3 years using both submerged and external MBR units. Although much of the pioneering research occurred in Japan, France and the UK, countries such as South Korea, China and Germany have significantly contributed to the research pool in the last 5 years. The primary research focus has been on water filtration MBRs with limited growth in extractive and gas diffusion MBRs which still hold un-tapped potential. Fundamental aspects studied in academic research predominantly involve issues related to fouling, microbial characterization and optimizing operational performance. Research in North America presents a unique picture as a higher ratio of industrial wastewater treatment and side-stream MBR applications have been studied compared to other parts of the world. For MBR commercial application, the North America installations constitute about 11% of worldwide installations. Zenon occupies the majority of the MBR market in North America, whereas Kubota and Mitsubishi-Rayon have a larger number of installations in other parts of the world. Due to more stringent regulations and water reuse strategies, it is expected that a significant increase in MBR plant capacity and widening of application areas will occur in the future. Potential application areas include nitrate removal in drinking water treatment, removal of endocrine disrupting compounds from water and wastewater streams, enhancing bio-fuels production via membrane assisted fermentation and gas extraction and purification MBRs.
Conference Paper
Full-text available
Functional annotation of unknown proteins is am ajor goal in proteomics. Ak ey step in this annotation process is th ed efinition of ap rotein' ss ubcellular lo- calization. As ac onsequence, numerous predictio nt echniques for localization have been developed ove rt he years. These methods typically focus on as ingle underlying biological aspect or predict as ubset of all possible subcellula rl ocalizations. There is ac lear need for ne wm ethods that utilize an dr epresent available protein specific bio- logica lk nowledge from several sources, in order to improve accurac ya nd localization coverage for aw ide range of organisms. Her ew ep resent an ove lS upport Vector Machine (SVM)-based approach for pre- dictin gp rotein subcellular localization, which integratesinformatio na bout N-terminal targeting sequences ,a mino acid composition, and protein sequence motifs. An impor- tant step is taken towardsemulating th ep roteinsorting process by capturingand bring- ing together biologically relevant information. Our nove la pproach has been used to develop tw on ew prediction methods, TargetLoc and MultiLoc. TargetLoc is restricted to analysis of proteins containing N-terminal targeting sequences, whereas MultiLoc covers all major eukaryotic subcellular localizations for animal, plant, an df ungal pro- teins. Compared to simila rm ethods, TargetLo cp erforms better than these. MultiLoc performs considerably better than comparable prediction methods predicting al lm ajor eukaryotic subcellula rl ocalizations, an ds hows better or comparable results to meth- ods tha ta re specialized on fewer localizations or for one organism.
Article
Full-text available
SWISS-PROT (http://www.expasy.ch/) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
Article
Full-text available
Automated prediction of bacterial protein subcellular localization is an important tool for genome annotation and drug discovery. PSORT has been one of the most widely used computational methods for such bacterial protein analysis; however, it has not been updated since it was introduced in 1991. In addition, neither PSORT nor any of the other computational methods available make predictions for all five of the localization sites characteristic of Gram-negative bacteria. Here we present PSORT-B, an updated version of PSORT for Gram-negative bacteria, which is available as a web-based application at http://www.psort.org. PSORT-B examines a given protein sequence for amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices and motifs corresponding to specific localizations. A probabilistic method integrates these analyses, returning a list of five possible localization sites with associated probability scores. PSORT-B, designed to favor high precision (specificity) over high recall (sensitivity), attained an overall precision of 97% and recall of 75% in 5-fold cross-validation tests, using a dataset we developed of 1443 proteins of experimentally known localization. This dataset, the largest of its kind, is freely available, along with the PSORT-B source code (under GNU General Public License).
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Article
To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses.
Article
We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations. For example, the rule to recognize an inner membrane protein is the presence of either a hydrophobic stretch in the predicted mature protein or an uncleavable N-terminal signal sequence. Lipoproteins are first recognized by a consensus pattern and then assumed present at either the inner or outer membrane. These two possibilities are further discriminated by examining an acidic residue in the mature N-terminal portion. Furthermore, we found an empirical rule that periplasmic and outer membrane proteins were successfully discriminated by their different amino acid composition. Overall, our system could predict 83% of the localization sites of proteins in our database.
Article
We have developed a large-scale screen to identify genes expressed at different times during the life cycle of Saccharomyces cerevisiae and to determine the subcellular locations of many of the encoded gene products. Diploid yeast strains containing random lacZ insertions throughout the genome have been constructed by transformation with a mutagenized genomic library. Twenty-eight hundred transformants containing fusion genes expressed during vegetative growth and 55 transformants containing meiotically induced fusion genes have been identified. Based on the frequency of transformed strains producing beta-galactosidase, we estimate that 80-86% of the yeast genome (excluding the rDNA) contains open reading frames expressed in vegetative cells and that there are 93-135 meiotically induced genes. Indirect immunofluorescence analysis of 2373 strains carrying fusion genes expressed in vegetative cells has identified 245 fusion proteins that localize to discrete locations in the cell, including the nucleus, mitochondria, endoplasmic reticulum, cytoplasmic dots, spindle pole body, and microtubules. The DNA sequence adjacent to the lacZ gene has been determined for 91 vegetative fusion genes whose products have been localized and for 43 meiotically induced fusions. Although most fusions represent genes unidentified previously, many correspond to known genes, including some whose expression has not been studied previously and whose products have not been localized. For example, Sec21-beta-gal fusion proteins yield a Golgi-like staining pattern, Ty1-beta-gal fusion proteins localize to cytoplasmic dots, and the meiosis-specific Mek1/Mre4-beta-gal and Spo11-beta-gal fusion proteins reside in the nucleus. The phenotypes in haploid cells have been analyzed for 59 strains containing chromosomal fusion genes expressed during vegetative growth; 9 strains fail to form colonies indicating that the disrupted genes are essential. Fifteen additional strains display slow growth or are impaired for growth on specific media or in the presence of inhibitors. Of 39 meiotically induced fusion genes examined, 14 disruptions confer defects in spore formation or spore viability in homozygous diploids. Our results will allow researchers who identify a yeast gene to determine immediately whether that gene is expressed at a specific time during the life cycle and whether its gene product localizes to a specific subcellular location.