ArticlePDF Available

Significantly improved prediction of subcellular localization by integrating text and protein sequence data

February 2006
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 11:16-27

February 2006
11:16-27

DOI:10.1142/9789812701626_0003

Source
PubMed

Authors:

Torsten Blum

University of Tuebingen

Pierre Dönnes

SciCross AB

Show all 8 authorsHide

Computational prediction of protein subcellular localization is a challenging problem. Several approaches have been presented during the past few years; some attempt to cover a wide variety of localizations, while others focus on a small number of localizations and on specific organisms. We present a comprehensive system, integrating protein sequence-derived data and text-based information. Itis tested on three large data sets, previously used by leading prediction methods. The results demonstrate that our system performs significantly better than previously reported results, for a wide range of eukaryotic subcellular localizations.

. Prediction performance of MultiLoc, Text, and MultiLocText on the MultiLoc data set. Both localization-specific values (sens, spec, MCC) and overall results (Acc and Avg) are shown. Highest values appear in bold.

…

An overview of the prediction results using the TargetP and PLOC data sets. Both the total (Acc) and the average (Avg) prediction accuracies are shown for all the methods. The highest values appear in bold. Standard deviations, (denoted ) are provided where available.

…

Figures - uploaded by Oliver Kohlbacher

Content may be subject to copyright.

Content uploaded by Oliver Kohlbacher

Content may be subject to copyright.

Significantly Improved Prediction of Subcellular Localization by Integrating Text and

Protein Sequence Data

Annette Hoglund, Torsten Blum, Scott Brady, Pierre Donnes, John San Miguel, Matthew

Rocheford, Oliver Kohlbacher, and Hagit Shatkay

Pacific Symposium on Biocomputing 11:16-27(2006)

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

SIGNIFICANTLY IMPROVED PREDICTION OF SUBCELLULAR

LOCALIZATION BY INTEGRATING TEXT AND PROTEIN SEQUENCE DATA

ANNETTE H ¨

OGLUND , TORSTEN BLUM , SCOTT BRADY ,

PIERRE D ¨

ONNES , JOHN SAN MIGUEL , MATTHEW ROCHEFORD ,

OLIVER KOHLBACHER , HAGIT SHATKAY

Div. for Simulation of Biological Systems, ZBIT/WSI,

University of T¨ubingen, Sand 14, D-72076 T¨ubingen, Germany

School of Computing, Queen’s University,

Kingston, Ontario, Canada K7L 3N6

Computational prediction of protein subcellular localization is a challenging problem. Several

approaches have been presented during the past few years; some attempt to cover a wide va-

riety of localizations, while others focus on a small number of localizations and on speciﬁc

organisms. We present a comprehensive system, integrating protein sequence-derived data and

text-based information. It is tested on three large data sets, previously used by leading prediction

methods. The results demonstrate that our system performs signiﬁcantly better than previously

reported results, for a wide range of eukaryotic subcellular localizations.

1. Introduction

In this paper we introduce a new system for computationally assigning proteins

to their subcellular localization. By integrating several types of sequence-derived

features and text-based information, the achieved performance is the best reported

so far, in terms of sensitivity, speciﬁcity, and overall accuracy. Unlike several

recent systems which focus on a few subcellular localizations or on a speciﬁc

organism , our system is applicable to – and retains its good performance

across – a wide variety of organisms and subcellular localizations. Moreover, we

show that the integrated system, which combines sequence and text, performs sig-

niﬁcantly better than its individual components, based on each data source alone.

The task of protein subcellular localization prediction is important and well-

studied . Knowing a protein’s localization helps elucidate its function, its

role in both healthy processes and in the onset of disease, and its potential

use as a drug target. Experimental methods for protein localization range from

immunolocalization to tagging of proteins using green ﬂuorescent protein (G FP)

To whom correspondence should be addressed: shatkay@cs.queensu.ca. H S is supported by NS ERC

Discovery grant 298292-04.

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

and isotopes . Such methods are accurate but, even at their best, are slow and

labor-intensive compared with large-scale computational methods. Computational

tools for predicting localization are useful for a large-scale initial “triage”, espe-

cially for proteins whose amino acid sequence may be determined from the ge-

nomic sequence, but are hard to produce, isolate, or locate experimentally.

The past decade, and most notably the last ﬁve years, has seen much progress

in computational prediction of protein localization from sequence data. Nakai and

Kanehisa introduced PSort, a rule-based expert system, which was later im-

proved upon by a probabilistic and by a K-nearest neighbor classiﬁer. Another

pair of prominent systems, TargetP and ChloroP , based on artiﬁcial neural net-

works, demonstrated a signiﬁcantly higher accuracy when applied to a limited set

of subcellular localizations in plant and animal cells. Other recent systems use a

variety of machine learning techniques. Most of them focus on a few subcellular

localizations and improve upon – or just meet – the state of the art on those .

Several recent publications have examined the possibility of using text to sup-

port subcellular localization. Speciﬁcally, Stapley et al. represented yeast pro-

teins as vectors of weighted terms from all the PubMed articles mentioning their

respective genes. They then trained a support vector machine (SVM) on protein-

text-vectors, to distinguish among subcellular localizations. The performance was

favorable when compared to a classiﬁer trained on amino acid composition alone,

but it was not compared against any state-of-the-art localization system, and the

reported results do not suggest an improvement over earlier systems. Moreover,

while their text-based classiﬁer performed better than an amino acid composition

classiﬁer, combining the two forms of data did not signiﬁcantly improve perfor-

mance with respect to the text-based classiﬁer alone.

Nair and Rost used the text taken from Swiss-Prot annotations of proteins to

represent these proteins, and trained a subcellular classiﬁer using this representa-

tion. They concentrate on a few subcellular localizations, and report results that

are compatible – but do not improve upon – the state of the art at that time. Their

work was elaborated upon by Eskin and Agichtein , who added subsequences

from the protein’s amino acid sequence as part of the terms considered in the text

representation. The system was not tested against existing systems or data sets,

and the reported results do not indicate improvement over previous systems.

The best performing comprehensivesystems reportedso far, which were tested

on a large set of proteins, are PLOC and, more recently, MultiLoc . While they

report the best accuracy until now, on a broad range of organisms and localiza-

tions, there is still room for improvement.

The work reported here, similarly to that reported by Nair and Rost , uses

Swiss-Prot as a text source. Unlike them though, we use the PubMed abstracts

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

referenced by Swiss-Prot, rather than the annotation text placed by Swiss-Prot cu-

rators. Furthermore, unlike Stapley et al. who use all abstracts that contain the

gene name for the protein, we use only abstracts that are referenced by Swiss-Prot,

and moreover, rather than use all the terms in them with a standard (TF*IDF )

weighting, as done by Stapley et al., we select terms based on a distinguishing

criterion described in Section 2, and apply a probability-based weighting scheme.

We train an SVM as a text-based classiﬁer, and combine it with a sequence-based

classiﬁer, to produce a comprehensive subcellular categorizer. Our integrated sys-

tem is tested on a number of publicly available, extensive, homology-reduced,data

sets which were used for evaluating earlier systems (TargetP, PLOC, and Multi-

Loc). For each system, we ﬁrst conduct a comparison using the same data and

the same subcellular localizations as reported in the paper published about that

system. We then conduct a test using all the proteins in Swiss-Prot for which a

subcellular annotation is assigned, among the 11 localizations: chloroplast, cy-

toplasm, endoplasmic reticulum, extracellular space, Golgi apparatus, lysosome,

mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. On each of

the data sets our system performs better than the state-of-the-art systems in terms

of overall prediction accuracy, and other standard measures.

The next section outlines the methods used, while in Section 3 we demonstrate

the performance of our system. Section 4 concludes and outlines future work.

2. Methods

Our system combines ﬁve separate classiﬁers, four sequence-based and one text-

based. Their output is integrated through a sixth classiﬁer to produce an improved

prediction of protein subcellular localization. The sequence-based classiﬁers have

been successfully used before by the MultiLoc system and are brieﬂy described

below. Section 2.2 then presents the novel text-based method, while Section 2.3

explains how all these classiﬁers are combined to form an integrated prediction

system. Four of the ﬁve classiﬁers are based on support vector machines (SVMs),

using the LI BSV M implementation . The latter supports soft, probabilistic cate-

gorization for -class tasks , assigning to each classiﬁed item an -dimensional

vector denoting its probability to belong to each of the classes. Radial Basis

Function kernels were used throughout this study. Further details are given below.

2.1. Sequence-based methods

Each of the sequence-based classiﬁers utilizes a different approach to derive bio-

logically informative features that can be used to predict localization, and classi-

ﬁes the input protein sequence to its respective localization using these features.

An acronym for Term Frequency, Times Inverse Document Frequency.

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

Three of these classiﬁers are SVM-based. The fourth scans the protein sequences

for short sequence motifs indicative of structure and function. The four classiﬁers

are brieﬂy described below (see the MultiLoc paper for further details).

SVMTarget – This classiﬁer uses the N-terminal targeting peptide (TP) to pre-

dict a few subcellular categories. It distinguishes among four plant (chloroplast

(ch), mitochondria (mi), secretory pathway (SP), and other (OT)) and three non-

plant (mi, SP, OT) localizations. The targeting peptides are represented by their

partial amino acid composition, motivated by the observation that TPs for speciﬁc

localizations have a similar amino acid composition while their actual sequence

may differ. Given an input protein, the classiﬁer outputs a three-dimensional vec-

tor (four-dimensional for plant) of class probabilities. SVMTarget alone demon-

strated a slightly better performance than TargetP in a comparative study .

SVMSA – Some proteins of the secretory pathway carry a signal anchor (SA)

that, unlike the targeting peptide, is usually located further away from the N-

terminus and contains a longer hydrophobic component. SVMSA can predict

secretory pathway (SP) proteins that are hard to detect using SVMTarget. It is

a binary classiﬁer, trained to distinguish proteins carrying SA from those that do

not. It outputs, given an input sequence, its probability to contain a signal anchor.

SVMaac – This method uses the whole protein amino acid composition (aac),

and categorizes proteins into any of the possible localizations. It combines a col-

lection of binary classiﬁers, each trained to distinguish one class from all others,

although one classiﬁer in the collection was especially trained to distinguish cy-

tosolic (cy) from nuclear (nu) proteins, as these are hard to separate using the

one-against-all approach. Given an input protein, , with possible localizations,

the classiﬁer outputs an -dimensional probability vector containing ’s probabil-

ity to belong to each localization.

MotifSearch – Proteins from several subcellular localizations can be charac-

terized by a few types of short sequence motifs, such as Nuclear Localization Sig-

nal and DNA-binding domains. The motifs were obtained from the PROSITE

and from the NLSdb databases. This classiﬁer outputs a discrete, binary vec-

tor, representing the presence (1) or the absence (0) of each type of motif in the

query protein sequence.

2.2. Text-based method

The idea underlying the text-based classiﬁer is the representation of each protein

as a vector of weighted text features. While text-based localization has been pre-

sented before , the key differences between the current work and previous ones

is in the text source used, the feature selection, and the term weighting scheme.

First, for each protein the text comes from the abstracts curated for the protein

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

in its Swiss-Prot entry. We used a script that scanned each protein in Swiss-Prot

for all the PubMed identiﬁers occurring in its Swiss-Prot entry, and obtained the

respective title and abstract from PubMed. Each protein is thus assigned a set

of PubMed abstracts, based on Swiss-Prot. This choice of abstracts is different

from that of Stapley et al. who used all the PubMed abstracts mentioning the

gene’s name, and from that of Nair and Rost – who use Swiss-Prot annotation

text rather than PubMed abstracts. The assigned abstracts are then tokenized into

a set of terms, consisting of singleton and pairs of consecutive words, with a list of

standard stop words excluded from consideration. The results reported here also

include the application of Porter stemming to all the words in the terms.

Second, from all the extracted terms, we select a subset of distinguishing

terms. This is done by scoring each term with respect to each subcellular lo-

calization, where the score reﬂects the probability of the term to occur in abstracts

that are associated with proteins of this certain localization. Intuitively, a term

is distinguishing for a localization , if it is much more likely to occur in ab-

stracts associated with localization than with abstracts associated with all other

localizations. We formalize this idea in the following paragraphs.

Let be a term, a localization, and a protein. If protein is known to be

localized in , we denote this . We also deﬁne the following sets:

The set of all PubMed abstracts associated with protein according to Swiss-

Prot, denoted ;

The set of all proteins known to be localized at , denoted ;

The set of abstracts that are associated with a localization , denoted , is

deﬁned as: . It is the set of all the abstracts associ-

ated with the proteins that are in localization . The number of documents in

this set is denoted .

The probability of a term to be associated with a localization , denoted , is

deﬁned as the conditional probability of the term to appear in a document, given

that the document is associated with the localization: .

A maximum likelihood estimate for this probability is simply the proportion of

documents containing among all those associated with the localization:

of documents . For each term and each localization ,

the estimate for the probability is calculated.

Based on this probability, a term is called distinguishing for localization , if

and only if its probability to occur in localization , , is signiﬁcantly different

from its probability to occur in any other localization , . The statistical test

applied, uses the Z-score , which evaluates the difference between two binomial

Without using any of the MeSH terms.

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

Table 1. Examples of distinguishing stemmed terms for several localizations

Localization Example Terms

Nucleus bind, control, dna, histon, nuclear, promot, transcript

Mitochondria coa (CoA), complex, cytochrom, dehydrogenas, mitochondri, oxidas, respiratori

Golgi Apparatus acceptor, catalyt domain, fucosyltransferas, galactos, glycosyltransferas, golgi, transferas

Endoplasmic Reticulum calcium, chaperon, disulﬁd isomeras, endoplasm, lumen, microsom, transmembran

probabilities, , and , as follows:

where

When , the hypothesis that the two probabilities , are

different is accepted with a conﬁdence level greater than . Therefore, if the

term has a localization such that for any other localization ,

is considered distinguishing for localization , and is included in the set of

distinguishing terms. In our representation of proteins as term vectors, we use

only distinguishing terms. In the experiments described in Section 3, using several

different proteins sets, the average number of PubMed abstracts is on the order

of , while that of distinguishing terms is about . Some examples of

distinguishing terms for several localizations are shown in Table 1.

Finally, once the collection of distinguishing terms, denoted as , was

established, each protein is represented as an N-dimensional vector, where the

weight at position , (where ), is the conditional probability

of the term to appear in the abstracts associated with the protein , given all

the PubMed abstracts related to the protein, (the set ) This probability is es-

timated as the ratio between the total number of times the term occurs in the

abstracts associated with the protein and the total number of all the occurrences

of distinguishing terms in these same abstracts. Formally it is calculated as:

of times occurs in

where the sums are taken over all the abstracts in the set of abstracts associated

with the protein , .

The representation of proteins as weighted term vectors, is then partitioned

into training and test sets for each subcellular localization, and as before, an SVM

is trained to classify these protein vectors into their respective localization. This

classiﬁer, like SVMacc described above, produces an -dimensional probability

vector denoting the probability of the protein to be in each of the localizations.

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

2.3. Integrated method

The output from the ﬁve classiﬁers above, is a set of four probability vectors and

one binary-valued vector (resulting from MotifSearch). These are all concatenated

to form one integrated feature vector for each protein. Again, an SVM classiﬁer

is trained on these feature vectors to produce a prediction. This classiﬁer consists

of a set of one-against-one classiﬁers (each of which distinguishes between a pair

of localizations) and its output, yet again, is a probabilistic vector, holding for

each localization the probability of the protein to belong to it. Based on this

ﬁnal classiﬁcation step, a protein is assigned to the localization with the highest

probability value in the last output vector. The training and evaluation procedure

uses strict ﬁve-fold cross-validation, where no test protein was used to train any

of the classiﬁers comprising the system.

3. Experiments and Results

To train and to evaluate our integrated system, we used three different data sets,

namely those used for training and testing TargetP, MutliLoc, and PLOC. These

sets provide the basis for an extensive and sound comparison. The data sets, the

evaluation procedure, and the results are described throughout this section.

3.1. Experimental setting

The data sets used in our experiments are the following:

TargetP – This data set contains a total of 3,415 distinct proteins representing

four plant (ch,mi,SP, and OT) and three non-plant (mi,SP, and OT) localizations.

Homologs were removed from it by the TargetP authors. The SP category includes

proteins from several localizations in the secretory pathway: endoplasmic retic-

ulum (er), extracellular space (ex), Golgi apparatus (go), lysosome (ly), plasma

membrane (pm), and vacuole (va). The OT category includes cy and nu proteins.

MultiLoc – The MultiLoc data set contains a total of 5,959 protein se-

quences, which were extracted from the Swiss-Prot database release 42.0 . An-

imal, fungal, and plant proteins with an annotated subcellular localization were

grouped into eleven eukaryotic localizations: cy,ch,er,ex,go,ly,mi,nu, per-

oxisome (pe), pm,va. In the experiments reported here homologous proteins

with identity higher than , (the same threshold used by PLOC ), were ex-

cluded from the set, to avoid the occurrence of highly similar sequences in both

the training and the test sets . Further details about the data set extraction and the

implications of homology reduction are available in the MultiLoc publication .

Excluding proteins whose annotation was commented by similarity or potential.

We also conducted experiments with a more lenient and more stringent homology constraints, of

and identity, respectively (data not shown).

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

PLOC – The PLOC data set was used by Park and Kanehisa and consists of

proteins extracted from Swiss-Prot release 39.0, covering 12 localizations. In con-

trast to MultiLoc, (aside for the older Swiss-Prot version), this data set introduces

an additional category within the cy proteins, namely, the cytoskeleton (cs). There

are 41 cs proteins, compared to 1,245 cy proteins. The total number of sequences

is 7,579 (max. sequence identity 80%). This set is larger than the MultiLoc data

set due to a less restrictive data extraction, assigning proteins to localization even

when the localization annotation includes the words “potential” or “by similarity”.

Using these three data sets, the performance of our integrated system is com-

pared to that of TargetP, PLOC, and MultiLoc . In addition, we also compare

the performance of the integrated system to that of an SVM classiﬁer applied to

the text data alone. Following previous evaluations , we consistently employ

ﬁve-fold cross-validation. For comparison against the PLOC data set we use the

same split as the one used by Park and Kanehisa . For the TargetP data, as the

split used by Emanuelsson et al. was not provided, we ran the ﬁve-fold cross-

validation procedure ﬁve times, each using a different randomized ﬁve-way split,

to ensure robustness. The reported results are averaged over all the 5 folds, and

over the 5 randomized splits when those are used.

Since the performance of previous systems was evaluated using several

different metrics, for a fair comparison we calculated these same performance

measures. Thus, for each system and data set the performance is measured,

for each localization, in terms of the sensitivity (Sens), speciﬁcity (Spec), and

Matthews correlation coefﬁcient (MCC). These are deﬁned as:

and

where denote the number of true positives, true negatives,

false positives, and false negatives, respectively, with respect to a given localiza-

tion. Like Park and Kanehisa we also measure the overall accuracy, namely,

, where is the number of correctly classiﬁed proteins over all the

localizations, and is the total number of classiﬁed proteins. They also measured

the average sensitivity, over all the localizations, a metric they call local accuracy,

which we calculate as well. This last measure, which we denote as Avg, gives an

equal weight to the categorization performance on each localization, regardless of

the number of proteins known to be associated with it.

Comparison to PSort is not included here, since MultiLoc has already demonstrated a higher pre-

diction accuracy compared to this method .

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

3.2. Results

We present the results of running the sequence-based system, MultiLoc, the text-

based classiﬁer alone (denoted Text), and the integrated system (denoted Multi-

LocText), on all the three data sets. For completeness, we also present the results

reported by the authors of PLOC and of TargetP on the respective data sets.

These numbers were directly taken from the respective publications.

Table 2 summarizes the results, showing the overall accuracy (Acc) and the

average local accuracy (Avg) for both the TargetP and the PLOC data sets. For

TargetP the results are shown for plant and non-plant proteins, while for PLOC

results are shown for plant, animal, and fungal proteins. Table 3 compares the

performance of TargetP and PLOC with our integrated system, with respect to the

individual subcellular localizations.

Table 2. An overview of the prediction results using the TargetP and PLOC data sets. Both the total (Acc)

and the average (Avg) prediction accuracies are shown for all the methods. The highest values appear in bold.

Standard deviations, (denoted ) are provided where available.

Data set Method Acc [%] ( Standard Deviation) / Avg [%] ( Standard Deviation)

TargetP Plant Non-Plant

TargetP 85.3 ( 3.5) / 85.6 (n/a) 90.0 ( 0.7) / 90.7 (n/a)

MultiLoc 89.7 ( 1.6) / 90.2 ( 2.0) 92.5 ( 1.2)/ 92.8 ( 1.1)

Text 81.2 ( 2.6) / 78.1 ( 3.2) 88.7 ( 1.1)/ 89.8 ( 1.6)

MultiLocText 94.7 ( 1.5) / 94.4 ( 1.6) 96.2 ( 0.8) / 96.7 ( 0.9)

PLOC Plant Animal Fungal

PLOC 78.2 ( 0.9)/ 57.9 ( 2.1) 79.6 ( 0.9)/ 59.9 ( 3.3) 79.5 ( 0.9)/ 56.8 ( 1.9)

MultiLoc 73.6 ( 0.7) / 71.3 ( 2.8) 76.0 ( 0.7) / 73.6 ( 3.9) 75.8 ( 0.8) / 72.5 ( 2.5)

Text 68.7 ( 0.7) / 73.5 ( 1.8) 70.2 ( 0.7) / 75.5 ( 2.7) 67.8 ( 0.5) / 72.4 ( 2.6)

MultiLocText 85.3 ( 1.2) / 84.2 ( 2.4) 86.4 ( 0.8) / 84.5 ( 3.6) 85.4 ( 0.8) / 83.8 ( 2.8)

Table 3. Localization speciﬁc results using the TargetP (left), and the PLOC (right) data sets. For

both sets, the results reported in the respective papers are compared to results of our integrated

system (MultiLocText). As PLOC localization-speciﬁc results are averaged over all three organ-

isms, we show such averaged results for our system as well. Speciﬁcity and MCC values were not

available for PLOC, hence only its Sensitivity is listed and compared with our sensitivity values.

The highest compared values for each data set are shown in bold.

TargetP Data Set PLOC Data Set

Loc TargetP MultiLocText Loc PLOC MultiLocText

Plant (Sens Spec MCC) Avg. Sens Avg. (Sens Spec MCC)

ch ch

mi mi

OT cs

SP cy

Non-Plant (Sens Spec MCC)er

mi ex

OT go

SP nu

A comparison of the performance of our three systems (MultiLoc alone, Text

alone, and the integrated MultiLocText) using ﬁve-fold cross-validation over the

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

5,959 proteins of the MultiLoc data set, is presented in Table 4. The sensitivity

(Sens), speciﬁcity (Spec), and Matthews MCC values for the plant and animal

versions are listed. (Similar results were obtained for the fungal version, and are

not shown here due to space limitation).

The results in Tables 2, 3, and 4 clearly show that the combined classiﬁer,

which integrates text and sequence data, outperforms earlier prediction methods.

It also outperforms its own text-based (Text) and sequence-based (MultiLoc) com-

ponents, if taken separately. A signiﬁcance test was performed to evaluate the

differences between the values obtained from MultiLocText and those obtained

from each of MultiLoc and Text alone, (Table 4). The improved performance

values of MultiLocText are highly statistically signiﬁcant ( ), for almost

all the subcellular localizations. The only exceptions are the Golgi ( , animal

and plant), where there is no signiﬁcant difference in sensitivity with respect to

text-alone, as well as the peroxisome predictions ( , animal and plant), where

MultiLocText does not outperform the text-alone system.

4. Discussion and Conclusion

The methods, experiments, and results presented here clearly demonstrate a sig-

niﬁcant improvement in the prediction of protein subcellular localization through

the integration of sequence- and text-based methods. Table 4 shows that the two

Table 4. Prediction performance of MultiLoc, Text, and MultiLocText on

the MultiLoc data set. Both localization-speciﬁc values (sens, spec, M CC)

and overall results (Acc and Avg) are shown. Highest values appear in bold.

Loc MultiLoc Text MultiLocText

Plant (Sens Spec MCC)

Acc [%] 74.6 73.1

Avg [%] 75.2 76.0

Animal (Sens Spec MCC)

Acc [%]

Avg [%]

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

types of methods distinctly complement each other. MultiLoc, which is based on

sequence data, typically performs well predicting protein localizations that are di-

rected by N-terminal signals such as the mitochondria and the chloroplast. The use

of text information complements and signiﬁcantly boosts its performance for lo-

calizations whose sequence-based signal is not as overt, including the peroxisome

and localizations related to the secretory pathway such as the Golgi apparatus and

the endoplasmic reticulum.

In this work we have demonstrated, using ﬁve-fold cross-validation, that our

system can reproduce, with unprecedented sensitivity and speciﬁcity, localiza-

tions of proteins which were already annotated in Swiss-Prot. A natural next step

is to apply the method to yet un-localized proteins. We are developing the means

to predict subcellular localization of proteins for which PubMed reference exist

in Swiss-Prot but no localization assigned, as well as for those with no curated

PubMed reference. Our current use of “raw text” from PubMed abstracts (in con-

trast, for instance, to the use of Swiss-Prot annotation text as was done before ),

is expected to make our approach amenable to such extensions. We are also in-

vestigating methods for the localization of proteins with no PubMed references,

through the use of alternative data sources.

References

1. Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular lo-

calization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 300

(2000) 1005–1016

2. Nair, R., Rost, B.: Inferring sub-cellular localization through automated lexical anal-

ysis. Bioinformatics 18 (2002) S78–S86

3. Gardy, J.L., Spencer, C., Wang, K. el al.: PSORT-B: Improving protein subcellular

localization prediction for gram-negative bacteria. Nucleic Acids Research 31 (2003)

137–140

4. Cai, Y.D., Chou, K.C.: Predicting 22 protein localizations in budding yeast. Biochem

Biophys Res Commun. 323 (2004) 425–428

5. Schneider, G., Fechner, U.: Advances in the prediction of protein targeting signals.

Proteomics 4(2004) 1571–1580

6. D¨onnes, P., H¨oglund, A.: Predicting Protein Subcellular Localization: Past, Present,

and Future. Genomics, Proteomics, and Bioinformatics 2(2004)

7. Burns, N., Grimwade, B., Ross-Macdonald, P., Choi, E., Finberg, K., GS, R., M, S.:

Large-scale analysis of gene expression, protein localization and gene disruption in

Saccharomyces cerevisiae. Genes and Development 8(1994) 1087–1105

8. Hanson, M.R., K¨ohler, R.H.: GFP imaging: Methodology and application to investi-

gate cellular compartmentation in plants. Journal of Experimental Botany 52 (2001)

9. Dunkley, T., Watson, R., Grifﬁn, J., Dupree, P., Lilley, K.: Localization of organelle

proteins by isotope tagging (LOPIT). Molecular and Cellular Proteomics 3(2004)

10. Nakai, K., Kanehisa, M.: Expert system for predicting protein localization sites in

gram-negative bacteria. Proteins: Structure, Function and Genetics 11 (1991) 95–110

September 17, 2005 23:39 Proceedings Trim Size: 9in x 6in PSBHoeglundShatkay

11. Nakai, K., Kanehisa, M.: A knowledge base for predicting protein localization sites in

eukaryotic cells. Genomics. 14 (1992) 897–911

12. Horton, P., Nakai, K.: A probabilistic classiﬁcation system for predicting the cellular

localization of proteins. In: Proc. of the Int. Conf. on Intelligent Systems for Molecular

Biology (ISMB). (1996)

13. Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with the

k nearest neighbors classiﬁer. In: Proc. of the Int. Conf. on Intelligent Systems for

Molecular Biology (ISMB). (1997)

14. Emanuelsson, O., Nielsen, H., von Heijne, G.: Chlorop, a neural network-based

method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci-

ence 8(1999) 978–984

15. Bannai, H., Tamada, Y., Maruyama, O., Nakai, K., Miyano, S.: Extensive feature

detection of N-terminal protein sorting signals. Bioinformatics. 18 (2002) 298–305

16. Nair, R., Rost, B.: Mimicking cellular sorting improves prediction of subcellular lo-

calization. J Mol Biol. 348 (2005) 85–100

17. Stapley, B.J., Kelley, L.A., Sternberg, M.J.E.: Predicting the subcellular location of

proteins from text using support vector machines. In: Proc. of the Paciﬁc Symposium

on Biocomputing (PSB). (2002) 374–385

18. Eskin, E., Agichtein, E.: Combining text mining and sequence analysis to discover

protein functional regions. In: Proc. of the 9th Paciﬁc Symposium on Biocomputing

(PSB). (2004) 288–299

19. Park, K.J., Kanehisa, M.: Prediction of protein subcellular location by support vector

machines using compositions of amino acids and amino acid pairs. Bioinformatics. 19

(2003) 1656–1663

20. H¨oglund, A., D¨onnes, P., Blum, T., Adolph, H., Kohlbacher, O.: Using N-terminal tar-

geting sequences, amino acid composition, and sequence motifs for predicting protein

subcellular localization. German Conference on Bioinformatics (GCB) 2005.

21. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2003)

http://www.csie.ntu.edu.tw/ clin/libsvm/.

22. Wu, T.F., Linand, C.J., Weng, R.C.: Probability Estimates for Multi-class Classiﬁca-

tion by Pairwise Coupling. Journal of Machine Learning Research 5(2004) 975–1005

23. Bairoch, A., Bucher, P.: PROSITE: recent developments. Nucleic Acids Res. 22

(1994) 3583–3589

24. Cokol, M., Nair, R., Rost, B.: Finding nuclear localization signals. EMBO Rep. 1

(2000) 411–415

25. Nair, R., Carter, P., Rost, B.: NLSdb: database of nuclear localization signals. Nucleic

Acids Res. 31 (2003) 397–399

26. Porter, M.F.: An Algorithm for Sufﬁx Stripping (Reprint). In: Readings in Informa-

tion Retrieval. Morgan Kaufmann (1997)

http://www.tartarus.org/ martin/PorterStemmer/.

27. Walpole, R.E., Myers, R.H., Myers, S.L. In: One- and Two-Sample Tests of Hypothe-

ses. (1998) 235–335

28. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its sup-

plement in TrEMBL in 2000. Nucleic Acids Res. 28 (2000) 45–48

29. Matthews, B.W.: Comparison of predicted and observed secondary structure of T4

phage lysozyme. Biochim Biophys Acta. 405 (1975) 442–451

Term generalization and synonym resolution for biological abstracts

Conference Paper

Jan 2006

Hagit Shatkay-Reshef 1965-2022

Article

Full-text available

Mar 2022

Cecilia N Arighi

Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

Article

Katrin Fundel-Clemens

Mining the BioMedical Literature

Article

Jan 2012

A concise introduction to fundamental methods for finding and extracting relevant information from the ever-increasing amounts of biomedical text available. © 2012 Massachusetts Institute of Technology. All rights reserved.

Relation extraction from biomedical text

Article

Jan 2007

Zhongmin Shi

PROTEIN SEQUENCE CLASSIFICATION WITH BAYESIAN SUPERVISED AND SEMI-SUPERVISED LEARNED CLASSIFIERS

Article

Multi-Dimensional Fragment Classification in Biomedical Text

Article

Jan 2006

Fengxia Pan

Modern Computational Approaches to Traditional Chinese Medicine

Article

Jan 2012

Recognized as an essential component of Chinese culture, Traditional Chinese Medicine (TCM) is both an ancient medical system and one still used widely in China today. TCM's independently evolved knowledge system is expressed mainly in the Chinese language and the information is frequently only available through ancient classics and confidential family records, making it difficult to utilize. The major concern in TCM is how to consolidate and integrate the data, enabling efficient retrieval and discovery of novel knowledge from the dispersed data. Computational approaches such as data mining, semantic reasoning and computational intelligence have emerged as innovative approaches for the reservation and utilization of this knowledge system. Typically, this requires an inter-disciplinary approach involving Chinese culture, computer science, modern healthcare and life sciences. This book examines the computerization of TCM information and knowledge to provide intelligent resources and supporting evidences for clinical decision-making, drug discovery, and education. Recent research results from the Traditional Chinese Medicine Informatics Group of Zhejiang University are presented, gathering in one resource systematic approaches for massive data processing in TCM. These include the utilization of modern Semantic Web and data mining methods for more advanced data integration, data analysis and integrative knowledge discovery. This book will appeal to medical professionals, life sciences students, computer scientists, and those interested in integrative, complementary, and alternative medicine. Interdisciplinary book bringing together Traditional Chinese Medicine and computer scientists Introduces novel network technologies to Traditional Chinese Medicine informatics Provides theory and practical examples and case studies of new techniques. © 2012 Zhejiang University Press Co., Ltd. Published by Elsevier Inc. All rights reserved.

Analysis of a functional lactate permease in the fungus Rhizopus

Article

Jan 2010
ENZYME MICROB TECH

The fungus Rhizopus is frequently used for fermentative production of lactic acid, but little is known about the mechanisms or proteins for transporting this carboxylic acid. Since transport of the lactate anion across the plasma membrane is critical to prevent acidification of the cytoplasm, we evaluated the functionality of two lactate-proton symport paralogs, LacA and LacB, from Rhizopus delemar. Both of these proteins showed significant ancestral homology to bacterial lactate permease with 46–50% identity to similar homologs from the genus Burkholderia. Based on qPCR, the highest level of expression in Rhizopus for the lacA gene was on complex medium containing pyruvate, while lacB transcript was barely detected with all of the tested culture conditions. A Saccharomyces cerevisiae jen1 deletion strain lacking the ability to transport monocarboxylates was restored for growth on lactate and pyruvate with the expression of LacA. Expression of the LacB in this same strain did not confer the ability to grow on either carbon source. LacA expression also allowed active transport of L-[14C(U)]-lactate into yeast cells and this accumulation was inhibited by the proton uncoupler carbonyl cyanide p-(trifluoromethoxy) phenylhydrazone. Translation fusions with GFP showed that LacA accumulates primarily in the plasma and vacuolar membrane, while LacB is dispersed throughout the cytoplasm. These results indicate that the Rhizopus LacA is a functional lactate symport that is probably involved in uptake of pyruvate or lactate, while the physiological role of LacB is unknown.

State-of-the-Art of Membrane Bioreactors: Worldwide Research and Commercial Applications in North America

Article

Feb 2006
J MEMBRANE SCI

Membrane bioreactor (MBR) technology is advancing rapidly around the world both in research and commercial applications. Despite the increasing number of studies and full-scale applications of MBR systems, directions and trends in academic research as well as commercial developments require further analysis. This paper aims to critically characterize and review worldwide academic research efforts in the area of MBRs as well as focus attention to commercial MBR applications in North America. A total of 339 research papers published in peer-reviewed international journals from 1991 to 2004 and a total of 258 full-scale MBR installations in North America were used as the database for the analysis provided in this paper. After a surge of MBR publications in 2002, research appears to have reached a plateau in the last 3 years using both submerged and external MBR units. Although much of the pioneering research occurred in Japan, France and the UK, countries such as South Korea, China and Germany have significantly contributed to the research pool in the last 5 years. The primary research focus has been on water filtration MBRs with limited growth in extractive and gas diffusion MBRs which still hold un-tapped potential. Fundamental aspects studied in academic research predominantly involve issues related to fouling, microbial characterization and optimizing operational performance. Research in North America presents a unique picture as a higher ratio of industrial wastewater treatment and side-stream MBR applications have been studied compared to other parts of the world. For MBR commercial application, the North America installations constitute about 11% of worldwide installations. Zenon occupies the majority of the MBR market in North America, whereas Kubota and Mitsubishi-Rayon have a larger number of installations in other parts of the world. Due to more stringent regulations and water reuse strategies, it is expected that a significant increase in MBR plant capacity and widening of application areas will occur in the future. Potential application areas include nitrate removal in drinking water treatment, removal of endocrine disrupting compounds from water and wastewater streams, enhancing bio-fuels production via membrane assisted fermentation and gas extraction and purification MBRs.

MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs, and amino acid composition

Conference Paper

Full-text available

Jan 2005

Functional annotation of unknown proteins is am ajor goal in proteomics. Ak ey step in this annotation process is th ed efinition of ap rotein' ss ubcellular lo- calization. As ac onsequence, numerous predictio nt echniques for localization have been developed ove rt he years. These methods typically focus on as ingle underlying biological aspect or predict as ubset of all possible subcellula rl ocalizations. There is ac lear need for ne wm ethods that utilize an dr epresent available protein specific bio- logica lk nowledge from several sources, in order to improve accurac ya nd localization coverage for aw ide range of organisms. Her ew ep resent an ove lS upport Vector Machine (SVM)-based approach for pre- dictin gp rotein subcellular localization, which integratesinformatio na bout N-terminal targeting sequences ,a mino acid composition, and protein sequence motifs. An impor- tant step is taken towardsemulating th ep roteinsorting process by capturingand bring- ing together biologically relevant information. Our nove la pproach has been used to develop tw on ew prediction methods, TargetLoc and MultiLoc. TargetLoc is restricted to analysis of proteins containing N-terminal targeting sequences, whereas MultiLoc covers all major eukaryotic subcellular localizations for animal, plant, an df ungal pro- teins. Compared to simila rm ethods, TargetLo cp erforms better than these. MultiLoc performs considerably better than comparable prediction methods predicting al lm ajor eukaryotic subcellula rl ocalizations, an ds hows better or comparable results to meth- ods tha ta re specialized on fewer localizations or for one organism.

The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998

Article

Full-text available

Jan 1997

SWISS-PROT (http://www.expasy.ch/) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.

PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria

Article

Full-text available

Jul 2003
NUCLEIC ACIDS RES

Automated prediction of bacterial protein subcellular localization is an important tool for genome annotation and drug discovery. PSORT has been one of the most widely used computational methods for such bacterial protein analysis; however, it has not been updated since it was introduced in 1991. In addition, neither PSORT nor any of the other computational methods available make predictions for all five of the localization sites characteristic of Gram-negative bacteria. Here we present PSORT-B, an updated version of PSORT for Gram-negative bacteria, which is available as a web-based application at http://www.psort.org. PSORT-B examines a given protein sequence for amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices and motifs corresponding to specific localizations. A probabilistic method integrates these analyses, returning a list of five possible localization sites with associated probability scores. PSORT-B, designed to favor high precision (specificity) over high recall (sensitivity), attained an overall precision of 97% and recall of 75% in 5-fold cross-validation tests, using a dataset we developed of 1443 proteins of experimentally known localization. This dataset, the largest of its kind, is freely available, along with the PSORT-B source code (under GNU General Public License).

A neural network-based method for predict-ing chloroplast transit peptides and their cleavage sites

Article

Jan 1999

The SWISS-PROT protein database and its supplement TrEMBL in 2000

Article

Jan 2000

Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozime

Article

Nov 1974
Biochim Biophys Acta

B.W. Matthews

Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.

Matthews, B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442-451

Article

Nov 1975
Biochim Biophys Acta

B W Matthews

A knowledge base for predicting protein localization sites in eukaryotic cells

Article

Jan 1993

To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses.

Expert System for Predicting Protein Localisation Sites in Gram-Negative Bacteria

Article

Oct 1991

We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations. For example, the rule to recognize an inner membrane protein is the presence of either a hydrophobic stretch in the predicted mature protein or an uncleavable N-terminal signal sequence. Lipoproteins are first recognized by a consensus pattern and then assumed present at either the inner or outer membrane. These two possibilities are further discriminated by examining an acidic residue in the mature N-terminal portion. Furthermore, we found an empirical rule that periplasmic and outer membrane proteins were successfully discriminated by their different amino acid composition. Overall, our system could predict 83% of the localization sites of proteins in our database.

Large-scale analysis of gene expression, protein localization, and gene disruption in Saccharomyces cerevisiae

Article

Jun 1994
GENE DEV

We have developed a large-scale screen to identify genes expressed at different times during the life cycle of Saccharomyces cerevisiae and to determine the subcellular locations of many of the encoded gene products. Diploid yeast strains containing random lacZ insertions throughout the genome have been constructed by transformation with a mutagenized genomic library. Twenty-eight hundred transformants containing fusion genes expressed during vegetative growth and 55 transformants containing meiotically induced fusion genes have been identified. Based on the frequency of transformed strains producing beta-galactosidase, we estimate that 80-86% of the yeast genome (excluding the rDNA) contains open reading frames expressed in vegetative cells and that there are 93-135 meiotically induced genes. Indirect immunofluorescence analysis of 2373 strains carrying fusion genes expressed in vegetative cells has identified 245 fusion proteins that localize to discrete locations in the cell, including the nucleus, mitochondria, endoplasmic reticulum, cytoplasmic dots, spindle pole body, and microtubules. The DNA sequence adjacent to the lacZ gene has been determined for 91 vegetative fusion genes whose products have been localized and for 43 meiotically induced fusions. Although most fusions represent genes unidentified previously, many correspond to known genes, including some whose expression has not been studied previously and whose products have not been localized. For example, Sec21-beta-gal fusion proteins yield a Golgi-like staining pattern, Ty1-beta-gal fusion proteins localize to cytoplasmic dots, and the meiosis-specific Mek1/Mre4-beta-gal and Spo11-beta-gal fusion proteins reside in the nucleus. The phenotypes in haploid cells have been analyzed for 59 strains containing chromosomal fusion genes expressed during vegetative growth; 9 strains fail to form colonies indicating that the disrupted genes are essential. Fifteen additional strains display slow growth or are impaired for growth on specific media or in the presence of inhibitors. Of 39 meiotically induced fusion genes examined, 14 disruptions confer defects in spore formation or spore viability in homozygous diploids. Our results will allow researchers who identify a yeast gene to determine immediately whether that gene is expressed at a specific time during the life cycle and whether its gene product localizes to a specific subcellular location.

Significantly improved prediction of subcellular localization by integrating text and protein sequence data

Abstract and Figures

Recommended publications

Gene3D and Understanding Proteome Evolution

Prediction of signal peptides in protein sequences by neural networks

Incorporating Homology Using Multi-Instance Kernel for Protein Subcelluar Localization

The Ising model for prediction of disordered residues from protein sequence alone