Page 1
Reconstruction of Protein-Protein Interaction Pathways
by Mining Subject-Verb-Objects Intermediates
Maurice HT Ling1,2, Christophe Lefevre3,
Kevin R. Nicholas2, Feng Lin1
1 BioInformatics Research Centre, Nanyang Technological University, Singapore
2 CRC for Innovative Dairy Products, Department of Zoology,
The University of Melbourne, Australia
3 Victorian Bioinformatics Consortium, Monash University, Australia
mauriceling@acm.org, k.nicholas@zoology.unimelb.edu.au,
Chris.Lefevre@med.monash.edu.au, ASFLIN@ntu.edu.sg
Abstract. The exponential increase in publication rate of new articles is
limiting access of researchers to relevant literature. This has prompted the use
of text mining tools to extract key biological information. Previous studies have
reported extensive modification of existing generic text processors to process
biological text. However, this requirement for modification had not been
examined. In this study, we have constructed Muscorian, using MontyLingua, a
generic text processor. It uses a two-layered generalization-specialization
paradigm previously proposed where text was generically processed to a
suitable intermediate format before domain-specific data extraction techniques
are applied at the specialization layer. Evaluation using a corpus and experts
indicated 86-90% precision and approximately 30% recall in extracting protein-
protein interactions, which was comparable to previous studies using either
specialized biological text processing tools or modified existing tools. Our
study had also demonstrated the flexibility of the two-layered generalization-
specialization paradigm by using the same generalization layer for two
specialized information extraction tasks.
Keywords: biomedical literature analysis, protein-protein interaction,
montylingua
1 Introduction
PubMed currently indexes more than 16 million papers with about one million papers
and 1.2 million added in the years 2005 and 2006 respectively. A simple keyword
search in PubMed showed that nearly 900 thousand papers on mouse and more than
1.3 million papers on rat research had been indexed in PubMed to date, and in the last
four years, more than 150 thousand papers have been published on each of mouse and
rat research. This trend of increased volume of research papers indexed in PubMed
over the last 10 years makes it difficult for researchers to maintain an active and
productive assessment of relevant literature. Information extraction (IE) has been
used as a tool to analyze biological text to derive assertions on specific biological
domains [30], such as protein phosphorylation [19] or entity interactions [1].
Page 2
A number of IE tools used for mining information from biological text can be
classified according to their capacity for general application or tools that considers
biological text as specialized text requiring domain-specific tools to process them.
This has led to the development of specialized part-of-speech (POS) tag sets (such as
SPECIALIST [28]), POS taggers (such as MedPost [33]), ontologies [11], text
processors (such as MedLEE [15]), and full IE systems, such as GENIES [16],
MedScan [29], MeKE [4], Arizona Relation Parser [10], and GIS [5]. On the other
hand, an alternative approach assumes that biological text are not specialized enough
to warrant re-development of tools but adaptation of existing or generic tools will
suffice. To this end, BioRAT [12] had modified GATE [8], MedTAKMI [36] had
modified TAKMI [27], originally used in call centres, Santos [31] had used Link
grammar parser [32].
Although both systems demonstrated similar performance, either developing these
systems or modifying existing systems were time consuming [20]. Although work by
Grover [17] suggested that native generic tools may be used for biological text, a
recent review had highlighted successful uses of a generic text processing system,
MontyLingua [14, 23], for a number of purposes [22]. For example, MontyLingua has
been used to process published economics papers for concept extraction [35]. The
need to modify generic text processors had not been formally examined and the
question of whether an un-modified, generic text processor can be used in biological
text analysis with comparable performance, remains to be assessed.
In this study, we evaluated a native, generic text processing system, MontyLingua
[23], in a two-layered generalization-specialization architecture [29] where the
generalization layer processes biological text into an intermediate knowledge
representation for the specialization layer to extract genic or entity-entity interactions.
This system demonstrated 86.1% precision using Learning Logic in Languages 2005
evaluation data [9], 88.1% and 90.7% precisions in extracting protein-protein binding
and activation interactions respectively. Our results were comparable to previous
work which modified generic text processing systems which reported precision
ranging from 53% [24] to 84% [5], suggesting this modification may not improve the
efficiency of information retrieval.
2 System Description
We have developed a biological text mining system, known as Muscorian, for mining
protein-protein inter-relationships in the form of subject-relation-object (for example,
protein X bind protein Y) assertions. Muscorian is implemented as a 3-module
sequential system of entity normalization, text analysis, and protein-protein binding
finding, as shown in Figure 1. It is available for academic and non-profit users
through http://ib-dwb.sf.net/Muscorian.html.
Page 3
Fig 1. Schematic Diagram Illustrating the Operations of Muscorian
2.1 Entity Normalization
Entity normalization is the substitution of the long form of either a biological or
chemical term with its abbreviated form. This is essential to correct part-of-speech
tagging errors which are common in biological text due to multi-worded nouns. For
example, the protein name “phosphatase and tensin homolog deleted on chromosome
10” has to be recognized as a single noun and not a phrase. In this study, we attempt
to mine protein-protein interactions and consolidate this knowledge to produce a map.
Therefore, the naming convention of the protein entities must be standardized to allow
for matching. However, this is not the case for biological text and synonymous
protein names exist for virtually every protein. For example, “MAP kinase kinase”,
“MAPKK”, “MEK” and “MAPK/Erk kinase” referred to the same protein. Both of
these problems could be either resolved or minimized by reducing multi-worded
nouns into their abbreviated forms.
A dictionary-based approach was used for entity normalization to a high level of
accuracy and consistency. The dictionary was assembled as follows: firstly, a set of
25000 abstracts from PubMed was used to interrogate Stanford University's BioNLP
server [3] to obtain a list of long forms with its abbreviations and a calculated score.
Secondly, only results with the score of more than 0.88 were retained as it is an
inflection point of ROC graph [3], which is a good balance between obtaining the
most information while reducing curation efforts. Lastly, the set of long form and its
abbreviations was manually curated with the help of domain experts.
The domain experts curated dictionary of long forms and its abbreviated term was
used to construct a regular expression engine for the process of recognition of the
long form of a biological or chemical term and substituting it with its corresponding
abbreviated form.
Page 4
2.2 TextAnalysis
Entity normalized abstracts were then analyzed textually by an un-modified text
processing engine, MontyLingua [14], where they were tokenized, part-of-speech
tagged, chunked, stemmed and processed into a set of assertions in the form of 3-
element subject-verb-object(s) (SVO) tuple, or more generally, subject-relation-
object(s) tuple. Therefore, a sequential pattern of words which formed an abstract was
transformed through a series of pattern recognition into a set of structurally-definable
assertions.
Before part-of-speech tagging is possible, an abstract made up of one or more
sentences had to be separated into individual sentences. This is done by regular
expression recognition of sentence delimiters, such as full-stop, ellipse, exclamation
mark and question mark, at the end of a word (regular expression: ([?!]+|[.][.]+)$)
with an exception of acronyms. Acronyms, which are commonly represented with a
full-stop, for example “Dr.”, are not denoted as the end of a sentence and were
generally prevented by an enumeration of common acronyms.
Individual sentences were then separated into constituent words and punctuations
by a process known as tokenization. Tokenization, which is essential to atomize a
sentence into atomic syntactic building blocks, is generally a simple process of
splitting of an English sentence in words using whitespaces in the sentence, resulting
in a list of tokens (words). However, there were three problems which were corrected
by examining each token. Firstly, punctuations are crucial in understand a written
English sentence, but typographically a punctuation is usually joined to the presiding
word. Hence, punctuation separation from the presiding word is necessary. However,
it resulted in incorrect tokenization with respect to acronyms and decimal numbers.
For example, “... an appt. for ...” will be tokenized to “... an appt . for ...” and “$4.20”'
will be “$ 4 . 20”. This problem was prevented by pre-defining acronyms and using
regular expressions, such as “^[$][0-9]{1,3}[.][0-9][0-9](?[.]?)$”. Lastly, common
abbreviated words, such as “don't”, were expanded into two tokens of “do” and “n't”.
Despite the above error correction measures, certain text such as mathematical
equations, which might be used to describe enzyme kinetics in biological text, will not
be tokenized correctly. In spite of this limitation, the described tokenization scheme is
still appropriate as extraction of enzyme kinetics or mathematical representations are
not the aims of this study.
Each of the tokens (words and punctuations) in a tokenized sentence is then tagged
using Penn TreeBank Tag Set [25] by a Brill Tagger, trained on Wall Street Journal
and Brown corpora, which operates in two phases. Using a lexicon, containing the
likely tag for each word, each word is tagged. This is followed by a phase of
correction using lexical and contextual rules, which were learnt using training with a
tagged corpora, in this case, Wall Street Journal and Brown corpora. Lexical rules
uses a combination of preceding tag and prefix or suffix of the token (word) in
question. For example, the rule “NN ing fhassuf 3 VBG” defines that if the current
token is tagged as a noun (NN) and has a 3-character suffix of “ing”, then the tag
should be a verb (VBG). On the other hand, contextual rules uses only the preceding
or proceeding tags and hence, must be applied after lexical rules for effectiveness.
The contextual rule “RB JJ NEXTTAGNN” defines that an abverbial tag (RB) should
be changed to an adjective (JJ) if the next token was tagged as a noun (NN). A table
of Penn Treebank Tag Set [25] without punctuation tags is given in Table 1.
Page 5
Tag Description Tag Description
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determinant RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition or subordinating
conjunction
SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBN Verb, past participle
NN Noun, singular or mass VBG Verb, gerund or present
participle
NNS Noun, plural VBP Verb, non-3rd person singular
present
NNP Proper noun, singular VBZ Verb, 3rd person singular
present
NNPS Proper noun, plural WDT Wh-determiner
PDT Predeterminer WP Wh-pronoun
POS Possessive ending WP$ Possessive wh-pronoun
PRP Personal pronoun WRB Wh-adverb
Table 1. Penn Treebank Tag Set without Punctuation Tags (Adapted from [25])
By tagging, the complexity of an English sentence (ie, the number of ways an
English sentence can be grammatically constructed with virtually unlimited words
and unlimited ideas) was collapsed into a sequence of part-of-speech tags, in this
case, Penn TreeBank Tag Set [25], with only about 40 tags. Therefore, tagging
reduced the large number of English words to about 40 “words” or tags.
Generally, an English sentence is composed of a noun phrase, a verb, and a verb
phase, where the verb phrase may be reduced into more noun phrases, verbs, and verb
phrases. More precisely, the English language is an example of subject-verb-object
typology structure, which accounts for 75% of all languages in the world [7]. This
concept of English sentence structure is used to process a tagged sentence into higher-
order structures of phrases by a process of chunking, which is a precursor to the
extraction of semantic relationships of nouns into SVO structure. Using only the
sequence of tags, chunking was performed as a recursive 4-step process: protecting
End of preview.