Page 1
Parts-of-Speech Tagger Errors Do Not Necessarily
Degrade Accuracy in Extracting Information from
Biomedical Text
Maurice HT Ling1, Christophe Lefevre 1,2, Kevin R Nicholas1
1CRC for Innovative Dairy Products, Department of Zoology, The University of
Melbourne, Australia
2Victorian Bioinformatics Consortium, Monash University, Australia
Corresponding email: mauriceling@acm.org
Abstract
Background: An ongoing assessment of the literature is difficult with the rapidly
increasing volume of research publications and limited effective information
extraction tools which identify entity relationships from text. A recent study reported
development of Muscorian, a generic text processing tool for extracting protein-
protein interactions from text that achieved comparable performance to biomedical-
specific text processing tools. This result was unexpected since potential errors from a
series of text analysis processes is likely to adversely affect the outcome of the entire
process. Most biomedical entity relationship extraction tools have used biomedical-
specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect
subsequent semantic analysis of the text, such as shallow parsing. This study aims to
evaluate the parts-of-speech (POS) tagging accuracy and attempts to explore whether
a comparable performance is obtained when a generic POS tagger, MontyTagger, was
used in place of MedPost, a tagger trained in biomedical text. Results: Our results
demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy
of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did
not result in a significant improvement in entity relationship extraction from text;
precision of 55.6% from MontyTagger versus 56.8% from MedPost on directional
relationships and 86.1% from MontyTagger compared to 81.8% from MedPost on
nondirectional relationships. This is unexpected as the potential for poor POS tagging
by MontyTagger is likely to affect the outcome of the information extraction. An
analysis of POS tagging errors demonstrated that 78.5% of tagging errors are being
compensated by shallow parsing. Thus, despite 83.1% tagging accuracy, MontyTagger
has a functional tagging accuracy of 94.6%. Conclusions: The POS tagging error
does not adversely affect the information extraction task if the errors were resolved in
shallow parsing through alternative POS tag use.
1. Introduction
PubMed currently indexes more than 17.5 million papers that includes 1 million
papers added in both 2006 and the first half of 2007. This trend of increased volume
of research papers makes it difficult for researchers to maintain a productive
assessment of relevant literature. Information extraction (IE) has been used as a tool
to analyze biological text to derive assertions, such as entity interactions (Abulaish
and Dey, 2007). To date, there has been a number of IE tools to extract entity
Page 2
interactions from published text, such as MedScan (Novichkova et al., 2003), Arizona
Relation Parser (Daniel et al., 2004), BioRAT (David et al., 2004) and Santos et al.
(2005).
A recent article by Ling et al. (2007) has classified entity interaction IE tools by
whether tools are developed with biological text in mind or adapted generic tools for
biological text. Ling et al. (2007) developed Muscorian, a tool to extract protein-
protein interactions from text. They also demonstrated that a generic text analysis tool
chain, MontyLingua (Liu and Singh, 2004; Ling, 2006), incorporated into a two-
layered generic-specialized architecture as explained in MedScan (Novichkova et al.
2003), can give rise to comparable performance in entity interaction extraction
compared to those IE systems that modified existing systems, such as BioRAT (David
et al., 2004), Chilibot (Chen and Sharp, 2004) and Santos et al. (2005). One of the
common features of both classes of tools defined by Ling et al. (2007) is the
specialization of the part-of-speech (POS) tagger. For example, Arizona Relation
Parser (Daniel et al., 2004) re-trained Brill tagger (Brill, 1995) and Chilibot (Chen and
Sharp, 2004) re-trained TnT tagger (Brants, 2000). POS tagging is a process of
assigning grammatical roles of each word and punctuation in the source sentence.
This plays a critical role in subsequent text processing tasks, such as shallow parsing,
where the sequence of POS tags were used instead of the original sequence of words.
At the same time, it was known that errors in POS tagging often results in
misunderstanding of the sentence (Kodratoff et al., 2005; Amrani et al., 2005).
Muscorian (Ling et al., 2007) makes use of a generic POS tagger as part of
MontyLingua (Ling, 2006; Liu and Singh, 2004) and performs at a comparable level
to IE tools using POS taggers trained on biomedical text. This contradicts the
common view that “error propagation through cascades of processors may in
aggregate severely degrade performance on the final task” as stated in the Call for
Papers for the Tenth Conference on Natural Language Processing 2006 (CoNLL-X).
Tateisi and Tsujii (2004) have demonstrated that generic POS taggers are only about
83% accurate when used to tag biomedical text. This suggests that MontyTagger, the
generic POS tagger in MontyLingua, is unlikely to perform as well as taggers trained
on biomedical text, such as MedPost (Smith et al., 2004). Therefore, it is likely that
the above mentioned contradiction is resolved at the step immediately downstream to
POS tagging, the shallow parsing. In MontyLingua shallow parsing (Ling et al.,
2007), the input sentence is broken into noun phrase and verb phrase. The process of
shallow parsing can be seen as a collapse of a sequence of POS tags into 2 groups;
hence, we expect high level of permissible substitution of POS tags within related
classes. We term this permissible substitution as “alternate POS tag use”.
This study compares the performance of MedPost (Smith et al., 2004) with the
generic POS tagger, MontyTagger (Liu and Singh, 2004), in Muscorian (Ling et al.,
2007) and illustrates a case whereby POS tagging error does not adversely affect the
final information extraction task if the errors were resolved in shallow parsing through
alternate POS tag use.
2. Methods
2.1. Evaluating POS Tagging and Information Extraction Performance
MontyTagger was evaluated on its own using MedPost corpus (Smith et al., 2004) and
its accuracy as the percentage of the number of correctly tagged tokens (words and
punctuations) in the total number of tokens (n=182399). MedPost tagger was swapped
in place of MontyTagger by modifying MontyLingua's jist() and jist_predicates()
Page 3
functions to mpjist() and mpjist_predicates(), giving MedPost-MontyLingua
Muscorian:
def jist(self,text):
sentences = self.split_sentences(text)
tokenized = map(self.tokenize,sentences)
tagged = map(self.tag_tokenized,tokenized)
chunked = map(self.chunk_tagged,tagged)
extracted = map(self.extract_info,chunked)
return extracted
def jist_predicates(self,text):
infos = self.jist(text)
svoos_list = []
for info in infos:
svoos =
info['verb_arg_structures_concise']
svoos_list.append(svoos)
return svoos_list
to
def mpjist(self,text):
sentences = self.split_sentences(text)
tokenized = map(self.tokenize,sentences)
sourcefilename =
random.random()*1000000000
outfilename =
random.random()*1000000000000
source = open('temp' + os.sep +
str(sourcefilename), 'w')
source.writelines(tokenized)
source.close()
os.popen(os.getcwd() + os.sep +
'medpost/medpost -text -token -penn <
temp' + os.sep + str(sourcefilename) + '>
temp' + os.sep + str(outfilename))
mpout = open('temp' + os.sep +
str(outfilename), 'r')
tagged = mpout.readlines()
mpout.close()
chunked = map(self.chunk_tagged,tagged)
extracted = map(self.extract_info,chunked)
return extracted
def mpjist_predicates(self,text):
infos = self.mpjist(text)
svoos_list = []
for info in infos:
svoos =
info['verb_arg_structures_concise']
svoos_list.append(svoos)
return svoos_list
Figure 1. Flowchart of evaluation
procedure for Muscorian with
native MontyLingua and
MedPost-MontyLingua.
LLL05 test data was
processed for abbreviations
before feeding into each
system and the extracted genic
interactions (output) were
evaluated for precision and
recall.
Page 4
MedPost-MontyLingua Muscorian's IE performance was evaluated using Learning
Languages in Logic 2005 test data (Cussens and Nedellec, 2005) in the same manner
as Muscorian (Ling et al., 2007) and the performances were compared (Figure 1).
2.2. Analysis of POS Tagging Errors
Wrongly tagged tokens from MontyTagger's output were first grouped by their
original tags in MedPost corpus (Smith et al., 2004), then sub-grouped by
MontyTagger's assigned tags (the wrong tag) and arranged in decreasing order based
on the numbers of tags in both main and sub-group. First 80% of the tags in the main
group where first 90% of the wrongly assigned tags were chosen for further error
analysis. Each of the pairs of original tag and wrongly assigned tag were analysed
with respect to the regular expressions in MontyREChunker (Ling et al., 2007), the
shallow parser in MontyLingua, for the effects of the wrongly assigned tags on the
operations of the shallow parser.
3 Results
3.1 Evaluating POS Tagging and Information Extraction Performance
Evaluating MontyTagger on MedTag corpus demonstrated correct tagging in 151663
of the tags representing 83.1% tagging accuracy. Using the LLL05 evaluation corpus,
Muscorian with MedPost-MontyLingua on directional relationship was found to be
56.8% precise with 24.8% recall, while nondirectional relationship was estimated to
be 81.8% precise with 35.6% recall (Table 1).
Directional Relationships Nondirectional Relationships
MontyLingua mpMontyLingua MontyLingua mpMontyLingua
Precision 55.6% 56.8% 86.1% 81.8%
Recall 19.8% 24.8% 30.7% 35.6%
F-Score 0.292 0.345 0.453 0.496
Table 1. Summary of Muscorian's performances evaluated using Learning Languages
in Logic 2005 data (Cussens, 2005).
3.2. Analysis of POS Tagging Errors
Comparison of the reference tags (MedPost corpus) with the wrongly assigned tags
from MontyTagger showed the 30736 wrongly assigned tags (52.3%, n=16067)
should be tagged as nouns (tag: 'NN'), 15.8% (n=4865) should be tagged as 'JJ'
(adjectives), and the next four most common wrongly assigned tags were 'NNS'
(n=1987, 6.5%), 'SYM' (n=1496, 4.9%), 'VBP' (n=1470, 4.8%), and 'VBD' (n=745,
2.4%). These six reference tags (NN, JJ, NNS, SYM, VBP, VBN) accounted for
26630 (86.6%) of the wrongly assigned tags, while the rest of the errors (n=4106)
were distributed across 25 tags. Six tags (TO, :, (, ), WP, ,) were correctly assigned in
every instance in this evaluation. A tabulation of errors is shown in Table 2 and a table
providing the definition of each POS tag is given in Table 3. The confusion matrix can
be found at http://ib-dwb.sf.net/Muscorian/MedPost_confuse.txt.
Page 5
Tag % Corpus % Error in
Total Error
% Error
in Tag
Tag % Corpus % Error in
Total Error
% Error
in Tag
NN 28.56 52.27 30.84 VBG 0.64 0.06 1.59
IN 13.49 1.08 1.33 : 0.54 0.00 0.00
JJ 10.47 15.81 25.44 MD 0.43 0.01 0.2
DT 7.77 0.56 1.16 WDT 0.45 0.19 6.70
NNS 7.75 6.45 14.03 , 0.39 0.00 0.00
CC 6.66 1.30 3.29 PRP$ 0.28 0.01 0.40
. 3.67 0.01 0.03 FW 0.26 0.96 61.39
CD 3.13 2.02 10.84 WRB 0.23 0.59 43.33
VBN 3.05 1.70 10.13 JJR 0.17 0.17 17.74
VBD 2.81 2.42 14.56 NNP 0.14 0.03 3.53
RB 2.57 1.72 9.49 EX 0.08 0.01 1.38
) 1.89 0.00 0.00 POS 0.06 0.06 15.31
( 1.88 0.00 0.00 WP 0.06 0.00 0.00
VBP 1.98 4.78 41.26 JJS 0.05 0.02 6.60
TO 1.55 0.00 0.00 RBS 0.05 0.01 4.40
VBZ 1.54 0.45 5.20 “ 0.03 0.19 100.00
SYM 1.07 4.87 76.43 `` 0.03 0.19 100.00
PRP 0.88 1.61 30.59 PDT 0.02 0.11 100.00
VB 0.74 0.05 1.11 RBR 0.01 0.03 44.44
Table 2. Percentage breakdown of POS tags in MedTag corpus and errors in
MontyTagger as percentage of POS tags assignation. This table tabulates the POS
tagging errors made by MontyTagger on MedTag corpus and the order is according to
the abundance of each tag in the MedTag corpus. For example, 'NN' is the most
abundant tag accounting for 28.56% or 52093 of MedTag corpus of 182399 tokens.
Of which, 3084% (16067 of 52093) of the 'NN' tokens in MedTag corpus were
wrongly assigned to a different POS tag by MontyTagger which accounted for
52.27% of the total wrongly assigned POS tag of 30736 tokens.
End of preview.