BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.

Cheng-Ju Kuo, Maurice H T Ling, Kuan-Ting Lin, Chun-Nan Hsu

Institute of Information Science, Academia Sinica, Taipei 115, Taiwan, Republic of China.

Journal Article: BMC Bioinformatics (impact factor: 3.43). 01/2009; 10 Suppl 15:S7. DOI: 10.1186/1471-2105-10-S15-S7

Abstract

To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools.
Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems.
By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
BMC Bioinformatics
Proceedings
BIOADI: a machine learning approach to identifying abbreviations
and definitions in biological literature
Cheng-Ju Kuo1, Maurice HT Ling2,4, Kuan-Ting Lin1,3 and Chun-Nan Hsu*1
Addresses: 1Institute of Information Science, Academia Sinica, Taipei 115, Taiwan, Republic of China, 2School of Chemical and Life Sciences,
Singapore Polytechnic, Republic of Singapore, 3Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan,
Republic of China and 4Department of Zoology, The University of Melbourne, Parkville, Victoria, Australia
E-mail: Cheng-Ju Kuo - clarkkuo@iis.sinica.edu.tw; Maurice HT Ling - mauriceling@acm.org; Kuan-Ting Lin - woody@iis.sinica.edu.tw;
Chun-Nan Hsu* - chunnan.hsu@iis.sinica.edu.tw
*Corresponding author
from Asia Pacific Bioinformatics Network (APBioNet) Eighth International Conference on Bioinformatics (InCoB2009)
Singapore 7-11 September 2009
Published: 3 December 2009
BMC Bioinformatics 2009, 10(Suppl 15):S7 doi: 10.1186/1471-2105-10-S15-S7
This article is available from: http://www.biomedcentral.com/1471-2105/10/S15/S7
© 2009 Kuo et al; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: To automatically process large quantities of biological literature for knowledge
discovery and information curation, text mining tools are becoming essential. Abbreviation
recognition is related to NER and can be considered as a pair recognition task of a terminology and
its corresponding abbreviation from free text. The successful identification of abbreviation and its
corresponding definition is not only a prerequisite to index terms of text databases to produce
articles of related interests, but also a building block to improve existing gene mention tagging and
gene normalization tools.
Results: Our approach to abbreviation recognition (AR) is based on machine-learning, which
exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus,
our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than
the result achieved by the existing best AR performance system. We also annotated a new corpus
of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On
our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95%
recall, which also outperforms all tested systems.
Conclusion: By applying our system to extract all short form-long form pairs from all available
PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends
of bio-medical research. Besides, we also provide an off-line AR software in the download section
on http://bioagent.iis.sinica.edu.tw/BIOADI/.
Page 1 of 10
(page number not for citation purposes)
BioMed Central
Open Access
Page 2
Background
Protein/gene name recognition (NR) [1,2], is one of the
most challenging tasks in biomedical text mining [3].
Solving the problem of NR will allow for more complex
text mining tasks to be addressed [4] as it is a prerequisite
for information extraction and advanced text mining
[3,5,6]. One of themain reasons of the challenging is high
variation of terms that are not explicitly reflected in
biomedical ontologies [7]. It is common that biological
entities can have several names. For example, PTEN and
MMAC1 refers to the same entity [8]. It was estimated that
one-third of biological terms are variants [9].
A number of important studies in this area include
GAPSCORE [10], which examines the appearance,
morphology and context of named entities before
applying a classifier trained using these features (59%
precision and 50% recall). ABNER [11] employed a
conditional random field model and achieved precisions
between 58.2% to 85.4% and recall between 53.9% and
79.8% for different target entities. Other groups had
attempted combinations of approaches to improve
precision [12-16].
Abbreviation recognition (AR) is related to NR and can
be considered as a pair recognition task of a terminology
(may be a phrase or an entity) and its corresponding
abbreviation from free text. In this manuscript, we
denote “LF” to mean “the long form of the term” and
“SF” to mean “the abbreviation or the short form of the
term”. Since the name of most protein and gene names
are rather lengthy, most researchers tend to abbreviate
their names in published manuscripts. As a result, AR can
serve as a precursor of a number of applications. For
example, building a term index of a text database to
retrieve articles of related interests [17] or to link text-
mined protein interaction networks [18-20]. Hence, it
seems plausible to use AR as a first-pass in NER. In the
simplest sense, AR may be used to assist term boundaries
of entity names in free text, such as reported in [21,22].
AR is generally considered as a simpler problem than
NER and had been shown by the performance of AR
systems [8]. For example, Stanford University’s Abbre-
viation Server [23,24] demonstrated 97% precision at
22% recall and 95% precision at 75% recall. AbbRE [25]
and the system by Schwartz et al. [26] achieved 96%
precision with 70% recall, and 96% precision with 82%
recall, respectively, while SaRAD system [27] reported
95% precision with 85% recall. More recently, Sohn et al.
[28] used a LF to SF matching algorithm similar to Yu
et al. [25] and reported 96.5% precision with 83.2%
recall. However, these performance measures are hardly
comparable because each system was tested on different
corpora [29]. Although both Chang et al. [23] and
Schwartz et al. [26] used the Medstract Gold Standard
Evaluation Corpus [30], each had made undisclosed
modifications to their test corpus [29], resulting in
difficulty in comparison. Nevertheless, Torii et al. [31]
performed a meta-study to compare the results of a
number of AR systems and found that the SF-LF
identified by each system is generally consistent with
previous reports. In general, these systems can achieve
excellent precisions but still have plenty of room for
improvement in terms of recall. Currently, Schwartz et al.
[26] and Sohn et al. [28] demonstrated the best AR
performance than other existing systems.
Schwartz et al. [26] used a 2-step algorithm for AR under
the assumption that the SF-LF must exist in the same
sentence. In the first step, identification of a possible SF-
LF pair is initiated by the presence of a pair of brackets. It
considered two cases - the LF is in the brackets or the SF
is in the brackets. If it is likely that the SF is in the
brackets, the second step is to search for the LF word
boundaries in the sentence by morphological features.
Sohn et al. also used brackets to initiate the process of AR
but ignored a list of common bracket-delimited struc-
tures, such as “(p < 0.05)”. This is followed by filtering
the potential SF-LF pairs using a set of pre-defined rules.
Our approach to AR is based on machine-learning and
exploits a novel set of rich features to describe properties
of a potential SF-LF pair. In addition, the difference
between our system and those of [26,28] is that we can
identify pairs with unused characters in the SF. For
example, “CA5” and “CA V gene”. Our system also
outputs the prediction probability to indicate the
confidence of each identified SF-LF pair. Tested on the
AB3P corpus [28], our system demonstrated a F-score of
89.90% with 95.86% precision at 84.64% recall. We also
annotated a corpus of 1200 PubMed abstracts which was
derived from BioCreative II gene normalization dataset.
On our corpus, our system achieved F-score of 86.20%
with 93.52% precision at 79.95% recall. Comparing to
existing available AR systems [26,28], our system out-
performed them on both corpora and performs about 14
times faster than the best AR performance system [28].
All resources can be found on our website. By applying
our system to extract all short form-long form pairs from
all available PubMed abstracts, we have constructed
BIOADI, the most comprehensive dictionary of biologi-
cal abbreviations online. Mining BIOADI reveals many
interesting trends of bio-medical research.
Methods
Preparation of training data
We annotated a corpus of 1200 abstracts from BioCrea-
tive II gene normalization dataset [32] by a single person
BMC Bioinformatics 2009, 10(Suppl 15):S7 http://www.biomedcentral.com/1471-2105/10/S15/S7
Page 2 of 10
(page number not for citation purposes)
Page 3
for consistency and exploited it to develop an AR system.
Hence, we denote this annotated corpus as “BIOADI
corpus.” We followed the style and the annotation
guideline of AB3P corpus [28], in which SF and LF pairs
are separated by “|” (for example, “HSP” and “heat shock
protein” form “HSP|heat shock protein”) to annotate
each abstract. We focus on the following forms of SF-LF
pairs:
1. LF is in front of SF, and SF is in brackets or square
brackets, e.g. "HSP (heat shock protein)";
2. SF is in front of LF, and LF is in brackets or square
brackets, e.g. "heat shock protein (HSP)";
3. Both SF and LF are in brackets or square brackets
and separated by comma or semi-colon, e.g.” (HSP,
heat shock protein)”.
The SF-LF pairs adhered to one of these forms will be
annotated as potential SF-LF pairs. The BIOADI corpus
includes 1668 true SF-LF pairs and 145 synonym pairs
which are marked with “//” in the beginning of each pair.
The synonym pairs were not considered as valid SF-LF
pairs and ignored in the following experiments. Mean-
while, We also used the AB3P corpus for performance
evaluation. It contains 1221 true SF-LF pairs. Some of
them are synonym pairs, however.
Both positive and negative instances were required for
model training. In this study, annotated SF-LF pairs were
used as positive instances in training data, and negative
instances were automatically extracted from text. The
extraction of potential SF-LF pairs was similar to the
previous work [28]. However, constraints on character
lengths or word lengths of SFs were not set, but
numbered list indicators (e.g., (a), (b), (1a), (1b), (I),
(II)....) and common strings (“e.g.”, “and”...) were
filtered out. Potential SFs which do not contain any
alphabetic character or contain certain symbols (“=”,
“%”, “>” and “<”) were excluded. A potential LF can be
composed of up to ten consecutive words preceding a
potential SF in the same sentence, or in brackets or in
square brackets following a potential SF which means
that there are at most ten potential LFs of a potential SF,
of which one of them is correct. Each abstract was split into
sentences by “sentence and paragraph breaker” [33] before
the automatic AR process. All potential SF-LF pairs were
checked for existence in the list of positive instances. If not,
the pairs acted as negative instances in training data.
Feature extraction of SF-LF pairs
Before training and testing the model, it is a pre-requisite
applying to transform the pair into the form of a feature
vector. In order to construct features from raw data
(potential SF-LF pairs extracted from the previous step),
we defined four sets of features. The design of these
features was originated from [16], inspired by the
previous works [29,34] and carefully selected in our
tests. The detail is as the following:
 String morphological features of SF and LF
We had selected the following binary features to
describe the string morphology in order to extract
and represent the literal information and character
properties of each SF and LF. We had also used some
features to demonstrate the position and amount of
stop words in LFs. For example,
1. Is the first letter of the string uppercase?
2. Is the first letter of the string lowercase?
3. Are all characters of the string all uppercase?
4. Are all characters of the string all lowercase?
5. Does the first word of the LF use the first letter
of the SF (case-sensitive and insensitive)?
6. Is the first word of the LF a stop word (case-
insensitive)?
7. Is the first word of the SF a stop word (case-
sensitive)?
8. Does the string contain numbers?
9. Does the LF share the same numbers of the SF?
10. Does the string contain Greek alphabet?
11. Does the LF start with the SF?
12. Does the brackets or square brackets of the
string pair well?
We also applied the discrete binary features to
characterize the composition of each SF and LF,
including:
1. Is the number of stopwords in the LF = 1, 2, 3, 4...?
2. Is the length (in tokens) of the string = 1, 2, 3, 4...?
3. Does the string contain certain punctuation
symbol?
4. The character pattern of the SF: First, to convert
each consecutive uppercase or lowercase charac-
ters to “A” or “a” depending on whether they are
uppercase or lowercase. Second, to convert each
consecutive digits to “1”. Third, to prune off other
characters. This is followed by matching the
converted string to a specified pattern. For
example, the SF “Rb1” matches the pattern “Aa1”.
 LF tokens
We used space and punctuations as delimiters to
tokenize each potential LF into tokens. Each token
acted as a binary feature to represent token information
of the potential LF. We also applied token bi-grams as
binary contextual features of the potential LF.
BMC Bioinformatics 2009, 10(Suppl 15):S7 http://www.biomedcentral.com/1471-2105/10/S15/S7
Page 3 of 10
(page number not for citation purposes)
Page 4
 Numeric features between SF and LF
We exploited this set of features to describe the
mapping of SF letters to LF letters and the calculation
of the character usage between SF and LF.
1. The number of characters of longest common
subsequence of the SF-LF pair divided by the SF
length (in characters) [35];
2. Same as 1 but with the string consisting of the
first character of all LF tokens (e.g. "protein kinase
C” forms “PKC”);
3. The size of sharing character set between the SF
and the LF divided by the size of character set of
the SF;
4. The size of character set of the SF divided by the
SF length (in characters);
5. The shortest LF of the SF-LF pair extracted by
Schwartz’s AR system [26] that is equal to the LF;
6. Same as 5 but ignoring numbers of the SF and LF
(e.g. "CA 5 gene” are transformed into “CA gene”);
7. Same as 5 but reversing both the SF and LF (e.g.
"CA 5 gene” are transformed into “eneg 5 AC”);
8. Same as 7 but ignoring numbers of the SF and LF
(e.g. "CA 5 gene” are transformed into “eneg AC”);
 Contextual features of SF-LF pair
We generated contextual information of each poten-
tial SF-LF pair from the tokens which precede the SF-
LF pair and are limited two tokens at most. Those
tokens acted as binary features respectively.
We also applied token bi-grams as binary contextual
features of the SF-LF pair.
The total number of each of set of features and the total
number of all features are listed in Table 1.
Model training and testing
To test the performance of different learning algorithms
in our feature set, we implemented four learning
algorithms, including Support Vector Machine, Naïve
Bayes, Logistic Regression and Monte-Carlo Sampling
Logistic Regression. We took advantage of MALLET [36]
to implement Naïve Bayes, Logistic Regression and
Monte-Carlo Sampling Logistic Regression and LIBSVM
[37] for SVM. In this study, LIBSVM was incorporated
into MALLET to simplify the pipeline of experiments on
various learning algorithms.
We also set a ruled-based filter in the post-processing
step to clean up some easily fixed mistakes to improve
the precision. The output SF-LF pairs were filtered by the
following rules generalized from the inside tests:
1. If the length of the SF (in characters) is equal to
one, the length of the LF (in words) must not be large
than one;
2. If the SF is equal to “s”, the first letter of the LF
must not be “S” or “s” (e.g. "substract(s)”);
3. The brackets and parentheses of the SF and LF
must pair well;
4. The LF cannot contain a semi-colon followed by a
space;
5. The number of punctuations in the SF normalized
by the length of the SF (in characters) must not be
large than 0.5;
6. The pairs of bracket or parenthesis are at most two
pairs;
7. The LF must not start with the SF;
8. The SF must not be a sequence or list indicator
(e.g., (a), (b), (1a), (1b), (I), (II)....);
Since Sohn’s and Schwartz’s AR systems are available
online, we were able to reproduce their systems at our
local site. Generally speaking, we used them without any
modification in the whole process of system evaluation
and comparison. We only made a necessary modifica-
tion in the part of input and output of Schwartz’s system
for handling the format style of the AB3P corpus.
Results and discussion
In this study, we used a machine learning approach to
SF-LF pair recognition instead of a rule-based approach
[26-28,38]. Four learning algorithms, Logistic Regres-
sion, Monte-Carlo Sampling Maximum Entropy, Sup-
port Vector Machine and Naïve Bayes, were tested.
Learning algorithms and feature sets analysis
We evaluated the performance of the learning algorithms
on both corpora (BIOADI and AB3P, as described in
Methods). The performance of each algorithm was
carried out using one corpus for model training and
the other for model testing, vice versa. As tabulated in
Table 2, our results showed that the F-scores of different
learning algorithms tested on the BIOADI corpus were
Table 1: Number of features of each feature set and total number
of all features generated in feature extraction
Feature Set(s) BIOADI corpus AB3P corpus
M 251 239
L 23601 25993
N 8 8
C 19264 20231
M + L + N + C 43124 46471
M, String morphological features; L, LF tokens; N, Numeric features; C,
Contextual features.
BMC Bioinformatics 2009, 10(Suppl 15):S7 http://www.biomedcentral.com/1471-2105/10/S15/S7
Page 4 of 10
(page number not for citation purposes)
Page 5
between 64.54% and 86.22%. The performances on
AB3P corpus were between 85.03% and 89.90%. The
F-score difference among the learning algorithms trained
using the AB3P corpus was larger than using the BIOADI
corpus suggesting that pairs of SF and LF were more
irregular in the AB3P corpus (containing synonyms)
than in the BIOADI corpus. Our result also indicates that
logistic regression and support vector machine with RBF
kernel outperformed other algorithms in both precision
and recall on both corpora with our feature set. As the
precision of logistic regression being higher than SVM
with RBF kernel, the logistic regression algorithm was
used to develop our AR system.
We evaluated the four sets of features on the two corpora
with logistic regression. Table 3 presents the perfor-
mance of four trials on different combinations of four
sets of features on both corpora. The F-scores of these
trails range from 80.87% to 85.81% for the BIOADI
corpus and range from 86.88% to 89.90% for the AB3P
corpus. Comparing the trials with the highest F-score
with the lowest one on both corpora, the trails with all
features were four to five percent higher than the one
with only morphological set of features. They also
performed the best in both precision and recall on
both corpora. That suggests our feature set is robust and
reliable.
Comparison with previous works
We compared our system to Schwartz’s and Sohn’s
systems. Each system was trained with the AB3P corpus
before tested them with the BIOADI corpus and vise
versa. Medstract Gold Standard Evaluation Corpus for
evaluation [30] was not used as past results with the
corpus reported are all based on the different modifica-
tion version annotated by each team [29].
The results are shown in Table 4. The F-scores of the
systems on the BIOADI corpus were between 85.12%
and 86.20%, while that on AB3P corpus were between
86.13% and 89.90%. The highest precision on both
corpora were achieved by Sohn’s system, but the highest
F-score and the highest recall on both corpora were
Table 2: Performance of various learning algorithms tested on the BIOADI corpus and the AB3P corpus
Training Corpus AB3P corpus BIOADI corpus
Test Corpus BIOADI corpus AB3P corpus
Learning Algorithm Precision Recall F-score Precision Recall F-Score
Naïve Bayes 0.9733 0.4828 0.6454 0.9784 0.7518 0.8503
Logistic Regression 0.9352 0.7995 0.8620 0.9586 0.8464 0.8990
MCMaximun Entropy 0.9320 0.7013 0.8004 0.9301 0.8066 0.8640
SVM (linear kernel) 0.9446 0.7808 0.8549 0.9619 0.8398 0.8967
SVM (RBF kernel) 0.9212 0.8103 0.8622 0.9256 0.8580 0.8906
Table 3: Performance of logistic regression classifier trained with different feature sets and tested on the BIOADI corpus and the AB3P
corpus
Training Corpus AB3P corpus BIOADI corpus
Test Corpus BIOADI corpus AB3P corpus
Feature Set(s) Precision Recall F-score Precision Recall F-Score
M 0.9155 0.7242 0.8087 0.9392 0.8082 0.8688
M + L 0.9153 0.7489 0.8238 0.9401 0.8207 0.8763
M + L + N 0.9260 0.7995 0.8581 0.9556 0.8398 0.8939
M + L + N + C 0.9352 0.7995 0.8620 0.9586 0.8464 0.8990
M, String morphological features; L, LF tokens; N, Numeric features; C, Contextual features.
Table 4: Performance of the AR systems tested on the BIOADI corpus and the AB3P corpus
Training Corpus AB3P corpus BIOADI corpus
Test Corpus BIOADI corpus AB3P corpus
System Precision Recall F-score Precision Recall F-Score
This study 0.9352 0.7995 0.8620 0.9586 0.8464 0.8990
Sohn et al. [28] 0.9482 0.7832 0.8578 0.9701 0.8356 0.8979
Schwartz et al. [26] 0.9416 0.7766 0.8512 0.9500 0.7883 0.8613
BMC Bioinformatics 2009, 10(Suppl 15):S7 http://www.biomedcentral.com/1471-2105/10/S15/S7
Page 5 of 10
(page number not for citation purposes)
End of preview.
Preview full-text

Science & Research Jobs

Keywords

1200 PubMed abstracts
 
abbreviation recognition
 
AR performance system
 
available PubMed abstracts
 
bio-medical research
 
BioCreative II gene normalization corpus
 
building block
 
corresponding definition
 
free text
 
gene normalization tools
 
index terms
 
information curation
 
interesting trends
 
knowledge discovery
 
Mining BIOADI
 
off-line AR software
 
pair recognition task
 
process large quantities
 
rich features
 
short form-long form pairs