MER: a Shell Script and Annotation Server for Minimal Named Entity Recognition and Linking

Preprint (PDF Available) · November 2018with 119 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
DOI: 10.13140/RG.2.2.29900.46727
Cite this publication
Abstract
Named-Entity Recognition aims at identifying the fragments of text that mention entities of interest, that afterwards could be linked to a knowledge base where those entities are described. This manuscript presents our Minimal Named-Entity Recognition and Linking tool (MER), designed with flexibility, autonomy and efficiency in mind. To annotate a given text, MER only requires: i) a lexicon (text file) with the list of terms representing the entities of interest; ii) optionally a tab-separated values file with a link for each term; iii) and a Unix shell. Alternatively, the user can provide an ontology from where MER will automatically generate the lexicon and links files. The efficiency of MER derives from exploring the high performance and reliability of the text processing command-line tools grep and awk, and a novel inverted recognition technique. MER was deployed in a cloud infrastructure using multiple Virtual Machines to work as an annotation server and participate in the Technical Interoperability and Performance of annotation Servers (TIPS) task of BioCreative V.5. The results show that our solution processed each document (text retrieval and annotation) in less than 3 seconds on average without using any type of cache. MER was also compared to a state-of-the-art dictionary lookup solution obtaining competitive results not only in computational performance but also in precision and recall. MER is publicly available in a GitHub repository (https://github.com/lasigeBioTM/MER) and through a RESTful Web service. [Preprint accepted for publication in Journal of Cheminformatics, 2018]
Figures - uploaded by Francisco M Couto
Author content
All content in this area was uploaded by Francisco M Couto
Content may be subject to copyright.
Couto and Lamurias
RESEARCH
MER: a Shell Script and Annotation Server for
Minimal Named Entity Recognition and Linking
Francisco M Couto1* and Andre Lamurias1,2
*Correspondence:
fcouto@di.fc.ul.pt
1LASIGE, Faculdade de Ciˆencias,
Universidade de Lisboa, 1749 016,
Lisboa, Portugal
Full list of author information is
available at the end of the article
Abstract
Named-Entity Recognition aims at identifying the fragments of text that mention
entities of interest, that afterwards could be linked to a knowledge base where
those entities are described.
This manuscript presents our Minimal Named-Entity Recognition and Linking
tool (MER), designed with exibility, autonomy and eciency in mind. To
annotate a given text, MER only requires: i) a lexicon (text le) with the list of
terms representing the entities of interest; ii) optionally a tab-separated values
le with a link for each term; iii) and a Unix shell. Alternatively, the user can
provide an ontology from where MER will automatically generate the lexicon and
links les. The eciency of MER derives from exploring the high performance
and reliability of the text processing command-line tools grep and awk, and a
novel inverted recognition technique.
MER was deployed in a cloud infrastructure using multiple Virtual Machines to
work as an annotation server and participate in the Technical Interoperability and
Performance of annotation Servers (TIPS) task of BioCreative V.5. The results
show that our solution processed each document (text retrieval and annotation)
in less than 3 seconds on average without using any type of cache. MER was also
compared to a state-of-the-art dictionary lookup solution obtaining competitive
results not only in computational performance but also in precision and recall.
MER is publicly available in a GitHub repository
(https://github.com/lasigeBioTM/MER) and through a RESTful Web service
(http://labs.fc.ul.pt/mer/).
Keywords: Named-Entity Recognition; Entity Linking; Annotation Server; Text
Mining; Biomedical Ontologies; Lexicon
Introduction
Text has been and continues to be for humans the traditional and natural mean of
representing and sharing knowledge. However, the information encoded in free text
is not easily attainable by computer applications. Usually, the rst step to untangle
this information is to perform Named-Entity Recognition (NER), a text mining task
for identifying mentions of entities in a given text [13]. The second step is linking
these mentions to the most appropriate entry in a knowledge base. This last step is
usually referred to as the Named-Entity Linking (NEL) task but is also referred to
as entity disambiguation, resolution, mapping, matching or even grounding [4].
State-of-the-art NER and NEL solutions are mostly based on machine learning
techniques, such as Conditional Random Fields and/or Deep Learning [514]. These
solutions usually require as input a training corpus, which consists of a set of texts
and the entities mentioned on them, including their exact location (annotations),
Preprint accepted for publication in
Journal of Cheminformatics, 2018
Couto and Lamurias Page 2 of 15
and the entries in a knowledge base that represent these entities [15]. The training
corpus is used to generate a model, which will then be used to recognize and link
entities in new texts. Their eectiveness strongly depends on the availability of a
large training corpus with an accurate and comprehensive set of annotations, which
is usually arduous to create, maintain and extend. On the other hand, dictionary
lookup solutions usually only require as input a lexicon consisting in a list of terms
within some domain [1621], for example, a list of names of chemical compounds.
The input text is then matched against the terms in the lexicon mainly using string
matching techniques. A comprehensive lexicon is normally much easier to nd or
to create and update than a training corpus, however, dictionary lookup solutions
are generally less eective than machine learning solutions.
Searching, ltering and recognizing relevant information in the vast amount of
literature being published is an almost daily task for researches working in Life
and Health Sciences [22]. Most of them use web tools, such as PubMed [23], but
many times to perform repetitive tasks that could be automatized. However, these
repetitive tasks are sometimes sporadic and highly specic, depending on the project
the researcher is currently working on. Therefore, in these cases, researchers are
reluctant to spend resources creating a large training corpus or learning how to
adapt highly complex text mining systems. They are not interested in getting the
most accurate solution, just one good enough tool that they can use, understand and
adapt with minimal eort. Dictionary lookup solutions are normally less complex
than machine learning solutions, and a specialized lexicon is usually easier to nd
than an appropriate training corpus. Moreover, dictionary lookup solutions are still
competitive when the problem is limited to a set of well-known entities. For these
reasons, dictionary lookup solutions are usually the appropriate option when good
enough is what the user requires.
This manuscript proposes a novel dictionary lookup solution, dubbed as Minimal
named-Entity Recognizer (MER), which was designed with exibility, autonomy,
and eciency in mind. MER only requires as input a lexicon in the form of a text
le, in which each line contains a term representing a named-entity to recognize.
If the user also wants to perform entity linking, a text le containing the terms
and their respective Unique Resource Identiers (URIs) can also be given as input.
Therefore, adding a new lexicon to MER could not be easier than this. MER also
accepts as input an ontology in Web Ontology Language (OWL) format, which it
converts to a lexicon.
MER is not only minimal in terms of the input but also in its implementation,
which was reduced to a minimal set of components and software dependencies.
MER is then composed of just two components, one to process the lexicon (oine)
and another to produce the annotations (online). Both were implemented as a Unix
shell script [24], mainly for two reasons: i) eciency, due to its direct access to
high-performance text and le processing tools, such as grep and awk, and a novel
inverted recognition technique; and ii) portability, since terminal applications that
execute Unix shell scripts are nowadays available in most computers using Linux,
macOS or Windows operating systems. MER was tested using the Bourne-Again
shell (bash) [25] since it is the most widely available. However, we expect MER to
work in other Unix shells with minimal or even without any modications.
Couto and Lamurias Page 3 of 15
Figure 1 Workow of MER depicting its inverted recognition technique.
We deployed MER in a cloud infrastructure to work as an annotation server and
participate in the Technical Interoperability and Performance of annotation Servers
(TIPS) task of BioCreative V.5 [26]. This participation allowed us to assess the
exibility, autonomy, and eciency of MER in a realistic scenario. Our annotation
server responded to the maximum number of requests (319k documents) and gen-
erated the second highest number of total predictions (7130k annotations), with an
average of 2.9 seconds per request.
To analyze the statistical accuracy of MER’s results we compared it against a
popular dictionary lookup solution, the Bioportal annotator [27], using a Human
Phenotype Ontology (HPO) gold-standard corpus [28]. MER obtained the highest
precision in both NER and NEL tasks, the highest recall in NER, and a lower
processing time. Additionally, we compared MER with Aho-corasick [29], a well-
known string search algorithm. MER obtained a lower processing time and higher
evaluation scores on the same corpus.
MER is publicly available in a GitHub repository [30], along with the code used
to run the comparisons to other systems.
The repository contains a small tutorial to help the user start using the pro-
gram and test it. The remainder of this article will detail the components of MER,
and how it was incorporated in the annotation server. We end by analyzing and
discussing the evaluation results and present future directions.
Couto and Lamurias Page 4 of 15
Figure 2 Example of the contents of a lexicon le representing four compounds.
αm a l t o s e
n i c o t i n i c a ci d
n i c o t i n i c a ci d Dribonucleotide
n i c o t i n i c ac i da d en i n e di n u c l e o t i d e p ho s ph a te
Figure 3 A snippet of the contents of the links le generated with ChEBI
zy g a d e n i n e ht t p : / / pu r l . o b o l ib r a r y . o r g / o bo / C HE BI 1013 0
z y m o s t e r o l ht t p : / / p u r l . o b o l i br a r y . o r g / o b o / CH EB I 182 52
z ym o s t e r ol e s te r ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 523 22
z ym o s t e r o l i n t e r m e d i a t e 1a ht t p :/ / p u r l . ob o l i b r ar y . or g / o bo / CH EBI 5 238 8
z ym o s t e r o l i n t e r m e d i a t e 1b ht t p : / / p u r l . ob o l i b r ar y . or g / o bo / CH EBI 5 261 5
Figure 4 A snippet of the contents of the links le generated with the Human
Phenotype Ontology
y el l o w n a i ls ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ H P 00 11 36 7
y e l l o w n o d u l e ht t p : / / p u r l . o b o l i br a r y . o r g / o b o / H P 00 2 55 5 4
y e l l o w p a p u l e ht t p : / / p u r l . o b o l i br a r y . o r g / o b o / H P 00 2 55 0 7
y el l o w s k i n ht t p :/ / p ur l . o b o l i b r a r y . o r g / ob o / HP 0 00 09 52
y e l l o w s k i n p l a qu e h t tp : / / pu r l . o b o l i b ra r y . o rg /ob o / H P 00 31 3 60
Figure 5 A snippet of the contents of the links le generated with the Disease
Ontology
z e b r a f i s h a l l e r g y ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ D OID 0 06 05 17
z e l l w e g e r s yn dr om e ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ DO ID 905
z i ka f e v er ht t p :/ / p u rl . o b o l i b r a r y . or g / ob o/ D OI D 006 04 78
z i ka v i r u s c o n g e n i t a l sy nd ro me ht t p :/ / pu r l . o b o l i b r a r y . o r g / obo / D OID 0 080 18 0
z i ka v i r u s d i se a s e ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ D OID 0 06 04 78
Figure 6 Example of the contents of the links le representing compounds
CHEBI:18167, CHEBI:15940, CHEBI:15763 and CHEBI:76072.
αm a l to s e ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 181 67
n i c o t i n i c a c i d ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 159 40
n i c o t i n i c a ci d dri b o n u c l e o t i d e ht t p : / / pu r l . o bo l i b r a r y . or g / ob o/ CH EBI 1 576 3
n i c o t i n i c a c id a de n i ne d i n u c l e o t i d e ph o sp h at e h tt p :/ / p ur l . o b o l ib r a r y . or g / ob o /C HEB I 760 72
MER
Figure 1shows the generic workow of MER. The oine and online mod-
ules were implemented in two shell script les, namely produce data les.sh and
get entities.sh, respectively. Both scripts are available in the GitHub repository.
The remainder of this section will explain their methodology in detail.
Input
Before being able to annotate any text, MER requires as input a lexicon containing
the list of terms to match. The user can provide the lexicon as text le (.txt)
where each line represents a term to be recognized. Additionally, to perform NEL
Couto and Lamurias Page 5 of 15
Figure 7 Each block represents the content of each of the four les created after
pre-processing the input le shown in Figure 2.
== onewor d ( . . . w ord1 . t xt ) =====================
α. ma l t o s e
== twoword ( . . . w ord 2. t xt ) =====================
n i c o t i n i c a ci d
== morewo rds ( . . . w ord s . tx t ) ===================
n i c o t i n i c a c i d d . r i b o n u c l e o t i d e
n i c o t i ni c a c i d . ad e n i n e d i nu c le o ti d e p h o s p h a t e
== f i r s t twow ord s ( . . . wo rd s2 . t xt ) ============
n i c o t i n i c a ci d
n i c o t i n i c a ci d . a d en i ne
a tab-separated values le (.tsv) is required. This links le has to contain two data
elements per line: the term and the link. Alternatively, the user can provide an
ontology (.owl) and MER will automatically parse it to create the lexicon and links
les. So if, for example, we want to recognize terms that are present in ChEBI [31],
the user can provide the whole ontology (chebi.owl) or just collect the relevant
labels and store them in a text le, one label per line. Figure 2presents an example
where four ChEBI compounds are represented by a list of terms based on their
ChEBI’s name.
If the user provides an ontology, MER starts by retrieving all the values of the
tags rdfs:label,oboInOwl:hasRelatedSynonym and oboInOwl:hasExactSynonym in-
side each top-level owl:Class. The values are then stored in two les: a regular
lexicon with a label (term) per line; and a tab-separated values le with a pair term
and respective identier (URI) per line. The links le is then sorted and will be
used by MER to perform NEL. Figures 3,4and 5show a snippet of the links les
generated for ChEBI ontology [32], HPO [33,34], and Human Disease Ontology
(DOID) [35,36], respectively.
The links le can also be created manually for a specic lexicon not generated from
an ontology. Figure 6presents the links le created for the lexicon le of Figure 2.
Inverted Recognition
To recognize the terms, a standard solution would be to apply grep directly to the
input text. However, the execution time is proportional to the size of the lexicon,
since each term of the lexicon will correspond to an independent pattern to match.
To optimize the execution time, we developed the inverted recognition technique.
The inverted recognition uses the words in the processed input text as patterns to
be matched against the lexicon le. Since the number of words in the input text
is much smaller than the number of terms in the lexicon, grep has much fewer
patterns to match. For example, nding the pattern nicotinic acid in the two-word
chemical lexicon created for TIPS is more than 100 times faster than using the
standard solution.
To perform the inverted recognition technique, MER splits the lexicon into three
les containing the terms composed by one (one-word), two (two-word) and three
or more words (more-words). The second step creates a fourth le containing the
rst two words (rst-two-words) of all the terms in the more-words le. During the
above steps, MER makes the following minor modications to the terms: convert
all text to lowercase; contiguous white spaces are replaced by one white space; full
Couto and Lamurias Page 6 of 15
Figure 8 Example of a given sentence to be annotated (rst line), and its
one-word and two-word patterns created by MER.
αm a lt o se an d n i c o t i n i c a c id w as fo un d , b ut n ot
n i c o t i n i c a ci d Dri b o n u c l e o t i d e
α. ma l t o s e |n i c o t i n i c |a c i d |f oun d |nicotinic|acid |d . r i b o n u c l e o t i d e
α. m a lt o se n i c o t i n i c |a c i d f o u nd |n i c o t i n i c a ci d
|n i c o t i n i c a ci d |fo un d n i c o t i n i c |a c i d d . r i b o n u c l e o t i d e
Figure 9 Output example of MER for the sentence in Figure 8and the lexicon in
Figure 2without any links le
0 9 αma l t o s e
14 28 n i c o t i n i c a ci d
48 62 n i c o t i n i c a ci d
48 79 n i c o t i n i c a ci d Dri b o n u c l e o t i d e
Figure 10 Output example of MER for the sentence in Figure 8, the lexicon in
Figure 2, and the links le of Figure 6
0 9 αma lt o s e ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 181 67
14 28 n i c o t i n i c a c i d ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 159 40
48 62 n i c o t i n i c a c i d ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ C HEB I 159 40
48 79 n i c o t i n i c a ci d Dri b o n u c l e ot i d e h t tp : / / pu r l . ob o l i b r a r y . or g / o bo / CHE BI 15 763
Figure 11 Output example of MER for the abstracts with PubMed identiers:
29490421 and 29490060, and the Human Disease Ontology
ac n e ht t p :/ / p ur l . o b o l i b r a r y . o r g / ob o /D OID 65 43
as th ma ht t p :/ / p ur l . o b o l i b r a r y . o r g / ob o /D OID 28 41
b r o n c h i t i s ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ DO ID 613 2
c h ro n i c o b s t r u c t i v e p ul mo na ry d i s e a s e ht t p : / / p u rl . o b o l i b r a r y . o r g / obo / DO ID 308 3
COPD ht t p :/ / p ur l . o b o l i b r a r y . o r g / ob o /D OID 30 83
d i s e a s e h t t p : / / pu r l . o b o l i b r a r y . o r g / ob o /DO ID 4
g a s t r o e n t e r i t i s ht t p : / / p ur l . o bo l i b ra r y . o r g /o bo /D OID 232 6
i m p e t i g o ht t p : / / pu r l . o b o l i br a r y . o r g / ob o / DO ID 85 04
o t i t i s m ed ia ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ D OID 10 75 4
u r in a r y t r a c t i nf ec ti o n ht t p :/ / p ur l . o b o l i b r a r y . or g / ob o/ D OID 13 14 8
stops are removed; leading and trailing white spaces are removed; and all special
characters are replaced by a full stop. Since some special characters may cause
matching problems, MER assumes that all the special characters (characters that
are not alphanumeric or a whitespace, for example, hyphens) can be matched by
any other character, so these characters are replaced by a full stop, like in regular
expressions. Figure 7presents the contents of each of the four les created using
the terms shown in Figure 2. Note that the word acid-adenine was replaced by
acid.adenine, and the last le presents the rst two words of each entry in the
third le. Note also that all the above steps are performed oine and only once per
lexicon.
The online module of MER starts when the user provides a new input text to
be annotated with a lexicon already pre-processed. The goal is to identify which
terms of the lexicon are mentioned in the text. The rst step of MER is to apply
Couto and Lamurias Page 7 of 15
the same minor modications to the input text as described above, but also remove
stop-words, and words with less than a given number of characters. The le with
the list of stop-words and the minimum entity length are parameters that the user
can easily modify in the scripts. The list of stop-words used in this study are in
the stopwords.txt le of the GitHub repository. For this study, we selected 3 as the
minimum entity length because two-character acronyms are not so common, and
we empirically found that most of the two-character matches were errors.
This will result in a processed input text derived from the original one. Note
that MER only recognizes direct matches, if lexical variations of the terms are
needed, then they have to be added in the lexicon, for example by using a stemming
algorithm. MER will then create two alternation patterns: i) one-word pattern, with
all the words in the input text; and ii) two-word pattern, with all the consecutive
pairs of words in the input text. Figure 8shows an example of these two patterns.
Next, MER creates three background jobs to match the terms composed of: i)
one word, ii) two words, and iii) three or more words. The one-word job uses the
one-word pattern to nd matching terms in the one-word le. Similarly, for the
two-word job, that uses the two-word pattern and le. The last job uses the two-
word pattern to nd matches in the two-rst-word le, and the resulting matches
are then used as a pattern to nd terms in the more-words le. The last job is less
ecient since it executes grep twice, however, the resulting list of matches with the
two-rst-word le is usually small, so the second execution is negligible. In the end,
each job will create a list of matching terms that are mentioned in the input text.
Since the processed input text cannot be used to nd the exact position of the
term, MER uses the list of matching terms to nd their exact position in the original
input text. MER uses awk to nd the multiple instances of each term in the original
input text. The awk tool has the advantage of working well with UTF-8 characters
that use more than one byte, in opposition to grep that just counts the bytes to
nd the position of a match. MER provides partial overlaps, i.e. a shorter term
may occur at the same position as a longer one, but not full overlapping matches
(same term in the same position). We also developed a test suite to refactor the
algorithm with more condence that nothing is being done incorrectly. The test
suite is available in the GitHub repository branch dedicated to development [37].
Figure 9shows the output of MER when using as input text the sentence in
Figure 8, and the lexicon of Figure 2. Note that nicotinic acid appears twice at
position 14 and 65, as expected, without aecting the match of nicotinic acid D-
ribonucleotide.
Linking
If the links le is provided, then MER will try to nd the recognized term in that
le. This step is basically a grep at the beginning of each line in the le, and only
returns the rst exact match of each term. Figure 10 shows the output of MER
when using the links le of Figure 6that was missing in Figure 9. Figure 11 shows
the output of MER for two abstracts using the Human Disease Ontology. Note that
this functionality was implemented after our TIPS participation [38].
Couto and Lamurias Page 8 of 15
Figure 12 Number of terms, words, and characters in the lexicons used in TIPS,
obtained by using the following shell command: wc -lmw *.txt.
#te r ms #wo rd s #ch a r #f i l e n a m e
11 6 61 6 13 7 70 2 10 27 3 69 CELL LINE AND CELL TYPE. t x t
33 2 16 7 44 6 42 3 10 3 9 75 7 4 CHEMICAL . tx t
26 2 16 92 6 88 80 8 3 66 DISEASE . tx t
73 9 54 73 9 54 99 1 0 12 MIRNA . tx t
59 7 8 6 7 13 7 2 3 26 1 1 8 63 6 4 2 PROTEIN. t x t
81 4 6 261 1 7 2 2 81 6 7 SUBCELLULAR STRUCTURE. t x t
52 3 8 162 8 3 1 2 60 2 4 TI SSUE AND ORGAN. tx t
11 6 02 04 2 16 54 9 3 254 4 21 54 t o t a l
Figure 13 Output example of MER using BeCalm TSV format for the sentence in
Figure 8and the lexicon in Figure 2
1 A 0 9 0. 5 4 4 8 8 αm a lt o s e l e x i c o n 1
1 A 1 4 2 8 0 .6 2 10 7 7 n i c o t i n i c a c i d l e x i co n 1
1 A 4 8 6 2 0 .6 2 10 7 7 n i c o t i n i c a c i d l e x i co n 1
1 A 4 8 7 9 0 .7 0 87 9 3 n i c o t i n i c a c i d Dr i b o n u c l e o t i d e l e x i c o n 1
Annotation Server
TIPS is a novel task in BioCreative aiming at the evaluation of the performance of
NER web servers, based on reliability and performance metrics. The entities to be
recognized in TIPS were not restricted to a particular domain.
The web servers had to respond to single document annotation requests. The
servers had to be able to retrieve the text from documents in the patent server, the
abstract server and PubMed, without using any kind of cache for the text or for the
annotations. The annotations had to be provided in, at least, one of the following
formats: BeCalm JSON, BeCalm TSV, BioC XML or BioC JSON.
Lexicons
The rst step to participate in TIPS was to select the data sources from which
we could collect terms related with the following accepted categories: Cell line and
cell type: Cellosaurus [39]; Chemical: HMDB [40], ChEBI [32] and ChEMBL [41];
Disease: Human Disease Ontology [35]; miRNA: miRBase [42]; Protein: Protein On-
tology [43]; Subcellular structure: cellular component aspect of Gene Ontology [44];
Tissue and organ: tissue and organ subsets of UBERON [45].
A post-extraction processing was applied to these data les, which consisted in
lowercasing all terms, deleting leading and trailing white spaces and removing re-
peated terms. Since repeated annotations of dierent types were not allowed, we
created another lexicon containing terms that appeared on more than one of the
other lexicons. The terms matched to this lexicon were categorized as Unknown,
as suggested by the organization. The software to extract the list of terms from
the above data sources can be found in the GitHub repository branch dedicated to
TIPS [37].
Figure 12 shows the number of terms, number of words, and number of characters
of each lexicon created. Our Annotation Server was then able to recognize more than
1M terms composed of more than 2M words and more than 25M characters. All
lexicons are available for reuse as a zip le in the TIPS branch of our repository [37].
Couto and Lamurias Page 9 of 15
Input and Output
We adapted MER to provide the annotations in the BeCalm TSV format. Thus,
besides the input text and the lexicon, MER had also to receive the document
identier and the section as input. In Figure 13, the document identier is 1 and
section is A. The score column is calculated by 1 1/ln(nc), where nc represents
the number of characters of the recognized term. This assumes that longer terms
are less ambiguous, and in that case, the match should have a higher condence
score. Note that MER only recognizes terms with three or more characters, so the
minimum score is 0.08976 and the score is always lower than 1.
We used jq [46] a command-line JSON processor to parse the requests. The
retrieval of each document was implemented using the popular curl tool, and we
developed a specic parser for each data source to extract the text to be annotated.
The parsers are also available at the TIPS branch [37].
Infrastructure
Our annotation server was deployed in a cloud infrastructure composed of three
Virtual Machines (VM). Each VM had 8GB of RAM and 4 Intel Core CPUs at 1.7
GHz, using CentOS Linux release 7.3.1611 as the operating system. We selected
one VM (primary) to process the requests, distribute the jobs, and execute MER.
The other two VMs (secondary) just execute MER. We installed the NGINX HTTP
server running CGI scripts given its high performance when compared with other
web servers [47]. We also used the Task Spooler [48] tool to manage and distribute
within the VMs the jobs to be processed.
The server is congured to receive the REST API requests dened in the BeCalm
API documentation. Each request is distributed to one of the three VMs accord-
ing to the least-connected method of NGINX. When a getAnnotations request is
received, the server rst downloads the documents from the respective sources and
then processes the title and abstract of each document in the same VM. Two jobs
are spawned in background, corresponding to the title and abstract. Each annota-
tion server handles all the entity types mentioned in Figure 12, spawning a separate
job for each entity type. The name of the entity type is added as another column
to the output of Figure 9. These jobs can run in parallel since they are indepen-
dent from each other and the output of each job can be easily merged into a nal
TSV output le. When a job nishes processing, a script checks if the other jobs
associated with the same requests have also nished processing. If that is the case,
then the results of every job are concatenated and sent back to BeCalm using the
saveAnnotations method.
To test MER outside of the scope of the TIPS competition, we implemented a
dierent REST API which accepts as input raw text and the name of a lexicon.
This way, the document does not have to be retrieved from external sources, and
we can evaluate the performance of MER independently. This alternative API can
be accessed online, along with a simple user interface shown in Figure 14 [49].
Results and Discussion
Computational Performance
Table 1shows the ocial TIPS evaluation data of our system [50]. These results
refer to the whole period of the competition, from February 5, 2017 to March
Couto and Lamurias Page 10 of 15
Figure 14 Screenshot of the MER web graphical user interface.
Table 1 Ocial evaluation results of the TIPS task (time values are in seconds).
MER Best
# Requests 3.19E+05 3.19E+05
# Predictions 7.13E+06 2.74E+07
Mean time seek annotations (MTSA) 1.29E-01 s 1.37E-02 s
Mean time per document volume (MTDV) 2.38E-03 bytes/s 8.58E-04 bytes/s
Mean annotations per document (MAD) 2.25E+01 1.01E+02
Average response time (ART) 2.90E+00 s 1.07E+00 s
Mean time between failures (MTBF) 4.58E+06 s 4.58E+06 s
Mean time to repair (MTTR) 0.00E+00 s 0.00E+00 s
30, 2017. The evaluation process and metrics used are described in the workshop
article [26]. Each request consisted of one document that the server had to retrieve
either from PubMed or a repository hosted by the organization. Our server was
able to handle all 319k requests received during the evaluation period, generating
a total of 7.13M annotations (second best) with an average of 22.5 predictions
per document (MAD) (third best). In average, each prediction has been generated
in 0.129 s (MTSA). Our average processing time value (ART) was 2.9 s, and the
processing time per document volume (MTDV) was 0.00238 bytes/s. The Mean time
between failures (MTBF) and Mean time to repair (MTTR) metrics were associated
with the reliability of server, and our team obtained the maximum scores on those
metrics.
MER was able to eciently process the documents by taking less than 3 seconds on
average without using any type of cache. We note that all documents, irrespectively
of the source, were annotated using all the entity types presented in the previous
Lexicons section. Furthermore, the time to process each document is aected by
external sources used to retrieve the document text. If the text is provided with the
request, then the processing time should be considerably shorter. Another factor is
the latency between our server and the TIPS server. As we were not able to measure
Couto and Lamurias Page 11 of 15
this latency, it is dicult to measure the impact on the response times, and it was
not taken into consideration for the evaluation metrics.
We compared the time necessary to process the same sentence on the same hard-
ware using MER and a more complex machine learning system, IBEnt [11], using
the sentence of Figure 8. While IBEnt took 8.25 seconds to process the sentence,
MER took only 0.098 seconds. Although IBEnt is optimized for batch processing,
therefore reducing the time per document as the number of documents increases,
MER is still 84 times faster than IBEnt in this experiment. Thus, besides being
easy to install and congure, MER is also a highly ecient and scalable NER and
NEL tool.
Part of the optimization of MER is due to the four les that are generated by
the oine module. These les are generated from the lexicon le, which contains
one entity per line. For NEL, there is another necessary step, which consists in
converting an OWL le in a lexicon. This process took around 15 minutes for each
ontology. However, processing a lexicon le is quite faster, taking 0.746 and 3.671
seconds for the HPO and ChEBI ontologies, respectively.
Precision and Recall
Table 2 Comparison between MER and BioPortal on the HPO gold-standard corpus.
NER NER+NEL
P R F P R F ART MTSA
BioPortal 0.6862 0.4463 0.5408 0.6118 0.3979 0.4822 1.15E+00 s 1.45E-01 s
MER 0.7184 0.4514 0.5544 0.6155 0.3868 0.4751 7.32E-01 s 9.59E-02 s
We compared the performance of MER with the BioPortal annotator, which is a
popular dictionary lookup NER solution. To perform this comparison, we adapted
our server to directly receive as input free text, instead of requiring another request
to retrieve the documents. We used the HPO corpus to compare the two tools.
This corpus is composed by 228 scientic abstracts annotated with human pheno-
types, associated with the HPO. We used an updated version of this corpus, which
aimed at improving the consistency of the annotations [51]. A total of 2773 textual
named entities were annotated in this corpus, corresponding to 2170 unique entity
mentions. We compared the quality of the results produced by each tool using the
standard precision, recall and F1-score measures, as well as the time necessary to
process each document on average (ART) and time per annotation (MTSA).
Table 2shows the results of this comparison, where NER refers to matching the
osets of the automatic annotations with the gold standard, and NEL refers to
matching the URI annotated automatically with the gold standard. As expected,
combining both tasks (NER+NEL) results in lower scores than performing only
NER. Using MER, the F1-score obtained was 0.5544, while BioPortal obtained
an F1-score of 0.5408 on the NER task. Considering the NEL task too, BioPortal
obtained a better F1-score than MER, indicating that some entities were linked to
incorrect URIs. Bioportal annotator employs a semantic expansion technique that
could lead to more accurate URIs, using the relations dened in the ontology [52].
An approach to improve the results would be to incorporate semantic similarity
measures, so MER could also consider related classes in the NEL task [53].
However, MER obtained lower response times than BioPortal, in terms of time per
document and per annotation. To account for the dierence in latency between the
Couto and Lamurias Page 12 of 15
two servers, we used the ping tool to calculate the round-trip time of each server,
averaged over 10 packets. MER obtained a round-trip time of 6.72E-03 s while
BioPortal obtained 1.86E-01 s, representing a dierence of 1.79E-01 s. This means
that MER had a better connection to the machine we used to run the experiments,
but this had minimal impact when comparing to a dierence of 4.18E-01 s in both
response times (ART).
Table 3 Comparison between MER and Aho-corasick on the HPO gold-standard corpus.
NER
P R F ART MTSA
Aho-corasick 0.2282 0.2665 0.2459 0.8596 s 0.0786 s
MER 0.7184 0.4514 0.5544 0.5088 s 0.0667 s
We also compared MER with a well-known string search algorithm, Aho-corasick
using the HPO corpus [29]. In this experiment, we did not attempt to match entities
to ontology concepts as this would require additional enhancements to the Aho-
corasick algorithm. We used the same HPO lexicon for both methods, as well as
the same documents. Unlike the comparison to BioPortal, the experiment was done
using local installations of MER and of the Makefast tool [54], which provides an
implementation of the Aho-corasick algorithm. Table 3shows the results of this
comparison. MER obtained higher precision, recall and F1-score, as well as a lower
processing time per document and per annotation. MER obtained better evaluation
scores since it was developed specically for NER, while Aho-corasick is a generic
string search algorithm. The processing time was also shorter, due to the lexicon
pre-processing done by the oine module of MER. However, this pre-processing is
quick (3.671 s for the HPO ontology) and only has to be done once.
Conclusions
We presented MER, a minimal named entity recognition and linking tool that was
developed with the concepts of exibility, autonomy, and eciency in mind. MER
is exible since it can be extended with any lexicon composed of a simple list of
terms and its identiers (if available). MER is autonomous since it only requires a
Unix shell with awk and grep command-line tools, which are nowadays available in
all mainstream operating systems. MER is ecient since it takes advantage of the
high-performance capacity of grep as a le pattern matcher, and by proposing a
novel inverted recognition technique.
MER was integrated in an annotation server deployed in a cloud infrastructure for
participating in the TIPS task of BioCreative V.5. Our server was fully developed
in-house with minimal software dependencies and is open-source. Without using
any kind of cache, our server was able to process each document in less than 3
seconds on average. Performance and quality results show that MER is competitive
with state-of-the-art dictionary lookup solutions.
Availability of data and materials
The data and software used in this study are available at https://github.com/lasigeBioTM/MER.
List of abbreviations
NER - Named Entity Recognition
NEL - Named Entity Linking
URI - Unique Resource Identier
Couto and Lamurias Page 13 of 15
OWL - Web Ontology Language
CHEBI - Chemical Entities with Biological interest
HPO - Human Phenotype Ontology
TIPS - Technical Interoperability and Performance of annotation Servers
MTSA - Mean Time Seek Annotations
MTDV - Mean Time per Document Volume
MAD - Mean Annotations per Document
ART - Average Response Time
MTBF - Mean Time Between Failures
MTTR - Mean Time To Repair
VM - Virtual Machine
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
Conceptualization and methodology: FC and AL. Funding acquisition, project administration, and supervision: FC.
Investigation, validation, writing, review, and editing: FC and AL. Software: FC and AL. Visualization and writing
original draft: FC and AL.
Acknowledgements
This work was supported by FCT through funding of the DeST: Deep Semantic Tagger project, ref.
PTDC/CCI-BIO/28685/2017, LASIGE Research Unit, ref. UID/CEC/00408/2013 and BioISI, ref.
ID/MULTI/04046/2013. AL is recipient of a fellowship from BioSys PhD programme (ref PD/BD/106083/2015)
from FCT (Portugal). This work was produced with the support of the Portuguese National Distributed Computing
Infrastructure (http://www.incd.pt).
Author details
1LASIGE, Faculdade de Ciˆencias, Universidade de Lisboa, 1749 016, Lisboa, Portugal. 2University of Lisboa,
Faculty of Sciences, BioISI - Biosystems & Integrative Sciences Institute, Campo Grande, C8 bdg, 1749 016, Lisboa,
Portugal.
References
1. Nadeau, D., Sekine, S.: A survey of named entity recognition and classication. Lingvisticae Investigationes
30(1), 3–26 (2007)
2. Krallinger, M., Rabal, O., Louren¸co, A., Oyarzabal, J., Valencia, A.: Information retrieval and text mining
technologies for chemistry. Chemical reviews 117(12), 7673–7761 (2017)
3. Lamurias, A., Couto, F.: Text mining for bioinformatics using biomedical literature. In: Ranganathan, S., Nakai,
K., Sch¨onbach, C., Gribskov, M. (eds.) Encyclopedia of Bioinformatics and Computational Biology vol. 1.
Oxford, Elsevier (2019). doi:10.1016/B978-0-12-809633-8.20409-3
4. MacDonald, M.C., Pearlmutter, N.J., Seidenberg, M.S.: The lexical nature of syntactic ambiguity resolution.
Psychological review 101(4), 676 (1994)
5. Wang, C.-K., Dai, H.-J., Jonnagaddala, J., Su, E.C.-Y.: An ensemble algorithm for sequential labelling: A case
study in chemical named entity recognition. In: Proceedings of the BioCreative V. 5 Challenge Evaluation
Workshop (2017)
6. Col´on-Ruiz, C., Segura-Bedmar, I., Martınez, P.: Combining the banner tool with the dinto ontology for the
cemp task of biocreative v. 5. In: Proceedings of the BioCreative V. 5 Challenge Evaluation Workshop (2017)
7. Leaman, R., Lu, Z.: Towards robust chemical recognition with taggerone at the biocreative v. 5 cemp task. In:
Proceedings of the BioCreative V. 5 Challenge Evaluation Workshop (2017)
8. Guo, Y., Zhao, S., Qu, C., Li, L.: Recognition of chemical entity mention in patents using feature-rich crf. In:
Proceedings of the BioCreative V. 5 Challenge Evaluation Workshop (2017)
9. Santos, A., Matos, S.: Neji: Recognition of chemical and gene mentions in patent texts. In: Proceedings of the
Biocreative V. 5 Challenge Evaluation Workshop (2017)
10. Liu, Z., Wang, X., Tang, B., Chen, Q., Shi, X., Hou, J.: Hitextracter system for chemical and gene/protein
entity mention recognition in patents. In: Proceedings of the Biocreative V. 5 Challenge Evaluation Workshop
(2017)
11. Lamurias, A., Campos, L.F., Couto, F.M.: Ibent: Chemical entity mentions in patents using chebi. In:
Proceedings of the Biocreative V. 5 Challenge Evaluation Workshop (2017)
12. Luo, L., Yang, P., Yang, Z., Lin, H., Wang, J.: Dutir at the biocreative v. 5. becalm tasks: A blstm-crf
approach for biomedical entity recognition in patents. In: Proceedings of the Biocreative V. 5 Challenge
Evaluation Workshop (2017)
13. Corbett, P., Boyle, J.: Chemlistem-chemical named entity recognition using recurrent neural networks. In:
Proceedings of the BioCreative V. 5 Challenge Evaluation Workshop (2017)
14. Dai, H.-J., Lai, P.-T., Chang, Y.-C., Tsai, R.T.-H.: Enhancing of chemical comp ound and drug name
recognition using representative tag scheme and ne-grained tokenization. Journal of cheminformatics 7(S1),
14 (2015)
15. Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado, D., Lu, Z., Leaman, R., Lu, Y., Ji, D., Lowe,
D.M., et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of
cheminformatics 7(1), 2 (2015)
16. Palis, E., Buttigieg, P.L., Ferrell, B., Pereira, E., Schnetzer, J., Arvanitidis, C., Jensen, L.J.: Extract:
interactive extraction of environment metadata and term suggestion for metagenomic sample annotation.
Database 2016 (2016)
17. Kirschnick, J., Thomas, P.: Sia: Scalable interop erable annotation server (2017)
Couto and Lamurias Page 14 of 15
18. Jonnagaddala, J., Jue, T.R., Chang, N.-W., Dai, H.-J.: Improving the dictionary lookup approach for disease
normalization using enhanced dictionary and query expansion. Database 2016 (2016)
19. Kraus, M., Niedermeier, J., Jankrift, M., Tietb¨ohl, S., Stachewicz, T., Folkerts, H., Uacker, M., Neves, M.:
Olelo: a web application for intuitive exploration of biomedical literature. Nucleic acids research 45(W1),
478–483 (2017)
20. Rinaldi, F., Clematide, S., Marques, H., Ellendor, T., Romacker, M., Rodriguez-Esteban, R.: Ontogene web
services for biomedical text mining. BMC bioinformatics 15(14), 6 (2014)
21. MacKinlay, A., Verspoor, K.: A web service annotation framework for ctd using the uima concept mapper. In:
BioCreative Challenge Evaluation Workshop Vol, vol. 1 (2013)
22. Tenopir, C., King, D.W.: Reading behaviour and electronic journals. Learned Publishing 15(4), 259–265 (2002)
23. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio,
M., Edgar, R., Federhen, S., et al.: Database resources of the national center for biotechnology information.
Nucleic acids research 35(suppl 1), 5–12 (2006)
24. Newham, C., Rosenblatt, B.: Learning the Bash Shell: Unix Shell Programming. O’Reilly Media, Inc. (2005)
25. Bash download page. https://ftp.gnu.org/gnu/bash/. (Accessed on 11/06/2018)
26. Perez, M.P., Rodriguez, G.P., M´ıguez, A.B., Riverola, F.F., Valencia, A., Krallinger, M., Lourenco, A.:
Benchmarking biomedical text mining web servers at BioCreative V.5: the technical Interoperability and
Performance of annotation Servers - TIPS track. In: Proceedings of the BioCreative V.5 Challenge Evaluation
Workshop (2017)
27. Whetzel, P.L., Noy, N.F., Shah, N.H., Alexander, P.R., Nyulas, C., Tudorache, T., Musen, M.A.: Bioportal:
enhanced functionality via new web services from the national center for biomedical ontology to access and use
ontologies in software applications. Nucleic acids research 39(suppl 2), 541–545 (2011)
28. Groza, T., K¨ohler, S., Doelken, S., Collier, N., Oellrich, A., Smedley, D., Couto, F.M., Baynam, G., Zankl, A.,
Robinson, P.N.: Automatic concept recognition using the Human Phenotype Ontology reference and test suite
corpora. Database 2015, 1–13 (2015). doi:10.1093/database/bav005
29. Aho, A.V., Corasick, M.J.: Ecient string matching: An aid to bibliographic search. Commun. ACM 18(6),
333–340 (1975). doi:10.1145/360825.360855
30. MER Source code. https://github.com/lasigeBioTM/MER. (Accessed on 11/06/2018)
31. Degtyarenko, K., De Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alc´antara, R., Darsow,
M., Guedj, M., Ashburner, M.: Chebi: a database and ontology for chemical entities of biological interest.
Nucleic acids research 36(suppl 1), 344–350 (2007)
32. ChEBI ontology. ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi_lite.owl. (Accessed on
11/06/2018)
33. Human Phenotype Ontology.
https://raw.githubusercontent.com/obophenotype/human-phenotype- ontology/master/hp.owl.
(Accessed on 11/06/2018)
34. ohler, S., Doelken, S.C., Mungall, C.J., Bauer, S., Firth, H.V., Bailleul-Forestier, I., Black, G.C., Brown, D.L.,
Brudno, M., Campbell, J., et al.: The human phenotype ontology project: linking molecular biology and disease
through phenotype data. Nucleic acids research 42(D1), 966–974 (2013)
35. Disease Ontology. https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/
master/src/ontology/doid.owl. (Accessed on 11/06/2018)
36. Kibbe, W.A., Arze, C., Felix, V., Mitraka, E., Bolton, E., Fu, G., Mungall, C.J., Binder, J.X., Malone, J.,
Vasant, D., et al.: Disease ontology 2015 update: an expanded and updated database of human diseases for
linking biomedical knowledge through disease data. Nucleic acids research 43(D1), 1071–1078 (2014)
37. MER Source code for BioCreative V.5 2017. https://github.com/lasigeBioTM/MER/tree/biocreative2017.
(Accessed on 11/06/2018)
38. Couto, F.M., Campos, L.F., Lamurias, A.: Mer: a minimal named-entity recognition tagger and annotation
server (2017)
39. ExPASy - Cellosaurus. https://web.expasy.org/cellosaurus/. (Accessed on 11/06/2018)
40. Wishart, D.S., Feunang, Y.D., Marcu, A., Guo, A.C., Liang, K., V´azquez-Fresno, R., Sajed, T., Johnson, D., Li,
C., Karu, N., Sayeeda, Z., Lo, E., Assempour, N., Berjanskii, M., Singhal, S., Arndt, D., Liang, Y., Badran, H.,
Grant, J., Serra-Cayuela, A., Liu, Y., Mandal, R., Neveu, V., Pon, A., Knox, C., Wilson, M., Manach, C.,
Scalbert, A.: Hmdb 4.0: the human metabolome database for 2018. Nucleic Acids Research 46(D1), 608–617
(2018). doi:10.1093/nar/gkx1089
41. Gaulton, A., Hersey, A., Nowotka, M., Bento, A.P., Chambers, J., Mendez, D., Mutowo, P., Atkinson, F.,
Bellis, L.J., Cibri´an-Uhalte, E., Davies, M., Dedman, N., Karlsson, A., Magari˜nos, M.P., Overington, J.P.,
Papadatos, G., Smit, I., Leach, A.R.: The chembl database in 2017. Nucleic Acids Research 45(D1), 945–954
(2017). doi:10.1093/nar/gkw1074
42. Kozomara, A., Griths-Jones, S.: mirbase: annotating high condence micrornas using deep sequencing data.
Nucleic Acids Research 42(D1), 68–73 (2014). doi:10.1093/nar/gkt1181
43. PRotein Ontology (PRO). http://www.obofoundry.org/ontology/pr.html. (Accessed on 11/06/2018)
44. Consortium, G.O.: Expansion of the gene ontology knowledgebase and resources. Nucleic acids research
45(D1), 331–338 (2016)
45. Haendel, M.A., Balho, J.P., Bastian, F.B., Blackburn, D.C., Blake, J.A., Bradford, Y., Comte, A., Dahdul,
W.M., Dececchi, T.A., Druzinsky, R.E., et al.: Unication of multi-species vertebrate anatomy ontologies for
comparative biology in uberon. Journal of biomedical semantics 5(1), 21 (2014)
46. jq. https://stedolan.github.io/jq/. (Accessed on 11/06/2018)
47. Reese, W.: Nginx: the high-performance web server and reverse proxy. Linux Journal 2008(173), 2 (2008)
48. i Rossell, L.B.: Task Spooler - batch is back! http://vicerveza.homeunix.net/~viric/soft/ts/. (Accessed
on 11/06/2018)
49. MER. http://labs.rd.ciencias.ulisboa.pt/mer/. (Accessed on 11/06/2018)
50. erez-P´erez, M., P´erez-Rodr´ıguez, G., Blanco-M´ıguez, A., Fdez-Riverola, F., Valencia, A., Krallinger, M.,
Couto and Lamurias Page 15 of 15
Louren¸co, A.: Next generation community assessment of biomedical entity recognition web servers: metrics,
performance, interoperability aspects of becalm. Journal of Cheminformatics (2018)
51. Lobo, M., Lamurias, A., Couto, F.: Identifying human phenotype terms by combining machine learning and
validation rules. BioMed Research International 2017 (2017). doi:10.1155/2017/8565739
52. Shah, N.H., Bhatia, N., Jonquet, C., Rubin, D., Chiang, A.P., Musen, M.A.: Comparison of concept recognizers
for building the open biomedical annotator. In: BMC Bioinformatics, vol. 10, p. 14 (2009). BioMed Central
53. Couto, F., Lamurias, A.: Semantic similarity denition. In: Ranganathan, S., Nakai, K., Sch¨onbach, C.,
Gribskov, M. (eds.) Encyclopedia of Bioinformatics and Computational Biology vol. 1. Oxford, Elsevier (2019).
doi:10.1016/B978-0-12-809633-8.20401-9
54. MultiFast 2.0.0. http://multifast.sourceforge.net/. (Accessed on 11/06/2018)
ResearchGate has not been able to resolve any citations for this publication.
  • Article
    Full-text available
    Abstract Background Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. Results A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. Conclusions The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents.
  • Chapter
    Full-text available
    Biomedical literature has become a rich source of information for various applications. Automatic text mining methods can make the processing of extracting information from a large set of documents more efficient. However, since natural language is not easily processed by computer programs, it is necessary to develop algorithms to transform text into a structured representation. Scientific texts present a challenge to text mining methods since the language used is formal and highly specialized. This article presents an overview of the current biomedical text mining tools and bioinformatics applications using them.
  • Chapter
    Full-text available
    In bioinformatics, semantic similarity has been used to compare different types of biomedical entities, such as proteins, compounds and phenotypes, based on their biological role instead on what they look like. This manuscript presents a definition of semantic similarity between biomedical entities described by a common semantic base (e.g. ontology) following an information-theoretic perspective of semantic similarity. It defines the amount of information content two entries share in a semantic base, and, by extension, how to compare biomedical entities represented outside the semantic base but linked through a set of annotations. Software to check how semantic similarity works in practice is available at: https://github.com/lasigeBioTM/DiShIn/.
  • Article
    Full-text available
    The Human Metabolome Database or HMDB (www. hmdb.ca) is a web-enabled metabolomic database containing comprehensive information about human metabolites along with their biological roles, physiological concentrations, disease associations, chemical reactions, metabolic pathways, and reference spectra. First described in 2007, the HMDB is now considered the standard metabolomic resource for human metabolic studies. Over the past decade the HMDB has continued to grow and evolve in response to emerging needs for metabolomics researchers and continuing changes in web standards. This year's update, HMDB 4.0, represents the most significant upgrade to the database in its history. For instance, the number of fully annotated metabolites has increased by nearly threefold, the number of experimental spectra has grown by almost fourfold and the number of illustrated metabolic pathways has grown by a factor of almost 60. Significant improvements have also been made to the HMDB's chemical taxonomy, chemical ontology, spectral viewing, and spectral/text searching tools. A great deal of brand new data has also been added to HMDB 4.0. This includes large quantities of predicted MS/MS and GC-MS reference spectral data as well as predicted (physiologically feasible) metabolite structures to facilitate novel metabolite identification. Additional information on metabolite-SNP interactions and the influence of drugs on metabolite levels (pharma-cometabolomics) has also been added. Many other important improvements in the content, the interface , and the performance of the HMDB website have been made and these should greatly enhance its ease of use and its potential applications in nutrition , biochemistry, clinical chemistry, clinical genetics , medicine, and metabolomics science.
  • Conference Paper
    Ensemble methods are learning algorithms that classify new data points by synthesizing the predictions of a set of classifiers. Many methods for constructing ensembles have been proposed such as weighted voting, manipulations of training samples, features, or labels. The paper proposes a novel ensemble algorithm which constructs ensembles by manipulating the label set given to the learning algorithm and then classifies a new dataset by a voting algorithm specifically designed for sequential labelling task. The dataset released in the Bi-oCreative V.5 CEMP (Chemical Entity Mention recognition) task was used to evaluate the performance of proposed algorithm. The results revealed that the proposed algorithm can improve the precision and F-score.
  • Article
    Full-text available
    Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses Stanford CoreNLP for text processing and applies Conditional Random Fields trained with a rich feature set, which includes linguistic, orthographic, morphologic, lexical, and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted manual rules, such as the negative connotation analysis, that combined with a dictionary can filter incorrectly identified entities, find missed entities, and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC), where the system Bio-LarK CR obtained the best F -measure of 0.56. IHP achieved an F -measure of 0.65 on the GSC. Due to inconsistencies found in the GSC, an extended version of the GSC was created, adding 881 entities and modifying 4 entities. IHP achieved an F -measure of 0.863 on the new GSC.
  • Article
    Full-text available
    The Gene Ontology (GO) is a comprehensive resource of computable knowledge regarding the functions of genes and gene products. As such, it is extensively used by the biomedical research community for the analysis of-omics and related data. Our continued focus is on improving the quality and utility of the GO resources, and we welcome and encourage input from researchers in all areas of biology. In this update, we summarize the current contents of the GO knowledgebase, and present several new features and improvements that have been made to the ontology, the annotations and the tools. Among the highlights are 1) developments that facilitate access to, and application of, the GO knowledgebase, and 2) extensions to the resource as well as increasing support for descriptions of causal models of biological systems and network biology. To learn more, visit http://geneontology.org/.
  • Conference Paper
    Full-text available
    Patents contain the significant amount of information. Biomedical text mining has received much attention in patents recently, especially in the medicinal chemistry domain. The BioCreative V.5.BeCalm tasks focus on bio- medical entities recognition in patents. This paper describes our method used to create our submissions to the Chemical Entity Mention recognition (CEMP) and Gene and Protein Related Object recognition (GPRO) subtasks. In our method, a bidirectional Long Short-Term Memory with a conditional random field layer (BLSTM-CRF) is employed to recognize biomedical entities from patents. Our best CEMP submission achieves an F-score of 90.42% and our best GPRO submission with type 1 achieves an F-score of 79.19%.
  • Article
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
  • Article
    Full-text available
    Researchers usually query the large biomedical literature in PubMed via keywords, logical operators and filters, none of which is very intuitive. Question answering systems are an alternative to keyword searches. They allow questions in natural language as input and results reflect the given type of question, such as short answers and summaries. Few of those systems are available online but they experience drawbacks in terms of long response times and they support a limited amount of question and result types. Additionally, user interfaces are usually restricted to only displaying the retrieved information. For our Olelo web application, we combined biomedical literature and terminologies in a fast in-memory database to enable real-time responses to researchers' queries. Further, we extended the built-in natural language processing features of the database with question answering and summarization procedures. Combined with a new explorative approach of document filtering and a clean user interface, Olelo enables a fast and intelligent search through the ever-growing biomedical literature. Olelo is available at http://www.hpi.de/plattner/olelo.