Conference PaperPDF Available

Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature

Authors:
Bi-directional Recurrent Neural Network Models for Geographic Location
Extraction in Biomedical Literature
Arjun Magge1,2and Davy Weissenbacher3and Abeed Sarker3and Matthew Scotch1,2and
Graciela Gonzalez-Hernandez3
1College of Health Solutions, 2Biodesign Center for Environmental Health Engineering,
Arizona State University, Tempe, AZ 85281, USA
3Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine,
University of Pennsylvania, Philadelphia, PA 19104, USA.
E-mail: Matthew.Scotch@asu.edu
Phylogeography research involving virus spread and tree reconstruction relies on accurate
geographic locations of infected hosts. Insufficient level of geographic information in nu-
cleotide sequence repositories such as GenBank motivates the use of natural language pro-
cessing methods for extracting geographic location names (toponyms) in the scientific article
associated with the sequence, and disambiguating the locations to their co-ordinates. In this
paper, we present an extensive study of multiple recurrent neural network architectures for
the task of extracting geographic locations and their effective contribution to the disam-
biguation task using population heuristics. The methods presented in this paper achieve a
strict detection F1score of 0.94, disambiguation accuracy of 91% and an overall resolution
F1score of 0.88 that are significantly higher than previously developed methods, improving
our capability to find the location of infected hosts and enrich metadata information.
Keywords: Named Entity Recognition; Toponym Detection; Toponym Disambiguation; To-
ponym Resolution; Natural Language Processing; Deep Learning;
1. Introduction
Nucleotide sequence repositories like GenBank contain millions of records from various or-
ganisms collected around the world that enables researchers to perform phylogenetic tree and
spread reconstruction. However, a vast majority of the records (65-80%)1,2 contain geographic
information that is deemed to be at an insufficient level of granularity; information that is
often present in the associated published article. This motivates the use of natural language
processing (NLP) methods to find the geographic location (or toponym) of infected hosts
in the full text. In NLP, this task of detecting toponyms from unstructured text, and then
disambiguating the locations to their co-ordinates is formally known as toponym resolution.
Toponym resolution in scientific articles can be used to obtain precise geospatial metadata
of infected hosts which is highly beneficial in building transmission models in phylogeography
that could enable public health agencies to target high-risk areas. Improvement in geospatial
metadata also enriches other scientific studies that utilize GenBank data, such as those in
population genetics, environmental health, and epidemiology in general, as geographic location
c
2018 The Authors. Open Access chapter published by World Scientific Publishing Company and
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC)
4.0 License.
Pacific Symposium on Biocomputing 2019
100
is often used in addition to or as a proxy of other demographic data. Toponym Resolution
is typically accomplished in two stages (1) toponym detection (geotagging), a named entity
recognition (NER) task in NLP and (2) toponym disambiguation (geocoding).
For instance, given the sentence “Our study mainly focused on pediatric cases with differ-
ent outcomes from the most populated city in Argentina and one of the hospitals in Buenos
Aires where patients are most often referred.”, the detection stage deals with extracting the
locations “Argentina” and “Buenos Aires”.3The disambiguation stage deals with assigning
the most likely, unique, identifiers from gazetteer resources like Geonamesato each location
detected e.g. “3865483:Argentina” from 145 candidate entries containing the same name and
“3435910:Buenos Aires” from 943 candidate entries with variations of the same name. Both
tasks bring forth interesting NLP challenges with applications in a wide number of areas.
In this work, we present a system for toponym detection and disambiguation that improves
substantially over previously published systems for this task, including our own.4–6 Since
detection is the first step in the process, its impact on the overall performance of the combined
task is multiplied, as locations not detected can never be disambiguated. We use recurrent
neural network (RNN) architectures that use word embeddings, character embeddings and case
features as input for performing the detection task. In addition to these, we also experiment
with the use of conditional random fields (CRF) on the output layer as they have known to
improve performance. We perform ablation studies/leave-one-out analysis with repetitive runs
with different seed values for drawing strong conclusions about the use of deep recurrent neural
networks, their architectural variations and common features. We evaluate the impact of the
results from the detection task on the upstream disambiguation task, performed using the
commonly assumed population heuristic7whereby the location with the greatest population
is chosen as the correct match.
The rest of the document is structured as follows. In Section 2, we summarize research
efforts in the area of toponym detection and disambiguation and list the contributions of this
paper in light of previous work. We distinguish the RNN architectures used for evaluation
along with the population heuristic used for measurement in Section 3. Finally, we present
and discuss the results of the toponym detection and disambiguation experiments in Sections
4 and discuss limitations and scope for improvements in Section 5.
2. Related Work
Toponym detection and toponym disambiguation have been widely researched by the NLP
community, with numerous publications on both detection and disambiguation tasks.8–10 To-
ponym detection is commonly tackled as a NER challenge where toponyms are recognized
among other named entities like organization names and people’s names. Previous studies11
have identified the performance of the NER as an important source of errors in enhancing
geospatial metadata in GenBank, motivating the development of tools for performing detec-
tion and resolution of named entities such as infected hosts and geographical locations.12,13
The annotated dataset used in this work4,11 includes both span and normalized Geonames ID
ahttp://www.geonames.org/ Accessed:Sept 30 2018
Pacific Symposium on Biocomputing 2019
101
annotations. Since the performance of the overall resolution task is deeply influenced by the
NER, some of the previous works using this dataset have looked specifically at improving the
NER’s performance. Our previous research on toponym detection have used rule-based meth-
ods,4traditional machine learning sequence taggers using conditional random fields (CRF)5
and deep learning methods using feed forward neural networks.6NER performance since the
introduction of the dataset has increased from an F1-score of 0.70 to 0.91 closing in on the
human-level annotation agreement of 0.97. In the previous baseline for toponym resolution4
a rule based extraction system was used to detect toponyms. In subsequent work, traditional
machine learning algorithms such as conditional random fields (CRFs)5and feedforward neu-
ral nets6were introduced for improving the NER’s performance. There exist some studies
involving RNN experiments that explore the use of RNN architectures for sequence tagging
tasks in the generic domain.14,15 While these tasks measure the performance on specific tasks,
the effect of optimal performances haven’t been measured in upstream tasks.
On the other hand, toponym disambiguation has been commonly tackled as an information
retrieval challenge by creating an inverted index of Geonames entries.4,16 Given a toponym,
candidate locations are first retrieved based on words used in the toponym and subsequently
heuristics are used to pick the most appropriate location. Popular techniques use metrics such
as entity co-occurrences, similarity measures, distance metrics, context features and topic
modeling.7,16–20 This approach is largely adopted due the large number of Geonames entries
(about 12 million) to choose from. We also find that the most common baseline used for
measuring the disambiguation performance is the population heuristic where the place with the
most population is chosen as the correct match. Most research articles that focus specifically
on the disambiguation problem use Stanford-NER or the Apache-NER tool20–22 for detection
which has been trained on datasets like CoNLL-2003, ACE-2005 and MUC. Some studies
assume gold standard labels and proceed with the task of disambiguation which makes it
difficult to assess the strength of the overall system. It is also important to note that a
majority of efforts have been focused on texts from a general domain like Wikipedia or news
articles.20–22 Only a handful of publications deal with the problem in other domains like
biomedical scientific articles4,23 which contain a different and broader vocabulary. Similar to
the previous disambiguation method developed for this dataset,4we build an inverted index
using Geonames entries but use term expansion techniques to improve the performance and
usability of the system in various contexts.
In light of previous work, the main contributions of this work can be summarized as follows:
(i) We perform a comprehensive and systematic evaluation of multiple RNN architectures
from over 400 individual runs for the task of toponym detection in scientific articles and
arrive at state-of-the-art results compared to previous methods.
(ii) We discuss the impact of significant performance improvement in toponym detection in
the upstream task of toponym resolution.
3. Methods
Our approach for detection and disambiguation of geographic locations are tackled indepen-
dently, as described in the following subsections. For the purposes of training and evaluation,
Pacific Symposium on Biocomputing 2019
102
we use the publicly available human annotated corpus of 60 full-text PMC articles containing
1881 toponyms.4Of the 60, the standard test set for the corpus includes only 12 articles con-
taining a total of 285 toponyms, a large majority of which are countries and major locations.
The annotated dataset contains both span annotations and gazetteer ID annotations linking
ISO-3166-1 codes for countries and GeonamesIDs for the remaining toponyms. For uniformity,
we converted all ISO-3166-1 codes to equivalent GeonameIDs.
3.1. Toponym Detection
The task of toponym detection typically involves identifying the spans of the toponyms in
an NER task where the sequence of actions is illustrated in Fig 1. As input features, we
use publicly available pre-trained word embeddings that were trained on Wikipedia, PubMed
abstracts and PubMed Central full text articles.24 In addition to word embeddings, we experi-
ment with orthogonal features such as (1) a case feature to explicitly distinguish all-uppercase,
all-lowercase and camel-case words encoded as one-hot vectors that are appended to the word,
and (2) fixed length character embeddings. Character embeddings have shown to improve the
performances of deep neural networks and are employed in few different ways. One of the
popular methods used involves the use of a CNN layer25 or an LSTM layer26 on vectors from
a randomly initialized character embeddings that are fine tuned during training appended to
the input word embedding layer. During initial experiments we found that implementation
of this architecture added significantly to the training time and hence we employ the use of
a simpler model where character embeddings are pre-trained using word2vec and appended
directly to the input layer along with word embeddings and case features.
The proposed RNN units and their variations can be used on their own for NER purposes.
However, bidirectional architectures are popularly employed for NER as they have the com-
bined capability of processing input sentences in both directions and making tagging decisions
collectively using an output layer as illustrated in figure 1. In this paper, we specifically look
at bi-directional recurrent architectures. It is also common to observe the use of a CRF output
layer on top of the output layer of bidirectional RNN architecture. CRF’s are known to add
consistency in making final tagging decisions using IOB or IOBES styled annotations. We
experiment between combinations of the RNN variants along with the optional features in an
ablation study to identify the impact of these additive layers on the NER’s performance as
well as its impact on the upstream resolution task.
3.1.1. Recurrent Neural Networks
RNN architectures have been widely used for auto-encoders and sequence labeling tasks such
as part-of-speech tagging, NER, chunking among others.27 RNNs are variants of feedforward
neural networks that are equipped with recurrent units to carry signals from the previous
output yt1for making decisions at time ytas shown in equation 1.
yt=σ(W·xt+U·yt1+b)(1)
Here, Wand Uare the weight matrices and bis the bias term that are randomly initial-
ized and updated during training. σrepresents the sigmoid activation function. In practice
Pacific Symposium on Biocomputing 2019
103
Fig. 1. A schematic representation of the sequence of actions performed in the NER equipped with
bi-directional RNN layers and an output CRF layer. RNN variants discussed in this paper involve
replacing RNN units with LSTM, LSTM-Peepholes, GRU and UG-RNN units.
other activation functions such as tanh and rectified linear units (ReLU ) are also used. This
characteristic recurrent feature simulates a memory function that makes it ideal for tasks in-
volving sequential predictions dependent on previous decisions. However, learning long term
dependencies that are necessary have been found to be difficult using RNN units alone.28
3.1.2. LSTM
LSTM networks29 are variants of RNN that have proven to be fairly successful at learning long
term dependencies. A candidate output gis calculated using an equation similar to equation
1 and further manipulated based on previous and current states of a cell that retains signals
simulating long-term memory. The LSTM cell’s state is controlled by forget (f),input (i) and
output (o) gates that control how much information flows from the input to the state and
from state to the output. The gates themselves depend of current input and previous outputs.
g=tanh(Wg·xt+Ug·yt1+bg)(2)
f=σ(Wf·xt+Uf·yt1+bf)(3)
i=σ(Wi·xt+Ui·yt1+bi)(4)
o=σ(Wo·xt+Uo·yt1+bo)(5)
The future state of the cell ctis calculated as a combination of (1) signals from forget gate
gand the previous state of the cell ct1which determines the information to forget (or retain)
Pacific Symposium on Biocomputing 2019
104
in the cell, and (2) signals from the input gate iand the candidate output gthat determines
the information from the input to be stored in the cell. Eventually the output ytis calculated
using signals from the output gate oand the current state of the cell ct.
ct=fct1+ig(6)
yt=otanh(ct)(7)
In the above equations, indicates pointwise multiplication operation. While the above
equations represent LSTM in its most basic form, many variations of the architecture have been
introduced to simulate retention of long-term signals a few of which have been summarized
in the following subsections and subsequently evaluated in the results section. For reasons of
brevity, we do not include the formulas used for calculating the output ytbut they can be
inferred from the works cited.
3.1.3. Other Gated RNN Architectures
We evaluate in our experiments one of the LSTM variations introduced for speech processing30
that introduced the notion of peepholes (LSTM-Peep) where the idea is that state of the cell
influences the input,forget and output gates. Here, signals for the input and forget gates i
and fdepend not only on the previous output yt1and current input xtbut also the previous
state of the cell ct1and the output gate odepends on the current state of the cell ct.
Gated Recurrent Unit (GRU)31 also known as coupled input and forget gate LSTM (CIFG-
LSTM)15 is a simpler variation of LSTM with only two gates: update zand reset r. Their
signals are determined based on the current input xand previous output yt1similar to the
gates in LSTMs. The update gate zattempts to combine the functionality of input and forget
gates of LSTMs iand fand eliminates the need for an output gate as well as an explicit
cell state. A singular update gate signal zcontrols the information flow to the output value.
Although it appears far more simple, GRU has gained a lot of popularity in the recent years
in a variety of NLP tasks.32,33
Update gate RNN (UG-RNN)34 is a much simpler variation of LSTM and GRU architec-
tures containing only an update gate zis also included in our experiments. The importance
of the update gate is often highlighted in RNN based architectures.15 Hence, we include this
model to perform a gate based ablation study to understand their contributions to the overall
resolution task.
3.1.4. Hyperparameter search and optimization
The performance of deep neural networks relies greatly on optimization of its hyperparame-
ters and the performance of the models have been found to be sensitive to changes in seed
values used for initializing the weight matrices.27 We first performed a grid search over the
previously recommended optimal range of hyperparameter space for NER tasks27 and to arrive
at potential candidates of optimal configurations. We then performed up to 5 repetitions of
experiments at the optimal setting for the model at different seed values to obtain the median
performance scores. All models were developed using the TensorFlow framework and trained
on NVIDIA Titan Xp GPUs equipped with an Intel Xeon CPU (E5-2687W v4).
Pacific Symposium on Biocomputing 2019
105
3.2. Toponym Disambiguation
For toponym disambiguation, we use the Geonames gazetteer data to build an inverted index
using Apache Luceneband search for the toponym terms extracted in the toponym detection
step in the index.
3.2.1. Building Geonames Index
Individual Geonames entries in the index are documents with common fields such as Geon-
ameID,LocationName,Latitude,Longitude,LocationClass,LocationCode,Population,Conti-
nent and AncestorNames. Here, LocationName contains the common name of the place. For
countries, we expand this field by using official names, ISO and ISO3 abbreviations (e.g. United
States of America,US and USA, respectively, for United States). For ADM1 (Administrative
Level 1) entries that have available abbreviations (e.g. AZ for Arizona, and CA for Califor-
nia), we add such alternate names to the LocationName field. In addition to the above fields
we add the County,State and Country fields depending on the type of geoname entry. Fields
such as LocationName,County,State,Country and AncestorNames are chosen to be reverse
indexed such that partial matches of names offers the possibility of being matched with the
right disambiguated toponym on a search.
3.2.2. Searching Geonames Index
Most cities and locations commonly have their parent locations listed as comma separated val-
ues (e.g. Philadelphia, PA, USA). In such cases, the index provides the capability to perform
compound searches (e.g. LocationName:“Philadelphia” AND AncestorNames:“PA, USA”).
We find that this method offers the best scalable framework for toponym disambiguation
among approximately 12 million entries. Efficient search capabilities aside, the solution in-
ternally provides documents to be sorted by a particular field. In this case, we choose the
Population field as the default sorting heuristic such that search results are sorted by highest
population first. An additional motivation for the implementation of this solution is the flexi-
bility of using external information to narrow down search results. For example, when Country
information is available in the GenBank record, we can use queries like LocationName:“Paris”
AND Country:“France” to narrow down the location of infected hosts.
4. Results and Discussion
For the NER task, we use the standard metric scores of precision, recall, and F1-scores for
toponym entities across two modes of evaluation:(1) Strict where the predicted spans of the
toponym have to match exactly with the gold standard spans to be counted as a true positive
and (2) Overlapping where predicted spans are true positives as long as one of its tokens overlap
with gold standard annotations. For toponym disambiguation, we compare the predicted and
gold standard GeonameIDs to measure precision, recall and f1-scores as long as the spans
overlap. We compare our scores with the previous systems that were trained and tested on the
bhttp://lucene.apache.org/ Accessed:Sept 30 2018
Pacific Symposium on Biocomputing 2019
106
same dataset. To evaluate the performance of the overall resolution task, it is important to
examine the performance of the individual systems to assess the cause of errors and identifying
regions for improvement.
4.1. Toponym Disambiguation
Our toponym disambiguation system is unsupervised, giving us the capability to test its per-
formance on the entire dataset assuming gold standard toponym terms to be available. Under
this assumption, the accuracy of the disambiguation system was found to be 91.6% and 90.5%
on training and test set respectively. Analyzing the errors, we found that comparing ids di-
rectly is a very strict mode of evaluation for the purposes of phylogeography as Geonames
contains duplicate entries for many locations that belong to two or more classes of locations
such as administrative division (ADM) and populated area or city (PPLA, PPLC) but refer
to the same geographical location. For instance, when we look at the test set alone, which had
27 errors from a total of 285 locations, 19 appeared to be roughly the same location. These
included locations like Auckland, Lagos, St. Louis, Cleveland, Shantou, Nanchang, Shanghai,
and Beijing which were assigned the ID of the administrative unit by the system, while the
annotated locations were assigned the ID of the populated area or city or vice versa. Given
these reasons, we find that the performance of the resolution step exceeds the reported scores
by 5% to arrive at an approximate accuracy of 95-96%. However, for the purposes of compar-
ison with previous systems we report the overall resolution performance in Table 1 without
making such approximations. We did however observe 8 errors where the system assigned
GeonamesIDs were drastically different from their original locations due to the population
heuristic. For example, a toponym of Madison was incorrectly assigned the ID of Madison
County, Alabama which had a higher population than the gold standard annotation Madison,
Dane County, Wisconsin(WI).
4.2. Toponym Resolution
Analyzing the errors across the architectures, we find that 80-90% of the erroneous instances
to be repeating across the RNN architectures making it challenging to use ensemble methods
for reducing errors. These included false negative toponyms such as Plateau, Borno, Ga,
Gurjev, Sokoto etc. which appear in tables and structured contexts making it difficult to
recognize them. However, as discussed in our previous work,6we plan to handle table structures
differently by employing alternative methods of conversions from pdf to text. Almost all false
positives appeared to be geographic locations, however in the text they were found to be
referring to other named entities like virus strains and isolates rather than toponyms.
We found that the LSTM-Peep based architecture appeared to have marginally better
performance scores on the NER task and hence the overall resolution task. Feature ablation
analysis shown in Figure 2 indicate that inclusion of the character embedding feature con-
tributed to increase in the overall performance of RNN models. However, inclusion of case
feature in combination with the character embeddings appeared to be redundant. Inclusion of
the CRF output layer seemed to have a positive impact on most models while additive layers
seemed to have more effect on GRU, LSTM and LSTM-Peep architectures.
Pacific Symposium on Biocomputing 2019
107
Table 1. Median Precision(P), Recall(R) and F1scores for NER and Resolution. Bold-styled
scores indicate highest performance. All recurrent neural network units were used in a bidirec-
tional setup with inputs containing pre-trained word embeddings, character embeddings and
case features, and an output layer with an additional CRF layer.
Method NER-Strict NER-Overlapping Resolution
P R F1P R F1P R F1
Rule-based40.58 0.876 0.698 0.599 0.904 0.72 0.547 0.897 0.697
CRF-All50.85 0.76 0.80 0.86 0.77 0.81 - - -
FFNN + DS60.90 0.93 0.91 - - - - - -
RNN 0.910 0.891 0.901 0.931 0.912 0.922 0.896 0.817 0.855
UG-RNN 0.948 0.902 0.924 0.959 0.912 0.935 0.903 0.824 0.862
GRU 0.952 0.919 0.935 0.967 0.930 0.948 0.888 0.835 0.860
LSTM 0.932 0.926 0.929 0.954 0.947 0.950 0.892 0.842 0.866
LSTM-Peep 0.934 0.944 0.939 0.951 0.961 0.956 0.907 0.863 0.884
Fig. 2. (Left) Ablation/leave-one-out analysis showing the contribution of individual features to the
NER performance across the RNN models. (Right) Impact of additive layers on the performance of
the NER across the RNN models. Here, RNN layers refer to respective variants of RNN architectures.
Y-axis shows strict F1scores.
5. Limitations and Future Work
In this work, we find that utilizing state-of-the-art NER architectures help us obtain perfor-
mances that are inching close to human performance. However, we do find that the articles
in the test set may perhaps be relatively easier than the average article for the detection
task when we compare it to randomly selected validation/development set performances. As
discussed in our previous work,6distance supervision datasets can contain toponym spans in
close proximity to each other generating noisy training examples. This makes it challenging to
Pacific Symposium on Biocomputing 2019
108
use distance supervision techniques to increase the size of training data for training sequence
tagging models based on RNN architectures. Hence, to address this issue, we are in the process
of expanding the annotation dataset from 60 articles to 150 articles for a more comprehensive
training and evaluation of the system.
Irrespective of the ease of detection in the test set, there appear to be false negative
toponyms (discussed in the previous section) that could possibly be the location of infected
hosts(LOIH). While there are chances that toponyms that are LOIH appear repeatedly in the
scientific article in varying contexts thus increasing the chances of them being detected, in
our following work we wish to evaluate the impact of these false negatives on the overall task
of identifying the LOIH. To reduce false positives where locations could infact refer to other
named entities like virus strains and isolates than toponyms themselves, we intend to explore
approaches from metonymy resolution35 for filtering out such false positives.
6. Conclusion
Phylogeography research relies on accurate geographical metadata information from nucleotide
repositories like GenBank. In records that contain insufficient metadata information, there is
a motivation to extract the geographical location from the associated articles to determine the
location of the infected hosts. In this work we present and evaluate methods built on recurrent
neural networks that extract geographical locations from scientific articles with a substantial
increase in performance from an F1score of 0.88 which improves significantly over the previous
toponym resolution system F1of 0.69. Our implementations of the toponym detection and to-
ponym disambiguationcsystems along with the updated version of the annotations containing
GeonameIDsdare available online.
Acknowledgments
AM designed and trained the neural networks, ran the experiments, performed the error anal-
ysis, and wrote most of the manuscript. DW and AS reviewed, restructured and contributed
many sections and revisions of the manuscript. MS and GG provided overall guidance on
the work and edited the final manuscript. The authors would also like to acknowledge Karen
OConnor, Megan Rorison and Briana Trevino for their efforts in the annotation processes.
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation
of the Titan Xp GPU used for this research. The authors are also grateful to ASU-BMI’s
computing resources used for conducting the experiments in the paper.
Funding
Research reported in this publication was supported by the National Institute of Allergy and
Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number
R01AI117011 to MS and GG. The content is solely the responsibility of the authors and does
not necessarily represent the official views of the NIH.
chttps://bitbucket.org/pennhlp/toponym-resolution-using-rnns Accessed:30 Sept 2018
dhttps://healthlanguageprocessing.org/software-and-downloads/ Accessed:30 Sept 2018
Pacific Symposium on Biocomputing 2019
109
References
1. M. Scotch, I. N. Sarkar, C. Mei, R. Leaman, K.-H. Cheung, P. Ortiz, A. Singraur and G. Gonzalez,
Enhancing phylogeography by improving geographical information from genbank Journal of
biomedical informatics 44 (Elsevier, 2011).
2. T. Tahsin, R. Beard, R. Rivera, R. Lauder, G. Wallstrom, M. Scotch and G. Gonzalez, Natural
language processing methods for enhancing geographic metadata for phylogeography of zoonotic
viruses AMIA Summits on Translational Science Proceedings 2014 (American Medical Infor-
matics Association, 2014).
3. P. Barrero, M. Viegas, L. Valinotto and A. Mistchenko, Genetic and phylogenetic analyses of
influenza a h1n1pdm virus in buenos aires, argentina Journal of virology 85 (Am Soc Microbiol,
2011).
4. D. Weissenbacher, T. Tahsin, R. Beard, M. Figaro, R. Rivera, M. Scotch and G. Gonzalez,
Knowledge-driven geospatial location resolution for phylogeographic models of virus migration
Bioinformatics 31 (Oxford University Press, 2015).
5. D. Weissenbacher, A. Sarker, T. Tahsin, M. Scotch and G. Gonzalez, Extracting geographic lo-
cations from the literature for virus phylogeography using supervised and distant supervision
methods AMIA Summits on Translational Science Proceedings 2017 (American Medical Infor-
matics Association, 2017).
6. A. Magge, D. Weissenbacher, A. Sarker, M. Scotch and G. Gonzalez-Hernandez, Deep neu-
ral networks and distant supervision for geographic location mention extraction Bioinformatics
342018.
7. J. L. Leidner, Toponym resolution in text: annotation, evaluation and applications of spatial
grounding, in ACM SIGIR Forum, (2)2007.
8. M. Gritta, M. T. Pilehvar, N. Limsopatham and N. Collier, Whats missing in geographical
parsing? Language Resources and Evaluation 52 (Springer, 2018).
9. J. L. Leidner and M. D. Lieberman, Detecting geographical references in the form of place names
and associated spatial natural language SIGSPATIAL Special 3(ACM, 2011).
10. R. Tobin, C. Grover, K. Byrne, J. Reid and J. Walsh, Evaluation of georeferencing, in proceedings
of the 6th workshop on geographic information retrieval, 2010.
11. T. Tahsin, D. Weissenbacher, R. Rivera, R. Beard, M. Firago, G. Wallstrom, M. Scotch and
G. Gonzalez, A high-precision rule-based extraction system for expanding geospatial metadata in
genbank records Journal of the American Medical Informatics Association 23 (Oxford University
Press, 2016).
12. T. Tahsin, D. Weissenbacher, D. Jones-Shargani, D. Magee, M. Vaiente, G. Gonzalez and
M. Scotch, Named entity linking of geospatial and host metadata in genbank for advancing
biomedical research Database 2017 (Oxford University Press, 2017).
13. T. Tahsin, D. Weissenbacher, K. Oconnor, A. Magge, M. Scotch and G. Gonzalez-Hernandez,
Geoboost: accelerating research involving the geospatial metadata of virus genbank records
Bioinformatics 34 (Oxford University Press, 2017).
14. R. Jozefowicz, W. Zaremba and I. Sutskever, An empirical exploration of recurrent network
architectures, in International Conference on Machine Learning, 2015.
15. K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink and J. Schmidhuber, Lstm: A search
space odyssey IEEE transactions on neural networks and learning systems 28 (IEEE, 2017).
16. S. Overell and S. R¨uger, Using co-occurrence models for placename disambiguation International
Journal of Geographical Information Science 22 (Taylor & Francis, 2008).
17. A. Spitz, J. Geiß and M. Gertz, So far away and yet so close: augmenting toponym disambiguation
and similarity with text-based networks, in Proceedings of the third international ACM SIGMOD
workshop on managing and mining enriched geo-spatial data, 2016.
Pacific Symposium on Biocomputing 2019
110
18. Y. Ju, B. Adams, K. Janowicz, Y. Hu, B. Yan and G. McKenzie, Things and strings: improv-
ing place name disambiguation from short texts by combining entity co-occurrence with topic
modeling, in European Knowledge Acquisition Workshop, 2016.
19. M. D. Lieberman and H. Samet, Adaptive context features for toponym resolution in streaming
news, in Proceedings of the 35th international ACM SIGIR conference on Research and develop-
ment in information retrieval, 2012.
20. E. Kamalloo and D. Rafiei, A coherent unsupervised model for toponym resolution, in Proceedings
of the 2018 World Wide Web Conference on World Wide Web, 2018.
21. M. D. Lieberman and H. Samet, Multifaceted toponym recognition for streaming news, in Pro-
ceedings of the 34th international ACM SIGIR conference on Research and development in In-
formation Retrieval, 2011.
22. J. Hoffart, Discovering and disambiguating named entities in text, in Proceedings of the 2013
SIGMOD/PODS Ph. D. symposium, 2013.
23. J. Tamames and V. de Lorenzo, Envmine: A text-mining system for the automatic extraction of
contextual information BMC bioinformatics 11 (BioMed Central, 2010).
24. S. Pyysalo, F. Ginter, H. Moen, T. Salakoski and S. Ananiadou, Distributional semantics re-
sources for biomedical text processing (2013).
25. X. Ma and E. Hovy, End-to-end sequence labeling via bi-directional lstm-cnns-crf, in Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2016.
26. G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer, Neural architectures for
named entity recognition, in Proceedings of NAACL-HLT , 2016.
27. N. Reimers and I. Gurevych, Reporting Score Distributions Makes a Difference: Performance
Study of LSTM-networks for Sequence Tagging, in Proceedings of the 2017 Conference on Em-
pirical Methods in Natural Language Processing (EMNLP), (Copenhagen, Denmark, 2017).
28. Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent
is difficult IEEE transactions on neural networks 51994.
29. S. Hochreiter and J. Schmidhuber, Long short-term memory Neural computation 9(MIT Press,
1997).
30. H. Sak, A. Senior and F. Beaufays, Long short-term memory recurrent neural network architec-
tures for large scale acoustic modeling, in Fifteenth annual conference of the international speech
communication association, 2014.
31. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio,
Learning phrase representations using rnn encoder–decoder for statistical machine translation,
in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2014.
32. Z. Che, S. Purushotham, K. Cho, D. Sontag and Y. Liu, Recurrent neural networks for multi-
variate time series with missing values Scientific reports 8(Nature Publishing Group, 2018).
33. Y. Luo, Recurrent neural networks for classifying relations in clinical notes Journal of biomedical
informatics 72 (Elsevier, 2017).
34. J. Collins, J. Sohl-Dickstein and D. Sussillo, Capacity and trainability in recurrent neural net-
works, in Profeedings of the International Conference on Learning Representations (ICLR), 2017.
35. M. Gritta, M. T. Pilehvar, N. Limsopatham and N. Collier, Vancouver welcomes you! minimalist
location metonymy resolution, in Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), 2017.
Pacific Symposium on Biocomputing 2019
111
... NER is a well-researched problem in natural language processing (NLP) (Goyal, Gupta, and Kumar 2018;Li et al. 2020). Text-based location identification has been traditionally addressed as part of the broader NER task, although some works focus specifically on location identification (Lingad, Karimi, and Yin 2013;Han et al. 2014;Kumar and Singh 2019;Magge et al. 2019). Most of the works that identify locations simply tag location mentions, as opposed to identifying fine-grained location types (Li et al. 2020). ...
Preprint
Full-text available
Identification of fine-grained location mentions in crisis tweets is central in transforming situational awareness information extracted from social media into actionable information. Most prior works have focused on identifying generic locations, without considering their specific types. To facilitate progress on the fine-grained location identification task, we assemble two tweet crisis datasets and manually annotate them with specific location types. The first dataset contains tweets from a mixed set of crisis events, while the second dataset contains tweets from the global COVID-19 pandemic. We investigate the performance of state-of-the-art deep learning models for sequence tagging on these datasets, in both in-domain and cross-domain settings.
Article
Full-text available
Scientific articles often contain relevant geographic information such as where field work was performed or where patients were treated. Most often, this information appears in the full-text article contents as a description in natural language including place names, with no accompanying machine-readable geographic metadata. Automatically extracting this geographic information could help conduct meta-analyses, find geographical research gaps, and retrieve articles using spatial search criteria. Research on this problem is still in its infancy, with many works manually processing corpora for locations and few cross-domain studies. In this paper, we develop a fully automatic pipeline to extract and represent relevant locations from scientific articles, applying it to two varied corpora. We obtain good performance, with full pipeline precision of 0.84 for an environmental corpus, and 0.78 for a biomedical corpus. Our results can be visualized as simple global maps, allowing human annotators to both explore corpus patterns in space and triage results for downstream analysis. Future work should not only focus on improving individual pipeline components, but also be informed by user needs derived from the potential spatial analysis and exploration of such corpora.
Article
Full-text available
We present GeoBoost2, a natural language processing (NLP) pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like NCBI's GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface, and state-of-the-art methods for location extraction using deep learning. Availability: Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research. Database URL: https://zodo.asu.edu/zoophydb/
Article
Full-text available
Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER's capability to embed external features to further boost the system's performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
Article
Full-text available
The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.
Conference Paper
Full-text available
Place name disambiguation is the task of correctly identifying a place from a set of places sharing a common name. It contributes to tasks such as knowledge extraction, query answering, geographic information retrieval, and automatic tagging. Disambiguation quality relies on the ability to correctly identify and interpret contextual clues, complicating the task for short texts. Here we propose a novel approach to the disambiguation of place names from short texts that integrates two models: entity co-occurrence and topic modeling. The first model uses Linked Data to identify related entities to improve disambiguation quality. The second model uses topic modeling to differentiate places based on the terms used to describe them. We evaluate our approach using a corpus of short texts, determine the suitable weight between models, and demonstrate that a combined model outperforms benchmark systems such as DBpedia Spotlight and Open Calais in terms of F1-score and Mean Reciprocal Rank.
Conference Paper
Toponym Resolution, the task of assigning a location mention in a document to a geographic referent (i.e., latitude/longitude), plays a pivotal role in analyzing location-aware content. However, the ambiguities of natural language and a huge number of possible interpretations for toponyms constitute insurmountable hurdles for this task. In this paper, we study the problem of toponym resolution with no additional information other than a gazetteer and no training data. We demonstrate that a dearth of large enough annotated data makes supervised methods less capable of generalizing. Our proposed method estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. We explore the interactions between multiple interpretations of mentions and the relationships between different toponyms in a document to build a model that finds the most coherent resolution. Our model is evaluated on three news corpora, two from the literature and one collected and annotated by us; then, we compare our methods to the state-of-the-art unsupervised and supervised techniques. We also examine three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker, and Google Cloud Natural Language API. The evaluation shows that our method outperforms the unsupervised technique as well as Reuters OpenCalais and Google Cloud Natural Language API on all three corpora; also, our method shows a performance close to that of the state-of-the art supervised method and outperforms it when the test data has 40% or more toponyms that are not seen in the training data.
Article
We proposed the first models based on recurrent neural networks (more specifically Long Short-Term Memory - LSTM) for classifying relations from clinical notes. We tested our models on the i2b2/VA relation classification challenge dataset. We showed that our segment LSTM model, with only word embedding feature and no manual feature engineering, achieved a micro-averaged f-measure of 0.661 for classifying medical problem-treatment relations, 0.800 for medical problem-test relations, and 0.683 for medical problem-medical problem relations. These results are comparable to those of the state-of-the-art systems on the i2b2/VA relation classification challenge. We compared the segment LSTM model with the sentence LSTM model, and demonstrated the benefits of exploring the difference between concept text and context text, and between different contextual parts in the sentence. We also evaluated the impact of word embedding on the performance of LSTM models and showed that medical domain word embedding help improve the relation classification. These results support the use of LSTM models for classifying relations between medical concepts, as they show comparable performance to previously published systems while requiring no manual feature engineering.
Article
Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However, geoparsing is still widely regarded as a challenge because of domain language diversity, place name ambiguity, metonymic language and limited leveraging of context as we show in our analysis. Results to date, whilst promising, are on laboratory data and unlike in wider NLP are often not cross-compared. In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.