Conference PaperPDF Available

Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus

Authors:

Abstract and Figures

Geoparsing and geocoding are two essential middleware services to facilitate final user applications such as location-aware searching or different types of location-based services. The objective of this work is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space and with a frequent use of fine-grain toponyms. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers). However, the real novelty of this work lies in the geocoding method. The geocoding algorithm is unsupervised and takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. The feasibility of the proposal has been tested with a corpus of hiking descriptions in French, Spanish and Italian.
Content may be subject to copyright.
Geocoding for texts with fine-grain toponyms: an
experiment on a geoparsed hiking descriptions corpus
Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, Mauro
Gaio
To cite this version:
Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, Mauro Gaio. Geocoding
for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus.
ACM. ACM SIGSPATIAL International Conference on Advances in Geographic Information
Systems (ACM SIGSPATIAL 2014), Nov 2014, Dallas, Texas, United States. Proceedings of
the 22th ACM SIGSPATIAL International Conference on Advances in Geographic Information
Systems. <hal-01069625v2>
HAL Id: hal-01069625
https://hal.archives-ouvertes.fr/hal-01069625v2
Submitted on 12 Nov 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destin´ee au d´epˆot et `a la diffusion de documents
scientifiques de niveau recherche, publi´es ou non,
´emanant des ´etablissements d’enseignement et de
recherche fran¸cais ou ´etrangers, des laboratoires
publics ou priv´es.
Geocoding for texts with fine-grain toponyms : an
experiment on a geoparsed hiking descriptions corpus
Ludovic Moncla
Université de Pau et des Pays
de l’Adour LIUPPA
Pau, France
lmoncla@univ-pau.fr
Walter
Renteria-Agualimpia
Universidad de Zaragoza
C/ María de Luna, 1
Zaragoza, Spain
walterra@unizar.es
Javier Nogueras-Iso
Universidad de Zaragoza
C/ María de Luna, 1
Zaragoza, Spain
jnog@unizar.es
Mauro Gaio
Université de Pau et des Pays
de l’Adour LIUPPA
Pau, France
mauro.gaio@univ-pau.fr
ABSTRACT
Geoparsing and geocoding are two essential middleware ser-
vices to facilitate final user applications such as location-
aware searching or different types of location-based services.
The objective of this work is to propose a method for es-
tablishing a processing chain to support the geoparsing and
geocoding of text documents describing events strongly lin-
ked with space and with a frequent use of fine-grain topo-
nyms. The geoparsing part is a Natural Language Proces-
sing approach which combines the use of part of speech and
syntactico-semantic combined patterns (cascade of transdu-
cers). However, the real novelty of this work lies in the geoco-
ding method. The geocoding algorithm is unsupervised and
takes profit of clustering techniques to provide a solution
for disambiguating the toponyms found in gazetteers, and
at the same time estimating the spatial footprint of those
other fine-grain toponyms not found in gazetteers. The fea-
sibility of the proposal has been tested with a corpus of
hiking descriptions in French, Spanish and Italian.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Ana-
lysis and Indexing—Linguistic processing; H.3.3 [Information
Storage and Retrieval]: Information Search and Retrie-
val—Information filtering
General Terms
Algorithms, Experimentation
Keywords
Geocoding, Toponym disambiguation, Spatio-textual sear-
ching, Geoparsing, Location based services
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
SIGSPATIAL’14, November 04 - 07 2014, Dallas/Fort Worth, TX, USA
Copyright 2014 ACM 978-1-4503-3131-9/14/11 ...$15.00
http://dx.doi.org/10.1145/2666310.2666386
1. INTRODUCTION
Geoparsing and geocoding are complementary services to
facilitate respectively the recognition of spatial language in
text documents and the mapping of such language to explicit
georeferences (e.g., lat/long values) [19]. In the last decade
there has already been big research efforts addressing the
problems behind these two services because they are essen-
tial for providing back-end data which is exploited later by
multiple applications such as location-aware searching [26,
33] or different types of location-based services (e.g. mobile
applications for finding nearest points of interest or route
planning for emergency response services).
The problem of geoparsing, i.e. the recognition of toponyms
(place names) in text, can be seen as a particular category
of named entity recognition and classification (NERC). Two
categories of approaches have been proposed, those that use
learning techniques and those based on natural language
processing (NLP), in particular syntactico-semantic rules.
Both categories use external lexical resources. These ap-
proaches can be used in a complementary manner in hybrid
systems [18]. Amongst the rule-based approaches, several
use transducers with a finite number of states [25], which
can also be used in cascade [22].
With respect to the problem of geocoding, also known as
toponym resolution, the objective is to associate a topo-
nym with its spatial footprint. The first issue to solve for
this resolution is the ambiguity contained in place names
expressed in text (the problem of solving these ambigui-
ties is known as toponym disambiguation [7]). According
to Smith and Mann [30] there are three main types of am-
biguities : the same name is used for several places (referent
ambiguity) ; the same place can have several names (refe-
rence ambiguity ) ; and the place name can be used in a non-
geographical context, like organizations or names of people
(referent class ambiguity ). Another type of ambiguity is cal-
led structural ambiguity and it arises when the structure of
the words constituting the place name in the text are am-
biguous (e.g. is the word Lake part of the toponym Lake
Grattaleu or not?) [31]. We consider that this type of am-
biguity can be seen as a subset of the reference ambiguity.
The second issue is to have adequate resources to associate
a footprint to a disambiguated place name. The use of re-
sources like gazetteers is thus inescapable.
In an open data context, the availability of gazetteers is ex-
panding and we may mention global resources such as Geo-
names, OpenGeoData, OpenStreetMap and Wikimapia or
national resources 1such as BD NYME (France) or Nomen-
cl´ator Geogr´afico B´asico (Spain). However, whatever the re-
source is selected for geocoding, the problem that usually
arises is its completeness. Currently, the present Web geo-
coding public market is dominated by geocoding services
for average users, i.e. users that can accept low Quality of
Service in terms of resolution [13], i.e. geocoding of admi-
nistrative units, street names in urban areas, or names of
well-known touristic sites are the main needs. But there are
contexts, even for public citizens or casual users, where the
completeness of resources is crucial, in particular as regards
the geocoding of fine-grain toponyms. For instance, in a cor-
pus of narrative descriptions of places in a small area, it is
common to find toponyms referring to geographical entities
of varying size. Additionally, there are frequent occurrences
of micro-toponyms, which are not usually found in gazet-
teers thought for broader audiences.
The objective of this work is to propose a method establi-
shing a processing chain to support the geoparsing and geo-
coding of text documents describing events strongly linked
with space and with a frequent use of fine-grain toponyms.
The geoparsing part of the workflow is a NLP approach ba-
sed on a previous work of the authors [23], which combines
the use of a Part Of Speech tagger and a cascade of trans-
ducers. The real novelty of this work lies in the geocoding
part of the method. The geocoding algorithm is an unsuper-
vised algorithm that takes profit of clustering techniques to
provide a solution for disambiguating those toponyms found
in gazetteers, and at the same time estimating the spatial
footprint of those other fine-grain toponyms not found in
gazetteer resources. This additional contribution about the
estimation of the spatial footprint is closely related to the
step of spatial inference proposed by Leidner and Lieber-
man [19] in their “Reference model for processing textual
geographic references”.
Additionally, for this contribution the method has been eva-
luated for a corpus of hiking descriptions in three different
languages : French, Spanish and Italian.
The remainder of this paper is structured as follows. Section
2 discusses the related work about toponym disambiguation,
and the inference of location information associated to to-
ponyms. Section 3 describes the processing chain proposed
for geoparsing and geocoding. Section 4 describes our imple-
mentation and relates the early results of our experiments.
Finally, Section 5 provides conclusions and outlook on future
work.
2. RELATED WORK
As mentioned in the introduction, geocoding involves two
important and related issues : toponym disambiguation, and
1. http://unstats.un.org/unsd/geoinfo/ungegn/
geonames.html
the adequate selection of a resource for assigning a foot-
print to a disambiguated toponym. The task of toponym
disambiguation and the assignment of a footprint are two
activities that are usually combined thanks to the use of a
gazetteer with an appropriate coverage of the georeferences
which might be associated to ambiguous toponyms. Subsec-
tion 2.1 provides an overview of approaches for toponym
disambiguation.
However, sometimes there is no availability of gazetteers
with appropriate coverage for the processing of text docu-
ments. In those cases, we need to explore alternative me-
thods to infer the spatial location of toponyms. Although
this last issue has received little attention in the research
literature, Subsection 2.2 analyzes some approaches about
how to define new toponyms or improve the spatial infor-
mation associated with existing toponyms.
2.1 Related work about toponym disambigua-
tion
Buscaldi [5] provides an overview about different ways of
disambiguating toponyms. According to this work, the ap-
proaches can be classified in three categories : map-based,
knowledge-based, and supervised or data-driven approaches.
Map-based approaches use as the context for disambigua-
tion other unambiguous and georeferenced toponyms found
on the same document. These approaches provide a score to
any of the possible locations according to the distance to the
unambiguous toponym. Knowledge-based approaches make
profit of knowledge sources (gazetteers, ontologies, and so
on) to see if other related toponyms in this knowledge source
are also referred in the document [26, 6, 20], or additional
information from the toponyms (e.g., population) or docu-
ment creators (e.g. documents of social media [17]) can be
exploited. Finally, data-driven approaches [2, 30] are based
on machine learning algorithms. The main drawback with
this last category is the lack of classified collections.
As the application domain for our proposal are textual des-
criptions of itineraries and the texts are short, we mainly
focus in this subsection about works related to map-based
disambiguation, i.e. the main information to solve disambi-
guation is the explicit georeference of detected toponyms.
Buscaldi and Rosso [7] describe a basic implementation of
a map-based method that analyzes the distance from the
centroid of all possible locations of toponyms cited in the
text to a candidate location in order to choose the most ap-
propriate disambiguated toponym. However, this basic im-
plementation can be refined with multiple improvements re-
lated to the definition of the disambiguation context, the
computation of distance to the candidate location or the
consideration of additional physical properties.
With respect to the definition of a more refined disambigua-
tion context, Zhao et al [33] propose, for instance, a Geo-
Rank algorithm inspired in Google Page Rank algorithm for
the disambiguation of toponyms in Web resources. The idea
is that other toponyms identified on the same resource vote
to the alternative locations of the ambiguous toponym ac-
cording to their distance in the text and the geographical
distance to the tentative location.
About the diversity of techniques for computing closeness
to a candidate location, Habib and van Keulen [14] propose
the use of clustering techniques to disambiguate ambiguous
toponyms in holiday descriptions. The clustering approach
is an unsupervised disambiguation approach based on the
assumption that toponyms appearing in same document are
likely to refer to locations close to each other distance-wise.
Another alternative to take into account the distance close-
ness is the one proposed by Zhang et al [32]. They use an
Exact-All-Hop Shortest Path (EAHSP) algorithm for disam-
biguating road names in text route descriptions. Road name
disambiguation belongs to the scope of toponym disambi-
guation. Their proposed EAHSP algorithm aims at finding
a path that maximizes the number of crossed roads in a
proximity area. Besides, it must be also acknowledged that
this work faces additional difficulties since databases for road
names are not so popular, and their associated geometry is
not point-based.
Finally, related to the consideration of additional physical
properties, Derungs and Purves [8] propose a map-based di-
sambiguation algorithm for toponyms found in landscapes
descriptions as part of a mechanism to assign a general geo-
spatial footprint to documents. Apart from using the Eucli-
dean distance from ambiguous toponyms to already identi-
fied unambiguous toponyms, the proposed disambiguation
algorithm exploits the topographic similarity between topo-
nyms. The performance of their disambiguation algorithm is
closely related to the availability of gazetteers with topogra-
phic information, which may be missing for some regions.
Additionally, this work tries to identify the terms used to
describe natural features in a region and compare the dif-
ferent vocabularies used for natural features in different re-
gions.
2.2 Related work about the definition of new
toponyms
There are some works in the literature that aim at the crea-
tion of new databases of geographic locations. For instance,
Lieberman et al [21] present a method for generating a local
spatial lexicon by processing a corpora from local newspa-
pers, which is used later for the geocoding of documents. In
principle, the method only aims at including in this spatial
lexicon the disambiguated toponyms extracted from news-
papers’ articles and discovered in an existing gazetteer. The
method used for disambiguating toponyms with possible in-
terpretations is based on the definition of a convex hull co-
vering nearby possible locations of toponyms. However, the
method could be easily extended to assign the geographic
extent of this convex hull to all those names recognized as
place names, but not found in the gazetteer.
There are also some relevant works investigating the defini-
tion of toponyms using social media sources. For instance,
Rattenbury and Naaman [27] present different methods ba-
sed on burst-analysis techniques for extracting place seman-
tics from Flickr Tags. Training these methods with Flickr
data containing high-resolution location metadata (i.e., lon-
gitude and latitude), it is possible to derive automatic asso-
ciations between place name tags and explicit georeferences.
As the authors claim, these methods could automate the
creation of place gazetteer data. In the same line, Serdyukov
et al [28] propose the development of a statistical language
model to predict the most likely geocoding corresponding to
a set of location tags associated to a Flickr photo.
Additionally, there are some works applied in similar contexts
to the ones proposed for the context of our experiments, i.e.
narrative descriptions of places in small areas, whose ob-
jective is also to infer the spatial location of complex place
names, not directly found in gazetteers. In this area Scheider
and Purves [1] have proposed the use of semantic technolo-
gies to process narrative descriptions of mountain itineraries
or historic places to infer the location of complex names in
terms of spatial relations with respect to well recognized
landmarks.
Finally, there are some works aiming at the enrichment of
existing geographic databases. An example in this area is the
work of Smart et al [29]. They present a mediation frame-
work to access and integrate distributed gazetteer resources
to build a meta-gazetteer that generates augmented versions
of place name information. In this mediation framework they
include geofeature augmentation module. During the mer-
ging process of features from different sources, they detect
matchings of these features and create more complete and
consistent information, e.g. adding hierarchical information
about administrative units. Another work about the enrich-
ment of existing toponyms could be the work proposed by
Hao et al [15] for the enrichment of destinations in trave-
logues with knowledge (e.g. local topics such as beach, moun-
tain or other features related to the toponym) mined from
a large corpus using probabilistic models.
3. OUR PROPOSAL
Our proposed approach is a hybrid solution that combines
map-based disambiguation with the assignment of georefe-
rences for new toponyms.
The first step is the geoparsing (Fig. 1a) : we annotate geo-
graphic entities thanks to a well-established process [23]
based on the use of POS taggers and the application of
syntactico-semantic combined patterns (cascade of transdu-
cers). Then, we start with the geocoding part of the propo-
sal. The second step of our proposal is the georeferencing
of toponyms thanks to the interrogation of well-known ga-
zetteers (e.g., Geonames, BD NYME, OpenStreetMap)(Fig.
1b). To solve the ambiguities added by the georeferencing
step we propose to use a clustering algorithm based on spa-
tial density.
With respect to the map-based disambiguation part, our
work is similar to the proposal of Habib and van Keulen [14]
as we also use a clustering technique. The difference is that
they do not face the problem of not finding toponyms in
their gazetteer. Additionally, the granularity of the spatial
footprint is higher than ours : their objective is to identify to
which country a set of toponyms belong. The disambigua-
tion part may also have some similarities with the work of
Delarungs and Purves [8] as the types of input documents to
be processed are similar : descriptions of natural landscapes
could be considered as a superset of hiking descriptions. Ho-
wever, their aim is not to try to geolocate toponyms not
found in a backend gazetteer. Anyway, the vocabulary for
natural features that they analyze could probably intersect
with the toponyms (or part of the place names) that we try
to geolocate in this work.
In our proposal the clustering algorithm is also used to de-
fine the geographic extent of all geographic entities in the
document used as input and to propose a geocoding for the
toponyms not found in gazetteers (Fig. 1c).
With respect to the creation of new toponyms, our work also
has some similarities with the one proposed by Lieberman et
al [21]. Our purpose is also to create a gazetteer or spatial
lexicon for an intended audience in small areas. Additio-
nally, the proximity of locations associated with ambiguous
toponyms is the main criterion to discard alternatives. Fur-
thermore, some ideas of the works of Scheider and Purves
[1] and Hao et al [15] could be used to provide additional
information to discovered fine-grain toponyms. The anno-
tation obtained after the toponym extraction process could
be used to inform about a more precise geolocation, or the
topics associated to this toponym.
Figure 1: Block diagram of our processing chain
3.1 Corpus
In order to build a corpus of narrative descritpions of places
in a small area, we chose hiking descriptions. Hiking descrip-
tions are specific documents describing displacements using
geographical information, such as toponyms, spatial rela-
tions, and natural features or landscapes. It must be noted
that this corpus of documents deals with a homogeneous
theme and describes specific geographic small and single
areas.
To build our corpus of hiking descriptions, thousands of
hiking descriptions were automatically extracted from spe-
cialized websites in French 2, Spanish 3, and Italian4. Each
collected hiking description is associated with a GPS track
useful for our experiments. Then with the aim to evaluate
our proposal we built a body of reference consisting of 30
hiking descriptions manually annotated (ground truth) for
each language.
Table 1 shows some features of each body of reference : num-
ber of documents, total number of words, average number of
words per document, total number of toponyms, and average
number of toponyms per document. These 90 documents are
representative of the whole corpus, each document has an
2. http://www.visorando.com (fr)
3. http://senderos.turismodearagon.com (es)
4. http://www.parks.it/parco.alpi.marittime/ (it)
average of : 263 words (269 on the body of reference) with
standard distance of 188 (242 for the body of reference) and
15 toponyms (14 on the body of reference) with standard
distance of 9 (12 for the body of reference).
French Spanish Italian
#Documents 30 30 30
#Words 11297 5626 6856
Avg. #Words 376 187 228
#Toponyms 583 376 409
Avg. #Toponyms 19.4 12.5 13.63
Table 1: Document sets
Table 2 shows the ten most popular terms associated with
toponyms in our body of reference (French, Spanish and
Italian). These terms annotated by our processing chain as
sub-types and we consider that they are part of the name
of the spatial entity. These spatial entities are most of the
time fine-grain (hamlet, cottage, bridge, church, lake) and
46% of toponyms in our French body of reference have an
associated sub-type. (24% for Spanish and 36% for Italian).
French Spanish Italian
col 20 puente 17 rifugio 20
village 20 rio 17 monte 19
hameau 20 pueblo 12 villaggio 17
route 17 iglesia 10 masi 15
sentier 15 camino 9 castello 9
chalet 13 barranco 8 lago 8
refuge 11 parque 5 passo 7
pont 11 castillo 3 foce 7
lac 8 barrio 3 chiesa 6
chapelle 8 casa 2 via 5
Table 2: Most frequent terms associated with toponyms
3.2 Geoparsing
Our approach uses a geoparsing system built to extract spa-
tial named entities, but also spatial relations. This annota-
tion system was first designed for French corpus only and
was described in a previous work [23]. For this current work
we transformed our processing chain to deal with Spanish
and Italian documents. The first step (Fig. 1a) of our ap-
proach is a system where spatial expressions described in
textual documents are automatically annotated. The me-
thod combines the notions of marking and extraction of na-
med entities, through the use of local grammars or external
resources (lexicons).
The first step of this annotation tool is done by a part of
speech (POS) analyser. The output is used by syntactico-
semantic combined patterns in a transducers cascade to mark
toponyms, spatial relations and expanded spatial named en-
tity (ESNE). We defined ESNE as an entity built from a
proper name attributed to a place (toponym), which can be
associated with a sub-type and one or more concepts rela-
ting to the expression of location in the language (spatial
relations).
The rules to annotate toponyms are built using proper nouns.
There are also rules to describe spatial relations using POS
tags like prepositions or common nouns and lexicons.
Listing 1 shows the result of the annotation for the following
sentence : Walk to the refuge south of hamlet of Fontanettes.
We can notice that in that sentence the verb of motion (to
walk), the spatial relation (south of ) and the toponym (ham-
let of Fontanettes) have been annotated. Verbs of motion,
or perception are considered as a special kind of spatial re-
lations called V T [24]. This VT annotation establishes the
link between a verb of motion or perception and a spatial
entity (toponym or ESNE).
At the end of the geoparsing process toponyms, spatial re-
lations, and ESNE are annotated.
<VT>
<verb type=motion” polarity=”median”>
<token>Walk</token>
</verb>
<ESNE focus=”final”>
<token>to</token>
<commonNoun>
<token>the</token>
<token>refuge</token>
</commonNoun>
<indirection type=orientation”>
<token>south</token>
<token>of</token>
</indirection>
<toponym>
<subType>
<token>hamlet</token>
</subType>
<token>of</token>
<subToponym>
<token>Fontanettes</token>
</subToponym>
</toponym>
</ESNE>
</VT>
Listing 1: Example of an XML result of our geoparsing
process (translated from French)
3.3 Geocoding
To find a geo-coded representation for toponyms extrac-
ted from the textual descriptions (toponym resolution), we
query some well-known gazetteers.
During this process a lot of ambiguities may arise. One place
can have several names (reference ambiguity). For example,
this happens when the name has changed over the time, or
when the name commonly used by people is different from
the official name. In order to avoid this ambiguity our pro-
cessing chain looks for the toponym name in all available
fields of the backend database of gazetteers containing offi-
cial or alternative names. Additionally, the fact of querying
different gazetteers also expands the probabilities of the me-
thod to find a matching with one of the possible names of a
place. Apart from these clear cases of reference ambiguity,
our method is focused on the problem of the inclusion or not
inclusion of sub-types within the official name of a toponym
(structural ambiguity).
Most of the named entities used in this corpus are spatial
named entities. Thus this circumstance allows us not to take
into account the referent class ambiguity. But as far as we
know another ambiguity untreated in the literature appears :
the incompleteness of the gazetteers. Not all existing topo-
nyms are stored in databases. Thus we call this problem the
unreferenced toponyms ambiguity.
3.3.1 Dealing with structural ambiguity
Sometimes toponyms or ESNE are stored in gazetteers with
the so-called full name, which means that the name consists
of a sub-type and a proper name. For example, the toponym
hamlet of Fontanettes consists of the subtype hamlet and the
proper name Fontanettes. But in many cases, toponyms are
just stored in gazetteers using the proper name, and in such
cases they are usually associated with metadata describing
the type of feature (e.g., hamlet, city or stream).
The approach to know which result matches our query is to
compare the metadata type of the results with the subtype
of the toponym extracted from the text. In some cases, am-
biguities can be solved using their sub-type [24] when it is
available in the textual description.
For instance, the expression hamlet of Fontanettes is not
found in gazetteers. Instead, we can only find it with the
name Fontanettes. Several results exist for this query and
one of them has the metadata type hamlet.
However, if none of the possible names of a place is stored in
the gazetteers, then we have a case of unreferenced toponym
ambiguity, which is considered specifically in section 3.3.3.
3.3.2 Dealing with referent ambiguity
In many cases the disambiguation approach presented in
section 3.3.1 is not enough because gazetteers may contain
several toponyms with the same name and type. In these
cases we need a mechanism to distinguish the good group of
toponyms associated with the real trajectory of the hiking
description.
Similar to the works of Feuerhake and Sester [12] or Intagorn
et al. [16], we propose to use clustering algorithms to find
collections sharing a spatial property, and in our case, these
collections enable us to find clusters of the most likely geo-
spatial points belonging to a hiking trail. In particular, we
are using the DBSCAN (Density-Based Spatial Clustering
of Applications with Noise) clustering algorithm [11]. It uses
the concept of density to determine the neighborhood of a
point, that is, what constitutes a cluster. DBSCAN uses two
parameters to define the density concept : Eps and MinPts.
Eps (epsilon radius) determines the area of a neighborhood
and MinPts determines the minimum number of points that
must be contained in that neighborhood to deem it a cluster.
In our current methodology, the values of DBSCAN parame-
ters have been empirically adjusted according to the features
of the linking dataset used in the experiments.
DBSCAN can deal with the problems of data with noise, i.e.
DBSCAN has the ability to detect outliers. In our context,
an outlier is a point that does not belong to the hiking
trail cluster. Additionally, since hiking trails may have many
points describing different trajectory shapes, DBSCAN can
find arbitrarily shaped clusters. The output of the DBSCAN
is a set of clusters of toponyms whose footprints are close.
Every cluster represents a possible set of points describing
the hiking trail. Then we need a way to identify the best
cluster matching with the set of points in the hiking trail.
The heuristic is defined as follows : given a set of cluster C1,
C2,... ,Cngenerated by the clustering algorithm, the best
cluster Cbis the one containing the largest number of dis-
tinct toponyms. In other words, the best cluster identifies
the area with the largest co-occurrence of toponyms.
The proposed method reduces considerably the structural
and referent ambiguities. Now the remaining problem is how
to deal with unreferenced toponyms ambiguity, i.e. how to
find spatial locations for toponyms that are not found in
gazetteers.
3.3.3 Dealing with unreferenced toponym ambiguity
Our proposal is to infer locations from the locations of pre-
viously disambiguated toponyms. However, these spatial in-
ferences cannot be as precise as points with geographical co-
ordinates (latitude/longitude). These spatial inferences are
represented by a geographical area which can be refined de-
pending on various spatial informations contained in the tex-
tual descriptions.
For example, Figure 2 shows three different cases of infe-
rences. In the first case (Fig.2a) there is not other explicit
spatial information in the text linked with the unreferen-
ced toponyms. In this case, when there is no information
concerning the context of a toponym, we define a geogra-
phical area that contains all well located toponyms thanks
to the clustering method previously described. Indeed, as
we are working with hiking descriptions, toponyms are re-
lated to each other and they are located in the same area.
The second case is when explicit spatial relations are asso-
ciated with the unreferenced toponym. For example, if we
know that the unreferenced toponym is somewhere south of
A(Fig.2b), then we can define a new area smaller than the
previous one. A third case arises when we have even more
information available in the textual description. In this case
we can define a much smaller area. For example, if we know
that the unreferenced toponym is somewhere between two
other toponyms, we can define a small area between these
two toponyms (Fig.2c). Spatial relations are very important
to determine the context of unreferenced toponyms. We have
developed specific transducers in our cascade using lexicons
or patterns to annotate different categories of spatial rela-
tions : distances, topological relations, directional relations
and displacements. Currently our geoparsing process is able
to annotate these categories of spatial relations for the three
languages (French, Spanish and Italian).
4. EXPERIMENTS
4.1 Geoparsing
As mentioned in section 3.1, to evaluate our proposal we
built a body of reference consisting of 30 hiking guides ma-
nually annotated (ground truth) for each language.
For each document set (French, Spanish, and Italian) of hi-
king descriptions, we evaluated the precision and recall ob-
tained with our georpasing method (Table 3). The precision
is the ratio of the number of relevant toponyms annotated
(y) to the total number of toponyms annotated (z); and the
recall is the ratio of the number of relevant toponyms an-
notated (y) to the number of relevant toponyms manually
annotated (x).
In the context of hiking descriptions almost all named en-
tities are spatial named entities, and the analysis of the re-
sults showed that the remaining errors come from geo/non
geo ambiguities (referent class ambiguity).
It can be observed that our processing chain yields a very
(x) (y) (z) Precision Recall
French 583 581 595 98.29% 99.42%
Spanish 376 374 376 99.26% 99.33%
Italian 409 407 410 99.10% 99.68%
Table 3: Evaluation of our geoparsing method
high precision and recall using a specific corpus dealing with
spatial named entities.
The first step of our annotation tool is an automatic part-of-
speech annotation. The efficiency is not the same depending
the language (French, Spannish or Italian) and the analyser
used. To compare the same results for all languages and in
order to avoid adding errors before the application of the
transducers cascade, we manually corrected the output of
the part-of-speech analysers. Without this manual correc-
tion, the results are between 5 to 15% lower depending on
the language and the analyser used.
4.2 Geocoding
For the geocoding experiments we are using gazetteers pro-
vided by national mapping institutes : BD NYME5(France),
Nomencl´ator Geogr´afico B´asico de Espa˜na 6(Spain), and To-
ponimi d’Italia IGM 7(Italy). The Spanish and Italian gazet-
teers are accessible through the Web Feature Service speci-
fication defined by the Open Geospatial Consortium 8. Ad-
ditionally we also use some well-known gazetteers : Geo-
names 9and OpenStreetMap 10 .
4.2.1 Dealing with structural ambiguity
For each toponym extracted from the text we first query
the databases with its full name (exact matching). If there
is no result, then we query again the databases with the sub-
toponym (see Section 3.3.1). Last we try to disambiguate the
retrieved candidate toponyms with the metadata type.
Table 4 shows the percentage of toponyms (or sub-toponyms)
found in gazetteers. For French texts, 13.78% of toponyms
are not found (5.59% in the case of Spanish texts and 23.17%
in the case of Italian texts). Furthermore, we can notice that
the gazetteers complement each other.
French Spanish Italian
Full name query 54.79% 73.14% 43.90%
National Gazetteer 37.82% 62.77% 17.32%
Geonames 29.24% 53.99% 29.02%
OpenStreetMap 40.17% 63.56% 34.63%
Sub-toponym query 31.43% 21.28% 32.93%
National Gazetteer 23.03% 18.88% 17.32%
Geonames 23.19% 18.09% 22.20%
OpenStreetMap 24.87% 19.41% 25.85%
Total 86.22% 94.41% 76.83%
Table 4: Percentage of toponyms found in gazetteers
5. http://www.geoportail.gouv.fr
6. http://www.ign.es
7. http://www.pcn.minambiente.it/GN/
8. http://www.opengeospatial.org/standards/wfs
9. http://www.geonames.org
10. http://www.openstreetmap.org
(a) (b) (c)
Figure 2: Refining spatial inferences according to the context
The way we query the gazetteers is not flexible and it is pos-
sible to improve these results. For example, some toponyms
are not retrieved because their names contain articles and
determiners, and these articles and determiners are not spe-
cified in the text. For example, the toponym Lac de Rocheure
extracted from a hiking description is stored in gazetteers as
Lac de la Rocheure. Applying more flexible queries would ob-
tain better results in terms of retrieved toponyms, but on
the other hand the number of toponyms ambiguous would
increase. These results can also be improved because of miss-
pellings. We can imagine the use of algorithms like the “edit
distance” (commonly used in natural language processing)
to find results in gazetteers for the misspelled toponyms.
4.2.2 Dealing with referent ambiguity
As we said before, even if toponyms are found in gazetteers,
there are still some remaining ambiguities. For instance, in
the case of French hiking descriptions and depending on the
used gazetteer, between 45 to 70 % of found toponyms are
ambiguous. This means that for these toponyms the gazet-
teers give more than one result. For technical reasons and
in order to compare the results coming from the three ga-
zetteers, we set a maximum limit of results per query. The
limit is set to 100 results to have less than 2% of toponyms
having a limited number of results. Table 5 shows that ga-
zetteers give an average of 15-20 locations for each toponym
depending on the language.
French Spanish Italian
Total # of toponyms 595 376 409
# Retrieved toponyms 513 355 315
# Results 6850 6359 5722
Avg. # Results 13.35 17.91 18.16
Table 5: Number of toponyms (and results) found in gazet-
teers
In order to avoid this ambiguity the clustering method ex-
plained in section 3.3.2 was applied. The validity of the best
cluster for every document proposed by our algorithm was
verified by comparing the similarity between the point set of
each generated cluster and the original point set of the tra-
jectory described in a GPX file. To measure the similarity we
computed the convex polygon of the original point set of the
trajectory and every cluster with the ST ConvexHull Post-
GIS function, and then, we calculated the distance between
these point sets using the ST Distance PostGIS function [9,
3].
In 29 out of the 30 French cases, 30 out of the 30 Spanish
cases and 29 out of the 30 Italian cases the best cluster
suggested by our method is the best one, that is, the cluster
with the best matching with respect to the real points in the
trajectory. In general, for French cases, 65.63% of the topo-
nyms found in a gazetteer are in the best cluster (65.35% for
Spanish and 59.68% for Italian cases). Additionally, thanks
to the comparison with respect to the real trajectory, we
analyzed the missing points in every best cluster and we
found that all missing points were not included within the
set of points associated with the toponyms that were re-
trieved from gazetteers. This means that only the points
included in the best clusters are well located. Table 6 shows
the comparison between the number of toponyms retrieved
from gazetteers and the number of well-located toponyms .
French Spanish Italian
Well-located BC 29 of 30 30 of 30 29 of 30
Retrieved toponyms 513 355 315
(86.22%) (94.41%) (77.01%)
Well-located toponyms 340 232 188
(57.14%) (61.7%) (45.96%)
Table 6: Number of best clusters (BC) and toponyms well
located
This points out the problems derived from the lack of cove-
rage in gazetteers and the need to assign a geographic refe-
rence to those toponyms not found. In each case analyzed,
the missing points were associated with fine-grain toponyms.
Our approach failed in 2 out of 90 cases. In these cases the
problem was caused by the few number of results retrieved.
For example, in one of these cases only two toponyms were
retrieved from the gazetteer and the rest were not found
(they were missing points). Moreover, the retrieved topo-
nyms were referencing places not related with to real hiking
description, (i.e., bigger location in other countries or re-
gions).
4.2.3 Dealing with unreferenced toponyms ambiguity
As mentioned in section 3.3.3, our proposal is to infer loca-
tions for the unreferenced toponyms. We implemented two
approaches to define a geographic area where the unreferen-
ced toponyms are supposed to be in order to evaluate which
one is the best. The first approach takes into account the
geometric outline of the displacement by implementing the
convex hull (in red in figure 3) computed with all the topo-
nyms included in the best cluster. The second approach im-
plements the circumscribed circle (in blue in figure 3) around
the rectangle of the bounding box and does not take into ac-
count the geometric outline of the displacement.
(a)
(b)
Figure 3: Refining spatial inferences according to the context
(Source : GoogleMaps)
We manually reviewed all the unreferenced toponyms of each
document in the corpus. We sought into different resources
like web pages or detailed geographical maps to find the real
locations of the unreferenced toponyms. When the toponyms
were impossible to find, we also used the GPS track available
with each document of our corpus.
After running the experiments, we identified different cases
of spatial inference. The first one was the perfect case, that
is, when all toponyms cited in the textual description were
found and were all well located (Fig.3a). The second case
was when there are some unreferenced toponyms, but when
their real locations are located inside the convex hull. Then
the third case happened when there are several unreferenced
toponyms and when the real locations of these toponyms
are not included in the convex hull but are included in the
circumscribed circle (see points A,B, and C in Fig.3b).
Last there are also some cases when the real locations of
unreferenced toponyms are located neither in the convex
hull nor in the circumscribed circle.
Table 7 shows the number of unreferenced toponyms that
are actually located inside the area defined by the convex
hull or the circumscribed circle. We can notice that using
the circumscribed circle we are able to propose a good loca-
tion for more unreferenced toponyms than with the convex
hull.
We removed from the total number of automatically anno-
tated toponyms the ones which were not toponyms (referent
class ambiguity) and also the ones which were associated
with an expression of perception. These toponyms can be
far from the real trajectory described and our method is not
adapted to locate them.
French Spanish Italian
unreferenced toponyms 200 123 162
Convex hull 140 75 52
(70.0%) (53.65%) (32.1%)
Circumscribed circle 180 102 120
(90.0%) (77.23%) (74.07%)
Table 7: Numbers of unreferenced toponyms found in the
convex hull or in the circumscribed circle
Furthermore our experiments show that we need at least 4
or 5 well located toponyms in order to find the best clus-
ter and to propose a good geographic area for unreferenced
toponyms.
To summarize the experiments of the full processing chain,
Table 8 shows some global results : the initial numbers of
toponyms manually and automatically annotated (excluding
toponyms associated with expression of perception or er-
rors) ; the number of toponyms located by gazetteers after
the cluster-based disambiguation; the number of toponyms
located by our spatial inference method ; and the number of
toponyms still unlocated at the end of the process. Additio-
nally, the table shows the percentages of located (unlocated)
toponyms for each body of reference.
Toponyms French Spanish Italian
manually annotated 542 362 350
automatically annotated 540 360 349
located by gazetteers 340 232 188
(62.96%) (64.44%) (53.86%)
located by inferences 180 102 120
(33.33%) (28.33%) (34.38%)
unlocated 20 26 41
(3.71%) (7.22%) (11.74%)
Table 8: Global results of our processing chain
Finally, Table 9 shows the global accuracy of our proposal
to locate toponyms. The accuracy takes into account the
total number of toponyms manually annotated, including
those associated with expressions of perception. We can no-
tice that adding the step of spatial inference for the location
of toponyms increase significantly the accurary. On average
and considering the whole corpus, our method is able to
identify and correctly locate 55% of toponyms without spa-
tial inference. In contrast, using the spatial inference me-
thod, this percentage increases up to 84% of toponyms cor-
rectly identified and located.
French Spanish Italian
Total # of toponyms 595 376 409
Accuracy without SI 57.14% 61.70% 45.96%
Accuracy with SI 87.39% 88.82% 75.3%
Table 9: Accuracy without and with spatial inference (SI)
5. CONCLUSIONS
This work has proposed a processing chain for the geopar-
sing and geocoding of texts containing travel descriptions,
making a special emphasis on two main problems related
to the geocoding part of the method : the existence of am-
biguous toponyms, and the lack of gazetteers with enough
coverage for fine-grain toponyms. The solution proposed for
addressing these two problems has been based on the use
of clustering techniques. On the one hand, the definition of
clusters provides a map-based disambiguation approach to
identify the clusters with the highest number of candidate
toponyms in terms of distance. On the other hand, the boun-
ding polygon of these clusters can be used as an estimate to
define the location of those fine-grain toponyms not found
in gazetteers.
Additionally, the feasibility of the proposed method has been
tested for the geoparsing and geocoding of toponyms in a
corpus of hiking descriptions obtaining good results in terms
of accuracy, precision and recall. Moreover, it must be no-
ted that the method performs well in a multilingual environ-
ment. The results obtained for the three different tested lan-
guages (French, Italian and Spanish) are comparable. The
only requirement for the application of the proposed me-
thod to a set of travel descriptions in a new language is the
customization of the geoparsing part and the availability of
gazetteers for the geographic area covered by the texts in
this new language.
However, some refinements could be included in the propo-
sed method to increase its performance. On the one hand,
additional heuristics could be taken into account to address
the problem of reference ambiguity (i.e., several place names
for the same place). In the current work we have conside-
red the problem of structural ambiguity (i.e., finding or not
finding sub-types within the toponym names in gazetteers).
But other problems could be taken into account : variants
of toponym names and types in other languages, abbrevia-
tions, etc. On the other hand, with respect to the adjustment
of parameters in the application of the DBSCAN clustering
method to deal with the problem of referent ambiguity, a
possible refinement would be the automatic definition of pa-
rameter values by means of machine learning techniques as
it is proposed in other works using clustering techniques [4,
10].
Finally, it must be noted that the proposed processing chain
for geoparsing and geocoding could be applied to other types
of text corpora. The proposed method is general enough to
be applied for any kind of narrative descriptions in a small
area. In the future we plan to test this method with dif-
ferent corpora referring to travel descriptions in other coun-
tries and with a higher level of granularity (e.g., travel tours
across different countries).
Acknowledgments
This work has been partially supported by : the Commu-
naut´e d’Agglomeration Pau Pyr´en´ees (CDAPP) and the Na-
tional Geographic Institute of France through the PER-
DIDO project ; the Spanish Ministry of Education through
the International Excellence Campus Program (Campus Ibe-
rus mobility grants); the Aragon Government through the
grant ref. B181/11 and the Pyrenees Working Community
mobility grant ref. CTPM2/13; and the Keystone COST
Action IC1302.
6. REFERENCES
[1] Semantic place localization from narratives. In
Proceedings of The First ACM SIGSPATIAL
International Workshop on Computational Models of
Place, COMP ’13, pages 16 :16–16 :19, New York, NY,
USA, 2013. ACM.
[2] R. J. Agrawal and J. G. Shanahan. Location
disambiguation in local searches using gradient
boosted decision trees. In Proceedings of the 18th
SIGSPATIAL International Conference on Advances
in Geographic Information Systems, GIS ’10, pages
129–136, 2010.
[3] A. Aji, X. Sun, H. Vo, Q. Liu, R. Lee, X. Zhang, J. H.
Saltz, and F. Wang. Demonstration of Hadoop-GIS : a
spatial data warehousing system over MapReduce. In
Proceedings of the 21st ACM SIGSPATIAL
International Conference on Advances in Geographic
Information Systems, pages 518–521, 2013.
[4] K.-H. Anders and M. Sester. Parameter-free cluster
detection in spatial databases and its application to
typification. International Archives of Photogrammetry
and Remote Sensing, 33(B4/1; PART 4) :75–83, 2000.
[5] D. Buscaldi. Approaches to disambiguating toponyms.
SIGSPATIAL Special, 3(2) :16–19, jul 2011.
[6] D. Buscaldi and P. Rosso. A conceptual density-based
approach for the disambiguation of toponyms. Int. J.
Geogr. Inf. Sci., 22(3) :301–313, Jan. 2008.
[7] D. Buscaldi and P. Rosso. Map-based vs.
knowledge-based toponym disambiguation. In
Proceedings of the 2Nd International Workshop on
Geographic Information Retrieval, GIR ’08, pages
19–22, New York, NY, USA, 2008. ACM.
[8] C. Derungs and R. S. Purves. From text to landscape :
Locating, identifying and mapping the use of
landscape features in a swiss alpine corpus.
International Journal of Geographical Information
Science, 28(6) :1272–1293, 2013.
[9] A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg
hadoop : computational geometry in mapreduce. In
Proceedings of the 21st ACM SIGSPATIAL
International Conference on Advances in Geographic
Information Systems, pages 284–293, 2013.
[10] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and
X. Xu. Incremental clustering for mining in a data
warehousing environment. In VLDB, volume 98, pages
323–333, 1998.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In KDD,
volume 96, pages 226–231, 1996.
[12] U. Feuerhake and M. Sester. Mining group movement
patterns. In Proceedings of the 21st ACM
SIGSPATIAL International Conference on Advances
in Geographic Information Systems, pages 510–513,
2013.
[13] A. J. Florczyk, F. J. Lopez-Pellicer, P. R.
Muro-Medrano, J. Nogueras-Iso, and F. J.
Zarazaga-Soria. Semantic selection of georeferencing
services for urban management. Journal of
Information Technology in Construction, 15 (Special
Issue Bringing urban ontologies into
practice) :111–121, 2010.
[14] M. Habib and M. Van Keulen. Improving toponym
disambiguation by iteratively enhancing certainty of
extraction. In KDIR 2012 - Proceedings of the
International Conference on Knowledge Discovery and
Information Retrieval, 2012.
[15] Q. Hao, R. Cai, C. Wang, R. Xiao, J.-M. Yang,
Y. Pang, and L. Zhang. Equip tourists with knowledge
mined from travelogues. In Proceedings of the 19th
international conference on World wide web, pages
401–410. ACM, 2010.
[16] S. Intagorn and K. Lerman. Learning boundaries of
vague places from noisy annotations. In Proceedings of
the 19th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems,
pages 425–428. ACM, 2011.
[17] N. Ireson and F. Ciravegna. Toponym resolution in
social media. In The Semantic Web–ISWC 2010,
pages 370–385. Springer, 2010.
[18] J. L. Leidner. Toponym Resolution in Text :
Annotation, Evaluation and Applications of Spatial
Grounding of Place Names. Universal-Publishers, Jan.
2008.
[19] J. L. Leidner and M. D. Lieberman. Detecting
geographical references in the form of place names and
associated spatial natural language. SIGSPATIAL
Special, 3(2) :5–11, July 2011.
[20] M. D. Lieberman and H. Samet. Adaptive context
features for toponym resolution in streaming news. In
Proceedings of the 35th international ACM SIGIR
conference on Research and development in
information retrieval, pages 731–740. ACM, 2012.
[21] M. D. Lieberman, H. Samet, and
J. Sankaranarayanan. Geotagging with local lexicons
to build indexes for textually-specified spatial data. In
Data Engineering (ICDE), 2010 IEEE 26th
International Conference on, pages 201–212. IEEE,
2010.
[22] D. Maurel and N. Friburger. Finite-state transducer
cascades to extract named entities in texts. Theoretical
Computer Science, 313 :93–104, 2004.
[23] L. Moncla, M. Gaio, and S. Musti`ere. Automatic
itinerary reconstruction from texts. In Eighth
International Conference on Geographic Information
Science, GIScience 2014, Vienna, September, 23-26.
[24] V. T. Nguyen, M. Gaio, and L. Moncla. Topographic
subtyping of place named entities : a linguistic
approach. In The 16th AGILE International
Conference on Geographic Information Science,
Leuven, Belgium, 2013.
[25] T. Poibeau. Extraction automatique d’information(du
texte brut au web s´emantique). 2003.
[26] T. Qin, R. Xiao, L. Fang, X. Xie, and L. Zhang. An
efficient location extraction algorithm by leveraging
web contextual information. In Proceedings of the 18th
SIGSPATIAL International Conference on Advances
in Geographic Information Systems, GIS ’10, pages
53–60, 2010.
[27] T. Rattenbury and M. Naaman. Methods for
extracting place semantics from flickr tags. ACM
Transactions on the Web (TWEB), 3(1) :1, 2009.
[28] P. Serdyukov, V. Murdock, and R. Van Zwol. Placing
flickr photos on a map. In Proceedings of the 32nd
international ACM SIGIR conference on Research and
development in information retrieval, pages 484–491.
ACM, 2009.
[29] P. D. Smart, C. Jones, and F. Twaroch. Multi-source
toponym data integration and mediation for a
meta-gazetteer service. In S. Fabrikant,
T. Reichenbacher, M. Kreveld, and C. Schlieder,
editors, Geographic Information Science, volume 6292
of Lecture Notes in Computer Science, pages 234–248.
Springer Berlin Heidelberg, 2010.
[30] D. Smith and G. Mann. Bootstrapping toponym
classifiers. Association for Computational Linguistics,
Proceedings of the HLT-NAACL 2003 workshop on
Analysis of geographic references - Volume 1 :45–49,
2003.
[31] N. Wacholder, Y. Ravin, and M. Choi.
Disambiguation of proper names in text. In
Proceedings of the fifth conference on Applied natural
language processing, pages 202–208. Association for
Computational Linguistics, 1997.
[32] X. Zhang, B. Qiu, P. Mitra, S. Xu, A. Klippel, and
A. M. MacEachren. Disambiguating Road Names in
Text Route Descriptions using Exact-All-Hop Shortest
Path Algorithm. In ECAI, pages 876–881, 2012.
[33] J. Zhao, P. Jin, Q. Zhang, and R. Wen. Exploiting
location information for web search. Computers in
Human Behavior, 30 :378–388, 2014.
... This is similar to our proposed NameGeo method, although the authors did not explore using cosine similarity thresholds for selective prediction or any of the other variations that we investigate. Some prior work has examined geo-entity linking in historical texts (Smith and Crane, 2001;Ardanuy and Sporleder, 2017;Ardanuy et al., 2020), which includes English, Spanish, Dutch, and German data; and in web pages (Moncla et al., 2014), which includes French, Spanish, and Italian data. ...
... Geoparsers focused on standard English texts (including news articles, Wikipedia, or scientific papers) include CLIFF-CLAVIN (D'Ignazio et al., 2014), TopoCluster (DeLozier et al., 2015, CamCoder (Gritta et al., 2018), and DM_NLP . Other geoparsers include the Edinburgh geoparser for historical English text (Grover et al., 2010), GeoTxt for English social media data (Karimzadeh et al., 2013(Karimzadeh et al., , 2019, and Perdido for French texts (Moncla et al., 2014). ...
... There is limited research in how to assign coordinates to toponyms that cannot be linked to entries in a gazetteer. Moncla et al. (2014) annotate spatial relations in a corpus of hike descriptions and apply a clustering algorithm, finding collections of spatial points that belong to the same trail and manually resolving the unknown toponyms not in gazetteers using geographic areas of co-occurring toponyms. Moncla et al. (2019) use network analysis to identify neighbors and relations between toponyms. ...
... Toponym recognition and resolution are complicated by reference ambiguity (Amitay et al. 2004, Moncla et al. 2014. We adopted the term geo/non-geo ambiguity for words with non-geographic meanings (e.g. ...
Article
Full-text available
Geoparsers aim to find place names in unstructured texts and locate them geographically. This process produces georeferenced data usable for spatial analyses or visualisations. Much geoparsing research and development has thus far focused on the English language, yet languages are not alike. Geoparsing them may necessitate language-specific processing steps or data for training geoparsing systems. In this article, we applied generic language and GIS resources to geoparsing Finnish texts. We argue that using generic resources can ease the development of geoparsers, and free up resources to other tasks, such as annotating evaluation corpora. A quantitative evaluation on new human-annotated news and tweet corpora indicates robust overall performance. A systematic analysis of the geoparser output reveals errors and their causes at each processing step. Some of the causes are specific to Finnish, and offer insights to geoparsing other morphologically complex languages as well. Our results highlight how the language of the input text affects geoparsing. Additionally, we argue that toponym resolution metrics based on error distance have limitations, and proposed metrics based on spatial intersection with ground-truth polygons. ARTICLE HISTORY
... The process of identifying locations from textual documents and translating them into geographical coordinates consists of two steps. In the existing literature, the automated process for recognizing and extracting geographic entities from natural language text is referred to as Geoparsing (Moncla et al., 2014). Whereas, Geocoding is the process of converting textual place names i.e. toponyms into geographic coordinates available in the form of latitude and longitude to locate them on geographical maps or any other geographical information systems (GIS) (Hill, 2009). ...
Thesis
Full-text available
An escalation in infectious diseases has led to a significant increase in health threats reported across diverse online sources. Event-based surveillance (EBS) systems detect health threats or events by utilizing automated approaches to assist stakeholders in taking timely preventive measures. There is significant room for improvement across various aspects of the event to enhance the effectiveness of EBS. In this thesis, we improve several aspects of the event to provide more precise information by ensuring prior data quality assessment, geographical accuracy enhancement, and post-situational awareness. This work is supported by the MOOD project, which aims to enhance the utility of EBS. To effectively monitor infectious diseases reported from online sources, it is imperative to implement data quality assessment measures in order to obtain trustworthy and reliable information. In our work to improve data quality, we introduce a data-driven approach to classify news articles as relevant or irrelevant by enriching the epidemiological context. We also explore metadata features of online news by applying a machine learning approach to identify important metadata features. Moreover, we also explore enhancing news source quality attributes, proposing the identification of source specialization and geographical coverage identification for improved classification performance. To extract event information, the geographical accuracy of events plays a pivotal role in epidemiology allows precise tracking, containment thereby significantly impacting public health outcomes. Secondly, in our work to improve geographical accuracy, we propose a rule-based Named Entity Recognition (NER) approach to extract spatial relations related to locations mentioned in text data, evaluated using a diverse news article dataset covering various diseases. Additionally, we present an algorithm to compute geographical coordinates in the form of polygons for identified spatial relation locations, with qualitative assessments involving end-users to ensure their quality and utility. Extracting situational awareness from social media e.g. geotagged tweets of geographically accurate event region are offering real-time insights to gauge severity of event. Finally, for situational awareness, we performed sentiment analysis using Hierarchy-based measures for tweet analysis (H-TFIDF) to understand local sentiments during the COVID-19 epidemic, evaluated with early COVID-19-related tweets from the E.Chen dataset categorized into spatial groups. Furthermore, various features including Bidirectional Encoder Representations from Transformers (BERT), H-TFIDF, term frequency-inverse document frequency (TF-IDF), and bag-of-words (BOW), were employed in spatial opinion mining to assess their significance in sentiment classification.
... From a cognitive point of view, landmarks are remarkable real-world elements that people use to understand space and better orient themselves (Lynch, 1960). From a space description perspective, landmarks are considered as references to describe the space (Zhou et al., 2017), whereas, in the field of information retrieval, landmarks represent placenames considered as anchors to geocode proper names (Moncla et al., 2014). Based on these definitions, we define a landmark as a landscape reference feature, named or unnamed, natural or built, which can be seen, known or used to practice an outdoor activity and used by victims in mountainous areas for locating themselves. ...
Article
Full-text available
When people are injured or lost in mountains during outdoor activities and when web-based locations are not available, they locate themselves by describing their environment, routes and activities. The description of their location is done using landmarks and spatial locations (e.g., “I am located in front of Punay Lake”, “I am near a protected area”). Landmarks used can be named (e.g., “Punay Lake”) or unnamed if the landmark has no name or if the victim does not know it (e.g., "area lake"). Landmarks are represented in geographic databases by name (if possible), type and geometry. To reduce the heterogeneity of landmark types present in oral language and geographic databases representing landmarks, and thus improve locating victims, our goal is to define a controlled vocabulary for landmarks. In this research, we present a lightweight ontology (i.e. ontology having generally less complexity and does not express formal constraints) of landmarks, named Landmark Ontology (OOR), describing landmark types. It is an application ontology, i.e. it is designed to support mountain rescue operations. The ontology construction is adapted from the SAMOD methodology for engineering ontology development and involves researchers and experts from mountain rescue teams. The construction of OOR is composed of four main phases: knowledge acquisition, conceptual formalisation, implementation, and testing. The implementation phase is carried out by an iterative and collaborative approach and using four formalised sources of knowledge (a landmark and a landform ontologies, and two other domain vocabularies), an un-formalised taxonomy of outdoor activities, and five authoritative and volunteered geographic information sources representing geographic data. The landmarks ontology contains 543 classes associated with 1739 labels: 1086 prefLabel (preferential label) in French and English, 321 altLabel (alternative label) in French, and 332 altLabel in English. The depth of the ontology varies from four for land cover, hydrological and land subdivision landmark types), to six for landform types, and eight for building types. Although the use of ontology is broader, in this paper we illustrate and test its use through three applications in the context of mountain rescue operations: semantic mapping, data instantiation and data matching.
... It is apparent that this requires more than just building formal FoR models (Clementini, 2013), extracting parts-of-speech (PoS), spatial relation words without context (Stock and Yousaf, 2018), or the recognition of named entities (NER). What is needed includes, to the very least: While some research has recently been done to address the latter two challenges (Chen, Vasardani, and Winter, 2018;Scheider et al., 2018;Stock and Yousaf, 2018), the first two challenges about geoparsing are seldomly taken into focus (Moncla et al., 2014;Vasardani et al., 2012;Stock and Yousaf, 2018). In particular, it is still unclear which kinds of reference strategies need to be distinguished for environmental narratives, and to which degree they can be extracted from texts based on state-of-the-art geoparsing methods. ...
Chapter
Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities.
... It is apparent that this requires more than just building formal FoR models (Clementini, 2013), extracting parts-of-speech (PoS), spatial relation words without context (Stock and Yousaf, 2018), or the recognition of named entities (NER). What is needed includes, to the very least: While some research has recently been done to address the latter two challenges (Chen, Vasardani, and Winter, 2018;Scheider et al., 2018;Stock and Yousaf, 2018), the first two challenges about geoparsing are seldomly taken into focus (Moncla et al., 2014;Vasardani et al., 2012;Stock and Yousaf, 2018). In particular, it is still unclear which kinds of reference strategies need to be distinguished for environmental narratives, and to which degree they can be extracted from texts based on state-of-the-art geoparsing methods. ...
Book
Full-text available
Understanding the role of humans in environmental change is one of the most pressing challenges of the 21st century. Environmental narratives – written texts with a focus on the environment – offer rich material capturing relationships between people and surroundings. We take advantage of two key opportunities for their computational analysis: massive growth in the availability of digitised contemporary and historical sources, and parallel advances in the computational analysis of natural language. We open by introducing interdisciplinary research questions related to the environment and amenable to analysis through written sources. The reader is then introduced to potential collections of narratives including newspapers, travel diaries, policy documents, scientific proposals and even fiction. We demonstrate the application of a range of approaches to analysing natural language computationally, introducing key ideas through worked examples, and providing access to the sources analysed and accompanying code. The second part of the book is centred around case studies, each applying computational analysis to some aspect of environmental narrative. Themes include the use of language to describe narratives about glaciers, urban gentrification, diversity and writing about nature and ways in which locations are conceptualised and described in nature writing. We close by reviewing the approaches taken, and presenting an interdisciplinary research agenda for future work. The book is designed to be of interest to newcomers to the field and experienced researchers, and set out in a way that it can be used as an accompanying text for graduate level courses in, for example, geography, environmental history or the digital humanities.
... Fusing rule and gazetteer: Many studies [118,122,131,132,143,182,186] English tweets and 500 Spanish tweets to test LORE. ...
Article
Full-text available
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to the process of recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of the specific applications is still missing. Further, there lacks a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and a core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching-based, statistical learning-based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references across the world. Results from this thorough evaluation can help inform future methodological developments for location reference recognition, and can help guide the selection of proper approaches based on application needs.
Article
This paper explores cognitive place associations; conceptualised as a place‐based mental model that derives subconscious links between geographic locations. Utilising a large corpus of online discussion data from the social media website Reddit, we experiment on the extraction of such geographic knowledge from unstructured text. First we construct a system to identify place names found in Reddit comments, disambiguating each to a set of coordinates where possible. Following this, we build a collective picture of cognitive place associations in the United Kingdom, linking locations that co‐occur in user comments and evaluating the effect of distance on the strength of these associations. Exploring these geographies nationally, associations were shown to be typically weaker over greater distances. This distance decay is also highly regional, rural areas typically have greater levels of distance decay, particularly in Wales and Scotland. When comparing major cities across the UK, we observe distinct distance decay patterns, influenced primarily by proximity to other cities.
Article
Human and natural processes such as navigation and natural calamities are intrinsically linked to the geographic space and described using place names. Extraction and subsequent geocoding of place names from text are critical for understanding the onset, progression, and end of these processes. Geocoding place names extracted from text requires using an external knowledge base such as a gazetteer. However, a standard gazetteer is typically incomplete. Additionally, widely used place name geocoding—also known as toponym resolution—approaches generally focus on geocoding ambiguous but known gazetteer place names. Hence there is a need for an approach to automatically geocode non -gazetteer place names. In this research, we demonstrate that patterns in place names are not spatially random. Places are often named based on people, geography, and history of the area and thus exhibit a degree of similarity. Similarly, places that co-occur in text are likely to be spatially proximate as they provide geographic reference to common events. We propose a novel data-driven spatially-aware algorithm, Bhugol , that leverages the spatial patterns and the spatial context of place names to automatically geocode the non-gazetteer place names. The efficacy of Bhugol is demonstrated using two diverse geographic areas – USA and India. The results show that Bhugol outperforms well-known state-of-the-art geocoders.
Conference Paper
Full-text available
The aim of this work is to find sub-types for Place Named Entities, from the analysis of relations between Place Names and a nominal group within a specific phrasal context. The proposed method combines the use of specific intra-sentential lexico-syntactic relations and external resources like gazetteers, thesauri, or ontologies. It relies on expanded spatial named entities recognition transcribed into a symbolic representation expressed in terms of semantic features. This symbolic representation will then be associated with a geo-coded representation, depending on the available resources. Our method is completely implemented and has been tested on a corpus of travelogues.
Conference Paper
Full-text available
This paper proposes an approach for the reconstruction of itineraries extracted from narrative texts. This approach is divided into two main tasks. The first extracts geographical information with natural language processing. Its outputs are annotations of so called expanded entities and expressions of displacement or perception from hiking descriptions. In order to reconstruct a plausible footprint of an itinerary described in the text, the second task uses the outputs of the first task to compute a minimum spanning tree.
Conference Paper
Full-text available
The proliferation of GPS-enabled devices, and the rapid improvement of scientific instruments have resulted in massive amounts of spatial data in the last decade. Support of high performance spatial queries on large volumes data has become increasingly important in numerous fields, which requires a scalable and efficient spatial data warehousing solution as existing approaches exhibit scalability limitations and efficiency bottlenecks for large scale spatial applications. In this demonstration, we present Hadoop-GIS -- a scalable and high performance spatial query system over MapReduce. Hadoop-GIS provides an efficient spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. Hadoop-GIS also provides an expressive, SQL-like spatial query language for work-load specification. We will demonstrate how spatial queries are expressed in spatially extended SQL queries, and submitted through a command line/web interface for execution. Parallel to our system demonstration, we explain the system architecture and details on how queries are translated to MapReduce operators, optimized, and executed on Hadoop. In addition, we will showcase how the system can be used to support two representative real world use cases: large scale pathology analytical imaging, and geo-spatial data warehousing.
Article
Place narratives provide a rich resource of learning how humans localize places. Place localization can be done in various ways, relative to other spatial referents, and relative to agents and their activities in which these referents may be involved. How can we describe places based on their spatial and semantic relationships to objects, qualities, and activities? How can these relations help us improve automated localization of places implicit in textual descriptions? In this paper, we motivate research on extraction of semantic place localization statements from text corpora which can be used for improving document retrieval and for reconstructing locations. The idea is to combine SemanticWeb reasoning with existing geographic information retrieval (GIR) and structural text extraction for this purpose. GIR and Semantic Web technology have matured during the last years, but still largely exist in parallel. Current localization approaches have been focusing on the extraction of unstructured word lists from texts, including toponyms and geographic features, not on human place descriptions on a sentence level.
Article
Automatic extraction and understanding of human-generated route descriptions have been critical to research aiming at understanding human cognition of geospatial information. Among all research issues involved, road name disambiguation is the most important, because one road name can refer to more than one road. Compared with traditional toponym (place name) disambiguation, the challenges of disambiguating road names in human-generated route description are three-fold: (1) the authors may use a wrong or obsolete road name and the gazetteer may have incomplete or out-of-date information; (2) geographic ontologies often used to disambiguate cities or counties do not exist for roads, due to their linear nature and large spatial extent; (3) knowledge of the co-occurrence of road names and other toponyms are difficult to learn due to the difficulty in automatic processing of natural language and lack of external information source of road entities. In this paper, we solve the problem of road name disambiguation in human-generated route descriptions with noise, i.e. in the presence of wrong names and incomplete gazetteer. We model the problem as an Exact-All-Hop Shortest Path problem on a semi-complete directed k-partite graph, and design an efficient algorithm to solve it. Our disambiguation algorithm successfully handles the noisy data and does not require any extra information sources other than the gazetteer. We compared our algorithm with an existing map-based method. Experiment results show that our algorithm significantly outperforms the existing method.
Article
In this paper, we demonstrate how a large corpus, consisting of about 10 000 articles describing Swiss alpine landscapes and activities and dating back to 1864, can be used to explore the use of language in space. In a first step, we link landscape descriptions to geospatial footprints, which requires new methods to disambiguating toponyms referring to natural features. Secondly, we identify natural features used to describe landscapes, which are compared and discussed in the light of previous work based on controlled participant experiments in laboratory settings and more exploratory ethnographic studies. Finally, we use natural features in combination with geospatial footprints to investigate variations in landscape descriptions across space. Our contributions are threefold. Firstly, we show how a corpus composed of detailed descriptions of natural landscapes can be georeferenced and mapped using density surfaces and an adaptive grid linking footprints to articles. Secondly, 95 natural features are identified in the corpus, forming a vocabulary of terms reflecting known basic levels and their relationships to other more specific landscape features. Thirdly, we can explore the use of natural features in broader spatial and temporal contexts than is possible in typical ethnographic work, by exploring when and where particular terms are used within Switzerland with respect to our corpus. On the one hand, this enables us to characterize individual regions and, on the other hand, to measure similarity between regions, on the basis of associated natural features. Our methods could be adapted to different types of corpus, for instance, referring to fine granularity entities in urban landscapes. Our results are potential building blocks for attaching place-related descriptions to automatically generated sensor data such as photographs or satellite images.
Conference Paper
In this paper we aim to recognize a priori unknown group movement patterns. We propose a constellation-based approach to extract repetitive relative movements of a constant group, which are allowed to be rotated, translated or differently scaled. To this end, we record a sequence of constellations, which are used for describing the movements relatively. We deal with uncertainties, and similarities of constellations respectively, by clustering the constellations. Further, we have developed a sequence mining algorithm, which uses the clustering results and tree-like data structures to extract the requested patterns from the sequence. Finally, this approach is applied to different datasets containing real trajectory data provided by different tracking devices. By this way, we want to show its portability to different use cases.
Conference Paper
Hadoop, employing the MapReduce programming paradigm, has been widely accepted as the standard framework for analyzing big data in distributed environments. Unfortunately, this rich framework was not truly exploited towards processing large-scale computational geometry operations. This paper introduces CG_Hadoop; a suite of scalable and efficient MapReduce algorithms for various fundamental computational geometry problems, namely, polygon union, skyline, convex hull, farthest pair, and closest pair, which present a set of key components for other geometric algorithms. For each computational geometry operation, CG_Hadoop has two versions, one for the Apache Hadoop system and one for the SpatialHadoop system; a Hadoop-based system that is more suited for spatial operations. These proposed algorithms form a nucleus of a comprehensive MapReduce library of computational geometry operations. Extensive experimental results on a cluster of 25 machines of datasets up to 128GB show that CG_Hadoop achieves up to 29x and 260x better performance than traditional algorithms when using Hadoop and SpatialHadoop systems, respectively.
Article
Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process. It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.