ArticlePDF Available

Abstract and Figures

Geospatial data catalogs enable users to discover and access geographical information. Prevailing solutions are document oriented and fragment the spatial continuum of the geospatial data into independent and disconnected resources described through metadata. Due to this, the complete answer for a query may be scattered across multiple resources, making its discovery and access more difficult. This paper proposes an improved information retrieval process for geospatial data catalogs that aggregates the search results by identifying the implicit spatial/thematic relations between the metadata records of the resources. These aggregations are constructed in such a way that they match better the user query than each resource individually.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=tgis20
Download by: [University of Zaragoza] Date: 02 May 2017, At: 03:44
International Journal of Geographical Information
Science
ISSN: 1365-8816 (Print) 1362-3087 (Online) Journal homepage: http://www.tandfonline.com/loi/tgis20
Aggregation-based information retrieval system
for geospatial data catalogs
Javier Lacasta, F. Javier Lopez-Pellicer, Borja Espejo-García, Javier Nogueras-
Iso & F. Javier Zarazaga-Soria
To cite this article: Javier Lacasta, F. Javier Lopez-Pellicer, Borja Espejo-García, Javier
Nogueras-Iso & F. Javier Zarazaga-Soria (2017): Aggregation-based information retrieval system
for geospatial data catalogs, International Journal of Geographical Information Science, DOI:
10.1080/13658816.2017.1319949
To link to this article: http://dx.doi.org/10.1080/13658816.2017.1319949
Published online: 02 May 2017.
Submit your article to this journal
View related articles
View Crossmark data
RESEARCH ARTICLE
Aggregation-based information retrieval system for
geospatial data catalogs
Javier Lacasta , F. Javier Lopez-Pellicer , Borja Espejo-García, Javier Nogueras-
Iso and F. Javier Zarazaga-Soria
Aragon Institute of Engineering Research (I3A), Universidad de Zaragoza, Zaragoza, Spain
ABSTRACT
Geospatial data catalogs enable users to discover and access
geographical information. Prevailing solutions are document
oriented and fragment the spatial continuum of the geospatial
data into independent and disconnected resources described
through metadata. Due to this, the complete answer for a query
may be scattered across multiple resources, making its discovery
and access more dicult. This paper proposes an improved infor-
mation retrieval process for geospatial data catalogs that aggre-
gates the search results by identifying the implicit spatial/thematic
relations between the metadata records of the resources. These
aggregations are constructed in such a way that they match better
the user query than each resource individually.
ARTICLE HISTORY
Received 24 August 2016
Accepted 12 April 2017
KEYWORDS
Geospatial data catalog;
information retrieval;
catalog service for the web;
spatial data infrastructure
1. Introduction
Geographical information is commonly used by organizations, institutions and common
citizens for daily work and leisure activities. In the last years, the number, variety and
goals of geographical data creators and users have increased, thanks to the progressive
cost reduction of the technologies needed for acquiring, processing, analyzing, acces-
sing, presenting and transferring geographical information (Anderson and Gaston 2013,
Paneque-Gálvez et al.2014). Part of this cost reduction is the result of more than two
decades of work by public and private initiatives to promote the generation and use of
geographical information through spatial data infrastructures (SDI).
The geospatial data catalogs are responsible for facilitating the location and access to
spatial resources in SDIs (OGC 2007a). The creation of international standards has
facilitated its adoption. Some examples are the International Organization for
Standardization metadata standards for geographical data (ISO/TC 211 2014,2016),
the Open Geospatial Consortium (OGC) standards for discovering geographical informa-
tion on the web (OGC 2007b) and the implementing rules of the European INSPIRE
Directive that ensure the interoperability of European SDIs (Nogueras-Iso et al.2009).
Technologically, geospatial data catalogs are similar to digital libraries that provide
access to textual documents, images or any other kind of resource described with a
metadata record (Smith 1996). The most basic geospatial data catalogs provide a text/
CONTACT Javier Lacasta jlacasta@unizar.es
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2017
https://doi.org/10.1080/13658816.2017.1319949
© 2017 Informa UK Limited, trading as Taylor & Francis Group
keyword-based search system and a location-based search component to lter and sort
the resources by their spatial features (Göbel and Klein 2002). This kind of query is
usually named concept at locationquery in the literature (Hübner et al.2004). In the
geographical context, the concepts in a concept at locationquery are the themes of the
resources. The text-based search usually provides free-text queries on the metadata
records, and the selection of terms from controlled vocabularies. The location-based
search component usually allows constraining the user query to an area dened by
coordinates or by geographic identiers. The answer always consists of a metadata
record list with the resources that partially fulll the query restrictions, sorted by some
similarity criteria.
These approaches are valid in many situations but they have limitations. From a
spatial perspective, geographical information forms a continuum that covers the Earth
surface. However, the data creators divide it into independent resources that cover
dierent spatial and thematic extents. This division is usually done to fulll the producer
goals and they ignore the nature of the stored information. For example, governments
develop geographical resources that cover their country surface, but there are several
geographical features such as river basins that are shared between countries. In the case
of rivers, Wolf et al.(1999) estimated that 45.3% of the land surface corresponds with
river basins covering more than one country. In this context, the organization of the
information into datasets spatially delimited by the boundaries of countries and other
kinds of administrative regions enters into conict with users that need continuous data
(e.g. a pan European provider of road maps, or a ood risk manager covering dierent
autonomous regions in a country). Moreover, data belonging to related topics in the
same area may be scattered across dierent datasets. That is, this producer-oriented
approach may enter into conict with users who need continuous data.
The pan-European INSPIRE geoportal
1
is an example of a system with the previous
limitations. This geospatial data catalog was created for providing support for the
implementation of the INSPIRE Directive. It is a state-focused geospatial data catalog
where each member state describes its geospatial resources. This focus may lead to
incomplete answers for some kind of queries. For example, Figure 1 shows the answer to
a query about road networksin an area between Italy and France. This kind of query
will always produce incomplete results because there are not resources in the catalog
covering both countries. In this example, the rst result describes exclusively the road
networks in France and the second one describes only those in Italy. This data partition
makes ranking an unhelpful feature. Each result only provides a part of the required
information, and the user is forced to review all of them to compose a set of suitable
Figure 1. Example of query answer from the INSPIRE geospatial data catalog.
2J. LACASTA ET AL.
road resources that fulll his needs. This review task is challenging because the lack of
feedback about how the search parameters dene the search results makes dicult the
comparison and interpretation of results (Göbel and Klein 2002). Moreover, in the same
way as there is spatial fragmentation, there is also thematic fragmentation. As each
resource may contain only a small set of the themes of the information available about
an area, in multi-theme queries, the results will also be thematically fragmented.
Nowadays, the geospatial data catalogs of public institutions (such as the INSPIRE
geoportal) are the most technologically advanced, but all of them have similar problems
in terms of results interpretation.
The main contribution of this paper is an improved information retrieval (IR) process
for geospatial data catalogs that aggregates the search results by identifying the implicit
spatial/thematic relations between the metadata records of the resources. These aggre-
gations are constructed in such a way that they match better the user query than each
resource individually. The returned aggregations are composed of metadata records that
describe resources that complement each other and ll the spatial gaps that each
individual resource has for each queried theme. This paper is focused on analyzing
the suitability of the aggregation of the metadata records provided as query results in
the geospatial data catalog context. To analyze the result composition issues, other IR
issues such as terminological heterogeneity or the use of imprecise spatial references in
user queries are left aside. The system performance is evaluated comparing the behavior
of the proposed IR system with another one that is similar to those used in prevalent
geospatial data catalogs.
The rest of the paper is organized as follows. Section 2 reviews other works related to
IR systems for geospatial data catalogs. Section 3 explains the IR issues that we analyze
in this paper. Section 4 describes the proposed IR system, which is evaluated through a
series of experiments described in Section 5.Section 6 discusses the obtained results.
Finally, the paper ends with some conclusions and outlook on future work.
2. State of the art
There are many works that have proposed IR improvements through better similarity
measures for result ranking, and through the increase of the metadata and query
description quality. This section reviews a selection of these works.
Related to the denition of a spatial similarity measure between resources for ranking
purposes, on the web context, Watters and Amoudi (2003) propose as a ranking factor
for queries the distance between the place where the user is located and the place
where the web server with the relevant data is located. They translate URLs into the
spatial coordinates of the place where the web domain is situated and they use the
linear distance of these coordinates to the user spatial location as a ranking factor of the
results. Asadi et al.(2005) analyze the dierent types of textual queries that involve
spatial information and describe how to adjust their ranking formulas. They review direct
queries about facts, local queries that restrict the relevant results to those describing an
area and location-based queries where the objective is to locate specic entities in an
area (e.g. train stations).
Focused on geospatial data catalogs, Larson and Frontiera (2004) describe a statisti-
cally based ranking formula for geometry-based spatial queries. They make a review of
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 3
previous spatial-based ranking formulas and propose a statistical measure that includes
a corrective factor to deal with the problems caused by the imprecise denition of
bounding boxes in border areas such as coasts. Lanfear (2006) suggests another ranking
method for spatial features in geospatial data catalogs that takes into account the
overlap between the query area and the resource, and the dimensions of the area
outside the overlap. More recently, Renteria-Agualimpia et al.(2016) detect incoheren-
cies in metadata collections by comparing the explicit geographical extension dened
by coordinates and an implicit one dened by geographic identiers found in metadata
records. Their use of the Hausdordistance to detect how similar are two geometries
can be directly used in the IR context to determine a spatial similarity measure of a
resource with the user query.
In addition to the previous works, there are ranking proposals that focus on the
integration of dierent relevance measures. Göbel and Klein (2002) propose a linear
ranking formula that in addition to the similarity with the spatial feature of the query
(both coordinate and gazetteer based), it includes the degree of thematic coincidence
and temporal overlap. Martins et al.(2005) compare dierent approaches to generate a
combined ranking value from individual spatial and thematic distances. This comparison
includes the use of a linear combination, the product, the maximum similarity and a
step-linear function. Finally, Megler and Maier (2011) present a ranking method for
integrating spatial and temporal query features based on the mean between the spatial
and temporal distances to the center of the selected period or selected area.
Another approach frequently used to obtain better IR systems for geospatial data
catalogs has been to improve the resource and the query descriptions. Lieberman (2006)
describes an SDI architecture where online resources are able to self-describe them-
selves. This solution requires a semantic facade on top of OGC standard services that
describes their content through the use of ontologies. Lutz et al.(2009) propose instead
a semantic catalog where geospatial resources are described using roles and concepts
from a domain ontology. Somehow related, Janowicz et al.(2010) propose a transparent
semantic layer for SDI. They annotate the resource descriptions using ontologies and
they relate these ontologies through a reasoning service implemented as a prole of an
OGC catalog and a processing service, respectively. Finally, Florczyk et al.(2010) add
semantics into an SDI catalog with a linked data administrative geography ontology that
is used for data integration and referencing geographic themes.
Related to the query description improvement, other works have focused on the
identication of textual patterns describing locations or location-based references (e.g.
north of X). Works such as Sallaberry (2013), Ferrés and Rodríguez (2015) and Kim et al.
(2017) show that the identication of textual patterns describing locations can greatly
improve the quality of the results when spatial description in metadata records is
textual. However, in geospatial data catalogs, these solutions are less relevant because,
by design, the metadata records of spatial resources specify their spatial limits as
coordinates.
Our process focuses on automatically providing improved aggregated search results
for geospatial data catalogs by using raw metadata. There have been other proposals
that perform aggregations of resources at the data level but they require either human
intervention or an extra layer of complexity such as adding domain knowledge in the
form of ontologies. For example, Hübner et al.(2004) describe an ontology-based
4J. LACASTA ET AL.
reasoning system that integrates heterogeneous geographical information in concept at
location in timequeries. The user employs provided ontologies to dene a query and
the system returns a list of resources sorted by relevance. Then, the system facilitates its
visualization and integration. Similarly, Lutz and Klien (2006) propose a retrieval system
in which features published at Web Feature Service (WFS)s are described in terms of a
shared domain ontology. This system oers a user interface that allows formulating
queries using such ontology. A dierent approach can be found in Latre et al.(2009).
This work describes a retrieval system that identies non-explicit relations between
hydrologic feature types published at WFSs and uses this knowledge to expand results.
Finally, Zhu et al.(2015) describe a user-focused spatial data analysis service that unies
the access to heterogeneous data by creating linked layers after parsing user requests.
3. Spatial and thematic issues in geospatial data catalogs
This section reviews the IR systems used in a representative set of geospatial data
catalogs and describes their features and issues. We have analyzed the pan-European
INSPIRE catalog, and the national catalogs in USA
2
(GeoPlatform), Spain
3
(IDEE), United
Kingdom
4
(Data.Gov) and Canada
5
(GeoDiscovery). Below, we present an analysis of their
user query interfaces and how they answer when the queries include spatial and
thematic constraints. Then, we describe the issues that the process described in this
paper tries to correct.
3.1. Prevalent approaches
Table 1 shows the type of search and ranking provided by each analyzed system. All the
analyzed systems provide free-text search and some kind of controlled topic list or
faceted solution. Additionally, their advanced search components focus on specic
metadata elds, such as the resource type or data format. Among them, only the
Spanish and Canadian systems oer temporal search (periods of time). Regarding the
spatial search, the queries in the UK catalog cannot include textual and spatial features
simultaneously. The remaining systems provide a bounding box-based spatial search
component that can be combined with other query elements.
In order to determine how the search process is performed, we have analyzed the
result of queries using controlled elds, queries using free-text elds, queries using a
spatial bounding box and queries with the three restrictions. The query terms have been
selected so they return multiple resources (e.g. road networkin INSPIRE geoportal) and
we have counted the occurrences of the query terms in each of the obtained metadata
records and the percentage of spatial area in query that they cover.
Table 1. Search and ranking features.
Catalog Country Type of search Type of ranking
INSPIRE EU Spatial, free text, term cloud, topics Relevance
GeoPlatform USA Spatial, free text, facets Relevance, popularity
IDEE Spain Free text, topics, date Rating, relevance, popularity
Data.gov.uk United Kingdom Spatial or free text, facets Relevance, popularity
GeoDiscovery Canada Spatial, free text, date, topics Relevance
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 5
Through this analysis, we have found that the systems do ANDstyle queries when
the queries involve two or more types of constraints (spatial, controlled or free text).
That is, the responses only contain records that match all the constraints. The results can
be sorted according to a relevance rank (some of them also include popularity and user
rating) or alphabetically by dierent elds (e.g. title). Regarding the relevance rank, the
number of occurrences of the query term in any part of a metadata record is used as a
ranking factor (the metadata records with more query term occurrences are rst).
However, when the query involves controlled elds, only the existence is taken into
account as ranking factor. We also tried to identify if any of the systems uses ontologies,
or any other kind of formal model, for query expansion or renement. However, since all
the results in the tested systems contain the used query terms, it seems that queries
have not been expanded with additional terms such as synonyms, hypernyms or
hyponyms. In the systems supporting spatial restrictions, the more the query area and
the geographical extent of a resource overlap, the better its rank is.
We have been unable to identify the exact ranking formula used for combining the
spatial and textual rankings in the systems that support concept at locationstyle
queries (INSPIRE, GeoPlatform and GeoDiscovery). However, we have detected that, in
these systems, a spatially closer resource is ranked rst even if it has far fewer occur-
rences of the textual query terms. This indicates that the ranking weight given to the
spatial similarity is higher than the used one for the textual similarity.
The functionalities oered by these systems seem suitable in many situations but
they are problematic when performing queries about multiple themes in an area cross-
ing multiple countries or regions. In these cases, the results obtained are similar to those
described in Figure 1. That is, the results are partially relevant and none can be
considered a complete answer.
3.2. Data fragmentation issues
The search features identied are very common and they can be found in digital libraries
outside the spatial eld (e.g. Europeana
6
). It is important to note that the reviewed
systems manage the resources as independent entities. However, in the geospatial con-
text, the resources are related by the themes they cover and by spatial proximity (all this is
indicated in their metadata records). If these relations are not taken into account, when a
user query does not tthearticial divisions (spatial and thematic) of the data continuum
performed by the data creators, the catalogs will return incomplete results. Depending on
the user query and the spatial and thematic data fragmentation, the answer may suer
from under-coverage, over-coverage and partial coverage of the results with respect to
the query. Below, we are going to characterize each of these issues.
The rst issue is the under-coverage of the results with respect to the query. At
spatial level, this happens when the results include resources that only slightly inter-
sect with the query bounding box. At thematic level, it happens when the result
includes resources about a small subset of the query themes. This is a problem
because resources that only slightly fulll the query may be considered very relevant
results. As an example of the spatial under-coverage, Figure 2(A1) shows the bounding
box of a query focused on the Castilla la Mancharegion in Spain (continuous line) and
a resource focused on Valenciaregion that only slightly intersects with the query
6J. LACASTA ET AL.
(discontinuous line). Figure 2(A2) shows a thematic under-coverage example. It con-
tains a query about many subjects (roads,riversand others) and a result (R) detailing
only roads. In both cases, the amount of information provided with respect to the
requested query is small. Thus, although they fulll the query, they have little
relevance.
The second issue is the over-coverage of the results with respect to the query. At spatial
level, it happens when the results include resources that cover an area much bigger than the
requested one. At thematic level, it occurs when a result contains information about many
more themes than the requested ones. Over-coverage is a problem in the sense that the
amount of irrelevant information in the results makes dicult to identify the desired
content. Any other result more adjusted to the query is probably a better option for the
user. The spatial over-coverage example is shown in Figure 2(B1). It depicts the same
Castilla la Manchaquery, but in this case, the result covers the desired region and the
rest of the Iberian Peninsula. Figure 2(B2) shows a thematic over-coverage example with a
query about roadsand a result (R) containing information about roads,riversand many
other themes. In both cases, the result contains not only relevant information but also a
disproportionate amount of irrelevant data.
The last issue happens when there are many partial results to a query. All of them are
partially relevant, but none of them completely fulll the query specication. Existent
systems sort these results according to a spatial/theme similarity criterion (usually some
spatial/theme overlap variant). However, when results are presented in this way, it is
dicult to distinguish which areas/themes of the query are described by each resource
and how they complement each other. An example of the spatial aspect of the partial
coverage is shown in Figure 2(C1). It shows the same Castilla la Manchaquery with
multiple partial relevant results. Figure 2(C2) continues with the previous roadsand
riversquery but providing as answer a resource about roadsand a dierent one about
Figure 2. Spatial and thematic issues in the IR system of geospatial data catalogs. (a) under
coverage, (b) over coverage, (c) resource composition.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 7
riversto show the thematic partial coverage. In both cases, none of the results are a
complete answer to the query.
These spatial and thematic issues can happen in any combination in a concept at
locationstyle query, especially when resources only cover part of the spatial area and
part of the requested themes. In this case, as previously indicated, the results generated
by the reviewed systems are not able to completely fulll the query restrictions. Next
section proposes an IR system able to deal with these issues by aggregating the
metadata records in the result list into collections of compatible records that, as a set,
are a better answer to the query. The construction of these aggregations helps to solve
the composition issue and mitigates under and over-coverage problems.
4. Generating thematically and spatially aggregated results
In a classic IR system, when performing the intersection between the metadata of a
resource and the query to determine if it is relevant, only a subset of the themes may
intersect, and only a part of the query area may be covered. This means that each
retrieved resource only provides a partial result. However, combining it with others, a
more complete result could be obtained. For example, in a query about highwaysin
Spain,we can nd a resource about highways in the south of Spain. This result is
incomplete but if combined with other one covering the north, we can compose a good
result. The same happens with respect to the themes. For example, in a query about
highwaysand motorwaysin Spain, a resource about Spanish motorways may be the
perfect complement to another one depicting the Spanish highways.
Figure 3 shows the main steps of the IR process created to generate aggregations of
metadata records as results of concept at locationqueries. The query analysis step is a
simple decomposition process where the query is processed to separate spatial (bound-
ing box) and thematic (keywords) requirements. The next step is to obtain the metadata
records describing resources that are partially relevant to the query. This is done using a
spatial and an inverted textual index. Only those that intersect spatially with the query
area and contain at least one of the query keywords are returned. Then, the obtained
metadata records are sorted according to their relevance degree. Finally, the IR process
aggregates the records in suitable groups. This section focuses on describing how the
ranking and aggregation process is performed.
Figure 3. Process for aggregation of query results.
8J. LACASTA ET AL.
Equation (1) shows the similarity formula used for result ranking. It represents the
similarity with a value between 0 and 1, 0 being irrelevantand 1 perfect match. In the
formula, we use the following symbols: dH GQ;GR
ðÞrepresents the Hausdordistance
between the query geometry (GQ) and a metadata record geometry (GR); Max DH is the
biggest Hausdordistance of all the partially relevant resources with respect to the
query; size TQ\TR

indicates the number of themes in common between the metadata
record (TR) and the query (TQ); and size TQ
ðÞindicates the number of themes in the query.
SimilarityðQ;RÞ¼Max DH dH GQ;GR
ðÞ
Max DH size TQ\TR
ðÞ
size TQ
ðÞ (1)
The Hausdordistance is the greatest of all the distances between any point in a
geometry and the closest point in another geometry. Since the Hausdordistance
between geometry A and B may be dierent from the Hausdordistance between
geometry B and A, the maximum is used. The Hausdordistance of overlapping
geometries of similar size is smaller than the equivalent one between overlapping
geometries with very dierent dimensions. Therefore, it is very appropriate for ordering
resources of dierent administrative levels (e.g. region vs. country size). Additionally, the
Hausdordistance can be used with complex geometries. Thus, replacing the metadata
bounding boxes with approximate geometries would directly increase the quality of
results without having to modify the IR system. This is also valid for resources with
multiple disjoint geometries (e.g. Iberian peninsula and Canary islands) that can be
represented as a single multi-polygon geometry for distance measure purpose. A
problem of the Hausdordistance is that it can give small distances to nonoverlapping
resources (if they are similar in size and they are spatially close). However, this issue is
not relevant for our system because the ranking is only applied on resources that
overlap.
Once the ranking of the results has been performed, the last step aggregates the
results that are spatially and thematically compatible. We consider that two metadata
records (and therefore the resources they describe) are spatially compatible if the
combined area for all their themes is signicantly closer to the query area than each
record individually. Regarding the thematic compatibility, we only consider compatible
those that are thematically disjoint (they do not have any common query themes), and
those that share one query theme and also half of the rest of the keywords. This avoids
aggregating records of resources that describe a theme in very dierent ways. For
example, if we are making a query about river basins, we may nd a resource that
focuses on the water owand another one describing the geologyof the basin. In this
case, they are too dierent to be in the same aggregation. When the records do not
share a query theme, they can always be aggregated because they are fullling disjoint
restrictions in the query.
Algorithm 1 describes the method used to aggregate the list of ranked results
obtained with the previous steps. For each metadata record obtained as result, the
process searches other results that complement it. The aggregation process is per-
formed only if there is a relevant part of the query area that is not covered in all the
themes. The coverageFactor indicates the percentage of area in the query (summing up
all themes) that can be left uncovered. The smaller it is, the more complete the results
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 9
are. However, it is better to not completely cover the query bounding box with results to
deal with imprecisions in the denition of the query or the resources. For example, it is
counterproductive to complement a resource with a 99% of query coverage just to
cover a small gap in a border. The value selected in the experiment section is a
compromise between a complete coverage of the query and the management of
deciencies in the query formulation and the resources.
Algorithm 1 Spatio-thematic aggregation of results.
function AGGREGATIONSTEP(results;query)
aggregationList ;
for result 2results do
aggregation result
reducedQuery query aggregation
resultExtended true
while areaðreducedQueryÞ>coverageFactor areaðqueryÞ&resultExtended do
resultExtended false
possibleAggregated getBestResultðresults;reducedQuery;aggregationÞ
if possibleAggreatedÞ;then
reducedQuery reducedQuery possibleAggregated
aggregation aggregation [possibleAggregated
resultExtended true
end if
end while
aggregationList aggregationList [aggregation
end for
return duplicateRemovalðaggregationListÞ
end function
The search of results that complement a given one is done by the function
getBestResult depicted in Algorithm 2. This process is repeated until no more suitable
resources for the aggregation are found. The identication of suitable complementary
results is done by removing those that are spatially and thematically incompatible with
the current aggregation (thematicFilter and spatialFilter functions). Then, the rest are
sorted according to the dimension of the uncovered part of the query, and the closest
one is selected.
Algorithm 2 Function to obtain a new element for an aggregation.
function GETBESTRESULT(results;reducedQuery;aggregation)
filteredResults spatialFilterðresults;reducedQuery;infoFactorÞ;
filteredResults thematicFilterðfilteredResults;reducedQuery;aggregationÞ;
sortedResult rankResultsðfilteredResults;reducedQueryÞ
if sortedResult Þ;then
10 J. LACASTA ET AL.
return sortedResult½0
else
return ;
end if
function SPATIALFILTER(results;reducedQuery;infoFactor)
filteredResult ;;
for result 2results do
if areaðintersectionðresult;reducedQueryÞÞ >areaðreducedQueryÞinfoFactor then
filteredResult filteredResult [result;
end if
end for
return filteredResult
end function
function THEMATICFILTER(results;reducedQuery;aggregation)
filteredResult ;;
for result 2results do
themesInQuery ¼themesðresultÞ\themesðreducedQueryÞ;
themesInAggr ¼themesInQuery \themesðaggregationÞ;
commonThemes ¼themesðresultÞ\themesðaggregationÞ
if resultaggregation ðthemesInQueryÞ;^themesInAggr ¼¼ ;Þ_
ðthemesInAggrÞ;^sizeðcommonThemesÞsizeðthemesðaggregationÞÞ2Þthen
filteredResult filteredResult [result;
end if
end for
return filteredResult
end function
In the process to identify the best possible record to add to an aggregation depicted
in Algorithm 2, the spatial and thematic lters avoid adding resources that only improve
the results in a negligible amount and the creation of thematically heterogeneous
aggregations. The spatial lter removes the metadata records of resources that do not
cover a relevant part of the query area uncovered by the aggregation. This behavior is
adjusted by the infoFactor parameter that represents the amount of new information a
resource has to provide to be included in the aggregation. A low value generates more
complete aggregations, but some of their components may provide very little new
information. A high value creates aggregations with more relevant elements, but it
may leave important parts of the query uncovered. The thematic lter removes those
results already in the aggregation and those that share a theme with the query and the
aggregation but do not have in common at least half of the keywords. This is done
because results with fewer keywords in common are likely to be too dierent between
them to be integrated, even if they share a queried theme. After applying both lters,
the remaining results are ranked according to the similarity formula shown in Equation
(2), and the most similar one is selected as a new element in the aggregation.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 11
Similarity Uq;RðÞ¼
PT2Uq Max DHTdH GTUq ;GTR

=Max DHT
Size TUq
 (2)
Equation (2) is a generalization of Equation (1). Since we try to nd the metadata record of
the resource that is the most similar to the area of the themes that is not covered by the
current members of an aggregation, the geometry of each theme in the query is dierent.
For example, in a query about highwaysand motorwaysin Spain, we may have con-
structed an aggregation with a resource with the highways in the south of Spain, and
another one covering the motorways in the east. In this context, the extension that is
needed to cover with additional resources is dierent for the theme motorwaysand the
theme highways. In the equation, we calculate the similarity of a metadata record (R) that
is candidate for the aggregation with respect to the area of the query themes not covered
by the aggregation (Uq). It is calculated as the sum of the spatial similarity of each theme of
the query with some spatial extension uncovered with respect to the metadata record
extension for these themes, divided by the number of query themes that have a spatial
part uncovered (Size TUq

). The spatial similarity for each theme is obtained in a way
analogous to Equation (1). In Equation (2), the following symbols are used: MaxDHT
represents the maximum Hausdordistance between the theme extension of all the
candidates and the uncovered extension of the query for this theme, and dH GTUq;GTR

is the Hausdordistance of the theme of the metadata record that is being analyzed (GTR)
with respect to the uncovered part of the query for the theme (GTUq ).
This process may generate redundant aggregations with the same elements in dierent
order (e.g. it can aggregate the 1st result with the 10th one and then aggregate the 10th
result with the 1st one) and aggregations that are a superset of another one (in this case,
some elements in the superset are not relevant). The last step removes these redundancies.
5. Experiments
This section compares the performance of the proposed IR process (Aggregated IR
System) with a basic IR system (Basic IR System) similar to the ones used in the geospatial
data catalogs described in Section 3.1.
Our Aggregated IR System has been tuned to try to create aggregations with at least a 90%
of query coverage (coverageFactor ¼0:1) and to add elements to the aggregation even if
they only provide a small improvement of the result (infoFactor ¼0:1). The Basic IR System
used for comparison applies a similarity measure that behaves similarly to the geospatial data
catalogs described in Section 2 (see Equation (3)). This measure performs a combination
between the spatial intersection of the query and the metadata record of the resource, and
the theme intersection. The spatial similarity is obtained as the area of the intersection
between the query (AQ)andtherecord(AR), divided by the maximum area of intersection
between all the resources and the query. For the thematic similarity, we have directly used the
Jaccard coecient, which is calculated as the number of themes in common between the
query (TQ)andtherecord(TR) divided by the total number of themes. Finally, this similarity
values are weighted with αand βfactors to be able to adjust the weight of the spatial aspect
(α) of the query with respect to the thematic ones (β). To avoid giving any advantage to our
proposal, the experiments with the Basic IR System have been performed multiple times with
12 J. LACASTA ET AL.
dierent αand βvalues, and the best obtained results have been the ones used in the
comparison.
SimilarityðQ;RÞ¼α\AQ;AR
ðÞ
maxAreaIntersection

þβJaccard TQ;TR
ðÞðÞ(3)
5.1. Evaluation methodology
For this experiment, we have used the metadata records provided through the
Geoportal of the Spanish National Spatial Data Infrastructure
7
(IDEE) in 2015. This
collection contains 97,867 records describing geographical resources created by dier-
ent Spanish governmental institutions. This includes themes so dierent such as topol-
ogy, environment, mineral resources, industry and infrastructures, among other themes.
The performance of an IR system to solve the issues described in Section 3 cannot be
simply described in terms of classical precision/recall measures. These measures are
often based on a binary classication results as relevant and nonrelevant as a whole
(Baeza-Yates and Ribeiro-Neto 2011). In our system, only the metadata records of
resources that contain part of the selected area and some of the query themes are
returned. Therefore, all the results contain at least a bit of relevant information. The
problem here is to measure the degree of relevance for result ordering.
The proposed system is focused on improving the results of concept at location
queries. The objective of this type of queries is to return rst the results that have an
exact match with the query restriction, second those that cover the selected area but
include much additional information (over-coverage), third those that have only a partial
coverage, and nally results that only slightly fulll the query restrictions (under-
coverage).
To evaluate the ranking of the two systems, we have used the discounted cumulative
gain (DCG) measure shown in Equation (4) (Baeza-Yates and Ribeiro-Neto 2011). This
measure calculates the gain of adding each document to the result set based on its
position in the result list. To obtain this measure, it is needed to describe the gain that
each result adds to the result list (Gi). For this task, we have used the gain criteria
described in Table 2. In these criteria, higher values indicate that the result is more
adjusted to the spatial or thematic restrictions in the query. The lower ones indicate that
there is less similarity. The spatial and thematic content of each result of the analyzed
queries has been classied according to these criteria. The nal gain of each result is
calculated as the mean of the spatial and the thematic gains.
Table 2. Criteria values used to determine the quality of a result.
Gain value Meaning Description
3 Exact match The spatial or thematic features of the result are approximately
equal to the query
Over-coverage The spatial or thematic features of the result approximately cover
the query but they are much more extensive
Partial coverage The spatial or thematic features of the result just cover a part of the
query
Under-coverage The spatial or thematic features of the result just slightly cover a
part of the query
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 13
DCG½i¼ i¼1;G1
iÞ1;Gi=log2ðiÞDCGi1
(4)
5.2. Description of the experiments and results obtained
The main advantage of our system with respect to a basic one is that it is able to identify
subsets of low gain results and transform them into higher gain aggregations. For
example, it can combine several results with partial coverage of the query to obtain
an exact match. To evaluate how the system performs, we have selected four themes
commonly used in elds such as hydrology, ecology, infrastructure planning, industry or
agriculture [some recent examples on the interest in these areas can be found in Graser
et al.(2015) and Pereira et al.(2015)] that have a high presence in the collection and four
spatial areas that contain information about these themes. The themes are elevation
(model), road network,soil useand hydrography. Using these themes and areas, we
have generated all the possible queries that include one or two of the themes and one
spatial area. This makes a total of 40 dierent queries (combinations without repetitions
of 2 themes selected from the 4 original ones plus the emptytheme, and the 4 dierent
areas, i.e. 5
2

4). For these queries, we have obtained the DCG for the 10 rst
positions of their result lists, and we have calculated the mean DCG at each position.
This mean takes into account that some queries return less than 10 results. Figure 4
compares the mean DGC of the two systems at each position of their result list (number
of query result). It can be observed how the aggregated system has always a higher
mean gain. This means that the obtained resources and the way they are positioned in
the result list are better in the aggregated system than in the basic one.
To explicitly show how the system behaves with respect to the undesired eects
described in Section 3, we analyze in detail the results of a small set of the selected
queries. These queries show how our system behaves in two main scenarios: when there
are results that perfectly match the spatial and/or thematic query restriction, and when
there are not close matches. Table 3 summarizes the selected queries (Q14). The table
shows the query bounding box (min and max longitude, latitude), a toponymical
reference (Location) of the Spanish region containing the bounding box (for illustrative
Figure 4. Mean DGC comparison.
14 J. LACASTA ET AL.
purposes) and the themes in the query (Themes). In the case of the rst query (Q1), there
are results that perfectly match the query restrictions. The detailed analysis of the result
ordering is illustrative of the dierence in behavior between our system and the basic
solution. The rest of the queries display result sets that do not contain any result that
perfectly match the query. Specically, Q2 focuses on the spatial features. It requests
information about a theme (elevation) in a region (a part of Andalucía) where there are
not resources that t well with the selected area for the selected theme. However, there
are resources that have over-coverage and others with partial coverage. Q3 focuses on
the thematic features. It includes multiple themes (road networksand soil use) and it
selects a region (Galicia) that contains resources that match the spatial aspects of the
query but only partially match the thematic aspects. Q4 describes the more general case
where none of the resources match well the query area (a part of Galicia and Castilla y
León) and the selected themes (road networksand soil useagain).
Table 4 shows a summary of the performance of each system, measured as the spatial
and thematic similarity of the results with respect to the query specication. It includes
the number of results obtained from each query (Num Res), the mean spatial coverage of
the results with respect to the spatial restriction in the query (Mean SpCov), the mean
thematic overlap between the results and the thematic restrictions in the query (Mean
ThCov) and the size of the biggest aggregation obtained (Max AggSize). The spatial and
thematic mean coverage visualize the degree of fulllment of the user needs, while the
size of the biggest aggregation indicates the number of individual results that are
needed to fulll the query in the worst case. Table 5 details the rst three results of
each query. It includes the title of the result in each position (result order), the percen-
tage of spatial (SpCov) and thematic (ThCov) coverage of each result with respect to the
query and the gain value (Gain) obtained according to the criteria indicated in Table 2.
The results that aggregate several resources to compose a better result are marked with
ðAÞ. In the gure, it can be observed the dierence between the results of the basic
system, where most of them have over-coverage and partial coverage, and the ones
obtained in the aggregated system, which generates aggregations closer to the query
constraints.
Table 4. Comparison of system results.
Basic IR Aggregated IR
Num Res
Mean
SpCov (%)
Mean
ThCov (%) Num Res
Mean
SpCov (%)
Mean
ThCov (%)
Max
AggSize
Q1 12 51.37 100 6 99.98 100 1
Q2 9 55.55 100 6 100 100 4
Q3 25 100 50 24 100 100 2
Q4 31 81.15 50 29 97.5 100 4
Table 3. Queries selected for the evaluation.
Number Bounding box (min; max) Location Themes
Q1 6.02, 37.37; 5.93, 37.41 Andalucía Elevation
Q2 6, 37.35; 5.9, 37.41 Andalucía Elevation
Q3 8.38, 42.25, 7.50, 43.08 Galicia Road network, soil use
Q4 7.8, 42.21; 5.9, 43.29 Galicia, Castilla y León Road network, soil use
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 15
Table 5. Detailed comparison of system results.
Basic IR Aggregated IR
Order Result title
SpCov
(%)
ThCov
(%) Gain Result title
SpCov
(%)
ThCov
(%) Gain
Q1 1 Orography of Andalucía 100100 2.5 Andalucía EDM 98433 99 100 3.0
2 Contour lines 100100 2.5 Orography of Andalucía 100100 2.5
3 Digital Terrain Model 100100 2.5 Contour lines 100100 2.5
Q2 1 Orography of Andalucía 100100 2.5 (A) Andalucía EDM 98433/34/43/44 100 100 3.0
2 Contour lines 100100 2.5 Orography of Andalucía 100100 2.5
3 Digital Terrain Model 100100 2.5 Contour lines 100100 2.5
Q3 1 Topographic Base of Galicia 100 50 2.0 (A) Topographic Base of Galicia/Map of Coverages and
Soil Uses
100 100 3.0
2 Map of Coverages and Soil Uses 100 50 2.0 (A) Map of Coverages and Soil Uses/Basic Cartography of
Galicia
100 100 3.0
3 Basic Cartography of Galicia 100 50 2.0 (A) Topographic Base of Galicia/Soil Uses, Polygons 100 100 3.0
Q4 1 CORINE Land Cover 1990 10050 1.5 (A) Topographic Base of Galicia/Transport Network CyL/
Map of Coverages and Soil Uses/Land Cover CyL
97 100 3.0
2 CORINE Land Cover 2000 10050 1.5 (A) Basic Cartography of Galicia/Transport Network CyL/
Map of Coverages and Soil Uses/Land Cover CyL
97 100 3.0
3 CORINE Land Cover changes 19902000 10050 1.5 (A) Soil Uses. Polygons/Topographic Base of Galicia/
Transport Network CyL/Land Cover CyL
96 100 3.0
Spatial coverage percentages marked with a star (*) indicate a big spatial over-coverage (they are resources at country level for queries about a small region).
16 J. LACASTA ET AL.
Q1 is representative of the situation when a query has a perfect match with the
collection. It has been selected because there is a resource in the collection that matches
at 99% the query bounding box (tile 98433 of the Andalucía Elevation Digital Model).
Additionally, there are ve resources relevant for the query theme but that cover all
Andalucía/Spain (they have spatial over-coverage). Finally, there are other six themati-
cally relevant resources with spatial under-coverage. In the basic system, the most
relevant resource is provided as the sixth result and the previous places are occupied
by the resources with spatial over-coverage that completely cover the query area. The
results with under-coverage are placed last. In the proposed system, no aggregation is
generated for this query, but the ordering is improved since the best result is placed rst
and those with over-coverage are sorted according to the spatial similarity with the
query. Additionally, the resources with spatial under-coverage are not returned because
the aggregation process identies that they are only reliable if complemented with
another one that is reliable by itself.
Q2 analyzes the behavior of the systems when there are resources completely cover-
ing the thematic restrictions but not the spatial ones. For Q2, the collection contains
nine relevant resources, ve with spatial over-coverage (the same in Q1) and four with
partial spatial coverage. They are four tiles of the Andalucía Elevation Digital Model
(98433,98434,98443and 98444). In the basic system, the ve rst results are those
with spatial over-coverage, and the last four ones are those that have partial coverage. In
the proposed system, the four resources with partial spatial coverage are aggregated
into a single result that perfectly matches the query. This aggregation is provided rst in
the result list. The rest of the results are sorted according to their spatial distance with
respect to the query.
Q3 analyzes a scenario with resources that cover the spatial aspects of the query but
only partially the thematic ones. For this query, there are 25 resources focused on Galicia
and Spain about road networksand soil use, but none containing both. In the basic IR
system, the obtained results are not distinguishable since they all completely cover the
query area and contain one query theme. In the proposed system, 24 compatible
aggregations that fulll the user needs are found. These aggregations add to each result
(i.e. focused on one theme) the spatially closest result of the other theme. For example,
the rst result is the aggregation consisting of the two rst results of the basic system,
i.e. Topographic Base of Galicia(providing road networks) and Map of Coverages and
Soil Uses(providing soil use).
Finally, Q4 focuses on the most general case: a query that has no clear candidate
for the thematic and spatial query restrictions. It uses the same themes as the third
query, but it covers an area that includes part of Galicia and their neighbor region of
Castilla y León. In the collection, there are 31 relevant resources, and all of them have
partial coverage of the spatial or the thematic query restrictions. In the basic system,
as in the third query, the results only cover a single theme and they are sorted
according to the degree of intersection with the query bounding box. This places the
results with spatial over-coverage upper in the result list. For example, the rst ve
results are dierent versions of the CORINE land cover project about Soil Usesin
Spain. The result 17 is the rst one about Soil Usesfocused on a region close to the
query bounding box (Land cover of Castilla y León), and the result 19 is the rst
about road networks(Topographic Base of Galicia). The proposed system
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 17
aggregates resources focused on Galicia and Castilla y León to form results that
almost perfectly t the user needs. Figure 5 compares graphically the rst result
obtained in both systems (queries are showningray,resultsinblack).Thebasic
system provides an unfocused result about a single query theme (Soil Use) and the
query area is only a small fragment of the area covered by the resource. Regarding
the proposed system, it returns an aggregation containing four results that provide
an almost complete answer to the user query.
As a nal comparison between the two systems, Figure 6 shows the DCG of the four
selected queries (discounted cumulative gain) at each position of their result lists
(number of query results) for the basic and the aggregated IR systems. It can be
observed how the proposed system behaves better for all the query types, being
especially advantageous in the most general case (Q4). In this case, none of the
collection resources perfectly match the query but there are several partial matches.
Therefore, the aggregation process can show its maximum potential.
Figure 5. Graphical comparison of the rst result of Q4 in both systems.
18 J. LACASTA ET AL.
6. Discussion
The management of the spatial information as a continuous set is needed for many
tasks such as analyzing the morphology of a river or identifying routes in a road
network. However, when dealing with geospatial data catalogs, usually continuous
information can be found divided into individual resources that do not provide the
implicit spatial/thematic connection between them. This problem is not architectural, it
is related to how data producers manage information. The technologies used for IR are,
in most cases, general purpose solutions not adapted to the nature of spatial data.
The IR system proposed in this paper identies the spatial and thematic relations in a
collection of metadata records of geospatial resources to produce results closer to the
query restrictions. To identify the spatial closeness of the records with respect to a query,
it uses criteria similar to the one indicated in Lanfear (2006), but using the Hausdor
distance as the spatial similarity measure. The thematic similarity is the ratio between
the common themes in the record and the query. Finally, our system integrates the
spatial and thematic similarity with a ranking formula like the one detailed in Martins
et al.(2005). What makes our system very dierent from them is the addition of a
processing layer that identies the spatial and thematic relations between the results to
generate collections of metadata records as query results. The system uses these rela-
tions to combine the results into coherent aggregations that are closer to the user query
constraints than the individual resources.
In this aspect, the paper has similarities with Lieberman (2006) or Lutz et al.(2009)in
the sense that we use a layer on top of a basic catalog to provide improved results. The
dierence is that they use ontologies to generate individual results and we focus on the
use of raw metadata to produce aggregations of results. They deal with resources as
individuals while our system goes a step beyond that by considering that a result can be
a composition of several metadata records.
Figure 6. Detailed DCG comparison of the analyzed queries.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 19
The aggregation of the catalog metadata records can be seen as a data integration
task that helps to improve the quality of a catalog IR process. From this perspective, our
work is related to proposals such as Hübner et al.(2004) and Lutz and Klien (2006) but
working at metadata level instead of at data level.
Our aggregation proposal shows that taking into account the collection context
in geospatial data catalogs is a way to provide more complete results. However, we
have observed a limitation caused by the diculty of generating consistent aggre-
gations. In general, the themes related to geographical information are a quite
homogeneous terminology set where the aggregations are really meaningful (land-
forms, infrastructures, cadaster). However, even if two resources share the same
thematic, they may not be completely compatible if they provide too dierent
information. For example, a resource containing the geometry of parcels and their
type of crop is not very compatible with an aerial thermal image used for crop
analysis. The problem has been mitigated using all the keywords of the resources
as integration context. However, its eectiveness depends on the content of a
single metadata eld. A more sophisticated solution would require taking into
account additional metadata elements to avoid noise in the aggregations.
Additional elements to take into account as factors for data integration would be
the data information models, formats, scales, or resolutions.
With respect to the processing time, the step for computing the aggregations does
not signicantly delay the search process because the spatial operations required to
perform the aggregation are restricted to the resources in the result list. Including the
time to access the spatial and textual indices, all the queries have returned their results
in less than 1 s.
7. Conclusions
This paper has identied and analyzed three issues (under coverage, over coverage,
partial coverage) from concept at locationstyle queries in geospatial data catalogs.
These issues are caused by the lack of adaptation of prevalent IR engines used in these
geospatial data catalogs to the specic nature of geoinformation. As a solution, we have
proposed an IR method that yields aggregations of search results that match better
concept at locationquery restrictions.
The proposed IR method takes all the metadata records of resources that partially
fulll a query (intersect the bounding box or the themes) and nds the spatial and
thematic relations between them. Next, it uses these relations to generate sets of
metadata records that are a better answer to the query than each one individually.
To evaluate its performance in archetypical concept at locationqueries, we have
compared the performance of our proposalwithrespecttoanIRsystemsimilarto
those used in prevalent geospatial data catalogs. The results have shown that this
approach may complement the traditional plain list of results of geospatial data
catalogs.
Additionally, our proposed aggregation-based functionality could be easily oered in
any geospatial data catalog by extending the catalog service for the web (CSW) interface
provided by OGC consortium (OGC 2007a). In the CSW, the GetRecords operation is
responsible for locating resources according to the user query specied restrictions. CSW
20 J. LACASTA ET AL.
standard establishes three levels of detail in query results (Brief, Summary and Full). To
provide interoperability between systems, Brief and Summary results structure is
restricted to the Dublin Core (DCMI 2007) based schema dened by OGC. In the case
of Full results, the standard allows the denition of proles that extend the service
functionality. Through these proles, the result structure could be redened to provide
aggregations as a new type of resource that can be returned by the CSW.
Since the structure of the Brief and Summary levels of detail is restricted, it is not
possible to completely describe the aggregations: just the type of returned resource
would indicate that is a collection (e.g. using dct:Collection as dc:type value), and the
identiers of the elements conforming it would be referenced (using dc:relation eld).
However, the Full basic description could be extended as needed to indicate the themes
and area of the query added by each resource in the aggregation. This approach would
make the aggregation-based system compatible with any existent CSW client. In all the
three views, a client could obtain the description of resources with standard metadata
elds, and only in the Full view, it would need to be specically adapted to process the
details of the aggregation composition.
Future work will explore the use of other metadata elements to solve problems
related to scale and information content. Clustering the resources that only dier in
representation elds, such as the scale, before the indexing process would reduce the
heterogeneity of the results, showing the alternative scales and the type of content
available in each cluster. Including temporal information can be also used to extend the
proposed method for dealing with concept at locationat time queries. Finally, once we
have aggregated metadata results, it could be possible to produce virtual resources that
give access to the associated resources in an integrated way, even if this information
was originally scattered across multiple resources.
Notes
1. http://inspire-geoportal.ec.europa.eu/
2. https://www.geoplatform.gov/
3. http://idee.es/
4. https://data.gov.uk
5. https://geodiscover.alberta.ca/geoportal/
6. http://www.europeana.eu/portal/
7. http://www.idee.es/csw-inspire-idee/srv/spa/catalog.search#/home
Acknowledgments
We are grateful to the collaboration of the Spanish National Geographic Institute to provide us
with data for the experiments.
Disclosure statement
No potential conict of interest was reported by the authors.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 21
Funding
The work of Borja Espejo-Garcia has been partially supported by Aragon Government through the
grant number [C38/2015].
ORCID
Javier Lacasta http://orcid.org/0000-0003-3071-5819
F. Javier Lopez-Pellicer http://orcid.org/0000-0001-6491-7430
Javier Nogueras-Iso http://orcid.org/0000-0002-1279-0367
F. Javier Zarazaga-Soria http://orcid.org/0000-0002-6557-2494
References
Anderson, K. and Gaston, K.J., 2013. Lightweight unmanned aerial vehicles will revolutionize
spatial ecology. Frontiers in Ecology and the Environment, 11, 138146. doi:10.1890/120150
Asadi, S., et al., 2005. Searching the World Wide Web for local services and facilities: a review on
the patterns of location-based queries. In: W. Fan, Z. Wu and J. Yang, eds. Advances in web-age
information management, lecture notes in computer science. Cham, Switzerland: Springer, Vol.
3739, 91101.
Baeza-Yates, R. and Ribeiro-Neto, B., 2011.Modern information retrieval. The concepts and technol-
ogies behind search. Reading, MA: Addison-Wesley.
DCMI, 2007.DCMI abstract model. Dublin Core Metadata Initiative, Technical report. Seoul, Korea:
National Library of Korea.
Ferrés, D. and Rodríguez, H., 2015, Evaluating geographical knowledge re-ranking, Linguistic
processing and query expansion techniques for geographical information retrieval. In:C.
Iliopoulos, S. Puglisi and E. Yilmaz, eds. Proceedings of the 22nd International Symposium, SPIRE
2015, 9309 of Lecture Notes in Computer Science, September, London, UK. New York, USA:
Springer-Verlag, 311323.
Florczyk, A.J., et al., 2010. Applying semantic linkage in the geospatial web. Lecture Notes in
Geoinformation and Cartography (LNG&C). Geospatial Thinking, 201220.
Göbel, S. and Klein, P., 2002. Ranking mechanisms in metadata information systems for geospatial
data. In:Proceedings of the Earth Observation & Geo-Spatial Web and Internet Workshop, Ispra,
Itally. Munich, Germany: Fraunhofer, 1313.
Graser, A., Asamer, J., and Ponweiser, W., 2015. The elevation factor: Digital elevation model quality
and sampling impacts on electric vehicle energy estimation errors. In:Proceedings of the
International Conference on Models and Technologies for Intelligent Transportation Systems (MT-
ITS) [online], 35 June 2015. Budapest, Hungary: Domokos Esztergár-Kiss, 8186.
Hübner, S., et al., 2004. Ontology-based search for interactive digital maps. IEEE Intelligent Systems,
19, 8086. doi:10.1109/MIS.2004.15
ISO/TC 211, 2014.ISO 19115-1:2014. Geographic information Metadata Part 1: fundamentals.
Geneva, Switzerland: International Organization for Standardization.
ISO/TC 211, 2016.ISO 19135-3:2016. Metadata Part 3: XML implementation of fundamentals.
Geneva, Switzerland: International Organization for Standardization.
Janowicz, K., et al., 2010. Semantic enablement for spatial data infrastructures. Transactions in GIS,
14, 111129. doi:10.1111/j.1467-9671.2010.01186.x
Kim, J., Vasardani, M., and Winter, S., 2017. Similarity matching for integrating spatial information
extracted from place descriptions. International Journal of Geographical Information Science, 31,
5680. doi:10.1080/13658816.2016.1188930
Lanfear, K.J., 2006.A spatial overlay ranking method for a geospatial search of text objects. Reston,
VA: USGS, Technical report.
22 J. LACASTA ET AL.
Larson, R. and Frontiera, P., 2004. Ranking and representation for geographic information retrieval.
In: R. Purves and C. Jones, eds. Proceedings of the SIGIR 2004 Workshop on Geographic
Information Retrieval [online], 2529 July. Sheeld. Available from: http://www.geo.uzh.ch/~
rsp/gir/abstracts/
Latre, M., et al., 2009. An approach to facilitate the integration of hydrological data by means of
ontologies and multilingual thesauri. In: M. Sester, L. Bernard and V. Paelke, eds. Advances in
GIScience. Lecture notes in geoinformation and cartography (LNG&C),0205 June, Hannover,
Germany. Berlin, Germany: Springer-Verlag, 155171.
Lieberman, J., 2006.Geospatial semantic web interoperability experiment report. Open Geospatial
Consortium, Technical report. Abingdon, UK: Taylor & Francis.
Lutz, M., et al., 2009. Overcoming semantic heterogeneity in spatial data infrastructures. Computers
& Geosciences, 35, 739752. doi:10.1016/j.cageo.2007.09.017
Lutz, M. and Klien, E., 2006. Ontology-based retrieval of geographic information. International
Journal of Geographical Information Science, 20, 233260. doi:10.1080/13658810500287107
Martins, B., Silva, M.J., and Andrade, L., 2005. Indexing and ranking in Geo-IR systems. In: C. Jones
and R. Purve, eds. Proceedings of the Workshop on Geographic information retrieval,04
November, Bremen, Germany. New York, USA: ACM, 3134.
Megler, V.M. and Maier, D., 2011. Finding haystacks with needles: ranked search for data using
geospatial and temporal characteristics. In: J.B. Cushing, J. French and S. Bowers, eds. Scientic
and statistical database management, no. 6809 of lecture notes in computer science. Berlin,
Germany: Springer Verlag, 5572.
Nogueras-Iso, J., et al., 2009. Development and deployment of a services catalog in compliance
with the INSPIRE metadata implementing rules. In: B. Van Loenen, J.W.J. Besemer and J.A.
Zevenbergen, eds. SDI convergence: research, emerging trends, and critical assessment.
Amersfoort, Netherlands: The Netherlands Geodetic Commission (NGC), 2134.
OGC, 2007a.OpenGIS catalogue service implementation specication.Wayland, MA: Open
Geospatial Consoritum, Technical report Version 2.02.
OGC, 2007b.OpenGIS catalogue services specication 2.0.2 - ISO metadata application prole.
Wayland, MA: Open Geospatial Consoritum, Technical report Version 2.02.
Paneque-Gálvez, J., et al., 2014. Small drones for community-based forest monitoring: An assess-
ment of their feasibility and potential in tropical areas. Forests, 5, 14811507. doi:10.3390/
f5061481
Pereira, P., et al., 2015. The impact of road and railway embankments on runoand soil erosion in
eastern Spain. Hydrology and Earth System Sciences Discussions, 12, 1294712985. doi:10.5194/
hessd-12-12947-2015
Renteria-Agualimpia, W., et al., 2016. Improving the geospatial consistency of digital libraries
metadata. Journal of Information Science, 42, 507523. doi:10.1177/0165551515597364
Sallaberry, C., 2013.Geographical information retrieval in textual corpora. Hoboken, NJ: Wiley.
Smith, T.R., 1996. A digital library for geographically referenced materials. Computer, 29, 5460.
doi:10.1109/2.493457
Watters, C. and Amoudi, G., 2003. GeoSearcher: Location-based ranking of searchengine results.
Journal of the American Society for information Science and Technology, 54, 140151.
doi:10.1002/(ISSN)1532-2890
Wolf, A.T., et al., 1999. International river basins of the world. International Journal of Water
Resources Development, 15, 387427. doi:10.1080/07900629948682
Zhu, X., et al., 2015. Integrating spatial data linkage and analysis services in a geoportal for China
urban research. Transactions in GIS, 19, 107128. doi:10.1111/tgis.2015.19.issue-1
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 23
... Several prior studies have addressed the improvement of IR in the context of geodata catalogues with diverse solutions. Works by Lacasta et al. [28,29] have addressed an often occurring mismatch between the users' query demands and the returned metadata records. Specifically, ref. [29] describes that retrieved datasets often do not entirely cover the area of interest that is requested by the user. ...
... (a) A BERT model fine-tuned on named entity recognition [61] to extract location entities from a query (e.g., "Berlin" from the query "climate data berlin"). (2) Calculating a spatial similarity metric: To calculate the spatial similarity metric between the query bounding box and the bounding boxes of candidates retrieved by the dense retriever, the Hausdorff distance is an appropriate measure, as proposed by [28,63]. The Hausdorff distance takes into account both the size and position of the geometries (polygons in this case). ...
Article
Full-text available
The search for environmental data typically involves lexical approaches, where query terms are matched with metadata records based on measures of term frequency. In contrast, dense retrieval approaches employ language models to comprehend the context and meaning of a query and provide relevant search results. However, for environmental data, this has not been researched and there are no corpora or evaluation datasets to fine-tune the models. This study demonstrates the adaptation of dense retrievers to the domain of climate-related scientific geodata. Four corpora containing text passages from various sources were used to train different dense retrievers. The domain-adapted dense retrievers are integrated into the search architecture of a standard metadata catalogue. To improve the search results further, we propose a spatial re-ranking stage after the initial retrieval phase to refine the results. The evaluation demonstrates superior performance compared to the baseline model commonly used in metadata catalogues (BM25). No clear trends in performance were discovered when comparing the results of the dense retrievers. Therefore, further investigation aspects are identified to finally enable a recommendation of the most suitable corpus composition.
... Their work only considers spatial search without due consideration to thematic queries that we incorporated in this study. Another set of works [10,28,34] was mainly about spatial data infrastructures or geospatial catalogs based on metadata, but is of interest for our study. ...
... Lacasta et al. [34] proposed discovering related geospatial data in different resources by taking all metadata records of resources that partially fulfill a query (i.e., intersect the bounding box or only match the themes), find their spatial & thematic relations and generate sets of metadata records or results that are a better answer to the query than each one individually. Even though their work is not exclusively for open government data, it was of interest for our research work for three reasons. ...
Conference Paper
The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.
... Their work only considers spatial search without due consideration to thematic queries that we incorporated in this study. Another set of works [10,28,34] was mainly about spatial data infrastructures or geospatial catalogs based on metadata, but is of interest for our study. ...
... Lacasta et al. [34] proposed discovering related geospatial data in different resources by taking all metadata records of resources that partially fulfill a query (i.e., intersect the bounding box or only match the themes), find their spatial & thematic relations and generate sets of metadata records or results that are a better answer to the query than each one individually. Even though their work is not exclusively for open government data, it was of interest for our research work for three reasons. ...
Preprint
The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.
... Further, using fine-grained data point searches, researchers can identify track points that are similar to the original track or track points of interest in an area for in-depth analysis to optimize vehicle scheduling. Specifically, in spatial dataset search, we integrate rangebased dataset search and top-k exemplar dataset search based on several distance metrics such as intersecting area (IA) [27], [39], grid-based overlap (GBO) [46], and Hausdorff distance (Haus) [27], [29], [38], [39], [47], [59], as shown in Fig. 2 (the formal definition is in Section III-B). Moreover, building upon the outcomes of the dataset search, we further enhance our capabilities to facilitate a finer-grained data point search process, wherein the results obtained from the dataset search are fed to the subsequent data point search. ...
Preprint
There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.
... The literature has some works in this field. For instance, Lacasta et al. [53] describe an IR process for geospatial data catalogues that focuses on solving this fragmentation problem by identifying the implicit spatial/thematic relations among query results. Their process focuses on finding resources spatially and thematically compatible with the user query and identifying their theme and spatial overlap. ...
Article
Full-text available
The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning.
... Spatial datasets in GIS relate to how primary and secondary data are obtained through the collection process, and then how the data is processed through spatial analysis to be information in the decision support system [3]. Visualization of spatial data can be done with cloud-terminal integration GIS to provide convenience in the process of spatial analysis on a large number of spatial datasets [4], aggregation-based spatial datasets information retrieval system [5]. Spatial datasets as the key to the value of big data in spatial data mining (SDM) that refers to the description of attribute data requirements, how the data is obtained, and what AI 366 method is used to perform spatial analysis of the data [6], [4]. ...
Article
Full-text available
span id="docs-internal-guid-9754d3d8-7fff-a7d2-6605-1c8f8c5a707a"> The classification method in the spatial analysis modeling based on the multi-criteria parameter is currently widely used to manage geographic information systems (GIS) software engineering. The accuracy of the proposed model will play an essential role in the successful software development of GIS. This is related to the nature of GIS used for mapping through spatial analysis. This paper aims to propose a framework of spatial analysis using a hybrid estimation model-based on a combination of multi-criteria decision-making (MCDM) and artificial neural networks (ANNs) (MCDM-ANNs) classification. The proposed framework is based on the comparison of existing frameworks through the concept of a literature review. The model in the proposed framework will be used for future work on the traffic accident-prone road classification through testing with a private or public spatial dataset. Model validation testing on the proposed framework uses metaheuristic optimization techniques. Policymakers can use the results of the model on the proposed framework for initial planning developing GIS software engineering through spatial analysis models. </span
... However, these names that can be found in the gazetteers that fit user needs are noise unless they are explicitly mentioned in metadata records. An improved catalogue system could help by providing an information retrieval process for geospatial catalogue aimed at improving results (Lacasta et al., 2017). ...
Conference Paper
Full-text available
This paper presents an idea for the development of a catalogue system for spatial datasets based onindexing both their metadata and theirfeatures. This characteristic is not available in spatial data catalogues in Spatial Data Infrastructures. This catalogue uses features forimproving the relevance of responses because metadata records may not conveyall the information that users may need for datasetdiscovery. The underlying search engine guarantees that when the query filters using a bounding box, all the returned datasets shouldcontain features in the queried area and their rank position will depend on the characteristics of the matching features. In a similar way,when the query contains a text query expression, all the returned datasets should contain features related to such query evenif theirmetadata records do not mention. The feasibility ofthis approach is shown with the development of a proof of concept using open sourceoff-the-shelf technologies like Elasticsearch.
Chapter
With the electronic transformation of traditional paper media, the integration of government new media platforms, the establishment of various institutions on social, music, video, shopping and other websites, and the rise of We-Media, massive new media information resources come out. In order to solve the problem of organizing and managing such information resources, and meet the needs of people’s deep use, propose an intelligent catalog system based on the development status of new media information resources. The intelligent catalog system for new media information resources combines massive data, bibliography theories, various catalogs, emerging intelligent technologies, catalog applications, etc., consisting of catalog data, catalog set, catalog application and support and guarantee. Metadata, classification and coding, and knowledge graph are key technologies for the construction of intelligent catalog system. Through the combination of bibliographic theory and information technologies, new media information resources can be better managed and utilized, and it can be fulfilled to provide personalized, diversified, integrated, accurate knowledge services for the users.
Article
Advances in linked geospatial data, recommender systems, and geographic information retrieval have led to urgent necessity to assess the overall semantic relatedness between keyword sets of geographic metadata. In this study, a new model is proposed for computing the semantic relatedness between arbitrary two keyword sets of geographic metadata stored in current global spatial data infrastructures. In this model, the overall semantic relatedness is derived by pairing these keywords that are found to be most relevant to each other and averaging their relatedness. To find the most relevant keywords across two keyword sets precisely, the keywords in the keyword set of geographic metadata are divided into three kinds: the thesaurus elements, the WordNet elements, and the statistical elements. The thesaurus-lexical relatedness measure (TLRM), the extended thesaurus-lexical relatedness measure (ETLRM), and the Longest Common Substring method are proposed to compute the semantic relatedness between two thesaurus elements, two WordNet elements, a thesaurus element, and a WordNet element and two statistical elements, respectively. A human data set – the geographic-metadata’s keyword set relatedness dataset, which was used to evaluate the precision of the semantic relatedness measures of keyword sets of geographic metadata, was created. The proposed method was evaluated against the human-generated relatedness judgments and was compared with the Jaccard method and Vector Space Model. The results demonstrated that the proposed method achieved a high correlation with human judgments and outperformed the existing methods. Finally, the proposed method was applied to quantitatively linked geospatial data.
Article
Full-text available
Road and railway infrastructure increased in the Mediterranean region during the last three decades. This included the building of embankments, which are assumed to be a large source of sediments and runoff. However, little is known about soil erosion rates, the factors that control them, and the processes that contribute to detachment, transport and deposition of sediments from road and railway embankments. The objective of this study was therefore to assess the impacts of road and railway embankments as a source of sediment and water, and compare them to other land use types (citrus plantations and shrublands) representative of the Cànyoles watershed to evaluate the importance of road embankments as a source of water and sediment under high magnitude low frequency rainfall events. Sixty rainfall experiments (1m2 plots; 60 min duration; 78mm h-1 rainfall intensity) were carried out on these land use types: 20 on two railway embankments (10+10), 20 on two road embankments (10+10), and 10 on citrus and 10 on shrubland. Road and railway embankments were characterized by bare soils with low organic matter and high bulk density. Erosion processes were more active in road, railway and citrus plots, and null in the shrublands. The non-sustainable soil erosion rates of 3Mg ha-1 y-1 measured on the road embankments were due to the efficient runoff connectivity plus low infiltration rates within the plot as the runoff took less than one minute to reach the runoff outlet. Road and railway embankments are both an active source of sediments and runoff, and soil erosion control strategies must be applied. The citrus plantations also act as a source of water and sediments (1.5Mg ha-1 y-1), while shrublands are sediment sinks, as no overland flow was observed due to the high infiltration rates.
Article
Full-text available
Data gathered through community-based forest monitoring (CBFM) programs may be as accurate as those gathered by professional scientists, but acquired at a much lower cost and capable of providing more detailed data about the occurrence, extent and drivers of forest loss, degradation and regrowth at the community scale. In addition, CBFM enables greater survey repeatability. Therefore, CBFM should be a fundamental component of national forest monitoring systems and programs to measure, report and verify (MRV) REDD+ activities. To contribute to the development of more effective approaches to CBFM, in this paper we assess: (1) the feasibility of using small, low-cost drones (i.e., remotely piloted aerial vehicles) in CBFM programs; (2) their potential advantages and disadvantages for communities, partner organizations and forest data end-users; and (3) to what extent their utilization, coupled with ground surveys and local ecological knowledge, would improve tropical forest monitoring. To do so, we reviewed the existing literature regarding environmental applications of drones, including forest monitoring, and drew on our own firsthand experience flying small drones to map and monitor tropical forests and training people to operate them. We believe that the utilization of small drones can enhance CBFM and that this approach is feasible in many locations throughout the tropics if some degree of external assistance and funding is provided to communities. We suggest that the use of small drones can help tropical communities to better manage and conserve their forests whilst benefiting partner organizations, governments and forest data end-users, particularly those engaged in forestry, biodiversity conservation and climate change mitigation projects such as REDD+.
Article
Full-text available
It is becoming acknowledged that water is likely to be the most pressing environmental concern of the next century. Difficulties in river basin management are only exacerbated when the resource crosses international boundaries. One critical aid in the assessment of international waters has been the Register of International Rivers a compendium which listed 214 international waterways that cover 47% of the earth's continental land surface. The Register, though, was last updated in 1978 by the now defunct United Nations Department of Economic and Social Affairs. The purpose of this paper is to update the Register in order to reflect the quantum changes that have taken place over the last 22 years, both in global geopolitics and in map coverage and technology. By accessing digital elevation models at spatial resolutions of 30 arc seconds, corroborating at a unified global map coverage of at least 1:1 000 000, and superimposing the results over complete coverage of current political boundaries, we are able to provide a new register which lists 261 international rivers, covering 45.3% of the land surface of the earth (excluding Antarctica). This paper lists all international rivers with their watershed areas, the nations which share each watershed,their respective territorial percentages, and notes on changes in or disputes over international boundaries since 1978.
Article
Full-text available
The general concern about environmental issues has involved the creation of national and international policies that require, at a technical level, the analysis, merging and processing of data obtained from very different sources. This paper proposes an approach for the integration of hydrological data that is based on the use of a multilingual ontology to facilitate the mapping across the local data models in the different sources. The novelty of the proposal is that the multilingual domain ontology is generated automatically by the merging and pruning of existing lexical ontologies. This approach has been tested in the context of the European Water Framework directive for the development of reporting applications in cross-border scenarios. Nevertheless, this approach could be easily extended to other domains.
Article
Full-text available
Ibáñez de Ibero, 3, 28003-Madrid (Spain) afrodriguez@fomento.es 3 GeoSpatiumLab Carlos Marx, 4, local der., 50015-Zaragoza (Spain) Abstract In order to facilitate the availability of and access to spatial data, Spatial Data Infrastructures must set up a series of services to be reused by their community of users in the construction of different applications and value-added services. One of the key elements to exploit the resources provided by these infrastructures is to facilitate a catalog of services describing the features of the services offered to their users. This article presents the development and deployment of a services catalog within the Spatial Data Infrastructure of Spain, a catalog in compliance with the INSPIRE implementing rules.
Article
Place descriptions are used in everyday communication as a common way to convey spatial information. Processing the information from place descriptions poses multiple significant challenges because these descriptions are written in natural language. In particular, corpora of place descriptions provide a plethora of human spatial knowledge beyond geographical information system, even if these descriptions refer to the same places in various ways. This article focuses on resolving ambiguous or synonymous place names from place descriptions by exploring the given relationships with other spatial features. It matches place names from multiple descriptions by developing a novel labelled graph matching process that relies solely on the comparison of string, linguistic and spatial similarities between identified places. This process uses unstructured place descriptions as an input, and produces a composite place graph with qualitative spatial relations from the descriptions. The performance of this novel process exceeds current toponym resolution by coping with non-gazetteered places.
Conference Paper
This paper describes and evaluates the use of Geographical Knowledge Re-Ranking, Linguistic Processing, and Query Expansion techniques to improve Geographical Information Retrieval effectiveness. Geographical Knowledge Re-Ranking is performed with Geographical Gazetteers and conservative Toponym Disambiguation techniques that boost the ranking of the geographically relevant documents retrieved by standard state-of-the-art Information Retrieval algorithms. Linguistic Processing is performed in two ways: 1) Part-of-Speech tagging and Named Entity Recognition and Classification are applied to analyze the text collections and topics to detect toponyms, 2) Stemming (Porter’s algorithm) and Lemmatization are also applied in combination with default stopwords filtering. The Query Expansion methods tested are the Bose-Einstein (Bo1) and Kullback-Leibler term weighting models. The experiments have been performed with the English Monolingual test collections of the GeoCLEF evaluations (from years 2005, 2006, 2007, and 2008) using the TF-IDF, BM25, and InL2 Information Retrieval algorithms over unprocessed texts as baselines. The experiments have been performed with each GeoCLEF test collection (25 topics per evaluation) separately and with the fusion of all these collections (100 topics). The results of evaluating separately Geographical Knowledge Re-Ranking, Linguistic Processing (lemmatization, stemming, and the combination of both), and Query Expansion with the fusion of all the topics show that all these processes improve the Mean Average Precision (MAP) and RPrecision effectiveness measures in all the experiments and show statistical significance over the baselines in most of them. The best results in MAP and RPrecision are obtained with the InL2 algorithm using the following techniques: Geographical Knowledge Re-Ranking, Lemmatization with Stemming, and Kullback-Leibler Query Expansion. Some configurations with Geographical Knowledge Re-Ranking, Linguistic Processing and Query Expansion have improved the MAP of the best official results at GeoCLEF evaluations of 2005, 2006, and 2007.
Article
Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and may even not contain some resources considered obvious. This is even worse when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems and information-retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to automatically identify inconsistent metadata records (870 out of 10,575) and propose fixes to most of them (91.5%) These results support the ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency.