Figure 4 - uploaded by Borja Espejo García
Content may be subject to copyright.
Mean DGC comparison.

Mean DGC comparison.

Source publication
Article
Full-text available
Geospatial data catalogs enable users to discover and access geographical information. Prevailing solutions are document oriented and fragment the spatial continuum of the geospatial data into independent and disconnected resources described through metadata. Due to this, the complete answer for a query may be scattered across multiple resources, m...

Context in source publication

Context 1
... mean takes into account that some queries return less than 10 results. Figure 4 compares the mean DGC of the two systems at each position of their result list (number of query result). It can be observed how the aggregated system has always a higher mean gain. ...

Similar publications

Article
Full-text available
Although universities conduct research in the SDI field, they have repeatedly erred when storing, preserving, and sharing their geospatial data. The general objective of this research is to develop a proposal for a Spatial Data Infrastructure (SDI) for the graduate programs of the Department of Earth Sciences at Federal University of Paraná (UFPR)....

Citations

... Further, using fine-grained data point searches, researchers can identify track points that are similar to the original track or track points of interest in an area for in-depth analysis to optimize vehicle scheduling. Specifically, in spatial dataset search, we integrate rangebased dataset search and top-k exemplar dataset search based on several distance metrics such as intersecting area (IA) [27], [39], grid-based overlap (GBO) [46], and Hausdorff distance (Haus) [27], [29], [38], [39], [47], [59], as shown in Fig. 2 (the formal definition is in Section III-B). Moreover, building upon the outcomes of the dataset search, we further enhance our capabilities to facilitate a finer-grained data point search process, wherein the results obtained from the dataset search are fed to the subsequent data point search. ...
Preprint
There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.
... Several prior studies have addressed the improvement of IR in the context of geodata catalogues with diverse solutions. Works by Lacasta et al. [28,29] have addressed an often occurring mismatch between the users' query demands and the returned metadata records. Specifically, ref. [29] describes that retrieved datasets often do not entirely cover the area of interest that is requested by the user. ...
... (a) A BERT model fine-tuned on named entity recognition [61] to extract location entities from a query (e.g., "Berlin" from the query "climate data berlin"). (2) Calculating a spatial similarity metric: To calculate the spatial similarity metric between the query bounding box and the bounding boxes of candidates retrieved by the dense retriever, the Hausdorff distance is an appropriate measure, as proposed by [28,63]. The Hausdorff distance takes into account both the size and position of the geometries (polygons in this case). ...
Article
Full-text available
The search for environmental data typically involves lexical approaches, where query terms are matched with metadata records based on measures of term frequency. In contrast, dense retrieval approaches employ language models to comprehend the context and meaning of a query and provide relevant search results. However, for environmental data, this has not been researched and there are no corpora or evaluation datasets to fine-tune the models. This study demonstrates the adaptation of dense retrievers to the domain of climate-related scientific geodata. Four corpora containing text passages from various sources were used to train different dense retrievers. The domain-adapted dense retrievers are integrated into the search architecture of a standard metadata catalogue. To improve the search results further, we propose a spatial re-ranking stage after the initial retrieval phase to refine the results. The evaluation demonstrates superior performance compared to the baseline model commonly used in metadata catalogues (BM25). No clear trends in performance were discovered when comparing the results of the dense retrievers. Therefore, further investigation aspects are identified to finally enable a recommendation of the most suitable corpus composition.
... The literature has some works in this field. For instance, Lacasta et al. [53] describe an IR process for geospatial data catalogues that focuses on solving this fragmentation problem by identifying the implicit spatial/thematic relations among query results. Their process focuses on finding resources spatially and thematically compatible with the user query and identifying their theme and spatial overlap. ...
Article
Full-text available
The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning.
... Spatial datasets in GIS relate to how primary and secondary data are obtained through the collection process, and then how the data is processed through spatial analysis to be information in the decision support system [3]. Visualization of spatial data can be done with cloud-terminal integration GIS to provide convenience in the process of spatial analysis on a large number of spatial datasets [4], aggregation-based spatial datasets information retrieval system [5]. Spatial datasets as the key to the value of big data in spatial data mining (SDM) that refers to the description of attribute data requirements, how the data is obtained, and what AI 366 method is used to perform spatial analysis of the data [6], [4]. ...
Article
Full-text available
span id="docs-internal-guid-9754d3d8-7fff-a7d2-6605-1c8f8c5a707a"> The classification method in the spatial analysis modeling based on the multi-criteria parameter is currently widely used to manage geographic information systems (GIS) software engineering. The accuracy of the proposed model will play an essential role in the successful software development of GIS. This is related to the nature of GIS used for mapping through spatial analysis. This paper aims to propose a framework of spatial analysis using a hybrid estimation model-based on a combination of multi-criteria decision-making (MCDM) and artificial neural networks (ANNs) (MCDM-ANNs) classification. The proposed framework is based on the comparison of existing frameworks through the concept of a literature review. The model in the proposed framework will be used for future work on the traffic accident-prone road classification through testing with a private or public spatial dataset. Model validation testing on the proposed framework uses metaheuristic optimization techniques. Policymakers can use the results of the model on the proposed framework for initial planning developing GIS software engineering through spatial analysis models. </span
... Their work only considers spatial search without due consideration to thematic queries that we incorporated in this study. Another set of works [10,28,34] was mainly about spatial data infrastructures or geospatial catalogs based on metadata, but is of interest for our study. ...
... Lacasta et al. [34] proposed discovering related geospatial data in different resources by taking all metadata records of resources that partially fulfill a query (i.e., intersect the bounding box or only match the themes), find their spatial & thematic relations and generate sets of metadata records or results that are a better answer to the query than each one individually. Even though their work is not exclusively for open government data, it was of interest for our research work for three reasons. ...
Conference Paper
The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.
... Their work only considers spatial search without due consideration to thematic queries that we incorporated in this study. Another set of works [10,28,34] was mainly about spatial data infrastructures or geospatial catalogs based on metadata, but is of interest for our study. ...
... Lacasta et al. [34] proposed discovering related geospatial data in different resources by taking all metadata records of resources that partially fulfill a query (i.e., intersect the bounding box or only match the themes), find their spatial & thematic relations and generate sets of metadata records or results that are a better answer to the query than each one individually. Even though their work is not exclusively for open government data, it was of interest for our research work for three reasons. ...
Preprint
The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.
... However, these names that can be found in the gazetteers that fit user needs are noise unless they are explicitly mentioned in metadata records. An improved catalogue system could help by providing an information retrieval process for geospatial catalogue aimed at improving results (Lacasta et al., 2017). ...
Conference Paper
Full-text available
This paper presents an idea for the development of a catalogue system for spatial datasets based onindexing both their metadata and theirfeatures. This characteristic is not available in spatial data catalogues in Spatial Data Infrastructures. This catalogue uses features forimproving the relevance of responses because metadata records may not conveyall the information that users may need for datasetdiscovery. The underlying search engine guarantees that when the query filters using a bounding box, all the returned datasets shouldcontain features in the queried area and their rank position will depend on the characteristics of the matching features. In a similar way,when the query contains a text query expression, all the returned datasets should contain features related to such query evenif theirmetadata records do not mention. The feasibility ofthis approach is shown with the development of a proof of concept using open sourceoff-the-shelf technologies like Elasticsearch.
... For example, in the context of geographic information retrieval, Lacasta et al. [23] aggregated users' search results of geospatial datasets by identifying the implicit spatial and thematic relations between the metadata records of geospatial datasets as similarity to offer complete answers for a user's query. Martins et al. [16] ranked geospatial data retrieval results according to a combination of thematic and geographical similarity between a user's query and the geospatial datasets. ...
Article
Full-text available
To help users discover the most relevant spatial datasets in the ever-growing global spatial data infrastructures (SDIs), a number of similarity measures of geospatial data based on metadata have been proposed. Researchers have assessed the similarity of geospatial data according to one or more characteristics of the geospatial data. They created different similarity algorithms for each of the selected characteristics and then combined these elementary similarities to the overall similarity of the geospatial data. The existing combination methods are mainly linear and may not be the most accurate. This paper reports our experiences in attempting to learn the optimal non-linear similarity integration functions, from the knowledge of experts, using an artificial neural network. First, a multiple-layer feed forward neural network (MLFFN) was created. Then, the intrinsic characteristics were used to represent the metadata of geospatial data and the similarity algorithms for each of the intrinsic characteristics were built. The training and evaluation data of MLFFN were derived from the knowledge of domain experts. Finally, the MLFFN was trained, evaluated, and compared with traditional linear combination methods, which was mainly a weighted sum. The results show that our method outperformed the existing methods in terms of precision. Moreover, we found that the combination of elementary similarities of experts to the overall similarity of geospatial data was not linear.
Chapter
With the electronic transformation of traditional paper media, the integration of government new media platforms, the establishment of various institutions on social, music, video, shopping and other websites, and the rise of We-Media, massive new media information resources come out. In order to solve the problem of organizing and managing such information resources, and meet the needs of people’s deep use, propose an intelligent catalog system based on the development status of new media information resources. The intelligent catalog system for new media information resources combines massive data, bibliography theories, various catalogs, emerging intelligent technologies, catalog applications, etc., consisting of catalog data, catalog set, catalog application and support and guarantee. Metadata, classification and coding, and knowledge graph are key technologies for the construction of intelligent catalog system. Through the combination of bibliographic theory and information technologies, new media information resources can be better managed and utilized, and it can be fulfilled to provide personalized, diversified, integrated, accurate knowledge services for the users.