ArticlePublisher preview available

Efficient spatial queries over complex polygons with hybrid representations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

One major goal of spatial query processing is to mitigate I/O costs and minimize the search space. However, geometric computation can be heavy-duty for spatial queries, in particular for complex geometries such as polygons with many edges based on a vector-based representation. Many past techniques have been provided for spatial partitioning and indexing, which are mainly built on minimal bounding boxes or other approximation methods and are not optimized for reducing geometric computation. In this paper, we propose a novel vector-raster hybrid approach through rasterization, where rich pixel-centric information is preserved to help not only filter out more candidates but also reduce geometry computation load. Based on the hybrid model, we implement four typical spatial queries, which can be generalized for other types of spatial queries. We also propose cost models to estimate the latency for those query types. Our experiments demonstrate that the hybrid model can boost the performance of spatial queries on complex polygons by up to one order of magnitude.
This content is subject to copyright. Terms and conditions apply.
GeoInformatica (2024) 28:459–497
https://doi.org/10.1007/s10707-023-00508-2
RESEARCH
Efficient spatial queries over complex polygons with hybrid
representations
Dejun Teng1·Furqan Baig2·Zhaohui Peng1·Jun Kong3·Fusheng Wang4
Received: 26 October 2022 / Revised: 29 September 2023 / Accepted: 31 October 2023 /
Published online: 27 December 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023
Abstract
One major goal of spatial query processing is to mitigate I/O costs and minimize the search
space. However, geometric computation can be heavy-duty for spatial queries, in particular
for complex geometries such as polygons with many edges based on a vector-based repre-
sentation. Many past techniques have been provided for spatial partitioning and indexing,
which are mainly built on minimal bounding boxes or other approximation methods and
are not optimized for reducing geometric computation. In this paper, we propose a novel
vector-raster hybrid approach through rasterization, where rich pixel-centric information is
preserved to help not only filter out more candidates but also reduce geometry computation
load. Based on the hybrid model, we implement four typical spatial queries, which can be
generalized for other types of spatial queries. We also propose cost models to estimate the
latency for those query types. Our experiments demonstrate that the hybrid model can boost
the performance of spatial queries on complex polygons by up to one order of magnitude.
Keywords Spatial database ·Spatial representations
BZhaohui Peng
pzh@sdu.edu.cn
Dejun Teng
teng@sdu.edu.cn
Furqan Baig
fbaig@illinois.edu
Jun Kong
jkong@gsu.edu
Fusheng Wang
fusheng.wang@stonybrook.edu
1The School of Computer Science and Technology, Shandong University, 72 Binhai Road, Qingdao
266237, Shandong, China
2CyberGIS Center for Advanced Digital and Spatial Studies, University of Illinois at
Urbana-Champaign, 1301 W Green St, Urbana 61801, IL, USA
3Department of Mathematics and Statistics, Georgia State University, Atlanta 30303, GA, USA
4Department of Computer Science, Stony Brook University, Stony Brook 11794, NY, USA
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Article
We released open-source software Hadoop-GIS in 2011, and presented and published the work in VLDB 2013. This work initiated the development of a new spatial data analytical ecosystem characterized by its large-scale capacity in both computing and data storage, high scalability, compatibility with low-cost commodity processors in clusters and open-source software. After more than a decade of research and development, this ecosystem has matured and is now serving many applications across various fields. In this paper, we provide the background on why we started this project and give an overview of the original Hadoop-GIS software architecture, along with its unique technical contributions and legacy. We present the evolution of the ecosystem and its current state-of-the-art, which has been influenced by the Hadoop-GIS project. We also describe the ongoing efforts to further enhance this ecosystem with hardware accelerations to meet the increasing demands for low latency and high throughput in various spatial data analysis tasks. Finally, we will summarize the insights gained and lessons learned over more than a decade in pursuing high-performance spatial data analytics.
Conference Paper
Full-text available
As individual traffic and public transport in cities are changing, city authorities need to analyze urban geospatial data to improve transportation and infrastructure. To that end, they highly rely on spatial aggregation queries that extract summarized information from point data (e.g., Uber rides) contained in a given polygo-nal region (e.g., a city neighborhood). To support such queries, current analysis tools either allow only predefined aggregates on predefined regions and are thus unsuitable for exploratory analyses, or access the raw data to compute aggregate results on-the-fly, which severely limits the interactivity. At the same time, existing pre-aggregation techniques are inadequate since they maintain aggregates over rectangular regions. As a result, when applied over arbitrary polygonal regions, they induce an approximation error that cannot be bounded. In this paper, we introduce GeoBlocks, a novel pre-aggregating data structure that supports spatial aggregation over arbitrary polygons. GeoBlocks closely approximate polygons using a set of fine-grained grid cells and, in contrast to prior work, allow to bound the approximation error by adjusting the cell size. Furthermore , GeoBlocks employ a trie-like cache that caches aggregate results of frequently queried regions, thereby dynamically adapting to the skew inherently present in query workloads and improving performance over time. In summary, GeoBlocks outperform on-the-fly aggregation by up to three orders of magnitude , achieving the sub-second query latencies required for interactive exploratory analytics.
Conference Paper
Full-text available
Geometric intersection algorithms are fundamental in spatial analysis in Geographic Information System (GIS). Applying high performance computing to perform geometric intersection on huge amount of spatial data to get real-time results is necessary. Given two input geometries (polygon or polyline) of a candidate pair, we introduce a new two-step geospatial filter that first creates sketches of the geometries and uses it to detect workload and then refines the sketches by the common areas of sketches to decrease the overall computations in the refine phase. We call this filter PolySketch-based CMBR (PSCMBR) filter. We show the application of this filter in speeding-up line segment intersections (LSI) reporting task that is a basic computation in a variety of geospatial applications like polygon overlay and spatial join. We also developed a parallel PolySketch-based PNP filter to perform PNP tests on GPU which reduces computational workload in PNP tests. Finally, we integrated these new filters to the hierarchical filter and refinement system to solve geometric intersection problem. We have implemented the new filter and refine system on GPU using CUDA. The new filters introduced in this paper reduce more computational workload when compared to existing filters. The processing rate of the new filter and refine system for line segment intersection reporting task is 61 million/sec on average.
Conference Paper
Full-text available
Connected mobility applications rely heavily on geospatial joins that associate point data, such as locations of Uber cars, to static polygonal regions, such as city neighborhoods. These joins typically involve expensive geometric computations, which makes it hard to provide an interactive user experience. In this paper, we propose an adaptive polygon index that leverages true hit ltering to avoid expensive geometric computations in most cases. In particular, our approach closely approximates polygons by combining quadtrees with true hit ltering, and stores these approximations in a query-ecient radix tree. Based on this index, we introduce two geospatial join algorithms: an approximate one that guarantees a user-dened precision, and an exact one that adapts to the expected point distribution. In summary, our technique outperforms existing CPU-based joins by up to two orders of magnitude and is competitive with state-of-the-art GPU implementations.
Conference Paper
Full-text available
In this paper, we introduce our hierarchical filter and refinement technique that we have developed for parallel geometric intersection operations involving large polygons and polylines. The inputs are two layers of large polygonal datasets and the computations are spatial intersection on a pair of cross-layer polygons. These intersections are the compute-intensive spatial data analytic kernels in spatial join and map overlay computations. We have extended the classical filter and refine algorithms using PolySketch Filter to improve the performance of geospatial computations. In addition to filtering polygons by their Minimum Bounding Rectangle (MBR), our hierarchical approach explores further filtering using tiles (smaller MBRs) to increase the effectiveness of filtering and decrease the computational workload in the refinement phase. We have implemented this filter and refine system on CPU and GPU by using OpenMP and OpenACC. After using R-tree, on average, our filter technique can still discard 69% of polygon pairs which do not have segment intersection points. PolySketch filter reduces on average 99.77% of the workload of finding line segment intersections. PNP based task reduction and Striping algorithms filter out on average 95.84% of the workload of Point-in-Polygon tests. Our CPU-GPU system performs spatial join on two shapefiles, namely USA Water Bodies and USA Block Group Boundaries with 683K polygons in about 10 seconds using NVidia Titan V and Titan Xp GPU.
Article
Full-text available
In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.
Conference Paper
Geometric computation can be heavy duty for spatial queries, in particular for complex geometries such as polygons with many edges based on a vector-based representation. While many techniques have been provided for spatial partitioning and indexing, they are mainly built on minimal bounding boxes or other approximation methods, which will not mitigate the high cost of geometric computation. In this paper, we propose a novel vector-raster hybrid approach through rasterization, where pixel-centric rich information is preserved to help not only filtering out more candidates but also reducing geometry computation load. Based on the hybrid model, we develop an efficient rasterization based ray casting method for point-in-polygon queries and a circle buffering method for point-to-polygon distance calculation, which is a common operation for distance based queries. Our experiments demonstrate that the hybrid model can boost the performance of spatial queries on complex polygons by up to one order of magnitude.
Conference Paper
The recent advances in remote sensing technology resulted in peta bytes of data in raster format. To process this data, it is often combined with high resolution vector data that represents, for example, region boundaries. One of the common operations that combine big vector and raster data is the zonal statistics which computes some aggregate values for each polygon in the vector dataset. This paper proposes a novel and scalable algorithm for zonal statistics that can scale to peta bytes of raster and vector data. The proposed method does not require any preprocessing or indexing making it perfect for ad-hoc queries that scientists usually want to run. We implement a prototype for the proposed method and the initial preliminary results show that the proposed method can scale up-to a trillion pixels.