Conference PaperPDF Available

R Trees: A Dynamic Index Structure for Spatial Searching

Authors:

Abstract

In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, traditional indexing methods are not well suited to data objects of non-zero size located m multi-dimensional spaces In this paper we describe a dynamic index structure called an R-tree which meets this need, and give algorithms for searching and updating it. We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications
A preview of the PDF is not available
... The same applies to the R-tree, a data structure similar to a one-dimensional B-tree but supporting multiple dimensions [20]. In an R-tree, each node represents a bounding rectangle that encloses a group of points or other bounding rectangles. ...
... Here, the number of bins should stay below k = 6. We want to add the values 10,11,42,43,23,22,20,45,46,16,18,12,19,15 to the node in this order. At the beginning, the bin size is 1. ...
Preprint
Full-text available
In previous work, we have presented an approach to index 3D LiDAR point clouds in real time, i.e. while they are being recorded. We have further introduced a novel data structure called M3NO, which allows arbitrary attributes to be indexed directly during data acquisition. Based on this, we now present an integrated approach that supports not only real-time indexing but also visualization with attribute filtering. We specifically focus on large datasets from airborne and land-based mobile mapping systems. Compared to traditional indexing approaches running offline, the M3NO is created incrementally. This enables dynamic queries based on spatial extent and value ranges of arbitrary attributes. The points in the data structure are assigned to levels of detail (LOD), which can be used to create interactive visualizations. This is in contrast to other approaches, which focus on either spatial or attribute indexing, only support a limited set of attributes, or do not support real-time visualization. Using several publicly available large data sets, we evaluate the approach, assess quality and query performance, and compare it with existing state-of-the-art indexing solutions. The results show that our data structure is able to index 5.24 million points per second. This is more than most commercially available laser scanners can record and proves that low-latency visualization during the capturing process is possible.
... Multi-dimensional data management and analytics are crucial in a wide range of fields, including business intelligence [15], smart transportation [93], neuroscience [74], climate studies [23], etc. As data volumes grow at an exponential speed, conventional multi-dimensional indexes, such as Rtree [33] and its variants [4,9,40,77], have been designed to accelerate data access and query processing in large multidimensional databases. ...
... For low-dimensional data, conventional indexes can be classified into three classes based on different space partition strategies: R-tree variants, kd-tree variants, and grids. ❶ R-trees [33] organize data using minimum bounding rectangles (MBRs) to encapsulate spatial objects, where each internal node represents a MBR covering its child nodes. Variants of the R-tree, such as R*-tree [9], STR-tree [46] and Hilbert R-tree [40], focus on minimizing overlap between nodes by different packing strategies, improving query performance. ...
Article
Full-text available
Efficient indexing is fundamental to managing and analyzing multi-dimensional data. A growing trend is to directly learn the storage layout of multi-dimensional data using simple machine learning models, leading to the concept of Learned Index. Compared to conventional indexing methods that have been used for decades (e.g., kd-tree and R-tree variants), learned indexes have demonstrated empirical advantages in both space and time efficiency on modern architectures. However, there is a lack of comprehensive evaluation across existing multi-dimensional learned indexes under a standardized benchmark, making it challenging to identify the most suitable index for specific data types and query patterns. This gap also hinders the widespread adoption of learned indexes in practical applications. In this paper, we present the first in-depth empirical study to answer the question: how good are multi-dimensional learned indexes? We evaluate ten recently published indexes under a unified experimental framework, which includes standardized implementations, datasets, query workloads, and evaluation metrics. We thoroughly investigate the evaluation results and discuss the findings that may provide insights for future learned index design.
... We design a new data structure stream segment tree to efficiently handle stream updates. It is much faster here than the interval tree [54] and the R-tree [23]. • Hit set pruning algorithm. ...
... Comparison of data structures. We compare our stream segment tree with the interval tree [54] (using AVL balancing [3]) and the (1dimensional) R-tree [23] (using B-tree balancing [8]) in terms of the Append operation. We report in Figure 7 the cumulative latency and the cumulative number of Stitch operations in Append because the Stitch operation is the bottleneck during Append. ...
Preprint
Tucker decomposition has been widely used in a variety of applications to obtain latent factors of tensor data. In these applications, a common need is to compute Tucker decomposition for a given time range. Furthermore, real-world tensor time series are typically evolving in the time dimension. Such needs call for a data structure that can efficiently and accurately support range queries of Tucker decomposition and stream updates. Unfortunately, existing methods do not support either range queries or stream updates. This challenging problem has remained open for years prior to our work. To solve this challenging problem, we propose TUCKET, a data structure that can efficiently and accurately handle both range queries and stream updates. Our key idea is to design a new data structure that we call a stream segment tree by generalizing the segment tree, a data structure that was originally invented for computational geometry. For a range query of length L, our TUCKET can find O(logL)O(\log L) nodes (called the hit set) from the tree and efficiently stitch their preprocessed decompositions to answer the range query. We also propose an algorithm to optimally prune the hit set via an approximation of subtensor decomposition. For the T-th stream update, our TUCKET modifies only amortized O(1) nodes and only O(logT)O(\log T) nodes in the worst case. Extensive evaluation demonstrates that our TUCKET consistently achieves the highest efficiency and accuracy across four large-scale datasets. Our TUCKET achieves at least 3 times lower latency and at least 1.4 times smaller reconstruction error than Zoom-Tucker on all datasets.
... Lorsqu'une correspondance est trouvée, cela indique qu'un objet similaire est déjà présent dans la carte globale et doitêtre misà jour avec les informations fournies par le nouvel objet. Dans notre approche, en plus de la carte sémantique globale, les objets sont organisés dans la structure de données R-tree [103]. Dans un R-tree, les objets sont regroupés dans Pour identifier la correspondance d'un nouvel objet, nous appliquons une recherche par chevauchement dans la structure R-tree. ...
Thesis
Full-text available
Today, the efficient management of indoor spaces is essential for multiple aspects, including security, evacuation, and the deployment of robotic solutions. Digital twin (DT) technology is emerging as a cutting-edge solution for this purpose. It creates a real-time link between the physical environment and its virtual replica, enabling comprehensive monitoring, simulation, analysis, and enhancement of real-world performance. This thesis presents a novel approach to semantic mapping, enhancing the DT with a semantic map that incorporates contextual information about the environment. Firstly, it reviews state-of-the-art semantic mapping approaches. Secondly, it introduces a new method of semantic mapping using a mobile robot equipped with an RGB-D camera to acquire real-time semantic information about objects. This method integrates object detection, scene segmentation, and a computational geometry algorithm to generate detailed point clouds and define object occupancy zones. Prior knowledge (PK), such as object categories and 3D CAD models, is incorporated to validate data and estimate occupancy zones using optimal bounding boxes (OBB). Thirdly, this research proposes a collaborative approach to maintaining the semantic map in the DT. Autonomous mobile robots continuously update this map by generating individual semantic maps that are communicated to the DT, gradually integrating them into the existing semantic map. Data consistency is ensured through contextual validation, while information updating is guaranteed by a data fusion technique based on spatial and semantic correspondences. Experimental results in various office environments demonstrate significant improvements in object occupancy estimates, effective management of changes such as object addition, movement, and removal within indoor environments, and highlight the benefits of integrating PK.
... Notably, when the vectors possess identical norms, solving the MIPS problem is equivalent to performing NNS. In lower-dimensional spaces, exact solutions like kd-trees (Bentley, 1975) and R-trees (Guttman, 1984) can provide an exact nearest neighbor solution. However, these methods suffer from inefficient indexing when the dimensionality exceeds 20. ...
Preprint
The softmax function is a cornerstone of multi-class classification, integral to a wide range of machine learning applications, from large-scale retrieval and ranking models to advanced large language models. However, its computational cost grows linearly with the number of classes, which becomes prohibitively expensive in scenarios with millions or even billions of classes. The sampled softmax, which relies on self-normalized importance sampling, has emerged as a powerful alternative, significantly reducing computational complexity. Yet, its estimator remains unbiased only when the sampling distribution matches the true softmax distribution. To improve both approximation accuracy and sampling efficiency, we propose the MIDX Sampler, a novel adaptive sampling strategy based on an inverted multi-index approach. Concretely, we decompose the softmax probability into several multinomial probabilities, each associated with a specific set of codewords and the last associated with the residual score of queries, thus reducing time complexity to the number of codewords instead of the number of classes. To further boost efficiency, we replace the query-specific residual probability with a simple uniform distribution, simplifying the computation while retaining high performance. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds. The results demonstrate that a smaller divergence from the ideal softmax distribution leads to faster convergence and improved generalization. Extensive experiments on large-scale language models, sequential recommenders, and extreme multi-class classification tasks confirm that the MIDX-Sampler delivers superior effectiveness and efficiency compared to existing approaches.
Preprint
Full-text available
Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. We briefly survey in-memory graph-based vector search, outline the chronology of the different methods and classify them according to five main design paradigms: seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data-adaptive seed selection and diversification strategies.
Article
Full-text available
Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. We briefly survey in-memory graph-based vector search, outline the chronology of the different methods and classify them according to five main design paradigms: seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data-adaptive seed selection and diversification strategies.
Article
Full-text available
Much research has recently been devoted to "multikey" searching problems. In this paper the partmular multlkey problem of range searching Is investigated and a number of data structures that have been proposed as solutions to this problem are surveyed. The purposes of this paper are to bring together a collection of widely scattered results, to acquaint the reader with the structures currently avadable for solving the particular problem of range searching, and to display a set of general methods for attacking multikey searching problems.
Article
The 'Reduced Instruction Set Computer' (RISC) project investigates the implictions of simplifying the instruction set of a general-purpose computer. For a VLSI implementation, such an approach makes effective use of the limited on-chip silicon resources by devoting them to those tasks which are used most frequently. The micro-architecture of a second nMOS implementation, RISC II, is presented. It evolved with the goal of attaining highest performance in a register-to-register architecture, using a three-stage pipeline. The area-time tradeoffs are discussed in detail. Organization of the control section is presented, showing how its area was drasticaly reduced owing to the simple instruction set. Design metrics of RISC II are given.
Article
This document describes a technique for storing large sets of spatial objects so that proximity queries are handled efficiently as part of the accessing mechanism. This technique is based on a transformation of spatial objects into points in higher-dimensional spaces and on a data structure called the grid file. The grid file was designed to store highly dynamic sets of multi-dimensional data in such a way that it can be accessed using few disk accesses: a point query requires two disk accesses, a range query requires at most two disk accesses per data bucket retrieved. The efficiency of our technique is based on two facts: (1) many types of proximity queries lead to cone-shaped regions of the search space; and (2) the grid file allows an efficient enumeration of all the points in such a cone. (Author)
Article
The report summarizes the progress made during a three year period on research in data base management. The primary effort has been the design development of a major data base management system of the relational type, INGRES. In addition, a number of new directions, such as distributed data bases and data base machines were initiated. (Author)
Article
An abstract is not available.
Conference Paper
This paper explores the use of one form of abstract data types in CAD data bases. Basically, new data types for columns of a relation, such as boxes, wires and polygons, become possible. Also explored is the possibility of secondary indices for new data types that can support existing and user-defined operators. The performance and query complexity considerations of these features are examined. Refs.
Article
This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are: insertion, O(log n); deletion of the root, O(n(k-1)/k); deletion of a random node, O(log n); and optimization (guarantees logarithmic performance of searches), O(n log n). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O(n(k-t)/k)] and for nearest neighbor queries [empirically observed average running time of O(log n).] These performances far surpass the best currently known algorithms for these tasks. An algorithm is presented to handle any general intersection query. The main focus of this paper is theoretical. It is felt, however, that k-d trees could be quite useful in many applications, and examples of potential uses are given.