Conference Paper

Approximation and Analytical Studies of Inter-clustering Performances of Space-Filling Curves

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A discrete space-filling curve provides a linear traversal/indexing of a multi-dimensional grid space. This paper presents an application of random walk to the study of inter-clustering of space-filling curves and an analytical study on the inter-clustering performances of 2-dimensional Hilbert and z-order curve families. Two underlying measures are employed: the mean inter-cluster distance over all inter-cluster gaps and the mean total inter-cluster distance over all subgrids. We show how approximating the mean inter-cluster distance statistics of continuous multi-dimensional space-filling curves fits into the formalism of random walk, and derive the exact formulas for the two statistics for both curve families. The excellent agreement in the approximate and true mean inter-cluster distance statistics suggests that the random walk may furnish an effective model to develop approximations to clustering and locality statistics for space-filling curves. Based upon the analytical results, the asymptotic comparisons indicate that z-order curve family performs better than Hilbert curve family with respect to both statistics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Empirical and analytical studies of clustering performances of various low-dimensional space-filling curves have been reported in the literature (see [8,11,13,19,20,27,31] for details). Generally, the Hilbert and z-order curve families exhibit good performance in this respect. ...
... For clustering performance based on inter-cluster statistics, Dai and Su [11] obtain the exact formulas for the following three statistics for two-dimensional H 2 k and Z 2 k : (1) the summation of all inter-cluster distances over all 2 q × 2 q query subgrids, (2) the universe mean inter-cluster distance over all inter-cluster gaps from all 2 q × 2 q subgrids, and (3) the mean total inter-cluster distance over all 2 q × 2 q subgrids. Based on the analytical results, the asymptotic comparisons indicate that, for a two-dimensional grid space, the z-order curve family performs better than the Hilbert curve family with respect to the statistics. ...
... C 7 , C 8 1.854042783 C 7 , C 8 p 8 = 1.879247199 C 8 , C 9 1.840799205 C 8 , C 9 1.890629924 C 9 , C 10 1.780373868 C 9 , C 10 1.898437578 C 10 , C 11 1.902231935 C 6 , C 10 p 6 = 1.849641746 C 11 , C 12 1.900104562 ...
Article
Full-text available
A discrete space-filling curve provides a one-dimensional indexing or traversal of a multi-dimensional grid space. Sample applications of space-filling curves include multi-dimensional indexing methods, data structures and algorithms, parallel computing, and image compression. Common measures for the applicability of space-filling curve families are locality and clustering. Locality preservation reflects proximity between grid points, that is, close-by grid points are mapped to close-by indices or vice versa. We present analytical and empirical studies on the locality properties of the two-dimensional Hilbert curve family. The underlying locality measure, based on the p-normed metric \(d_{p}\), is the maximum ratio of \(d_{p}(v, u)^{m}\) to \(d_{p}({\tilde{v}}, {\tilde{u}})\) over all corresponding point-pairs (v, u) and \(({\tilde{v}}, {\tilde{u}})\) in the m-dimensional grid space and one-dimensional index space, respectively. Our analytical results close the gaps between the current best lower and upper bounds with exact formulas for \(p \in \{1, 2\}\), and extend to all reals \(p \ge 2\). We also verify the results with computer programs over various grid-orders and p-values. Our empirical results will shed some light on determining the exact formulas for the locality measure for all reals \(p \in (1, 2)\).
... 3). Иначе говоря, кривая Гильберта всюду плотна в области, в которой построена, то есть проходит через любую сколь угодно малую окрестность каждой точки этой области [13,14]. Кривой Гильберта отрезок [0; 1] отображается на n-мерный гиперпараллелепипед. ...
... Рис. 13 ...
... Parallel grid-based clustering further divide cells into sub-cells, process each sub-cell and combine the individual results to build the final clusters [28,34]. This approach is particularly useful to mine spatial data as the grid can be defined using Hilbert or Z-order space filling curves [3,13]. ...
Conference Paper
Full-text available
Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed, many are complex and do not scale well as more data become available, making then inadequate to analyze very large datasets. In addition, many clustering algorithms are sequential, thus inherently difficult to parallelize. We propose PatchWork, a novel clustering algorithm to address those issues. PatchWork is a distributed density clustering algorithm with linear computational complexity and linear horizontal scalability. It presents several desirable characteristics in knowledge discovery, in particular, it does not require a priori the number of clusters to identify, and offers a natural protection against outliers and noise. In addition, PatchWork makes it possible to discover spatially large clusters instead of dense clusters only. PatchWork relies on the map/reduce paradigm to parallelize computations and was implemented using Apache Spark, the distributed computation framework. As a result, PatchWork can cluster a billion points in a few minutes only, a 40x improvement over the distributed implementation of k-means in Spark MLLib.
Chapter
A discrete space-filling curve provides a linear traversal or indexing of a multi-dimensional grid space. This paper presents two analytical studies on clustering analyses of the 2-dimensional Hilbert and z-order curve families. The underlying measure is the mean number of cluster over all identically shaped subgrids. We derive the exact formulas for the clustering statistics for the 2-dimensional Hilbert and z-order curve families. The exact results allow us to compare their relative performances with respect to this measure: when the grid-order is sufficiently larger than the subgrid-order (typical scenario for most applications), Hilbert curve family performs significantly better than z-order curve family.
Conference Paper
A discrete space-filling curve provides a 1-dimensional indexing or traversal of a multi-dimensional grid space. Applications of space-filling curves include multi-dimensional indexing methods, parallel computing, and image compression. Common goodness-measures for the applicability of space-filling curve families are locality and clustering. Locality reflects proximity preservation that close-by grid points are mapped to close-by indices or vice versa. We present an analytical study on the locality property of the 2-dimensional Hilbert curve family. The underlying locality measure, based on the p-normed metric \(d_{p}\), is the maximum ratio of \(d_{p}(u, v)^{m}\) to \(d_{p}(\tilde{u}, \tilde{v})\) over all corresponding point-pairs (u, v) and \((\tilde{u}, \tilde{v})\) in the m-dimensional grid space and 1-dimensional index space, respectively. Our analytical results identify all candidate representative grid-point pairs (realizing the locality-measure values) for all real norm-parameters in the unit interval [1, 2] and grid-orders. Together with the known results for other norm-parameter values, we have almost complete knowledge of the locality measure of 2-dimensional Hilbert curves over the entire spectrum of possible norm-parameter values.
Conference Paper
A discrete space-filling curve provides a linear traversal or indexing of a multi-dimensional grid space. This paper presents an analytical study of the clustering performance of the 3-dimensional Hilbert curve family. The underlying measure is the mean number of clusters over all identically shaped cubic subgrids. We derive an exact formula for the statistics for the Hilbert curve family, and have verified all exact formulas (intermediate and final) involved in the derivations in the analytical study with computer programs over various grid- and subgrid-orders.
Article
Full-text available
The geometric structural complexity of spatial objects does not render an intuitive distance metric on the data space that measures spatial proximity. However, such a metric provides a formal basis for analytical work in transformation-based multidimensional spatial access methods, including locality preservation of the underlying transformation and distance-based spatial queries. We study the Hausdorff distance metric on the space of multidimensional polytopes, and prove a tight relationship between the metric on the original space of k-dimensional hyperrectangles and the standard p-normed metric on the transform space of 2k-dimensional points under the corner transformation, which justifies the effectiveness of the transformation-based technique in preserving spatial locality.
Conference Paper
A discrete space-filling curve provides a linear traversal or indexing of a multi-dimensional grid space. We present an analytical study of the locality properties of the m-dimensional k-order discrete Hilbert and z-order curve families, \(\{H^m_k | k = 1,2,...\}\) and \(\{Z^m_k | k = 1,2,...\}\), respectively, based on the locality measure L δ that cumulates all index-differences of point-pairs at a common 1-normed distance δ. We derive the exact formulas for L δ (H k m ) and L δ (Z k m ) for m = 2 and arbitrary δ that is an integral power of 2, and m = 3 and δ = 1. The results yield a constant asymptotic ratio lim\(_{k\rightarrow\infty}\frac{L_\delta(H^m_k)}{L_\delta(Z^m_k)} > 1\), which suggests that the z-order curve family performs better than the Hilbert curve family over the considered parameter ranges.
Conference Paper
A discrete space-filling curve provides a linear indexing or traversal of a multi-dimensional grid space. We present an analytical study on the locality properties of the 2-dimensional Hilbert curve family. The underlying locality measure, based on the p-normed metric d p , is the maximum ratio of d p (v, u)m to \(d_{p}(\tilde{v},\tilde{u})\) over all corresponding point-pairs (v, u) and \((\tilde{v},\tilde{u})\) in the m-dimensional grid space and (1-dimensional) index space, respectively. Our analytical results close the gaps between the current best lower and upper bounds with exact formulas for p ∈ {1, 2}, and extend to all reals p ≥ 2.
Article
A spatial join is a query that searches for a set of object pairs satisfying a given spatial relationship from a database. It is one of the most costly queries, and thus requires an efficient processing algorithm that fully exploits the features of the underlying spatial indexes. In our earlier work, we devised a fairly effective algorithm for processing spatial joins with double transformation (DOT) indexing, which is one of several spatial indexing schemes. However, the algorithm is restricted to only the one-dimensional cases. In this paper, we extend the algorithm for the two-dimensional cases, which are general in Geographic Information Systems (GIS) applications. We first extend DOT to two-dimensional original space. Next, we propose an efficient algorithm for processing range queries using extended DOT. This algorithm employs the quarter division technique and the tri-quarter division technique devised by analyzing the regularity of the space-filling curve used in DOT. This greatly reduces the number of space transformation operations. We then propose a novel spatial join algorithm based on this range query processing algorithm. In processing a spatial join, we determine the access order of disk pages so that we can minimize the number of disk accesses. We show the superiority of the proposed method by extensive experiments using data sets of various distributions and sizes. The experimental results reveal that the proposed method improves the performance of spatial join processing up to three times in comparison with the widely-used R-tree-based spatial join method.
Article
Full-text available
Several schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering (Abel and Mark, 1990; Jagadish, 1990). We analyze the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time
Locality properties of discrete space-filling curves: Results with relevance for computer science
  • J Alber
J. Alber. Locality properties of discrete space-filling curves: Results with relevance for computer science (in German). Studienarbeit Universität Tübingen, Wilhelm-Schickard-Institut für Informatik. July 1997.