Wenzhe Yang’s research while affiliated with Wuhan University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (5)


A Unified Approach for Multi-Granularity Search over Spatial Datasets
  • Preprint

December 2024

·

8 Reads

Wenzhe Yang

·

·

Shixun Huang

·

[...]

·

Zhiyong Peng

There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.


Budgeted Spatial Data Acquisition: When Coverage and Connectivity Matter

December 2024

·

8 Reads

Data is undoubtedly becoming a commodity like oil, land, and labor in the 21st century. Although there have been many successful marketplaces for data trading, the existing data marketplaces lack consideration of the case where buyers want to acquire a collection of datasets (instead of one), and the overall spatial coverage and connectivity matter. In this paper, we take the first attempt to formulate this problem as Budgeted Maximum Coverage with Connectivity Constraint (BMCC), which aims to acquire a dataset collection with the maximum spatial coverage under a limited budget while maintaining spatial connectivity. To solve the problem, we propose two approximate algorithms with detailed theoretical guarantees and time complexity analysis, followed by two acceleration strategies to further improve the efficiency of the algorithm. Experiments are conducted on five real-world spatial dataset collections to verify the efficiency and effectiveness of our algorithms.




Fast dataset search with earth mover's distance

July 2022

·

18 Reads

·

8 Citations

Proceedings of the VLDB Endowment

The amount of spatial data in open data portals has increased rapidly, raising the demand for spatial dataset search in large data repositories. In this paper, we tackle spatial dataset search by using the Earth Mover's Distance (EMD) to measure the similarity between datasets. EMD is a robust similarity measure between two distributions and has been successfully applied to multiple domains such as image retrieval, document retrieval, multimedia, etc. However, the existing EMD-based studies typically depend on a common filtering framework with a single pruning strategy, which still has a high search cost. To address this issue, we propose a Dual-Bound Filtering (DBF) framework to accelerate the EMD-based spatial dataset search. Specifically, we represent datasets by Z-order histograms and organize them as nodes in a tree structure. During a query, two levels of filtering are conducted based on pooling-based bounds and a TICT bound on EMD to prune dissimilar datasets efficiently. We conduct experiments on four real-world spatial data repositories and the experimental results demonstrate the efficiency and effectiveness of our DBF framework.

Citations (3)


... Yang et al. [10] tackle spatial dataset search using Earth Mover's Distance (EMD) to measure the similarity between datasets and propose a Dual-Bound Filtering (DBF) framework to accelerate EMD-based spatial dataset search. Li et al. [23] design the order-preserving encrypted similarity to achieve secure similarity calculation and propose the baseline search scheme PriDAS and the optimized search scheme PriDAS+. Mottin et al. [24] demonstrate the functionality of a query engine XQ, which enables top-k spatial dataset search and shows its applicability in various situations. ...

Reference:

Efficient Top-k Spatial Dataset Search Processing
Privacy-preserving Spatial Dataset Search in Cloud
  • Citing Conference Paper
  • October 2024

... The development of sensor technology, GPS-enabled mobile devices, and wireless communication brings prosperity to spatial data, and the emergence of a large number of spatial data has fueled the demand for searching spatial data [14], [26], [27], [46], [62], [67], [75]. Since a spatial dataset usually consists of a set of spatial data points [67], various spatial dataset search [15], [41], [42], [67] and data point search systems [8], [21], [28] have been developed. For example, Google Dataset Search [15] provides search capabilities over potentially all datasets published on the Web. ...

EDSS: An Exemplar Dataset Search Service over Encrypted Spatial Datasets
  • Citing Conference Paper
  • July 2024

... Examples include statistical properties, temporal patterns, geometric and feature space characteristics, Distance-based measures, and domain-specific features [118]. In statistical approach, similarity of datasets can be compared through either descriptive statistics (e.g., mean, variance, skewness, and kurtosis) or distributional similarity using metrics like the Kolmogorov-Smirnov (KS) test [119], Earth Mover's Distance (EMD) [120], or Maximum Mean Discrepancy (MMD) [121]. In temporal patterns, the similarity can be measured by either calculating autocorrelation and partial autocorrelation functions, or decomposition techniques like STL decomposition to compare seasonality and trends in the dataset. ...

Fast dataset search with earth mover's distance
  • Citing Article
  • July 2022

Proceedings of the VLDB Endowment