Preprint

A Unified Approach for Multi-granularity Search over Spatial Datasets

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the proliferation of location-based social media, it is of great importance to provide web users and mobile users with timely and high-quality information. In this light, we study the problem of continuous spatial keyword search over a stream of spatio-temporal messages by taking spatial relevance, textual relevance, and result diversification into consideration. We define a novel continuous query named Diversified Continuous Spatial Keyword (DCSK) query. A DCSK query consists of a spatial region, a set of query keywords, a similarity threshold 𝜃, and the number of query results k. Given a stream of spatio-temporal messages, the DCSK query continuously receive spatio-temporal messages such that: (1) they are located inside the query region and contain at least one query keyword, and (2) the similarities between each message and its previous k messages are all mess than 𝜃. Compared to traditional continuous spatial keyword query, the DCSK query can provide subscribers with spatio-temporal messages of higher quality because that the DCSK query takes both spatio-temporal relevance and query result diversification into consideration. We develop a Spatio-temporal Diversified Publish/Subscribe (STD-PS) framework to process a large number of DCSK queries efficiently. We conduct extensive experiments with real-world datasets. Our experimental results confirm the capability of our proposal in terms of result diversity, efficiency, and salability.
Conference Paper
Full-text available
Efficiently querying multiple spatial data sets is a growing challenge for scientists. Astronomers query data sets that contain different types of stars (e.g., dwarfs, giants, stragglers) while neuroscientists query different data sets that model different aspects of the brain in the same space (e.g., neurons, synapses, blood vessels). The results of each query determine the combination of data sets to be queried next. Not knowing a priori the queried data sets makes it hard to choose an efficient indexing strategy. In this paper, we show that indexing and querying the data sets separately incurs considerable overhead but so does using one index for all data sets. We therefore develop STITCH, a novel index structure for the scalable execution of spatial range queries on multiple data sets. Instead of indexing all data sets separately or indexing all of them together, the key insight we use in STITCH is to partition all data sets individually and to connect them to the same reference space. By doing so, STITCH only needs to query the reference space and follow the links to the data set partitions to retrieve the relevant data. With experiments we show that STITCH scales with the number of data sets and outperforms the state-of-the-art by a factor of up to 12.3.
Article
Full-text available
The spatial join is a popular operation in spatial database systems and its evaluation is a well-studied problem. This paper reviews research and recent trends on spatial join evaluation. The complexity of different data types, the consideration of different join predicates, the use of modern commodity hardware, and support for parallel processing open the road to a number of interesting directions for future research, some of which we outline in the paper.
Article
Full-text available
Spatial keyword search has been playing an indispensable role in personalized route recommendation and geo-textual information retrieval. In this light, we conduct a survey on existing studies of spatial keyword search. We categorize existing works of spatial keyword search based on the types of their input data, output results, and methodologies. For each category, we summarize their common features in terms of input data, output result, indexing scheme, and search algorithms. In addition, we provide detailed description regarding each study of spatial keyword search. This survey summarizes the findings of existing spatial keyword search studies, thus uncovering new insights that may guide software engineers as well as further research.
Article
Full-text available
PyOD is an open-source Python toolbox for performing scalable outlier detection on multivariate data. Uniquely, it provides access to a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API designed for use by both practitioners and researchers. With robustness and scalability in mind, best practices such as unit testing, continuous integration, code coverage, maintainability checks, interactive examples and parallelization are emphasized as core components in the toolbox's development. PyOD is compatible with both Python 2 and 3 and can be installed through Python Package Index (PyPI) or https://github.com/yzhao062/pyod.
Conference Paper
Full-text available
This paper studies the location-based web search and aims to build a unified processing paradigm for two purposes: (1) efficiently support each of the various types of location-based queries (kNN query, top-k spatial-textual query, etc.) on two major forms of geo-tagged data, i.e., spatial point data such as geo-tagged web documents, and spatial trajectory data such as a sequence of geo-tagged travel blogs by a user; (2) support interactive search to provide quick response for a query session, within which a user usually keeps refining her query by either issuing different query types or specifying different constraints (e.g., adding a keyword and/or location, changing the choice of k, etc.) until she finds the desired results. To achieve this goal, we first propose a general Top-k query called Monotone Aggregate Spatial Keyword query-MASK, which is able to cover most types of location-based web search. Next, we develop a unified indexing (called Textual-Grid-Point Inverted Index) and query processing paradigm (called ETAIL Algorithm) to answer a single MASK query efficiently. Furthermore, we extend ETAIL to provide interactive search for multiple queries within one query session, by exploiting the commonality of textual and/or spatial dimension among queries. Last, extensive experiments on four real datasets verify the robustness and efficiency of our approach.
Article
Full-text available
The first successful isolation‐based anomaly detector, ie, iForest, uses trees as a means to perform isolation. Although it has been shown to have advantages over existing anomaly detectors, we have identified 4 weaknesses, ie, its inability to detect local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis‐parallel clusters, and anomalies in multimodal data sets. To overcome these weaknesses, this paper shows that an alternative isolation mechanism is required and thus presents iNNE or isolation using Nearest Neighbor Ensemble. Although relying on nearest neighbors, iNNE runs significantly faster than the existing nearest neighbor–based methods such as the local outlier factor, especially in data sets having thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant space complexity.
Article
Full-text available
Visual exploration of spatial data relies heavily on spatial aggregation queries that slice and summarize the data over different regions. These queries comprise computationally-intensive point-in-polygon tests that associate data points to polygonal regions, challenging the responsiveness of visualization tools. This challenge is compounded by the sheer amounts of data, requiring a large number of such tests to be performed. Traditional pre-aggregation approaches are unsuitable in this setting since they fix the query constraints and support only rectangular regions. On the other hand, query constraints are defined interactively in visual analytics systems, and polygons can be of arbitrary shapes. In this paper, we convert a spatial aggregation query into a set of drawing operations on a canvas and leverage the rendering pipeline of the graphics hardware (GPU) to enable interactive response times. Our technique trades-off accuracy for response time by adjusting the canvas resolution, and can even provide accurate results when combined with a polygon index. We evaluate our technique on two large real-world data sets, exhibiting superior performance compared to index-based approaches.
Article
Full-text available
Geospatial data catalogs enable users to discover and access geographical information. Prevailing solutions are document oriented and fragment the spatial continuum of the geospatial data into independent and disconnected resources described through metadata. Due to this, the complete answer for a query may be scattered across multiple resources, making its discovery and access more difficult. This paper proposes an improved information retrieval process for geospatial data catalogs that aggregates the search results by identifying the implicit spatial/thematic relations between the metadata records of the resources. These aggregations are constructed in such a way that they match better the user query than each resource individually.
Article
Full-text available
Continuous outlier detection in data streams has important applications in fraud detection, network security, and public health. The arrival and departure of data objects in a streaming manner impose new challenges for outlier detection algorithms, especially in time and space efficiency. In the past decade, several studies have been performed to address the problem of distance-based outlier detection in data streams (DODDS), which adopts an unsupervised definition and does not have any distributional assumptions on data values. Our work is motivated by the lack of comparative evaluation among the state-of-the-art algorithms using the same datasets on the same platform. We systematically evaluate the most recent algorithms for DODDS under various stream settings and outlier rates. Our extensive results show that in most settings, the MCOD algorithm offers the superior performance among all the algorithms, including the most recent algorithm Thresh_LEAP.
Article
Full-text available
We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy. It is a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Annotations are made available through a public web-based interface to enable data visualization of object attributes, promote data-driven geometric analysis, and provide a large-scale quantitative benchmark for research in computer graphics and vision. At the time of this technical report, ShapeNet has indexed more than 3,000,000 models, 220,000 models out of which are classified into 3,135 categories (WordNet synsets). In this report we describe the ShapeNet effort as a whole, provide details for all currently available datasets, and summarize future plans.
Article
Full-text available
The Hausdorff distance (HD) between two point sets is a commonly used dissimilarity measure for comparing point sets and image segmentations. Especially when very large point sets are compared using the HD, for example when evaluating magnetic resonance volume segmentations, or when the underlying applications are based on time critical tasks, like motion detection, then the computational complexity of HD algorithms becomes an important issue. In this paper we propose a novel efficient algorithm for computing the exact Hausdorff distance. In a runtime analysis, the proposed algorithm is demonstrated to have nearly-linear complexity. Furthermore, it has efficient performance for large point set sizes as well as for large grid size; performs equally for sparse and dense point sets; and finally it is general without restrictions on the characteristics of the point set. The proposed algorithm is tested against the HD algorithm of the widely used National Library of Medicine Insight Segmentation and Registration Toolkit (ITK) using magnetic resonance volumes with extremely large size. The proposed algorithm outperforms the ITK HD algorithm both in speed and memory required. In an experiment using trajectories from a road network, the proposed algorithm significantly outperforms an HD algorithm based on R-Trees.
Article
Full-text available
Spatial analysis and social network analysis typically consider social processes in their own specific contexts, either geographical or network space. Both approaches demonstrate strong conceptual overlaps. For example, actors close to each other tend to have greater similarity than those far apart; this phenomenon has different labels in geography (spatial autocorrelation) and in network science (homophily). In spite of those conceptual and observed overlaps, the integration of geography and social network context has not received the attention needed in order to develop a comprehensive understanding of their interaction or their impact on outcomes of interest, such as population health behaviors, information dissemination, or human behavior in a crisis. In order to address this gap, this paper discusses the integration of geographic with social network perspectives applied to understanding social processes in place from two levels: the theoretical level and the methodological level. At the theoretical level, this paper argues that the concepts of nearness and relationship in terms of a possible extension of the First Law of Geography are a matter of both geographical and social network distance, relationship, and interaction. At the methodological level, the integration of geography and social network contexts are framed within a new interdisciplinary field: visual analytics, in which three major application-oriented subfields (data exploration, decision-making, and predictive analysis) are used to organize discussion. In each subfield, this paper presents a theoretical framework first, and then reviews what has been achieved regarding geo-social visual analytics in order to identify potential future research.
Article
Full-text available
Given two locations s and t in a road network, a distance query returns the minimum network distance from s to t, while a shortest path query computes the actual route that achieves the minimum distance. These two types of queries find important applications in practice, and a plethora of solutions have been proposed in past few decades. The existing solutions, however, are optimized for either practical or asymptotic performance, but not both. In particular, the techniques with enhanced practical efficiency are mostly heuristic-based, and they offer unattractive worst-case guarantees in terms of space and time. On the other hand, the methods that are worst-case efficient often entail prohibitive preprocessing or space overheads, which render them inapplicable for the large road networks (with millions of nodes) commonly used in modern map applications. This paper presents Arterial Hierarchy (AH), an index structure that narrows the gap between theory and practice in answering shortest path and distance queries on road networks. On the theoretical side, we show that, under a realistic assumption, AH answers any distance query in Õ(log α) time, where α = dmax/dmin, and dmax (resp. dmin) is the largest (resp. smallest) L∞ distance between any two nodes in the road network. In addition, any shortest path query can be answered in Õ(k + log α) time, where k is the number of nodes on the shortest path. On the practical side, we experimentally evaluate AH on a large set of real road networks with up to twenty million nodes, and we demonstrate that (i) AH outperforms the state of the art in terms of query time, and (ii) its space and pre-computation overheads are moderate.
Article
Full-text available
With the growing number of mobile applications, data analysis on large sets of historical moving objects trajectories becomes increasingly important. Nearest neighbor search is a fundamental problem in spatial and spatio-temporal databases. In this paper, we consider the following problem: Given a set of moving object trajectories D and a query trajectory mq, find the k nearest neighbors to mq within D for any instant of time within the lifetime of mq. We assume D is indexed in a 3D-R-tree and employ a filter-and-refine strategy. The filter step traverses the index and creates a stream of so-called units (linear pieces of a trajectory) as a superset of the units required to build the result of the query. The refinement step processes an ordered stream of units and determines the pieces of units forming the precise result. To support the filter step, for each node p of the index, in preprocessing a time-dependent coverage function C p (t) is computed which is the number of trajectories represented in p present at time t. Within the filter step, sophisticated data structures are used to keep track of the aggregated coverages of the nodes seen so far in the index traversal to enable pruning. Moreover, the R-tree index is built in a special way to obtain coverage functions that are effective for pruning. As a result, one obtains a highly efficient kNN algorithm for moving data and query points that outperforms the two competing algorithms by a wide margin. Implementations of the new algorithms and of the competing techniques are made available as well. Algorithms can be used in a system context including, for example, visualization and animation of results. Experiments of the paper can be easily checked or repeated, and new experiments be performed.
Article
We demonstrate ModsNet , a search tool for pre-trained data science MOD el s recommendatio N using E xamplar da T aset. Given a set of pre-trained data science models, an "example" input dataset, and a user-specified performance metric, ModsNet answers the following query: "what are top-k models that have the best expected performance for the input data?" The need for searching high-quality pre-trained models is evident in data-driven analysis. Inspired by "query by example" paradigm, ModsNet does not require users to write complex queries, but only provide an "examplar" dataset, a task description, and a performance measure as input, and can automatically suggest top- k matching models that are expected to have desirable performance to perform the task over the provided sample dataset. ModsNet utilizes a knowledge graph to integrate model performances over datasets and synchronizes it with a bipartite graph neural network to estimate model performance, reduce inference cost, and promptly respond to top- k model search queries. To cope with strict cold-start (upon receiving a new dataset when no historical performance of registered models are observed), it performs a dynamic, cost-bounded "probe-and-select" strategy to incrementally identify promising models. We demonstrate the application of ModsNet in enabling efficient scientific data analysis.
Article
The amount of spatial data in open data portals has increased rapidly, raising the demand for spatial dataset search in large data repositories. In this paper, we tackle spatial dataset search by using the Earth Mover's Distance (EMD) to measure the similarity between datasets. EMD is a robust similarity measure between two distributions and has been successfully applied to multiple domains such as image retrieval, document retrieval, multimedia, etc. However, the existing EMD-based studies typically depend on a common filtering framework with a single pruning strategy, which still has a high search cost. To address this issue, we propose a Dual-Bound Filtering (DBF) framework to accelerate the EMD-based spatial dataset search. Specifically, we represent datasets by Z-order histograms and organize them as nodes in a tree structure. During a query, two levels of filtering are conducted based on pooling-based bounds and a TICT bound on EMD to prune dissimilar datasets efficiently. We conduct experiments on four real-world spatial data repositories and the experimental results demonstrate the efficiency and effectiveness of our DBF framework.
Article
The large volumes of structured data currently available, from Web tables to open-data portals and enterprise data, open up new opportunities for progress in answering many important scientific, societal, and business questions. However, finding relevant data is difficult. While search engines have addressed this problem for Web documents, there are many new challenges involved in supporting the discovery of structured data. We demonstrate how the Auctus dataset search engine addresses some of these challenges. We describe the system architecture and how users can explore datasets through a rich set of queries. We also present case studies which show how Auctus supports data augmentation to improve machine learning models as well as to enrich analytics.
Article
In order to conduct analytical tasks, data scientists often need to find relevant data from an avalanche of sources (e.g., data lakes, large organizational databases). This effort is typically made in an ad hoc, non-systematic manner, which makes it a daunting endeavour. Current data discovery systems typically require the users to find relevant tables manually, usually by issuing multiple queries (e.g., using SQL). However, expressing such queries is nontrivial, as it requires knowledge of the underlying structure (schema) of the data organization in advance. This issue is further exacerbated when data resides in data lakes, where there is no predefined schema that data must conform to. On the other hand, data scientists can often come up with a few example records of interest quickly. Motivated by this observation, we developed DICE---a human-in-the-loop system for D ata d I s C overy by E xample---that takes user-provided example records as input and returns more records that satisfy the user intent. DICE's key idea is to synthesize a SQL query that captures the user intent, specified via examples. To this end, DICE follows a three-step process: (1) DICE first discovers a few candidate queries by finding join paths across tables within the data lake. (2) Then DICE consults with the user for validation by presenting a few records to them, and, thus, eliminating spurious queries. (3) Based on the user feedback, DICE refines the search and repeats the process until the user is satisfied with the results. We will demonstrate how DICE can help in data discovery through an interactive, example-based interaction.
Article
The rapid explosion of urban cities has modernized the residents’ lives and generated a large amount of data (e.g., human mobility data, traffic data, and geographical data), especially the activity trajectory data that contains spatial and temporal as well as activity information. With these data, urban computing enables to provide better services such as location-based applications for smart cities. Recently, a novel exemplar query paradigm becomes popular that considers a user query as an example of the data of interest, which plays an important role in dealing with the information deluge. In this article, we propose a novel query, called searching activity trajectory by exemplar, where, given an exemplar trajectory τ q , the goal is to find the top- k trajectories with the smallest distances to τ q . We first introduce an inverted-index-based algorithm (ILA) using threshold ranking strategy. To further improve the efficiency, we propose a gridtree threshold approach (GTA) to quickly locate candidates and prune unnecessary trajectories. In addition, we extend GTA to support parallel processing. Finally, extensive experiments verify the high efficiency and scalability of the proposed algorithms.
Conference Paper
The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.
Article
Aiming at the problem of top-k spatial join query processing in cloud computing systems, a Spark-based top-k spatial join (STKSJ) query processing algorithm is proposed. In this algorithm, the whole data space is divided into grid cells of the same size by a grid partitioning method, and each spatial object in one data set is projected into a grid cell. The Minimum Bounding Rectangle (MBR) of all spatial objects in each grid cell is computed. The spatial objects overlapping with these MBRs in another spatial data set are replicated to the corresponding grid cells, thereby filtering out spatial objects for which there are no join results, thus reducing the cost of subsequent spatial join processing. An improved plane sweeping algorithm is also proposed that speeds up the scanning mode and applies threshold filtering, thus greatly reducing the communication and computation costs of intermediate join results in subsequent top-k aggregation operations. Experimental results on synthetic and real data sets show that the proposed algorithm has clear advantages, and better performance than existing top-k spatial join query processing algorithms.
Conference Paper
There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.
Article
Although many fast methods exist for constructing a kNN-graph for low-dimensional data, it is still an open question how to do it efficiently for high-dimensional data. We present a new method to construct an approximate kNN-graph for medium- to high-dimensional data. Our method uses one-dimensional mapping with a Z-order curve to construct an initial graph and then continues to improve this using neighborhood propagation. Experiments show that the method is faster than the compared methods with five different benchmark datasets, the dimensionality of which ranges from 14 to 784. Compared to a brute-force approach, the method provides a speedup between 12.7:1 and 414.2:1 depending on the dataset. We also show that errors in the approximate kNN-graph originate more likely from outlier points; and, it can be detected during runtime, which points are likely to have errors in their neighbors.
Article
Point-of-interest (POI) recommendation has become an important way to help people discover attractive and interesting places, especially when they travel out of town. However, the extreme sparsity of user-POI matrix and cold-start issues severely hinder the performance of collaborative filtering-based methods. Moreover, user preferences may vary dramatically with respect to the geographical regions due to different urban compositions and cultures. To address these challenges, we stand on recent advances in deep learning and propose a Spatial-Aware Hierarchical Collaborative Deep Learning model (SH-CDL). The model jointly performs deep representation learning for POIs from heterogeneous features and hierarchically additive representation learning for spatial-aware personal preferences. To combat data sparsity in spatial-aware user preference modeling, both the collective preferences of the public in a given target region and the personal preferences of the user in adjacent regions are exploited in the form of social regularization and spatial smoothing. To deal with the multimodal heterogeneous features of the POIs, we introduce a late feature fusion strategy into our SH-CDL model. The extensive experimental analysis shows that our proposed model outperforms the state-of-the-art recommendation models, especially in out-of-town and cold-start recommendation scenarios.
Conference Paper
Recently, many governments have developed open government data portals as a way to facilitate the finding and the access to datasets produced by their agencies. The development of these portals has facilitated the retrieval of this kind of data, but they still have significant limitations. One drawback of current portals concerns the resolution of queries with spatial constraints. Many portals solve spatial queries selecting the datasets that contain in their description the place name informed by the user, which can lead to queries with low recall and precision. Aiming to solve these limitations, we propose a new spatial search engine to improve information retrieval in open government data portals. The main contributions of this work are the development of a system that retrieves OGD at the level of resources and the proposition of a ranking metric that evaluates the relevance of each resource retrieved from a query. We validated the proposed search engine using real data provided by the Brazilian open government data portal. The results obtained from the initial experiments showed that our solution is viable as it can retrieve data with good accuracy for many spatial queries of different granularities.
Article
We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective. Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.
Article
The Hausdorff Distance (HD) is a very important similarity measurement in Pattern Recognition, Shape Matching and Artificial Intelligence. Because of its inherent computational complexity, computing the HD using the NAIVEHD (brute force) algorithm is difficult, especially for comparing the similarity between large scale point sets in the time of big data. To overcome this problem, we propose a novel, efficient and general algorithm for computing the exact HD for arbitrary point sets, which takes advantage of the spatial locality of point sets, namely, Local Start Search (LSS). Different from the state-of-the-art algorithm EARLYBREAK in PAMI 2015, our idea comes from the observation and fact that the neighbor points of a break position in the current loop have higher probability to break the next loop than other points. Therefore, in our algorithm, we add a new mechanism to record the current break position as a start position, which is initialized as search center of the next loop. Then, LSS executes the next loop by scanning the neighbor points around the center. In this way, LSS maintains high performance in both overlap and non-overlap situations. Furthermore, the LSS algorithm can process arbitrary data by adopting the Morton Curve to establish the order of scattered point sets, while the EARLYBREAK is mainly applied to regular data which require the same grid size, such as medical images or voxel data. In the non-overlapping situation when comparing pairs of arbitrary point sets, LSS achieves performance as good as EARLYBREAK algorithm. While in the overlapping situation when comparing pairs of arbitrary point sets, LSS is faster than EARLYBREAK by three orders of magnitude. Thus, as a whole, LSS outperforms EARLYBREAK. In addition, LSS compared against the increment hausdorff distance calculation algorithm (INC) and significantly outperforms it by an order of magnitude faster. Experiments demonstrate the efficiency and accuracy of the proposed method.
Conference Paper
From tweets to urban data sets, there has been an explosion in the volume of textual data that is associated with both temporal and spatial components. Efficiently evaluating queries over these data is challenging. Previous approaches have focused on the spatial aspect. Some used separate indices for space and text, thus incurring the overhead of storing separate indices and joining their results. Others proposed a combined index that either inserts terms into a spatial structure or adds a spatial structure to an inverted index. These benefit queries with highly-selective constraints that match the primary index structure but have limited effectiveness and pruning power otherwise. We propose a new indexing strategy that uniformly handles text, space and time in a single structure, and is thus able to efficiently evaluate queries that combine keywords with spatial and temporal constraints. We present a detailed experimental evaluation using real data sets which shows that not only our index attains substantially lower query processing times, but it can also be constructed in a fraction of the time required by state-of-the-art approaches.
Conference Paper
A fundamental problem of time series is k nearest neighbor (k-NN) query processing. However, existing methods are not fast enough for large dataset. In this paper, we propose a novel approach, STS3, to process k-NN queries by transforming time series to sets and measure the similarity under Jaccard metric. Our approach is more accurate than Dynamic Time Warping(DTW) in our suitable scenarios and it is faster than most of the existing methods, due to the efficient similarity search for sets. Besides, we also developed an index, a pruning and an approximation technique to improve the k-NN query procedure. As shown in the experimental results, all of them could accelerate the query processing effectively.
Article
The problem of maximizing bichromatic reverse k nearest neighbor queries (BRkNN) has been extensively studied in spatial databases. In this work, we present a related query for spatial-textual databases that finds an optimal location, and a set of keywords that maximizes the size of bichromatic reverse spatial textual k nearest neighbors (MaxBRSTkNN). Such a query has many practical applications including social media advertisements where a limited number of relevant advertisements are displayed to each user. The problem is to find the location and the text contents to include in an advertisement so that it will be displayed to the maximum number of users. The increasing availability of spatial-textual collections allows us to answer these queries for both spatial proximity and textual similarity. This paper is the first to consider the MaxBRSTkNN query. We show that the problem is NP-hard and present both approximate and exact solutions.
Article
With the proliferation of mobile devices accompanied by the advancement in location detection technologies, such as GPS, GSM network logs, call description records (CDR), etc., a large amount of spatio-temporal data is being generated. A lot of research is directed towards understanding and discovering cumulative movement and behaviour patterns of users from this digital footprint. An interesting problem in this category is to be able to identify similarity between mobile users' trajectories. In this paper, a generic grid-based approach for user similarity mining from such location logs is presented. The technique is applied on real geo-location data to derive trajectory similarity patterns.
Article
Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale. These geo-textual data cover a wide range of topics. Users are interested in receiving up-to-date tweets such that their locations are close to a user specified location and their texts are interesting to users. For example, a user may want to be updated with tweets near her home on the topic 'food poisoning vomiting.' We consider the Temporal Spatial-Keyword Top-k Subscription (TaSK) query. Given a TaSK query, we continuously maintain up-to-date top-k most relevant results over a stream of geo-textual objects (e.g., geo-tagged Tweets) for the query. The TaSK query takes into account text relevance, spatial proximity, and recency of geo-textual objects in evaluating its relevance with a geo-textual object. We propose a novel solution to efficiently process a large number of TaSK queries over a stream of geotextual objects. We evaluate the efficiency of our approach on two real-world datasets and the experimental results show that our solution is able to achieve a reduction of the processing time by 70-80% compared with two baselines.
Article
Search engines are continuously employing advanced techniques that aim to capture user intentions and provide results that go beyond the data that simply satisfy the query conditions. Examples include the personalized results, related searches, similarity search, popular and relaxed queries. In this work we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries exemplar queries and claim that they can play an important role in dealing with the information deluge. We provide a formal specification of the semantics of such queries and show that they are fundamentally different from notions like queries by example, approximate and related queries. We provide an implementation of these semantics for graph-based data and present an exact solution with a number of optimizations that improve performance without compromising the quality of the answers. We also provide an approximate solution that prunes the search space and achieves considerably better time-performance with minimal or no impact on effectiveness. We experimentally evaluate the effectiveness and efficiency of these solutions with synthetic and real datasets, and illustrate the usefulness of exemplar queries in practice.
Article
Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.
Conference Paper
Hausdorff distance (HD) is an useful measurement to determine the extent to which one shape is similar to another, which is one of the most important problems in pattern recognition, computer vision and image analysis. Howeverm, HD is sensitive to outliers. Many researchers proposed modifications of HD. HD and its modifications are all based on computing the distance from each point in the model image to its nearest point in the test image, collectively called nearest neighbor based Hausdorff distances (NNHDs). In this paper, we propose modifications of Hausdorff distance measurements by using k-nearest neighbors (kNN). We use the average distance from each point in the model image to its kNN in the test image to replace the NN procedures of NNHDs and obtain the Hausdorff distance based on kNN, named kNNHDs. When k=1, kNNHDs are equal to NNHDs. kNNHDs inherit the properties of outliers tolerance from the prototypes in NNHDs and are more tolerant to noise.
Conference Paper
Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on S- H for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.
Article
In view-based 3-D object retrieval, each object is described by a set of views. Group matching thus plays an important role. Previous research efforts have shown the effectiveness of Hausdorff distance in group matching. In this paper, we propose a 3-D object retrieval scheme with Hausdorff distance learning. In our approach, relevance feedback information is employed to select positive and negative view pairs with a probabilistic strategy, and a view-level Mahalanobis distance metric is learned. This Mahalanobis distance metric is adopted in estimating the Hausdorff distances between objects, based on which the objects in the 3-D database are ranked. We conduct experiments on three testing data sets, and the results demonstrate that the proposed Hausdorff learning approach can improve 3-D object retrieval performance.
Article
This note presents a simplification and generalization of an algorithm for searchingk-dimensional trees for nearest neighbors reported by Friedmanet al [3]. If the distance between records is measured usingL 2 , the Euclidean norm, the data structure used by the algorithm to determine the bounds of the search space can be simplified to a single number. Moreover, because distance measurements inL 2 are rotationally invariant, the algorithm can be generalized to allow a partition plane to have an arbitrary orientation, rather than insisting that it be perpendicular to a coordinate axis, as in the original algorithm. When ak-dimensional tree is built, this plane can be found from the principal eigenvector of the covariance matrix of the records to be partitioned. These techniques and others yield variants ofk-dimensional trees customized for specific applications. It is wrong to assume thatk-dimensional trees guarantee that a nearest-neighbor query completes in logarithmic expected time. For smallk, logarithmic behavior is observed on all but tiny trees. However, for largerk, logarithmic behavior is achievable only with extremely large numbers of records. Fork = 16, a search of ak-dimensional tree of 76,000 records examines almost every record.