Table 1 - uploaded by Hyun-Hwa Choi
Content may be subject to copyright.
Summary of symbols and respective definitions Symbol Descriptions D Number of dimensions Q Query point k Number of nearest neighbors ¯ k Average kth distance between points in a sample
Source publication
Although conventional index structures provide various nearest-neighbor search algorithms for high-dimensional data, there are additional requirements to increase search performances, as well as to support index scalability for large-scale datasets. To support these requirements, we propose a distributed high-dimensional index structure based on cl...
Context in source publication
Similar publications
Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities m...
Citations
... Some approaches exist which utilize kd-tree for space partitioning on top of local index structures [22,23]. In [9], a distributed high-dimensional index structure (DVA-tree) is proposed which is based on a hybrid spill-tree and Vector Approximation files. Some papers [18,12] make the assumption that prior information is available, based on past experience, logs, etc. on the distribution of query sets. ...
... Some papers [18,12] make the assumption that prior information is available, based on past experience, logs, etc. on the distribution of query sets. Previous work mainly focused on declustering and box queries [16,17,20,25,4,8,18] or clustering and proximity search [9,23]. However, there were also attempts to satisfy these conflicting demands [3,22,27,31,12]. ...
There are several approaches related to handling and storing massive amounts of multidimensional data. State-of-the-art database systems use shared-nothing architecture for scalable spatial data management and indexing where data co-location and query load balancing are the primary objectives. Hence, data placement is an important component of efficiency. Szalai-Gindl et al. (SG17) proposed a data distribution algorithm for this task in an earlier work [26]. This paper investigates the improvement possibilities of that algorithm.
... Given better initialization, a resultant merit is that it yields significant improvements in an algorithm's running time. In a distributed setup, it remains notable that there is also increasing attention and effort toward understanding the running of the K-means algorithm [28][29][30]. ...
Real world problems for prediction usually try to predict rare occurrences. Application of standard classification algorithm is biased toward against these rare events, due to this data imbalance. Typical approaches to solve this data imbalance involve oversampling these “rare events” or under sampling the majority occurring events. Synthetic Minority Oversampling Technique is one technique that addresses this class imbalance effectively. However, the existing implementations of SMOTE fail when data grows and can't be stored on a single machine. In this paper present our solution to address the “big data challenge.” We provide a distributed version of SMOTE by using scalable k-means++ and M-Trees. With this implementation of SMOTE, we were able to oversample the “rare events” and achieve results which are better than the existing python version of SMOTE.
Smart medical technologies, combine Internet of Things, cloud computing and artificial intelligence technologies, are redefining the family life. With the advent of the era of big data, traditional medical service systems cannot meet the needs of big data processing in the current medical system because of the limited computing resources, slow operation speed and poorly distributed processing capacity. In this chapter, cloud-based smart medical system applying MapReduce distributed processing technology is proposed to solve these problems. A new distributed k-nearest neighbour (kNN) algorithm that combines the Voronoi-inverted grid (VIG) index and the MapReduce programming framework is developed to improve the efficiency of the data processing. Here, VIG is a spatial index, which uses the grid structure and the inverted index based on Voronoi partitioning technology. The results of extensive experimental evaluations indicate the efficiency and scalability of the proposed approach with real and synthetic data sets.