## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

This is the presentation talk covering my master degree research results, presented at the kick-off event of Scalable Similarity Search project at ITU-Copenhagen (the project is coordinated by Prof. Rasmus Pagh)
http://sss.projects.itu.dk/kickoff.html
The complete thesis is here https://www.researchgate.net/publication/267394989_Metric_space_indexing_for_nearest_neighbor_search_in_multimedia_context

To read the full-text of this research,

you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.

The problem of searching the set of keys in a file to find a key which is closest to a given query key is discussed. After “closest,” in terms of a metric on the the key space, is suitably defined, three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result. These methods are derived using certain inequalities satisfied by metrics and by graph-theoretic concepts. Some empirical results are presented which compare the efficiency of the methods.

Local descriptors have been extensively used in CBIR systems, where their robustness to intense geometric and photometric transformations allows the identification of a target object/image with great reliability. However, due to their excessive discriminating power, their application to the retrieval of complex categories is challenging. The introduction of the technique of visual dictionaries (also known as dictionary of visual terms) is an important step towards the conciliation between the robustness of local descriptors and the flexibility of generalization needed by complex queries. As a bonus, we become able to employ advanced retrieval techniques which were so far available only for textual data.

We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Many signal processing methods suffer from this computing cost. Dramatic performance gains can be obtained by using approximate search, such as the popular Locality-Sensitive Hashing. This paper improves LSH by performing an on-line selection of the most appropriate hash functions from a pool of functions. An additional improvement originates from the use of E& lattices for geometric hashing instead of one-dimensional random projections. A performance study based on state-of-the-art high-dimensional descriptors computed on real images shows that our improvements to LSH greatly reduce the search complexity for a given level of accuracy.

It is well known that high-dimensional nearest neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All the proposed extensions, however, rely on a structured quantizer for hashing, poorly fitting real data sets, limiting its performance in practice. In this paper, we compare several families of space hashing functions in a real setup, namely when searching for high-dimension SIFT descriptors. The comparison of random projections, lattice quantizers, k-means and hierarchical k-means reveal that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space. We then compare two querying mechanisms introduced in the literature with the one originally proposed in LSH, and discuss their respective merits and limitations.

In this paper we address the subject of large multimedia database indexing for content-based retrieval. We introduce multicurves, a new scheme for indexing high-dimensional descriptors. This technique, based on the simultaneous use of moderate-dimensional space-filling curves, has as main advantages the ability to handle high-dimensional data (100 dimensions and over), to allow the easy maintenance of the indexes (inclusion and deletion of data), and to adapt well to secondary storage, thus providing scalability to huge databases (millions, or even thousands of millions of descriptors). We use multicurves to perform the approximate k nearest neighbors search with a very good compromise between precision and speed. The evaluation of multicurves, carried out on large databases, demonstrates that the strategy compares well to other up-to-date k nearest neighbor search strategies. We also test multicurves on the real-world application of image identification for cultural institutions. In this application, which requires the fast search of a large amount of local descriptors, multicurves allows a dramatic speed-up in comparison to the brute-force strategy of sequential search, without any noticeable precision loss.

We present an efficient GPU-based parallel LSH algorithm to perform approximate k-nearest neighbor computation in high-dimensional spaces. We use the Bi-level LSH algorithm, which can compute k-nearest neighbors with higher accuracy and is amenable to parallelization. During the first level, we use the parallel RP-tree algorithm to partition datasets into several groups so that items similar to each other are clustered together. The second level involves computing the Bi-Level LSH code for each item and constructing a hierarchical hash table. The hash table is based on parallel cuckoo hashing and Morton curves. In the query step, we use GPU-based work queues to accelerate short-list search, which is one of the main bottlenecks in LSH-based algorithms. We demonstrate the results on large image datasets with 200,000 images which are represented as 512 dimensional vectors. In practice, our GPU implementation can obtain more than 40X acceleration over a single-core CPU-based LSH implementation.

Ecient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Local- ity Sensitive Hashing (LSH) has been proposed as index- ing technique for approximate similarity search. Among the most recent variations of LSH, multi-probe LSH tech- niques have been proved to overcome the overlinear space cost drawback of common LSH. Multi-probe LSH is built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by previous work on probabilistic similarity search structures and improves upon recent theoretical work on multi-probe and query adaptive LSH. Whereas these methods are based on likelihood criteria that a given bucket contains query results, we define a more reliable a posteriori model taking account some prior about the queries and the searched objects. This prior knowledge allows a better quality control of the search and a more accurate selection of the most probable buckets. We imple- mented a nearest neighbors search based on this paradigm and performed experiments on dierent real visual features datasets. We show that our a posteriori scheme outperforms other multi-probe LSH while oering a better quality con- trol. Comparisons to the basic LSH technique show that our method allows consistent improvements both in space and time eciency.

The concept of Locality-sensitive Hashing (LSH) has been successfully used for searching in high-dimensional data and a number of locality-preserving hash functions have been introduced. In order to extend the applicability of the LSH approach to a general metric space, we focus on a recently presented Metric Index (M-Index), we redefine its hashing and searching process in the terms of LSH, and perform extensive measurements on two datasets to verify that the M-Index fulfills the conditions of the LSH concept. We widely discuss "optimal" properties of LSH functions and the efficiency of a given LSH function with respect to kNN queries. The results also indicate that the M-Index hashing and searching is more efficient than the tested standard LSH approach for Euclidean distance.

Metric space as a universal and versatile model of similarity can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index), that employs practically all known principles of metric space partitioning, pruning and filtering. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in well-established structures such as the B+-tree or even in a distributed storage. We have implemented the M-Index with B+-tree and performed experiments on a combination of five MPEG-7 descriptors in a database of hundreds of thousands digital images. The experiments put under test several M-Index variants and compare them with two orthogonal approaches – the PM-Tree and the iDistance. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Furthermore, the M-Index demonstrates an excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient – maintaining practically constant response times while preserving a very high recall as the dataset grows.

Modeling proximity search problems as a metric space provides a general framework usable in many areas, like pattern recognition, web search, clustering, data mining, knowledge management, textual and multimedia information retrieval, to name a few. Metric indexes have been improved over the years and many instances of the problem can be solved efficiently. However, when very large/high dimensional metric databases are indexed exact approaches are not yet capable of solving efficiently the problem, the performance in these circumstances is degraded to almost sequential search. To overcome the above limitation, non-exact proximity searching algorithms can be used to give answers that either in probability or in an approximation factor are close to the exact result. Approximation is acceptable in many contexts, specially when human judgement about closeness is involved. In vector spaces, on the other hand, there is a very successful approach dubbed Locality Sensitive Hashing which consist in making a succinct representation of the objects. This succinct representation is relatively insensitive to small variations of the locality. Unfortunately, the hashing function have to be carefully designed, very close to the data model, and different functions are used when objects come from different domains. In this paper we give a new schema to encode objects in a general metric space with a uniform framework, independent from the data model. Finally, we provide experimental support to our claims using several real life databases with different data models and distance functions obtaining excellent results in both the speed and the recall sense, specially for large databases.

With the proliferation of high-speed internet access, piracy of multimedia data has developed into a major problem and media distributors, such as photo agencies, are making strong efforts to protect their digital property. Some recent work on image processing has therefore focused on content-based methods to detect image copyright violations, and a ``local descriptor'' method, which extracts several characteristic points of an image and describes through high-dimensional vectors, has been shown to be quite effective, albeit very inefficient. We have applied a recent approximate query processing method, the OMEDRANK algorithm, to the image copyright protection method above and shown that it does not result in more efficient query processing without sacrificing the quality of the results. Therefore we have proposed a new index structure, the PVS-index, which segments the descriptor collection based on projections to random lines and utilizes all the nice properties of the OMEDRANK algorithm. In a detailed performance study using a collection of over 20 million image descriptors, we show that using OMEDRANK on top of the PVS-index results in extremely efficient and effective query processing.

We propose MONORAIL, an indexing scheme for very large multimedia descriptor databases. Our index is based on the Hilbert curve, which is able to map the high-dimensional space of those descriptors to a single dimension. Instead of using several curves to mitigate boundary effects, we use a single curve with several surrogate points for each descriptor. Thus, we are able to reduce the random accesses to the bare minimum. In a rigorous empirical comparison with another method based on multiple surrogates, ours shows a significant improvement, due to our careful choice of the surrogate points.

In this work, a fast approximate nearest neighbour search algorithm using single Space-filling Curve (SPFC) Mapping and a set of synthetic prototype representations is presented. The results are comparable to a multiple- spacefilling scheme, but achieving a much faster execution time, since computing multiple transformations and SPFC Mapping's is avoided, at the expense of having a more densely populated one-dimensional representation of the data-set. The advantages and limitations of the model are discussed, and an experimental evaluation with synthetic data and with a large, real high-dimensional optical char- acter recognition data-set is presented.

This is a book, not a book review.

The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality". We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy which allows us to devise new algorithms from combinations of concepts which were not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.

This paper introduces a product quantization-based approach for approximate nearest neighbor search. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace quantization indices. The euclidean distance between two vectors can be efficiently estimated from their codes. An asymmetric version increases precision, as it computes the approximate distance between a vector and a code. Experimental results show that our approach searches for nearest neighbors efficiently, in particular in combination with an inverted file system. Results for SIFT and GIST image descriptors show excellent search accuracy, outperforming three state-of-the-art approaches. The scalability of our approach is validated on a data set of two billion vectors.

During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, or molecular biology. An important research issue in the field of multimedia databases is the content based retrieval of similar multimedia objects such as images, text, and videos. However, in contrast to searching data in a relational database, a content based retrieval requires the search of similar objects as a basic functionality of the database system. Most of the approaches addressing similarity search use a so-called feature transformation which transforms important properties of the multimedia objects into high-dimensional points (feature vectors). Thus, the similarity search is transformed into a search of points in the feature space which are close to a given query point in the high-dimensional feature space. Query Processing in high-dimensional spaces has therefore been a very active research area over the last few years. A number of new index structures and algorithms have been proposed. It has been shown that the new index structures considerably improve the performance in querying large multimedia databases. Based on recent tutorials [BK 98, BK 00], in this survey we provide an overview of the current state-of-the-art in querying multimedia databases, describing the index structures and algorithms for an efficient query processing in high-dimensional spaces. We identify the problems of processing queries in high-dimensional space, and we provide an overview of the proposed approaches to overcome these problems.

We present a new approach for approximate nearest neighbor queries for sets of high dimensional points under any L<sub>t</sub>-metric, t=1,...,∞. The proposed algorithm is efficient and simple to implement. The algorithm uses multiple shifted copies of the data points and stores them in up to (d+1) B-trees where d is the dimensionality of the data, sorted according to their position along a space filling curve. This is done in a way that allows us to guarantee that a neighbor within an O(d<sup>1+1</sup>t/) factor of the exact nearest, can be returned with at most (d+1)log, n page accesses, where p is the branching factor of the B-trees. In practice, for real data sets, our approximate technique finds the exact nearest neighbor between 87% and 99% of the time and a point no farther than the third nearest neighbor between 98% and 100% of the time. Our solution is dynamic, allowing insertion or deletion of points in O(d log<sub>p</sub> n) page accesses and generalizes easily to find approximate k-nearest neighbors

Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.

Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm

In this paper, we propose a new method for indexing large amounts of point and spatial data in highdimensional space. An analysis shows that index structures such as the R*-tree are not adequate for indexing high-dimensional data sets. The major problem of R-tree-based index structures is the overlap of the bounding boxes in the directory, which increases with growing dimension. To avoid this problem, we introduce a new organization of the directory which uses a split algorithm minimizing overlap and additionally utilizes the concept of supernodes. The basic idea of overlap-minimizing split and supernodes is to keep the directory as hierarchical as possible, and at the same time to avoid splits in the directory that would result in high overlap. Our experiments show that for high-dimensional data, the X-tree outperforms the well-known R*-tree and the TV-tree by up to two orders of magnitude. 1. Introduction In many applications, indexing of high-dimensional data has become increasingl...

Similarity searching has been a research issue for many years, and searching has probably become the most important web application today. As the complexity of data objects grows, it is more and more difficult to reason about digital objects otherwise than through the similarity. In this article, we first discuss concepts of similarity and searching in light of future perspectives before a concise history of similarity searching technology is presented. We use the historical knowledge to extend the trends to future. We analyze the bottlenecks of application development and discuss perspectives of search computing for future applications. We also present a model of search technology and its position in computer clouds for application development. Finally, execution platforms for multi-modal findability and security issues for outsourced similarity searching environments are suggested as important research challenges.

Locality-sensitive hashing (LSH) is the basis of many algorithms that use a probabilistic approach to find nearest neighbors. We describe an algorithm for optimizing the parameters and use of LSH. Prior work ignores these issues or suggests a search for the best parameters. We start with two histograms: one that characterizes the distributions of distances to a point's nearest neighbors and the second that characterizes the distance between a query and any point in the data set. Given a desired performance level (the chance of finding the true nearest neighbor) and a simple computational cost model, we return the LSH parameters that allow an LSH index to meet the performance goal and have the minimum computational cost. We can also use this analysis to connect LSH to deterministic nearest-neighbor algorithms such as k d trees and thus start to unify the two approaches.

Querying k nearest neighbors of query point from data set in high dimensional space is one of important operations in spatial database. The classic nearest neighbor query algorithms are based on R-tree. However, R-tree exits overlapping problem of minimum bounding rectangles. This causes its time complexity exponentially depends on the dimensionality of the space. So, the reduction of the dimensionality is the key point. Hilbert curve fills high dimensional space linearly, divides the space into equal-size grids and maps points lying in grids into linear space. Using the quality of reducing dimensionality of Hilbert curve, the paper presents an approximate k nearest neighbor query algorithm AKNN, and analyzes the quality of k nearest neighbors in theory. According to the experimental result, the execution time of algorithm AKNN is shorter than the nearest neighbor query algorithm based on R-tree in high dimensional space, and the quality of approximate k nearest neighbors satisfies the need of real applications.

This paper is concerned with the packing of equal spheres in Euclidean spaces [ n ] of n > 8 dimensions. To be precise, a packing is a distribution of spheres any two of which have at most a point of contact in common. If the centres of the spheres form a lattice, the packing is said to be a lattice packing . The densest lattice packings are known for spaces of up to eight dimensions (1, 2) , but not for any space of more than eight dimensions. Further, although non-lattice packings are known in [3] and [5] which have the same density as the densest lattice packings, none is known which has greater density than the densest lattice packings in any space of up to eight dimensions, neither, for any space of more than two dimensions, has it been shown that they do not exist.

Distributed frameworks are gaining increasingly widespread use in
applications that process large amounts of data. One important example
application is large scale similarity search, for which Locality Sensitive
Hashing (LSH) has emerged as the method of choice, specially when the data is
high-dimensional. At its core, LSH is based on hashing the data points to a
number of buckets such that similar points are more likely to map to the same
buckets. To guarantee high search quality, the LSH scheme needs a rather large
number of hash tables. This entails a large space requirement, and in the
distributed setting, with each query requiring a network call per hash bucket
look up, this also entails a big network load. The Entropy LSH scheme proposed
by Panigrahy significantly reduces the number of required hash tables by
looking up a number of query offsets in addition to the query itself. While
this improves the LSH space requirement, it does not help with (and in fact
worsens) the search network efficiency, as now each query offset requires a
network call. In this paper, focusing on the Euclidian space under $l_2$ norm
and building up on Entropy LSH, we propose the distributed Layered LSH scheme,
and prove that it exponentially decreases the network cost, while maintaining a
good load balance between different machines. Our experiments also verify that
our scheme results in a significant network traffic reduction that brings about
large runtime improvement in real world applications.

The Signature Quadratic Form Distance on feature signatures represents a flexible distance-based similarity model for effective content-based multimedia retrieval. Although metric indexing approaches are able to speed up query processing by two orders of magnitude, their applicability to large-scale multimedia databases containing billions of images is still a challenging issue. In this paper, we propose a parallel approach that balances the utilization of CPU and many-core GPUs for efficient similarity search with the Signature Quadratic Form Distance. In particular, we show how to process multiple distance computations and other parts of the search procedure in parallel, achieving maximal performance of the combined CPU/GPU system. The experimental evaluation demonstrates that our approach implemented on a common workstation with 2 GPU cards outperforms traditional parallel implementation on a high-end 48-core NUMA server in terms of efficiency almost by an order of magnitude. If we consider also the price of the high-end server that is ten times higher than that of the GPU workstation then, based on price/performance ratio, the GPU-based similarity search beats the CPU-based solution by almost two orders of magnitude. Although proposed for the SQFD, our approach of fast GPU-based similarity search is applicable for any distance function that is efficiently parallelizable in the SIMT execution model.

Advances in computational geometry and machine learning that offer new methods for search, regression, and classification with large amounts of high-dimensional data.
Regression and classification methods based on similarity of the input to stored examples have not been widely used in applications involving very large sets of high-dimensional data. Recent advances in computational geometry and machine learning, however, may alleviate the problems in using these methods on large data sets. This volume presents theoretical and practical discussions of nearest-neighbor (NN) methods in machine learning and examines computer vision as an application domain in which the benefit of these advanced methods is often dramatic. It brings together contributions from researchers in theory of computation, machine learning, and computer vision with the goals of bridging the gaps between disciplines and presenting state-of-the-art methods for emerging applications. The contributors focus on the importance of designing algorithms for NN search, and for the related classification, regression, and retrieval tasks, that remain efficient even as the number of points or the dimensionality of the data grows very large. The book begins with two theoretical chapters on computational geometry and then explores ways to make the NN approach practicable in machine learning applications where the dimensionality of the data and the size of the data sets make the naïve methods for NN search prohibitively expensive. The final chapters describe successful applications of an NN algorithm, locality-sensitive hashing (LSH), to vision tasks.

A DataCutter framework that is designed to provide support for subsetting and processing of datasets in a distributed and heterogeneous environment is presented. The use of DataCutter with several data-intensive applications from diverse fields was illustrated. The experimental results demonstrate the impact of heterogeneity on an application, and further suggest that any static application organization will likely not perform efficiently in all cases. The DataCutter filtering service uses techniques such as careful placement of filters, multiple filter group instances, and transparent copies to adjust dynamically to the heterogeneity present in the targeted runtime environment.

Divide-and-conquer search strategies are described for satisfying proximity queries involving arbitrary distance metrics.

This paper introduces Hypercurves, a flexible framework for pro- viding similarity search indexing to high throughput multimedia services. Hypercurves efficiently and effectively answers k-nearest neighbor searches on multigigabyte high-dimensional databases. It supports massively parallel processing and adapts at runtime its parallelization regimens to keep answer times optimal for either low and high demands. In order to achieve its goals, Hypercurves introduces new techniques for selecting parallelism configurations and allocating threads to computation cores, including hyperthreaded cores. Its efficiency gains are throughly validated on a large database of multimedia descriptors, where it presented near linear speedups and superlinear scaleups. The adaptation reduces query response times in 43% and 74% for both platforms tested, when compared to the best static parallelism regimens.

A new k-medoids algorithm is presented for spatial clustering in large applications. The new algorithm utilizes the TIN of medoids
to facilitate local computation when searching for the optimal medoids. It is more efficient than most existing k-medoids methods while retaining the exact the same clustering quality of the basic k-medoids algorithm. The application of the new algorithm to road network extraction from classified imagery is also discussed
and the preliminary results are encouraging.

In this paper the impact of the metric indexing paradigm on the real-world applications is discussed. We pose questions whether the priorities in research of metric access methods (MAMs) established in the past decades reflect the actual needs of practitioners. In particular, we formulate the following pragmatic questions: Are the established MAM cost measures relevant? Isn't the metric space model too general when the majority of real-world applications use Lp spaces? On the other hand, isn't the metric model too restrictive with respect to the growing community of practitioners using non-metric distances? Are the simple similarity queries competitive enough? Have the real-world similarity search engines ever used a general metric access method, or do they use specific indexing? Is there a real demand for content-based similarity search or will the annotations and keyword search win the game? We present justification of these questions, investigating relevant literature and search engines. Finally, we try to transform the questions into answers and suggestions to the future research on MAMs.

Similarity indices for high-dimensional data are very desir- able for building content-based search systems for feature- rich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large number of hash tables in order to achieve good search quality. This pa- per proposes a new indexing scheme called multi-probe LSH that overcomes this drawback. Multi-probe LSH is built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by and improves upon recent theoretical work on entropy-based LSH designed to reduce the space requirement of the basic LSH method. We have implemented the multi-probe LSH method and evalu- ated the implementation with two dierent high-dimensional datasets. Our evaluation shows that the multi-probe LSH method substantially improves upon previously proposed methods in both space and time eciency. To achieve the same search quality, multi-probe LSH has a similar time- eciency

Multiattribute hashing and its variations have been proposed for partial match and range queries in the past. The main idea is that each record yields a bitstring @@@@ (“record signature”), according to the values of its attributes. The binary value (@@@@)2 of this string decides the bucket that the record is stored. In this paper we propose to use Gray codes instead of binary codes, in order to map record signatures to buckets. In Gray codes, successive codewords differ in the value of exactly one bit position, thus, successive buckets hold records with similar record signatures. The proposed method achieves better clustering of similar records and avoids some of the (expensive) random disk accesses, replacing them with sequential ones. We develop a mathematical model, derive formulas giving the average performance of both methods and show that the proposed method achieves 0% - 50% relative savings over the binary codes. We also discuss how Gray codes could be applied to some retrieval methods designed for range queries, such as the grid file [Nievergelt84a] and the approach based on the so-called z-ordering [Orenstein84a].

The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

Asymptotic results from the statistical theory of k-means clustering are applied to problems of vector quantization. The behavior of quantizers constructed from long training sequences of data is analyzed by relating it to the consistency problem for k-means.

This work focus on fast nearest neighbor (NN) search algorithms that can work in any metric space (not just the Euclidean distance) and where the distance computation is very time consuming. One of the most well known methods in this field is the AESA algorithm, used as baseline for performance measurement for over twenty years. The AESA works in two steps that repeats: first it searches a promising candidate to NN and computes its distance (approximation step), next it eliminates all the unsuitable NN candidates in view of the new information acquired in the previous calculation (elimination step).This work introduces the PiAESA algorithm. This algorithm improves the performance of the AESA algorithm by splitting the approximation criterion: on the first iterations, when there is not enough information to find good NN candidates, it uses a list of pivots (objects in the database) to obtain a cheap approximation of the distance function. Once a good approximation is obtained it switches to the AESA usual behavior. As the pivot list is built in preprocessing time, the run time of PiAESA is almost the same than the AESA one.In this work, we report experiments comparing with some competing methods. Our empirical results show that this new approach obtains a significant reduction of distance computations with no execution time penalty.

The nearest neighbor search (NNS) problem is the following: Given a set of n points P={p1, …, pn} in some metric space X, preprocess P so as to efficiently answer queries which require finding a point in P closest to a query point q∈X. The approximate nearest neighbor search (c-NNS) is a relaxation of NNS which allows to return any point within c times the distance to the nearest neighbor (called c-nearest neighbor). This problem is of major and growing importance to a variety of applications. In this paper, we give an algorithm for (4⌈log1+ρlog4d⌉+1)-NNS algorithm in ld∞ with O(dn1+ρlogO(1)n) storage and O(dlogO(1)n) query time. Moreover, we obtain an algorithm for 3-NNS for l∞ with nlogd+1 storage. The preprocessing time is close to linear in the size of the data structure. The algorithm can be also used (after simple modifications) to output the exact nearest neighbor in time bounded by O(dlogO(1)n) plus the number of (4⌈log1+ρlog4d⌉+1)-nearest neighbors of the query point. Building on this result, we also obtain an approximation algorithm for a general class of product metrics. Finally, we show that for any c

Motivated by the urgent need to improve the efficiency of similarity queries, approximate similarity retrieval is investigated in the environment of a metric tree index called the M-tree. Three different approximation techniques are proposed, which show how to forsake query precision for improved performance. Measures are defined that can quantify the improvements in performance efficiency and the quality of approximations. The proposed approximation techniques are then tested on various synthetic and real-life files. The evidence obtained from the experiments confirms our hypothesis that a high-quality approximated similarity search can be performed at a much lower cost than that needed to obtain the exact results. The proposed approximation techniques are scalable and appear to be independent of the metric used. Extensions of these techniques to the environments of other similarity search indexes are also discussed.

This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.

We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under l0RW1S34RfeSDcfkexd09rT4p1RW1S34RfeSDcfkexd09rT4 norm, based on p-stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the l0RW1S34RfeSDcfkexd09rT421RW1S34RfeSDcfkexd09rT4 norm. It also yields the first known provably efficient approximate NN algorithm for the case p less than or equal 1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree.

This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

New architectural trends in chip design resulted in machines with multiple processing units as well as efficient communication networks, leading to the wide availability of systems that provide multiple levels of parallelism, both inter- and intra-machine. Developing applications that efficiently make use of such systems is a challenge, specially for application-domain programmers. In this paper we present a new version of the Anthill programming environment that efficiently exploits multi-level parallelism and experimental results that demonstrate such efficiency. Anthill is based on the filter-stream model; in this model, applications are decomposed into a set of filters communicating through streams, which has already been shown to be efficient for expressing inter-machine parallelism. We replaced the filter run-time environment, originally process-oriented, with an event-oriented version. This new version allow programmers to efficiently express opportunities for parallelism within each compute node through a higher-level programming abstraction. We evaluated our solution on dual- and quad-core machines with two data mining applications: Eclat and KNN. Both had drops in execution time nearly proportional to the number of cores on a single machine. When using a cluster of dual-core machines, speed-ups were close to linear on the number of available cores for both applications, confirming event-oriented Anthill performs well both on the inter- and intra-machine parallelism levels.

Asymptotic results from the statistical theory of k -means clustering are applied to problems of vector quantization. The behavior of quantizers constructed from long training sequences of data is analyzed by relating it to the consistency problem for k -means.

Given a set of n points in d-dimensional Euclidean , and a query point q 2 E , we wish to determine the nearest neighbor of q, that is, the point of S whose Euclidean distance to q is minimum. The goal is to preprocess the point set S, such that queries can be answered as efficiently as possible. We assume that the dimension d is a constant independent of n. Although reasonably good solutions to this problem exist when d is small, as d increases the performance of these algorithms degrades rapidly. We present a randomized algorithm for approximate nearest neighbor searching. Given any set of n points S ae E , and a constant ffl ? 0, we produce a data structure, such that given any query point, a point of S will be reported whose distance from the query point is at most a factor of (1 + ffl) from that of the true nearest neighbor. Our algorithm runs in O(log n) expected time and requires O(n log n) space. The data structure can be built in ) expected time. The constant factors depend on d and ffl. Because of the practical importance of nearest neighbor searching in higher dimensions, we have implemented a practical variant of this algorithm, and show empirically that for many point distributions this variant of the algorithm finds the nearest neighbor in moderately large dimension significantly faster than existing practical approaches.