Conference PaperPDF Available

K-medoids LSH: a new locality sensitive hashing in general metric space

Authors:

Abstract and Figures

The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them -the ability of finding spe-cific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this article, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our early results show promise, but also demonstrate how challenging is to work around those difficulties.
Content may be subject to copyright.
K-medoids LSH: a new locality sensitive hashing in general
metric space
Eliezer S. Silva, Eduardo Valle
RECOD Lab DCA/FEEC, University of Campinas, Brazil
{eliezers, dovalle}@dca.fee.unicamp.br
Abstract. The increasing availability of multimedia content poses a challenge for information retrieval researchers.
Users want not only have access to multimedia documents, but also make sense of them - the ability of finding spe-
cific content in extremely large collections of textual and non-textual documents is paramount. At such large scales,
Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However,
Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations
in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this article,
we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which
is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are
similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely
efficient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector
spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison
with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our
early results show promise, but also demonstrate how challenging is to work around those difficulties.
Categories and Subject Descriptors: H.2 [Database Management]: Miscellaneous; H.3.1 [Content Analysis and
Indexing]: Indexing methods; H.3.3 [Information Search and Retrieval]: Miscellaneous
Keywords: hashing, metric space indexing, nearest neighbor search, similarity search
1. INTRODUCTION
Content-based Multimedia Information Retrieval (CMIR) is an alternative to keyword-based or tag-
based retrieval, which works by extracting features based on distinctive properties of the multimedia
objects. Those features are organized in multimedia descriptors, which are used as surrogates of the
multimedia object, in such a way that the retrieval of similar objects is based solely on that higher
level representation, without the need to refer to the actual low-level encoding of the media. The
descriptor can be seen as a compact and distinctive representation of multimedia content, encoding
some invariant properties of the content. For example, in image retrieval the successful Scale Invariant
Feature Transform (SIFT) [8] encodes local gradient patterns around Points-of-Interest, in a way that
is (partially) invariant to illumination and geometric transformations.
The descriptor framework also allows to abstract the media details in multimedia retrieval systems.
The operation of looking for similar multimedia documents becomes the more abstract operation
of looking for multimedia descriptors which have small distances. The notion of a “feature space”,
that organizes the documents in a geometry, putting close-by those that are similar emerges. Of
course, CMIR systems are usually much more complex than that, but nevertheless, looking for similar
descriptors often plays a critical role in the processing chain of a complex system.
Although the operation of finding descriptors which have small distances seems simple enough,
performing it fast for multimedia descriptors is actually very challenging, due to the scale of the
Copyright
c
2012 this is a preprint. Contact author for copyright issues.
Brazilian Symposium on Databases - preprint, 2013, Pages 1–6.
2 · E. S. Silva and E. Valle
collections, the dimensionality of the descriptors and the diversity of distance functions [1]. The
literature on the subject is very extensive, but in this article, we focus on one of the most cited family
of solutions, the Locality-Sensitive Hashing (LSH) [5; 4; 3], proposing K-medoids LSH as an extension
to metric space.
2. LOCALITY SENSITIVE HASHING (LSH)
The LSH indexing method relies on a family of locality-sensitive hashing function H, [5], to map
objects from a metric domain U in a D-dimensional space (usually R
d
) to a countable set C (usually
Z), with the following property: nearby points in high dimensional space are hashed to the same value
with high probability.
Definition 1. Given a distance function d : U × U R
+
, a function family H = {h : U C} is
(r, cr, p
1
, p
2
)-sensitive for a given data set S U if, for any points p, q S, h H:
—If d(p, q) r then P r
H
[h(q) = h(p)] p
1
(probability of colliding within the ball of radius r),
—If d(p, q) > cr then P r
H
[h(q) = h(p)] p
2
(probability of colliding outside the ball of radius cr)
c > 1 and p
1
> p
2
The basic scheme [5] provided locality-sensitive families for the Hamming distance on Hamming
spaces, and the Jacquard distance in spaces of sets. For several years those were the only families
available, although extensions for L
1
-normed (Manhattan) and L
2
-normed (Euclidean) spaces were
proposed by embedding those spaces into a Hamming space [4].
The practical success of LSH, however, came with the E2LSH
1
(Euclidean LSH) [3], for L
p
-normed
space, where a new familiy of LSH functions was introduced:
H = {h
i
: R
D
Z} (1)
h
i
(v) =
a
i
.v + b
i
w
(2)
a
i
R
D
is a random vector with each coordinate picked independently from a Gaussian distribution
N(0, 1), b
i
is an offset value sampled from uniform distribution in the range [0, . . . , w] and w is a scalar
for the quantization width. Applying h
i
to a point or object v corresponds to the composition of a
projection to a random line and a quantization operation, given by the quantization width w and the
floor operation.
A function family G is constructed by concatenating M randomly sampled functions h
i
H, such
that each g
j
G has the following form: g
j
(v) = (h
1
(v), ..., h
M
(v)). The use of multiple h
i
functions
reduces the probability of false positives, since two objects will have the same key for g
j
only if their
value coincide for all h
i
component functions. Each object v from the input dataset is indexed by
hashing it against L hash functions g
1
, . . . , g
L
. At the search phase a query object q is hashed using
the same L hash functions and the objects stored in the given buckets are used as the candidate set.
Then, a ranking is performed among the candidate set according to their distance to the query, and
the k closest objects are returned.
A few recent works approach the problem of designing LSH indexing schemes using only the distance
information: M-Index [9], DFLSH (Distribution Free Locality-Sensitive Hashing) [6] and [13].
1
LSH Algorithm and Implementation (E2LSH), accessed in 22/09/2013. http://www.mit.edu/~andoni/LSH/
Brazilian Symposium on Databases - preprint, 2013.
K-medoids LSH: a new locality sensitive hashing in general metric space · 3
K-means LSH. [12] present a comparison between structured (regular random projections, lattice)
and unstructured (K-means and hierarchical K-means) quantizers in the task of searching high dimen-
sional SIFT descriptors, resulting on the proposal of a new LSH family based on the latter (Equa-
tion 3). Results indicate that the LSH functions based on unstructured quantizers perform better,
as the induced Voronoi partitioning adapts to the data distribution, generating more uniform hash
cells population than the structured LSH quantizers. However, K-means and hierarchical K-means
are clustering algorithms for vector spaces, restricting the application of this approach. In order to
overcome this limitation we turn to a clustering algorithm designed to work in generic metric spaces
K-medoids clustering.
3. K-MEDOIDS LSH
We propose a novel method for locality-sensitive hashing in the metric search framework and compare
them with other similar methods in the literature. Our method is rooted on the idea of partitioning
the data space with a distance-based clustering algorithm (K-medoids) as an initial quantization step
and constructing a hash table with the quantization results. K-medoids LSH follows a direct approach
taken from [12]: each partition cell is a hash table bucket, therefore the hashing is the index number
of the partition cell.
Definition 2. Given a metric space (U, d) (U is the domain set and d is the distance function),
the set of cluster centers C = {c
1
, . . . , c
k
} U and an object x U :
h
C
: U N
h
C
(x) = argmin
i=1,...,k
{d(x, c
1
), . . . , d(x, c
i
), . . . , d(x, c
k
)}
(3)
PAM (Partitioning Around Medoids) [7] is the classic algorithm for K-medoids clustering. In that
work a medoid is defined as the most centrally located object within a cluster. The algorithm initially
selects a group of medoids at random (or using some heuristics), then iteratively assigns each non-
medoid point to the nearest medoid and update the medoid set, looking for the optimal set of medoid
points minimizing the quadratic sum of distance. PAM features a prohibitive computational cost, since
it performs O(k(n k)
2
) distance calculation for each iteration. There are a few methods designed
to cope with that complexity constraint. Park and Jun [11] propose a simple approximate algorithm
based on PAM with significant performance improvements. Instead of searching the entire dataset for
a new optimal medoid, the method restricts the search for points within the cluster. We choose to
apply this method for its simplicity and performance.
Therefore, our baseline method consists in the following steps:
(1) Generate L lists of k cluster centers (C
i
= {c
i1
, . . . , c
ik
}, i {1, . . . , L}) in L runnings of the
Park and Jun fast k-medoid algorithm (as in In Paulevé et al. [12], this clustering is done over a
sampling of the dataset).
(2) Index each point x of the dataset using h
C
i
(x) as the bucket key for each table (i {1, . . . , L}).
(3) Given a query point q, hash it using h
C
i
(q) (i {1, . . . , L}) and retrieve all colliding points in a
candidate set of kNN
2
points. Then perform a linear scan over that candidate list.
Initialization. K-means clustering results exhibit a strong dependence on the initial cluster selection.
K-means++ [10; 2] solve the O(log k) approximate K-means problem
3
simple by carefully designing
a good initialization algorithms. That raises the question of how the initialization could affect the
final result of the kNN search for the proposed methods. The K-means++ initialization method is
2
kNN - k nearest neighbors
3
the exact solution is NP-Hard
Brazilian Symposium on Databases - preprint, 2013.
4 · E. S. Silva and E. Valle
based on metric distance information and sampling, thus can be plugged into a K-medoids algorithm
without further changes. [11] propose also a special initialization for the fast K-medoids algorithm. We
implemented both of those initialization, and the random selection as well and empirically evaluated
their impact on the similarity search task.
3.1 Experiments and analysis
Datasets:. for the initial experiments we resorted to a dataset called APM (Arquivo Público Mineiro
The Public Archives in Minas Gerais) which was created from the application of various transforma-
tion (15 transformation of scale, rotation, etc) to 100 antique photographs of the Public Archives. The
SIFT descriptors of each transformed image were calculated, resulting in a dataset of 2.871.300 feature
vectors (SIFT descriptor is a 128 dimensional vector). The queries dataset is build from the SIFT
descriptors of the original images (a total of 263.968 feature vectors). Each query point is equiped
with its set of true nearest neighbors the ground-truth. For these experiments we used 500 points
uniformly sampled from the query dataset and performed a 10-NN search.
Metrics:. We are concerned especially with the relations between the recall and selectivity metric
given the variation in the parametric space of the methods. The recall metric is the fraction of total
correct answers over the number of true relevant answers. The selectivity
4
metric is the fraction of
the dataset selected for the shortlist processing. Selectivity is an important metric for those methods
being considered, since the shortlist processing is a bottleneck in the whole query processing [12].
K-means LSH, K-medoids LSH and DFLSH:. In order to evaluate the feasibility of K-medoids LSH
we compared it with the K-means LSH et al. [12] and the DFLSH [6]. We implemented all the methods
in the same framework, namely Java and The Apache Commons
TM
Mathematics Library
5
. K-means
LSH is used as a baseline method, since its performance and behavior are well studied and presented
in the literature and DFLSH is as an alternative method for the same problem. Figure 1 shows recall,
query time and selectivity statistics averaged over 500 queries and using a single hash function
6
.
Figure 1a presents results relating recall with selectivity. Notice that high selectivity means that
a large part of the dataset is being processed in the final sequential ranking that is not a desirable
point of operation, since the main performance bottleneck on the query processing time is located at
that stage(1b). For a selectivity as low as 0.3%, both methods’ recall metric is approximately 65%
and with 1% selectivity it grows up to a 80%. These results could be improved by using more than
one hash function. Clearly K-means LSH presents the best result on the recall metric, however it is
important to notice that both DFLSH and K-medoids LSH are not exploiting any vector coordinate
properties, relying solely on distance information.
Figure 1b depicts a strong linear correlation between query time and selectivity. Other noticeable
strong correlation appears between selectivity and number of cluster centers (Figure 1c). Theoretically
it is possible to show that, given an approximate uniform population of points in the hash buckets,
the selectivity for a t number of cluster centers is O(t
1
). The plot shows that the experimental data
follows a power law curve.
Using DFLSH as a baseline, we plot the difference on the recall from K-means LSH and K-medoids
LSH to DFLSH (Figure 1d. The recall difference can be up to 0.07 positive (11%) for K-means LSH
4
the term selectivity has distinct use in the database and image retrieval research communities. Here we adopt the
latter, following Paulève et al.
5
Commons Math: The Apache Commons Mathematics Library, accessed in 22/09/2013. http://commons.apache.org/
proper/commons-math/
6
Including more than one hash function would have approximately the same effect in the recall and querying time metric
for all methods, nevertheless the preprocessing time for K-means LSH and K-medoids LSH (envolving extra rounds of
the optimization procedure) would increase much more then for DFLSH (sampling more points)
Brazilian Symposium on Databases - preprint, 2013.
K-medoids LSH: a new locality sensitive hashing in general metric space · 5
(a) Recall x Selectivity (log scale) (b) Query time (ms) x Selectivity
(c) Selectivity x Number of cluster centers (d) Difference of Recall x Selectivity (log scale)
Fig. 1: Comparison of K-means LSH, K-medoids LSH and DFLSH for 10-NN
and 0.03 (5%) for K-medoids LSH. The trend on the differences is sustained along the curves for
different values of selectivity. There is a clear indication that K-medoids LSH is a viable approach
with equivalent performance to K-means LSH and DFLSH.
Initialization effect:. First of all it is important to notice that the initial segment of the curves
in Figure 2 are not much informative those points corresponds to number of clusters up to 20,
implying in an almost brute-force search over the dataset (notice the explosion on the selectivity in
Figure 1c). Figure 2 depicts the difference on the recall metric for the two initialization algorithm,
using the random selection as a baseline. The K-means++ initialization affects positively the results
with average gain of 3%. In the other hand the Park and Jun initialization does not presents great
gain over the random baseline (in fact, the average gain is negative). Further experiments with more
datasets are needed in order to achieve statistical confidence on these results, but the initial analysis
indicates that the K-means++ initialization can contribute to better results in the recall metric, while
the Park and Jun method presents a null effect (or negative).
4. CONCLUSION
Efficient large-scale similarity search is a crucial operation for Content-based Multimedia Information
Retrieval (CMIR) systems. But because those systems employ high-dimensional feature vectors, or
other complex representations in metric spaces, providing fast similarity search for them has been
a persistent research challenge. LSH, a very successful family of methods, has been advanced as a
solution to the problem, but it is available only for a few distance functions. In this article we propose
to address that limitation, by extending LSH to general metric spaces, using K-medoids clustering as
Brazilian Symposium on Databases - preprint, 2013.
6 · E. S. Silva and E. Valle
Fig. 2: Effect of the initialization procedure for the K-medoids clustering in the quality of nearest-neighbors search
basis for a LSH family of functions. We show in our experiments that the K-medoids LSH improves
the results over the random choice of sample of DFLSH, while keeping the advantage of relying only
on distance information. As expected, K-medoids LSH performance is slightly worse than K-means
LSH, but it is important to note that K-means relies heavily on the vector-space structure, and many
data types of interest do not offer such structure.
REFERENCES
Fernando Akune, Eduardo Valle, and Ricardo Torres. MONORAIL: A Disk-Friendly Index for Huge Descriptor
Databases. In 2010 20th International Conference on Pattern Recognition, pages 4145–4148. IEEE, August 2010.
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007.
Society for Industrial and Applied Mathematics.
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on
p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry - SCG ’04,
page 253, New York, New York, USA, 2004. ACM Press.
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings
of the 25th International Conference on Very Large Data Bases, VLDB ’99, pages 518–529, San Francisco, CA, USA,
1999. Morgan Kaufmann Publishers Inc.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors. In Proceedings of the thirtieth annual ACM
symposium on Theory of computing - STOC ’98, pages 604–613, New York, New York, USA, 1998. ACM Press.
Byungkon Kang and Kyomin Jung. Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in
Large Data Sets. In NIPS Workshop on Big Learning (BigLearn), pages 1–8, Lake Tahoe, Nevada, 2012.
Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-
Interscience, 9th edition, March 1990.
David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer
Vision, 60(2):91–110, November 2004.
David Novak, Martin Kyselak, and Pavel Zezula. On locality-sensitive indexing in generic metric spaces. Proceedings
of the Third International Conference on SImilarity Search and APplications - SISAP ’10, page 59, 2010.
Rafail Ostrovsky, Yuval Rabani, Leonard Schulman, and Chaitanya Swamy. The Effectiveness of Lloyd-Type Methods
for the k-Means Problem. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06),
volume 59, pages 165–176. IEEE, December 2006.
Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with
Applications, 36(2):3336–3341, 2009.
Loïc Paulevé, Hervé Jégou, and Laurent Amsaleg. Locality sensitive hashing: A comparison of hash function types
and querying mechanisms. Pattern Recognition Letters, 31(11):1348–1358, August 2010.
Eric Sadit Tellez and Edgar Chavez. On locality sensitive hashing in metric spaces. In Proceedings of the Third
International Conference on SImilarity Search and APplications, SISAP ’10, pages 67–74, New York, NY, USA,
2010. ACM.
Brazilian Symposium on Databases - preprint, 2013.
... One approach [19] uses a centroid-based clustering algorithm K-Medoids with LSH but with the idea of developing a locality sensitive hashing method for generic metric spaces. The intention in this work is to improve the LSH algorithm, rather than the clustering algorithm. ...
Thesis
Full-text available
The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them — the ability of finding specific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this project, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most existing LSH functions are restricted to vector spaces. We propose two novel LSH methods (VoronoiLSH and VoronoiPlex LSH) for generic metric spaces based on metric hyperplane partitioning or Dirichlet Domains (random anchor points and K-medoids). We present a comparison with well-established LSH methods in vector spaces and with recent competing new methods for metric spaces. We develop a theoretical probabilistic modeling of the behavior of the proposed algorithms and show some relations and bounds for the probability of hash collision. Among the algorithms proposed for generalizing LSH for metric spaces, this theoretical development is new. Although the problem is very challenging, our results demonstrate that it can be successfully tackled. This dissertation will present the developments of the method, theoretical and experimental discussion and reasoning of the methods performance. Keywords: Similarity Search; Nearest-neighbor Search; Locality-sensitive Hashing; Quantiza- tion; Metric Space Indexing; Geometric Data Structure; Content-Based Multimedia Information Retrieval.
Conference Paper
Full-text available
We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration
Article
Full-text available
It is well known that high-dimensional nearest neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All the proposed extensions, however, rely on a structured quantizer for hashing, poorly fitting real data sets, limiting its performance in practice. In this paper, we compare several families of space hashing functions in a real setup, namely when searching for high-dimension SIFT descriptors. The comparison of random projections, lattice quantizers, k-means and hierarchical k-means reveal that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space. We then compare two querying mechanisms introduced in the literature with the one originally proposed in LSH, and discuss their respective merits and limitations.
Conference Paper
Full-text available
The concept of Locality-sensitive Hashing (LSH) has been successfully used for searching in high-dimensional data and a number of locality-preserving hash functions have been introduced. In order to extend the applicability of the LSH approach to a general metric space, we focus on a recently presented Metric Index (M-Index), we redefine its hashing and searching process in the terms of LSH, and perform extensive measurements on two datasets to verify that the M-Index fulfills the conditions of the LSH concept. We widely discuss "optimal" properties of LSH functions and the efficiency of a given LSH function with respect to kNN queries. The results also indicate that the M-Index hashing and searching is more efficient than the tested standard LSH approach for Euclidean distance.
Conference Paper
Full-text available
Modeling proximity search problems as a metric space provides a general framework usable in many areas, like pattern recognition, web search, clustering, data mining, knowledge management, textual and multimedia information retrieval, to name a few. Metric indexes have been improved over the years and many instances of the problem can be solved efficiently. However, when very large/high dimensional metric databases are indexed exact approaches are not yet capable of solving efficiently the problem, the performance in these circumstances is degraded to almost sequential search. To overcome the above limitation, non-exact proximity searching algorithms can be used to give answers that either in probability or in an approximation factor are close to the exact result. Approximation is acceptable in many contexts, specially when human judgement about closeness is involved. In vector spaces, on the other hand, there is a very successful approach dubbed Locality Sensitive Hashing which consist in making a succinct representation of the objects. This succinct representation is relatively insensitive to small variations of the locality. Unfortunately, the hashing function have to be carefully designed, very close to the data model, and different functions are used when objects come from different domains. In this paper we give a new schema to encode objects in a general metric space with a uniform framework, independent from the data model. Finally, we provide experimental support to our claims using several real life databases with different data models and distance functions obtaining excellent results in both the speed and the recall sense, specially for large databases.
Conference Paper
Full-text available
We propose MONORAIL, an indexing scheme for very large multimedia descriptor databases. Our index is based on the Hilbert curve, which is able to map the high-dimensional space of those descriptors to a single dimension. Instead of using several curves to mitigate boundary effects, we use a single curve with several surrogate points for each descriptor. Thus, we are able to reduce the random accesses to the bare minimum. In a rigorous empirical comparison with another method based on multiple surrogates, ours shows a significant improvement, due to our careful choice of the surrogate points.
Book
Full-text available
This is a book, not a book review.
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Article
The nearest neighbor search (NNS) problem is the following: Given a set of n points P={p1, …, pn} in some metric space X, preprocess P so as to efficiently answer queries which require finding a point in P closest to a query point q∈X. The approximate nearest neighbor search (c-NNS) is a relaxation of NNS which allows to return any point within c times the distance to the nearest neighbor (called c-nearest neighbor). This problem is of major and growing importance to a variety of applications. In this paper, we give an algorithm for (4⌈log1+ρlog4d⌉+1)-NNS algorithm in ld∞ with O(dn1+ρlogO(1)n) storage and O(dlogO(1)n) query time. Moreover, we obtain an algorithm for 3-NNS for l∞ with nlogd+1 storage. The preprocessing time is close to linear in the size of the data structure. The algorithm can be also used (after simple modifications) to output the exact nearest neighbor in time bounded by O(dlogO(1)n) plus the number of (4⌈log1+ρlog4d⌉+1)-nearest neighbors of the query point. Building on this result, we also obtain an approximation algorithm for a general class of product metrics. Finally, we show that for any c
Article
This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.
Conference Paper
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under l0RW1S34RfeSDcfkexd09rT4p1RW1S34RfeSDcfkexd09rT4 norm, based on p-stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the l0RW1S34RfeSDcfkexd09rT421RW1S34RfeSDcfkexd09rT4 norm. It also yields the first known provably efficient approximate NN algorithm for the case p less than or equal 1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree.