Content uploaded by Eliezer de Souza da Silva
Author content
All content in this area was uploaded by Eliezer de Souza da Silva
Content may be subject to copyright.
K-medoids LSH: a new locality sensitive hashing in general
metric space
Eliezer S. Silva, Eduardo Valle
RECOD Lab – DCA/FEEC, University of Campinas, Brazil
{eliezers, dovalle}@dca.fee.unicamp.br
Abstract. The increasing availability of multimedia content poses a challenge for information retrieval researchers.
Users want not only have access to multimedia documents, but also make sense of them - the ability of finding spe-
cific content in extremely large collections of textual and non-textual documents is paramount. At such large scales,
Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However,
Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations
in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this article,
we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which
is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are
similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely
efficient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector
spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison
with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our
early results show promise, but also demonstrate how challenging is to work around those difficulties.
Categories and Subject Descriptors: H.2 [Database Management]: Miscellaneous; H.3.1 [Content Analysis and
Indexing]: Indexing methods; H.3.3 [Information Search and Retrieval]: Miscellaneous
Keywords: hashing, metric space indexing, nearest neighbor search, similarity search
1. INTRODUCTION
Content-based Multimedia Information Retrieval (CMIR) is an alternative to keyword-based or tag-
based retrieval, which works by extracting features based on distinctive properties of the multimedia
objects. Those features are organized in multimedia descriptors, which are used as surrogates of the
multimedia object, in such a way that the retrieval of similar objects is based solely on that higher
level representation, without the need to refer to the actual low-level encoding of the media. The
descriptor can be seen as a compact and distinctive representation of multimedia content, encoding
some invariant properties of the content. For example, in image retrieval the successful Scale Invariant
Feature Transform (SIFT) [8] encodes local gradient patterns around Points-of-Interest, in a way that
is (partially) invariant to illumination and geometric transformations.
The descriptor framework also allows to abstract the media details in multimedia retrieval systems.
The operation of looking for similar multimedia documents becomes the more abstract operation
of looking for multimedia descriptors which have small distances. The notion of a “feature space”,
that organizes the documents in a geometry, putting close-by those that are similar emerges. Of
course, CMIR systems are usually much more complex than that, but nevertheless, looking for similar
descriptors often plays a critical role in the processing chain of a complex system.
Although the operation of finding descriptors which have small distances seems simple enough,
performing it fast for multimedia descriptors is actually very challenging, due to the scale of the
Copyright
c
2012 this is a preprint. Contact author for copyright issues.
Brazilian Symposium on Databases - preprint, 2013, Pages 1–6.
2 · E. S. Silva and E. Valle
collections, the dimensionality of the descriptors and the diversity of distance functions [1]. The
literature on the subject is very extensive, but in this article, we focus on one of the most cited family
of solutions, the Locality-Sensitive Hashing (LSH) [5; 4; 3], proposing K-medoids LSH as an extension
to metric space.
2. LOCALITY SENSITIVE HASHING (LSH)
The LSH indexing method relies on a family of locality-sensitive hashing function H, [5], to map
objects from a metric domain U in a D-dimensional space (usually R
d
) to a countable set C (usually
Z), with the following property: nearby points in high dimensional space are hashed to the same value
with high probability.
Definition 1. Given a distance function d : U × U → R
+
, a function family H = {h : U → C} is
(r, cr, p
1
, p
2
)-sensitive for a given data set S ⊆ U if, for any points p, q ∈ S, h ∈ H:
—If d(p, q) ≤ r then P r
H
[h(q) = h(p)] ≥ p
1
(probability of colliding within the ball of radius r),
—If d(p, q) > cr then P r
H
[h(q) = h(p)] ≤ p
2
(probability of colliding outside the ball of radius cr)
—c > 1 and p
1
> p
2
The basic scheme [5] provided locality-sensitive families for the Hamming distance on Hamming
spaces, and the Jacquard distance in spaces of sets. For several years those were the only families
available, although extensions for L
1
-normed (Manhattan) and L
2
-normed (Euclidean) spaces were
proposed by embedding those spaces into a Hamming space [4].
The practical success of LSH, however, came with the E2LSH
1
(Euclidean LSH) [3], for L
p
-normed
space, where a new familiy of LSH functions was introduced:
H = {h
i
: R
D
→ Z} (1)
h
i
(v) =
a
i
.v + b
i
w
(2)
a
i
∈ R
D
is a random vector with each coordinate picked independently from a Gaussian distribution
N(0, 1), b
i
is an offset value sampled from uniform distribution in the range [0, . . . , w] and w is a scalar
for the quantization width. Applying h
i
to a point or object v corresponds to the composition of a
projection to a random line and a quantization operation, given by the quantization width w and the
floor operation.
A function family G is constructed by concatenating M randomly sampled functions h
i
∈ H, such
that each g
j
∈ G has the following form: g
j
(v) = (h
1
(v), ..., h
M
(v)). The use of multiple h
i
functions
reduces the probability of false positives, since two objects will have the same key for g
j
only if their
value coincide for all h
i
component functions. Each object v from the input dataset is indexed by
hashing it against L hash functions g
1
, . . . , g
L
. At the search phase a query object q is hashed using
the same L hash functions and the objects stored in the given buckets are used as the candidate set.
Then, a ranking is performed among the candidate set according to their distance to the query, and
the k closest objects are returned.
A few recent works approach the problem of designing LSH indexing schemes using only the distance
information: M-Index [9], DFLSH (Distribution Free Locality-Sensitive Hashing) [6] and [13].
1
LSH Algorithm and Implementation (E2LSH), accessed in 22/09/2013. http://www.mit.edu/~andoni/LSH/
Brazilian Symposium on Databases - preprint, 2013.
K-medoids LSH: a new locality sensitive hashing in general metric space · 3
K-means LSH. [12] present a comparison between structured (regular random projections, lattice)
and unstructured (K-means and hierarchical K-means) quantizers in the task of searching high dimen-
sional SIFT descriptors, resulting on the proposal of a new LSH family based on the latter (Equa-
tion 3). Results indicate that the LSH functions based on unstructured quantizers perform better,
as the induced Voronoi partitioning adapts to the data distribution, generating more uniform hash
cells population than the structured LSH quantizers. However, K-means and hierarchical K-means
are clustering algorithms for vector spaces, restricting the application of this approach. In order to
overcome this limitation we turn to a clustering algorithm designed to work in generic metric spaces
– K-medoids clustering.
3. K-MEDOIDS LSH
We propose a novel method for locality-sensitive hashing in the metric search framework and compare
them with other similar methods in the literature. Our method is rooted on the idea of partitioning
the data space with a distance-based clustering algorithm (K-medoids) as an initial quantization step
and constructing a hash table with the quantization results. K-medoids LSH follows a direct approach
taken from [12]: each partition cell is a hash table bucket, therefore the hashing is the index number
of the partition cell.
Definition 2. Given a metric space (U, d) (U is the domain set and d is the distance function),
the set of cluster centers C = {c
1
, . . . , c
k
} ⊂ U and an object x ∈ U :
h
C
: U → N
h
C
(x) = argmin
i=1,...,k
{d(x, c
1
), . . . , d(x, c
i
), . . . , d(x, c
k
)}
(3)
PAM (Partitioning Around Medoids) [7] is the classic algorithm for K-medoids clustering. In that
work a medoid is defined as the most centrally located object within a cluster. The algorithm initially
selects a group of medoids at random (or using some heuristics), then iteratively assigns each non-
medoid point to the nearest medoid and update the medoid set, looking for the optimal set of medoid
points minimizing the quadratic sum of distance. PAM features a prohibitive computational cost, since
it performs O(k(n − k)
2
) distance calculation for each iteration. There are a few methods designed
to cope with that complexity constraint. Park and Jun [11] propose a simple approximate algorithm
based on PAM with significant performance improvements. Instead of searching the entire dataset for
a new optimal medoid, the method restricts the search for points within the cluster. We choose to
apply this method for its simplicity and performance.
Therefore, our baseline method consists in the following steps:
(1) Generate L lists of k cluster centers (C
i
= {c
i1
, . . . , c
ik
}, ∀i ∈ {1, . . . , L}) in L runnings of the
Park and Jun fast k-medoid algorithm (as in In Paulevé et al. [12], this clustering is done over a
sampling of the dataset).
(2) Index each point x of the dataset using h
C
i
(x) as the bucket key for each table (i ∈ {1, . . . , L}).
(3) Given a query point q, hash it using h
C
i
(q) (i ∈ {1, . . . , L}) and retrieve all colliding points in a
candidate set of kNN
2
points. Then perform a linear scan over that candidate list.
Initialization. K-means clustering results exhibit a strong dependence on the initial cluster selection.
K-means++ [10; 2] solve the O(log k) approximate K-means problem
3
simple by carefully designing
a good initialization algorithms. That raises the question of how the initialization could affect the
final result of the kNN search for the proposed methods. The K-means++ initialization method is
2
kNN - k nearest neighbors
3
the exact solution is NP-Hard
Brazilian Symposium on Databases - preprint, 2013.
4 · E. S. Silva and E. Valle
based on metric distance information and sampling, thus can be plugged into a K-medoids algorithm
without further changes. [11] propose also a special initialization for the fast K-medoids algorithm. We
implemented both of those initialization, and the random selection as well and empirically evaluated
their impact on the similarity search task.
3.1 Experiments and analysis
Datasets:. for the initial experiments we resorted to a dataset called APM (Arquivo Público Mineiro
– The Public Archives in Minas Gerais) which was created from the application of various transforma-
tion (15 transformation of scale, rotation, etc) to 100 antique photographs of the Public Archives. The
SIFT descriptors of each transformed image were calculated, resulting in a dataset of 2.871.300 feature
vectors (SIFT descriptor is a 128 dimensional vector). The queries dataset is build from the SIFT
descriptors of the original images (a total of 263.968 feature vectors). Each query point is equiped
with its set of true nearest neighbors – the ground-truth. For these experiments we used 500 points
uniformly sampled from the query dataset and performed a 10-NN search.
Metrics:. We are concerned especially with the relations between the recall and selectivity metric
given the variation in the parametric space of the methods. The recall metric is the fraction of total
correct answers over the number of true relevant answers. The selectivity
4
metric is the fraction of
the dataset selected for the shortlist processing. Selectivity is an important metric for those methods
being considered, since the shortlist processing is a bottleneck in the whole query processing [12].
K-means LSH, K-medoids LSH and DFLSH:. In order to evaluate the feasibility of K-medoids LSH
we compared it with the K-means LSH et al. [12] and the DFLSH [6]. We implemented all the methods
in the same framework, namely Java and The Apache Commons
TM
Mathematics Library
5
. K-means
LSH is used as a baseline method, since its performance and behavior are well studied and presented
in the literature and DFLSH is as an alternative method for the same problem. Figure 1 shows recall,
query time and selectivity statistics averaged over 500 queries and using a single hash function
6
.
Figure 1a presents results relating recall with selectivity. Notice that high selectivity means that
a large part of the dataset is being processed in the final sequential ranking – that is not a desirable
point of operation, since the main performance bottleneck on the query processing time is located at
that stage(1b). For a selectivity as low as 0.3%, both methods’ recall metric is approximately 65%
and with 1% selectivity it grows up to a 80%. These results could be improved by using more than
one hash function. Clearly K-means LSH presents the best result on the recall metric, however it is
important to notice that both DFLSH and K-medoids LSH are not exploiting any vector coordinate
properties, relying solely on distance information.
Figure 1b depicts a strong linear correlation between query time and selectivity. Other noticeable
strong correlation appears between selectivity and number of cluster centers (Figure 1c). Theoretically
it is possible to show that, given an approximate uniform population of points in the hash buckets,
the selectivity for a t number of cluster centers is O(t
−1
). The plot shows that the experimental data
follows a power law curve.
Using DFLSH as a baseline, we plot the difference on the recall from K-means LSH and K-medoids
LSH to DFLSH (Figure 1d. The recall difference can be up to 0.07 positive (11%) for K-means LSH
4
the term selectivity has distinct use in the database and image retrieval research communities. Here we adopt the
latter, following Paulève et al.
5
Commons Math: The Apache Commons Mathematics Library, accessed in 22/09/2013. http://commons.apache.org/
proper/commons-math/
6
Including more than one hash function would have approximately the same effect in the recall and querying time metric
for all methods, nevertheless the preprocessing time for K-means LSH and K-medoids LSH (envolving extra rounds of
the optimization procedure) would increase much more then for DFLSH (sampling more points)
Brazilian Symposium on Databases - preprint, 2013.
K-medoids LSH: a new locality sensitive hashing in general metric space · 5
(a) Recall x Selectivity (log scale) (b) Query time (ms) x Selectivity
(c) Selectivity x Number of cluster centers (d) Difference of Recall x Selectivity (log scale)
Fig. 1: Comparison of K-means LSH, K-medoids LSH and DFLSH for 10-NN
and 0.03 (5%) for K-medoids LSH. The trend on the differences is sustained along the curves for
different values of selectivity. There is a clear indication that K-medoids LSH is a viable approach
with equivalent performance to K-means LSH and DFLSH.
Initialization effect:. First of all it is important to notice that the initial segment of the curves
in Figure 2 are not much informative – those points corresponds to number of clusters up to 20,
implying in an almost brute-force search over the dataset (notice the explosion on the selectivity in
Figure 1c). Figure 2 depicts the difference on the recall metric for the two initialization algorithm,
using the random selection as a baseline. The K-means++ initialization affects positively the results
with average gain of 3%. In the other hand the Park and Jun initialization does not presents great
gain over the random baseline (in fact, the average gain is negative). Further experiments with more
datasets are needed in order to achieve statistical confidence on these results, but the initial analysis
indicates that the K-means++ initialization can contribute to better results in the recall metric, while
the Park and Jun method presents a null effect (or negative).
4. CONCLUSION
Efficient large-scale similarity search is a crucial operation for Content-based Multimedia Information
Retrieval (CMIR) systems. But because those systems employ high-dimensional feature vectors, or
other complex representations in metric spaces, providing fast similarity search for them has been
a persistent research challenge. LSH, a very successful family of methods, has been advanced as a
solution to the problem, but it is available only for a few distance functions. In this article we propose
to address that limitation, by extending LSH to general metric spaces, using K-medoids clustering as
Brazilian Symposium on Databases - preprint, 2013.
6 · E. S. Silva and E. Valle
Fig. 2: Effect of the initialization procedure for the K-medoids clustering in the quality of nearest-neighbors search
basis for a LSH family of functions. We show in our experiments that the K-medoids LSH improves
the results over the random choice of sample of DFLSH, while keeping the advantage of relying only
on distance information. As expected, K-medoids LSH performance is slightly worse than K-means
LSH, but it is important to note that K-means relies heavily on the vector-space structure, and many
data types of interest do not offer such structure.
REFERENCES
Fernando Akune, Eduardo Valle, and Ricardo Torres. MONORAIL: A Disk-Friendly Index for Huge Descriptor
Databases. In 2010 20th International Conference on Pattern Recognition, pages 4145–4148. IEEE, August 2010.
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007.
Society for Industrial and Applied Mathematics.
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on
p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry - SCG ’04,
page 253, New York, New York, USA, 2004. ACM Press.
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings
of the 25th International Conference on Very Large Data Bases, VLDB ’99, pages 518–529, San Francisco, CA, USA,
1999. Morgan Kaufmann Publishers Inc.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors. In Proceedings of the thirtieth annual ACM
symposium on Theory of computing - STOC ’98, pages 604–613, New York, New York, USA, 1998. ACM Press.
Byungkon Kang and Kyomin Jung. Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in
Large Data Sets. In NIPS Workshop on Big Learning (BigLearn), pages 1–8, Lake Tahoe, Nevada, 2012.
Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-
Interscience, 9th edition, March 1990.
David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer
Vision, 60(2):91–110, November 2004.
David Novak, Martin Kyselak, and Pavel Zezula. On locality-sensitive indexing in generic metric spaces. Proceedings
of the Third International Conference on SImilarity Search and APplications - SISAP ’10, page 59, 2010.
Rafail Ostrovsky, Yuval Rabani, Leonard Schulman, and Chaitanya Swamy. The Effectiveness of Lloyd-Type Methods
for the k-Means Problem. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06),
volume 59, pages 165–176. IEEE, December 2006.
Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with
Applications, 36(2):3336–3341, 2009.
Loïc Paulevé, Hervé Jégou, and Laurent Amsaleg. Locality sensitive hashing: A comparison of hash function types
and querying mechanisms. Pattern Recognition Letters, 31(11):1348–1358, August 2010.
Eric Sadit Tellez and Edgar Chavez. On locality sensitive hashing in metric spaces. In Proceedings of the Third
International Conference on SImilarity Search and APplications, SISAP ’10, pages 67–74, New York, NY, USA,
2010. ACM.
Brazilian Symposium on Databases - preprint, 2013.