Content uploaded by Eliezer de Souza da Silva

Author content

All content in this area was uploaded by Eliezer de Souza da Silva

Content may be subject to copyright.

K-medoids LSH: a new locality sensitive hashing in general

metric space

Eliezer S. Silva, Eduardo Valle

RECOD Lab – DCA/FEEC, University of Campinas, Brazil

{eliezers, dovalle}@dca.fee.unicamp.br

Abstract. The increasing availability of multimedia content poses a challenge for information retrieval researchers.

Users want not only have access to multimedia documents, but also make sense of them - the ability of ﬁnding spe-

ciﬁc content in extremely large collections of textual and non-textual documents is paramount. At such large scales,

Multimedia Information Retrieval systems must rely on the ability to perform search by similarity eﬃciently. However,

Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations

in metric spaces. Providing eﬃcient similarity search for that kind of data is extremely challenging. In this article,

we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which

is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are

similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely

eﬃcient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector

spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison

with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our

early results show promise, but also demonstrate how challenging is to work around those diﬃculties.

Categories and Subject Descriptors: H.2 [Database Management]: Miscellaneous; H.3.1 [Content Analysis and

Indexing]: Indexing methods; H.3.3 [Information Search and Retrieval]: Miscellaneous

Keywords: hashing, metric space indexing, nearest neighbor search, similarity search

1. INTRODUCTION

Content-based Multimedia Information Retrieval (CMIR) is an alternative to keyword-based or tag-

based retrieval, which works by extracting features based on distinctive properties of the multimedia

objects. Those features are organized in multimedia descriptors, which are used as surrogates of the

multimedia object, in such a way that the retrieval of similar objects is based solely on that higher

level representation, without the need to refer to the actual low-level encoding of the media. The

descriptor can be seen as a compact and distinctive representation of multimedia content, encoding

some invariant properties of the content. For example, in image retrieval the successful Scale Invariant

Feature Transform (SIFT) [8] encodes local gradient patterns around Points-of-Interest, in a way that

is (partially) invariant to illumination and geometric transformations.

The descriptor framework also allows to abstract the media details in multimedia retrieval systems.

The operation of looking for similar multimedia documents becomes the more abstract operation

of looking for multimedia descriptors which have small distances. The notion of a “feature space”,

that organizes the documents in a geometry, putting close-by those that are similar emerges. Of

course, CMIR systems are usually much more complex than that, but nevertheless, looking for similar

descriptors often plays a critical role in the processing chain of a complex system.

Although the operation of ﬁnding descriptors which have small distances seems simple enough,

performing it fast for multimedia descriptors is actually very challenging, due to the scale of the

Copyright

c

2012 this is a preprint. Contact author for copyright issues.

Brazilian Symposium on Databases - preprint, 2013, Pages 1–6.

2 · E. S. Silva and E. Valle

collections, the dimensionality of the descriptors and the diversity of distance functions [1]. The

literature on the subject is very extensive, but in this article, we focus on one of the most cited family

of solutions, the Locality-Sensitive Hashing (LSH) [5; 4; 3], proposing K-medoids LSH as an extension

to metric space.

2. LOCALITY SENSITIVE HASHING (LSH)

The LSH indexing method relies on a family of locality-sensitive hashing function H, [5], to map

objects from a metric domain U in a D-dimensional space (usually R

d

) to a countable set C (usually

Z), with the following property: nearby points in high dimensional space are hashed to the same value

with high probability.

Definition 1. Given a distance function d : U × U → R

+

, a function family H = {h : U → C} is

(r, cr, p

1

, p

2

)-sensitive for a given data set S ⊆ U if, for any points p, q ∈ S, h ∈ H:

—If d(p, q) ≤ r then P r

H

[h(q) = h(p)] ≥ p

1

(probability of colliding within the ball of radius r),

—If d(p, q) > cr then P r

H

[h(q) = h(p)] ≤ p

2

(probability of colliding outside the ball of radius cr)

—c > 1 and p

1

> p

2

The basic scheme [5] provided locality-sensitive families for the Hamming distance on Hamming

spaces, and the Jacquard distance in spaces of sets. For several years those were the only families

available, although extensions for L

1

-normed (Manhattan) and L

2

-normed (Euclidean) spaces were

proposed by embedding those spaces into a Hamming space [4].

The practical success of LSH, however, came with the E2LSH

1

(Euclidean LSH) [3], for L

p

-normed

space, where a new familiy of LSH functions was introduced:

H = {h

i

: R

D

→ Z} (1)

h

i

(v) =

a

i

.v + b

i

w

(2)

a

i

∈ R

D

is a random vector with each coordinate picked independently from a Gaussian distribution

N(0, 1), b

i

is an oﬀset value sampled from uniform distribution in the range [0, . . . , w] and w is a scalar

for the quantization width. Applying h

i

to a point or object v corresponds to the composition of a

projection to a random line and a quantization operation, given by the quantization width w and the

ﬂoor operation.

A function family G is constructed by concatenating M randomly sampled functions h

i

∈ H, such

that each g

j

∈ G has the following form: g

j

(v) = (h

1

(v), ..., h

M

(v)). The use of multiple h

i

functions

reduces the probability of false positives, since two objects will have the same key for g

j

only if their

value coincide for all h

i

component functions. Each object v from the input dataset is indexed by

hashing it against L hash functions g

1

, . . . , g

L

. At the search phase a query object q is hashed using

the same L hash functions and the objects stored in the given buckets are used as the candidate set.

Then, a ranking is performed among the candidate set according to their distance to the query, and

the k closest objects are returned.

A few recent works approach the problem of designing LSH indexing schemes using only the distance

information: M-Index [9], DFLSH (Distribution Free Locality-Sensitive Hashing) [6] and [13].

1

LSH Algorithm and Implementation (E2LSH), accessed in 22/09/2013. http://www.mit.edu/~andoni/LSH/

Brazilian Symposium on Databases - preprint, 2013.

K-medoids LSH: a new locality sensitive hashing in general metric space · 3

K-means LSH. [12] present a comparison between structured (regular random projections, lattice)

and unstructured (K-means and hierarchical K-means) quantizers in the task of searching high dimen-

sional SIFT descriptors, resulting on the proposal of a new LSH family based on the latter (Equa-

tion 3). Results indicate that the LSH functions based on unstructured quantizers perform better,

as the induced Voronoi partitioning adapts to the data distribution, generating more uniform hash

cells population than the structured LSH quantizers. However, K-means and hierarchical K-means

are clustering algorithms for vector spaces, restricting the application of this approach. In order to

overcome this limitation we turn to a clustering algorithm designed to work in generic metric spaces

– K-medoids clustering.

3. K-MEDOIDS LSH

We propose a novel method for locality-sensitive hashing in the metric search framework and compare

them with other similar methods in the literature. Our method is rooted on the idea of partitioning

the data space with a distance-based clustering algorithm (K-medoids) as an initial quantization step

and constructing a hash table with the quantization results. K-medoids LSH follows a direct approach

taken from [12]: each partition cell is a hash table bucket, therefore the hashing is the index number

of the partition cell.

Definition 2. Given a metric space (U, d) (U is the domain set and d is the distance function),

the set of cluster centers C = {c

1

, . . . , c

k

} ⊂ U and an object x ∈ U :

h

C

: U → N

h

C

(x) = argmin

i=1,...,k

{d(x, c

1

), . . . , d(x, c

i

), . . . , d(x, c

k

)}

(3)

PAM (Partitioning Around Medoids) [7] is the classic algorithm for K-medoids clustering. In that

work a medoid is deﬁned as the most centrally located object within a cluster. The algorithm initially

selects a group of medoids at random (or using some heuristics), then iteratively assigns each non-

medoid point to the nearest medoid and update the medoid set, looking for the optimal set of medoid

points minimizing the quadratic sum of distance. PAM features a prohibitive computational cost, since

it performs O(k(n − k)

2

) distance calculation for each iteration. There are a few methods designed

to cope with that complexity constraint. Park and Jun [11] propose a simple approximate algorithm

based on PAM with signiﬁcant performance improvements. Instead of searching the entire dataset for

a new optimal medoid, the method restricts the search for points within the cluster. We choose to

apply this method for its simplicity and performance.

Therefore, our baseline method consists in the following steps:

(1) Generate L lists of k cluster centers (C

i

= {c

i1

, . . . , c

ik

}, ∀i ∈ {1, . . . , L}) in L runnings of the

Park and Jun fast k-medoid algorithm (as in In Paulevé et al. [12], this clustering is done over a

sampling of the dataset).

(2) Index each point x of the dataset using h

C

i

(x) as the bucket key for each table (i ∈ {1, . . . , L}).

(3) Given a query point q, hash it using h

C

i

(q) (i ∈ {1, . . . , L}) and retrieve all colliding points in a

candidate set of kNN

2

points. Then perform a linear scan over that candidate list.

Initialization. K-means clustering results exhibit a strong dependence on the initial cluster selection.

K-means++ [10; 2] solve the O(log k) approximate K-means problem

3

simple by carefully designing

a good initialization algorithms. That raises the question of how the initialization could aﬀect the

ﬁnal result of the kNN search for the proposed methods. The K-means++ initialization method is

2

kNN - k nearest neighbors

3

the exact solution is NP-Hard

Brazilian Symposium on Databases - preprint, 2013.

4 · E. S. Silva and E. Valle

based on metric distance information and sampling, thus can be plugged into a K-medoids algorithm

without further changes. [11] propose also a special initialization for the fast K-medoids algorithm. We

implemented both of those initialization, and the random selection as well and empirically evaluated

their impact on the similarity search task.

3.1 Experiments and analysis

Datasets:. for the initial experiments we resorted to a dataset called APM (Arquivo Público Mineiro

– The Public Archives in Minas Gerais) which was created from the application of various transforma-

tion (15 transformation of scale, rotation, etc) to 100 antique photographs of the Public Archives. The

SIFT descriptors of each transformed image were calculated, resulting in a dataset of 2.871.300 feature

vectors (SIFT descriptor is a 128 dimensional vector). The queries dataset is build from the SIFT

descriptors of the original images (a total of 263.968 feature vectors). Each query point is equiped

with its set of true nearest neighbors – the ground-truth. For these experiments we used 500 points

uniformly sampled from the query dataset and performed a 10-NN search.

Metrics:. We are concerned especially with the relations between the recall and selectivity metric

given the variation in the parametric space of the methods. The recall metric is the fraction of total

correct answers over the number of true relevant answers. The selectivity

4

metric is the fraction of

the dataset selected for the shortlist processing. Selectivity is an important metric for those methods

being considered, since the shortlist processing is a bottleneck in the whole query processing [12].

K-means LSH, K-medoids LSH and DFLSH:. In order to evaluate the feasibility of K-medoids LSH

we compared it with the K-means LSH et al. [12] and the DFLSH [6]. We implemented all the methods

in the same framework, namely Java and The Apache Commons

TM

Mathematics Library

5

. K-means

LSH is used as a baseline method, since its performance and behavior are well studied and presented

in the literature and DFLSH is as an alternative method for the same problem. Figure 1 shows recall,

query time and selectivity statistics averaged over 500 queries and using a single hash function

6

.

Figure 1a presents results relating recall with selectivity. Notice that high selectivity means that

a large part of the dataset is being processed in the ﬁnal sequential ranking – that is not a desirable

point of operation, since the main performance bottleneck on the query processing time is located at

that stage(1b). For a selectivity as low as 0.3%, both methods’ recall metric is approximately 65%

and with 1% selectivity it grows up to a 80%. These results could be improved by using more than

one hash function. Clearly K-means LSH presents the best result on the recall metric, however it is

important to notice that both DFLSH and K-medoids LSH are not exploiting any vector coordinate

properties, relying solely on distance information.

Figure 1b depicts a strong linear correlation between query time and selectivity. Other noticeable

strong correlation appears between selectivity and number of cluster centers (Figure 1c). Theoretically

it is possible to show that, given an approximate uniform population of points in the hash buckets,

the selectivity for a t number of cluster centers is O(t

−1

). The plot shows that the experimental data

follows a power law curve.

Using DFLSH as a baseline, we plot the diﬀerence on the recall from K-means LSH and K-medoids

LSH to DFLSH (Figure 1d. The recall diﬀerence can be up to 0.07 positive (11%) for K-means LSH

4

the term selectivity has distinct use in the database and image retrieval research communities. Here we adopt the

latter, following Paulève et al.

5

Commons Math: The Apache Commons Mathematics Library, accessed in 22/09/2013. http://commons.apache.org/

proper/commons-math/

6

Including more than one hash function would have approximately the same eﬀect in the recall and querying time metric

for all methods, nevertheless the preprocessing time for K-means LSH and K-medoids LSH (envolving extra rounds of

the optimization procedure) would increase much more then for DFLSH (sampling more points)

Brazilian Symposium on Databases - preprint, 2013.

K-medoids LSH: a new locality sensitive hashing in general metric space · 5

(a) Recall x Selectivity (log scale) (b) Query time (ms) x Selectivity

(c) Selectivity x Number of cluster centers (d) Diﬀerence of Recall x Selectivity (log scale)

Fig. 1: Comparison of K-means LSH, K-medoids LSH and DFLSH for 10-NN

and 0.03 (5%) for K-medoids LSH. The trend on the diﬀerences is sustained along the curves for

diﬀerent values of selectivity. There is a clear indication that K-medoids LSH is a viable approach

with equivalent performance to K-means LSH and DFLSH.

Initialization eﬀect:. First of all it is important to notice that the initial segment of the curves

in Figure 2 are not much informative – those points corresponds to number of clusters up to 20,

implying in an almost brute-force search over the dataset (notice the explosion on the selectivity in

Figure 1c). Figure 2 depicts the diﬀerence on the recall metric for the two initialization algorithm,

using the random selection as a baseline. The K-means++ initialization aﬀects positively the results

with average gain of 3%. In the other hand the Park and Jun initialization does not presents great

gain over the random baseline (in fact, the average gain is negative). Further experiments with more

datasets are needed in order to achieve statistical conﬁdence on these results, but the initial analysis

indicates that the K-means++ initialization can contribute to better results in the recall metric, while

the Park and Jun method presents a null eﬀect (or negative).

4. CONCLUSION

Eﬃcient large-scale similarity search is a crucial operation for Content-based Multimedia Information

Retrieval (CMIR) systems. But because those systems employ high-dimensional feature vectors, or

other complex representations in metric spaces, providing fast similarity search for them has been

a persistent research challenge. LSH, a very successful family of methods, has been advanced as a

solution to the problem, but it is available only for a few distance functions. In this article we propose

to address that limitation, by extending LSH to general metric spaces, using K-medoids clustering as

Brazilian Symposium on Databases - preprint, 2013.

6 · E. S. Silva and E. Valle

Fig. 2: Eﬀect of the initialization procedure for the K-medoids clustering in the quality of nearest-neighbors search

basis for a LSH family of functions. We show in our experiments that the K-medoids LSH improves

the results over the random choice of sample of DFLSH, while keeping the advantage of relying only

on distance information. As expected, K-medoids LSH performance is slightly worse than K-means

LSH, but it is important to note that K-means relies heavily on the vector-space structure, and many

data types of interest do not oﬀer such structure.

REFERENCES

Fernando Akune, Eduardo Valle, and Ricardo Torres. MONORAIL: A Disk-Friendly Index for Huge Descriptor

Databases. In 2010 20th International Conference on Pattern Recognition, pages 4145–4148. IEEE, August 2010.

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth

annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007.

Society for Industrial and Applied Mathematics.

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on

p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry - SCG ’04,

page 253, New York, New York, USA, 2004. ACM Press.

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings

of the 25th International Conference on Very Large Data Bases, VLDB ’99, pages 518–529, San Francisco, CA, USA,

1999. Morgan Kaufmann Publishers Inc.

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors. In Proceedings of the thirtieth annual ACM

symposium on Theory of computing - STOC ’98, pages 604–613, New York, New York, USA, 1998. ACM Press.

Byungkon Kang and Kyomin Jung. Robust and Eﬃcient Locality Sensitive Hashing for Nearest Neighbor Search in

Large Data Sets. In NIPS Workshop on Big Learning (BigLearn), pages 1–8, Lake Tahoe, Nevada, 2012.

Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-

Interscience, 9th edition, March 1990.

David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer

Vision, 60(2):91–110, November 2004.

David Novak, Martin Kyselak, and Pavel Zezula. On locality-sensitive indexing in generic metric spaces. Proceedings

of the Third International Conference on SImilarity Search and APplications - SISAP ’10, page 59, 2010.

Rafail Ostrovsky, Yuval Rabani, Leonard Schulman, and Chaitanya Swamy. The Eﬀectiveness of Lloyd-Type Methods

for the k-Means Problem. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06),

volume 59, pages 165–176. IEEE, December 2006.

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. Expert Systems with

Applications, 36(2):3336–3341, 2009.

Loïc Paulevé, Hervé Jégou, and Laurent Amsaleg. Locality sensitive hashing: A comparison of hash function types

and querying mechanisms. Pattern Recognition Letters, 31(11):1348–1358, August 2010.

Eric Sadit Tellez and Edgar Chavez. On locality sensitive hashing in metric spaces. In Proceedings of the Third

International Conference on SImilarity Search and APplications, SISAP ’10, pages 67–74, New York, NY, USA,

2010. ACM.

Brazilian Symposium on Databases - preprint, 2013.