ThesisPDF Available

Metric space indexing for nearest neighbor search in multimedia context

Authors:

Abstract and Figures

The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them — the ability of finding specific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this project, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most existing LSH functions are restricted to vector spaces. We propose two novel LSH methods (VoronoiLSH and VoronoiPlex LSH) for generic metric spaces based on metric hyperplane partitioning or Dirichlet Domains (random anchor points and K-medoids). We present a comparison with well-established LSH methods in vector spaces and with recent competing new methods for metric spaces. We develop a theoretical probabilistic modeling of the behavior of the proposed algorithms and show some relations and bounds for the probability of hash collision. Among the algorithms proposed for generalizing LSH for metric spaces, this theoretical development is new. Although the problem is very challenging, our results demonstrate that it can be successfully tackled. This dissertation will present the developments of the method, theoretical and experimental discussion and reasoning of the methods performance. Keywords: Similarity Search; Nearest-neighbor Search; Locality-sensitive Hashing; Quantiza- tion; Metric Space Indexing; Geometric Data Structure; Content-Based Multimedia Information Retrieval.
Content may be subject to copyright.
Eliezer de Souza da Silva
“Metric space indexing for nearest neighbor search in
multimedia context”
Indexac¸ ˜
ao de espac¸os m´
etricos para busca de vizinho mais
pr´
oximo em contexto multim´
ıdia
CAMPINAS
2014
i
ii
University of Campinas
School of Electrical and Computer
Engineering
Universidade Estadual de Campinas
Faculdade de Engenharia El´
etrica e de
Computac¸ ˜
ao
Eliezer de Souza da Silva
“Metric space indexing for nearest neighbor search in
multimedia context”
Supervisor:
Orientador(a): Prof. Dr. Eduardo Valle
“Indexac¸ ˜
ao de espac¸os m´
etricos para busca de vizinho mais
pr´
oximo em contexto multim´
ıdia”
MSc Dissertation presented to the Post Graduate
Program of the School of Electrical and Com-
puter Engineering of the University of Campinas
to obtain a MSc degree in Electrical Engineering
in the concentration area of Computer Engineer-
ing.
Disserta
c¸ ˜
ao de Mestrado apresentada ao Programa
de P
´
os-Gradua
c¸ ˜
ao em Engenharia El
´
etrica da Fac-
uldade de Engenharia El
´
etrica e de Computa
c¸ ˜
ao da
Universidade Estadual de Campinas para obten
c¸ ˜
ao do
t
´
ıtulo de Mestrado em Engenharia El
´
etrica na
´
area de
concentrac¸ ˜
ao Engenharia de Computac¸ ˜
ao
THI S VOLUME C OR RESPOND S TO T HE FI NA L
VE RS ION OF THE DIS SE RTATIO N DE FENDE D BY
ELI EZ ER DE SOUZA DA SILVA,UNDER T HE S U-
PERVISION OF PROF. DR. EDUARDO VAL LE .
Este exemplar corresponde
`
a vers
˜
ao final da
Disserta
c¸ ˜
ao defendida por Eliezer de Souza da Silva,
sob orientac¸ ˜
ao de Prof. Dr. Eduardo Valle.
Supervisor’s signature / Assinatura do Orientador(a)
CAMPINAS
2014
iii
Ficha catalográfica
Universidade Estadual de Campinas
Biblioteca da Área de Engenharia e Arquitetura
Rose Meire da Silva - CRB 8/5974
Silva, Eliezer de Souza da, 1988-
Si38m SilMetric space indexing for nearest neighbor search in multimedia context /
Eliezer de Souza da Silva. – Campinas, SP : [s.n.], 2014.
SilOrientador: Eduardo Alves do Valle Junior.
SilDissertação (mestrado) – Universidade Estadual de Campinas, Faculdade de
Engenharia Elétrica e de Computação.
Sil1. Método k-vizinho mais próximo. 2. Hashing (Computação). 3. Estruturas de
dados (Computação). I. Valle Junior, Eduardo Alves do. II. Universidade Estadual
de Campinas. Faculdade de Engenharia Elétrica e de Computação. III. Título.
Informações para Biblioteca Digital
Título em outro idioma: Indexação de espaços métricos para busca de vizinho mais próximo
em contexto multimídia
Palavras-chave em inglês:
k-nearest neighbor
Hashing (Computer science)
Data structures (Computer)
Área de concentração: Engenharia de Computação
Titulação: Mestre em Engenharia Elétrica
Banca examinadora:
Eduardo Alves do Valle Junior [Orientador]
Agma Juci Machado Traina
Romis Ribeiro de Faissol Attux
Data de defesa: 26-08-2014
Programa de Pós-Graduação: Engenharia Elétrica
Powered by TCPDF (www.tcpdf.org)
iv
v
vi
Abstract
The increasing availability of multimedia content poses a challenge for information retrieval
researchers. Users want not only have access to multimedia documents, but also make sense of them
— the ability of finding specific content in extremely large collections of textual and non-textual
documents is paramount. At such large scales, Multimedia Information Retrieval systems must
rely on the ability to perform search by similarity efficiently. However, Multimedia Documents
are often represented by high-dimensional feature vectors, or by other complex representations in
metric spaces. Providing efficient similarity search for that kind of data is extremely challenging.
In this project, we explore one of the most cited family of solutions for similarity search, the
Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which
assign, with higher probability, the same key for data that are similar. LSH is available only for
a handful distance functions, but, where available, it has been found to be extremely efficient for
architectures with uniform access cost to the data. Most existing LSH functions are restricted
to vector spaces. We propose two novel LSH methods (VoronoiLSH and VoronoiPlex LSH) for
generic metric spaces based on metric hyperplane partitioning (random centroids and K-medoids).
We present a comparison with well-established LSH methods in vector spaces and with recent
competing new methods for metric spaces. We develop a theoretical probabilistic modeling of the
behavior of the proposed algorithms and show some relations and bounds for the probability of hash
collision. Among the algorithms proposed for generalizing LSH for metric spaces, this theoretical
development is new. Although the problem is very challenging, our results demonstrate that it can
be successfully tackled. This dissertation will present the developments of the method, theoretical
and experimental discussion and reasoning of the methods performance.
Keywords: Similarity Search; Nearest-neighbor Search; Locality-sensitive Hashing; Quantiza-
tion; Metric Space Indexing; Geometric Data Structure; Content-Based Multimedia Information
Retrieval.
vii
viii
Resumo
A crescente disponibilidade de conte
´
udo multim
´
ıdia
´
e um desafio para a pesquisa em Recupera
c¸˜
ao
de Informa
c¸˜
ao. Usu
´
arios querem n
˜
ao apenas ter acesso aos documentos multim
´
ıdia, mas tamb
´
em
obter sem
ˆ
antica destes documentos, de modo que a capacidade de encontrar um conte
´
udo espec
´
ıfico
em grandes cole
c¸ ˜
oes de documentos textuais e n
˜
ao textuais
´
e fundamental. Nessas grandes escalas,
sistemas de informa
c¸˜
ao multim
´
ıdia de recupera
c¸˜
ao devem contar com a capacidade de executar a
busca por semelhan
c¸
a de forma eficiente. No entanto, documentos multim
´
ıdia s
˜
ao muitas vezes
representados por descritores multim
´
ıdia representados por vetores de alta dimensionalidade, ou
por outras representa
c¸ ˜
oes complexas em espa
c¸
os m
´
etricos. Fornecer a possibilidade de uma busca
por similaridade eficiente para esse tipo de dados
´
e extremamente desafiador. Neste projeto, vamos
explorar uma das fam
´
ılias mais citadas de solu
c¸ ˜
oes para a busca de similaridade, o Hashing Sens
´
ıvel
`
a Localidade (LSH - Locality-sensitive Hashing em ingl
ˆ
es), que se baseia na cria
c¸˜
ao de fun
c¸ ˜
oes de
hash que atribuem, com maior probabilidade, a mesma chave para os dados que s
˜
ao semelhantes.
O LSH est
´
a dispon
´
ıvel apenas para um punhado fun
c¸ ˜
oes de dist
ˆ
ancia, mas, quando dispon
´
ıveis,
verificou-se ser extremamente eficiente para arquiteturas com custo de acesso uniforme aos dados.
A maioria das fun
c¸ ˜
oes LSH existentes s
˜
ao restritas a espa
c¸
os vetoriais. Propomos dois m
´
etodos
novos para o LSH, generalizando-o para espa
c¸
os m
´
etricos quaisquer utilizando particionamento
m
´
etrico (centr
´
oides aleat
´
orios e k-medoids). Apresentamos uma compara
c¸˜
ao com os m
´
etodos LSH
bem estabelecidos em espa
c¸
os vetoriais e com os
´
ultimos concorrentes novos m
´
etodos para espa
c¸
os
m
´
etricos. Desenvolvemos uma modelagem te
´
orica do comportamento probal
´
ıstico dos algoritmos
propostos e demonstramos algumas rela
c¸ ˜
oes e limitantes para a probabilidade de colis
˜
ao de hash.
Dentre os algoritmos propostos para generelizar LSH para espa
c¸
os m
´
etricos, esse desenvolvimento
te
´
orico
´
e novo. Embora o problema seja muito desafiador, nossos resultados demonstram que ele
pode ser atacado com sucesso. Esta disserta
c¸˜
ao apresentar
´
a os desenvolvimentos do m
´
etodo, a
formulac¸˜
ao te´
orica e a discuss˜
ao experimental dos m´
etodos propostos.
Palavras-chave: Busca de Similaridade; Busca de Vizinho Mais Pr
´
oximo; Hashing Sens
´
ıvel
`
a
Localidade; Quantiza
c¸˜
ao; Indexa
c¸˜
ao de espa
c¸
os m
´
etricos; Estruturas de Dados Geom
´
etricas;
Recuperac¸˜
ao de Informac¸˜
ao Multim´
ıdia.
ix
x
Contents
Abstract vii
Resumo viii
Dedication xi
Acknowledgements xiii
1 Introduction 1
1.1 DeningtheProblem ................................. 2
1.2 Applications...................................... 2
1.3 Our Approach and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Publications.................................. 4
2 Theoretical Background and Literature Review 5
2.1 GeometricNotions .................................. 5
2.1.1 Vector and Metric Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Distances ................................... 7
2.1.3 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Embeddings ................................. 10
2.2 IndexingMetricData ................................. 10
2.3 Locality-Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Basic Method in Hamming Space . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Extensions in Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Structured or Unstructured Quantization? . . . . . . . . . . . . . . . . . . 18
2.3.4 Extensions in General Metric Spaces . . . . . . . . . . . . . . . . . . . . 18
2.4 FinalRemarks..................................... 19
3 Towards a Locality-Sensitive Hashing in General Metric Spaces 20
3.1 VoronoiLSH...................................... 20
3.1.1 BasicIntuition ................................ 20
3.1.2 Algorithms .................................. 22
3.1.3 Cost models and Complexity Aspects . . . . . . . . . . . . . . . . . . . . 24
xi
3.1.4 Hashing Probabilities Bounds . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 VoronoiPlexLSH................................... 31
3.2.1 BasicIntuition ................................ 32
3.2.2 Algorithms .................................. 33
3.2.3 Theoretical characterization . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 ParallelVoronoiLSH ................................. 35
3.4 Finalremarks ..................................... 37
4 Experiments 38
4.1 Datasets........................................ 38
4.2 TechniquesEvaluated................................. 39
4.3 EvaluationMetrics .................................. 40
4.4 Results......................................... 41
4.4.1 APM Dataset: comparison with K-Means LSH . . . . . . . . . . . . . . . 42
4.4.2 Dictionary Dataset: Comparison of Voronoi LSH and BPI-LSH . . . . . . 45
4.4.3 Listeria Dataset: DFLSH and VoronoiPlex LSH . . . . . . . . . . . . . . . 47
4.4.4 BigANN Dataset: Large-Scale Distributed Voronoi LSH . . . . . . . . . . 49
5 Conclusion 51
5.1 Open questions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 ConcludingRemarks ................................. 52
Bibliography 54
A Implementation 63
xii
In the memory of what we yet have to see
xiii
xiv
Acknowledgements
I would like to thank all the support I have received during this years of study. The support we need
and receive are multivariate and changes over time and I must acknowledge that I would never come
this far without all the help and love and have received over these years.
I thank my supervisor, Prof. Dr. Eduardo Valle, for the patience, the trust and the lessons taught
about a wide range of aspects in academic and professional life.
I thank and appreciate all the love my family has given me. Thank you dad (Jose), mom (Marta)
and little brother (Elienai). Thank my love Juliana Vita. You have endured with me the hard times
and always given me encouragement to keep walking in the right track.
I thank this wide extended family of good friends and colleagues that I have around the world.
A special thank to friends from Unicamp (specially people at APOGEEU, LCA, IEEE-CIS):
Alan Godoy, David Kurka, Rafael Figueiredo, Raul Rosa, Paul Hidalgo, Micael Carvalho, Carlos
Azevedo, Michel Fornacielli, Roberto Medeiros, Ricardo Righetto, and many others. I thank my
friends from ABU-Campinas, residents at Casa Douglas and visitors: Lois McKinney Douglas,
Marcos Bacon, Jonas Kienitz, Tiago Bember, Carla Bueno, Esther Alves, Daniel Franzolin, Pedro
Ivo, Maria (Mau
ˆ
e), Edu Oliveira, Nathan Maciel, Micael Poget, Gabriel Dolara, Patrick Timmer,
Paulo Castro, Fernanda Longo, and many others. I thank you all for the warm welcome in Campinas,
for the friendly environment, the good and bad times spend together. It was really nice to share
some of my time in Campinas with you all.
I thank the University of Campinas and the School of Electrical and Computer Engineering
for being a high-level educational and research environment. A special thank to Prof. Dr. George
Teodoro and Thiago Teixeira for the opportunity of excellent collaborative work. A special thank to
CAPES for the financial support.
I thank God for guiding me and being the lenses through-with I see everything.
xv
xvi
List of Figures
1.1 Schematics of a CMIR system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1
Illustration of a hypothetical situation where the distance distribution gets more
concentrated as the dimensionality increases and the filtered portion of the dataset
using a distance bound
r0d(p, q)r1
. If in a low dimensional space this bound
implied in searching over 10% of the dataset, in a higher dimensional space this
bound could lead to approximately 50% of the dataset being evaluated. . . . . . . . 9
2.2
Domination mechanism and range queries: which range queries can be answered
using the upper and lower bounding distance functions? . . . . . . . . . . . . . . 11
2.3
LSH and
(R, c)
-NN: probability of reporting points inside closed ball
BX(q, r)
as
R-Near of
q
is high. Probability of reporting points outside closed ball
BX(q, cr)
as cr-near neighbor of
q
is lower. Points in the middle may be wrongly reported as
R-Near with some probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 E2LSH: projection on a random line and quantization . . . . . . . . . . . . . . . . 16
3.1
Each hash table of Voronoi LSH employs a hash function induced by a Voronoi
diagram over the data space. Differences between the diagrams due to the data
sample used and the initialization seeds employed allow for diversity among the
hashfunctions. .................................... 21
3.2
Voronoi LSH: a toy example with points
C={c1, c2, c3}
as centroids, query point
p
, relevant nearest neighbor
q
, irrelevant nearest neighbor
p0
, radii of search
r
and
cr and random variables Zq,Zpand Zp0....................... 27
3.3
Closed Balls centered at point
q
and
p
, illustrating the case where events with
Zq+Zpd(p, q), which implies that NNC(p)6=NNc(q).............. 29
3.4
VoronoiPlex LSH: a toy example with points
C={c1, c2, c3, c4, c5}
as shared
centroids between partitioning, choosing three centroids for each partitioning and
concatenating four partitioning in the final hash-value . . . . . . . . . . . . . . . . 32
4.1
(a) Listeria Gene Dataset string length distribution (b) English Dataset string length
distribution ...................................... 39
4.2
APM Dataset: Comparison of VoronoiLSH with three centroids selections strategy
K-means, K-medoids and Random for 10-NN . . . . . . . . . . . . . . . . . . . . 42
xvii
4.3
APM Dataset: Comparison of the impact of the number of cluster center in the
extensivity metric and the correlation between extensivity and the recall for K-means
LSH, Voronoi LSH and DFLSH (10-NN) . . . . . . . . . . . . . . . . . . . . . . 43
4.4
Effect of the initialization procedure for the K-medoids clustering in the quality of
nearest-neighborssearch ............................... 45
4.5
Voronoi LSH recall–cost compromises are competitive with those of BPI (error bars
are the imprecision of our interpretation of BPI original numbers). To make the
results commensurable, time is reported as a fraction of brute-force linear scan. . . 46
4.6
Different choices for the seeds of Voronoi LSH: K-medoids with different initial-
izations (K-means++, Park & Jun, random), and using random seeds (DFLSH).
Experiments on the English dictionary dataset. . . . . . . . . . . . . . . . . . . . . 46
4.7
Indexing time (ms) varying with the number of centroids. Experiments on the
Englishdictionarydataset. .............................. 47
4.8
Recall metric of VoronoiLSH using random centroids (DFLSH) for the Listeria
gene dataset using L=2 and L=3 hash-tables and varying the number of centroids
(obtaining varying extensivity) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9
Recall metric of VoronoiPlex LSH using random centroids (10 centroids selected
from a 4000 point sample set) for the Listeria gene dataset using L=1 and L=8
hash-tables and varying the size wof the key-length . . . . . . . . . . . . . . . . 49
4.10
Random seeds vs. K-means++ centroids on Parallel Voronoi LSH (BigAnn). Well-
chosen seeds have little effect on recall, but occasionally lower query times. . . . . 49
4.11
Efficiency of Parallel Voronoi LSH parallelization as the number of nodes used
and the reference dataset size increase proportionally. The results show that the
parallelism scheme scales-up well even for a very large dataset, with modest overhead.
50
A.1 Class diagram of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xviii
Chapter 1
Introduction
The success of the Internet and the popularization of personal digital multimedia devices (digital
cameras, mobile phones, etc.) has spurred the availability of online multimedia content. Meanwhile,
the need to make sense from ever growing data collections has also increased. We face the challenge
of processing very large collections, in many media (photos, videos, text and sound), geographically
dispersed, with assorted appearance and semantics, available instantaneously at the fingertips of the
users. Information Retrieval, Data Mining and Knowledge Management must attend to the needs of
the “new wave” of multimedia data [Datta et al., 2008].
Scientific interest in Multimedia Information Retrieval has been steadily increasing, with the
convergence of various disciplines (Databases, Statistics, Computational Geometry, Information
Systems, Data Structures, etc.), and the appearance of a wide range of potential applications, for
both private and public sectors. As an example of this interest we refer to the Multimedia Grand
Challenge
1
), opened in the ACM Multimedia conference of 2009, which consists of a competition
with a set of problems and issues brought by the industry to the scientific community. Amongst
others, Datta et al. [2008] present results showing an exponential growth in the number of articles
containing the keyword “Image Retrieval” as indicative of the growing interest in the area.
Similarity search is a key step in most of these systems (Information Retrieval, Machine Learning
and Pattern Recognition Systems) and there is the need for supporting different distance functions
and data formats, as well as designing fast and scalable algorithms, specially for achieving the
possibility of processing billions or more multimedia items in a tractable time. The strategy for
this task may be twofold: improving data-structure and algorithms for sequential processing and
adapting algorithms and data structure for parallel and distributed processing.
The idea of comparing the similarity of two (or more) abstract objects can be formally specified
and intuitively comprehended using the concept of distance between points in some generic space.
Thus, if we entertain the possibility of representing our abstract objects as points in some generic
space equipped with a distance measure, we may follow the intuition that the closer the distance
between the points, the more similar the objects. This has been established as a standard theoretical
and applied framework in Content-based Multimedia Retrieval, Data Mining, Pattern Recognition,
1http://www.acmmm12.org/call-for-multimedia-grand-challenge-solutions/
1
1.1. Defining the Problem 2
Computer Vision and Machine Learning. Although such approach requires a certain level of
abstraction and approximation, it is very practical and appropriate for the task.
However, since there are many possible distance measures for different types of data (strings,
text documents, image, video, sounds, etc.), solutions that are effective and efficient only for specific
distances measures are useful, but limited. Our purpose is to study existing specific solutions in
order to generalize them to generic metric space without loosing efficiency and effectiveness.
In this chapter we will define formally similarity search, discuss the available solutions and
present its most common applications.
1.1 Defining the Problem
A definition of Metric Access Methods (MAM) is given by Skopal [2010]:
Set of algorithms and data structure(s) providing efficient (fast) similarity search under
the metric space model.
The metric space model assumes that the domain of the problem is captured by a metric space
and that the measure of similarity between the objects of that domain can be represented using some
distance function in the metric space. Thus, the problem of finding, classifying or grouping similar
objects from a given domain is translated to a geometric problem of finding nearest-neighbor points
in a metric space. The challenge is to provide data structures and algorithms that can accomplish
this task efficiently and effectively in a context of large scale search.
The obvious approach for the nearest neighbor search is a linear sequential algorithm that scan
the whole dataset. However this approach is not scalable in realistic set-ups with large dataset
and large query sets. It would be an interesting result that using some refined data structure and
algorithm we achieve a much more efficient query performance.
Another challenge for the naive approach is the case of dataset with high-dimensionality,
meaning, concentrated histogram of distances, sparsity of points and various other non-trivial
phenomena related to high dimensionality. Exact algorithms has failed to tackle with this challenge
and approximate methods has showed to be the most promising approach.
Finally we can enunciate our specific problem statement:
The development of effective and efficient methods (data structures and algorithms) for
approximate similarity search in generic metric space. The question is whether it would
be possible to offer better efficiency/effectiveness trade-off than available methods on
the literature and the condition that this improvement could be achieved.
1.2 Applications
Content-based multimedia information retrieval (CBMIR) is an alternative to keyword-based or tag-
based retrieval, which works by extracting features based on distinctive properties of the multimedia
1.3. Our Approach and Contributions 3
Multimedia Documents
Collection Multimedia descriptors Store of descriptors
Query Query descriptors Most similar
document
Search for stored descriptors
close to query descriptor
Figure 1.1: Schematics of a CMIR system
objects. Those features are organized in multimedia descriptors, which are used as surrogates of
the multimedia object, in such a way that the retrieval of similar objects is based solely on that
higher level representation (Figure 1.1), without need to refer to the actual low-level encoding of the
media. The descriptor can be seen as a compact and distinctive representation of multimedia content,
encoding some invariant properties of the content. For example, in image retrieval the successful
Scale Invariant Feature Transform (SIFT) [Lowe, 2004] encode local gradient patterns around
Points-of-Interest, in a way that is (partially) invariant to illumination and geometric transformations.
The descriptor framework also allows to abstract the media details in multimedia retrieval
systems. The operation of looking for similar multimedia documents becomes the more abstract
operation of looking for multimedia descriptors which have small distances. The notion of a “feature
space”, that organizes the documents in a geometry, putting close-by those that are similar emerges.
Of course, CBIR systems are usually much more complex than that, but nevertheless, looking for
similar descriptors often plays a critical role in the processing chain of a complex system.
1.3 Our Approach and Contributions
The field of Similarity Search and Metric data structures are very broad fields of research accu-
mulating decades of effort and a great variety of results. It is not our intention to dismiss all the
accumulated research, but rather to offer a modest contribution by presenting a generalization of
a successful technique that has been used mostly for Euclidean and Hammming spaces. Locality-
Sensitive Hashing is a technique invented more than a decade ago, is based on the existence of
functions that map points in a metric space to scalar (integer) values, with a probabilistic bound
that points nearby in the original space are mapped to equal or similar values. Until now most
existing LSH functions were designed for a specific space – the most common ones for Euclidean
and Hamming space. Nevertheless, it has been one of the most cited techniques for nearest-neighbor
search in practical and theoretical communities of research.
Taking advantage of the success of Locality-Sensitive Hashing in Euclidean and Hamming
1.3. Our Approach and Contributions 4
spaces, we sought to investigate a configuration of the technique for metric spaces in general. For
this purpose we use data-dependent partitioning of the space (for example, clustering) and hashing
associated with the partitions of the space. Intuitively, the idea is that points that are assigned the
same partition have a high probability of being relevant nearest neighbors, and points with distinct
partitions have a low probability of being real nearest neighbors (and conversely, a high probability
of being “far” points). This idea was explored in the course of the research and resulted in two
methods that generalizes the LSH for arbitrary metric spaces. This dissertation will present an
intuitive description and discussion of the proposed methods, as long with empirical results, but
also, as far as possible, will describe our contributions in a formalistic fashion, presenting and
demonstrating time and space complexities and proving some important and significant bounds
on the hashing probabilities under reasonable assumptions. As far as we know, among the works
proposing LSH for general metric space [Kang and Jung, 2012; Tellez and Chavez, 2010; Novak
et al., 2010; Silva and Valle, 2013; Silva et al., 2014], this is the first to develop a theoretical
characterization of the hashing collision probabilities (Sections 3.1.4, 3.1.3, 3.2.3).
1.3.1 Publications
Some of the contributions in this dissertation has been already reported for the research community.
A preliminary work describing VoronoiLSH was accepted and presented at the major national
conference on Databases – Brazilian Symposium on Databases. In this work, we introduce the
method, review significant part of the literature and report some experimental data supporting the
viability of the method. The paper was accepted as a short papers (only for poster session) but
was invited to a small group of short paper that was given oral presentation time in the technical
sessions. The parallel version of VoronoiLSH using multistage dataflow programming, developed
in collaboration between Prof. Dr. George Teodoro, Thiago Teixeira and Prof. Dr. Eduardo
Valle, was accepted for the 7th International Conference on Similarity Search and Applications
(SISAP 2014) and should be presented there in October, 2014. Besides we are planning for an
additional publication reporting VoronoiPlex LSH and taking a more theoretical stance, reporting
and extending the theoretical results presented in this dissertation. The complete list of publications
related to this research is:
Eliezer Silva, Thiago Teixeira, George Teodoro, and Eduardo Valle. Large-scale distributed
locality-sensitive hashing for general metric data. In Similarity Search and Applications
- Proceedings of the 7th International Conference on Similarity Search and Applications.
Springer, October 2014[Silva et al., 2014]
Eliezer Silva and Eduardo Valle. K-medoids lsh: a new locality sensitive hashing in general
metric space. In Proceedings of the 28th Brazilian Symposium on Databases, pages 127–
132, Brazil, 2013. SBC. URL
http://sbbd2013.cin.ufpe.br/Proceedings/
artigos/sbbd_shp_22.html[Silva and Valle, 2013]
Chapter 2
Theoretical Background and Literature
Review
In this chapter we will present and discuss extensively the concepts necessary to understand our
contributions and the state-of-art on the subject. At first, we will introduce the basic mathematical
and algorithmic notions and notations. In the sequence, we will discuss how the broader problem
of Similarity Search in Metric Spaces has been approached in the specialized literature, however
not diving deeply in each specific method references. We will focus the discussion on the main
ideas applied to metric indexing and refer the reader to detailed surveys of the algorithms. Further,
Locality-Sensitive Hashing is addressed and discussed in details, including a formal presentation of
the algorithms and a mathematical demonstrations of selected properties.
2.1 Geometric Notions
In this dissertation we will use the framework of Metric Space to address the problem of Similarity
Search. We adopt this model because of it generality and versatility. Any collection of objects
equipped with a function measuring the similarity of the objects and obeying a small sets of axioms
(the metric axioms) are enabled to be analyzed and processed using the tools developed for metric
spaces.
Another advantage is the possibility of using geometric reasoning and intuition in the analysis
of objects that are not trivially thought as geometric (for example, strings, or text documents). So,
in this setting, it is possible to speak of a “ball” around a string using Edit distances, for example.
In this section we will develop some of these intuitions, formalizing geometric notions as balls and
triangular inequality over generic metric space.
2.1.1 Vector and Metric Space
We will briefly present the definition and axioms of metric space and the fundamental operations
we are interested in metric spaces.
5
2.1. Geometric Notions 6
Metric space are general sets equipped with a metric (or distance), which is a real-valued
positive function between pairs of points. Given that we have a definition of sets of objects and a
definition of a function comparing pairs of objects (obeying certain properties), we have a metric
space. The distance function must obey basically four axioms: the image of the distance function of
non-negative, the function is symmetric in relation to the order of the points, points with distance
zero are equals, and the triangle inequality.
Definition 2.1
(
Metric space properties
)
.
Metric space: given a set
U
(domain) and a function
d:U×UR
(distance or dissimilarity function), a pair
(U, d)
is a metric space if the distance
function dhave the following properties [Ch´
avez et al., 2001]:
x, y U,d(x, y)0(non-negativeness)
x, y U,d(x, y) = d(y, x)(symmetry)
x, y U,d(x, y) = 0 x=y(identity of indiscernibles)
x, y, z U,d(x, y)d(x, z) + d(z , y)(triangle inequality)
If we want to build a notion of neighborhood for points in metric space there is only one tool to
use: the distance information between pairs of points. A natural way of accomplish that is to define
a distance range around of a point and analyze the region of the space in that range; this idea is
similar to looking to an interval with a central point in the real line, using distance between points,
generalized to metric space. This definition is useful for a set of indexing method that partition the
space using metric balls.
Definition 2.2
(
Open and Closed ball
)
.
Given a set
X
in a metric space (
U
,
d
) and a point
pX
we define the Open Ball of radius
r
around
p
as the set
BX(p, r) = {x|d(x, p)< r, x X}
and the
Closed Ball of radius raround pas BX[p, r] = {x|d(x, p)r, x X}.
Definition 2.3
(
Finite Vector Spaces
)
.
A given metric space
(U, d)
is a finite-dimensional vector
space (which for the sake of briefness we will refer simply as a vector space) with dimension
D
if
each
xU
can be represented as a tuple of
D
real values,
x= (x1, . . . , xD)
. The most common
distance for vector space are the Lpdistances [Ch´
avez et al., 2001; Skopal, 2010].
x, y U, where x= (x1, . . . , xD)and y= (y1, . . . , yD),
Lp(x, y) = p
qPD
i=1 |xiyi|p
Now we define three fundamental search problems central for similarity search in metric spaces:
Range Search, K-Nearest Neighbor Search and
(R, c)
-Nearest Neighbor Search. Given a subset of
metric space and a subset of queries, the Range Search problem is of finding efficiently a metric
ball of data points around query points given radius as parameter (called range).
Definition 2.4
(
Range search
)
.
Given the metric space
(U, d)
, a dataset
XU
, a query point
qUand a range r, find the set R(q, r) = {xX|d(q, x)r}[Clarkson, 2006].
2.1. Geometric Notions 7
Definition 2.5
(
K-Nearest Neighbors (kNN) search
)
.
Given the metric space
(U, d)
, a dataset
XU
, a query point
qU
and an integer
k
, find the set
NNX(q, k)
, defined as
|NNX(q, k)|=
k
and
(xX\NNX(q, k))(yNNX(q, k)) : d(y, q)d(x, q)
(the
k
closest points to
q) [Clarkson, 2006].
A related problem is the
(c, R)
-Nearest Neighbor, defined as the approximate nearest neighbor
for a given radium.
Definition 2.6
(
c-approximate R-near neighbor, or (c, R)-NN [Andoni and Indyk, 2006]
)
.
Given a set
X
of points in a metric space
(U, d)
, and parameters
R > 0
,
δ > 0
, construct a
data structure such that, given any query point
qU
, if there is
pX
with
d(p, q)R
(
p
is a
R-near point of
q
), it reports some point
pX
with
d(p, q)cR
(
p
is a cR-near neighbor of
q
in X) with probability at least 1δ.
Given many challenges for large scale search in metric spaces, the choice for approximate
and random algorithms is justified because of the possibility of quality/time (efficiency/efficacy)
trade-off which may be favorable for a scaling of the algorithms under acceptable error rates.
Patella and Ciaccia [2009] presents a survey of approximate methods and the major challenges of
approximate search in spatial (vector) and metric data.
2.1.2 Distances
There are a variety of possible distance definitions over a variety of objects. We will restrain
ourselves to present just a small sample of the population of metric distances. Our aim is just to
illustrate and contextualize our discussion of Similarity Search presenting some distances that are
useful in applications, specially in Content-Based Multimedia Retrieval, Machine Learning, Pattern
Recognition, Databases and Data Mining.
The usual distances functions in coordinate spaces (in special Euclidean) are
Lp
distance, known
also as Minkowski distance.
Lp
distances are related to our most elementary geometric intuitions
and are widely applied in models of similarity and distance-based algorithms; for example, Skopal
[2010] says that more than 80% of relevant literature in metric indexing apply Lpdistances.
Definition 2.7
(
Lp
distances on finite dimensional coordinate spaces)
.
Given a coordinate metric
space
(U, d)
, of dimensionality
D
, the
Lp
distance of two points
p= (p1,··· , pD)
,
q= (q1,··· , qD)
is defined as
Lp(p, q) = D
X
i=1 |piqi|p!1/p
Euclidean metric space with
D
dimensions equipped with a
Lp
metric may be referred as
LD
p
space.
Three
Lp
distances with widespread use are the Euclidean (
p= 2
), Manhattan (
p= 1
, also
known as Taxicab metric), and the Chebyshev (
p=
, also known as the maximum metric). The
2.1. Geometric Notions 8
Euclidean distance is related with common analytic geometry, and is the generalization for higher
dimensions of the Pythagorean distance of two points in the plane – the square root of a sum of
squares.
Another import definition is the distance from a point to a set, a composite of distance to
individual points in the set. This concept will be applied in the analysis of our proposed methods
Definition 2.8
(Point distance to sets)
.
Given a metric space
(U, d)
, a point
xU
and set
CU
,
d(x, C)is defined as the distance from xto the nearest point in C.
d(x, C) = min{d(x, c)|cC}
These are restricted examples of a long and diverse set of distances; we choose to define here
only the ones that are necessary for the understanding of concepts and algorithms described in this
dissertation. Deza and Deza [2009] has done a compendious work of cataloging an exhaustive
collection of distance and metric; that is proper reference for the reader interested in further details
and more examples of distances and metric.
2.1.3 Curse of Dimensionality
The “Curse of Dimensionality” (CoD) is a generic term associated with intractability and challenges
with growing dimensionality of the search space in algorithms for statistical analysis, mathematical
optimization, geometric operations and other areas. It is related to the fact that the geometric
intuition in lower dimensionality sometimes are totally changed in higher dimensionality, and many
times in a way that undermines the strategies that worked in lower dimensions. However not all
properties associated with the curse are always negative: depending on the subject area the curse
can be a blessing [Donoho et al., 2000]. It was first coined by Bellman [1961] as an argument
against the strategy that use a discretization and a brute force over the search space, in the context
of optimization and dynamic programming. The argument is simple: the number of partition grows
exponentially with the dimensionality, meaning, to approximately optimize of function over a space
with dimensionality
D
using grid search, to achieve a error
we would need search over
(1/)D
grids [Donoho et al., 2000].
In Similarity Search (Metric Access Methods and Spatial Access Methods) the curse has been
related to the difficulty to prune the search space as the (intrinsic) dimensionality grows – the
performance of many methods in high-dimensions are no better than linear scan. Also the very
notion of nearest-neighbor in high-dimensional space becomes blurred, specially when the distance
to the nearest point and the distance to the farthest point in the dataset becomes indistinguishable: it
has been demonstrated general conditions over the distance distribution, covering a wide class of
data and query workload, implying with high probability that as the dimensionality increases the
distance to farthest point and to the closest point are practically the same [Aggarwal et al., 2001;
Shaft and Ramakrishnan, 2006; Beyer et al., 1999]. This effect has been related to concentration of
distance and the intrinsic dimensionality of the dataset [Ch
´
avez et al., 2001; Shaft and Ramakrishnan,
2006; Pestov, 2000, 2008], but it is still an open question what is the formulation that explains how
2.1. Geometric Notions 9
-3
-2
0.8
0.6
0.4
0.2
0.0
−5 −3 1 3 5 x
1.0
−1 0 2 4−2−4
pdf(x)
[r0,r1]
Figure 2.1: Illustration of a hypothetical situation where the distance distribution gets more concen-
trated as the dimensionality increases and the filtered portion of the dataset using a distance bound
r0d(p, q)r1
. If in a low dimensional space this bound implied in searching over 10% of the
dataset, in a higher dimensional space this bound could lead to approximately 50% of the dataset
being evaluated.
the difficulties for similarity search emerges. In fact Volnyansky and Pestov [2009] demonstrates
that, under several general workloads and appropriate assumptions, pivot-based methods presents
concentration of measure using the fact that these methods can be seen as a 1-Lipschitz embedding
(the mapping to pivot space) and this can degenerate the performance of a large class of metric
indexing methods. Intuitively is possible to see that if we are using bounds on the distance
distribution to perform a fast similarity search, a sharp concentrated distribution can be much more
difficult to search: as the distribution get concentrated around some value, a larger portion of the
dataset is not going to be filtered by the bounds – in the limit we would have a full scan of the
dataset. If in a low dimensional space, a distance bound [r0, r1](r0d(p, q)r1) can be applied
to filter out a huge portion of the dataset, in a higher space with concentrated distance distribution,
this same bound could filter out very few points; in order to achieve an equivalent selectivity, we
would need a more sharper bound, a process that could lead to very unstable query processing,
where a small numerical perturbation on the distances would imply large performance loss (see
Figure 2.1 for a illustration).
Recently the hubness phenomenon has been studied and associated with high-dimensionality
and the CoD. Hubs are points that consistently appear in the kNN set of many other points in the
dataset. Although hubness and concentration of measure are distinct phenomena, both emerges
with increasing dimensionality. However, this property can also be exploited positively to avoid
problem of high-dimensionality. In fact, if there is natural clusters in the dataset, and the hubness
property holds, it is expected that with increasing dimensionality the clusters become more well-
defined [Radovanovi
´
c et al., 2009, 2010], in this situation metrics based on shared nearest-neighbors
can offer good potential results [Flexer and Schnitzer, 2013].
2.2. Indexing Metric Data 10
2.1.4 Embeddings
Since there is long standing solutions for proximity search in more specific space (for example
Euclidean or Vector spaces), a possible general approach to the problem is the mapping from general
metric space to those space where a solution already exists – a general technique in Computer
Science, reduction of an unsolved problem to another problem with a known solution. In order for
those mappings to be useful, some properties of the original space must be preserved, specifically
the distance information for any pairs of points in the original space. There are also techniques
focusing on preserving volume (or content) information between spaces [Rao, 1999], but we will
not discuss them.
Our theoretical and practical interest for approximate similarity search relies over a class of
embedding that preserve distances between pairs to a certain degree of distortion from the original
distance. In special we are interested in Lipschitz embedding with a fixed maximum distortion.
Definition 2.9
(c-Lipschitz mapping [Deza and Deza, 2009])
.
Given a positive scalar
c
, a c-Lipschitz
mapping is a mapping f:USsuch that dS(f(p), f(q)) cd(p, q), for p, q U.
Definition 2.10
(Bi-Lipschitz mapping [Deza and Deza, 2009])
.
Given a positive scalar
c
, metric
spaces
(U, dU)
and
(S, dS)
, a funtion
f:US
is a c-bi-Lipschitz mapping if exists a scalar
c > 0
such that: 1/cdU(p, q)dS(f(p), f (q)) cdU(p, q), for p, q U.
A simple example of 1-Lipschitz embedding is the pivot-space (as we will see later, this is the
basis for a whole family of metric indexing methods): given metric space
(M, d)
, take
k
points as
the pivot set
P={p1,··· , pk}
, and build the mapping
gP(x)=(d(x, p1),· ·· , d(x, pk))
. Calcu-
lating the maximum metric over two embedded points
x, y M
we obtain
L(gP(x), gP(y)) =
maxi1,···k{|d(x, pi)d(y, pi)|}
and by triangle inequality
|d(x, pi)d(y, pi)| ≤ d(x, y),piM1
.
In general there are results indicating that an
n
points metric can be embedded in Euclidean
space with
log(n)
distortion [Matou
ˇ
sek, 1996; Bourgain, 1985; Matou
ˇ
sek, 2002]. The reader
interested in further details about metric space embeddings should refer to the works of Deza and
Laurent [1997] and Matouˇ
sek [2002].
By relying on the general theory of embedding, one can advance the theoretical and practical
understanding of a great number of similarity search algorithms in metric space. Although not
always explicitly stated, many times embeddings are essential components of those algorithms. In
the next section we will discuss how this is accomplished in some classes of algorithms.
2.2 Indexing Metric Data
The basic purpose of indexing metric data is to avoid a full sequential search, decreasing the
number of distance computations and points processed. We may think of it as a data structure
partitioning the data space in such a way that the query processing is computationally efficient and
1d(x, y) + d(y, pi)d(x, pi)d(x, y )d(x, pi)d(y, pi)
and
d(x, y) + d(x, pi)d(y, pi)d(x, y )
d(y, pi)d(x, pi), meaning that |d(y, pi)d(x, pi)| ≤ d(x, y )
2.2. Indexing Metric Data 11
q
dL(q,x)
dU(q,x)
Yes
No
Maybe
d(q,x)<R?
Figure 2.2: Domination mechanism and range queries: which range queries can be answered using
the upper and lower bounding distance functions?
effective. Even if we increase the amount of pre-processing, in the long run, in a scenario with
multiples queries arriving, the whole process is much more efficient than sequential search. So a
good indexing structure should display a polynomial complexity in the pre-processing phase and a
sub-linear complexity in the query search phase. We may also understand the index structure as the
implementation of some sort of metric filtering strategy.
Hetland [2009], in a more recent survey of metric indexing methods, tries to uncover the
most essential principles and mechanisms from the vast literature of metric methods, synthesizing
previous surveys and reference work on metric indexing [Ch
´
avez et al., 2001; Hjaltason and Samet,
2003; Zezula et al., 2006]. Leaving aside detailed algorithm description and thorough literature
discussion, the author focus on the metric properties, distance bounds and index construction
mechanisms constituting the building blocks of widespread metric indexing methods. In general,
before taking into account specific metric properties, the author highlights three mechanisms for
distance index construction: domination, signature and aggregation. Domination is the adoption
of computationally cheaper distances, lower and upper bound of the original distance, in order
to avoid distance calculation (if the distance function
dU
is always greater or equal to another
distance
d
, than
dU
dominates
d
). Applying domination mechanisms to range queries is useful
for avoiding expensive distance computation. For example, given two distance function bounding
the original distance such that for any pairs of points
p
and
q
,
dL(p, q)d(p, q)dU(p, q)
, if
range
R
is greater than
dU(p, q)
the query “
d(p, q)< R
?” can be positively answered without the
calculation of
d(p, q)
. Symmetrically, distance computation can be avoided when range
R
is less
than
dL(p, q)
( Figure 2.2 illustrates both situation). For instance, if distance functions
dL
and
dU
is
computationally cheaper than distance
d
, we obtain a mechanism for improving the index efficient.
Another lower bounding mechanism is a mapping that take points from the original space to a
signature space. The signature of a point is a new representation of the object in distinct space
such that the distance computed using the signature is a lower bound to the distance in the original
2.2. Indexing Metric Data 12
space. Take two points
p
and
q
in the original space
U
, a non-expansive mapping (also known as
1-Lipschitz mapping, Definition 2.9) is a function from the original space
U
to the signature space
S
,
σ:US
, such that
dS(σ(p), σ(q)) d(p, q)
; a non-expensive mapping is an example of lower
bound in the signature space. It is possible to see that any Lipschitz or Bi-Lipschitz (Definition 2.10)
mapping could be used as a signature mapping. The aggregation mechanism, used more often in
conjunction with signature or dominance mechanism, consists in partitioning the search space in
regions such that the distance bounds are applied on the aggregate rather than in individual points
of the dataset. Those principles are used in specific metric mechanism for indexing:
Pivoting and Pivot Scheme
: a selected set of points
P={p1,··· , pk}
and the dataset is
stored with precomputed distance information that later is used to lower-bound the distance
from queries to points in the dataset. Given a point
q
in the dataset, a query
q
and the pivot set,
there is a lower-bound to the query to points in the dataset given by
maxi1,···k{|d(p, pi)
d(q, pi)|} ≤ d(p, q).
Metric balls and shells
: pivot and search spaces are organized in ball and shell partitions in
order to avoid distances calculations (aggregation). Generally some information regarding
the radius of the aggregate region must be stored with the index in order to apply metric
distance bound using a representative point of the region, but taking into consideration all the
other points. Most metric tree methods apply this technique, but also methods like List-of-
Clusters (a hierarchical list of metric balls inside metric shells) [Ch
´
avez and Navarro, 2005;
Fredriksson, 2007].
Metric Hyperplanes and Dirichlet Domains
: in this case the regions are not of a particular
shape, but are the result of dividing the metric dataset using metric hyperplanes. Taking two
point
p1
and
p2
, we can divide the space using the distance to these points; points closer to
p1
form a region, and points closer to
p2
is another region – the separating metric hyperplane
is the set of points with
d(x, p1) = d(x, p2)
. This idea can be generalized if we use many
reference points, or hierarchical organization of the partitions. Ch
´
avez et al. [2001] use the
compact partition relation to analyze different techniques relying on metric hyperplanes.
The survey by Ch
´
avez et al. [2001], despite being dated and not covering a considerable number
of new relevant techniques, is still very relevant since many challenges for similarity search are still
not solved and open to new methods and approaches. However, Ch
´
avez et al. [2001] shows that
those different views of metric indexing are equivalent and can be comprehensively understood
using the unifying model of equivalent relations, equivalent classes and partitioning. This unified
model is applied to the development of a rigorous algorithmic analysis of metric indexing methods,
specifically methods based on pivot-spaces and metric hyperplanes (denominated compact partitions
also), where lower-bound on the probability of discarding whole partition classes are given and are
related to a specific measure of intrinsic dimensionality (square of the mean over the square of the
variance of the distance histogram).
2.3. Locality-Sensitive Hashing 13
Metric trees:
Burkhard and Keller [1973] (BK-Tree), in their seminal work, introduced a tree
structure for searching with discrete metric distances. More recently, Uhlmann [1991] introduced
the concept of a “metric tree”, which can be seen as a generalization of the BK-Tree for general
metric distances. The Vantage-Point Tree (VPT), by Yianilos [1993] is a binary metric tree that
starts with a random root point, and then separates the dataset into left and right subtrees using
the median distance as separating criterion. The M-tree [Ciaccia et al., 1997] is a balanced tree
constructed using only distance information: at search phase, it used the triangle-inequality property
to prune the nodes to be visited. Slim-tree [Traina et al., 2000, 2002] is a dynamic metric tree with
enhanced feature including the minimizing the overlap between nodes of the metric tree using a
Minimum Spanning Tree algorithm.
Permutation-based Indexing:
a recent family of MAM is the permutation-based indexing meth-
ods. This methods are based on the idea of taking a reference set (permutants), and using the
perspective of any point to the permutants, the distance ordering from a point in the dataset to the
permutants, as a relevant information for approximate search [Ch
´
avez et al., 2005; Gonzalez et al.,
2008; Amato and Savino, 2008]. This ordering is very interesting because it is mapping from a
general metric space to a permutation space, which a potentially cheaper distance function that can
be exploited to render new bound and offer better performance. In fact, this mapping can be used
with a inverted-file and compared using Spearman Rho, Kendall Tau or Spearman Footrule measure
to perform approximate search in an effective procedure [Gonzalez et al., 2008; Amato and Savino,
2008].
For a comprehensive survey of Metric Access Methods the reader may refer to existing sur-
veys [Ch
´
avez et al., 2001; Hjaltason and Samet, 2003; Zezula et al., 2006; Samet, 2005; Hetland,
2009; Clarkson, 2006]. Skopal [2010] and Zezula [2012] offer a critical review of the evolution of
the area, and an evaluation of possible future directions, making explicit claims about the necessity
of even more scalable algorithms for the future of the area. The PhD thesis of Batko [2006] is also
a good recent reference and survey for Metric Access Methods and general principles of metric
indexing.
2.3 Locality-Sensitive Hashing
The LSH indexing method relies on a family of locality-sensitive hashing function
H
[Indyk and
Motwani, 1998] to map objects from a metric domain
X
in a
D
-dimensional space (usually
Rd
)
to a countable set
C
(usually
Z
), with the following property: nearby points in the metric space
are hashed to the same value with high probability. It is presented in the seminal article [Indyk
and Motwani, 1998] as an efficient and theoretically interesting approach to the Approximate
Nearest-Neighbors problem and later also as a solution for the
(R, c)
-NN problem [Datar et al.,
2004; Andoni and Indyk, 2006] (Figure 2.3). A parallel line of work by Broder et al. [2000]
developed the idea of MinHash (Min-Wise Independent Permutations) for fast estimation of set and
documents similarity and later SimHash [Charikar, 2002] for cosine distance in vectors, using a
2.3. Locality-Sensitive Hashing 14
q
r
cr
p
p'
Figure 2.3: LSH and
(R, c)
-NN: probability of reporting points inside closed ball
BX(q, r)
as R-
Near of
q
is high. Probability of reporting points outside closed ball
BX(q, cr)
as cr-near neighbor
of qis lower. Points in the middle may be wrongly reported as R-Near with some probability
(slightly) distinct formulation of LSH. However, both formulations relied on a definition that related
similarities (and distances) between objects and hash probabilities. We will follow the formulation
of Indyk and Motwani [1998]2.
Definition 2.11.
Given a distance function
d:X×XR+
, a function family
H={h:XC}
is (r, cr, p1, p2)-sensitive for a given data set SXif, for any points p, q S,hH:
If
d(p, q)r
then
P rH[h(q) = h(p)] p1
(probability of colliding within the ball of radius
r),
If
d(p, q)> cr
then
P rH[h(q) = h(p)] p2
(probability of colliding outside the ball of
radius cr)
c > 1and p1> p2
A function family
G
is constructed by concatenating
M
randomly sampled functions
hiH
,
such that each
gjG
has the following form:
gj(v) = (h1(v), ..., hM(v))
. The use of multiple
hi
2
It is worth noting that the authors of these articles, Andrei Broder, Moses Charikar and Indyk Motwani, were
named recipients of the prestigious Paris Kanellakis Theory And Practice Award in 2012 for their contribution in the
development of LSH. ACM Awards Page: The Paris Kanellakis Theory and Practice Award,
http://awards.acm.
org/kanellakis/, acessed in June, 7, 2014.
2.3. Locality-Sensitive Hashing 15
functions reduces the probability of false positives, since two objects will have the same key for
gj
only if their value coincide for all
hi
component functions. Each object
v
from the input dataset is
indexed by hashing it against
L
hash functions
g1, . . . , gL
. At the search phase a query object
q
is
hashed using the same
L
hash functions and the objects stored in the given buckets are used as the
candidate set. Then, a ranking is performed among the candidate set according to their distance to
the query, and the kclosest objects are returned.
LSH works by boosting the locality sensitiveness of the hash functions. As
M
grows, the
probability of a false positive (points that are far away having the same value on a given
gj
) drops
sharply, but so grows the probability of a false negative (points that are close having different
values). But as
L
grows and we check all hash tables, the probability of false negatives falls, and the
probability of false positives grows. LSH theory shows that it is possible to set
M
and
L
so to have
a small probability of false negatives, with an acceptable number of false positives. This allows the
correct points to be found among a small number of candidates, dramatically reducing the number
of distance computations needed to answer the queries.
The need to maintain and query
L
independent hash tables is the main weak point of LSH. In
the effort to keep both false positives and false negatives low, there is an “arms race” between
M
and
L
, and the technique tends to favor large values for those parameters. The large number of
hash tables results in excessive storage overheads. Referential locality also suffers, due to the need
to random-access a bucket in each of the large number of tables. More importantly, it becomes
unfeasible to replicate the data on so many tables, so each table has to store only pointers to the
data. Once the index retrieves a bucket of pointers on one hash table, a cascade of random accesses
ensues to retrieve the actual data.
2.3.1 Basic Method in Hamming Space
The basic scheme [Indyk and Motwani, 1998] (Hamming LSH) provided locality-sensitive families
for the Hamming distance on Hamming spaces, and the Jacquard distance in spaces of sets.
The original [Indyk and Motwani, 1998] method is limited to Hamming space (bit-vectors of
fixed size) using Hamming distance (
dH
, which is the number of different bits at corresponding po-
sitions, a sum of exclusive-or operation) and point-set space using Jaccard similarities. Equation 2.1
describes the hash functions family for Hamming distance. The idea is to choose one position of the
hamming point coordinate as representative of the point.
H={hi:{0,1}D→ {0,1} ∈ Z}
hi((b1,··· , bD)) = bi,1iD(2.1)
Because the distance between two hamming points is bounded and the number of different hashing
function also is bounded, it is easy to see that the probabilities
p1
and
p2
are bounded and obeying
the restriction of the LSH definition. Indeed, there are
D
possible functions
hi
for a given Hamming
space of dimensionality Dand the hamming distance between two points measures the number of
times (over the
D
possible) that the hashing of the two points are supposed to have distinct values.
2.3. Locality-Sensitive Hashing 16
Figure 2.4: E2LSH: projection on a random line and quantization
Thus, the probability of not colliding is given by:
PhiH[hi(q)6=hi(p)] = dH(q, p)
|H|=dH(p, q)
D
Considering the case of dH(q, p)> cr, we obtain
PhiH[hi(q)6=hi(p)] = dH(p, q)
D>cr
D
PhiH[hi(q) = hi(p)] = 1 PhiH[hi(q)6=hi(p)] <1cr
D
And if dH(q, p)r,
PhiH[hi(q)6=hi(p)] r
D
PhiH[hi(q) = hi(p)] 1r
D
So it is clear that
H
is a
(r, cr, 1r
D,1cr
D)
-sensitive hashing function for Hamming metric
space.
2.3.2 Extensions in Vector Spaces
For several years those were the only families available, although extensions for
L1
-normed
(Manhattan) and
L2
-normed (Euclidean) spaces were proposed by embedding those spaces into
Hamming spaces [Gionis et al., 1999].
The practical success of LSH, however, came with the E2LSH
3
(Euclidean LSH) [Datar et al.,
2004], for
Lp
-normed space, where a new family of LSH functions was introduced (Figure 2.4 is a
graphical representation of this equation):
H={hi:RDZ}(2.2)
hi(v) = $ai.v+bi
w%(2.3)
3
LSH Algorithm and Implementation (E2LSH), accessed in 22/09/2013.
http://www.mit.edu/˜andoni/
LSH/
2.3. Locality-Sensitive Hashing 17
aiRD
is a random vector with each coordinate picked independently from a Gaussian
distribution
N(0,1)
,
bi
is an offset value sampled from uniform distribution in the range
[0, . . . , w]
and
w
is a fixed scalar that determines quantization width. Applying
hi
to a point or object
v
corresponds to the composition of a projection to a random line and a quantization operation, given
by the quantization width wand the floor operation.
Andoni and Indyk [2006] extends LSH using geometric ball hashing and lattices. This approach
achieves a complexity near the lower bound established by Motwani et al. [2006] and O’Donnell
et al. [2014] for LSH-based algorithms. Instead of projecting the high-dimensional points to a
line, as in Datar et al. [2004] and Indyk and Motwani [1998], it is done a projection to a lower-
dimensional space with dimension
kD
greater than one (
D
is the original space dimension) and
“ball partitioning” quantization on the low-dimensional space. Nevertheless, for practical purposes
(fast encoding and decoding) Leech lattices quantizer is used as alternative to “ball partitioning”
quantizer. The Leech lattice is a dense lattice in the 24-dimensional space introduced by Leech
[1964] for the problem of ball packing.
Query adaptive LSH [Jegou et al., 2008] introduces a dynamic bucket filtering scheme based
on the relative position of the hashed (but not quantized) value of the query point to the quantizer
cell frontier. Suppose the hashing function
h(x)
may be seen as the composition of a function
f:MR
, mapping the point to a scalar value, and a quantizer
g:x→ bxc
. The cell frontier of
h(x)
is
bf(x)c
and
bf(x)c+ 1
, and distance to the center of the cell given by
|bf(x)cf(x)+1/2|
can be seen as a relevance criterion for the quality of the bucket. The further from the cell frontier
(or closer to the cell center), the better the quality of the bucket. Using this as a relevance criteria,
the “best” buckets are selected without using any point-wise distance calculation.
Panigrahy [2006] proposes entropy based LSH, an alternative approach to LSH in high-
dimensional nearest neighbor search that employs very few hash tables (often just one), and
for the search phase, probes multiple randomly “perturbed” buckets around the query bucket in each
table, in order to maintain the probabilistic response guarantees. Theoretical analysis support the
assertion that this approach renders similar performance to the basic LSH. Although the number
of probes is very large, increasing the query costs, the trade-off might still be interesting for some
applications, specially in very large-scale databases, since main memory might be a system-wide
constraint, while processing time might be available.
Multiprobe LSH [Lv et al., 2007] follows the approach of Panigrahy, that, instead of using the
expensive random probes, generates a carefully probing sequence, visiting first the buckets most
likely to contain the results. The probing sequence follows a success likelihood estimation, which
generates an optimal probing sequence, whose quality can be controlled by setting the number of
probes. Joly and Buisson’s a posteriori multi-probe LSH [Joly and Buisson, 2008] extends that
work by turning the likelihoods into probabilities by incorporating a Gaussian prior estimated from
training data, into a scheme aptly called a posteriori LSH. They employ an “estimated accuracy
probability” stop criterion instead of a number of probes.
Other contributions approach the problem of parameter tuning and the dynamic adaptation of the
LSH method. LSH Forest [Bawa et al., 2005] uses variable-length hashes and a prefix-tree structure
for self-tuning of the LSH parameters. Ocsa and Sousa [2010] propose a similar adaptive multilevel
2.3. Locality-Sensitive Hashing 18
LSH index for dynamic index construction, changing the hash length parameter in each level of the
structure. A more recent work by Slaney et al. [2012] focus on the optimization of the parameters
using prior knowledge of the distance distribution of the base to achieve a desired quality level of
the nearest neighbor method.
2.3.3 Structured or Unstructured Quantization?
In original LSH formulation [Indyk and Motwani, 1998; Gionis et al., 1999; Datar et al., 2004;
Andoni and Indyk, 2006], the hashing schemes are based on structured random quantization of the
data space and this regularity is useful for the bounds on the hash collision probability. However,
the question could be raised whether this structure is really necessary for locality-sensitiveness and
if other non-regular partitioning could be applied. This discussion concerns how distinct space or
data partitioning be useful in the LSH framework.
Indeed, early proposals, Hamming LSH and E2LSH for example, used exclusively regular
quantizers of the space independent with the data. However, K-means LSH [Paulev
´
e et al., 2010]
presents a comparison between structured (random projections, lattice) and unstructured (K-means
and hierarchical K-means) quantizers in the task of searching high dimensional SIFT descriptors,
resulting on the proposal of a new LSH family based on the latter (Equation 3.2). Experimental
results indicate that the LSH functions based on unstructured quantizers perform better, as the
induced Voronoi partitioning adapts to the data distribution, generating more uniform hash cells
population than the structured LSH quantizers. A drawback of this work is the reliance solely on
empirical evidence and the lack of theoretical results demonstrating, for example, how the collision
probabilities in K-means LSH could offer better results than the previous version based on structured
quantization.
Advancing the theoretical analysis in the direction of unstructured quantization, Andoni et al.
[2014] developed an adapted version of LSH using two-level data-dependent hashing. Two reference
works [Andoni and Indyk, 2006; Andoni, 2009] demonstrate lower bounds for approximate nearest-
neighbor search with locality-sensitive hashing (lower bound on
ρ
, given that the query complexity
is
dnρ+o(1)
); also an LSH family based on lattices and ball partitioning achieving near-optimal
performance is presented. Further more, relying on the same ideas, a two-level data-dependent
hashing may be constructed and perform (theoretically) even better [Andoni et al., 2014].
Finally, K-means and hierarchical K-means are clustering algorithms for vector spaces, restrict-
ing the application of this approach to metric data. In order to overcome this limitation we turn to a
clustering and data-dependent partitioning algorithms designed to work in generic metric spaces as
K-medoids clustering and random Voronoi partitioning.
2.3.4 Extensions in General Metric Spaces
As far as we know, there are only three works that tackles the problem of designing LSH indexing
schemes using only distance information, both of them exploiting the idea of Voronoi partitioning
of the space.
2.4. Final Remarks 19
M-Index
M-Index [Novak and Batko, 2009] is a Metric Access Method for exact and approximate
similarity search based on universal mapping from the original metric space to scalar values in
[0,1]
.
The values of the mapping are affected by the permutation order of a set of reference points and
the distance to these points. It uses a broad range of metric filtering when performing the query
processing. In a follow-up work [Novak et al., 2010], this indexing scheme is analyzed empirically
as a locality-sensitive hashing for general metric spaces.
Permutation-based Hashing
Tellez and Chavez [2010] presents a general metric space LSH
method using permutation based indexing, combining a technique of mapping and hashing. In the
permutation index approach the similarity is inferred by the perspective on a group of points called
permutants (if point
p
sees the permutants in the same order as point
q
, so
p
and
q
are likely to be
close to each other). In a way, this is similar to embedding the data to a proper space for LSH and
then applying a built-in (in this proper space) LSH function. Indeed the method consists in two
steps: first it creates a permutation index (designed to be compared using Hamming distance); and
then it hashes the permutation indexes using Hamming LSH [Indyk and Motwani, 1998].
DFLSH
The DFLSH (Distribution Free Locality-Sensitive Hashing) is introduced in a recent
paper by Kang and Jung [Kang and Jung, 2012]. The idea is to randomly choose
t
points from the
original dataset (with
n > t
points) as centroids and index the dataset using the nearest centroid as
hash key — this construction yields an approximately uniform number of points-per-bucket:
O(n/t)
.
Repeating this procedure
L
times it is possible to generate
L
hash tables. A theoretical analysis is
provided, and with some simplifications, it shows that this approach follows the locality-sensitive
property.
2.4 Final Remarks
We presented the fundamental mathematical and algorithmic concepts essential for this work, with
emphasis on the definition of metric space and different type of similarity search over metric spaces.
Despite the large number of metric indexing methods, we relied on the general techniques and
principles in the literature to give a wide description of the intuition behind most of the metric
indexing methods. Nevertheless it is still an open challenge to scaling up of the algorithms in terms
of size of the dataset and dimensionality. Given the flexibility of permutation-based methods for a
possible family of LSH methods, and taking in consideration the idea of data-based quantizer for
Euclidean spaces, we will present in next chapter two algorithms for Similarity Search in general
metric spaces using a family of LSH functions in metric spaces.
Chapter 3
Towards a Locality-Sensitive Hashing in
General Metric Spaces
This chapter introduces our main contribution. We present two methods for Locality-Sensitive
Hashing in general metric spaces and a theoretical characterization of the proposed methods as
locality-sensitive. In general metric spaces, all structural and local information is encoded in the
distance between the points, forcing us to somehow use this in the design of hashing functions. Our
practical solution is partitioning the metric space using clustering or simple induced Voronoi dia-
grams (or Dirichlet Domains) by a subset of points (generalized hyperplane partitioning), assigning
numbers to the partitions and using this to build hashing functions. We present VoronoiLSH in
Section 3.1 and VoronoiPlex LSH in Section 3.2, using an initial intuitive presentation and then a
theoretical discussion of the methods. Section 3.3 introduces the work regarding parallelization of
VoronoiLSH using dataflow programming.
3.1 VoronoiLSH
We propose a novel method for locality-sensitive hashing in the metric search framework and
compare them with other similar methods in the literature. Our method is rooted on the idea of
partitioning the data space with a distance-based clustering algorithm (K-medoids) as an initial
quantization step and constructing a hash table with the quantization results. Voronoi LSH follows a
direct approach taken from [Paulev
´
e et al., 2010]: each partition cell is a hash table bucket, therefore
the hashing is the index number of the partition cell.
3.1.1 Basic Intuition
Each hash table of Voronoi LSH employs a hash function induced by a Voronoi diagram over the
data space. If the space is known to be Euclidean (or at least coordinate) the Voronoi diagram can
use as seeds the centroids learned by a Euclidean clustering algorithm, like K-means (in which
case Voronoi LSH coincides with K-means LSH). However, if nothing is known about the space,
20
3.1. VoronoiLSH 21
points from the dataset must be used as seeds, in order to make as few assumptions as possible
about the structure of the space. In the latter case, randomly sampled points can be used (in which
case Voronoi LSH coincides with DFLSH). However, it is also possible to try select the seeds
by employing a distance-based clustering algorithm, like K-medoids, and using the medoids as
seeds. Take note that the Voronoi partitioning is implicit and only the centroids of the partitions
and pointers to the points of the dataset should be stored, which is needed in order to avoid an
exponential space usage.
Neither K-means nor K-medoids are guaranteed to converge to a global optimum, due to
dependencies on the initialization. For Voronoi LSH, however, obtaining the best clustering is not
a necessity. Moreover, when it is used with several hash tables, the differences between runs of
the clustering are employed in its favor, as one of the sources of diversity among the hash tables
( Figure 3.1). We further explore the problem of optimizing clustering initialization below.
Generate
L induced
Voronoi
Partitioning
L hash tables
h1
hL
...
[ ]
L associated
hash functions
...
{
{
Figure 3.1: Each hash table of Voronoi LSH employs a hash function induced by a Voronoi
diagram over the data space. Differences between the diagrams due to the data sample used and the
initialization seeds employed allow for diversity among the hash functions.
PAM (Partitioning Around Medoids) [Kaufman and Rousseeuw, 1990] is the basic algorithm for
K-medoids clustering. A medoid is defined as the most centrally located object within a cluster. The
algorithm initially selects a group of medoids at random or using some heuristics, then iteratively
assigns each non-medoid point to the nearest medoid and update the medoids set looking for the
optimal set of points that minimize the quadratic sum of distance
f(C)
, with
C={c1, . . . , ck}
being the set of centroids (Equation 3.1, refer to Definition 2.7 for d(x, C)).
f(C) = X
xX\C
d(x, C)2(3.1)
PAM features a prohibitive computational cost, since it performs
O(k(nk)2)
distance cal-
culation for each iteration. There are a few methods designed to cope with those complexities,
such as CLARA [Kaufman and Rousseeuw, 1990], CLARANS [Ng, 2002], CLATIN [Zhang and
Couloigner, 2005], the algorithm by Park and Jun [2009] and FAMES [Paterlini et al., 2011].
Park and Jun [2009] proposes a simple approximate algorithm based on PAM with significant
performance improvements. Instead of searching the entire dataset for a new optimal medoid, the
method restricts the search for points within the cluster. We choose to apply this method for its
simplicity as an initial attempt and baseline formulation, further exploration of other metric space
clustering algorithm is performed and results are reported.
3.1. VoronoiLSH 22
3.1.2 Algorithms
We define more formally the hash function used by Voronoi LSH in Equation 3.2. In summary, it
computes the index of the Voronoi seed closest to a given input object
1
. In addition, we present
the indexing and querying phases of Voronoi LSH, respectively, in Algorithms 1 and 2. Indexing
consists in creating
L
lists with
k
Voronoi seeds each (
Ci={ci1, . . . , cik},i∈ {1, . . . , L}
). When
using K-medoids, which is expensive, we suggest the Park and Jun fast K-medoids algorithm
(apud [Paulev
´
e et al., 2010]), performing this clustering over a sample of the dataset, and limiting
the number of iterations to 30. Then, for each point in the dataset, the index stores a reference in
the hash table
Ti
(
i∈ {1, . . . , L}
), using the hashing function defined in Equation 3.2 (
hCi(x)
).
When using K-means, we suggest the K-means++ variation [Ostrovsky et al., 2006; Arthur and
Vassilvitskii, 2007]. The seeds can also simply be chosen at random (like in DFLSH). The querying
phase is conceptually similar: the same set of
L
hash functions (
hCi(x)
) is computed for a query
point, and all hash tables are queried to retrieve the references to the candidate answer set. The
actual points are retrieved from the dataset using the references, forming a candidate set (shortlist),
and the best answers are selected from this shortlist.
In this work, we focus on k-nearest neighbor queries. Therefore, this final step consists of
computing the dissimilarity function to all points in the shortlist and selecting the k closest ones.
Definition 3.1.
Given a metric space
(X, d)
(
X
is the domain set and
d
is the distance function),
the set of cluster centers C={c1, . . . , ck} ⊂ Uand an object xX:
hC:UN
hC(x) = argmini=1,...,k{d(x, c1), . . . , d(x, ci), . . . , d(x, ck)}(3.2)
input : Set of points X, number of hash tables L, size of the sample set Sand number of
Voronoi seeds k
output: list of Lindex tables T1, . . . , TLpopulated with all points from Xand list of L
Voronoi seeds Ci={ci1, . . . , cik },i1, . . . , L
for i1to Ldo1
Draw sample set Sifrom X;2
Cichoose kseeds from sample Si(random, K-means, K-medoids, etc.);3
for xXdo4
Ti[hCi(x)] Ti[hCi(x)] {pointer to x};5
end6
end7
Algorithm 1
: Voronoi LSH indexing phase: a Voronoi seed list (
Ci
) is independently selected
for each of the
L
hash tables used. Further, each input data point is stored in the bucket entry
(hCi(x)) of each hash table.
1
Ch
´
avez et al. [2001] use to term Compact partition relation to refer to this kind of partitioning of the dataset and
demonstrate some lower bounds for algorithms using compact partitions
3.1. VoronoiLSH 23
input : Query point q, index tables T1, . . . , TL,Llists of MVoronoi seeds each
Ci={ci1, . . . , cik}, number of nearest neighbors to be retrieved N
output: set of Nnearest neighbors NN(q, N ) = {n1, . . . , nN} ⊂ X
CandidateSet ← ∅ ;1
for i1to Ldo2
CandidateSet CandidateSet Ti[hCi(q)] ;3
end4
NN(q, N ){k closest points to q in CandidateSet };5
Algorithm 2
: Voronoi LSH querying phase: the input query point is hashed using same
L
functions as in indexing phase, and points in colliding bucket in each hash table are used as
nearest neighbors candidate set. Finally,
N
closest points to the query are selected from the
candidate set.
Clustering Initialization
K-means and K-medoids clustering results exhibit a strong dependence on the initial cluster
selection. Usually those points are selected at random, but there are some proposed heuristics.
K-means++ [Ostrovsky et al., 2006; Arthur and Vassilvitskii, 2007] solve the
O(log k)
approximate
K-means problem
2
simple by carefully designing a good initialization algorithm. That raises
the question of how the initialization could affect the final result of the kNN search for the
proposed methods. The K-means++ initialization method (Algorithm 3) is based on metric
distance information and sampling, thus can be plugged into a K-medoids algorithm without
further changes. [Park and Jun, 2009] propose also a special initialization (Algorithm 4)
for the fast K-medoids algorithm. We implemented both of those initialization, and the
random selection as well and empirically evaluated their impact on the similarity search task.
input : Set of points Xand number of cluster centers k
output: list of cluster centers {c1, . . . , ck}
Choose an initial cluster c1at random from X;1
while total of number of cluster centers < k do2
choose the next center ci, selecting ci=x0Xwith probability D(x0)2
PxXD(x)2;
3
end4
Algorithm 3: k-means++ procedure for initial cluster selection
2the exact solution is NP-Hard
3.1. VoronoiLSH 24
input : Set of points X={x1, . . . , xn}and expected number of cluster centers k
output: list of cluster centers {c1, . . . , ck}
Calculate distance matrix with {dij =d(xi, xj)}n×n;1
for j1to ndo2
vjPn
i=1 dij/Pn
l=1 dil ;3
end4
Sort {v1, . . . , vn}in ascending order;5
Select the kpoints with smallest vjvalue as initial cluster centers;6
Algorithm 4: Park and Jun procedure for initial cluster selection
3.1.3 Cost models and Complexity Aspects
We will consider two models for query cost, a more simplistic looking only to distance computation
and another one taking into account other costs. We must remember that there is two basic
component of the query cost: the cost to find the appropriate bucket (called internal cost) and the
cost to sequentially process and rank the points in the bucket (called external cost). The basic
supposition is that the number of points in a bucket can be approximated by its mean when we
evaluate the average query cost.
Given a dataset X of
n
points and
k
centroids
ci
associated to a bucket
Xi
(which forms a
partition of the dataset, implying that
K
i=1Xi=X
and
K
i=1Xi=
), the average number of point
in a bucket (n) is given by:
n=PK
i=1 |Xi|
k=n
k
For the fist model let us consider that the distance function evaluation is computationally much
more expensive than any other computation in the algorithm, so the complexity is dominated by
the number of distance function evaluations. We denominate this cost model the Range Search
Cost Model because it does not differentiate between range queries and k-nearest neighbor queries
in terms of cost. The internal cost is the number of distance computation in the linear scan for
the nearest centroid to the query and is fixed on
k
. The external cost is the number of distance
computations for the final ranking. In the average case we will approximate this cost by the average
number of points nin each bucket. So the final cost is given by:
RC(n, k) = n
k+k
Another possibility is to jointly analyze the final ranking cost, which does not depend solely
on the number of distance computations. For that matter, we assume that any basic operation
(mathematical operators,numbers comparisons, memory transfer) has unit cost and that a given
distance has a constant cost of
d > 1
. We will denominate this model as Nearest-Neighbors Cost
Model. The final ranking operation has the complexity of a sorting algorithm, which is
O(nlogn)
,
3.1. VoronoiLSH 25
over the average number of points in each bucket. Taking all into account the final cost is:
NNC(n, k, d) = n
klog(n
k) + dn
k+dk
Optimizing the parameters:
VoronoiLSH has basically two parameters, the number of tables
L
and number of centroids
k
. For the sake of simplicity we will look only for the parameter
k
(supposing a fixed
L
, and given that this parameter has a fixed multiplicative positive impact over
the query time complexity and a non-obvious multiplicative positive impact over the quality of the
search).
Let us start with Range Search Model; Note that
1kn
and that in practical setups
kn
.
RC(n, k)
∂k =n
k2+ 1 = 0
So the optimal number of centroids kis
k=n
RC(n) = 2n
For the Nearest-Neighbor Model we proceed in the same way, using the same assumptions:
NNC(n, k, d)
∂k =n
k2[log(n
k) + 1] dn
k2+d= 0
n
k2[log(n
k) + 1 + d] = d
So the optimal number of centroids kis implicit given by
k2=n
d[log(n
k) + 1 + d]
Knowing that log(n/k)>0we conclude that
k2>n
d(d+ 1)
n2
k2<d
d+ 1n
log(n
k)<1
2log( d
d+ 1n)
n
d(d+ 1) < k2=n
d[log(n
k) + 1 + d]<n
d[1
2log( d
d+ 1n) + 1 + d]
sn(1
d+ 1) < k < sn[1
2dlog( d
d+ 1n) + 1
d+ 1]
3.1. VoronoiLSH 26
dsn(1
d+ 1) < dk < dsn[1
2dlog( d
d+ 1n) + 1
d+ 1]
qn(d+ 1) < dk < sn[1
2log( d
d+ 1n) + d+ 1]
1/sn[1
2dlog( d
d+ 1n) + 1
d+ 1] <1/k < 1/sn(1
d+ 1)
n
qn[1
2dlog( d
d+1 n) + 1
d+ 1] < n/k < n
qn(1
d+ 1)
v
u
u
t
dn
1
2log( d
d+1 n) + d+ 1 < n/k < sdn
d+ 1
Knowing that the optimal kmust satisfy
kd =n
klog n
k+dn
k+n
k
kd n
k=n
klog n
k+dn
k
We conclude that the optimal cost NNCopt(n, d) = NNC(n, k, d)may be simplified to
NNCopt(n, d)=2dk + (d1)n
k
Implying that
nd[2d+ 1+v
u
u
u
t
(d1)2
log(qdn
d+1 ) + d+ 1]<NNCopt(n, d)<nd[2v
u
u
tlog(sdn
d+ 1) + d+ 1+s(d1)2
d+ 1 ]
NNCopt(n, d) = O(qnd(log(n) + d+ 1) = O(dsn(log(n) + 1
d+ 1)
In the Nearest-Neighbor cost model the dimensionality information is built-in the distance cost
d
, for instance, if we consider
Lp
distance in a
D
dimensional Euclidean space, the distance cost
d
may be approximated by dimensionality
D
. It is interesting to note that, although we did not
achieve a logarithm query cost, we obtained a sub-linear cost on the number of points (and linear
on the distance cost) which does not hide exponential costs on the dimensionality. Comparing
to standard LSH in Euclidean space which solves the
(c, r)
-approximate nearest-neighbors with
O(dnρ)
query complexity, where
d
is the dimensionality (and the approximate cost of evaluating the
distance function) and
ρ
is a fractional power (depending on the approximation factor
c
) [Indyk and
Motwani, 1998; Andoni and Indyk, 2006; Datar et al., 2004], we may conclude that the expected
query complexity (under reasonable assumptions) are very close to the Euclidean LSH. However
3.1. VoronoiLSH 27
C1
C2
C3
q
r
cr
Zq
p
Zp
d(q,p)
h(q)=h(p)=2
p'
h(p')=3
Zp'
Figure 3.2: Voronoi LSH: a toy example with points
C={c1, c2, c3}
as centroids, query point
p
,
relevant nearest neighbor
q
, irrelevant nearest neighbor
p0
, radii of search
r
and
cr
and random
variables Zq,Zpand Zp0
the approach can be used to a broader set of distance functions, without the need of embedding the
data in Euclidean space, which could add up more complexity and distortions in the method.
Regarding space complexity it is straightforward: the algorithms stored
L
copies of a hash-table
with
k
entries, each entry with pointers to points of the dataset. The number of pointers is
Ln
and
the number of table entries is Lk, with a total of L(n+k).
3.1.4 Hashing Probabilities Bounds
Now we proceed to analyze the hashing collision probabilities and verifying the locality-
sensitiveness of our proposed scheme. We consider the metric space
(X, d)
and a collection
of random points of size
k
,
C={c1, . . . , ck} ⊂ X
inducing a random Voronoi partitioning of
the dataset and distinct hash functions. Our hashing scheme consists of attributing an integer
code
i
to each point in the region of the space covered by some point
ciC
(defined by the
subset of points closest to
C
,
Xi={x|d(x, C) = infcCd(x, c), x X}
). Alternatively, given
a point
pX
, the notation
NNC(q)
represents the nearest-neighbor of
p
in the random set
C
.
Putting all together, given the random set
C
, a point
pX
is associated with the random variable
3.1. VoronoiLSH 28
Zp=d(p, NNC(p)) = d(p, C)
, and if
NNC(p) = ci
then
hC(p) = i
. We want to analyze the
probabilities of points being assigned the same hash value (attributed to the same Voronoi region)
looking at the distance of the points for different possible hash functions.
Because we are dealing with generic metric spaces without any specific structural information,
we shall recall to the tools and language of probability measure spaces. Our probability measure
space is
(Ω, F, P r)
, where
Ω = {d(x, NNC(x))|xX, C X} ⊆ [0, M ]
is the sample space,
consisting in all possible distances from points in
X
to subsets of
X
,
F
is a Borel
σ
-algebra over
(collection of subset of
closed on the operations of complement and disjoint union), which
constitutes the event space, and
P r
is a probability measure over the
σ
-algebra, which assigns a
probability to each event. As an abuse of notation, we shall use inequalities (or other relations)
with the random variables in
representing elements of
F
, for example,
P r[Z < r]
,
Z
, is the
probability measure associated with the event (a set
AF
) where the relation is true. Similarly we
shall use a fashion of comprehensive set notation with inequalities (or other relations) and random
variables as a representation of events (sets in
F
) where that relation is true, for example,
{Z < r}
represents the event where a random variable
Z
is less than a fixed value
r
. Taking both notations in
consideration and the same example,
P r[Z < r]
and
P r[{Z < r}]
represents the same probability,
which is the measure of the event {Z < r} ∈ F.
The analysis will be developed supposing the framework of
(R, c)
-Nearest neighbors problem
and the random variables associated of the distance distribution. We will demonstrate that it is
possible to upper and lower bound the hashing probabilities using distance distribution information,
and under certain conditions, this may be shown to be locality-sensitive.
Theorem 3.1.
The family of hash function
{hC|hC:XZ+
, with
CX
and
|C|=k}
over
metric space
(X, d)
, mapping a point to the index of the correspondent Voronoi region induced by
C
, is locality-sensitive for every pairs of points
p, q X
and some fixed scalar
r > 0, c > 3
. In
other words:
If
d(p, q)r
then
P r[hC(q) = hC(p)] p1=P r[|ZqZp| ≥ r]
(probability of colliding
within the ball of radius r),
If
d(p, q)> cr
then
P r[hC(q) = hC(p)] p2=P r[Zq+Zpcr]
(probability of colliding
outside the ball of radius cr)
p1> p2, which is equivalent to P r[|ZqZp| ≥ r]> P r[Zq+Zpcr]
3.1. VoronoiLSH 29
pq
NNC(p)
NNC(q)
Figure 3.3: Closed Balls centered at point
q
and
p
, illustrating the case where events with
Zq+Zp
d(p, q), which implies that NNC(p)6=NNc(q)
Proof.
Let us take two points,
p
and
q
, and look at the value of the random variables representing
the distance from a point to its nearest neighbor among the set
C
,
Zp
and
Zq
, and how their
outcomes might affect
hC(p)
and
hC(q)
. We shall analyze the hashing probabilities in the case
where d(p, q)r,d(p, q)> cr and finally show how those probabilities are related.
Considering the case where d(p, q)r, by triangle inequality we have:
d(p, NNC(q)) d(p, q) + d(q, NNC(q)) = d(p, q) + Zqr+Zq
d(q, NNC(p)) d(p, q) + d(p, NNC(p)) = d(p, q) + Zpr+Zp
The probabilities of hC(p)6=hC(q)is modeled as:
P r[hC(p)6=hC(q)] = P r[{d(q, NNC(q)) < d(q, NNC(p))}, d(p, NNC(p)) < d(p, NNC(q))]
=P r[Zq< d(q, NNC(p)), Zpd(p, NNC(q))]
P r[Zq< r +Zp, Zp< r +Zq] = P r[|ZqZp|< r]
P r[hC(p)6=hC(q)] P r[|ZqZp|< r]
Since P r[hC(p) = hC(q)] = 1 P r[hC(p)6=hC(q)] we finally obtain:
P r[hC(p) = hC(q)] P r[|ZqZp| ≥ r] = p1
Consider the case where
d(p, q)> cr
. We define the events of interest
ξ0={Zq+Zp< cr}
,
ξ1={Zq+Zp< d(p, q)}and ξ2={Zq< d(q, NNC(p)}∩{Zp< d(p, NNC(q)}.
3.1. VoronoiLSH 30
From the definitions and the hypothesis that cr < d(p, q):
Zp+Zq< cr Zp+Zq< d(p, q)
ξ0ξ1
Applying triangle inequality we obtain in conjunction with Zp+Zq< d(p, q):
Zq+Zp< d(p, q)d(p, NNC(p)) + d(q, NNC(p)) = Zp+d(q, NNC(p))
Zq+Zp< d(p, q)d(q, NNC(q)) + d(p, NNC(q)) = Zq+d(p, NNC(q))
Zq< d(q, NNC(p))
Zp< d(p, NNC(q))
ξ1ξ2
ξ0ξ1ξ2
Knowing that the event
ξ2
represents the situation where
hC(p)6=hC(q)
we obtain a lower
bound for the hashing probabilities:
P r[hC(p)6=hC(q)] = P r[ξ2]P r [ξ0] = P r[Zq+Zp< cr]
P r[hC(p) = hC(q)] P r[Zq+Zpcr] = p2
To show that
p1> p2
we have only a proof sketch that works under strict conditions. Start with
the complement event
ξC
0={Zq+Zpcr}
. Notice that
cr > r
and assume that
Zq< δr
(for
some
δ > 1
), meaning that the range
r
of search can not be less than a fixed factor of to the closest
cluster center (a limiting condition that does not hold in general). Putting all together:
Zq+Zpcr
Zq+Zp2Zqcr 2Zq
Since Zq< δr, we know that Zq>δr
Zq+Zp2Zq=ZpZq(c2δ)r
Choosing c > 2δ+ 1 we obtain
ZpZqr
⇒ {Zq+Zpcr} ⊆ {|ZqZp| ≥ r}
p2p1
In order to finish the sketch-proof, consider the hypothetical case where
Zq=r
and
Zp= 2δr
,
3.2. VoronoiPlex LSH 31
for an arbitrary  > 0(and maintaining the previous assumptions on δand c). Taking ZpZq, we
obtain
2δr r+= (2δ1)r
, meaning that
|ZqZp|> r
(
Zp
and
Zq
are elements of the
event
{|ZqZp| ≥ r}
). Computing
Zp+Zq= 2δr +r
, we obtain
(2δ+ 1)r2
, implying
that
Zq+Zp<(2δ+ 1)r
; Since
(2δ+ 1) < c
(an assumption from the previous step of the proof),
we obtain
Zq+Zp< cr
, and conclude that
{Zq+Zpcr}
is a proper subset of
{|ZqZp| ≥ r}
under the conditions we are analyzing.
p2< p1
This is a generic result depending on some assumptions of the metric structure of the space and
a somehow reasonable workload of the
(r, c)
-ANN, but does not elucidate relationships between
the choice of variable (namely
k
and
L
) and the probabilistic guarantees of the method. Further
development in this line of argument is possible by observing that the distribution of the random
variables
Zq
and
Zp
may be understood as an order statistics over the distribution of distances in
the dataset. Indeed if we define the (unordered) sequence of random variable
Zq(1),··· , Zq(k)
as
distances from
qX
to a each point
{c1,··· , ck}
from a random sample
CX
(
|C|=k
), the
random variable
Zq
correspond to the order statistics
min(Zq(1),··· , Zq(k))
. We may compute the
cumulative distribution function (cdf)
Fmin
of the minimum order statistics using the hypothesis
that the random variables
Zq(1),··· , Zq(k)
are i.i.d. (a not unrealistic one, given that the sample
C
is
randomly chosen from the original dataset) and are distributed according to a cdf F :
Fmin(x) = P(Zqx) = P(min(Zq(1) ,··· , Zq(k))x)=1P(min(Zq(1),··· , Zq(k))> x)
= 1 P(Zq(1) > x, ··· , Zq(k)> x)=1
i=k
Y
i=1
P(Zq(i)> x)=1
i=k
Y
i=1
(1 F(x))
Fmin(x) = 1 (1 F(x))k
The interesting fact about this relation is that somehow, without using the “concatenation”
trick (building a composite LSH hash using the concatenation of multiple LSH, implying in the
multiplication of the hashing probabilities), we achieved a similar exponential probability for the
nearest-neighbor distance probability from the distance distribution.
3.2 VoronoiPlex LSH
VoronoiLSH was our first attempt to develop a LSH family for general metric space, seeking good
trade-off between effectiveness, efficiency and generalization. We took a very simplistic design,
and as we have presented in the theoretical characterization of the method, despite the minimalistic
approach (in relation the other methods that directly apply various metric space properties for
filtering and pruning) the method achieves competitive average query time performance and can be
3.2. VoronoiPlex LSH 32
125 2
c1c2c3c4c5
c1
c3
c4
c3
c5
c2
c5
c1
c3
c5
c4
c2
h5,4,3=
{
{
h5,4,3(p)=
Figure 3.4: VoronoiPlex LSH: a toy example with points
C={c1, c2, c3, c4, c5}
as shared cen-
troids between partitioning, choosing three centroids for each partitioning and concatenating four
partitioning in the final hash-value
formally described in the locality-sensitive framework – meaning, there are provable probabilistic
bounds for the approximate nearest-neighbors search. However, given that the only parameter
available for tuning the metric by constraining the size of the partitions is the number of cluster
centers, and that the relation between the number of cluster centers and the method potential of
selecting the most relevant nearest-neighbor is not trivial, an extension of the method was developed
taking advantage of the locality-sensitiveness of the hashing family and the same fashion of Indyk
and Motwani [Indyk and Motwani, 1998] – boosting the probabilities of “good” hashing collisions
more than the “bad” ones, and minimizing the number of distance computations necessary for
multiples VoronoiLSH.
3.2.1 Basic Intuition
The basic idea is the boost the probability of finding true nearest-neighbors in the same bin more
than the probability of finding true nearest-neighbors in distinct bins. This objective may be achieved
by the application of multiples random partitioning of the dataset (meaning, in our framework,
multiple VoronoiLSH) and hashing a way that only points that share the same nearest-neighbor in all
partitioning share the same hash number. We would expect that, under these random perturbations
on the partitioning, most of points that share all the same partitions are very similar and that most
false positive similar points are being filtered out. An another alternate view of this approach is
to see it as a two levels partitioning: the first level is of direct Voronoi partitioning using random
points as centroids, this operation is repeated many times; the second level is a unique partitioning
that is built by combining all the multiple Voronoi partitioning in a way that points that share the
same code in all of them are hashed together.
We have four control parameters in this approach: the number of hash tables (
L
), the number of
centroids for each partitioning (
p
), the number of centroids shared between partitioning (
k
), and
the number of multiple partitioning that are later put together (
w
). The number of hash tables have
a positive impact over the overall quality of the approximate nearest-neighbor search, however
isolated it does not improve the selectivity of the method, increasing the number false positives.
3.2. VoronoiPlex LSH 33
The other three parameters have a non-trivial impact in selectivity and quality of the search that
we still need to further investigate. In the following sections, we will present a brief theoretical
characterization of this impact. Figure 3.4 presents an example with five shared centroids (
k= 5
),
four partitioning concatenated (w= 4) and three centroids in each partitioning (p= 3).
3.2.2 Algorithms
The algorithms for indexing and kNN searching are essentially the same described in Algorithm 1
and Algorithm 2, with the exception of the hash function. Hence, we will present the algorithm for
the construction of the hash function and for the application of the hashing to a point in the metric
space. We will assume that functions/methods are first class objects with the possibility of being
returned from a method, encapsulating data and computation.
Algorithm 5 performs a multiple selection of index and stores the selected indexes in a data
structure that represents the hashing function. The indexes selected are used as sub-samples of a
given sample of the dataset, and each group of selected indexes forms a set of centroid points that
spans a Voronoi partitioning of the space. Those
k
partitioning forms a product space of hash-vectors
of size
k
. The intuitive idea is to build up on the VoronoiLSH intuition, using the compact closest
center relation as a hashing function, and with a bounded number of distance computation, construct
a product space of hash function. The idea is to boost the single hash probabilities with a controlled
number of distance of computation.
input : Integer size kof the sample C={c1, . . . , ck} ⊂ X, integer number of distinct
partitioning w, and integer number of centroids for each partitioning p<k
output: A hash function hk,w,p object with two arrays indicating the random choice of
centroids made by the method
selected new binary array of size k;1
subsample new integer multi-array of size w×p;2
for j1to wdo3
Random sample S={s1,··· , sp}from {1,··· , k};4
for i1to pdo5
subsample[j, i]si;6
selected[si]1;7
end8
end9
hk,w,p (selected,subsample) ;10
Algorithm 5: Hash function building
3.2. VoronoiPlex LSH 34
input : Hash function object hk,w,p,Sample C={c1, . . . , ck} ⊂ X(|C|=k) and a point
qX
output: Integer value hk,w,p(q)
(selected,subsample)
retrieved from
hk,w,p
distances
new floating-point array of size
k
;
1
for j1to kdo2
if selected[j] == 1 then3
distances[j]d(q, cj);4
end5
end6
hasharray new integer array of size w;7
for i1to wdo8
hasharray[i]element in subsample[i]that minimize distances[j](varying j) ;9
end10
hk,w,p(q)hash(hasharray) ;11
Algorithm 6: Hash function Application
3.2.3 Theoretical characterization
First, we must note that the hash function build procedure (Algorithm 5) is blind to the actual
data. It is a random selection of indexes of the arrays that store the sampled points. So in first
sight we can see that it could be precomputed, or at least re-used in various distinct settings of the
algorithm (for example, in a distributed memory system, there could be strong re-utilization of the
hash functions). The algorithm consist of two “for” loops and a simple random sample in a set of
k
elements, resulting in a time complexity of
O(w(p+k))
and a space complexity of
O(k+wp)
(with size
k
of the sample
C={c1, . . . , ck} ⊂ X
, integer number of distinct partitioning
w
, and
integer number of centroids for each partitioning p<k).
For the time complexity we will continue with the focus on expensive computation bottlenecks –
distance computation. The discussion will be focused on the expected number of positions in the
selected array that are set to one – IEi=1,···,k [selected[i] = 1].
Proposition 3.1.
After
w
rounds of the outer “for” loop in Algorithm 5, the expected number of
position of the array selected set to one is IEi=1,··· ,k[selected[i] = 1] = kk(1 p
k)w.
Proof.
Start with the probability of a specific position being assigned 1 in a single iteration of the
for loop. There are
k
p!
possible ways of sampling
p
values from
k
. There are
k1
p!
ways of
sampling
p
values of
k
excluding some value of the set with size
k
. So, the number of ways of
3.3. Parallel VoronoiLSH 35
sampling pand not assigning selected[i]one of them is k1
p!. Putting all together:
P rsingle iteration[selected[i] = 1] = 1 k1
p!
k
p!=p
k
After
w
iteration, the probability of the position
i
not being assigned to 1 is multiplication of the
probabilities of not being assigned at each iteration.
P rafter witeration[selected[i]6= 1] =
w
Y
n=1
P rsingle iteration[selected[i]6= 1]
P rafter witeration[selected[i]6= 1] = (1 p
k)w
P rafter witeration[selected[i] = 1] = 1 (1 p
k)w
Using the probability of being assigned after
w
iterations of the for loop, we calculate the final
expectation of the number of elements in the array selected that are assigned to 1.
IEi=1,··· ,k[selected[i] = 1] = kP rafter witeration[selected[i] = 1]
IEi=1,··· ,k[selected[i] = 1] = kk(1 p
k)w
This result is important because it helps us understand how the number of distance computation
(and the internal cost) of the hashing algorithm is bounded using the parameters that the hash
function is build and taking into account the randomness of the building algorithm. So in general
the internal cost, looking only for the number of distance computation is
O(kk)
where this
factor is well related with the effectiveness of the search.
3.3 Parallel VoronoiLSH
The design and adaptation of the original sequential algorithm for a distributed and parallel frame-
work was developed through a collaborative work of the author, Prof. Eduardo Valle, Prof. George
Teodoro and Thiago Teixeira. George Teodoro and Thiago Teixeira work is specialized in distributed
systems and they developed substantial part of the parallelization, specially the implementation us-
ing dataflow programming paradigm and performance test in a distributed cluster. This collaboration
resulted in an conference paper accepted at SISAP2014.
The parallelization strategy we employ is based on the dataflow programming paradigm [Arpaci-
Dusseau et al., 1999; Beynon et al., 2001; Teodoro et al., 2008]. Dataflow applications are typically
3.3. Parallel VoronoiLSH 36
represented as a set of computing stages, which are connected to each other using directed streams.
Our parallelization decomposes Voronoi LSH into five computing stages organized into two
conceptual pipelines, which execute the index building and the search phases of the application.
All stages may be replicated in the computing environment to create as many copies as necessary.
Additionally, the streams connecting the application stages implement a special type of commu-
nication policy referred here as labeled-stream. Messages sent through a labeled-stream have an
associated label or tag, which provides an affordable scheme to map message tags to specific copies
of the receiver stage in a stream. We rely on this communication policy to partition the input dataset
and to perform parallel reduction of partial results computed during a query execution. The data
communication streams and processes management are built on top of Message Passing Interface
(MPI).
The index building phase of the application, which includes the Input Reader (IR), Bucket Index
(BI), and Data Points (DP) stages, is responsible for reading input data objects and building the
distributed LSH indexes that are managed by the BI and the DP stages. In this phase, the input
data objects are read in parallel using multiple IR stage copies and are sent (1) to be stored into the
DP stage (message i) and (2) to be indexed by the BI stage (message ii). First, each object read
is mapped to a specific DP copy, meaning that there is no replication of input data objects. The
mapping of objects to DPs is carried out using the data distribution function obj map (labeled-stream
mapping function), which calculates the specific copy of the DP stage that should store an object as
it is sent through the stream connecting IR and DP. Further, the pair
<
object identifier, DP copy
in which it is stored
>
is sent to every BI copy holding buckets into which the object was hashed.
The distribution of buckets among BI stage copies is carried out using another mapping function:
bucket map, which is calculated based on the bucket value/key. Again, there is no replication
of buckets among BIs and each bucket value is stored into a single BI copy. The obj map and
bucket map functions used in our implementation are modulo operation based on the number of
copies of the receiver in a stream. We plan to evaluate other hashing strategies in the future.
The index construction is very compute-intensive, and involves many distance calculations
between the input data objects and the Voronoi seeds. For Euclidean data, we implemented a
vectorized code using Intel SSE/AVX intrinsics to take advantage of the wide SIMD (Single
Instruction, Multiple Data) instructions of current processors. Preliminary measurements have
shown that the use of SIMD instructions speeds-up the index building 8 times.
The search phase of the parallel LSH uses four stages, two of them shared with the index
building phase: Query Receiver (QR), Bucket Index (BI), Data Points (DP), and Aggregator (AG).
The QR stage reads the query objects and calculates the bucket values into which the query is
hashed for the
L
hash tables. Each bucket value computed for a query is mapped to a BI copy
using the bucket map function. The query is then sent to those BI stage copies that store at least
one bucket of interest (message iii). Each BI copy that receives a query message visits the buckets
of interest, retrieves the identifier of the objects stored on those buckets, aggregates all object
identifiers to be sent to the same DP copy (list(obj id)), and sends a single message to each DP
stage that stores at least one of the retrieved objects (message iv). For each message received by a
DP copy, it calculates the distance from the query to the objects of interest, selects the k-nearest