Conference PaperPDF Available

Pivot Selection Strategies for Permutation-Based Similarity Search

Authors:

Abstract and Figures

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query. This, of course assumes that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection strategies do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivots selection strategies on three permutation-based similarity access methods. Among those, we propose a novel strategy specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested strategies. Second, there is not a strategy that is universally the best for all permutation-based access methods; rather different strategies are optimal for different methods.
Content may be subject to copyright.
DRAFT
Pivot Selection Strategies for
Permutation-Based Similarity Search
Giuseppe Amato, Andrea Esuli, and Fabrizio Falchi
Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo”,
via G. Moruzzi 1, Pisa 56124, Italy
{firstname}.{lastname}@isti.cnr.it
Abstract. Recently, permutation based indexes have attracted interest
in the area of similarity search. The basic idea of permutation based
indexes is that data objects are represented as appropriately generated
permutations of a set of pivots (or reference objects). Similarity queries
are executed by searching for data objects whose permutation represen-
tation is similar to that of the query. This, of course assumes that similar
objects are represented by similar permutations of the pivots.
In the context of permutation-based indexing, most authors propose to
select pivots randomly from the data set, given that traditional pivot se-
lection strategies do not reveal better performance. However, to the best
of our knowledge, no rigorous comparison has been performed yet. In this
paper we compare five pivots selection strategies on three permutation-
based similarity access methods. Among those, we propose a novel strat-
egy specifically designed for permutations. Two significant observations
emerge from our tests. First, random selection is always outperformed by
at least one of the tested strategies. Second, there is not a strategy that
is universally the best for all permutation-based access methods; rather
different strategies are optimal for different methods.
Keywords: permutation-based, pivot, metric space, similarity search,
inverted files, content based image retrieval
1 Introduction
Given a set of objects Cfrom a domain D, a distance function d:D × D R,
and a query object q∈ D, a similarity search problem can be generally defined
as the problem of finding a subset SCof the objects that are closer to qwith
respect to d. Specific formulations of the problem can, for example, require to
find the kclosest objects (k-nearest neighbors search, k-NN).
Permutation-based indexes have been proposed as a new approach to ap-
proximate similarity search [1, 8, 11, 20]. In permutation-based indexes, data ob-
jects and queries are represented as appropriate permutations of a set of pivots
P={p1. . . pn} ⊂ D. Formally, every object o∈ D is associated with a per-
mutation Πothat lists the identifiers of the pivots by their closeness to o, i.e.,
j∈ {1,2, . . . , n 1}, d(o, pΠo(j))d(o, pΠo(j+1)), where pΠo(j)indicates the
pivot at position jin the permutation associated with object o. For convenience,
we denote the position of a pivot pi, in the permutation of an object o∈ D, as
Π1
o(i) so that Πo(Π1
o(i)) = i.
The similarity between objects is approximated by comparing their represen-
tation in terms of permutations. The basic intuition is that if the permutations
relative to two objects are similar, i.e. the two objects see the pivots in a similar
order of distance, then the two objects are likely to be similar also with respect
to the original distance function d.
Once the set of pivots Pis defined it must be kept fixed for all the indexed
objects and queries, because the permutations deriving from different sets of piv-
ots are not comparable. A selection of a “good” set of pivots is thus an important
step in the indexing process, where the “goodness” of the set is measured by the
effectiveness and efficacy of the resulting index structure at search time.
The paper is structured as follows. In Section 2 we discuss related work.
Section 3 presents the strategies being compared. The tested similarity search
access methods are presented in Section 4. Section 5 describes the experiments
and comments their results. Conclusion and future work are given in Section 6.
2 Related Work
The study of pivot selection strategies for access methods usually classified as
pivot-based [25] has been an active research topic, in the field of similarity search
in metric spaces, since the nineties. Most access methods make use of pivots for
reducing the set of data objects accessed during similarity query execution. In an
early work by Shapiro [23], it was noticed that good performance were obtained
by locating pivots far away from data clusters. In [24, 18, 5], following this intu-
ition, several heuristics were proposed to select pivots between the outliers and
far away from each other. Pivot selection techniques that maximize the mean of
the distance distribution in the pivoted space were exploited in [7]. It was also
argued that while good pivots are usually outliers, the reverse is not true. In
[22, 6], the problem of dynamic pivot selection as the database grows is faced. In
[17] Principal Component Analysis (PCA) has been proposed for pivot selection.
Principal components (PC) of the dataset are identified by applying PCA on it
(actually a subset to make the method computationally feasible) and the objects
in the dataset that are best aligned with PC vectors are selected as pivots.
Works that use permutation-based indexing techniques have mostly per-
formed a random selection of pivots [1, 8, 11] following the observation that the
role of pivots in permutation-based indexes appears to be substantially differ-
ent from the one they have in traditional pivot-based access methods and also
because the use of previous selection strategies did not reveal significant ad-
vantages. At the best of our knowledge, the only report on the definition of a
specific selection techniques for permutation-based indexing is in [8], were it was
mentioned that no significant improvement, with respect to random selection,
was obtained by maximizing or minimizing the Spearman Rho distance through
a greedy algorithm.
3 Pivot Selection Strategies
Permutation based access methods use pivots to build permutation that repre-
sent data objects. This paper compares four promising selection strategies used
in combination with different permutation based indexes to make a comprehen-
sive evaluation, and also to identify the specific features that can be exploited
in the various cases.
As the baseline we tested the random (rnd) strategy, which samples pivots
from the dataset following a uniform probability distribution.
3.1 Farthest-First Traversal (FFT)
A very well known topic in metric spaces is the k-center NP-hard problem that
asks: given a set of objects Cand an integer k, find a subset Pof kobjects in
Cthat minimizes the largest distance of any object in Cfrom its closest object
in P. FFT (so called by Dasgupta in [9]) finds a solution close to optimal by
selecting an arbitrary object p1Cand choosing, at each subsequent iteration,
the object piCfor which min
1jid(pi, pj) is maximum. In [14], it has been proved
that FFT achieves an approximation, with respect to the optimal solution, of
at most a factor of 2. Note that FFT actually tries to maximize the minimum
distance between the pivots, which intuitively could be a desirable property of
the resulting pivot set. The computational cost of this algorithm is O(n|C|),
where nis the number of requested pivots.
3.2 k-medoids (kMED)
Originally proposed in [15], k-medoids is a partitional clustering algorithm that
tries to minimize the average distance between objects and selected cluster
medoids. k-medoids is very similar to k-means. The difference is that it uses
objects from the dataset as representatives of the centers of the clusters rather
than computing centroids, which could be not possible in general metric spaces.
Moreover, k-medoids is also more robust to noise and outliers because it min-
imizes the distances instead of their square. While FFT minimizes the largest
distance of an object from its closest pivot, k-medoids minimizes the average
distance of the objects from their closest pivot.
3.3 Pivoted Space Incremental Selection (PSIS)
In [7] several strategies for selecting pivots were proposed and tested considering
the average distance of the transformed space obtained leveraging on the set of
selected pivots and on the triangle inequality of the original metric space [25].
The presented algorithms try to maximize the average distance in the pivoted
space defined as: DP(x, y) = max
1in|d(x, pi)d(y, pi)|.
The goal of the proposed algorithm is to have good lower bounds DPfor the
original distance d. Bustos et al. observed that the chosen pivots are outliers but
that not all outliers are good pivots for maximizing the average DP. The overall
best between the proposed methods is the incremental selection technique. This
technique greedily selects the first and subsequent pivots maximizing DPon a
set of pairs of objects in C.
3.4 Balancing Pivot-Position occurences (BPP)
While the other pivot selection approaches mainly originate from the literature
on similarity search in metric spaces and clustering, in this section we propose
an algorithm specifically intended for permutation-based access methods. The
intuition suggests that each pivot should appear in the various positions in the
permutations uniformly. In fact, if a pivot piPappears in the same position
in all the permutations, such pivot is useless.
Let c(pi, j) = |{Πo:Π1
o(i) = j}| be the number of permutations where
piappears in position j. The mean value of c(pi, j),1jnis independent
of the specific set of pivots and is always equal to |C|/n. BPP tries to mini-
mize the deviation of c(pi, j) values from their mean. The algorithm starts by
randomly selecting a set PCof ˆn>ncandidate pivots and evaluating the
permutations for all the objects oC(or a subset SC). At each iteration,
the algorithm evaluates the effect of removing each piP(or a fixed number
tof candidate pivots) on the distribution of c(pi, j) and removes the pivot for
which the minimum average standard deviation is obtained. The algorithm ends
when the number of candidate pivots satisfies the request, i.e. |P|=n.
In [1] it was observed that the first pivots in the permutations, i.e. the nearest
to the object, have been proved to be more relevant. Thus, in our experiments,
we applied this general algorithm considering c(pi, j) for 1 jlwhere lis
the actual length of the permutation we are considering. The complexity of the
algorithm is thus On|S|) for initialization using the distance d, and O(tˆn2|S|)
for the iterative selection where the cost is the evaluation of each candidate pivot
occurrence in the permutations.
4 Similarity Access Methods
We have compared the pivot selection strategies on three permutation based in-
dex structures that reasonably cover the various approaches adopted in literature
by methods based on permutations.
4.1 Permutations Spearman Rho (PSR)
The idea of predicting the closeness between elements comparing the way they
“see” a set of pivots was originally proposed in [8]. As distance between permu-
tations, Spearman Rho, Kendall Tau and Spearman Footrule [12] were tested.
Spearman Rho revealed better performance. Given two permutations Πxand
Πy, Spearman Rho is defined as:
Sρ(Πx, Πy) = sX
1in
(Π1
x(i)Π1
y(i))2
When a k-NN search is performed, a candidate set of results of size k >
k0is retrieved considering the similarity of the permutations based on Sρ(in
our experiments we fixed k0= 10k). This set is then reordered considering the
original distance d. In [8] an optimal incremental sorting [21] was used to improve
efficiency when the candidate set of results to be retrieved using the Spearman
Rho is not known in advance. In this work we just perform a linear scan of the
permutations defining the size of the candidate set in advance.
As already mentioned, the most relevant information of the permutation Πo
is in the very first, i.e. nearest, pivots. Thus, we decided to test also truncated
permutations. In this case we used the Spearman Rho distance with location
parameter Sρ,l defined in [12], which is intended for the comparison of top-l
lists. Sρ,l differs from Sρfor the use of an inverted truncated permutation ˜
Π1
o
that assumes that pivots further than pΠo(l)from obeing at position l+ 1.
Formally, ˜
Π1
o(i) = Π1
o(i) if Π1
o(i)land ˜
Π1
o(i) = l+ 1 otherwise.
4.2 MI-File
The Metric Inverted File approach (MI-File) [2, 1] uses an inverted file to store
relationships between permutations. It also uses some approximations and op-
timizations to improve both efficiency and effectiveness. The basic idea is that
entries (the lexicon) of the inverted file are the pivots P. The posting list asso-
ciated with an entry piPis a list of pairs (o, Π1
o(i)), o C, i.e. a list where
each object oof the dataset Cis associated with the position of the pivot piin
Πo.
As already mentioned, in [1] it was observed that truncated permutations
can be used without huge lost of effectiveness. MI-File allows truncating the
permutation of both data and query objects independently. We denote with lx
the length of the permutation used for indexing and with lsthe one used for
searching (i.e. the length of the query permutation).
The MI-File also uses a strategy to read just a small portion of the accessed
posting lists, containing the most promising objects, further reducing the search
cost. The most promising data objects in a posting list, associated with a pivot
pifor a query q, are those whose position of the pivot pi, in their associated
permutation, is closer to the position of piin the permutation associated with
q. That is, the promising objects are the objects o, in the posting list, having
a small |Π1
o(i)Π1
q(i)|. To control this, a parameter is used to specify a
threshold on the maximum allowed position difference (mpd) among pivots in
data and query objects. Provided that entries in posting lists are maintained
sorted according to the position of the associated pivot, small values of mpd
imply accessing just a small portion of the posting lists.
Finally, in order to improve effectiveness of the approximate search, when the
MI-File execute a k-NN query, it first retrieves k·amp objects using the inverted
file, then selects, from these, the best kobjects according to the original distance.
The factor amp 1, is used to specify the size of the set of candidate objects
to be retrieved using the permutation based technique, which will be reordered
according to the original distance, to retrieve the best kobjects.
The MI-File search algorithm computes incrementally a relaxed version of
the Footrule Distance with location parameter lbetween the query and data
objects retrieved from the read portions of the accessed posting lists.
4.3 PP-Index
The Permutation Prefix Index (PP-Index) [10,11] associates each indexed object
owith the short prefix Πl
o, of length l, of the permutation Πo. The permutation
prefixes of the indexed objects are indexed by a prefix tree kept in main memory.
All the indexed objects are serialized sequentially in a data storage, kept on disk,
following the lexicographic order defined by the permutation prefixes.
At search time the permutation prefix Πl
qof the query qis used to find, in the
prefix tree, the smallest subtree which includes at least zkcandidates (zis a
parameter of the search function). All the z0zcandidates that are included in
that subtree, i.e., o1. . . oz0, are then retrieved from the data storage and sorted,
using a max-heap of kelements, by their distance d(q, oi), thus determining the
approximated k-NN result.
A key property of PP-Index is that any subtree of the prefix tree maps directly
into a single sequence of contiguous objects in the data storage. The sequential
access to disk is crucial for the search efficiency. For example, in our experimental
setup, random access read from disk of data representing 10,000 objects from the
test dataset (described in Section 5.1) takes 87.4 seconds, while a sequential read
of the same number of objects takes 0.14 seconds. Computing 10,000 distances
between objects in the test dataset takes only 0.0046 seconds, which indicates
how having good disk access patterns is the key aspect for efficiency.
The approach of PP-Index to similarity search is close to the one of M-Index
[19] , which uses permutation prefixes to compute a mapping of any object to a
real number that is then used as the key to sequentially sort the indexed objects
in a secondary memory data structure such as a sequential file of a B+-tree.
Both PP-Index and M-Index share many intuitions with the Locality-Sensitive
Hashing (LSH) model [13, 20]. For example, following the same principle of
Multi-Probe LSH [16], the PP-Index adopts a multiple-query strategy that gen-
erates additional queries by performing local permutations on the original per-
mutation prefix of the query object, i.e. retrieving additional candidates that
are still close to the query because their permutation prefix differ only for a
swap in a pair of adjacent pivots. The first pair that is swapped is the one
that has the minimum difference of distances with respect to the query, i.e.
min
j(d(q, pΠq(j+1))d(q, pΠq(j)), and so on. Note that it may happen that some
of the additional queries end up in selecting the same subtree of other queries,
so that the number of sequences of candidates objects accessed on disk may be
less than the number of queries.
5 Experiments
5.1 Experimental Settings
Datasets and Groundtruth: Experiments were conducted using the CoPhIR
dataset [4], which is currently the largest multimedia metadata collection avail-
able for research purposes. It consists of a crawl of 106 millions images from the
Flickr photo sharing website. We have run experiments by using as the distance
function da linear combination of the five distance functions for the five MPEG-7
descriptors that have been extracted from each image. As weights for the linear
combination we have adopted those proposed in [3]. As the ground truth, we
have randomly selected 1,000 objects from the dataset as test queries and we
have sequentially scanned the entire CoPhIR to compute the exact results.
Evaluation Measures: All the tested similarity search techniques re-rank a
set of approximate result using the original distance. Thus, if the k-NN results
list ˜
Rkreturned by a search technique has an intersection with the ground truth
Rk, the objects in the intersection are ranked consistently in both lists. The most
appropriate measure to use is then the recall:|˜
Rk∩ Rk|/k. In the experiments
we fixed the number of results krequested to each similarity search techniques
to 100 and evaluated the recall@rdefined as |˜
Rr∩ Rr|/r where ˜
Rrindicates the
sub-list of the first rresults in ˜
Rk(1 rk). Note that, being the two lists
consistently ordered, ˜
Rk∩ Rr˜
Rralways holds and thus ˜
Rr∩ Rr=˜
Rk∩ Rr,
i.e. none of the results in ˜
Rkafter the r-th position can give a contribute to
recall@r. Given that the queries were selected from the dataset and that all the
tested access methods always found them, we decided to remove each query from
the relative approximate result list. In fact, not removing them would result in
artificially raising the recall@rfor small values of r.
The average query cost of each tested technique was measured adopting a
specific cost model that will be specified in Section 5.2.
Selection Techniques Parameters: Given previous results reported in [2, 1,
10, 11] we decided to use 1,000 pivots. The parameters used for each selection
strategy were selected so that they required almost the same time to be computed
(about 10 hours):
FFT: We selected the pivots among a subset of 1 million randomly selected
objects performing at each iteration 100,000 tries for selecting the added
pivot.
kMED: We performed the clustering algorithm on a subset of 1 million ran-
domly selected objects.
PSIS: We randomly selected 10,000 pairs of objects from the dataset and
performed 10 trials at each iteration.
BPP: We randomly selected a set of 10,000 candidate pivots and tested them
on 100,000 randomly selected objects performing at each iteration no more
than 100 trials for selecting the pivot to be removed.
5.2 Results
For all the tested similarity access methods we show a pair of figures. On the
first one we report recall@robtained by the various selection strategies keeping
the parameters of the access method settings. Even if the parameters are fixed,
the use of different sets of pivots results in different average query cost which
can not be inferred from this figure. For this reason, in the second figure we
report an orthogonal evaluation that compares the recall@10 versus the query
cost while varying some parameters of the access methods.
PSR: In Figure 1 we report the recall@robtained by PSR for location pa-
rameter l= 100. The results show that FFT outperforms the other techniques
in terms of effectiveness. PSIS performs significantly worse than all the others
while the rest of the strategies obtained very similar results. In Figure 2 we
tested various values of location parameter l, which directly impacts the query
cost by reducing the index size and the permutation comparison cost. The re-
sults confirm that FFT significantly outperforms the others but also reveal that
the differences are more relevant when lis closer to n, i.e. when more complete
permutations are used. For values of lgreater than 100, none of the techniques
reported significant variations. The values of lused for the results reported in
Figure 1 was chosen according to this observation.
PP-Index: Following the results of [11], we tested the PP-Index by setting the
length of the prefixes lto 6, and the values of zto 1,000. We tested both single-
and multiple-query search, exploring a range of additional queries from 1 to 8. As
the reference configuration we have chosen the one using a multiple-query search
method with eight additional queries (nine total). As already noted in Section
4.3, some of the additional queries may result in selecting the same subtree of
candidates. In fact, only 4.61 sequential blocks of candidates are accessed on
disk on average for the above configuration.
Figure 3 shows that the PP-Index obtains its best results when using the
kMED strategy, which is clearly better than the other strategies. FFT and PSIS
form a group of second best strategies, followed by rnd and BPP, which are the
worst performing ones. With respect to the other tested access methods, the PP-
Index resulted to be more robust (or less sensitive) to the change of the pivot
selection strategy. The recall curves for the various strategies have an almost
identical slope and there is only an average difference of 1.3% between the best
and worst strategies, almost constant across all the recall levels.
For the PP-Index, we have measured the query cost induced by the various
strategies in terms of number of candidate objects selected by the queries on
the prefix tree. Figure 4 shows that the best two strategies with respect to
the recall/cost tradeoff are kMED and FFT, followed by rnd and PSIS, with
BPP being the worst one. On the nine queries setup BPP needs about 20%
more candidates to score a slightly worse recall than FFT. Again, the differences
between the various strategies are smaller than those observed for the other
access methods.
.3
.4
.5
.6
.7
.8
.9
1.0
110 100
Recall@r
r
FFT
kMED
rnd
BPP
PSIS
Fig. 1. Recall@robtained by PSR for
l=100 varying r
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
110 100 1000
Recall@10
Spearman Rho Location Parameter l
FFT
kMED
rnd
BPP
PSIS
Fig. 2. Recall@10 obtained by PSR for
various location parameters
.10
.15
.20
.25
110 100
Recall@r
r
kMED
FFT
rnd
BPP
PSIS
Fig. 3. Recall@rvarying robtained by the
PP-Index using the multiple-query search
(eight additional queries).
.05
.10
.15
.20
2000 4000 8000 16000
Recall@10
mean n. of Candidates
kMED
FFT
rnd
BPP
PSIS
Fig. 4. Recall@10 versus the number of
candidates accessed (z0) by the PP-Index
when using the multiple-query search
method with zero (lower left corner) to
eight (upper right) of additional queries.
.15
.20
.25
.30
.35
.40
110 100
Recall@r
r
BPP
rnd
kMED
FFT
PSIS
Fig. 5. Recall@robtained by the MI-File
using ls= 5, varying the number of
retrieved objects rfrom 1 to 100.
.0
.1
.2
.3
100 1000 10000 100000
Recall@10
Index Block Reads
BPP
rnd
kMED
FFT
PSIS
Fig. 6. Recall@10 obtained by MI-File
ranging lsfrom 1 to 5
Note that the X axis of Figure 4 has a logarithmic scale. The almost straight
lines indicate that the number of candidates grows with a logarithmic trend as
more queries are used with the multiple-query search strategy, while the recall
grows linearly, indicating that the multiple-query strategy has a very convenient
recall/cost trend.
In summary, the kMED strategy resulted to be the best one, resulting in
higher recall at a competitive cost.
MI-File: MI-File was tested indexing data objects using the closest 100 pivots
(lx= 100). Queries were executed ranging the number of closest pivots from 1 to
5, i.e. ls∈ {1, . . . 5}(see Section 4.2). The maximum allowed position difference
among pivots in data and query objects was 5 (mpd = 5). The size of the set of
candidate objects retrieved was set to be 50 times k, (amp = 50).
Figure 5 shows the results obtained using lsfixed to 5. For r < 10, BPP and
rnd reveal better performance, while for r > 10 all the strategies almost overlap,
except PSIS that is always the worst.
Figure 6 shows the results varying lsfrom 1 to 5. Larger values of lsimply
larger number of disk blocks reads. It can be seen that once a target recall value
is fixed, the cost needed by the MI-File to achieve such recall, varies significantly
among the strategies. The cost needed to achieve a specific recall using the BPP
method is one order of magnitude smaller than using the FFT method. For
instance, the cost needed to obtain a recall@10 of 0.26 is 3,000 disk block reads
using BPP, while the same recall requires 25,000 disk block reads using FFT.
The BPP method is overall the one offering the best performance with MI-
File. The recall value obtained using ls= 5 is mostly at the top. The cost needed
to execute queries is significantly lower than all the other methods. This can be
explained by the fact that, as discussed in Section 3.4, the BPP strategy has
been designed to distribute the positions of the various pivots uniformly across
the various permutations. This means that the posting lists of the MI-File are
well balanced and that they tend to contain blocks of entries, related to the same
pivot position, of equal size. As a consequence, there are no posting lists that
are very long and that are also mostly accessed for any query, simultaneously
improving effectiveness and efficiency.
6 Conclusion and Future Work
In this paper we compared five pivots selection strategies on three permutation-
based access methods. For all the tested access methods we found at least one
strategy that significantly outperforms the random selection. Another interest-
ing point is that there is not a strategy that is universally the best for all the
access methods. The PSR method, i.e. the sequential scan of the permutations
adopting the Spearman Rho with location parameter ldistance, largely bene-
fited from the use of FFT. For PP-Index the best strategy has been kMED even
if the performance differences are small. The novel proposed BPP strategy sig-
nificantly outperformed the others when used in combination with the MI-File.
This means that even if all the tested access methods are permutation-based,
they significantly differ in the way they exploit the permutation space.
The CoPhIR collection is one of largest non-synthetic collections available
for experiments on similarity search and its objects have a relatively high di-
mensionality. The results we have observed on this collection should thus be a
good reference for practical applications that have similar characteristics (e.g.,
large collections of images). We are planning to extend the comparison on other
collections with different characteristics in terms of data type, collection size and
dimensionality. For the future we also plan to expand the comparison to other
data structures, such as the M-Index [19], and to test novel strategies that make
use of information on the queries, e.g., from a query log (as suggested in [11]).
References
1. Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. Mi-file: Using inverted
files for scalable approximate similarity search. Multimedia Tools and Applications-
An International Journal, (Online first), November 2012 2012.
2. Giuseppe Amato and Pasquale Savino. Approximate similarity search in metric
spaces using inverted files. In Proceedings of the 3rd international conference on
Scalable information systems, InfoScale ’08, pages 28:1–28:10, ICST, Brussels, Bel-
gium, Belgium, 2008. ICST (Institute for Computer Sciences, Social-Informatics
and Telecommunications Engineering).
3. Michal Batko, Fabrizio Falchi, Claudio Lucchese, David Novak, Raffaele Perego,
Fausto Rabitti, Jan Sedmidubsky, and Pavel Zezula. Building a web-scale image
similarity search system. Multimedia Tools and Applications.
4. Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego,
Tommaso Piccioli, and Fausto Rabitti. Cophir: a test collection for content-based
image retrieval. CoRR, abs/0905.4627, 2009.
5. Sergey Brin. Near neighbor search in large metric spaces. In VLDB’95, Proceedings
of 21th International Conference on Very Large Data Bases, September 11-15,
1995, Zurich, Switzerland, pages 574–584. Morgan Kaufmann, 1995.
6. B. Bustos, O. Pedreira, and N. Brisaboa. A dynamic pivot selection technique for
similarity search. In Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th
International Conference on, pages 394–401, 2008.
7. Benjamin Bustos, Gonzalo Navarro, and Edgar Ch´avez. Pivot selection techniques
for proximity searching in metric spaces. Pattern Recogn. Lett., 24(14):2357–2366,
October 2003.
8. Edgar Ch´avez, Karina Figueroa, and Gonzalo Navarro. Effective proximity re-
trieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell.,
30(9):1647–1658, 2008.
9. Sanjoy Dasgupta. Performance guarantees for hierarchical clustering. In 15th
Annual Conference on Computational Learning Theory, pages 351–363. Springer,
2002.
10. Andrea Esuli. Mipai: Using the pp-index to build an efficient and scalable similarity
search system. In SISAP, pages 146–148, 2009.
11. Andrea Esuli. Use of permutation prefixes for efficient and scalable approximate
similarity search. Information Processing & Management, 48(5):889 – 902, 2012.
12. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing top k lists. In Pro-
ceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,
SODA ’03, pages 28–36, Philadelphia, PA, USA, 2003. Society for Industrial and
Applied Mathematics.
13. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high
dimensions via hashing. In Proceedings of 25th International Conference on Very
Large Data Bases, VLDB ’99, pages 518–529, 1999.
14. Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance.
Theor. Comput. Sci., 38:293–306, 1985.
15. L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to
cluster analysis. John Wiley and Sons, New York, 1990.
16. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe
lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the
33rd International Conference Very Large Data Bases, VLDB ’07, pages 950–961,
Vienna, Austria, 2007.
17. Rui Mao, Willard L. Miranker, and Daniel P. Miranker. Dimension reduction for
distance-based indexing. In Proceedings of the Third International Conference on
SImilarity Search and APplications, SISAP ’10, pages 25–32, New York, NY, USA,
2010. ACM.
18. Mar´ıa Luisa Mic´o, Jos´e Oncina, and Enrique Vidal. A new version of the nearest-
neighbour approximating and eliminating search algorithm (aesa) with linear pre-
processing time and memory requirements. Pattern Recogn. Lett., 15(1):9–17, Jan-
uary 1994.
19. David Novak, Michal Batko, and Pavel Zezula. Metric index: An efficient and scal-
able solution for precise and approximate similarity search. Inf. Syst., 36(4):721–
733, June 2011.
20. David Novak, Martin Kyselak, and Pavel Zezula. On locality-sensitive indexing
in generic metric spaces. In Proceedings of the Third International Conference on
SImilarity Search and APplications, SISAP ’10, pages 59–66, New York, NY, USA,
2010. ACM.
21. Rodrigo Paredes and Gonzalo Navarro. Optimal incremental sorting. In In Proc.
8th Workshop on Algorithm Engineering and Experiments (ALENEX, pages 171–
182. SIAM Press, 2006.
22. Oscar Pedreira and Nieves R. Brisaboa. Spatial selection of sparse pivots for
similarity search in metric spaces. In Proceedings of the 33rd conference on Current
Trends in Theory and Practice of Computer Science, SOFSEM ’07, pages 434–445,
Berlin, Heidelberg, 2007. Springer-Verlag.
23. Marvin Shapiro. The choice of reference points in best-match file searching. Com-
mun. ACM, 20(5):339–343, May 1977.
24. Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in
general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium
on Discrete algorithms, SODA ’93, pages 311–321, Philadelphia, PA, USA, 1993.
Society for Industrial and Applied Mathematics.
25. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity
Search - The Metric Space Approach, volume 32 of Advances in Database Systems.
Kluwer, 2006.
... There have been several works trying to improve the PBI's performance. There are some techniques to select good permutants obtaining little improvement [7,1]. Other variants have been proposed with the aim of reducing the PBI's space requirement [2,8,9] by discarding some portions of the permutations; however, these techniques lose precision in the query retrieval. ...
... For each object in U, we show its permutation Π u in the sequence closed by [ ] and its zone information Z u in the one closed by ( ). For example, for u 6 , its permutation is [2,1,3] and for permutants p 2 , p 1 and p 3 , u 6 belongs to zones 1, 2 and 2. Notice that with our proposal, now we can figure out whether two objects with equal permutations are close or not. ...
... In this figure, we are comparing our results with AESA (see Section 2.1) and its quantized version BAESA (see Section 2.3). Notice that our PZBI needs 1 3 of the distance computations used by AESA, as can be seen when using 8 zones, EEQ space partitioning and SPZ. AESA and pivots are represented with an horizontal line in order to simplify the visual comparison. ...
Conference Paper
Full-text available
Proximity searching consists in retrieving objects out of a database similar to a given query. Nowadays, when multimedia databases are growing up, this is an elementary task. The permutation based index (PBI) and its variants are excellent techniques to solve proximity searching in high dimensional spaces, however they have been surmountable in low dimensional ones. Another PBI’s drawback is that the distance between permutations cannot allow to discard elements safely when solving similarity queries. In the following, we introduce an improvement on the PBI that allows to produce a better promissory order using less space than the basic permutation technique and also gives us information to discard some elements. To do so, besides the permutations, we quantize distance information by defining distance rings around each permutant, and we also keep this data. The experimental evaluation shows we can dramatically improve upon specialized techniques in low dimensional spaces. For instance, in the real world dataset of NASA images, our boosted PBI uses up to 90 % less distances evaluations than AESA’s, the state-of-the-art searching algorithm with the best performance in this particular space.
... In [2], various pivot selection techniques were tested on three permutationbased indexing approaches (i.e., [8,3,14]). The results revealed that each indexing approach has its own best selection strategies but also that the random selection of pivots, even if never the best, results in good performance. ...
... Variuos pivots selection strategies have been proposed for permutation-based indexing [2]. Experimental results have shown that while each specific index strategies have its own best selection approach, the random selection is always a good choice. ...
Conference Paper
Full-text available
Permutation based approaches represent data objects as ordered lists of predefined reference objects. Similarity queries are executed by searching for data objects whose permutation representation is similar to the query one. Various permutation-based indexes have been recently proposed. They typically allow high efficiency with acceptable effectiveness. Moreover, various parameters can be set in order to find an optimal trade-off between quality of results and costs. In this paper we studied the permutation space without referring to any particular index structure focusing on both theoretical and experimental aspects. We used both synthetic and real-word datasets for our experiments. The results of this work are relevant in both developing and setting parameters of permutation-based similarity searching approaches.
... Pivot selection strategies for permutation-based methods were discussed in [2]. In the paper the Farthest-First Traversal (FFT) technique was identified as the one providing a set of reference objects such that the sorting performed with similarity computed among the permutations was the most correlated to sorting performed using the original distance. ...
... However, this was out of the scope of this paper, where we wanted to use a standard DCNN, and we leave it to future investigation. Figure 2 compares the proposed approach, with the no-ReLU strategy using a standard permutation-based approach, where pivots where both selected randomly and using Farthest-First Traversal (FFT), which in [2] was shown to be the best pivot selection method for permutation-based searching. Random selection and FFT offer better performance for values of l up to 200. ...
Conference Paper
Full-text available
The activation of the Deep Convolutional Neural Networks hidden layers can be successfully used as features, often referred as Deep Features, in generic visual similarity search tasks. Recently scientists have shown that permutation-based methods offer very good performance in indexing and supporting approximate similarity search on large database of objects. Permutation-based approaches represent metric objects as sequences (permutations) of reference objects, chosen from a predefined set of data. However, associating objects with permutations might have a high cost due to the distance calculation between the data objects and the reference objects. In this work, we propose a new approach to generate permutations at a very low computational cost, when objects to be indexed are Deep Features. We show that the permutations generated using the proposed method are more effective than those obtained using pivot selection criteria specifically developed for permutation-based methods.
... The authors in [2] and later in [3] compared five pivot selection techniques on three permutation-based similarity access methods. The authors conclude that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method. ...
Article
Full-text available
Large-scale similarity search engines are complex systems devised to process unstructured data like images and videos. These systems are deployed on clusters of distributed processors communicated through high-speed networks. To process a new query, a distance function is evaluated between the query and the objects stored in the database. This process relays on a metric space index distributed among the processors. In this paper, we propose a cache-based strategy devised to reduce the number of computations required to retrieve the top-k object results for user queries by using pre-computed information. Our proposal executes an approximate similarity search algorithm, which takes advantage of the links between objects stored in the cache memory. Those links form a graph of similarity among pre-computed queries. Compared to the previous methods in the literature, the proposed approach reduces the number of distance evaluations up to 60%.
... These include BK-Tree [6], Vantage Point Tree [32,23] or M-Tree (Metric Tree) [11] More recent structures such as the Fixed Query Array [9], M-Index [24] or Permutation Based Indexing [8] use pivots to partition the space and to encode the data according to the structure of the partition. These structures have a number of parameters on which their actual performance depend and their choice are generally made empirically, either based on heuristics or on the statistics of the data in question [1,3,7,4]. However, a formal modeling of the relationship between these choices and the impact on the performance, based on a sound modeling of the encoding created by the indexing scheme is still missing [2,20]. ...
Conference Paper
Full-text available
Providing a fast and accurate (exact or approximate) access to large-scale multidimensional data is a ubiquitous problem and dates back to the early days of large-scale Information Systems. Similarity search, requiring to resolve nearest neighbor (NN) searches, is a fundamental tool for structuring information space. Permutation-based Indexing (PBI) is a reference-based indexing scheme that accelerates NN search by combining the use of landmark points and ranking in place of distance calculation. In this paper, we are interested in understanding the approximation made by the PBI scheme. The aim is to understand the robustness of the scheme created by modeling and studying by quantifying its invariance properties. After discussing the geometry of PBI, in relation to the study of ranking, from empirical evidence, we make proposals to cater for the inconsistencies of this structure.
... Once the set of pivots P is defined it must be kept fixed for all the indexed objects and queries, because the permutations deriving from different sets of pivots are not comparable. A selection of a "good" set of pivots is thus an important step in the indexing process, where the "goodness" of the set is This is a revised and extended version of a paper appeared as [1]. measured by the effectiveness and efficacy of the resulting index structure at search time. ...
Article
Full-text available
Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query, following the assumption that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection techniques do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivot selection techniques on three permutation-based similarity access methods. Among those, we propose a novel technique specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested techniques. Second, there is no technique that is universally the best for all permutation-based access methods; rather different techniques are optimal for different methods. This indicates that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method.
Conference Paper
It is well known that, as the dimensionality of a metric space increases, metric search techniques become less effective and the cost of indexing mechanisms becomes greater than the saving they give. This is due to the so-called curse of dimensionality. One effect of increasing dimensionality is that the ratio of unit hypersphere to unit hypercube volume decreases rapidly, making the solution to a similarity query (the query ball, or hypersphere) ever more difficult to identify by using metric invariants such as triangle inequality. In this paper we take a different approach, by identifying points within a query polyhedron rather than a ball. We show how this can be achieved by constructing a surrogate metric space, such that a query ball in the surrogate space corresponds to a polyhedron in the original space. If the polyhedron contains the ball, the overall cost of the query is likely to be increased in high dimensions; however, we show that shrinking the polyhedron can capture a surprisingly high proportion of the points within the ball, whilst at the same time giving a more efficient, and more scalable, search. We show results which confirm our underlying hypothesis. In some cases we can retrieve significant volumes of query results from spaces which are otherwise intractable.
Article
Full-text available
Let A be a set of size m. Obtaining the first k ≤ m elements of A in ascending order can be done in optimal O(m+k logk) time. We present an algorithm (online on k) which incrementally gives the next smallest element of the set, so that the first k elements are obtained in optimal time for any k. We also give a practical version of the algorithm, with the same complexity on average, which performs better in practice than the best existing online algorithm. As a direct application, we use our technique to implement Kruskal's Minimum Spanning Tree algorithm, where our solution is competitive with the best current implementations. We finally show that our technique can be applied to several other problems, such as obtaining an interval of the sorted sequence and implementing heaps.
Article
Full-text available
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underly-ing metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the met-ric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corre-sponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Conference Paper
Full-text available
MiPai is an image search system that provides visual similarity search and text-based search functionalities. The similarity search functionality is implemented by means of the permutation prefix index (PP-Index), a novel data structure for approximate similarity search. The text-based search functionality is based on a traditional inverted list index data structure. MiPai also provides a combined visual similarity/text search function.
Conference Paper
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underlying metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the metric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corresponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Book
The Wiley-Interscience Paperback Series consists of selected books that have been made more accessible to consumers in an effort to increase global appeal and general circulation. With these new unabridged softcover volumes, Wiley hopes to extend the lives of these works by making them available to future generations of statisticians, mathematicians, and scientists. "Cluster analysis is the increasingly important and practical subject of finding groupings in data. The authors set out to write a book for the user who does not necessarily have an extensive background in mathematics. They succeed very well." textemdash}Mathematical Reviews "Finding Groups in Data [is] a clear, readable, and interesting presentation of a small number of clustering methods. In addition, the book introduced some interesting innovations of applied value to clustering literature." textemdash{Journal of Classification "This is a very good, easy-to-read, and practical book. It has many nice features and is highly recommended for students and practitioners in various fields of study." textemdashTechnometrics An introduction to the practical application of cluster analysis, this text presents a selection of methods that together can deal with most applications. These methods are chosen for their robustness, consistency, and general applicability. This book discusses various types of data, including interval-scaled and binary variables as well as similarity data, and explains how these can be transformed prior to clustering.
Article
We present the Permutation Prefix Index (this work is a revised and extended version of Esuli (2009b), presented at the 2009 LSDS-IR Workshop, held in Boston) (PP-Index), an index data structure that supports efficient approximate similarity search.The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with “its view of the surrounding world”, i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object.In its basic formulation, the PP-Index is strongly biased toward efficiency. We show how the effectiveness can easily reach optimal levels just by adopting two “boosting” strategies: multiple index search and multiple query search, which both have nice parallelization properties.We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.
Article
We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects.We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.
Conference Paper
We show that for any data set in any metric space, it is possible to construct a hierarchical clustering with the guarantee that for every k, the induced k-clustering has cost at most eight times that of the optimal k-clustering. Here the cost of a clustering is taken to be the maximum radius of its clusters. Our algorithm is similar in simplicity and efficiency to common heuristics for hierarchical clustering, and we show that these heuristics have poorer approximation factors.
Article
With few exceptions, proximity search algorithms in metric spaces based on the use of pivots select them at random among the objects of the metric space. However, it is well known that the way in which the pivots are selected can drastically affect the performance of the algorithm. Between two sets of pivots of the same size, better chosen pivots can largely reduce the search time. Alternatively, a better chosen small set of pivots (requiring much less space) can yield the same efficiency as a larger, randomly chosen, set. We propose an efficiency measure to compare two pivot sets, combined with an optimization technique that allows us to select good sets of pivots. We obtain abundant empirical evidence showing that our technique is effective, and it is the first that we are aware of in producing consistently good results in a wide variety of cases and in being based on a formal theory. We show that good pivots are outliers, but that selecting outliers does not ensure that good pivots are selected.