Conference PaperPDF Available

Some Theoretical and Experimental Observations on Permutation Spaces and Similarity Search

Authors:

Abstract and Figures

Permutation based approaches represent data objects as ordered lists of predefined reference objects. Similarity queries are executed by searching for data objects whose permutation representation is similar to the query one. Various permutation-based indexes have been recently proposed. They typically allow high efficiency with acceptable effectiveness. Moreover, various parameters can be set in order to find an optimal trade-off between quality of results and costs. In this paper we studied the permutation space without referring to any particular index structure focusing on both theoretical and experimental aspects. We used both synthetic and real-word datasets for our experiments. The results of this work are relevant in both developing and setting parameters of permutation-based similarity searching approaches.
Content may be subject to copyright.
DRAFT
Some Theoretical and Experimental
Observations on Permutation Spaces and
Similarity Search
Giuseppe Amato, Fabrizio Falchi, Fausto Rabitti, and Lucia Vadicamo
Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo”,
via G. Moruzzi 1, Pisa 56124, Italy
{firstname}.{lastname}@isti.cnr.it
Abstract. Permutation based approaches represent data objects as or-
dered lists of predefined reference objects. Similarity queries are executed
by searching for data objects whose permutation representation is similar
to the query one. Various permutation-based indexes have been recently
proposed. They typically allow high efficiency with acceptable effective-
ness. Moreover, various parameters can be set in order to find an optimal
trade-off between quality of results and costs.
In this paper we studied the permutation space without referring to any
particular index structure focusing on both theoretical and experimental
aspects. We used both synthetic and real-word datasets for our experi-
ments. The results of this work are relevant in both developing and set-
ting parameters of permutation-based similarity searching approaches.
Keywords: permutation-based indexing, similarity search, content based
image retrieval
1 Introduction
Representing dataset objects as lists of preselected pivots ordered by their close-
ness to each object is a recent approach that have been proved to be very useful
in many recent approximate similarity search techniques [3, 8, 14, 20]. These ap-
proaches share the intuition that similarity between objects can be approximated
by comparing their representation in terms of permutations. The quality of the
obtained results have proved that whenever the permutations of two objects are
similar then the two objects are likely to be similar also with respect to the
original distance function.
In this paper, we studied the permutation space withouth relying on any
specific indexing structure with the goal of making theoretical and experimental
observations that can be of help in both setting parameters of existing permu-
tation based approaches and developing new one.
2 Related Work
Predicting the closeness between objects on the basis of ranked lists of a set of
pivots was originally and independently proposed in [8] and [4]. In [8] data ob-
jects and queries are represented as appropriate permutations of a set of reference
objects, called permutants, and their similarity is approximated by comparing
their representations in term of permutations. As distance between permuta-
tions, Spearman rho, Kendall Tau and Spearman Footrule were tested. Spear-
man rho revealed better performance.
The MI-File approach [4, 3] uses an inverted file to store relationships between
permutations. Spearman Footrule Distance is used to estimate the similarity
between the query and the database objects. To reduce the storage, each object
is encoded using the only nearest reference points and further approximations
and optimizations are adopted to improve both efficiency and effectiveness.
The Permutation Prefix Index (PP-Index), was proposed in [13, 14]. PP-Index
associates each indexed object with a short prefix of predefined length of the full
permutation. The prefixes are indexed by a prefix tree kept in main memory
and all the relevant information relative to the indexed objects are serialized
sequentially in a data storage kept on disk. PP-index uses the permutations
prefixes in order to quickly retrieve a candidate set of objects that are likely
to be at close distance to the query. The result set is then obtained using the
original distance function by a sequential scan of the candidate set.
In [20], the concept of Locality-sensitive Hashing (LSH) was extend to a
general metric space by using a permutation approach. In [19], a quantized rep-
resentation of the permutation lists with its related data structure was proposed
and a specific data structed, namely the Metric Permutation Table, was also de-
fined. In [22] authors presented the neighboord approximation (NAPP) techinique
whose main idea is to represent each object by the set of its nearest pivots and
approximate the similarity between objects on the basis of the number of shared
pivots. Three strategies for parallelization of permutation-based indexes using
inverted files were presented in [18]. Posting lists decomposition, reference points
decomposition, and multiple independent inverted files were studied and com-
pared.
In [2], various pivot selection techniques were tested on three permutation-
based indexing approaches (i.e., [8, 3, 14]). The results revealed that each in-
dexing approach has its own best selection strategies but also that the random
selection of pivots, even if never the best, results in good performance.
In [17, 1] a Surrogate Text Representation (STR) derivated from the MI-File
has been proposed. The conversion of the permutations in a textual form allows
using off-the-shelf text search engines for similarity search.
3 Permutation-based representation
Given a a domain D, a distance function d:D × D Rand a fixed set of
objects P={p1. . . pn}⊂Dthat we call pivots, we define a permutation-based
representation Πo(briefly permutation) of an object o∈ D as the list of pivots
identifiers ordered by their closeness to o, with the pivots being a fixed set of
objects.
Formally, the permutation-based representation Πo= (Πo(1), Πo(2), ..., Πo(n))
lists the pivot identifiers in an order such that j∈ {1,2, . . . , n1}, d(o, pΠo(j))
d(o, pΠo(j+1)), where pΠo(j)indicates the pivot at position jin the permutation
associated with object o.
Denoting the position of a pivot pi, in the permutation of an object o∈ D,
as Π1
o(i) so that Πo(Π1
o(i)) = i, we obtain an equivalent representation Π1
o:
Π1
o= (Π1
o(1), Π1
o(2), ..., Π1
o(n))
This representation is very useful for essentially two reasons: first, Π1
oRn
allowing representing permutation in the Cartesian coordinate system; second,
the Euclidean distance between two objects x, y represented as Π1
xand Π1
yis
equivalent to the Spearman rho distance between Πxand Πy(see Section 3.1).
3.1 Comparing permutations
The idea of approximating the distance d(x, y) between any two objects x, y
Dby comparing their permutation-based representation Πx, Πy was originally
proposed in [8]. As distance between permutations, Spearman rho, Kendall Tau
and Spearman Footrule were tested. Spearman rho revealed better performance.
Given two permutations Πxand Πy, Spearman rho is defined as:
Sρ(Πx, Πy) = sX
1in
(Π1
x(i)Π1
y(i))2
Following the intuition that the most relevant information of the permutation
Πois in the very first, i.e. nearest, pivots, Spearman rho distance with location
parameter Sρ,l defined in [15], intended for the comparison of top-llists, has
been also proposed.
Sρ,l differs from Sρfor the use of an inverted truncated permutation ˜
Π1
o
that assumes that pivots further than pΠo(l)from obeing at position l+ 1.
Formally, ˜
Π1
o(i) = Π1
o(i) if Π1
o(i)land ˜
Π1
o(i) = l+ 1 otherwise.
It is worth to note that only the first lelements of the permutation Πoare
needed, in order to compare any two objects with the Sρ,l .
4 Theoretical observations
As mentioned in Section 3, the permutation-space representation Π1
obelongs
to Rn. Moreover, the Spearman rho distance between two permutations Πxand
Πyresults in a Euclidean distance between Π1
xand Π1
y. In the following we
consider the Π1
orepresentation in a Cartesian coordinate system.
If we consider the case n= 3, the set of all possible permutation-based
representation (i.e., the set of all permutations on 3 elements) is formed by
{(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}. It is easy to see that all this
points lie on the plane x+y+z= 6 and represent the vertices of a regular
hexagon as depicted in Figure 1.
0123
0
1
2
3
0
1
2
3
(3, 1, 2)
(2, 1, 3)
(3, 2, 1)
(1, 2, 3)
(2, 3, 1)
(1, 3, 2)
0123
0
1
2
3
0
1
2
3
(3, 1, 2)
(2, 1, 3)
(3, 2, 1)
(1, 2, 3)
(2, 3, 1)
(1, 3, 2)
Fig. 1. The six points in R3obtained by
permuting the coordinate of the vector
(1,2,3)
2143
1243
2134
1234
1342
3142
3124
4132
2341
3241
1432
1324
4231
4123
3214
1423
2314
2431
4213
4321
2413
3421
4312
3412
Fig. 2. Permutahedron with 4! = 24 ver-
tices
Consider now the n= 4 case: the vectors of all possible Π1
olie in a three-
dimensional subspace of R4and are the vertices of a truncated octahedron (see
Figure 2).
In general, the n! points xobtained by permuting the coordinates of the vec-
tor (1,2, . . . , n), form the vertices of a (n1)-dimensional polytope embedded
in a n-dimensional space, referred to as permutahedron (also spelled permutohe-
dron) [23, 16]. In fact, given that both the sum of vector values xi(i.e., Π1
o(i))
and their squared values are fixed, all the vertices lie on both the hyperplane
x1+x2+· ·· +xn=n(n+ 1)
2
the nsphere
x2
1+x2
2+· ·· +x2
n=n(n+ 1)(2n+ 1)
6.
That is they lie on the intersection between an hyperplane and a sphere both in
Rn, i.e., on a n1 sphere residing in n-dimensional space.
The permutahedron is a very interesting convex polytope. It is centrally
symmetric and its vertices can be identified with the permutation of nobjects
in such a way that two vertices are connected by an edge if and only if the
corresponding permutations differ by an adjacent transposition. It is rather easy
to see that the squared Euclidean distance between any two vertices is an even
integer, moreover, for n > 4, the squared distances constitute every even integer
up through the maximum possible value, that is 1
3(n3n) [21, 23].
As observed in [21], standing on any vertex of a permutahedron and looking
around at neighbouring vertices, the view of the surrounding space is the same:
there would be n1 adjacent vertices evenly distributed around the observation
vertex, which Euclidean distance is 2. Furthermore, the number of vertices and
their relative positions within a generic -ball neighbourhood is independent of
the observation vertex.
The permutahedron precisely illustrate how the permutation-based represen-
tation are positioned in the space were the Euclidean distance is equivalent to the
Spearman rho. It is worth to mention that the Spearman Footrule, sometimes
used in permutation based-indexing, results in a L1 (also Manattan) distance
in the same space. However, it does not help very much in understanding the
distance distribution.
In order to understand the Spearman rho distance distribution it is useful
to use its not-squared root variant (S2
ρ) because of its interesting distribution
properties. In [11] it was shown that S2
ρdistance has:
mean: 1
6(n3n)
variance: 1
36 n2(n1)(n+ 1)2
maximum value: 1
3(n3n)
Unfortunately, S2
ρis not a metric. However, due to the monotony of the square
root function, there are not changes in the order of the results of a k-NN search
with respect to the ones that can be obtained with Sρ. Moreover, normalized
by its means and variance, S2
ρhas a limiting normal distribution [12]. Ch´avez’s
intrinsic dimensionality [10] of the permutation space with squared Spearman
rho distance is 1
2(n1).
5 Performance evaluation of the permutation space
For our experiments we did not use any specific index approach. In fact, we
performed sequential scan of permutation-based representation archives in order
to retrieve most similar objects with respect to the query by using the Spearman
rho distance function.
5.1 Datasets and Groundtruth:
Random float vectors As synthetic dataset we considered random generated
vectors of floats of various dimensionalities dbetween 2 and 10. For each di-
mension we randomly generated float between 0 and 1. As distance measure for
comparing any two vectors we used the Euclidean distance.
CoPhIR As real-word dataset we used CoPhIR dataset [7], which is the largest
multimedia metadata collection available for research purposes. It consists of 106
millions images crawled from Flickr. We run experiments by using as distance
function da linear combination of the five distance functions for the five MPEG-7
descriptors that have been extracted from each image. We adopted the weights
proposed in [5]. As the ground truth, we have randomly selected 100 objects
from the dataset as test queries and we have sequentially scanned the CoPhIR
to compute the exact results. The queries were removed from the dataset itself.
5.2 Pivots selection
For the CoPhIR dataset we randomly selected 10,000 pivots from the whole
106M objects collection. We then created subsets of this first selection. In the
following we report experiments obtained on a subset of the entire CoPhIR
collection. Thus it happens that some pivots are also in the dataset while some
are not.
Pivots for the random float vectors were randomly generated without select-
ing between the objects in the dataset.
Variuos pivots selection strategies have been proposed for permutation-based
indexing [2]. Experimental results have shown that while each specific index
strategies have its own best selection approach, the random selection is always
a good choice.
5.3 Parameters
In this section we summarize the parameters that have to be set for each specific
experiment.
d- float vectors dimensionality This parameter is only necessary to indi-
cate which random float vector dataset was used for the specific experiment.
Experiments are reported for d= 2,4,6,8.
m- dataset size For both the synthetic and the CoPhIR dataset we recursively
selected a subset of the collection. We performed experiments up to 1M and 10M
objects for the random float vectors and CoPhIR datasets respectively.
n- number of pivots The max number of pivots we used was 10,000. The
smallest set of pivots have been obtained recursively selecting a subset of the
larger collection.
l- permutation length Various values of lfor the Spearman rho with location
parameter (see Section 3.1) where tested. Please note that l=nresults in the
standard Spearman rho distance.
a- amplification factor When a k-NN search is performed, a candidate set of
results of size k0=akis retrieved considering the similarity of the permutations
based on Sρ. This set is then reordered considering the original distance d:
D × D R.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
110 100 1,000
λj
j
1,000
400
100
40
10
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
110 100 1,000
λj
j
1,000
400
100
40
10
(a) (b)
Fig. 3. Variances (eigenvalues) λ1λ2≥ · · · ≥ λn, for various number of pivots
n, corresponding to each principal component of the permutation obtained from the
random float vectors of dimensionality 4 (a), 8 (b)
5.4 Evaluation Measure
Permutation-based indexing approaches, typically re-rank a set of approximate
results using the original distance. In this work we did the same. Thus, if the
k-NN results list ˜
Rkreturned by a search technique has an intersection with the
ground truth Rk, the objects in the intersection are ranked consistently in both
lists. The most appropriate measure to use is then the recall:|˜
Rk∩ Rk|/k. In
the experiments we fixed the number of results kto 10.
5.5 Principal Component Analysis
While PCA can not be performed on a generic domain Dthat can have a non
metric distance and/or being a non vector space, once the permutation-based
representation has been obtained it is always possible to run PCA on the Π1
o.
We did this for both the random float vectors and CoPhIR dataset.
In Figure 3, we show the eigenvalues of each principal component of the
permutations obtained for various number of pivots n. The dimensionality of
the float vectors was 4 for (a) and 8 for (b). Please note, that both axes have
logarithmic scale. With 1,000 pivots it is clear in both cases what the original
dimensionality of the vector space was. In fact, there is a large drop in the
eigenvalues passing from the 4th and 5th eigenvectors in (a), and from 8th and
9th in (b). The results also show that with more pivots we obtain a permutation-
based representation that better fix the original data complexity.
We did the same for the CoPhIR dataset reporting the results in Figure 4.
It is interesting to see that, in the logarithmic scale, the eigenvalues linearly
decrease. However, CoPhIR did not reveal any specific dimensionality.
Fig. 4. Variances (eigenvalues) λ1λ2≥ · · · ≥ λn, for various number of pivots n,
corresponding to each principal component of the data obtained from the mapping in
the permutation space of the CoPhIR 30,000 objects
In [6], it was shown that the combined distance function that we are also using
in our experiments, results on the CoPhIR dataset in a near normal distribution
with an intrinsic dimensionality, measured following the approach presented in
[9], of about 13. Unfortunately, the same information can’t be induced from
Figure 4. Some non-linearity can be seen around 6 and 9, but performing PCA
on the CoPhIR doesn’t allow to understand the intrinsic dimensionality of the
dataset as well as it allowed to understand the real dimensionality of the random
generated float vectors.
5.6 Recall
In this section we relate the various parameters presented in 5.3 to the recall
obtained on k-NN searching for k= 10. As mentioned before, results were ob-
tained sequentially scanning archives of permutations by using the Spearman rho
with and without location parameter l. Please note that l=n, i.e. for location
parameter lequal to the number of pivots, the Spearman rho with an without
location parameter are equivalent.
In Figure 5, we report the recall obtained on the random float vector datasets
of 2 (a), 4 (b), 6 (c), 8 (d) dimensionalities, varying the location parameter land
for various number nof pivots. In these experiments we fixed the amplification
factor a= 1. The most interesting result is that for small dimensionalities (2
and 4) there is a maximum recall that can be obtained varying l. In other
words, l=nit is not always the best solution, but there is an optimal lthat
appears not to vary for n>l. It also interesting to see that this optimal lvaries
significantly with the dimensionality of the original vector space. For 8 dimension
vectors we are not even able to see this effect in the results. Probably, in this
case the optimal lis well above 10,000 which is the max number of pivots we
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
110 100 1,000 10,000
recall
l
100
400
1,000
4,000
10,000
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
110 100 1,000 10,000
recall
l
100
400
1,000
4,000
10,000
(a) (b)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
110 100 1,000 10,000
recall
l
100
400
1,000
4,000
10,000
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
110 100 1,000 10,000
recall
l
100
400
1,000
4,000
10,000
(c) (d)
Fig. 5. Recall varying lfor various number of pivots obtained on 100,000 random float
vectors of dimensionality 2 (a), 4 (b), 6 (c), 8 (d)
tried. Another important observation is that the differences between the recall
obtained by the various set of pivots tend to be smaller for higher d.
The same type of experiments were conducted on the CoPhIR dataset. In
Figure 6, we report the recall obtained for a= 1 (a) and a= 10 (b). As for the
random float vectors, it appears to be an optimal lthat does not vary signifi-
cantly with n. While the amplification factor does significantly impact the overall
recall, the optimal lstill remain almost the same. These results are consistent
with the ones obtained on the random float vectors for dimensionality of about
4. In terms of indexability with respect to the permutation-based approach,
CoPhIR appears to be as complex as random generated vectors of dimensional-
ity between 4 and 6. In fact, we shown in Figure 5 that for random float vectors
of 8 dimensions, the optimal lequals the number of pivots.
In Figure 7, the recall obtained varying the number of pivots for various size
of CoPhIR subsets is reported. In this case we use a= 1. In Figure 7 (a), we show
the results obtained for l=n(i.e., the standard Spearman rho). In Figure 7 (b),
we report the recall obtained for the optimal lwhich depends on both nand
dataset size. Comparing these two figures it is evident that higher recall can be
obtained increasing the number of pivots only if the optimal location parameter
lis used. However, in our experiments, we had very near optimal results by
using l= 200 (as can be seen for 10M objects in Figure 6). The intuition is that
after a certain number of pivots, information regarding distant pivots is not only
useless but distracting. Pleas note that the experiments performed on the random
vectors indicate that the distant pivots are useful when the dimensionality of the
dataset is above 8 (up to 10,000 pivots). Thus, while the observations made on
the CoPhIR datasets are useful for understanding its characteristics and the
fact that it exists an optimal lfor a specific dataset, l= 200 is a near optimal
solution only for the CoPhIR dataset and it probably reflects its complexity
which appears to be lower than the intrinsic dimensionality evaluated in [6]
following the [9] approach.
In Figure 8, we show the recall obtained varying the size of the CoPhIR
subset for various number of pivots, optimal l(different for each combination
of number of pivots nand dataset size) and a= 10. This graph is useful for
understanding the loss in recall when the dataset increase. The results show
that there is almost a linear dependency between the number of pivots needed
to achieve a given quality of results and the dataset size.
In Figure 9, we fixed both the number of pivots (10,000) and the dataset
size (10M) reporting the recall varying afor various l. As obvious, the larger the
amplifier factor athe better the quality of the results. Please note that land
aare the most relevant parameters in trading efficiency versus effectiveness in
permutation based indexes. In fact, the shorter the permutation Πo, the fewer
the information to be stored for each object. Moreover, the less the amplification
factor a, the smaller the number of objects to be retrieved for each search.
.00
.05
.10
.15
.20
.25
.30
110 100 1,000 10,000
recall
l
10,000
4,000
1,000
400
100
.00
.10
.20
.30
.40
.50
.60
.70
110 100 1,000 10,000
recall
l
10,000
4,000
1,000
400
100
(a) (b)
Fig. 6. Recall varying location parameter lfor various number of pivots and a= 1 (a)
and 10 (b) on CoPhIR 10M objects
.00
.05
.10
.15
.20
.25
.30
.35
.40
110 100 1,000 10,000
recall
number of pivots (n)
100,000
300,000
1,000,000
3,000,000
10,000,000
.00
.05
.10
.15
.20
.25
.30
.35
.40
110 100 1,000 10,000
recall
number of pivots (n)
100,000
300,000
1,000,000
3,000,000
10,000,000
(a) (b)
Fig. 7. Recall varying the number of pivots for various dataset sizes (all subsets of
CoPhIR) obtained without location paramter l(a) and with the optimum l(b).
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1.0
10,000 100,000 1,000,000 10,000,000
recall
dataset size
10,000
4,000
1,000
400
100
Fig. 8. Recall varying dataset size (all
subsets of CoPhIR) for various land a=
10.
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1.0
110 100
recall
a
100
40
20
10
4
Fig. 9. Recall varying afor various l,
number of pivots n= 10,000 and size of
the CoPhIR subsets 10M.
6 Conclusion
In this work we studied the permutation space focusing on both theoretical and
experimental aspects not relying on any specific index structure. We used both
synthetic and the CoPhIR dataset for the experiments varying various parame-
ters that are typically used for trading-off between efficiency and effectiveness.
We first made some observations on the permutation space generating ran-
dom permutations in order to understand its specific characteristic. We showed
that the points are vertices of a permuthaedron, that using a squared Spearman
rho results in Gaussian distance distribution.
The experiments conducted using random float vectors of various dimension-
ality shown that the complexity of the dataset affects the optimal value of lin
terms of recall and that the dimensionality of the original vector space can be
argued by performing PCA on the permutation space.
Also in the case of the CoPhIR dataset we found that it exists an optimal
lfor each specific number of pivots. Moreover, this optimal lis very stable and
typically around 200. Thus, we believe that the optimal length of the permu-
tations is in relation with the intrinsic complexity of the dataset even if this
complexity can not be clearly seen performing PCA on the permutation space.
The experiments also revealed a linear dependency between the number of
pivots and dataset size. Other results were shown considering land amplifier
factor acombination considering that they are the most useful parameters in
trading-off efficiency and effectiveness in permutation indexes.
References
1. Amato, G., Bolettieri, P., Falchi, F., Gennaro, C., Rabitti, F.: Combining local
and global visual feature similarity using a text search engine. In: Content-Based
Multimedia Indexing (CBMI), 2011 9th International Workshop on. pp. 49 –54.
IEEE Computer Society (2011)
2. Amato, G., Esuli, A., Falchi, F.: Pivot selection strategies for permutation-based
similarity search. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) Similarity Search
and Applications, Lecture Notes in Computer Science, vol. 8199, pp. 91–102.
Springer Berlin Heidelberg (2013)
3. Amato, G., Gennaro, C., Savino, P.: Mi-file: using inverted files for scalable ap-
proximate similarity search. Multimedia Tools and Applications pp. 1–30 (2012)
4. Amato, G., Savino, P.: Approximate similarity search in metric spaces using in-
verted files. In: Proceedings of the 3rd international conference on Scalable In-
formation Systems. pp. 28:1–28:10. InfoScale ’08, ICST (Institute for Computer
Sciences, Social-Informatics and Telecommunications Engineering) (2008)
5. Batko, M., Falchi, F., Lucchese, C., Novak, D., Perego, R., Rabitti, F., Sedmidub-
sky, J., Zezula, P.: Building a web-scale image similarity search system. Multimedia
Tools and Applications 47(3), 599–629 (2010)
6. Batko, M., Kohoutkov´a, P., Novak, D.: CoPhIR image collection under the micro-
scope. In: Skopal, T., Zezula, P. (eds.) Similarity Search and Applications, 2009.
SISAP ’09. Second International Workshop on. pp. 47–54. IEEE Computer Society
(2009)
7. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.:
CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627
(2009)
8. Ch´avez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering
permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on
30(9), 1647–1658 (2008)
9. Ch´avez, E., Navarro, G.: Measuring the dimensionality of general metric spaces.
Department of Computer Science, University of Chile, Tech. Rep. TR/DCC-00-1
(2000)
10. Ch´avez, E., Navarro, G., Baeza-Yates, R., Marroqu´ın, J.L.: Searching in metric
spaces. ACM Computing Surveys 33(3), 273–321 (2001)
11. Diaconis, P.: Group representations in probability and statistics, Lecture Notes-
Monograph Series, vol. 11. Institute of Mathematical Statistics (1988)
12. Diaconis, P., Graham, R.L.: Spearman’s footrule as a measure of disarray. Journal
of the Royal Statistical Society. Series B (Methodological) 39(2), 262–268 (1977)
13. Esuli, A.: MiPai: Using the PP-index to build an efficient and scalable similarity
search system. In: Skopal, T., Zezula, P. (eds.) Similarity Search and Applications,
2009. SISAP ’09. Second International Workshop on. pp. 146–148. IEEE Computer
Society (2009)
14. Esuli, A.: Use of permutation prefixes for efficient and scalable approximate simi-
larity search. Information Processing & Management 48(5), 889–902 (2012)
15. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of
the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 28–36.
SODA ’03, Society for Industrial and Applied Mathematics (2003)
16. Gaiha, P., Gupta, S.K.: Adjacent vertices on a permutohedron. SIAM Journal on
Applied Mathematics 32(2), 323–327 (1977)
17. Gennaro, C., Amato, G., Bolettieri, P., Savino, P.: An approach to content-based
image retrieval based on the lucene search engine library. In: Lalmas, M., Jose, J.,
Rauber, A., Sebastiani, F., Frommholz, I. (eds.) Research and Advanced Technol-
ogy for Digital Libraries, Lecture Notes in Computer Science, vol. 6273, pp. 55–66.
Springer Berlin Heidelberg (2010)
18. Mohamed, H., Marchand-Maillet, S.: Parallel approaches to permutation-based
indexing using inverted files. In: Navarro, G., Pestov, V. (eds.) Similarity Search
and Applications, Lecture Notes in Computer Science, vol. 7404, pp. 148–161.
Springer Berlin Heidelberg (2012)
19. Mohamed, H., Marchand-Maillet, S.: Quantized ranking for permutation-based
indexing. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) Similarity Search and
Applications, Lecture Notes in Computer Science, vol. 8199, pp. 103–114. Springer
Berlin Heidelberg (2013)
20. Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric
spaces. In: Proceedings of the Third International Conference on Similarity Search
and Applications. pp. 59–66. SISAP ’10, ACM (2010)
21. Santmyer, J.: For all possible distances look to the permutohedron. Mathematics
Magazine 80(2), 120–125 (2007)
22. Tellez, E.S., Chavez, E., Navarro, G.: Succinct nearest neighbor search. Information
Systems 38(7), 1019 – 1030 (2013)
23. Ziegler, G.M.: Lectures on Polytopes. Graduate Texts in Mathematics, Springer
New York (1995)
... The permutation representation of an object is computed by ordering the identifiers of a set of pivots according to their distances to the object [3]. However, the computation of these distances is just one, yet effective, approach to associate a permutation to each data object. ...
... This choice is based on the intuition that the most relevant information in the permutation is present in its very first elements, i.e. the identifiers of the closest pivots. Moreover, using the positions of the nearest l out of n pivots often leads to obtaining better or similar effectiveness to using the full-length permutation [4,3], resulting also in a more compact data encoding. The permutation prefixes are compared using top-l distances [14], like the Spearman Rho with location parameter l defined as S ρ,l , ...
... Pivoted embedding -As a consequence of Equation 3 we have ...
... This characterization have been adopted in several research papers that further investigated the properties of this data representations and ways to efficiently index them, e.g. [2,[13][14][15][17][18][19]. Moreover, some alternative permutation-based representations have been defined in the literature [1,21], but only for representing objects of specific metric spaces. ...
... The use of permutation prefixes may be dictated by either the employed data structure (e.g. prefix tree), efficiency issues (more compact data encoding and better performance when using inverted files) or even by effectiveness reasons (in some cases the use of prefixes gives better results than full-length permutations [2,3]). ...
Chapter
In the domain of approximate metric search, the Permutation-based Indexing (PBI) approaches have been proved to be particularly suitable for dealing with large data collections. These methods employ a permutation-based representation of the data, which can be efficiently indexed using data structures such as inverted files. In the literature, the definition of the permutation of a metric object was derived by reordering the distances of the object to a set of pivots. In this paper, we aim at generalizing this definition in order to enlarge the class of permutations that can be used by PBI approaches. As a practical outcome, we defined a new type of permutation that is calculated using distances from pairs of pivots. The proposed technique permits us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage is that the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase when longer permutations are used.
... For example, the Euclidean vector [0.4, 1.6, 0.3, 0.5] is transformed into the permutation [3,1,4,2], since the third element of the vector is the smallest one, the first element is the second smallest one, and so on. ...
... This is evident in Fig. 1, where the recall lightly decrease as the location parameter l grows. This is a phenomenon experimentally observed also in other data sets as shown in several works (see e.g., [3,4]). Our SPLX-Perms seems to be not affected by this phenomenon since its recall increases when considering larger l. ...
Chapter
Full-text available
Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by ordering the identifiers of a set of pivots according to their distances to the object to be represented. In this paper, we present a novel approach to transform metric objects into permutations. It uses the object-pivot distances in combination with a metric transformation, called n-Simplex projection. The resulting permutation-based representation, named SPLX-Perm, is suitable only for the large class of metric space satisfying the n-point property. We tested the proposed approach on two benchmarks for similarity search. Our preliminary results are encouraging and open new perspectives for further investigations on the use of the n-Simplex projection for supporting permutation-based indexing.
... These structures have a number of parameters on which their actual performance depend and their choice are generally made empirically, either based on heuristics or on the statistics of the data in question [1,3,7,4]. However, a formal modeling of the relationship between these choices and the impact on the performance, based on a sound modeling of the encoding created by the indexing scheme is still missing [2,20]. ...
... Knowing that p ∈ V k R (R k ) for all k < n is sufficient to determine the ordered list π p . As noted in [2], equivalent classes are the vertices of the permutahedron of order n, the polytope whose edges are connecting all permutations differing from a local flip. Proposition 3. The edges of the order-n permutahedron form a equivalent "order-(n − 1) Delaunay graph" for the ordered order-(n − 1) Voronoi diagram. ...
Conference Paper
Full-text available
Providing a fast and accurate (exact or approximate) access to large-scale multidimensional data is a ubiquitous problem and dates back to the early days of large-scale Information Systems. Similarity search, requiring to resolve nearest neighbor (NN) searches, is a fundamental tool for structuring information space. Permutation-based Indexing (PBI) is a reference-based indexing scheme that accelerates NN search by combining the use of landmark points and ranking in place of distance calculation. In this paper, we are interested in understanding the approximation made by the PBI scheme. The aim is to understand the robustness of the scheme created by modeling and studying by quantifying its invariance properties. After discussing the geometry of PBI, in relation to the study of ranking, from empirical evidence, we make proposals to cater for the inconsistencies of this structure.
... In the past, we developed and proposed various techniques to support approximate similarity research in metric spaces, including approaches that rely on transforming the data objects into permutations (Permutation-based indexing) [39,40,41,42], low-dimensional Euclidean vectors (Supermetric search) [43,44], or compact binary codes (Sketching technique) [45]. Moreover, for a class of metric space that satisfy the so called "4-point property " [46] we derived a new pruning rule named Hilbert Exclusion [47], which can be used with any indexing mechanism based on hyperplane partitioning in order to determine subset of data that do not need to be exhaustively inspected. ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
... Exact range search requires efficient algorithms for partitioning the feature space to build a balanced search tree, along with a sound search algorithm for finding all data points that reside in the range of the given query. There are two approaches for exact search: 1)Compact partition indexes that directly partition the data points, such as KD-tree, cover-tree, and ball-tree, and 2)Pivot-based indexes that use a set of points, called pivots, to map the points to another space in which the distance is easier to compute [3,29]. While pivot-based approaches are faster in medium-sized datasets, the required number of pivots is extremely large for high-dimensional datasets [34]. ...
Article
Full-text available
Emerging location-based systems and data analysis frameworks requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as Kd-tree, R*-tree, and Ball-tree. In this paper, we focus on Ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a Ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose Ball*-tree, an improved Ball-tree that is more efficient for spatial queries. Ball*-tree enjoys a modified space partitioning algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using Ball*-tree, which performs better than both KNN and range search for such queries. Results show that Ball*-tree performs 39%-57% faster than the original Ball-tree algorithm.
Chapter
Indexing exploits assumptions on the inner structures of a dataset to make the nearest neighbor queries cheaper to resolve. Datasets are generally indexed at once into a unique index for similarity search. By indexing a given dataset as a whole, one faces the parameters of its global structure, which may be adverse. A typical well-studied example is a high global dimensionality of the dataset, making any indexing strategy inefficient due to the curse of dimensionality.
Conference Paper
Full-text available
Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query. This, of course assumes that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection strategies do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivots selection strategies on three permutation-based similarity access methods. Among those, we propose a novel strategy specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested strategies. Second, there is not a strategy that is universally the best for all permutation-based access methods; rather different strategies are optimal for different methods.
Conference Paper
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underlying metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the metric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corresponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Article
Spearman's measure of disarray D is the sum of the absolute values of the difference between the ranks. We treat D as a metric on the set of permutations. The limiting mean, variance and normality are established. D is shown to be related to the metric I arising from Kendall's τ through the combinatorial inequality I ≤ D ≤ 2I.
Article
The K-Nearest Neighbor (K-NN) search problem is the way to find the K closest and most similar objects to a given query. The K-NN is essential for many applications such as information retrieval and visualization, machine learning and data mining. The exponential growth of data imposes to find approximate approaches to this problem. Permutation-based indexing is one of the most recent techniques for approximate similarity search. Objects are represented by permutation lists ordering their distances to a set of selected reference objects, following the idea that two neighboring objects have the same surrounding. In this paper, we propose a novel quantized representation of permutation lists with its related data structure for effective retrieval on single and multicore architectures. Our novel permutation-based indexing strategy is built to be fast, memory efficient and scalable. This is experimentally demonstrated in comparison to existing proposals using several large-scale datasets of millions of documents and of different dimensions.
Conference Paper
We present parallel strategies for indexing and searching permutation-based indexes for high dimensional data using inverted files. In this paper, three strategies for parallelization are discussed; posting lists decomposition, reference points decomposition, and multiple independent inverted files. We study performance, efficiency, and effectiveness of our strategies on high dimensional datasets of millions of images. Experimental results show a good performance compared to the sequential version with the same efficiency and effectiveness.
Book
Based on a graduate course given at the Technische Universitat, Berlin, these lectures present a wealth of material on the modern theory of convex polytopes. The clear and straightforward presentation features many illustrations, and provides complete proofs for most theorems. The material requires only linear algebra as a prerequisite, but takes the reader quickly from the basics to topics of recent research, including a number of unanswered questions. The lectures - introduce the basic facts about polytopes, with an emphasis on the methods that yield the results (Fourier-Motzkin elimination, Schlegel diagrams, shellability, Gale transforms, and oriented matroids) - discuss important examples and elegant constructions (cyclic and neighborly polytopes, zonotopes, Minkowski sums, permutahedra and associhedra, fiber polytopes, and the Lawrence construction) - show the excitement of current work in the field (Kalai's new diameter bounds, construction of non-rational polytopes, the Bohne-Dress tiling theorem, the upper-bound theorem), and nonextendable shellings) They should provide interesting and enjoyable reading for researchers as well as students.
Article
The convex hull $P_n $ of \[S = \left\{ {\left( {a_{\pi ( 1 ),} a_{\pi ( 2 )} , \cdots ,a_{\pi ( n )} } \right)| {\pi {\text{ is a permutation of }}( {1,2,3, \cdots ,n} )} } \right\} ,\] where $a_1 ,a_2 , \cdots ,a_n $ are integers, is defined as a permutohedron. If $a_1 < a_2 < \cdots < a_n $, it is shown that every vertex of $P_n $ has $n - 1$ adjacent vertices and a method for determining the adjacent vertices is given. The algebraic description of $P_n $ is given by considering its facets.