ChapterPDF Available

Metric Embedding into the Hamming Space with the n-Simplex Projection



Transformations of data objects into the Hamming space are often exploited to speed-up the similarity search in metric spaces. Techniques applicable in generic metric spaces require expensive learning, e.g., selection of pivoting objects. However, when searching in common Euclidean space, the best performance is usually achieved by transformations specifically designed for this space. We propose a novel transformation technique that provides a good trade-off between the applicability and the quality of the space approximation. It uses the n-Simplex projection to transform metric objects into a low-dimensional Euclidean space, and then transform this space to the Hamming space. We compare our approach theoretically and experimentally with several techniques of the metric embedding into the Hamming space. We focus on the applicability, learning cost, and the quality of search space approximation.
Metric Embedding into the Hamming space
with the n-Simplex Projection
Lucia Vadicamo1, Vladimir Mic2, Fabrizio Falchi1, and Pavel Zezula2
1Institute of Information Science and Technologies (ISTI), CNR, Pisa, Italy
{lucia.vadicamo, fabrizio.falchi}
2Masaryk University, Brno, Czech Republic
{xmic, zezula}
Abstract. Transformations of data objects into the Hamming space are
often exploited to speed-up the similarity search in metric spaces. Tech-
niques applicable in generic metric spaces require expensive learning,
e.g., selection of pivoting objects. However, when searching in common
Euclidean space, the best performance is usually achieved by transforma-
tions specifically designed for this space. We propose a novel transforma-
tion technique that provides a good trade-off between the applicability
and the quality of the space approximation. It uses the n-Simplex projec-
tion to transform metric objects into a low-dimensional Euclidean space,
and then transform this space to the Hamming space. We compare our
approach theoretically and experimentally with several techniques of the
metric embedding into the Hamming space. We focus on the applicabil-
ity, learning cost, and the quality of search space approximation.
Keywords: sketch ·metric search ·metric embedding ·n-point property
1 Introduction
The metric search problem aims at finding the most similar data objects to a
given query object under the assumption that there exists a metric function
assessing the dissimilarity of any two objects. The broad applicability of the
metric space similarity model makes the metric search a challenging task, since
the distance function is the only operation that can be exploited to compare
two objects. One way to speed-up the metric searching is to transform the space
to use a cheaper similarity function or to reduce data object sizes [4,9,14,19].
Recently, Connor et al. proposed the n-Simplex projection that transforms the
metric space into a finite-dimensional Euclidean space [9,8]. Here, specialised
similarity search techniques can be applied. Moreover, the Euclidean distance is
more efficient to evaluate than many distance functions.
Another class of metric space transformations is formed by sketching tech-
niques that transform data objects into short bit-strings called sketches [4,17,19].
Springer Nature Switzerland AG 2019
G. Amato et al. (Eds.): SISAP 2019, LNCS 11807, pp. 265272, 2019.
Final authenticated publication: 23
2 L. Vadicamo et al.
The similarity of sketches is expressed by the Hamming distance, and sketches
are exploited to prune the search space during query executions [19,18]. While
some sketching techniques are applicable in generic metric spaces, others are de-
signed for specific spaces [4]. The metric-based sketching techniques are broadly
applicable, but their performance is often worse than that of the vector-based
sketching approaches when dealing with the vector spaces [4,17].
We propose a novel sketching technique NSP 50 that combines advantages of
both approaches: wide applicability and good space approximation. It is appli-
cable to the large class of metric spaces meeting the n-point property [7,3], and
it consists of the projection of the search space into a low-dimensional Euclidean
space (n-Simplex projection) and the binarization of the vectors. The NSP 50
technique is particularly advantageous for expensive metric functions, since the
learning of the projection requires a low number of distance computations. The
main contribution of the NSP 50 is a better trade-off between its applicability,
quality of the space approximation, and the pre-processing cost.
2 Background and Related Work
We focus on the similarity search in domains modelled by the metric space
(D, d), with the domain of objects Dand the metric (distance) function d:
D×DR+[21] that expresses the dissimilarity of objects oD. We consider
the data set SD, and the so-called kNN queries that search for the kclosest
objects from Sto a query object qD. Similarity queries are often evaluated
in an approximate manner since the slightly imprecise results are sufficient in
many real-life applications and they can be delivered significantly faster than the
precise ones. Many metric space transformations have been proposed to speed-up
the approximate similarity searching, including those producing the Hamming
space [4,5,11,18,19], Euclidean space [9,16] and Permutation space [1,6,20]. We
further restrict our attention to the metric embedding into the Hamming space.
2.1 Bit String Sketches for Speeding-up Similarity Search
Sketching techniques sk(·) transform the metric space (D, d) to the Hamming
space {0,1}λ, hto approximate it with smaller objects and more efficient
distance function. We denote the produced bit strings as sketches of length λ.
Many sketching techniques were proposed – see for instance the survey [4]. Their
main features are: (1) Quality, i.e., the ability to approximate the original metric
space; (2) Applicability to various search spaces; (3) Robustness with respect to
data (intrinsic) dimensionality; (4) Cost of the object-to-sketch transformation;
(5) Cost of the transformation learning. In the following, we summarise concepts
of three techniques that we later compare with the newly proposed NSP 50
technique. They all produce sketches with balanced bits, i.e. each bit iis set to
1 in one half of the sketches sk(o), o S. This is denoted by the suffix 50 in
their notations.
Metric Embedding into the Hamming space with the n-Simplex Projection 3
GHP 50 technique [18] uses λpairs of reference objects (pivots), that define
λinstances of the Generalized Hyperplane Partitioning (GHP) [21] of the
dataset S. Therefore, each GHP instance splits the dataset into two parts
according to the closer pivot, and these parts define values of one bit of
all sketches sk(o), o S. The pivots are selected to produce balanced and
low correlated bits [18]: (1) an initial set of pivots Psup Dis selected in
random, (2) the balance of the GHP is evaluated for all pivot pairs using a
sample set Tof S, (3) set Pbal is formed by pivot pairs that divide Tinto
parts balanced to at least 45 % to 55 %, and corresponding sketches skbal are
created, (4) the correlation matrix Mwith absolute values of the Pearson
correlation coefficient is evaluated for all pairs of bits of sketches skbal, and
(5) a heuristic is applied to select rows and columns of Mwhich form its
sub-matrix with low values and size λ×λ. (6) Finally, the λpivot pairs that
produce the corresponding low correlated bits define sketches sk(o), o S.
BP 50 uses the Ball Partitioning (BP) instead of the GHP [18]. BP uses one
pivot and a radius to split data into two parts, that again define the values
in one bit of sketches sk(o), o S. Pivots are selected again via a random
set of pivots Psup, for which we evaluate radii dividing the sample set T
into halves. The same heuristic as in case of the technique GHP 50 is than
employed to select λpivots that produces low correlated bits.
PCA 50 is a simple sketching technique surprisingly well approximating the
Euclidean spaces [12,15,17,13,4]. It uses the Principal Component Analysis
(PCA) to shrink the original vectors, which are then rotated using a random
matrix and binarized by the thresholding. The i-th bit of sketch sk(o) thus
expresses whether the i-th value in the shortened vector is bigger then the
median computed on a sample set T. If sketches longer than the original
vectors are desired, we propose to apply the PCA and to rotate transformed
vectors using independent random matrices. Then we concatenate corre-
sponding binarized vectors.
Sketching techniques applicable to generic metric spaces, e.g., GHP 50 and
BP 50, are usually of a worse quality than vector-based sketching techniques
when dealing with the vectors spaces [17,4]. Moreover, they require an expen-
sive learning of the transformation. We propose the sketching technique NSP 50
to provide a better trade-off between the quality of the space approximation,
applicability of the sketching, and the pre-processing cost.
2.2 The n-Simplex projection
The n-Simplex projection [9] associated with a set of npivots Pnis a space
transformation φPn: (D, d)(Rn, `2) that maps the original metric space to
a n-dimensional Euclidean space. It can be applied to any metric space with
the n-point property, which states that any npoints o1, ..onof the space can
be isometrically embedded in the (n1)-dimensional Euclidean space. Many
often used metric spaces such as Euclidean spaces of any dimension, spaces with
4 L. Vadicamo et al.
the Triangular or Jensen-Shannon distances, and, more generally, any Hilbert-
embeddable spaces meet the n-point property [7]. The n-Simplex projection is
properly described in [9]. Here, we sketch just the main concepts.
First, the n-point property guarantees that there exists an isometric embed-
ding of the npivots into (Rn1, `2) space, i.e., it is possible to construct the ver-
tices vpiRn1such that `2(vpi, vpj) = d(pi, pj) for all i, j ∈ {1, . . . , n}. These
vertices form the so-called base simplex. Second, for any other object oD, the
(n+ 1)-point property guarantees that there exists a vertex voRnsuch that
`2(vo, vpi) = d(o, pi) for all i= 1, . . . , n. The n-Simplex projection assigns such
voto o, and Connor et al. [9] provide an iterative algorithm to compute the co-
ordinates of the vertices vpiof the simplex base as well as the coordinates of the
vector voassociated to oD. The base simplex is computed once and reused
to project all data objects oS. Moreover, the Euclidean distance between any
two projected vectors vo1, vo2Rnis a lower-bound of their actual distance,
and this bound becomes tighter with increasing number of pivots n[9].
3 The n-Simplex Sketching: Proposal & Comparison
We propose the sketching technique NSP 50 that transforms metric spaces with
the n-point property to the Hamming space. It uses the n-Simplex projection
with λpivots to project objects into λ-dimensional Euclidean space; the obtained
vectors are then randomly rotated and binarized using the median values in each
coordinate. These medians are evaluated on the data sample set. The random
rotation is applied to distribute information equally over the vectors, as the n-
Simplex projection returns vectors with decreasing values along the dimensions.
For each data set S, there exists a finite number of pivots ˜nsuch that φP˜n
is an isometric space embedding3. The identification of the minimum ˜nwith
this property is still an open problem. The convergence is achieved when all
the projected data points have a zero value in their last component, so the
NSP 50 technique as described above cannot produce meaningful sketches of
length λ > ˜n. We overcome this issue by a concatenation of smaller sketches
obtained using different rotation matrices.
The proposed NSP 50 technique is inspired by the PCA 50 approach, but
provides significantly broader applicability, as it can transform all the metric
spaces with the n-point property. This includes spaces with very expensive dis-
tance functions, as mentioned in Section 2.2. Sketching techniques also require
transformation learning of a significantly different complexity. We compare the
novel NSP 50 technique with the GHP 50, BP 50 and PCA 50 approaches and
we provide the table summarising the main features of these sketching tech-
niques, including the costs of the learning and object to sketch transformations
in terms of floating point operations and distance computations. This table is
provided online4, due to the paper length limitation.
3The proof is made trivially by a selection of all objects from the data set Sas pivots.
Metric Embedding into the Hamming space with the n-Simplex Projection 5
(a) DeCAF (b) SIFT (c) SQFD
Fig. 1: Distance densities for DeCAF, SIFT and SQFD data sets
The GHP 50 and BP 50 techniques require an expensive pivot learning.
Specifically, the GHP 50 requires (1) to examine the balance of the GHPs de-
fined by various pivot pairs to create long sketches with the balanced bits, (2)
an analysis of the pairwise bit correlations made for these sketches, and (3) a
selection of low correlated bits. The learning of the BP 50 is cheaper, since the
proper radii are selected for a set of pivots directly. The rest of the learning is
the same as in case of the GHP 50. The cost of the PCA 50 learning is given
by the PCA learning cost and evaluation of the medians over the transformed
vectors. We compute the PCA matrix using the Singular Value Decomposition
(SVD) over the centred data. The learning of the NSP 50 is the cheapest one;
it consists of the n-Simplex projection that has the quadratic cost with respect
to the number of pivots n, and the binarization, which consists of the medians
evaluations over coordinates of vectors in the sample set T.
4 Experiments
We evaluate the search quality of the NSP 50 technique on three data sets and
we compare it with the sketching techniques PCA 50, GHP 50 and BP 50. We
use three real-life data sets of visual features extracted from images:
SQFD: 1 million adaptive-binning feature histograms [2] extracted from the
Profiset collection5. Each signature consists of, on average, 60 cluster cen-
troids in a 7-dimensional space. A weight is associate to each cluster, and
the signatures are compared by the Signature Quadratic Form Distance [2].
Note that this metric is a cheaper alternative to Earth Movers Distance,
nevertheless, the cost of the Signature Quadratic Form Distance evaluation
is quadratic with respect to the number of cluster centroids.
DeCAF: 1 million deep features extracted from the Profiset collection using
the Deep Convolutional Neural Network described in [10]. Each feature is
a 4,096-dimensional vector of values from the last hidden layer (fc7 ) of the
neural network. The deep features use the ReLU activation function and are
not `2-normalised. These features are compared with the Euclidean distance.
6 L. Vadicamo et al.
(a) Sketching techniques and lengths (b) Various candidate set sizes
Fig. 2: SQFD data set: Quality of 3 sketching techniques varying sketch lengths
(2a), comparison of 128bit sketches using various candidate set sizes (2b).
SIFT: 1 million SIFT descriptors from the ANN data set6. Each descriptor is a
128-dimensional vector. The Euclidean distance is used for the comparison.
Figure 1 shows particular distance densities. We express the quality of the sketch-
ing techniques by the recall of the k-NN queries evaluated using a simple sketch-
based filtering. More specifically, sketches are applied to select the candidate
set CandS et(q) for each query object qDthat consists of a fixed number of
the most similar sketches to the query sketch sk(q); then, the candidate set is
refined by the distance d(q, o), oCandSet(q) to return the kmost similar ob-
jects oto qwith the sketches in the candidate set CandSet(q). This approximate
answer is compared with the precise one that consists of the kclosest objects
oSto q. The candidate sets consist of 2,000 sketches in the case of DeCAF
and SIFT data sets, and 1,000 sketches in the case of the SQFD data set.
We evaluate experiments using 1,000 randomly selected query objects qD,
and we depict results by Tukey box plots to show distributions of the recall values
for particular query objects: the lower- and upper-bounds of the box show the
quartiles, and the lines inside the boxes depict the medians of the recall values.
The ends of the whiskers represent the minimum and the maximum non-outliers,
and dots show the outlying recall values. In all cases, we examine 100 nearest
neighbours queries to investigate properly the variance of the recall values over
particular query objects. We use sketches of lengths λ∈ {64,128,196,256}.
Results. Figure 2a shows results for the SQFD data set. The colours of the box
plots distinguish particular sketching techniques, the suffix of the column names
denotes the length of sketches. The proposed NSP 50 technique significantly
outperforms both, GHP 50 and BP 50 techniques, fixing the sketch length. The
PCA 50 approach is not applicable for this data set, as we search different than
the Euclidean space. The BP 50 technique performs worst and provides the
median recall just 0.67 in case of 256bit sketches. The NSP 50 and GHP 50
Metric Embedding into the Hamming space with the n-Simplex Projection 7
(a) DeCAF data set (b) SIFT data set
Fig. 3: Quality of sketching techniques varying sketch lengths
approaches achieve a solid median recall of 0.88 and 0.81, respectively, even in
case of 192bit sketches. We show also a coherence of the results when varying
the candidate set size. Figure 2b reports the recalls for the candidate set sizes
c∈ {100,500,1000,2000,3000,4000}and sketches of length 128 bits made by
the sketching techniques NSP 50 and GHP 50. This figure shows that a given
recall value can be achieved by the NSP 50 technique using a smaller candidate
set than in case of the GHP 50.
The recall values for the DeCAF and SIFT data sets are depicted in Figure 3.
The BP 50 technique is less robust concerning the dimensionality of the data,
so it achieves poor recalls in case of DeCAF descriptors, but it is still reasonable
for the SIFT data set. The quality of the newly proposed NSP 50 technique is
slightly better then that of the GHP 50 technique in case of the DeCAF data set.
Both are, however, outperformed by the PCA 50 technique, which is specialised
for the Euclidean space. This interpretation is valid for all the sketch lengths λ
we have tested. The differences between the NSP 50 and PCA 50 techniques
practically dismiss in case of the SIFT data set. Both these techniques achieve
significantly better recall than the BP 50 and the GHP 50 techniques.
5 Conclusions
We contribute to the area of the metric space embeddings into the Hamming
space. We propose the NSP 50 technique that leverages the n-Simplex projec-
tion to transform metric objects into bit-string sketches. We compare the NSP 50
technique with three other state-of-the-art sketching techniques designed either
for the general metric space or the Euclidean vector space. The experiments are
conducted on three real life data sets of visual features using four different sketch
lengths. We show that our technique provides advantages of both metric-based
and specialised vector-based techniques, as it provides a good trade-off between
the quality of the space approximation, applicability, and transformation learn-
ing cost.
8 L. Vadicamo et al.
The work was partially supported by VISECH ARCO-CNR, CUP B56J17001330004,
and AI4EU project, funded by the EC (H2020 - Contract n. 825619). This research
was supported by ERDF ”CyberSecurity, CyberCrime and Critical Information Infras-
tructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/ 16 019/0000822).
1. Amato, G., Gennaro, C., Savino, P.: MI-File: Using inverted files for scalable ap-
proximate similarity search. Multimed. Tools Appl. 71(3), 1333–1362 (2014)
2. Beecks, C., Uysal, M.S., Seidl, T.: Signature quadratic form distance. In: Proceed-
ings of the ACM-CIVR 2010. pp. 438–445. ACM (2010)
3. Blumenthal, L.M.: Theory and applications of distance geometry. Clarendon Press
4. Cao, Y., Qi, H., Zhou, W., Kato, J., Li, K., Liu, X., Gui, J.: Binary hashing for
approximate nearest neighbor search on big data:A survey. IEEE Access 6, 2039–
2054 (2018)
5. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In:
Proceedings of ACM-STOC 2002. ACM (2002)
6. Ch´avez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering
permutations. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1647–1658 (2008)
7. Connor, R., Cardillo, F.A., Vadicamo, L., Rabitti, F.: Hilbert Exclusion: Improved
metric search through finite isometric embeddings. ACM Trans. Inf. Syst. 35(3),
17:1–17:27 (Dec 2016)
8. Connor, R., Vadicamo, L., Cardillo, F.A., Rabitti, F.: Supermetric search. Infor-
mation Systems (2018)
9. Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric
search. In: Proceedings of SISAP 2017. pp. 96–109. Springer (2017)
10. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.:
DeCAF: A deep convolutional activation feature for generic visual recognition. In:
Prooceeding of ICML 2014. vol. 32, pp. 647–655 (2014)
11. Douze, M., J´egou, H., Perronnin, F.: Polysemous codes. In: ECCV - 14th European
Conference, Netherlands, 2016, Proceedings, Part II. pp. 785–801 (2016)
12. Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: A pro-
crustean approach to learning binary codes for large-scale image retrieval. IEEE
Trans. Pattern Anal. Mach. Intell. 35(12), 2916–2929 (2013)
13. Gordo, A., Perronnin, F., Gong, Y., Lazebnik, S.: Asymmetric distances for binary
embeddings. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 33–47 (2014)
14. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the
curse of dimensionality. In: Proceedings of ACM STOC 1998. pp. 604–613 (1998)
15. egou, H., Douze, M., Schmid, C., P´erez, P.: Aggregating local descriptors into
a compact image representation. In: Proceedings of CVPR 2010. pp. 3304–3311.
IEEE (2010)
16. Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a non-
metric hypothesis. Psychometrika 29(1), 1–27 (Mar 1964)
17. Mic, V., Novak, D., Vadicamo, L., Zezula, P.: Selecting sketches for similarity
search. In: Proceedings of ADBIS 2018. pp. 127–141 (2018)
Metric Embedding into the Hamming space with the n-Simplex Projection 9
18. Mic, V., Novak, D., Zezula, P.: Designing sketches for similarity filtering. In: Pro-
ceedings of IEEE ICDM Workshops. pp. 655–662 (Dec 2016)
19. Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans.
Inf. Syst. 37(1), 1:1–1:28 (Dec 2018)
20. Novak, D., Zezula, P.: PPP-codes for large-scale similarity searching. In: TLDKS
XXIV. pp. 61–87. Springer (2016)
21. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity search: the metric space
approach, vol. 32. Springer Science & Business Media (2006)
... Finally, we further exploited the nSimplex projection to embed string spaces into Euclidean vector spaces [51] (Section 3.2.16), and to transform metric objects into binary strings [52] (Section 3.2.15). ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
In the domain of approximate metric search, the Permutation-based Indexing (PBI) approaches have been proved to be particularly suitable for dealing with large data collections. These methods employ a permutation-based representation of the data, which can be efficiently indexed using data structures such as inverted files. In the literature, the definition of the permutation of a metric object was derived by reordering the distances of the object to a set of pivots. In this paper, we aim at generalizing this definition in order to enlarge the class of permutations that can be used by PBI approaches. As a practical outcome, we defined a new type of permutation that is calculated using distances from pairs of pivots. The proposed technique permits us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage is that the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase when longer permutations are used.
This chapter focuses on data searching, which is nowadays mostly based on similarity. The similarity search is challenging due to its computational complexity, and also the fact that similarity is subjective and context dependent. The authors assume the metric space model of similarity, defined by the domain of objects and the metric function that measures the dissimilarity of object pairs. The volume of contemporary data is large, and the time efficiency of similarity query executions is essential. This chapter investigates transformations of metric space to Hamming space to decrease the memory and computational complexity of the search. Various challenges of the similarity search with sketches in the Hamming space are addressed, including the definition of sketching transformation and efficient search algorithms that exploit sketches to speed-up searching. The indexing of Hamming space and a heuristic to facilitate the selection of a suitable sketching technique for any given application are also considered.
Full-text available
Nearest neighbor search is a fundamental problem in various domains such as computer vision, data mining and machine learning. With the explosive growth of data on the Internet, many new data structures using spatial partitions and recursive hyperplane decomposition (eg. k-d trees) are proposed to speed up the nearest neighbor search. However, these data structures are facing big data challenges. To meet these challenges, binary hashing based approximate nearest neighbor search methods attract substantial attention due to their fast query speed and drastically reduced storage. Since the most notably Locality Sensitive Hashing (LSH) was proposed, a large number of binary hashing methods have emerged. In this paper, we first illustrate the development of binary hashing research work by proposing an overall and clear classification of them. Then we conduct extensive experiments to compare the performance of these methods on five famous and public datasets. Finally, we present our view on this topic.
Full-text available
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general,a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space which is finitely four-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric space as, in terms of metric search, they are significantly more tractable. Supermetric spaces include all those governed by Euclidean, Cosine, Jensen-Shannon and Triangular distances, and are thus commonly used within many domains. In previous work we have given a generic mathematical basis for the supermetric property and shown how it can improve indexing performance for a given exact search structure. Here we present a full investigation into its use within a variety of different hyperplane partition indexing structures, and go on to show some more of its flexibility by examining a search structure whose partition and exclusion conditions are tailored, at each node, to suit the individual reference points and data set present there. Among the results given, we show a new best performance for exact search using a well-known benchmark.
This article addresses the problem of matching the most similar data objects to a given query object. We adopt a generic model of similarity that involves the domain of objects and metric distance functions only. We examine the case of a large dataset in a complex data space, which makes this problem inherently difficult. Many indexing and searching approaches have been proposed, but they have often failed to efficiently prune complex search spaces and access large portions of the dataset when evaluating queries. We propose an approach to enhancing the existing search techniques to significantly reduce the number of accessed data objects while preserving the quality of the search results. In particular, we extend each data object with its sketch, a short binary string in Hamming space. These sketches approximate the similarity relationships in the original search space, and we use them to filter out non-relevant objects not pruned by the original search technique. We provide a probabilistic model to tune the parameters of the sketch-based filtering separately for each query object. Experiments conducted with different similarity search techniques and real-life datasets demonstrate that the secondary filtering can speed-up similarity search several times.
Techniques of the Hamming embedding, producing bit string sketches, have been recently successfully applied to speed up similarity search. Sketches are usually compared by the Hamming distance, and applied to filter out non-relevant objects during the query evaluation. As several sketching techniques exist and each can produce sketches with different lengths, it is hard to select a proper configuration for a particular dataset. We assume that the (dis)similarity of objects is expressed by an arbitrary metric function, and we propose a way to efficiently estimate the quality of sketches using just a small sample set of data. Our approach is based on a probabilistic analysis of sketches which describes how separated are objects after projection to the Hamming space.
Conference Paper
In a metric space, triangle inequality implies that, for any three objects, a triangle with edge lengths corresponding to their pairwise distances can be formed. The n-point property is a generalisation of this where, for any \((n+1)\) objects in the space, there exists an n-dimensional simplex whose edge lengths correspond to the distances among the objects. In general, metric spaces do not have this property; however in 1953, Blumenthal showed that any semi-metric space which is isometrically embeddable in a Hilbert space also has the n-point property.
The nearest neighbor problem is the following: Given a set of n points P in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to the query point q in X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = R-d under some l-p norm.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Many current applications need to organize data with respect to mutual similarity between data objects. A typical general strategy to retrieve objects similar to a given sample is to access and then refine a candidate set of objects. We propose an indexing and search technique that can significantly reduce the candidate set size by combination of several space partitionings. Specifically, we propose a mapping of objects from a generic metric space onto main memory codes using several pivot spaces; our search algorithm first ranks objects within each pivot space and then aggregates these rankings producing a candidate set reduced by two orders of magnitude while keeping the same answer quality. Our approach is designed to well exploit contemporary HW: (1) larger main memories allow us to use rich and fast index, (2) multi-core CPUs well suit our parallel search algorithm, and (3) SSD disks without mechanical seeks enable efficient selective retrieval of candidate objects. The gain of the significant candidate set reduction is paid by the overhead of the candidate ranking algorithm and thus our approach is more advantageous for datasets with expensive candidate set refinement, i.e. large data objects or expensive similarity function. On real-life datasets, the search time speedup achieved by our approach is by factor of two to five.
Conference Paper
This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3 millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 h on a single machine.