Conference PaperPDF Available

Abstract and Figures

Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.
Content may be subject to copyright.
Using Apache Lucene to Search Vector of Locally Aggregated
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo
ISTI-CNR, Via G. Moruzzi,1, Pisa, Italy
Keywords: Bag of Features, Bag of Words, Local Features, Compact Codes, Image Retrieval, Vector of Locally
Aggregated Descriptors
Abstract: Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space
using conventional text search engines, such as Apache Lucene. This technique is based on comparing the
permutations of some reference objects in place of the original metric distance. However, the Achilles heel of
STR approach is the need to reorder the result set of the search according to the metric distance. This forces
to use a support database to store the original objects, which requires efficient random I/O on a fast secondary
memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation
to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors
(VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with
the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase.
The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR
achieving satisfactory performance near to the one obtained with the original VLAD vectors.
1 Introduction
Multimedia information retrieval on a large scale
database has to address at the same time both is-
sues related to effectiveness and efficiency. Search
results should be pertinent to the submitted queries,
and should be obtained quickly, even in presence
of very large multimedia archives and simultaneous
query load.
Vectors of Locally Aggregated Descriptors
(VLAD) (J´
egou et al., 2012) were recently proposed
as a way of producing compact representation of
local visual descriptors, as for instance SIFT (Lowe,
2004), while still retaining high level of accuracy. In
fact, experiments, demonstrated that VLAD accuracy
is higher than Bag of Words (BoW) (Sivic and Zisser-
man, 2003). The advantage of BoW representation is
that it is very sparse and allows using inverted files to
also achieve high efficiency. VLAD representation is
not sparse, so general indexing methods for similarity
searching (Zezula et al., 2006) must be used, which
are typically less efficient than inverted files.
One of the best performing generic methods for
similarity searching, is the use of permutation based
indexes (Chavez et al., 2008; Amato et al., 2014b).
Permutation based indexes rely on the assumption
that to objects that are very similar, “see” the space
around them in a similar way. This assumption is ex-
ploited by representing the objects as the ordering of a
fixed set of reference objects (or pivots), according to
their distance from the objects themselves. If two ob-
jects are very similar, the two corresponding ordering
of the reference objects will be similar as well.
However, measuring the similarity between ob-
jects using the similarity between permutations is a
coarse approximation. In fact, in order to achieve
also high accuracy, similarity between permutations is
used just to identify an appropriate set of candidates,
which is then reordered according to the original sim-
ilarity function to obtain the final result. This reorder-
ing phase, contributes to the overall search cost.
Given that objects are represented as order-
ing (permutations) of reference objects, permutation
based indexes offer the possibility of using inverted
files, in every similarity searching problem, where
distance functions are metric functions. In fact, (Gen-
naro et al., 2010) presents an approach where the
Lucene text search engines, was used to index and re-
trieve objects by similarity. The technique is based
on an encoding of the permutations by means of a
Surrogate Text Representation (STR). In this respect,
VLAD can be easily indexed using this technique, as
discussed in (Amato et al., 2014a) so that efficient and
effective image search engines can be built on top of
a standard text search engine.
In this paper, we propose an advancement on this
basic techniques, which exploits the internal structure
of VLAD. Specifically, the STR technique is applied,
independently, to portions of the entire VLAD. This
leads, at the same time, to higher efficiency and ac-
curacy without the need of executing the reordering
of the set of candidates, which was mentioned above.
The final result is obtained by directly using the sim-
ilarity between the permutations (the textual repre-
sentation), so saving both time in the searching al-
gorithms, and space, since the original VLAD vectors
no longer need to be stored.
The paper is organized as follows. Section 2
makes a survey of the related works. Section 3 pro-
vides a brief introduction to the VLAD approach.
Section 4 introduces the proposed approach. Section
5 discusses the validation tests. Section 6 concludes.
2 Related Work
In the last two decades, the breakthroughs in the field
of image retrieval have been mainly based on the use
of the local features. Local features, as SIFT (Lowe,
2004) and SURF (Bay et al., 2006), are visual descrip-
tors of selected interest points of an image. Their use
allows one to effectively match local structures be-
tween images. However, the costs of comparison of
the local features lay some limits on large scale, since
each image is represented by typically thousands of
local descriptors. Therefore, various methods for the
aggregation of local features have been proposed.
One of the most popular aggregation method is the
Bag-of-Word (BoW), initially proposed in (Sivic and
Zisserman, 2003; Csurka et al., 2004) for matching
object in videos. BoW uses a visual vocabulary to
quantize the local descriptors extracted from images;
each image is then represented by a histogram of oc-
currences of visual words. The BoW approach used
in computer vision is very similar to the BoW used in
natural language processing and information retrieval
(Salton and McGill, 1986), thus many text indexing
techniques, as inverted files (Witten et al., 1999), have
been applied for image search. From the very be-
ginning (Sivic and Zisserman, 2003) words reduc-
tions techniques have been used and images have been
ranked using the standard term frequency-inverse doc-
ument frequency (tf-idf) (Salton and McGill, 1986)
weighting. In order to improve the efficiency of BoW,
several approaches for the reduction of visual words
have been investigated (Thomee et al., 2010; Amato
et al., 2013b). Search results obtained using BoW in
CBIR (Content Based Image Retrieval) has also been
improved by exploiting additional geometrical infor-
mation (Philbin et al., 2007; Perd’och et al., 2009; To-
lias and Avrithis, 2011; Zhao et al., 2013) and apply-
ing re-ranking approaches (Philbin et al., 2007; J´
et al., 2008; Chum et al., 2007; Tolias and J´
2013). The baseline BoW encoding is affected by
the loss of information about the original descriptors
due to the quantization process. For example, corre-
sponding descriptors in two images may be assigned
to different visual words. To overcome the quantiza-
tion loss, more accurate representation of the original
descriptors and alternative encoding techniques have
been used, such as Hamming Embedding (J´
egou et al.,
2008; J´
egou et al., 2010), soft-assignment (Philbin
et al., 2008; Van Gemert et al., 2008; Van Gemert
et al., 2010), multiple assignment (J´
egou et al., 2010;
egou et al., 2010b), locality-constrained linear cod-
ing (Wang et al., 2010), sparse coding (Yang et al.,
2009; Boureau et al., 2010) and the use of spatial
pyramids (Lazebnik et al., 2006).
Recently, other aggregation schemes, such as the
Fisher Vector (FV) (Perronnin and Dance, 2007;
Jaakkola and Haussler, 1998) and the Vector of Lo-
cally Aggregated Descriptors (VLAD) (J´
egou et al.,
2010a), have attracted much attention because of their
effectiveness in both image classification and large-
scale image search. Both FV and VLAD use some
statistics about the distribution of the local descriptors
in order to transform an incoming set of descriptors
into a fixed-size vector representation.
The basic idea of FV is to characterize how a sam-
ple of descriptors deviates from an average distribu-
tion that is modeled by a parametric generative model.
The Gaussian Mixture Model (GMM) (McLachlan
and Peel, 2000), estimated on a training set, is typi-
cally used as generative model and might be under-
stood as a “probabilistic visual vocabulary”.
While BoW counts the occurrences of visual
words and so takes in account just 0-order statistics,
the VLAD approach, similarly to BoW, uses a visual
vocabulary to quantize the local descriptors of an im-
age. The visual vocabulary is learned using a cluster-
ing algorithm, as for example the k-means. Compared
to BOW, VLAD exploits more aspects of the distribu-
tion of the descriptors assigned to a visual word. In
fact, VLAD encodes the accumulated difference be-
tween the visual words and the associated descriptors,
rather than just the number of descriptors assigned
to each visual word. As common post-processing
step VLAD is power and L2 normalized (J´
egou et al.,
2012; Perronnin et al., 2010). Furthermore, PCA di-
mensionality reduction and product quantization have
been applied and several enhancements to the basic
VLAD have been proposed (Arandjelovic and Zisser-
man, 2013; Perronnin et al., 2010; Chen et al., 2011;
Delhumeau et al., 2013; Zhao et al., 2013)
In this work, we will focus on VLAD which is
very similar to FV. In fact VLAD has been proved
to be a simplified non-probabilistic version of FV
that performs very similar to FV (J´
egou et al., 2012).
However, while BoW is a sparse vector of occurrence,
VLAD is not. Thus, inverted files cannot be directly
applied for indexing and Euclidean Locality-Sensitive
Hashing (Datar et al., 2004) is, as far as we know, the
only technique tested with VLAD. Many other simi-
larity search indexing techniques (Zezula et al., 2006)
could be applied to VLAD. A very promising direc-
tion is Permutation-Based Indexing (Chavez et al.,
2008; Amato et al., 2014b; Esuli, 2009). In partic-
ular the MI-File allows one to use inverted files to
perform similarity search with an arbitrary similarity
function. Moreover, in (Gennaro et al., 2010; Amato
et al., 2011) a Surrogate Text Representation (STR)
derived from the MI-File has been proposed. The
conversion of the image description in textual form
enables us to exploit the off-the-shelf search engine
features with a little implementation effort.
In this paper, we extend the STR approach to
deal with the VLAD descriptions comparing both ef-
fectiveness and efficiency with the STR baseline ap-
proach, which has been studied in (Amato et al.,
2013a). The experimentation was carried out on
the same hardware and software infrastructure using
a publicly available INRIA Holidays (J´
egou et al.,
2008) dataset and comparing the effectiveness with
the sequential scan.
3 Vector of Locally Aggregated
Descriptors (VLAD)
The VLAD representation was proposed in (J´
et al., 2010). As for the BoW, a visual vocabulary,
here called codebook,{µ
µK}1is first learned
using a cluster algorithm (e.g. k-means). Each lo-
cal descriptor xtis then associated with its nearest
visual word (or codeword)NN(xt)in the codebook.
For each codeword the differences between the sub-
vectors xtassigned to µ
µiare accumulated:
The VLAD is the concatenation of the accumulated
sub-vectors, i.e. V= (v1,...,vK). Throughout the
1Throughout the paper bold letters denote row vectors.
paper, we refer to the accumulated sub-vectors visim-
ply as “sub-vectors”.
Two normalization are performed: first, a power
normalization with power 0.5; second, a L2 normal-
ization. After this process two descriptions can be
compared using the inner product.
The observation that descriptors are relatively
sparse and very structured suggests performing a prin-
cipal component analysis (PCA) to reduce the dimen-
sionality of the VLAD. In this work, we decide not to
use dimensionality reduction techniques because we
will show that our space transformation approach is
independent from the original dimensionality of the
description. In fact, the STR approach that we pro-
pose, transforms the VLAD description in a set of
words from a vocabulary that is independent from the
original VLAD dimensionality.
4 Surrogate Text Representation for
VLAD Vectors
In this paper, we propose to index VLAD using a text
encoding that allows using any text retrieval engine to
perform image similarity search. As discussed later,
we implemented this idea on top of the Lucene text
retrieval engine2.
To this end, we extend the permutation-based ap-
proach developed by Chavez et al. (Chavez et al.,
2008) to deal with the internal representation of the
VLAD vectors. In this section, we first introduce
the basic principle of the permutation-based approach
and then describe the generalization to VLAD vec-
4.1 Baseline Permutation-based
Approach and Surrogate Text
The key idea of the Permutation-based approach re-
lies on the observation that if two objects are near one
another, they have a similar view of the objects around
them. This means that the orderings (permutations)
of the surrounding objects, according to the distances
from the two objects, should be similar as well.
Let Dbe a domain of objects (features, points,
etc.), and d:D×DRa distance function able to
assess the dissimilarity between two objects of D. Let
RD, be a set of m distinct objects (reference ob-
jects), i.e., R={r1,...,rm}. Given any object oD,
we denote the vector of rank positions of the reference
objects, ordered by increasing distance from o, as
p(o) = (p1(o),..., pm(o)). For instance, if p3(o) = 2
then r3is the 2nd nearest object to oamong those in
R. The essence of the permutation-based approach
is to allow executing similarity searching exploiting
distances between permutations in place of original
objects’ distance. This, as discussed in the following,
has the advantage of allowing using a standard text
retrieval engine to execute similarity searching.
There are several standard methods for compar-
ing two ordered lists, such as Kendall’s tau distance,
Spearman Footrule distance, and Spearman Rho dis-
tance. In this paper, we concentrate our attention on
the latter distance, which is also used in (Chavez et al.,
2008). The reason of this choice (explained later on)
is tied to the way standard search engines process the
similarity between documents and query.
In particular, we exploit a generalization of the
Spearman Rho distance that allows us to compare two
top-kranked lists. Top-klist is a particular case of a
partial ranked list, which is a list that contains rank-
ings for only a subset of items. For top-klists, we
can use a generalization of the Spearman Rho dis-
tance ˜
d(o,q), called location parameter distance (Fa-
gin et al., 2003), which assigns a rank k+1 for all
items of the list that have rank greater than k.
In particular, let kbe an integer less or equal than
m, and pk(o)=(pk
m(o)) the vector defined
as follows:
i(o) = pi(o)if pi(o)k
k+1 if pi(o)>k.(1)
Given two top-kranked lists with k=kqand
k=kx, we define the approximate distance function
d(o,q)as follows:
d(o,q) = ||pkx(o)pkq(q)||2,(2)
where kqis used for queries and kxfor indexing. The
reason for using two different krelies on the fact the
performance of the inverted files is optimal when the
size of the queries are much smaller than the size of
documents. Therefore, we will typically require that
Since, the square root in Eq. (2) is monotonous,
it does not affect the ordering (Fagin et al., 2003), so
we can safely use ˜
d(o,q)2instead of its square-root
Figure 1 exemplifies the transformation process.
Figure 1a sketches a number of reference objects
(black points), objects (white points), and a query ob-
ject (gray point). Figure 1b shows the encoding of
the data objects in the transformed space. We will use
this illustration as a running example throughout the
remainder of the paper.
So far, we have presented a method for approx-
imating the function d. However, our primary ob-
jective is to implement the function ˜
d(o,q)in an ef-
ficient way by exploiting the built-in cosine simi-
larity measure of standard text-based search engines
based on vector space model. For this purpose, we
associate each element riRwith a unique key τi.
The set of keys {τ1,...,τm}represents our so-called
“reference-dictionary”. Then, we define a function
tk(o)that returns a space-separated concatenation of
zero or more repetitions of τikeywords, as follows:
tk(o) = m
τiwhere, by abuse of notation,
we denote the space-separated concatenation of key-
words with the union operator . The function tk(o)
is used to generate the Surrogate Text Representation
for both indexing and querying purposes. kassumes
in general the values kxfor indexing and kqfor query-
ing. For instance, consider the case exemplified in
Figure 1c, and let us assume τ1=A, τ2=B, etc. The
function tkwith kx=3 and kq=2, will generate the
following outputs
tkx(o1) = EEEBBA
tkx(o2) = DDDCCE
tkq(q) = “E E A
As can be seen intuitively, strings corresponding
to o1and qare more similar to those corresponding
to o2eq, this reflects the behavior of the distance
d. However, this should not mislead the reader: our
proposal is not a heuristic, the distance between the
strings corresponds exactly to the distance ˜
the objects, as we will prove below.
As explained above, the objective now is to force a
standard text-based search engine to generate the ap-
proximate distance function ˜
d. How this objective is
obtained becomes obvious by the following consid-
erations. A text based search engine will generate a
vector representation of STRs generated with tkx(o)
and tkq(q)containing the number of occurrences of
words in texts. This is the case of the simple term-
frequency weighting scheme. This means that, if for
instance keyword τicorresponding to the reference
object ri(1 im) appears ntimes, the i-th ele-
ment of the vector will assume the value n, and when-
ever τidoes not appear it will be 0. Let kxand kq
be respectively the constant m-dimensional vectors,
(kx+1,...,kx+1)and (kq+1,...,kq+1), then
pkx(o) = kxpkx(o)
pkq(q) = kqpkq(q)(4)
 
    
    
    
    
 
    
    
a) b) c)
= “A”
= “B”
= “C”
= “D”
= “E”
-> “E E E B BA”
-> “D DDC CE”
-> “E EA”
Figure 1: Example of perspective based space transformation. a) Black points are reference objects; white points are data
objects; the grey point is a query. b) Encoding of the data objects in the transformed space. c) Encoding of the data objects in
textual form.
It is easy to see that the vectors corresponding to
tkx(o)and tkq(q), are the same of b
pkx(o)and b
The cosine similarity is typically adopted to deter-
mine the similarity of the query vector and a vector in
the database of the search engine, and it is defined as:
simcos (o,q) = b
It is worth noting that b
pkis a permutation of the m-
dimensional vector (1,2,...,k,0,...,0), thus its norm
equals pk(k+1)(2k+1)/6. Since kxand kqare con-
stants, the norms of vectors b
pkxand b
pkqare constants
too, therefore can be neglected during the cosine eval-
uation (they do not affect the final ranking of the
search result).
What we are now to show is that simcos can be
used as a function for evaluating a similarity of two
objects in place of the distance ˜
dand it possible to
prove that the first one is a order reversing monotonic
transformation of the second one (they are equiva-
lent for practical aspects). This means that if we use
d(o,q)and we take the first knearest objects from
a dataset XD(i.e, from the shortest distance to
the highest) we obtain exactly the same objects in the
same order if we use simcos (o,q)and take the first
ksimilar objects (i.e., from the greater values to the
smaller ones).
By substituting Eq. (4) into Eq. (5), we obtain:
simcos(o,q)(kxpkx(o)) ·(kqpkq(q)) =
=kx·kqkx·pkq(q)kq·pkx(o) + pkx(o)·pkq(q)
since pkx(o)(pkq(q)) include all integers numbers
from 1 to kx(kq) and the remaining assumes kx+1
(kq+1) values, the scalar product kx·pkq(q)(kq·
pkx(o)) is constant. We can substitute the first three
member in Eq. (6) with a constant L(m,kx,kq), which
depends only on m,kx, and kqas follows:
simcos(o,q)L(m,kx,kq) + pkx(o)·pkq(q).(7)
Finally, combining Eq. (7) with Eq. (3), we obtain:
simcos(o,q)L(m,kx,kq) + 1
Since ||pkx(o)|| and ||pkq(q)|| depend only on the
constants m,kx, and kq, the Eq. (8) proves that
simcos (o,q)is a monotonic transformation of ˜
in the form simcos =αβ˜
To summarize, given a distance function d, we
were able to determine an approximate distance func-
tion ˜
d, which we transformed in a similarity mea-
sure. We proved that this similarity measure can be
obtained using the STR and that it is equivalent from
the point of view of the result ranking to ˜
Note, however, that searching using directly the
distance from permutations suffers of low precision.
To improve effectiveness, (Amato et al., 2014b) pro-
poses to reorder the results set according to original
distance function d. Suppose we are searching for the
kmost similar (nearest neighbors) descriptors to the
query. The quality of the approximation is improved
by reordering, using the original distance function d,
the first c(ck) descriptors from the approximate
result set at the cost of cadditional distance computa-
4.2 Blockwise Permutation-based
The idea described so far uses a textual/permutation
representation of the object as whole, however, in our
particular scenario, we can exploit the fact that VLAD
vector is the result of concatenation of sub-vectors. In
short, we apply and compare the textual/permutation
representation for each sub-vector viof the whole
VLAD, independently. We refer to this approach as
Blockwise Permutation-based approach.
As we will see, this approach has the advantage
of providing a finer representation of objects, in terms
of permutations, so that no reordering is needed to
guarantee the quality of the search result.
In order to decrease the complexity of the ap-
proach and since sub-vectors viare homogeneous, we
use the same set of reference objects R={r1,...,rm}
to represent them as permutations taken at random
from the dataset of VLAD vectors. Let vibe the
i-st sub-vector of a VLAD sub-vector V, we de-
note by pkx(vi)the corresponding permutation vec-
tor. Given two VLAD vectors V= (v1,...,vK)and
W= (w1,...,wK), and their corresponding concate-
nated permutation vectors O= (pkx(v1),...,pkx(vK))
and Q= (pkq(w1),...,pkq(wK)), we generalize the
Spearman Rho distance for two vectors Vand Was
This generalization has the advantage of being faster
to compute since it treats the concatenated permuta-
tion vector as a whole. Moreover, it does not require
square roots and it can be evaluated using the cosine.
Defining in the same way as above:
pkx(vi) = kxpkx(vi)
pkq(wi) = kqpkq(wi).(10)
By a similar procedure shown above, it is possible
to prove that also in this case simcos(V,W) α
In order to correctly match the transformed block-
wise vectors, we need to extended the reference dic-
tionary to distinguish the key produced from sub-
vectors viwith different subscript i. There for a set
mof reference objects, and Kelement in the VLAD
codebook, we employ dictionary including a set of
m×Kkeys τi,j(1 im, 1 jK).
For example, we associate, say, the set of keys A1,
B1,... to the sub-vector v1, A2, B2,... to the sub-vector
v2, and so on.
4.3 Dealing with VLAD Ambiguities
One of the well-known problems of VLAD happens
when no local descriptor is assigned to a codeword
(Peng et al., 2014). A simple approach to this prob-
lem is produce a sub-vector of all zeros (vi=0) but
this has the disadvantage to be ambiguous since it is
identical to the case in which the mean of the local de-
scriptors assigned to a codeword is equal to the code-
word itself.
Moreover, as pointed out by (Spyromitros-Xioufis
et al., 2014), given two images and the corresponding
VLAD vectors Vand W, and assuming that vi=0,
the contribution of codeword µ
µito the cosine similar-
ity of Vand Wwill be the same when either wi=0
or wi6=0. Therefore, this under-estimates the impor-
tance of jointly zero components, which gives some
limited yet important evidence on visual similarity
egou and Chum, 2012). In (J´
egou and Chum, 2012),
this problem was treated by measuring the cosine be-
tween vectors Vand Wat different point from the
This technique, however, did not lead to signifi-
cant improvement of our experiments. To tackle this
problem, we simply get rid of the sub-vectors vi=0
and omit to transform them in text. Mathematically,
this means that we assume b
pkx(0) = 0.
5 Experiments
5.1 Setup
INRIA Holidays (J´
egou et al., 2010a; J´
egou et al.,
2012) is a collection of 1,491 holiday images. The au-
thors selected 500 queries and for each of them a list
of positive results. As in (J´
egou et al., 2009; J ´
et al., 2010; J´
egou et al., 2012), to evaluate the ap-
proaches on a large scale, we merged the Holidays
dataset with the Flickr1M collection3. SIFT features
have been extracted by Jegou et al. for both the Holi-
days and the Flickr1M datasets4.
For representing the images using the VLAD ap-
proach, we selected 64 reference features using k-
means over a subset of the Flickr1M dataset.
All experiments were conducted on a Intel Core
i7 CPU, 2.67 GHz with 12.0 GB of RAM a 2TB 7200
RPM HD for the Lucene index. We used Lucene v4.7
running on Java 6 64 bit.
The quality of the retrieved images is typically
evaluated by means of precision and recall measures.
As in many other papers (J´
egou et al., 2009; Perronnin
et al., 2010; J´
egou et al., 2012), we combined this in-
formation by means of the mean Average Precision
(mAP), which represents the area below the precision
and recall curve.
5.2 Results
In a first experimental analysis, we compared the per-
formance of blockwise approach versus the baseline
approach (with and without reordering) that threats
the VLAD vectors as whole-objects, which was stud-
ied in (Amato et al., 2014a). In this latter approach, as
explained Section 4, since the performance was low,
we had to reorder the best results using the actual dis-
tance between the VLAD descriptors. With this ex-
periment, we want to show that with the blockwise
approach this phase is no longer necessary, and the
search is only based on the result provided by text-
search engine Lucene. For the baseline approach, we
used m=4,000 reference objects while for blockwise,
20,000. In both cases, we set kx=50, which, we re-
call, is the number of closest reference objects used
during indexing.
Figure 2 shows this comparison in terms of mAP.
We refer to baseline approach as STR, the baseline
approach with reordering as rSTR, and to blockwise
approach as BSTR. For the rSTR approach, we re-
ordered the first 1,000 objects of the results set. The
horizontal line at the top represents the performance
obtained matching the original VLAD descriptors
with the inner product, performing a sequential scan
of the dataset, which exhibits a mAP of 0.55. The
graph in the middle shows the mAP of our approach
(BSTR) versus the baseline approach without reorder-
ing (STR) and with reordering (rSTR). The graphs
show also how the performance changes varying kq
(the number of closest reference objects for the query)
from 10 to 50.
An interesting by-product of the experiment is
that, we obtain a little improvement of the mAP for
010 20 30 40 50 60
Exact (inner product)
BSTR tfidf
Figure 2: Effectiveness (mAP) of the various approach for
the INRIA Holidays dataset, using kx=50 for STR, rSTR,
BSTR, and BSTR tfidf (higher values mean better reults).
the BSTR approach when the number of reference ob-
jects used for the query is 20.
A quite intuitive way of generalizing the idea of
reducing the size of the query is to exploit the knowl-
edge of the tf*idf (i.e., term frequency * inverse doc-
ument frequency) statistic of the BSTR textual rep-
resentation. Instead of simply reducing the kqof the
query, i.e., the top-kqelement nearest to the query,
we can retain the elements that exhibit greater values
of tf*idf starting from the document generated with
kq=50 and eliminate the others. Therefore, we take,
for instance, the first 40 elements that have best tf*idf,
the first 30 elements, and so on. Figure 2 shows the
performance of this approach, with the name ‘BSTR
tfidf’. It is interesting to note that we had not only
an important improvement of the mAP for increasing
reduction of the queries but also that this approach
outperforms the performance of the inner product on
the original VLAD dataset.
In order to ascertain the soundness of the proposed
approach, we tested it on the larger and challenger
Flickr1M dataset.
The results are shown in Figure 3. We can see that
BSTR tfidf is still the winner in terms of mAP. How-
ever, in this case all the techniques exhibit lower per-
formance with respect the inner product on the orig-
inal VLAD dataset. The latter test is performed as a
sequential scan of the entire dataset obtaining a mAP
of 0.34. The results presented in this figure also show
the performance of the approach called BSTR tfidf2,
which consists in applying the reduction of the block-
wise textual representation using tf*idf also for the
indexed document (in addition to the queries), setting
kx=kqfor all the experiments. The mAPs values in
this case are slightly lower than BSTR tfidf, however,
as we are going to see in the next experiment there is
a great advance in terms of space occupation.
010 20 30 40 50 60
Exact (inner product)
BSTR tfidf
BSTR tfidf^2
Figure 3: Effectiveness (mAP) of the various approach for
the INRIA Holidays + Flickr1M dataset, using kx=50 for
STR, rSTR, BSTR, and BSTR tfidf. While for BSTR tfidf2,
we set kx=kq(higher values mean better reults).
In order to assess which approach is most promis-
ing, we have also evaluated the efficiency in terms of
space and time overhead. Figure 4 shows the average
time for a query for the proposed approaches. The
rSTR approach considers also the time for reordering
the result set, however, its average time is obtained us-
ing a solid state disk (SSD) disk in which the original
VLAD vectors are available for the reordering. The
SSD is necessary to guarantee fast random I/O, while
using a standard disk the seek time would affect the
query time of more than one order of magnitude.
Figure 5 presents the index occupation expressed
in GB. The rSTR approach occupies 16.8 GB on
the disk, including the overhead for the storage of
the VLAD vectors used for the reordering of the re-
sults. The BSTR tfidf2solution has great impact of
the space occupation: just for a reduction of the 20%
of the documents ( i.e., from kx=50 to kx=40) we
get a reduction of the 80% for the inverted file.
Considering all the alternatives seen so far, an op-
timal choice could be BSTR tfidf2with kx=kq=20,
which is efficient in term of both time and space over-
heads and still maintains satisfactory mAP.
6 Conclusions and Future Work
In this work, we proposed a ‘blockwise’ extension of
surrogate text representation, which is in principle ap-
plicable not only to VLAD but also to any other vec-
tor or compound metric objects. The main advantage
of this approach is the elimination for the need of the
reordering phase. Using the same hardware and text
search engine (i.e., Lucene), we were able to com-
pare with the state-of-the-art baseline STR approach
exploiting the reordering phase.
010 20 30 40 50 60
Average query time
BSTR tfidf
BSTR tfidf^2
Figure 4: Average time per query in seconds of the various
approaches for the INRIA Holidays + Flickr1M dataset, us-
ing kx=50 for rSTR, BSTR, and BSTR tfidf. While for
BSTR tfidf2, we set kx=kq(higher values mean worse per-
Figure 5: Space occupation of the index for the different
type of solutions, using the same value of kx=50 for BSTR
and rSTR, and varing kxfor BSTR tfidf2. Note that for the
rSTR, we consider also the overhead for the storage of the
VLAD vectors used for the reordering of the results (higher
values mean greater occupations).
The experimental evaluation on the blockwise ex-
tension revealed very promising performance in terms
of mAP and response time. However, the drawback of
it resides in the expansion of the number of terms in
the textual representation of the VLADs. This pro-
duces an inverted index that, using Lucene, is one or-
der of magnitude greater than the baseline STR. To
alleviate this problem, we propose to shrink the index
reducing the document, as we did for the query, by
eliminating the terms associated with a low value of
tf*idf weight. This approach is very effective but has
the disadvantage that need a double indexing phase or
at least a pre-analysis of the dataset in order to calcu-
late the tf*idf weight of the terms. Future work will
investigate this aspect in more detail.
This work was partially supported by EAGLE, Eu-
ropeana network of Ancient Greek and Latin Epigra-
phy, co-founded by the European Commision, CIP-
ICT-PSP.2012.2.1 - Europeana and creativity, Grant
Agreement n. 325122.
Amato, G., Bolettieri, P., Falchi, F., and Gennaro, C.
(2013a). Large scale image retrieval using vector of
locally aggregated descriptors. In Brisaboa, N., Pe-
dreira, O., and Zezula, P., editors, Similarity Search
and Applications, volume 8199 of Lecture Notes in
Computer Science, pages 245–256. Springer Berlin
Amato, G., Bolettieri, P., Falchi, F., Gennaro, C., and Ra-
bitti, F. (2011). Combining local and global visual fea-
ture similarity using a text search engine. In Content-
Based Multimedia Indexing (CBMI), 2011 9th Inter-
national Workshop on, pages 49 –54.
Amato, G., Falchi, F., and Gennaro, C. (2013b). On reduc-
ing the number of visual words in the bag-of-features
representation. In VISAPP 2013 - Proceedings of the
International Conference on Computer Vision Theory
and Applications, volume 1, pages 657–662.
Amato, G., Falchi, F., Gennaro, C., and Bolettieri, P.
(2014a). Indexing vectors of locally aggregated de-
scriptors using inverted files. In Proceedings of Inter-
national Conference on Multimedia Retrieval, ICMR
’14, pages 439:439–439:442.
Amato, G., Gennaro, C., and Savino, P. (2014b). MI-
File: using inverted files for scalable approximate sim-
ilarity search. Multimedia Tools and Applications,
Arandjelovic, R. and Zisserman, A. (2013). All about
VLAD. In Computer Vision and Pattern Recogni-
tion (CVPR), 2013 IEEE Conference on, pages 1578–
Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF:
Speeded Up Robust Features. In Leonardis, A.,
Bischof, H., and Pinz, A., editors, Computer Vision
- ECCV 2006, volume 3951 of Lecture Notes in Com-
puter Science, pages 404–417. Springer Berlin Hei-
Boureau, Y.-L., Bach, F., LeCun, Y., and Ponce, J. (2010).
Learning mid-level features for recognition. In Com-
puter Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, pages 2559–2566.
Chavez, G., Figueroa, K., and Navarro, G. (2008). Effec-
tive proximity retrieval by ordering permutations. Pat-
tern Analysis and Machine Intelligence, IEEE Trans-
actions on, 30(9):1647 –1658.
Chen, D., Tsai, S., Chandrasekhar, V., Takacs, G., Chen, H.,
Vedantham, R., Grzeszczuk, R., and Girod, B. (2011).
Residual enhanced visual vectors for on-device im-
age matching. In Signals, Systems and Computers
(ASILOMAR), 2011 Conference Record of the Forty
Fifth Asilomar Conference on, pages 850–854.
Chum, O., Philbin, J., Sivic, J., Isard, M., and Zisserman,
A. (2007). Total recall: Automatic query expansion
with a generative feature model for object retrieval. In
Computer Vision, 2007. ICCV 2007. IEEE 11th Inter-
national Conference on, pages 1–8.
Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray,
C. (2004). Visual categorization with bags of key-
points. Workshop on statistical learning in computer
vision, ECCV, 1(1-22):1–2.
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.
(2004). Locality-sensitive hashing scheme based on
p-stable distributions. In Proceedings of the twentieth
annual symposium on Computational geometry, SCG
’04, pages 253–262.
Delhumeau, J., Gosselin, P.-H., J´
egou, H., and P ´
erez, P.
(2013). Revisiting the VLAD image representation.
In Proceedings of the 21st ACM International Confer-
ence on Multimedia, MM ’13, pages 653–656.
Esuli, A. (2009). MiPai: Using the PP-Index to Build an
Efficient and Scalable Similarity Search System. In
Proceedings of the 2009 Second International Work-
shop on Similarity Search and Applications, SISAP
’09, pages 146–148.
Fagin, R., Kumar, R., and Sivakumar, D. (2003). Compar-
ing top-k lists. SIAM J. of Discrete Math., 17(1):134–
Gennaro, C., Amato, G., Bolettieri, P., and Savino, P.
(2010). An approach to content-based image retrieval
based on the lucene search engine library. In Lal-
mas, M., Jose, J., Rauber, A., Sebastiani, F., and
Frommholz, I., editors, Research and Advanced Tech-
nology for Digital Libraries, volume 6273 of Lecture
Notes in Computer Science, pages 55–66. Springer
Berlin Heidelberg.
Jaakkola, T. and Haussler, D. (1998). Exploiting generative
models in discriminative classifiers. In In Advances
in Neural Information Processing Systems 11, pages
egou, H. and Chum, O. (2012). Negative evidences and
co-occurences in image retrieval: The benefit of pca
and whitening. In Fitzgibbon, A., Lazebnik, S., Per-
ona, P., Sato, Y., and Schmid, C., editors, Computer
Vision–ECCV 2012, volume 7573 of Lecture Notes in
Computer Science, pages 774–787. Springer.
egou, H., Douze, M., and Schmid, C. (2008). Hamming
embedding and weak geometric consistency for large
scale image search. In Forsyth, D., Torr, P., and Zis-
serman, A., editors, Computer Vision – ECCV 2008,
volume 5302 of Lecture Notes in Computer Science,
pages 304–317. Springer Berlin Heidelberg.
egou, H., Douze, M., and Schmid, C. (2009). Packing bag-
of-features. In Computer Vision, 2009 IEEE 12th In-
ternational Conference on, pages 2357 –2364.
egou, H., Douze, M., and Schmid, C. (2010). Improving
bag-of-features for large scale image search. Interna-
tional Journal of Computer Vision, 87:316–336.
egou, H., Douze, M., Schmid, C., and P´
erez, P. (2010a).
Aggregating local descriptors into a compact image
representation. In IEEE Conference on Computer Vi-
sion & Pattern Recognition, pages 3304–3311.
egou, H., Perronnin, F., Douze, M., S`
anchez, J., P´
P., and Schmid, C. (2012). Aggregating local im-
age descriptors into compact codes. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
egou, H., Schmid, C., Harzallah, H., and Verbeek, J.
(2010b). Accurate image search using the contextual
dissimilarity measure. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 32(1):2–11.
Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond
bags of features: Spatial pyramid matching for recog-
nizing natural scene categories. In Computer Vision
and Pattern Recognition, 2006 IEEE Computer Soci-
ety Conference on, volume 2.
Lowe, D. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision, 60(2):91–110.
McLachlan, G. and Peel, D. (2000). Finite Mixture Models.
Wiley series in probability and statistics. Wiley.
Peng, X., Wang, L., Qiao, Y., and Peng, Q. (2014). Boost-
ing vlad with supervised dictionary learning and high-
order statistics. In Fleet, D., Pajdla, T., Schiele, B.,
and Tuytelaars, T., editors, Computer Vision - ECCV
2014, volume 8691 of Lecture Notes in Computer Sci-
ence, pages 660–674. Springer International Publish-
Perd’och, M., Chum, O., and Matas, J. (2009). Efficient
representation of local geometry for large scale object
retrieval. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages
Perronnin, F. and Dance, C. (2007). Fisher kernels on vi-
sual vocabularies for image categorization. In Com-
puter Vision and Pattern Recognition, 2007. CVPR
’07. IEEE Conference on, pages 1–8.
Perronnin, F., Liu, Y., Sanchez, J., and Poirier, H. (2010).
Large-scale image retrieval with compressed fisher
vectors. In Computer Vision and Pattern Recogni-
tion (CVPR), 2010 IEEE Conference on, pages 3384
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A.
(2007). Object retrieval with large vocabularies and
fast spatial matching. In Computer Vision and Pattern
Recognition, 2007. CVPR 2007. IEEE Conference on,
pages 1–8.
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman,
A. (2008). Lost in quantization: Improving partic-
ular object retrieval in large scale image databases.
In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on, pages 1–8.
Salton, G. and McGill, M. J. (1986). Introduction to Mod-
ern Information Retrieval. McGraw-Hill, Inc., New
York, NY, USA.
Sivic, J. and Zisserman, A. (2003). Video google: A text
retrieval approach to object matching in videos. In
Proceedings of the Ninth IEEE International Confer-
ence on Computer Vision - Volume 2, ICCV ’03, pages
Spyromitros-Xioufis, E., Papadopoulos, S., Kompatsiaris,
I. Y., Tsoumakas, G., and Vlahavas, I. (2014). A com-
prehensive study over vlad and product quantization in
large-scale image retrieval. Multimedia, IEEE Trans-
actions on, 16(6):1713–1728.
Thomee, B., Bakker, E. M., and Lew, M. S. (2010). TOP-
SURF: A visual words toolkit. In Proceedings of
the International Conference on Multimedia, MM ’10,
pages 1473–1476.
Tolias, G. and Avrithis, Y. (2011). Speeded-up, relaxed spa-
tial matching. In Computer Vision (ICCV), 2011 IEEE
International Conference on, pages 1653–1660.
Tolias, G. and J´
egou, H. (2013). Local visual query expan-
sion: Exploiting an image collection to refine local
descriptors. Research Report RR-8325, INRIA.
Van Gemert, J., Veenman, C., Smeulders, A., and Geuse-
broek, J.-M. (2010). Visual word ambiguity. Pat-
tern Analysis and Machine Intelligence, IEEE Trans-
actions on, 32(7):1271–1283.
Van Gemert, J. C., Geusebroek, J.-M., Veenman, C. J., and
Smeulders, A. W. (2008). Kernel codebooks for scene
categorization. In Forsyth, D., Torr, P., and Zisserman,
A., editors, Computer Vision - ECCV 2008, volume
5304 of Lecture Notes in Computer Science, pages
696–709. Springer Berlin Heidelberg.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong,
Y. (2010). Locality-constrained linear coding for
image classification. In Computer Vision and Pat-
tern Recognition (CVPR), 2010 IEEE Conference on,
pages 3360–3367.
Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing
gigabytes: compressing and indexing documents and
images. Multimedia Information and Systems Series.
Morgan Kaufmann Publishers.
Yang, J., Yu, K., Gong, Y., and Huang, T. (2009). Lin-
ear spatial pyramid matching using sparse coding for
image classification. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on,
pages 1794–1801.
Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006).
Similarity Search: The Metric Space Approach, vol-
ume 32 of Advances in Database Systems. Springer-
Zhao, W.-L., J´
egou, H., and Gravier, G. (2013). Oriented
pooling for dense and non-dense rotation-invariant
features. In BMVC - 24th British Machine Vision Con-
... is order-preserving in a strict sense, however several works Amato, Gennaro et al. (2014), Chavez, Figueroa, and Navarro (2008) and Esuli (2009) experimentally proved that the rankings obtained in the permutation space are good approximations of the rankings obtained in the original space. Moreover, it can be easily proved that the cosine similarity between any two term frequency vectors is a monotonic transformation of the squared Euclidean distance between the associated inverted permutations: where α ∈ R, + are constants (see Amato, Bolettieri, Falchi, Gennaro, & Vadicamo, 2016;Vadicamo, 2018). Therefore, a ranking obtained using the cosine similarity on term frequency vectors is equivalent to that obtained in the permutation space using the Spearman rho distance, which in turn is an approximation of the ranking obtained in the original data space. ...
The great success of visual features learned from deep neural networks has led to a significant effort to develop efficient and scalable technologies for image retrieval. Nevertheless, its usage in large-scale Web applications of content-based retrieval is still challenged by their high dimensionality. To overcome this issue, some image retrieval systems employ the product quantization method to learn a large-scale visual dictionary from a training set of global neural network features. These approaches are implemented in main memory, preventing their usage in big-data applications. The contribution of the work is mainly devoted to investigating some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea of our approaches relies on a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. Its effectiveness has been proved through several instance-level retrieval benchmarks. An extensive experimental evaluation conducted on the standard benchmarks shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
... [9] introduced MI-File, an approach that allows using inverted files to perform similarity search with an arbitrary similarity function. In [4,5] a Surrogate Text Representation (STR) derived from the MI-File has been proposed. The conversion of the permutations in a textual form allows using off-the-shelf text search engines for similarity search. ...
During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to profound pervasiveness of relational databases in any kind of organization. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today.
Conference Paper
Full-text available
This paper improves recent methods for large scale image search. State-of-the-art methods build on the bag-of-features image representation. We, first, analyze bag-of-features in the framework of approximate nearest neighbor search. This shows the sub-optimality of such a representation for matching descriptors and leads us to derive a more precise representation based on 1) Hamming embedding (HE) and 2) weak geometric consistency constraints (WGC). HE provides binary signatures that refine the matching based on visual words. WGC filters matching descriptors that are not consistent in terms of angle and scale. HE and WGC are integrated within the inverted file and are efficiently exploited for all images, even in the case of very large datasets. Experiments performed on a dataset of one million of images show a significant improvement due to the binary signature and the weak geometric consistency constraints, as well as their efficiency. Estimation of the full geometric transformation, i.e., a re-ranking step on a short list of images, is complementary to our weak geometric consistency constraints and allows to further improve the accuracy.
Full-text available
A new class of applications based on visual search engines are emerging, especially on smart-phones that have evolved into powerful tools for processing images and videos. The state-of-the-art algorithms for large visual content recognition and content based similarity search today use the "Bag of Features" (BoF) or "Bag of Words" (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted files. A very well known issue with this approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious efficiency problems when using inverted files to perform efficient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve efficiency and we study the effects of this reduction on effectiveness in landmark recognition and retrieval scenarios. We show that very relevant improvement in performance are achievable still preserving the advantages of the BoF base approach.
Full-text available
This paper deals with content-based large-scale image retrieval using the state-of-the-art framework of VLAD and Product Quantization proposed by Jegou as a starting point. Demonstrating an excellent accuracy-efficiency trade-off, this framework has attracted increased attention from the community and numerous extensions have been proposed. In this work, we make an in-depth analysis of the framework that aims at increasing our understanding of its different processing steps and boosting its overall performance. Our analysis involves the evaluation of numerous extensions (both existing and novel) as well as the study of the effects of several unexplored parameters. We specifically focus on: a) employing more efficient and discriminative local features; b) improving the quality of the aggregated representation; and c) optimizing the indexing scheme. Our thorough experimental evaluation provides new insights into extensions that consistently contribute, and others that do not, to performance improvement, and sheds light onto the effects of previously unexplored parameters of the framework. As a result, we develop an enhanced framework that significantly outperforms the previous best reported accuracy results on standard benchmarks and is more efficient.
Conference Paper
Full-text available
Recent works on image retrieval have proposed to index images by compact representations encoding powerful local descriptors, such as the closely related VLAD and Fisher vector. By combining such a representation with a suitable coding technique, it is possible to encode an image in a few dozen bytes while achieving excellent retrieval results. This paper revisits some assumptions proposed in this context regarding the handling of "visual burstiness", and shows that ad-hoc choices are implicitly done which are not desirable. Focusing on VLAD without loss of generality, we propose to modify several steps of the original design. Albeit simple, these modifications significantly improve VLAD and make it compare favorably against the state of the art.
Conference Paper
Full-text available
Vector of locally aggregated descriptors (VLAD) is a promising approach for addressing the problem of image search on a very large scale. This representation is proposed to overcome the quantization error problem faced in Bag-of-Words (BoW) representation. In this paper, we propose to enable inverted files of standard text search engines to exploit VLAD representation to deal with large-scale image search scenarios. We show that the use of inverted files with VLAD significantly outperforms BoW in terms of efficiency and effectiveness on the same hardware and software infrastructure.
This paper proposes a query expansion technique for image search that is faster and more precise than the existing ones. An enriched representation of the query is obtained by exploiting the binary representation offered by the Hamming Embedding image matching approach: The initial local descriptors are refined by aggregating those of the database, while new descriptors are produced from the images that are deemed relevant. This approach has two computational advantages over other query expansion techniques. First, the size of the enriched representation is comparable to that of the initial query. Second, the technique is effective even without using any geometry, in which case searching a database comprising 105k images typically takes 79 ms on a desktop machine. Overall, our technique significantly outperforms the visual query expansion state of the art on popular benchmarks. It is also the first query expansion technique shown effective on the UKB benchmark, which has few relevant images per query.