Conference PaperPDF Available

Abstract and Figures

Vector of locally aggregated descriptors (VLAD) is a promising approach for addressing the problem of image search on a very large scale. This representation is proposed to overcome the quantization error problem faced in Bag-of-Words (BoW) representation. However, text search engines have not be used yet for indexing VLAD given that it is not a sparse vector of occurrence counts. For this reason BoW approach is still the most widely adopted method for finding images that represent the same object or location given an image as a query and a large set of images as dataset. In this paper, we propose to enable inverted files of standard text search engines to exploit VLAD representation to deal with large-scale image search scenarios. We show that the use of inverted files with VLAD significantly outperforms BoW in terms of efficiency and effectiveness on the same hardware and software infrastructure.
Content may be subject to copyright.
DRAFT
Large Scale Image Retrieval Using Vector of
Locally Aggregated Descriptors
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro
{giuseppe.amato, paolo.bolettieri, fabrizio.falchi claudio.gennaro}@isti.cnr.it
ISTI - CNR, Pisa, Italy
Abstract. Vector of locally aggregated descriptors (VLAD) is a promis-
ing approach for addressing the problem of image search on a very large
scale. This representation is proposed to overcome the quantization er-
ror problem faced in Bag-of-Words (BoW) representation. However, text
search engines have not be used yet for indexing VLAD given that it is
not a sparse vector of occurrence counts. For this reason BoW approach
is still the most widely adopted method for finding images that represent
the same object or location given an image as a query and a large set of
images as dataset.
In this paper, we propose to enable inverted files of standard text search
engines to exploit VLAD representation to deal with large-scale image
search scenarios. We show that the use of inverted files with VLAD
significantly outperforms BoW in terms of efficiency and effectiveness on
the same hardware and software infrastructure.
Keywords: bag of features, bag of words, local features, compact codes, image
retrieval.
1 Introduction
In the last few years, local features [16] extracted from selected regions [22]
have emerged as a promising method of representing image content in such a way
that tasks of object recognition, and other similar (e.g. landmark recognition,
copy detection, etc.) can be effectively executed. A drawback of the use of local
features is that a single image is represented by a large set (typically thousands)
of (local) descriptors that should be individually matched and processed in order
to compare the visual content of two images. In principle, a query image should
be compared with each dataset object independently. In fact, each local feature of
the query should be compared with all the local features of any dataset image in
order to find a possible match. Moreover, candidate matches should be validated
evaluating a geometric transformation (typically an Homography) able to map a
region of the query to a region of the dataset image. Even though data structures
as kd-tree [9] are used to efficiently search candidate matching pairs in any two
images, still the approach is not scalable.
2
A very popular method to achieve scalability is the Bag-of-Words (BoW)
[21] (or bag-of-features) approach that consists in replacing original local de-
scriptors with the id of the most similar descriptor in a predefined vocabulary.
Following the BoW approach, an image is described as a histogram of occur-
rence of visual words over the global vocabulary. Thus, the BoW approach used
in computer vision is very similar to the traditional BoW approach in natural
language processing and information retrieval [5]. However, as mentioned in [24],
“a fundamental difference between an image query (e.g. 1500 visual terms) and
a text query (e.g. 3 terms) is largely ignored in existing index design”. From the
very beginning [21] a words reduction technique was used (e.g. removing 10% of
the more frequent images). In [2], removing query words with small tf*idf [20]
revealed very good performance in improving efficiency of the BoW approach
with a reduced lost in effectiveness. In this work, we make use of the parametric
tf*idf approach for facilitating trade-offs between efficiency and effectiveness in
the BoW approach.
Efficiency and memory constraints have been recently addressed by aggre-
gating local descriptors into a fixed-size vector representation that describe the
whole image. In particular, Fisher Vector (FV) [18] and VLAD [12] have shown
better performance than BoW [15]. In this work we will focus on VLAD which
has been proved to be a simplified non-probabilistic version of FV. Despite its
simplicity, VLAD performance is comparable to that of FV [15].
Euclidean Locality-Sensitive Hashing [7] is, as far as we know, the only in-
dexing technique tested with VLAD. While may other similarity search indexing
techniques [23] could be applied to VLAD, in this work we decide to investi-
gate the use of inverted files for allowing comparison of the VLAD and BoW
approach on the same index. Permutation-Based Indexing [6, 4, 8] allows using
inverted files to perform similarity search with an arbitrary similarity function.
Moreover, in [10,1] a Surrogate Text Representation (STR) derivated from the
MI-File has been proposed. The conversion of the image description in a textual
form allows us to employ the search engine off-the-shelf indexing and searching
abilities with a little implementation effort.
In this paper, we applied the STR technique to the VLAD method compar-
ing both effectiveness and efficiency with the state-of-the-art BoW approach on
the very same hardware and software infrastructure using the publicly available
and widely adopted 1M photos dataset. Given that the STR combination gives
approximate results with respect to a complete sequential scan, we also com-
pare the effectiveness of VLAD-STR with the one of standard VLAD. Moreover,
we considered balancing efficiency and effectiveness with both BoW and VLAD-
STR approaches. For the VLAD-STR, a similar trade-off is obtained varying the
number of results used for re-ordering. Thus, we do not only compare VLAD-
STR and BoW on specific settings but we show efficiency vs effectiveness graphs
for both. For the VLAD-STR, a trade-off is obtained varying the number of
results used for re-ordering.
Results confirm the higher performance obtained by VLAD with respect to
BoW already showed in [12,15] even when VLAD is combined with STR a off-
3
the-shelf text search engine (i.e., Lucene) is used. Thus, our main contribution is
proving that the proposed VLAD-STR approach, can be used, in place of BoW,
in combination with traditional text search engines achieving good scalability
and preserving the improvement in effectiveness already showed in [15]
The paper is organized as follows. Section 2 presents relevant previous works.
In Section 3 we present the STR approach that is used for indexing VLAD with
a text search engine. Results are presented in Section 4. Finally, in Section 5 we
present our conclusions and describe future work.
2 Related Work
2.1 Local Features
Local features [16] describe the visual content of local interest regions computed
for local interest regions [22]. Good local features should be distinctive and at
the same time robust to changes in viewing conditions as well as to errors of
the detector. Developed mainly in Computer Vision, their typical applications
include finding locations and particular objects, detecting image near duplicates
and deformed copies. A drawback of the use of local features is that a single
image is represented by a large set (typically thousands) of descriptors that
should be individually matched and processed in order to compare the visual
content of two images.
2.2 Bag of Words (BoW)
State-of-the art techniques for performing large scale content based image re-
trieval using local features typically involve the BoW approach. BoW was ini-
tially proposed in [21] and has been studied in many other papers. The goal of
the BoW approach is to substitute each local descriptor of an images with visual
words obtained from a predefined vocabulary in order to apply traditional text
retrieval techniques to CBIR.
The first step is selecting some visual words creating a vocabulary. The visual
vocabulary is typically built clustering, using k-means, local descriptors of the
dataset ad selecting the centroids. The second step assigns each local descriptor
to the identifier of the first nearest word in the vocabulary. For speeding-up this
second phase approximate kd-tree is often used at a small effectiveness price. At
the end of the process, each image is described as a set of visual words. The
retrieval phase is then performed using text retrieval techniques considering a
query image as disjunctive text-query. Typically, the cosine similarity measure in
conjunction with a term weighting scheme is adopted for evaluating the similarity
between any two images.
Even though inverted files offer a significant improvement in efficiency, in
many cases efficiency is not yet satisfactory. In fact, a query image is associated
with thousands of visual words. Therefore, the search algorithm on inverted files
4
has to access thousands of different posting lists. From the very beginning [21]
words reduction techniques were used (e.g. removing 10% of the more frequent
images). However, as far as we know, no experiments have been reported on the
impact of the reduction on both efficiency and efficacy.
In [2], various techniques to reduce the number of words describing an image
obtained with the BoW approach were evaluated. tf*idf [20] revealed very good
performance in improving efficiency with a reduced lost in effectiveness. In this
work, we make use of the parametric tf*idf approach to allow trade-offs between
efficiency and effectiveness.
2.3 Fisher Vector
Fisher kernels [11] describe how the set of descriptors deviates from an average
distribution, modeled by a parametric generative model. Fisher kernels have been
applied in the context of image classification [17] and large scale image search
[18]. In [15] it has been proved that Fisher vectors (FVs) extend the BoW. While
the BoW approach counts the number of descriptors assigned to each region in
the space, FV also encodes the proximate location of the descriptors in each
region and has a normalization that can be interpreted as an IDF term. The FV
image representation proposed by [17] assumes that the samples are distributed
according to a Gaussian Mixture Model (GMM) estimated on a training set.
Results reported in [15] reveal that FV indexed using LSH outperforms BoW.
2.4 VLAD
The VLAD representation was proposed in [12]. As for the BoW, a codebook
{µ1, . . . , µK}is first learned using a cluster algorithm (e.g. k-means). Each lo-
cal descriptor xtin each image is then associated to its nearest visual word
NN (xt) in the codebook. For each codeword the differences between the vectors
xtassigned to µiare accumulated:
vi=X
xt:NN (xt)=i
xtµi
VLAD is the concatenation of the accumulated vectors, i.e. V= [vT
1. . . vT
K].
Please note that all vi(i= 1, ...K) have the same size which is equal to the size
of the used local feature (e.g. 128 for SIFT). Given a codebook {µ1, . . . , µK},
Kis fixed (typically 16 K128. Thus the dimensionality of the whole
vector Vdescribing any image is fixed too. In other words, VLAD evaluates a
global descriptor statistically describing a set of local features with respect to a
predefined codebook.
In order to improve the effectiveness of the VLAD approach, two normaliza-
tion are performed: first, a power normalization with power 0.5; second, a L2
normalization. After this process two global descriptor V1and V2related to any
two images can be compared using the inner product.
5
The observation that VLAD descriptor has high dimensionality but is rela-
tively sparse and very structured suggests a principal component analysis (PCA)
that is usually performed to reduce the size of the K-dimensional VLAD vectors.
In this work, we decide not to use dimensionality reduction techniques because
we will show that our space transformation approach is independent from the
original dimensionality of the description. In fact, the STR approach that we
propose, transforms the VLAD description in a set of words from a vocabulary
that is independent from the original VLAD dimensionality. In our proposal,
PCA could be used to increase efficiency of the STR trasformation.
In [15], it has been shown that VLAD is a simplified non-probabilistic ver-
sion of FV: VLAD is to FV what k-means is to GMM clustering. The k-means
clustering can be viewed as a non-probabilistic limit case of GMM clustering.
In [15] Euclidean Locality-Sensitive Hashing and its variant have been pro-
posed to efficiently search VLAD descriptors.
3 Perspective Transformation and Surrogate Text
Representation
In this paper, we propose to index the VLAD descriptors using a surrogate
text representation. This allows using any text retrieval engine to perform image
similarity search. As discussed later, for the experiments, we implemented these
ideas on top of the Lucene text retrieval engine.
The approach to encode global features (as VLAD) used in this paper lever-
ages on the perspective based space transformation developed in [4, 10]. The
idea at the basis of this technique is that when two descriptors are very similar,
with respect to a given similarity function, they ’see’ the ’world around’ them
in the same way. In the following, we will see that the ’world around’ can be
encoded as a surrogate text representation (STR), which can be managed with
an inverted index by means of a standard text-based search. The conversion of
the visual descriptor in a textual form allows us to employ the search engine
off-the-shelf indexing and searching abilities with a little implementation effort.
3.1 STR Generation
Let Dbe the domain of the global descriptors o, and d:D × D Ra dis-
tance function able to assess the dissimilarity between any two o1, o2∈ D.
Let R∈ Dm, be a vector of mdistinct reference descriptors (or pivots)ri,
i.e., R= (r1, . . . , rm). We denote the vector of positions of the reference ob-
jects in Rranked by increasing distance with respect to an object o∈ D as
P(o) = (p1(o), . . . , pm(o)). As an example, if p3(o) = 2 then r3is the 2nd near-
est object to oamong those in R.
The objective is to define a function that transforms a global descriptor
into a sequence of terms (ie, a textual document) that can be fed into a text
search engine as for instance Lucene. Of course, the ultimate goal is to obtain
that the distance between the documents and the query is an approximation
6
of the original distance function of the global descriptors. To achieve this, we
associate each element riRwith a unique alphanumeric keyword τi, and define
a function tk(o) that returns a space-separated concatenation of zero or more
repetitions of τikeywords, as follows:
tk(o) =
m
[
i=1
(k+1)pk
i(o)
[
j=1
τi
where pk
i(o) = pi(o) if pi(o)< k and pk
i(o) = kotherwise. By abuse of notation,
we denote the space-separated concatenation of keywords with the union opera-
tor . The inner simply repeat (k+ 1) pk
i(o) times the alphanumeric keyword
τiused for indicating the reference object riR. The outer concatenates the
repeated occurrences, if any, of keywords τifor i= 1...m. The function tk(o) is
used to generate the STR to be used for both indexing and querying purposes.
kis used to consider only the knearest reference object in Rto o, and typically
assumes two distinct values for the query qand for the objects in the dataset (kx
for indexing and kqfor querying). For instance, consider the case exemplified in
Figure 1, and let us assume τ1=A, τ2=B, etc. The function tkwill generate the
following outputs
tkx(o1) = “E E E B B A”
tkx(o2) = “D D D C C E”
tkq(q) = “E E A”
As can be seen intuitively, strings corresponding to o1and qare more sim-
ilar to those corresponding to o2eq, this approximate the original distance d.
Without going to the mathematical details, we leverage on the fact that a text
based search engine will generate a vector representation of STRs generated with
tkx(o) and tkq(q) containing the number of occurrences of words in texts. This
is the case of the simple term-frequency weighting scheme. This means that, if
for instance keyword τicorresponding to the reference object riRappears n
times, the i-th element of the vector will contain the number n, and whenever
τidoes not appear it will contain 0. With simple mathematical manipulations,
it is easy to see how applying the cosine similarity on the query vector and a
vector in the database corresponding to tkx(o) and tkq(q) respectively, we get
a degree of similarity that reflects the similarity order of reference descriptors
(pivots) around descriptors in the original space.
For more information on how the technique works from the mathematical
point of view, we remind the reader to [10, 1]. The impact of kxon the effective-
ness of the search has been studied in [3].
3.2 Reordering Search Results
The idea described so far uses a textual representation of the descriptors and a
matching measure based on a similarity offered by standard text search engines
to order the descriptors in the dataset in decreasing similarity with respect to
the query. The result set is more precise if we order it using the original distance
function d.
7
Fig. 1. Example of perspective based space transformation and Surrogate Text Rep-
resentation. a) Black points are reference objects; white points are data objects; the
gray point is a query. b) Encoding of the data objects in the STR.
Suppose we are searching for the kmost similar (nearest neighbors) descrip-
tors to the query. We can improve the quality of the approximation by re-ranking,
using the original distance function d, the first c(ck) descriptors from the
approximate result set at the cost of more cdistance computations. We will show
that this technique significantly improves the accuracy, though only requiring a
very low search cost. In fact, when cis much smaller than the size of the dataset,
this extra cost can be considered negligible with respect to the cost of accessing
the inverted file. For instance, when kis 10 and c=1,000, with a dataset size of
1,000,000 it means that we have to reorder a number of descriptors equivalent
to just 0.1% of the entire dataset. Usually, this is not true for other access meth-
ods, for instance tree-based access methods, where the efficiency of the search
algorithms strongly depends on the amount of descriptors retrieved.
4 Experiments
4.1 Setup
INRIA Holidays [14, 15] is a collection of 1,491 holiday images. The authors se-
lected 500 queries and, for each of them, a list of positive results. To evaluate the
approaches on a large scale, we merged the Holidays dataset with the Flickr1M1
collection as in [13, 12, 15]. The ground–truth is the one built on the INRIA Hol-
idays dataset alone, but it is largely accepted that no relevant images can be
found between the Flickr1M images. SIFT descriptors and various vocabulary
were made publicly available by Jegou et al. for both the Holidays and the Flickr
1M datasets2. For the BoW approach we used the 20K vocabulary.
1http://press.liacs.nl/mirflickr/
2http://lear.inrialpes.fr/~jegou/data.php
8
avg#Words mAP avg mSec
7 0.03 525
16 0.07 555
37 0.11 932
90 0.14 1463
233 0.15 2343
BoW
Table 1. Effectiveness (mAP) and effi-
ciency (mSec) with respect to the average
number of distinct words per query ob-
tained with the BoW approach varying the
query size.
#reordered mAP avg mSec
0 0.13 139
100 0.24 205
1000 0.29 800
2000 0.30 1461
4000 0.31 2784
VLAD
Table 2. Effectiveness (mAP) and effi-
ciency (mSec) obtained with the VLAD
approach in combination with STR, with
respect to the number of results used for
reordering.
For representing the images using the VLAD approach, we selected 64 ref-
erence descriptors using k-means over a subset of the Flickr1M dataset. As ex-
plained Section 3, a drawback of the perspective based space transformation used
for indexing the VLAD with a text search engine is that it is an approximate
technique. However, to alleviate this problem, we reorder the best results using
the actual distance between the VLAD descriptors. For the STR we used 4,000
references (i.e., m= 4,000) randomly selected from the Flickr1M dataset.
During the experimentation also 256 references for VLAD and up to 10,000
references for the STR were selected but the results were only slightly better
than the ones presented while efficiency significantly reduced.
All experiments were conducted on a Intel Core i7 CPU, 2.67 GHz with 12.0
GB of RAM a 2TB 7200 RPM HD for the Lucene index and a 250 GB SSD
for the VLAD reordering. We used Lucene v3.6 running on Java 6 64 bit over
Windows 7 Professional.
The quality of the retrieved images is typically evaluated by means of preci-
sion and recall measures. As in many other papers [19, 13, 18, 15], we combined
this information by means of the mean Average Precision (mAP), which repre-
sents the area below the precision and recall curve.
4.2 Results
In Table 1, we report the mAP obtained with the BoW approach varying the
size of the query in terms of average number of distinct words. The query words
have been filtered using the tf*idf approach as mentioned in Section 2.2. The
average number of words per image, as extracted by the INRIA group, is 1,471
and they were all inserted in the index without any filtering. The filtering was
used only for the queries and results are reported for average number of distinct
words up to 250. In fact, bigger queries result in heavy load of the system. It is
worth to mention that we were able to obtain 0.23 mAP performing a sequential
scan of the dataset with the unfiltered queries.
9
The results show that while the BoW approach is in principle very effective
(i.e. performing a sequential scan), the high number of query visual words needed
for achieve good results significantly reduces his usability. As mentioned in [24],
“a fundamental difference between an image query (e.g. 1,500 visual terms) and
a text query (e.g. 3 terms) is largely ignored in existing index design. This
difference makes the inverted list inappropriate to index images”.
In Table 2, we report the results obtained using the VLAD approach in
combination with the use of the STR illustrated in Section 3. As explained in 4.1,
given that for indexing the images we used a STR, it is useful to reorder the better
results obtained from the text search engine using the actual VLAD distance.
Thus, we report mAP and avg mSec per query for the non–reordering case and
for various values of results used for reordering. The reordering phase dominates
the average query time but it significantly improves effectiveness especially if
only 100 or 1,000 objects are considered for reordering. As mentioned before, we
make use of SSD for speed-up reordering phase but even higher efficiency could
be obtained using PCA as proposed in [15]. Please note that even though the
reordering phase cost for VLAD can be reduced, the reported results already
show that VLAD outperform BoW.
It is worth to mention that we also performed a sequential scan of the en-
tire dataset obtaining a mAP of 0.34 for VLAD. In fact, as depicted in 3, the
results obtained with the VLAD-STR approach are an approximation of the re-
sults obtained with a complete pair wise comparison between the query and the
dataset object. The same is true when LSH indexing is used as in [15]. Results
show that the approximation introduced byt STR does not impact significantly
the effectiveness of the system when at least 1,000 objects are considered for
reordering.
In Figures 2 and 3 we report the precision and recall curves for BoW and
VLAD. The results essentially confirm the ones reported in Table 1 and 2. In
fact, no significant differences can be found in the distribution of the precision
with respect to the recall.
In Figure 4 we plot mAP with respect to the average query execution time for
both BoW and VLAD as reported in Table 1 and Table 2. The graph underlines
both the efficiency and effectiveness advantages of the VLAD with respect to
the BoW approach.
5 Conclusions and Future Work
In this work, we proposed the usage of STR in combination with VLAD
descriptions in order to index VLAD with off-the-shelf text search engines. Using
the very same hardware and text search engine (i.e., Lucene), we were able to
compare with the state-of-the-art BoW. Results obtained for BoW confirm that
the high number of visual terms in the query significantly reduces efficiency of
inverted lists. Even though results showed that this can be mitigated reducing the
number of visual terms in the query with a tf*idf weighting scheme, the VLAD-
STR significantly outperforms BoW in terms of both efficiency and effectiveness.
10
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision
Recall
BoW
233
90
37
16
7
Fig. 2. Precision and recall curves ob-
tained with the BoW approach in combi-
nation with STR for various query size.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision
Recall
VLAD
4000
2000
1000
100
0
Fig. 3. Precision and recall curves ob-
tained with the VLAD-STR for various
number of results used for reordering.
The efficiency vs effectiveness graph reveals that VLAD-STR is able to obtain
the same values of mAP obtained with BoW for an order of magnitude less
in response time. Moreover, for the same response time, VLAD-STR is able to
obtain twice the mAP of BoW.
Future work includes VLAD-STR improving the reordering phase. With re-
gards to efficiency, PCA could be used on VLAD as suggested in [15]. Moreover,
in recognition scenarios (e.g., landmark recognition) the reordering phase typ-
ically involves geometric consistency checks performed using RANSAC. This
could be also done with the VLAD description.
As mentioned in the paper, VLAD is essentially a non probabilistic version
of the Fisher Kernels that typically results in almost the same performance. It
would be interesting to test the STR approach also with Fisher Kernels compar-
ing with both VLAD-STR and BoW.
References
1. G. Amato, P. Bolettieri, F. Falchi, C. Gennaro, and F. Rabitti. Combining local
and global visual feature similarity using a text search engine. In Content-Based
Multimedia Indexing (CBMI), 2011 9th International Workshop on, pages 49 –54,
june 2011.
2. G. Amato, F. Falchi, and C. Gennaro. On reducing the number of visualwords in
the bag-of-features representation. In VISAPP 2012 - Proceedings of the Interna-
tional Conference on Computer Vision Theory and Applications, to appear.
3. G. Amato, C. Gennaro, and P. Savino. Mi-file: using inverted files for scalable
approximate similarity search. Multimedia Tools and Applications, pages 1–30,
2012.
4. G. Amato and P. Savino. Approximate similarity search in metric spaces using
inverted files. In Proceedings of the 3rd international conference on Scalable infor-
mation systems, InfoScale ’08, pages 28:1–28:10, ICST, Brussels, Belgium, Belgium,
2008. ICST (Institute for Computer Sciences, Social-Informatics and Telecommu-
nications Engineering).
11
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 500 1000 1500 2000 2500 3000
mAP
mSec
Efficiency vs Effectiveness
VLAD BoW
Fig. 4. Effectiveness (mAP) with respect to efficiency (mSec per query) obtained by
VLAD and BoW for various settings.
5. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the
concepts and technology behind search, Second edition. Pearson Education Ltd.,
Harlow, England, 2011.
6. E. Ch´avez, K. Figueroa, and G. Navarro. Effective proximity retrieval by ordering
permutations. IEEE Trans. Pattern Anal. Mach. Intell., 30(9):1647–1658, 2008.
7. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing
scheme based on p-stable distributions. In Proceedings of the twentieth annual
symposium on Computational geometry, SCG ’04, pages 253–262, New York, NY,
USA, 2004. ACM.
8. A. Esuli. Mipai: Using the pp-index to build an efficient and scalable similarity
search system. In Proceedings of the 2009 Second International Workshop on Sim-
ilarity Search and Applications, SISAP ’09, pages 146–148, Washington, DC, USA,
2009. IEEE Computer Society.
9. J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best
matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209–226,
1977.
10. C. Gennaro, G. Amato, P. Bolettieri, and P. Savino. An approach to content-based
image retrieval based on the lucene search engine library. In Proceeding of the 14th
European Conference on Research and Advanced Technology for Digital Libraries
(ECDL 2010), LNCS.
11. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi-
fiers. In In Advances in Neural Information Processing Systems 11, pages 487–493.
MIT Press, 1998.
12. H. J´egou, M. Douze, J. S´anchez, and P. P´erez. Aggregating local descriptors into
a compact image representation. In Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on, pages 3304 –3311, june 2010.
13. H. Jegou, M. Douze, and C. Schmid. Packing bag-of-features. In Computer Vision,
2009 IEEE 12th International Conference on, pages 2357 –2364, 29 2009-oct. 2
2009.
12
14. H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregating local descriptors into a
compact image representation. In IEEE Conference on Computer Vision & Pattern
Recognition, pages 3304–3311, jun 2010.
15. H. J´egou, F. Perronnin, M. Douze, J. S´anchez, P. P´erez, and C. Schmid. Aggre-
gating local image descriptors into compact codes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, Sept. 2012. QUAERO.
16. K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1615
–1630, oct. 2005.
17. F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image cate-
gorization. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE
Conference on, pages 1 –8, june 2007.
18. F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier. Large-scale image retrieval with
compressed fisher vectors. In Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pages 3384 –3391, june 2010.
19. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Ob ject retrieval with
large vocabularies and fast spatial matching. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2007.
20. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval.
McGraw-Hill, Inc., New York, NY, USA, 1986.
21. J. Sivic and A. Zisserman. Video google: A text retrieval approach to object
matching in videos. In Proceedings of the Ninth IEEE International Conference
on Computer Vision - Volume 2, ICCV ’03, pages 1470–, Washington, DC, USA,
2003. IEEE Computer Society.
22. T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: a survey.
Found. Trends. Comput. Graph. Vis., 3(3):177–280, 2008.
23. P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search - The Metric
Space Approach, volume 32 of Advances in Database Systems. Kluwer, 2006.
24. X. Zhang, Z. Li, L. Zhang, W.-Y. Ma, and H.-Y. Shum. Efficient indexing for large
scale visual search. In Computer Vision, 2009 IEEE 12th International Conference
on, pages 1103 –1110, 29 2009-oct. 2 2009.
... Milford et al. [12] proposed a method, combining intolerant but fast low resolution whole image matching with highly tolerant, sub-image patch matching processes to improve the accuracy of place recognition. Amato et al. [13] proposed a new image feature representation, called VLAD, which realized image retrieval on large-scale datasets by aggregating the residuals of SIFT features in images. However, in general, the performance of traditional place recognition algorithms needs to be improved and they often fail to deal with severe viewpoint changes. ...
... Sunderhauf et al. [19] extracted the image features of different convolutional layers with a pre-trained AlexNet [18], so as to evaluate the robustness of that for viewpoint-variance and condition-variance, which provides a reference for the selection of convolutional features. Arandjelovic et al. improved the traditional method VLAD [13] and proposed NetVLAD [20], which replaced the traditional local features with CNN features and improved the performance. Chen et al. [21] proposed a CNN-based feature encoding method to create image representations by mining the salient patterns of images, tacking variations both in viewpoints and conditions. ...
... FabMap [4]: A classical method for appearance-based VPR based on Bag-of-Words model; VLAD [13]: A large-scale image-based place recognition model. It can be used for place recognition and realize good performance on many datasets; ...
Article
Full-text available
A major challenge in place recognition is to be robust against viewpoint changes and appearance changes caused by self and environmental variations. Humans achieve this by recognizing objects and their relationships in the scene under different conditions. Inspired by this, we propose a hierarchical visual place recognition pipeline based on semantic-aggregation and scene understanding for the images. The pipeline contains coarse matching and fine matching. Semantic-aggregation happens in residual aggregation of visual information and semantic information in coarse matching, and semantic association of semantic edges in fine matching. Through the above two processes, we realized a robust coarse-to-fine pipeline of visual place recognition across viewpoint and condition variations. Experimental results on the benchmark datasets show that our method performs better than several state-of-the-art methods, improving the robustness against severe viewpoint changes and appearance changes while maintaining good matching-time performance. Moreover, we prove that it is possible for a computer to realize place recognition based on scene understanding.
... Vector of Locally Aggregated Descriptors (VLAD) find its use in applications like weakly supervised place recognition [19], improvement of image similarity using tensors [20], fast video classification [21], large scale image retrieval [22] and event detection [23]. In [19], VLAD is used as one of the important layers of the CNN architecture,i.e., NetVLAD, for image representation which can be used for image retrieval and is readily pluggable as well as amendable to training. ...
... In [21], VLAD is used in combination with Fisher kernels to outperform the Bag of Words technique in terms of accuracy. [22] uses VLAD in large scale image search applications. ...
Preprint
Full-text available
The paper posits a computationally-efficient algorithm for multi-class facial image classification in which images are constrained with translation, rotation, scale, color, illumination and affine distortion. The proposed method is divided into five main building blocks including Haar-Cascade for face detection, Bilateral Filter for image preprocessing to remove unwanted noise, Affine Speeded-Up Robust Features (ASURF) for keypoint detection and description, Vector of Locally Aggregated Descriptors (VLAD) for feature quantization and Cloud Forest for image classification. The proposed method aims at improving the accuracy and the time taken for face recognition systems. The usage of the Cloud Forest algorithm as a classifier on three benchmark datasets, namely the FACES95, FACES96 and ORL facial datasets, showed promising results. The proposed methodology using Cloud Forest algorithm successfully improves the recognition model by 2-12\% when differentiated against other ensemble techniques like the Random Forest classifier depending upon the dataset used.
... The need of VLAD becomes paramount in order to quantize these vectors. In order to achieve quantization VLAD [14,15] uses a fischer kernel [16] which is non probabilistic method and uses a codebook calculated by Kmeans++ algorithm [17]. A descriptor x t is associated with a codebook that contains visual words that are closest to the descriptors. ...
... The set of local feature descriptors is given by I = (x 1 , x 2 , ..., x n ), q k i depicts the extent to which x i represents a data vector that is related to the cluster µ k . With constraints q k i > 0 and k q k i = 1, the feature vector x is encoded using equation 14. ...
... The closer the feature vectors are, the higher the similarity of the image is. In traditional machine learning, the main methods of extracting image features are Scale-Invariant Feature Transform (SIFT) [1,2], Speeded Up Robust Features (SURF) [3,4], GIST descriptors [5,6], Fisher Vector 7, Vector of Locally Aggregated Descriptors (VLAD) [8,9]. In deep learning, the main methods include Convolutional Neural Network (CNN) [10,11], Siamese Network [12][13][14], Triplet Network [15][16][17]. ...
... The new goal can achieve twice the distance learning, including minimizing the distance between an anchor and a positive, maximizing the distance between an anchor and a negative, and maximizing the distance between a positive and a negative. On the basis of the original triple loss function and the new distance learning goals, we easily developed the improved triplet loss function, as shown in Equation (9). Compared with the original triple loss function, the improved triple loss function can realize twice the distance learning only through a triple sample. ...
Article
Full-text available
Image retrieval or content-based image retrieval (CBIR) can be transformed into the calculation of the distance between image feature vectors. The closer the vectors are, the higher the image similarity will be. In the image retrieval system for large-scale dataset, the approximate nearest-neighbor (ANN) search can quickly obtain the top k images closest to the query image, which is the Top-k problem in the field of information retrieval. With the traditional ANN algorithms, such as KD-Tree, R-Tree, and M-Tree, when the dimension of the image feature vector increases, the computing time will increase exponentially due to the curse of dimensionality. In order to reduce the calculation time and improve the efficiency of image retrieval, we propose an ANN search algorithm based on the Product Quantization Table (PQTable). After quantizing and compressing the image feature vectors by the product quantization algorithm, we can construct the image index structure of the PQTable, which speeds up image retrieval. We also propose a multi-PQTable query strategy for ANN search. Besides, we generate several nearest-neighbor vectors for each sub-compressed vector of the query vector to reduce the failure rate and improve the recall in image retrieval. Through theoretical analysis and experimental verification, it is proved that the multi-PQTable query strategy and the generation of several nearest-neighbor vectors are greatly correct and efficient.
... Vector of Locally Aggregated Descriptors (VLAD) find its use in applications like weakly supervised place recognition [19], improvement of image similarity using tensors [20], fast video classification [21], large scale image retrieval [22] and event detection [23]. In [19], VLAD is used as one of the important layers of the CNN architecture,i.e., NetVLAD, for image representation which can be used for image retrieval and is readily pluggable as well as amendable to training. ...
... In [21], VLAD is used in combination with Fisher kernels to outperform the Bag of Words technique in terms of accuracy. [22] uses VLAD in large scale image search applications. ...
Article
Full-text available
The paper posits a computationally-efficient algorithm for multi-class facial image classification in which images are constrained with translation, rotation, scale, color, illumination and affine distortion. The proposed method is divided into five main building blocks including Haar-Cascade for face detection, Bilateral Filter for image preprocessing to remove unwanted noise, Affine Speeded-Up Robust Features (ASURF) for keypoint detection and description, Vector of Locally Aggregated Descriptors (VLAD) for feature quantization and Cloud Forest for image classification. The proposed method aims at improving the accuracy and the time taken for face recognition systems. The usage of the Cloud Forest algorithm as a classifier on three benchmark datasets, namely the FACES95, FACES96 and ORL facial datasets, showed promising results. The proposed methodology using Cloud Forest algorithm successfully improves the recognition model by 2-12% when differentiated against other ensemble techniques like the Random Forest classifier depending upon the dataset used.
... In addition, this proposed a search strategy to find hard positive and hard negative image pairs for training which can impact the results. Arandjelovic et al.[50] proposed NetVLAD on the basis of VLAD[51], as a trainable VLAD layer. They used Google street view 1 to produce a weakly supervised dataset. ...
Preprint
Camera, and associated with its objects within the field of view, localization could benefit many computer vision fields, such as autonomous driving, robot navigation, and augmented reality (AR). In this survey, we first introduce specific application areas and the evaluation metrics for camera localization pose according to different sub-tasks (learning-based 2D-2D task, feature-based 2D-3D task, and 3D-3D task). Then, we review common methods for structure-based camera pose estimation approaches, absolute pose regression and relative pose regression approaches by critically modelling the methods to inspire further improvements in their algorithms such as loss functions, neural network structures. Furthermore, we summarise what are the popular datasets used for camera localization and compare the quantitative and qualitative results of these methods with detailed performance metrics. Finally, we discuss future research possibilities and applications.
... Given the global fixed length description obtained with these techniques various access methods can be used as Euclidean Locality-Sensing Hashing [12] and Surrogate Text Representation [2]. A trade-off between quality and efficiency/scalability can be also applied by varying the parameters of these approximate indexing techniques. ...
Chapter
Full-text available
The widespread diffusion of smart devices, such as smartphones and tablets, and the new emerging trend of wearable devices, such as smart glasses and smart watches, has pushed forward the development of applications where the user can interact relying on his or her position and field of view. In this way, users can also receive additional information in augmented reality, that is, seeing the information through the smart device, overlaid on top of the real scene. The GPS or the compass can be used to localize the user when augmented reality has to be provided with scenes of large size, for instance, squares or large buildings. However, when augmented reality has to be offered for enriching the view of small objects or small details of larger objects, for instance, statues, paintings, or epigraphs, a more precise positioning is needed. Visual object recognition and tracking technologies offer very detailed and fine-grained positioning capabilities. This chapter discusses the techniques enabling a precise positioning of the user and the subsequent experience in augmented reality, focusing on algorithms for image matching and homography estimation between the images seen by smart devices and images representing objects of interest.
... The conversion of the image description in textual form enables us to exploit the off-the-shelf search engine features with a little implementation effort. In this paper, we extend the STR approach to deal with the VLAD descriptions comparing both effectiveness and efficiency with the STR baseline approach, which has been studied in (Amato et al., 2013a). The experimentation was carried out on the same hardware and software infrastructure using a publicly available INRIA Holidays (Jégou et al., 2008) dataset and comparing the effectiveness with the sequential scan. ...
Conference Paper
Full-text available
Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.
... It is worth to note that this approach outperforms VLAD for r =10,100. We believe that VLAD is still preferable, since recent works as[1]have shown that VLAD can be more efficiently indexed than BoF. ...
Article
Full-text available
In this paper, we present a system for visually retrieving an- cient inscriptions, developed in the context of the ongoing Europeana network of Ancient Greek and Latin Epigraphy (EAGLE) EU Project. The system allows the user in front of an inscription (e.g, in a museum, street, archaeological site) or watching a reproduction (e.g., in a book, from a monitor), to automatically recognize the inscription and obtain information about it just using a smart-phone or a tablet. The experi- mental results show that the Vector of Locally Aggregated Descriptors is a promising encoding strategy for performing visual recognition in this specific context.
Article
The decreasing mean-time-to-failure estimates in cloud computing systems indicate that multimedia applications running on such environments should be able to mitigate an increasing number of core failures at runtime. We propose a new roll-forward failure-mitigation approach for integer sumof-product computations, with emphasis on generic matrix multiplication (GEMM)and convolution/crosscorrelation (CONV) routines. Our approach is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing.This differs fromall existing roll-forward solutions that require a separate set of checksum (or duplicate) results. Our proposal imposes 37.5% reduction in the maximum output bitwidth supported in comparison to integer sum-ofproduct realizations performed on 32-bit integer representations which is comparable to the bitwidth requirement of checksummethods for multiple core failure mitigation. Experiments with state-of-the-art GEMM and CONV routines running on a c4.8xlarge compute-optimized instance of amazon web services elastic compute cloud (AWS EC2) demonstrate that the proposed approach is able to mitigate up to one quadcore failure while achieving processing throughput that is: 1) comparable to that of the conventional, failure-intolerant, integer GEMM and CONV routines, 2) substantially superior to that of the equivalent roll-forward failure-mitigation method based on checksum streams. Furthermore, when used within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminatable) instances, our proposal leads to: 1) 16%-23% cost reduction against the equivalent checksum-based method and 2) more than 70% cost reduction against conventional failure-intolerant processing on AWS EC2 on-demand (i.e., highercost albeit guaranteed) instances.
Article
Full-text available
A new class of applications based on visual search engines are emerging, especially on smart-phones that have evolved into powerful tools for processing images and videos. The state-of-the-art algorithms for large visual content recognition and content based similarity search today use the "Bag of Features" (BoF) or "Bag of Words" (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted files. A very well known issue with this approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious efficiency problems when using inverted files to perform efficient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve efficiency and we study the effects of this reduction on effectiveness in landmark recognition and retrieval scenarios. We show that very relevant improvement in performance are achievable still preserving the advantages of the BoF base approach.
Conference Paper
Full-text available
In this paper we propose a novel approach that allows processing image content based queries expressed as arbitrary combinations of local and global visual features, by using a single index realized as an inverted file. The index was implemented on top of the Lucene retrieval engine. This is particularly useful to allow people to efficiently and interactively check the quality of the retrieval result by exploiting combinations of various features when using content based retrieval systems.
Article
Full-text available
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underly-ing metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the met-ric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corre-sponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Conference Paper
Full-text available
MiPai is an image search system that provides visual similarity search and text-based search functionalities. The similarity search functionality is implemented by means of the permutation prefix index (PP-Index), a novel data structure for approximate similarity search. The text-based search functionality is based on a traditional inverted list index data structure. MiPai also provides a combined visual similarity/text search function.
Conference Paper
Full-text available
We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Conference Paper
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underlying metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the metric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corresponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Article
We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects.We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on two challenging databases: an in-house database of 19 object/scene categories and the recently released VOC 2006 database. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.