Conference PaperPDF Available

Abstract and Figures

Vector of locally aggregated descriptors (VLAD) is a promising approach for addressing the problem of image search on a very large scale. This representation is proposed to overcome the quantization error problem faced in Bag-of-Words (BoW) representation. In this paper, we propose to enable inverted files of standard text search engines to exploit VLAD representation to deal with large-scale image search scenarios. We show that the use of inverted files with VLAD significantly outperforms BoW in terms of efficiency and effectiveness on the same hardware and software infrastructure.
Content may be subject to copyright.
Indexing Vectors of Locally Aggregated Descriptors
Using Inverted Files
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro
ISTI-CNR
via G. Moruzzi, 1 - 56124 Pisa Italy
<name>.<last name>@isti.cnr.it
ABSTRACT
Vector of locally aggregated descriptors (VLAD) is a promis-
ing approach for addressing the problem of image search
on a very large scale. This representation is proposed to
overcome the quantization error problem faced in Bag-of-
Words (BoW) representation. In this paper, we propose to
enable inverted files of standard text search engines to ex-
ploit VLAD representation to deal with large-scale image
search scenarios. We show that the use of inverted files with
VLAD significantly outperforms BoW in terms of efficiency
and effectiveness on the same hardware and software infras-
tructure.
Categories and Subject Descriptors
H3.1 [Information Storage and Retrievals]: Content
Analysis and Indexing; H3.5 [Information Systems]: On-
line Information Services—Commercial services
General Terms
Experimentation, Algorithms
Keywords
landmarks recognition, image classification, local features
1. INTRODUCTION
In the last few years, local features [10] extracted from
selected regions [14] have emerged as a promising method of
representing image content in such a way that tasks of ob-
ject recognition, can be effectively executed. A drawback of
the use of local features is that a single image is represented
by a large set of local descriptors that should be individu-
ally matched and processed in order to compare the visual
content of two images. In principle, a query image should be
compared with each dataset object independently. A very
popular method to achieve scalability is the Bag-of-Words
(BoW) [13] (or bag-of-features) approach that consists in
ACM acknowledges that this contribution was authored or co-authored by an em-
ployee, contractor or affiliate of the national government. As such, the Government
retains a nonexclusive, royalty-free right to publish or reproduce this article, or to al-
low others to do so, for Government purposes only.
ICMR ’14 April 01 - 04 2014, Glasgow, United Kingdom
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2782-4/14/04 ...$15.00.
http://dx.doi.org/10.1145/2578726.2578788.
replacing original local descriptors with the id of the most
similar descriptor in a predefined vocabulary. Following the
BoW approach, an image is described as a histogram of oc-
currence of visual words over the global vocabulary. Thus,
the BoW approach used in computer vision is very similar
to the traditional BoW approach in natural language pro-
cessing and information retrieval. However, as mentioned in
[16], “a fundamental difference between an image query (e.g.
1500 visual terms) and a text query (e.g. 3 terms) is largely
ignored in existing index design”.
Efficiency and memory constraints have been recently ad-
dressed by aggregating local descriptors into a fixed-size vec-
tor representation that describe the whole image. In partic-
ular, Fisher Vector (FV) and VLAD have shown better per-
formance than BoW [9]. In this work we will focus on VLAD
which has been proved to be a simplified non-probabilistic
version of FV [9]. Despite its simplicity, VLAD performance
is comparable to that of FV [9].
Euclidean Locality-Sensitive Hashing [4] is, as far as we
know, the only indexing technique tested with VLAD. While
may other similarity search indexing techniques [15] could
be applied to VLAD, in this work we decide to investi-
gate the use of inverted files for allowing comparison of the
VLAD and BoW approach on the same index. Permutation-
Based Indexing [3, 2, 5] allows using inverted files to per-
form similarity search with an arbitrary similarity function.
Moreover, in [6, 1] a Surrogate Text Representation (STR)
derivated from the MI-File [2] has been proposed. The con-
version of the image description in a textual form allows
us to employ the search engine off-the-shelf indexing and
searching abilities with a little implementation effort.
2. PROPOSED APPROACH
Conventional search engines use inverted index file index-
ing to speed up the solution of user queries. We are studying
a methodology which will enable inverted files of standard
text search engines to index vectors of locally aggregated
descriptors (VLAD) to deal with large-scale image search
scenarios. To this end, we first encode VLAD features by
means of the perspective-based space transformation devel-
oped in [2]. The idea underlying this technique is that when
two descriptors are very similar, with respect to a given sim-
ilarity function, they “see” the “world around” them in the
same way. In a next step, the “world around” can be en-
coded as a surrogate text representation (STR), which can
be managed with an inverted index using a standard text-
based search. The conversion of visual descriptors into a
textual form allows us to employ off-the-shelf indexing and
Figure 1: Example of perspective-based space transformation and surrogate text representation: 1) From
the images we extract the VLAD features represented by points in a metric space. Blue points are reference
features and colored points are data features, 2) The points are transformed into permutations of the refer-
ences, 3) The permutations are transformed into text documents, 4) The text documents associated with the
images are indexed.
searching functions with little implementation effort.
Our transformation process is shown in Figure 1: the blue
points represent reference VLAD features; the other colours
represent dataset VLAD features. The figure also shows the
encoding of the data features in the transformed space and
their representation in textual form (SRT). As can be seen
intuitively, strings corresponding to VLAD features X and Y
are more similar to those corresponding to X and Z. There-
fore, the distance between strings can be interpreted as an
approximation of the actual VLAD d. Without going into
the math, we leverage on the fact that a text-based search
engine will generate a vector representation of STRs, con-
taining the number of occurrences of words in texts. With
simple mathematical manipulations, it is easy to see how ap-
plying the cosine similarity on the query vector and a vector
in the database corresponding to the string representations
will give us a degree of similarity that reflects the similar-
ity order of reference descriptors around descriptors in the
original space. Mathematical details of the technique are
outlined in [6].
The idea described so far uses a textual representation of
the descriptors and a matching measure based on a simi-
larity offered by standard text search engines to order the
descriptors in the dataset in decreasing similarity with re-
spect to the query. The result set will increase in precision
if we order it using the original distance function used for
comparing features. Suppose we are searching for the most
similar (nearest neighbours) descriptors to the query. We
can improve the quality of the approximation by re-ranking,
using the original distance function dand the first c(ck)
descriptors from the approximate result set at the cost of
more cdistance computations. This technique significantly
improves accuracy at a very low search cost.
We applied the STR technique to the VLAD method com-
paring both effectiveness and efficiency with the state-of-the-
art BoW approach on the same hardware and software in-
frastructure using the publicly available and widely adopted
1M photos dataset. Given that the STR combination gives
approximate results with respect to a complete sequential
scan, we also compare the effectiveness of VLAD-STR with
standard VLAD. Moreover, we considered balancing effi-
ciency and effectiveness with both BoW and VLAD-STR
approaches. For the VLAD-STR, a similar trade-off is ob-
tained varying the number of results used for re-ordering.
Thus, we do not only compare VLAD-STR and BoW on spe-
cific settings but we show efficiency vs effectiveness graphs
for both. For the VLAD-STR, a trade-off is obtained vary-
ing the number of results used for re-ordering.
3. EXPERIMENTS
3.1 Setup
INRIA Holidays [8, 9] is a collection of 1,491 holiday im-
ages. The authors selected 500 queries and, for each of
them, a list of positive results. To evaluate the approaches
on a large scale, we merged the Holidays dataset with the
Flickr1M1collection as in [7, 8, 9]. The ground–truth is
the one built on the INRIA Holidays dataset alone, but it
1http://press.liacs.nl/mirflickr/
avg#Words mAP avg mSec
7 0.03 525
16 0.07 555
37 0.11 932
90 0.14 1463
233 0.15 2343
BoW
Table 1: Effectiveness (mAP) and efficiency (mSec)
with respect to the average number of distinct words
per query obtained with the BoW approach varying
the query size.
is largely accepted that no relevant images can be found be-
tween the Flickr1M images. SIFT descriptors and various
vocabulary were made publicly available by Jegou et al. for
both the Holidays and the Flickr 1M datasets2. For the
BoW approach we used the 20K vocabulary.
For representing the images using the VLAD approach,
we selected 64 reference descriptors using k-means over a
subset of the Flickr1M dataset. As explained Section 2,
a drawback of the perspective based space transformation
used for indexing the VLAD with a text search engine is
that it is an approximate technique. However, to alleviate
this problem, we reorder the best results using the actual
distance between the VLAD descriptors. For the STR we
used 4,000 references (i.e., m= 4,000) randomly selected
from the Flickr1M dataset.
During the experimentation also 256 references for VLAD
and up to 10,000 references for the STR were selected but
the results were only slightly better than the ones presented
while efficiency significantly reduced.
All experiments were conducted on a Intel Core i7 CPU,
2.67 GHz with 12.0 GB of RAM a 2TB 7200 RPM HD for the
Lucene index and a 250 GB SSD for the VLAD reordering.
We used Lucene v3.6 running on Java 6 64 bit over Windows
7 Professional.
The quality of the retrieved images is typically evaluated
by means of precision and recall measures. As in many other
papers [12, 7, 11, 9], we combined this information by means
of the mean Average Precision (mAP), which represents the
area below the precision and recall curve.
3.2 Results
In Table 1, we report the mAP obtained with the BoW
approach varying the size of the query in terms of average
number of distinct words. In this case, the query words have
been filtered using the tf*idf approach. The average number
of words per image, as extracted by the INRIA group, is
1,471 and they were all inserted in the index without any
filtering. The filtering was used only for the queries and
results are reported for average number of distinct words up
to 250. In fact, bigger queries result in heavy load of the
system. It is worth to mention that we were able to obtain
0.23 mAP performing a sequential scan of the dataset with
the unfiltered queries.
The results show that while the BoW approach is in prin-
2http://lear.inrialpes.fr/~jegou/data.php
#reordered mAP avg mSec
0 0.13 139
100 0.24 205
1000 0.29 800
2000 0.30 1461
4000 0.31 2784
VLAD
Table 2: Effectiveness (mAP) and efficiency (mSec)
obtained with the VLAD approach in combination
with STR, with respect to the number of results
used for reordering.
ciple very effective (i.e. performing a sequential scan), the
high number of query visual words needed for achieve good
results significantly reduces his usability.
In Table 2, we report the results obtained using the VLAD
approach in combination with the use of the STR. Given
that for indexing the images we used a STR, it is useful
to reorder the better results obtained from the text search
engine using the actual VLAD distance. Thus, we report
mAP and avg mSec per query for the non–reordering case
and for various values of results used for reordering. The
reordering phase dominates the average query time but it
significantly improves effectiveness especially if only 100 or
1,000 objects are considered for reordering. As mentioned
before, we make use of SSD for speed-up reordering phase
but even higher efficiency could be obtained using PCA as
proposed in [9]. Please note that even though the reordering
phase cost for VLAD can be reduced, the reported results
already show that VLAD outperform BoW.
It is worth to mention that we also performed a sequen-
tial scan of the entire dataset obtaining a mAP of 0.34 for
VLAD. In fact, as depicted in 2, the results obtained with
the VLAD-STR approach are an approximation of the re-
sults obtained with a complete pair wise comparison be-
tween the query and the dataset object. The same is true
when LSH indexing is used as in [9]. Results show that the
approximation introduced byt STR does not impact signif-
icantly the effectiveness of the system when at least 1,000
objects are considered for reordering.
In Figure 2, we plot mAP with respect to the average
query execution time for both BoW and VLAD as reported
in Table 1 and Table 2. The graph underlines both the
efficiency and effectiveness advantages of the VLAD tech-
nique with respect to the BoW approach. The efficiency
vs effectiveness graph reveals that VLAD-STR obtains the
same mAP values as BoW, for an order of magnitude less
in response time. Moreover, for the same response time,
VLAD-STR is able to obtain twice the mAP of BoW.
4. CONCLUSIONS
In this work, we proposed the usage of STR in combina-
tion with VLAD descriptions in order to index VLAD with
off-the-shelf text search engines. Using the very same hard-
ware and text search engine (i.e., Lucene), we were able to
compare with the state-of-the-art BoW. Results obtained for
BoW confirm that the high number of visual terms in the
query significantly reduces efficiency of inverted lists. Even
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 500 1000 1500 2000 2500 3000
mAP
mSec
Efficiency vs Effectiveness
VLAD
BoW
Figure 2: Effectiveness (mAP) with respect to ef-
ficiency (mSec per query) obtained by VLAD and
BoW for various settings.
though results showed that this can be mitigated reducing
the number of visual terms in the query with a tf*idf weight-
ing scheme, the VLAD-STR significantly outperforms BoW
in terms of both efficiency and effectiveness. The efficiency
vs effectiveness graph reveals that VLAD-STR is able to
obtain the same values of mAP obtained with BoW for an
order of magnitude less in response time. Moreover, for the
same response time, VLAD-STR is able to obtain twice the
mAP of BoW.
Acknowledgments
This work was partially supported by the Europeana net-
work of Ancient Greek and Latin Epigraphy (EAGLE, grant
agreement number: 325122) co-funded by the European Com-
mission within the ICT Policy Support Programme.
5. REFERENCES
[1] G. Amato, P. Bolettieri, F. Falchi, C. Gennaro, and
F. Rabitti. Combining local and global visual feature
similarity using a text search engine. In Content-Based
Multimedia Indexing (CBMI), 2011 9th International
Workshop on, pages 49 –54, june 2011.
[2] G. Amato and P. Savino. Approximate similarity
search in metric spaces using inverted files. In
Proceedings of the 3rd international conference on
Scalable information systems, InfoScale ’08, pages
28:1–28:10, ICST, Brussels, Belgium, Belgium, 2008.
ICST (Institute for Computer Sciences,
Social-Informatics and Telecommunications
Engineering).
[3] G. Chavez, K. Figueroa, and G. Navarro. Effective
proximity retrieval by ordering permutations. Pattern
Analysis and Machine Intelligence, IEEE Transactions
on, 30(9):1647 –1658, sept. 2008.
[4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.
Locality-sensitive hashing scheme based on p-stable
distributions. In Proceedings of the twentieth annual
symposium on Computational geometry, SCG ’04,
pages 253–262, New York, NY, USA, 2004. ACM.
[5] A. Esuli. Mipai: Using the pp-index to build an
efficient and scalable similarity search system. In
Proceedings of the 2009 Second International
Workshop on Similarity Search and Applications,
SISAP ’09, pages 146–148, Washington, DC, USA,
2009. IEEE Computer Society.
[6] C. Gennaro, G. Amato, P. Bolettieri, and P. Savino.
An approach to content-based image retrieval based on
the lucene search engine library. In Proceeding of the
14th European Conference on Research and Advanced
Technology for Digital Libraries (ECDL 2010), LNCS.
[7] H. Jegou, M. Douze, and C. Schmid. Packing
bag-of-features. In Computer Vision, 2009 IEEE 12th
International Conference on, pages 2357 –2364, 29
2009-oct. 2 2009.
[8] H. J´egou, M. Douze, C. Schmid, and P. P´erez.
Aggregating local descriptors into a compact image
representation. In IEEE Conference on Computer
Vision & Pattern Recognition, pages 3304–3311, jun
2010.
[9] H. J´egou, F. Perronnin, M. Douze, J. S´anchez,
P. P´erez, and C. Schmid. Aggregating local image
descriptors into compact codes. IEEE Transactions on
Pattern Analysis and Machine Intelligence, Sept.
2012. QUAERO.
[10] K. Mikolajczyk and C. Schmid. A performance
evaluation of local descriptors. Pattern Analysis and
Machine Intelligence, IEEE Transactions on,
27(10):1615 –1630, oct. 2005.
[11] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier.
Large-scale image retrieval with compressed fisher
vectors. In Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on, pages 3384
–3391, june 2010.
[12] J. Philbin, O. Chum, M. Isard, J. Sivic, and
A. Zisserman. Object retrieval with large vocabularies
and fast spatial matching. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, 2007.
[13] J. Sivic and A. Zisserman. Video google: A text
retrieval approach to object matching in videos. In
Proceedings of the Ninth IEEE International
Conference on Computer Vision - Volume 2, ICCV
’03, pages 1470–, Washington, DC, USA, 2003. IEEE
Computer Society.
[14] T. Tuytelaars and K. Mikolajczyk. Local invariant
feature detectors: a survey. Found. Trends. Comput.
Graph. Vis., 3(3):177–280, 2008.
[15] P. Zezula, G. Amato, V. Dohnal, and M. Batko.
Similarity Search - The Metric Space Approach,
volume 32 of Advances in Database Systems. Kluwer,
2006.
[16] X. Zhang, Z. Li, L. Zhang, W.-Y. Ma, and H.-Y.
Shum. Efficient indexing for large scale visual search.
In Computer Vision, 2009 IEEE 12th International
Conference on, pages 1103 –1110, 29 2009-oct. 2 2009.
... The technique is based on an encoding of the permutations by means of a Surrogate Text Representation (STR). In this respect, VLAD can be easily indexed using this technique, as discussed in (Amato et al., 2014a) so that efficient and effective image search engines can be built on top of a standard text search engine. ...
... In a first experimental analysis, we compared the performance of blockwise approach versus the baseline approach (with and without reordering) that threats the VLAD vectors as whole-objects, which was studied in (Amato et al., 2014a). In this latter approach, as explained Section 4, since the performance was low, we had to reorder the best results using the actual distance between the VLAD descriptors. ...
Conference Paper
Full-text available
Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.
Conference Paper
Full-text available
In this paper we propose a novel approach that allows processing image content based queries expressed as arbitrary combinations of local and global visual features, by using a single index realized as an inverted file. The index was implemented on top of the Lucene retrieval engine. This is particularly useful to allow people to efficiently and interactively check the quality of the retrieval result by exploiting combinations of various features when using content based retrieval systems.
Article
Full-text available
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underly-ing metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the met-ric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corre-sponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Conference Paper
Full-text available
MiPai is an image search system that provides visual similarity search and text-based search functionalities. The similarity search functionality is implemented by means of the permutation prefix index (PP-Index), a novel data structure for approximate similarity search. The text-based search functionality is based on a traditional inverted list index data structure. MiPai also provides a combined visual similarity/text search function.
Conference Paper
Full-text available
We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Conference Paper
Full-text available
One of the main limitations of image search based on bag-of-features is the memory usage per image. Only a few million images can be handled on a single machine in reasonable response time. In this paper, we first evaluate how the memory usage is reduced by using lossless index compression. We then propose an approximate representation of bag-of-features obtained by projecting the corresponding histogram onto a set of pre-defined sparse projection functions, producing several image descriptors. Coupled with a proper indexing structure, an image is represented by a few hundred bytes. A distance expectation criterion is then used to rank the images. Our method is at least one order of magnitude faster than standard bag-of-features while providing excellent search quality.
Conference Paper
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underlying metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the metric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corresponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
In this paper, we present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the scalability and performance of our system on a dataset of over 1 million images crawled from the photo-sharing site, Flickr [3], using Oxford landmarks as queries. Building an image-feature vocabulary is a major time and performance bottleneck, due to the size of our dataset. To address this problem we compare different scalable methods for building a vocabulary and introduce a novel quantization method based on randomized trees which we show outperforms the current state-of-the-art on an extensive ground-truth. Our experiments show that the quantization has a major effect on retrieval quality. To further improve query performance, we add an efficient spatial verification stage to re-rank the results returned from our bag-of-words model and show that this consistently improves search quality, though by less of a margin when the visual vocabulary is large. We view this work as a promising step towards much larger, "web-scale" image corpora.
Conference Paper
Content-based image retrieval is becoming a popular way for searching digital libraries as the amount of available multimedia data increases. However, the cost of developing from scratch a robust and reliable system with content-based image retrieval facilities for large databases is quite prohibitive. In this paper, we propose to exploit an approach to perform approximate similarity search in metric spaces developed by [3,6]. The idea at the basis of these techniques is that when two objects are very close one to each other they ’see’ the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the views of the world at different objects, in place of the distance function of the underlying metric space. To employ this idea the low level image features (such as colors and textures) are converted into a textual form and are indexed into the inverted index by means of the Lucene search engine library. The conversion of the features in textual form allows us to employ the Lucene’s off-the-shelf indexing and searching abilities with a little implementation effort. In this way, we are able to set up a robust information retrieval system that combines full-text search with contentbased image retrieval capabilities.
Conference Paper
With the popularity of "bag of visual terms" represen- tations of images, many text indexing techniques have been applied in large-scale image retrieval systems. However, due to a fundamental difference between an image query (e.g. 1500 visual terms) and a text query (e.g. 3-5 terms), the usages of some text indexing techniques, e.g. inverted list, are misleading. In this work, we develop a novel index- ing technique for this problem. The basic idea is to decom- pose a document-like representation of an image into two components, one for dimension reduction and the other for residual information preservation. The computing of simi- larity of two images can be transferred to measuring sim- ilarities of their components. The decomposition has two major merits: 1) these components have good properties which enable them to be efficiently indexed and retrieved; 2) The decomposition has better generalization ability tha n other dimension reduction algorithms. The decomposition can be achieved by either a graphical model or a matrix fac- torization approach. Theoretic analysis and extensive ex- periments over a 2.3 million image database show that this framework is scalable to index large scale image database to support fast and accurate visual search.