Large Scale Indexing and Searching Deep
Convolutional Neural Network Features
Giuseppe Amato, Franca Debole, Fabrizio Falchi,
Claudio Gennaro, and Fausto Rabitti ?
Via G. Moruzzi 1
56124 Pisa - Italy
Abstract. Content-based image retrieval using Deep Learning has be-
come very popular during the last few years. In this work, we propose an
approach to index Deep Convolutional Neural Network Features to sup-
port eﬃcient retrieval on very large image databases. The idea is to pro-
vide a text encoding for these features enabling the use of a text retrieval
engine to perform image similarity search. In this way, we built LuQ a
robust retrieval system that combines full-text search with content-based
image retrieval capabilities. In order to optimize the index occupation
and the query response time, we evaluated various tuning parameters to
generate the text encoding. To this end, we have developed a web-based
prototype to eﬃciently search through a dataset of 100 million of images.
Keywords: Convolutional Neural Network, Deep Learning, Inverted In-
dex, Image Retrieval
Deep Convolutional Neural Networks (DCNNs) have recently shown impressive
performance in the Computer Vision area, such as image classiﬁcation and object
recognition [15, 21, 9]. The activation of the DCNN hidden layers has been also
used in the context of transfer learning and content-based image retrieval [6,
19]. In fact, Deep Learning methods are “representation-learning methods with
multiple levels of representation, obtained by composing simple but non-linear
modules that each transform the representation at one level (starting with the
raw input) into a representation at a higher, slightly more abstract level” .
These representations can be successfully used as features in generic recognition
or visual similarity search tasks.
?This work was partially founded by: EAGLE, Europeana network of Ancient
Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-
PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart
News, Social sensing for breakingnews, co-founded by the Tuscany region under the
FAR-FAS 2014 program, CUP CIPE D58C15000270008.
2 Giuseppe Amato et al.
The ﬁrst layers of DCNN are typically useful in recognizing low-level charac-
teristics of images such as edges and blobs, while higher levels have demonstrated
to be more suitable for semantic similarity search. One major obstacle to the
use of DCNN features for indexing large datasets of images is its internal repre-
sentation, which is high dimensional leading to the curse of dimensionality .
For instance, in the well-known AlexNet architecture  the output of the sixth
layer (fc6) has 4,096 dimensions, while the ﬁfth layer (pool5) has 9,216 dimen-
sions. An eﬀective approach to tackle the dimensionality curse problem is the
application of approximate access methods such as permutation-based indexes
A drawback of these approaches is on one hand the dependence of the index
on a set of reference objects (or pivots); on the other hand the need to reorder
the result set of the search according to the original feature distance. The former
issue requires the selection of a set of reference objects, which represents well
the variety of datasets that we want to index. The latter issue forces to use a
support database to store the features, which requires eﬃcient random I/O on
a fast secondary memory (such as ﬂash-based storages).
In this paper, we propose an approach to speciﬁcally index DCNN Features to
support eﬃcient content-based on large datasets, which we refer to as LuQ. The
proposed approach exploits the ability of inverted ﬁles to deal with the sparsity
of the Convolutional features. To this end, we make use of the eﬃcient and
robust full-text search library Lucene1. The idea is to associate each component
of the feature vector with a unique alphanumeric keyword and to generate a
textual representation in which we boost the relative term proportionally to its
A browser-based application that provides an interface for combined textual
and visual searching into a dataset of about 100 million of images is available at
The paper is organized as follows. Section 2 makes a survey of the related
works. Section 3 presents the proposed approach. Section 4 discusses the valida-
tion tests. Section 5 concludes.
2 Related Work
Recently, a new class of image descriptors, built upon Deep Convolutional Neu-
ral Networks (DCNNs), have been used as eﬀective alternative to descriptors
built using local features such as SIFT, SURF, ORB, BRIEF, etc. Starting from
2012 , DCNNs have attracted enormous interest within the Computer Vision
community because of the state-of-the-art results achieved in the image classiﬁ-
cation challenge ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
In Computer Vision, DCNN have been used to perform several tasks, including
not only image classiﬁcation, but also image retrieval [6,2] and object detection
, to cite some. In particular, it has been proved that the multiple levels of rep-
resentation which are learned by DCNN on speciﬁc task (typically supervised)
Large Scale Indexing and Searching DCNN Features 3
can be used to transfer learning across tasks. This means that the activation of
neurons of a speciﬁc layers, preferably the last ones, can be used as features for
describing the visual content .
Liu et al.  proposed a framework that adapts Bag-of-Word model and
inverted table to DCNN feature indexing, which is similar to LuQ. However, for
large-scale datasets, Liu et al. have to build a large-scale visual dictionary that
employs product quantization method to learn a large-scale visual dictionary
from a training set of global DCNN features. In any case, using this approach
the authors reported a search time that is one order higher than in our case for
the same dataset.
3 Text Encoding for Deep Convolutional Neural Network
3.1 Feature Indexing
The main idea of LuQ is to index DCNN features using a text encoding that
allows us to use a text retrieval engine to perform image similarity search. As
discussed later, we implemented this idea on top of the Lucene text retrieval
engine; however any full-text engine supporting vector space model can be used
for this purpose.
In principle, to perform similarity search on DCNN feature vectors we should
compare the vector extracted from the query with all the vectors extracted from
the images of the dataset and take the nearest images according to L2 distance.
This is sometimes called brute-force approach. However, if the database is large,
brute-force approach can become time-consuming.
Starting from this approach, we observed that given the sparsity of the DCNN
features, which contain mostly zeros (about 75%), we are able to use a well-
known IR technique, i.e., an inverted index.
In fact, we note that since typically DCNN feature vectors exhibits better
results if they are L2-normalized to the unit length [20, 4], if two vectors xand y
have length equal to 1 (such as in our case), the following relationship between
the L2 distance d2(x,y) and the inner product x∗yexists:
d2(x,y)2= 2(1 −x∗y)
The advantage of computing the similarity between vectors in terms of inner
product is that we can eﬃciently exploit the sparsity of the vectors by accumu-
lating the product of non-zeroes entries in the vector xand their corresponding
non-zeros entries in the vector y. Moreover, Lucene, as other search engines,
computes the similarity between documents using the cosine similarity, which is
the inner product of the two vectors divided by their lengths product. There-
fore, in our case, cosine similarity and inner product are the same. Ascertained
this behave, our idea is to ﬁll the inverted index of Lucene with the DCNN
4 Giuseppe Amato et al.
features vectors. For space-saving reasons, however, text search engines do not
store ﬂoat numbers in the posting entries of the inverted index representing doc-
uments, rather they store the term frequencies, which are represented as integers.
Therefore, we must guarantee that posting entries will contain numeric values
proportional to the ﬂoat values of the deep feature entries.
To employ this idea, in LuQ, we provide a text encoding for the DCNN
feature vectors that guarantees the direct proportionality between the feature
components and the term frequencies. Let w= (w1, . . . , wm) denote the L2-
normalized DCNN vector of mdimensions. Firstly, we associated each of its
component wiwith a unique alphanumeric term τi(for instance, the preﬁx ’f’
followed by the numeric values corresponding to the index i). The text encoding
doc(w) corresponding to the vector wis given by:
Where bc denotes the ﬂoor function and Qis a multiplication factor >1 that
works as a quantization factor2.
Therefore, we form the text encoding of wiby repeating the term τifor the
non-zero components a number of times directly proportional to wi. This process
introduces a quantization error due to the representation of ﬂoat components in
integers. However, as we will see, this error does not aﬀect the retrieval eﬀec-
tiveness. The accuracy of this approximation depends on the factor Q, used to
transform the vector w. For instance, if we ﬁx Q= 2, for wi<0.5, bQwic= 0,
while for wi≥0.5, bQwic ≥ 1. In contrast, the smaller we set Qthe smaller the
inverted index will be. This is because the ﬂoor function will set to zero more
entries of the posting lists. Hence, we have to ﬁnd a good trade-oﬀ between the
eﬀectiveness of the retrieval system and its space occupation.
For example, if we set Q= 30 and we have for instance a feature vector
with just three components w= (0.01,0.15,0.09) the corresponding integer-
representation of the vector will be (0,4,2) and its text encoding will be: doc(w) =“f2
f2 f2 f2 f3 f3”.
Since on average the 25% of the DCNN features are non-zero (in our speciﬁc
case the fc6 layer), the size of their corresponding text encoding will have a small
fraction of the unique terms present in the whole dictionary (composed of 4,096
terms). In our case, on average a document contains about 275 unique terms,
which is about 6.7% of the dictionary because of quantization that set to zero
the feature components smaller than 1/Q.
3.2 Query Reduction
When we have to process similarity search, therefore the search engine have to
treat query of that size. These unusual long queries, however, can aﬀect the
response time if the inverted index contains million of items.
2By abuse of notation, we denote the space-separated concatenation of keywords with
the union operator ∪.
Large Scale Indexing and Searching DCNN Features 5
A quite intuitive way to overcome this issue is to reduce the size of the query
by exploiting the knowledge of the tf*idf (i.e., term frequency * inverse document
frequency) statistic of the text encoding, which comes for free in standard full-
text retrieval engines. We can retain the elements of the query that exhibit
greater values of tf*idf and eliminate the others. For instance, for a query of
about 275 unique term on average, we can take the ﬁrst ten terms that exhibits
the highest tf*idf, we obtain a query time reduction of about 96%.
This query reduction comes, however, with a price: it decreases the precision
of results. To attenuate this problem, for a top-kquery, we reorder the results
using the cosine similarity between the original query (i.e., the one without
reduction) and the ﬁrst Cr×kcandidates documents retrieved. Where Cris
an ampliﬁcation factor that we refer to as reordering factor. For instance, if we
have to return k= 100 images and we set Cr= 10, we take and reorder the ﬁrst
10 ×100 = 1000 candidate documents retrieved by the reduced query.
In order to calculate the cosine similarity of the original query and the Cr×k
candidates, we have to reconstruct the quantized features by accessing to the
posting list of the document returned by the search engine. As we will see, this
approach does not aﬀect signiﬁcantly the eﬃciency of the query but can oﬀer
great improvements in terms of eﬀectiveness.
Figure 1 summarizes the indexing and searching phases of LuQ.
In order to test eﬃciency of LuQ, we used the Yahoo Flickr Creative Commons
100 Million (YFCC100M) dataset3. This dataset was created in 2014 as part
of the Yahoo Webscope program. YFCC100M consists of 99.2 million photos
and 0.8 million videos uploaded to Flickr between 2004 and 2014 and Creative
Commons commercial or noncommercial license. More information about the
dataset can be found in the article in Communications of the ACM .
For extracting deep features we used the Caﬀe  deep learning frame-
work. In particular we used the neural network Hybrid-DCNN whose model
and weights are public available in the Caﬀe Model Zoo4. The Hybrid-DCNN
was trained on 1,183 categories (205 scene categories from Places Database and
978 object categories from the train data of ILSVRC2012 (ImageNet) with 3.6
million images ). The architecture is the same as Caﬀe reference network.
The deep features we have extracted are the activations of the fc6 layer. We
have made them public available at http://www.deepfeatures.org and they
will be soon included in the Multimedia Commons initiative corpus.
A ground-truth is not yet available for the YFCC100M dataset. Therefore,
eﬀectiveness of the proposed approach was evaluated using the INRIA Holidays
6 Giuseppe Amato et al.
f2 f2 f2 f4095 ...
f2 f2 f2 f4095 ... f4095 f4095 ...
[ Cr * k [
[ Indexing [
[ querying [
[ reordering [
data vector w
query vector wq
query doc(wq)Reduced query doc(wq)
Fig. 1. Diagram showing the indexing and searching phases of LuQ.
dataset [13, 11]. It is a collection of 1,491 holiday images. The authors selected
500 queries and for each of them a list of positive results. As in [10, 12, 11], to
evaluate the approach on a large scale, we merged the Holidays dataset with the
Large Scale Indexing and Searching DCNN Features 7
Flickr1M collection5. We extracted DCNN features also from these datasets in
order to test our technique.
All experiments were conducted on a Intel Core i7 CPU, 2.67 GHz with 12.0
GB of RAM a 2TB 7200 RPM HD for the Lucene index. We used Lucene v5.5
running on Java 8 64 bit.The quality of the retrieved images is typically evaluated
by means of precision and recall measures. As in many other papers [10, 18, 11],
we combined this information by means of the mean Average Precision (mAP),
which represents the area below the precision and recall curve.
4.2 Eﬀectiveness and Quantization factor Q
In a ﬁrst experimental analysis, we have evaluated the optimal value of Qover the
Flickr1M dataset. As explained above, by keeping the value Qto the minimum,
we can reduce the space occupation of the inverted index. Figure 2 shows the
mAP as function of Qand the corresponding space occupation of the inverted
index. From this analysis, we conclude that an optimal choice of the quantization
factor Qis 30, which leads to a mAP of 0.62 and a space occupation of 2.31GB.
We stress that the mAP using the brute-force approach (on the exact DCNN
feature vectors) is about 0.60. This means that our quantization error leads to
a slight improvement of the precision for Q≥30. Another important aspect is
that this eﬀectiveness was obtained forcing Lucene to use the standard inner
product on tf weight without idf and any other document normalization. A
further improvement can be obtained using the similarity function of Lucene
called LMDirichlet, which provided a mAP of about 0.64.
Figure 3 shows the document frequency distribution, in the Flickr1M dataset,
of the terms τi(i.e., component wi), sorted in decreasing order. As can be seen,
the distribution is quite skewed and some terms are much more frequent than
others ranging from 313 to 378,876 in a collection of about 1 million features.
This aspect has some impact on the performance of the inverted index, since it
means that the posting list of the term τxhas 378,876 items since it appears in
many image features.
The observation about document frequency leads to the idea of using tf*idf
to reduce the query length by cutting oﬀ terms with lower tf*idf weight. Since
in inverted ﬁles the query time is usually proportional with the length of the
query , this approach gives a great improvement in terms of query response
Figure 4 shows the mAP values at diﬀerent levels of reordering factors Crand
query lengths Lq. Note that, Cr= 0 means no reordering, Cr= 1, reordering of
the ﬁrst kcandidates, Cr= 2, reordering of the ﬁrst 2kcandidates, and so on.
Concerning, Lq, we have considered a range of values between 2 and 50. Since
the average document length is about 275, this corresponds to an average length
reduction from 0.3% to 80%. In the graph of Figure 4, we have also plotted the
mAP level obtained with query without reduction (namely fullQ) and the mAP
obtained with sequential scan using L2 (brute-force). In all experiments we have
8 Giuseppe Amato et al.
20 30 40 50 60 70 80 90 100
Fig. 2. Eﬀectiveness (mAP) vs space occupation for increasing values of the quantiza-
tion factor Q.
set k= 100. As these experiments show, the conﬁguration Cr= 10 and Lq= 8
exhibits a mAP comparable to the case of brute-force using L2.
4.3 Evaluation of the Eﬃciency
In order to evaluate the eﬃciency of LuQ, in Figure 5, we have plotted the
average search time for the same queries of the previous experiments. Please,
note that the y-axis is in logarithmic scale. When there is no reordering (i.e.,
Cr= 0), the length of the query has a great impact to the search time (more
than one order of magnitude). For increasing values of the the reordering factor
Cr,Lqhas a lower and lower inﬂuence on the search time. Clearly, increasing
Cr, we increase the cost of the search. However, as we can see, there is still a
big improvement in eﬃciency even for the case in which the reorder factor is
maximum, i.e., Cr= 10.
In order to further validate LuQ, we test our index on the much larger dataset
YFCC100M. We used the DCNN features of the fc6 layer and indexed them
with Lucene using a quantization factor Q= 30. Since for this dataset a ground-
truth is not available, we only reported the performance of the query in terms of
average search time (see Figure 6). From this experiment, we see that for instance
for the conﬁguration Lq= 10 and Cr= 10, we have an average query time of
less than 4 seconds (without any parallelization), which is quite encouraging
considering that for the same dataset the query time was of the order of 10
minutes using the brute-force approach. In this case, we observe that Crdoes
not practically aﬀect the eﬃciency of LuQ.
Large Scale Indexing and Searching DCNN Features 9
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Fig. 3. Distribution of the vector components of the DCNN features (Q= 30).
5 Conclusions and Future Work
In this work, we propose LuQ, an eﬃcient approach to build a CBIR, on top
of a text search engine, speciﬁcally developed for Deep Convolutional Neural
Network Features. This approach is very straightforward and does not demand
costly elaborations during the indexing process, as for instance the permutation-
based approaches such as , which requires to order a set of predeﬁned reference
features for each feature to be indexed. Moreover, in our approach, we can tune
the query costs versus the quality of the approximation by specifying the length
of the query, without the need of maintaining the original features for reordering
the result set.
We evaluated diﬀerent implementation strategies to balance index occupa-
tion, eﬀectiveness, and the query response time. In order to show that our ap-
proach is able to handle large datasets, we have developed a browser-based ap-
plication that provides an interface for combined textual and visual searching
on a dataset of about 100 million of images, available at http://melisandre.
deepfeatures.org. The whole Lucene 5.5 archive of LuQ approach is also avail-
able for download from the same location. This index can be directly queried by
simply extracting the term vectors from the archive.
This work was partially founded by: EAGLE, Europeana network of Ancient
Greek and Latin Epigraphy, co-founded by the European Commission, CIP-
ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and
10 Giuseppe Amato et al.
mAP and reordering
2 4 6
8 12 16
24 36 50
Fig. 4. Eﬀectiveness (mAP) for various level of query lengths Lqand reordering factor
Cr(with k= 100 and Cr= 0 means no reordering) vs the query without reduction
(fullQ), and the mAP obtained with sequential scan using L2.
Smart News, Social sensing for breakingnews, co-founded by the Tuscany region
under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.
1. Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. MI-File: using inverted
ﬁles for scalable approximate similarity search. Multimedia Tools and Applications,
2. Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neu-
ral codes for image retrieval. In Computer Vision–ECCV 2014, pages 584–599.
3. Stefan B¨uttcher, Charles LA Clarke, and Gordon V Cormack. Information re-
trieval: Implementing and evaluating search engines. Mit Press, 2010.
4. Ken Chatﬁeld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return
of the devil in the details: Delving deep into convolutional nets. arXiv preprint
5. G.E. Chavez, K. Figueroa, and G. Navarro. Eﬀective proximity retrieval by order-
ing permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 30(9):1647 –1658, sept. 2008.
6. Jeﬀ Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoﬀman, Ning Zhang, Eric Tzeng,
and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual
recognition. arXiv preprint arXiv:1310.1531, 2013.
7. ZongYuan Ge, Chris McCool, Conrad Sanderson, and Peter Corke. Modelling
local deep convolutional neural network features to improve ﬁne-grained image
Large Scale Indexing and Searching DCNN Features 11
0 1 2 3 4 5 6 7 8 9 10
query time and reordering
Fig. 5. Average search time (sec) for various level of query lengths Lqand reordering
factor Cr(with k= 100 and Cr= 0 means no reordering) vs the query without
reduction (fullQ), and the mAP obtained with sequential scan using L2.
classiﬁcation. In Image Processing (ICIP), 2015 IEEE International Conference
on, pages 4112–4116. IEEE, 2015.
8. Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 580–587,
9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. arXiv preprint arXiv:1512.03385, 2015.
10. H. J´egou, M. Douze, and C. Schmid. Packing bag-of-features. In Computer Vision,
2009 IEEE 12th International Conference on, pages 2357 –2364, 29 2009-oct. 2
11. H. J´egou, F. Perronnin, M. Douze, J. S`anchez, P. P´erez, and C. Schmid. Aggre-
gating local image descriptors into compact codes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 34(9):1704–1716, Sept 2012.
12. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Improving bag-of-features for
large scale image search. International Journal of Computer Vision, 87:316–336,
13. Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and Patrick P´erez. Aggregating
local descriptors into a compact image representation. In IEEE Conference on
Computer Vision & Pattern Recognition, pages 3304–3311, jun 2010.
14. Yangqing Jia, Evan Shelhamer, Jeﬀ Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caﬀe: Convolutional ar-
chitecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
15. Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
12 Giuseppe Amato et al.
10 20 30 40 50
Fig. 6. Average search time (sec) for the YFCC100M dataset using lucene.
16. Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature,
17. Ruoyu Liu, Yao Zhao, Shikui Wei, Zhenfeng Zhu, Lixin Liao, and Shuang
Qiu. Indexing of cnn features for large scale image search. arXiv preprint
18. F. Perronnin, Yan Liu, J. Sanchez, and H. Poirier. Large-scale image retrieval with
compressed ﬁsher vectors. In Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pages 3384 –3391, june 2010.
19. Ali Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn
features oﬀ-the-shelf: an astounding baseline for recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages
20. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
Cnn features oﬀ-the-shelf: An astounding baseline for recognition. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
21. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
22. Bart Thomee, Benjamin Elizalde, David A Shamma, Karl Ni, Gerald Friedland,
Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multi-
media research. Communications of the ACM, 59(2):64–73, 2016.
23. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.
Learning deep features for scene recognition using places database. In Advances
in neural information processing systems, pages 487–495, 2014.