Conference PaperPDF Available

Abstract

Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.
Large Scale Indexing and Searching Deep
Convolutional Neural Network Features
Giuseppe Amato, Franca Debole, Fabrizio Falchi,
Claudio Gennaro, and Fausto Rabitti ?
ISTI-CNR
Via G. Moruzzi 1
56124 Pisa - Italy
name.surname@isti.cnr.it
Abstract. Content-based image retrieval using Deep Learning has be-
come very popular during the last few years. In this work, we propose an
approach to index Deep Convolutional Neural Network Features to sup-
port efficient retrieval on very large image databases. The idea is to pro-
vide a text encoding for these features enabling the use of a text retrieval
engine to perform image similarity search. In this way, we built LuQ a
robust retrieval system that combines full-text search with content-based
image retrieval capabilities. In order to optimize the index occupation
and the query response time, we evaluated various tuning parameters to
generate the text encoding. To this end, we have developed a web-based
prototype to efficiently search through a dataset of 100 million of images.
Keywords: Convolutional Neural Network, Deep Learning, Inverted In-
dex, Image Retrieval
1 Introduction
Deep Convolutional Neural Networks (DCNNs) have recently shown impressive
performance in the Computer Vision area, such as image classification and object
recognition [15, 21, 9]. The activation of the DCNN hidden layers has been also
used in the context of transfer learning and content-based image retrieval [6,
19]. In fact, Deep Learning methods are “representation-learning methods with
multiple levels of representation, obtained by composing simple but non-linear
modules that each transform the representation at one level (starting with the
raw input) into a representation at a higher, slightly more abstract level” [16].
These representations can be successfully used as features in generic recognition
or visual similarity search tasks.
?This work was partially founded by: EAGLE, Europeana network of Ancient
Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-
PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart
News, Social sensing for breakingnews, co-founded by the Tuscany region under the
FAR-FAS 2014 program, CUP CIPE D58C15000270008.
2 Giuseppe Amato et al.
The first layers of DCNN are typically useful in recognizing low-level charac-
teristics of images such as edges and blobs, while higher levels have demonstrated
to be more suitable for semantic similarity search. One major obstacle to the
use of DCNN features for indexing large datasets of images is its internal repre-
sentation, which is high dimensional leading to the curse of dimensionality [7].
For instance, in the well-known AlexNet architecture [15] the output of the sixth
layer (fc6) has 4,096 dimensions, while the fifth layer (pool5) has 9,216 dimen-
sions. An effective approach to tackle the dimensionality curse problem is the
application of approximate access methods such as permutation-based indexes
[5, 1].
A drawback of these approaches is on one hand the dependence of the index
on a set of reference objects (or pivots); on the other hand the need to reorder
the result set of the search according to the original feature distance. The former
issue requires the selection of a set of reference objects, which represents well
the variety of datasets that we want to index. The latter issue forces to use a
support database to store the features, which requires efficient random I/O on
a fast secondary memory (such as flash-based storages).
In this paper, we propose an approach to specifically index DCNN Features to
support efficient content-based on large datasets, which we refer to as LuQ. The
proposed approach exploits the ability of inverted files to deal with the sparsity
of the Convolutional features. To this end, we make use of the efficient and
robust full-text search library Lucene1. The idea is to associate each component
of the feature vector with a unique alphanumeric keyword and to generate a
textual representation in which we boost the relative term proportionally to its
intensity.
A browser-based application that provides an interface for combined textual
and visual searching into a dataset of about 100 million of images is available at
http://melisandre.deepfeatures.org.
The paper is organized as follows. Section 2 makes a survey of the related
works. Section 3 presents the proposed approach. Section 4 discusses the valida-
tion tests. Section 5 concludes.
2 Related Work
Recently, a new class of image descriptors, built upon Deep Convolutional Neu-
ral Networks (DCNNs), have been used as effective alternative to descriptors
built using local features such as SIFT, SURF, ORB, BRIEF, etc. Starting from
2012 [15], DCNNs have attracted enormous interest within the Computer Vision
community because of the state-of-the-art results achieved in the image classifi-
cation challenge ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
In Computer Vision, DCNN have been used to perform several tasks, including
not only image classification, but also image retrieval [6,2] and object detection
[8], to cite some. In particular, it has been proved that the multiple levels of rep-
resentation which are learned by DCNN on specific task (typically supervised)
1http://lucene.apache.org
Large Scale Indexing and Searching DCNN Features 3
can be used to transfer learning across tasks. This means that the activation of
neurons of a specific layers, preferably the last ones, can be used as features for
describing the visual content [19].
Liu et al. [17] proposed a framework that adapts Bag-of-Word model and
inverted table to DCNN feature indexing, which is similar to LuQ. However, for
large-scale datasets, Liu et al. have to build a large-scale visual dictionary that
employs product quantization method to learn a large-scale visual dictionary
from a training set of global DCNN features. In any case, using this approach
the authors reported a search time that is one order higher than in our case for
the same dataset.
3 Text Encoding for Deep Convolutional Neural Network
Features
3.1 Feature Indexing
The main idea of LuQ is to index DCNN features using a text encoding that
allows us to use a text retrieval engine to perform image similarity search. As
discussed later, we implemented this idea on top of the Lucene text retrieval
engine; however any full-text engine supporting vector space model can be used
for this purpose.
In principle, to perform similarity search on DCNN feature vectors we should
compare the vector extracted from the query with all the vectors extracted from
the images of the dataset and take the nearest images according to L2 distance.
This is sometimes called brute-force approach. However, if the database is large,
brute-force approach can become time-consuming.
Starting from this approach, we observed that given the sparsity of the DCNN
features, which contain mostly zeros (about 75%), we are able to use a well-
known IR technique, i.e., an inverted index.
In fact, we note that since typically DCNN feature vectors exhibits better
results if they are L2-normalized to the unit length [20, 4], if two vectors xand y
have length equal to 1 (such as in our case), the following relationship between
the L2 distance d2(x,y) and the inner product xyexists:
d2(x,y)2= 2(1 xy)
The advantage of computing the similarity between vectors in terms of inner
product is that we can efficiently exploit the sparsity of the vectors by accumu-
lating the product of non-zeroes entries in the vector xand their corresponding
non-zeros entries in the vector y. Moreover, Lucene, as other search engines,
computes the similarity between documents using the cosine similarity, which is
the inner product of the two vectors divided by their lengths product. There-
fore, in our case, cosine similarity and inner product are the same. Ascertained
this behave, our idea is to fill the inverted index of Lucene with the DCNN
4 Giuseppe Amato et al.
features vectors. For space-saving reasons, however, text search engines do not
store float numbers in the posting entries of the inverted index representing doc-
uments, rather they store the term frequencies, which are represented as integers.
Therefore, we must guarantee that posting entries will contain numeric values
proportional to the float values of the deep feature entries.
To employ this idea, in LuQ, we provide a text encoding for the DCNN
feature vectors that guarantees the direct proportionality between the feature
components and the term frequencies. Let w= (w1, . . . , wm) denote the L2-
normalized DCNN vector of mdimensions. Firstly, we associated each of its
component wiwith a unique alphanumeric term τi(for instance, the prefix ’f’
followed by the numeric values corresponding to the index i). The text encoding
doc(w) corresponding to the vector wis given by:
doc(w) =
m
[
i=1
bQwic
[
j=1
τi
Where bc denotes the floor function and Qis a multiplication factor >1 that
works as a quantization factor2.
Therefore, we form the text encoding of wiby repeating the term τifor the
non-zero components a number of times directly proportional to wi. This process
introduces a quantization error due to the representation of float components in
integers. However, as we will see, this error does not affect the retrieval effec-
tiveness. The accuracy of this approximation depends on the factor Q, used to
transform the vector w. For instance, if we fix Q= 2, for wi<0.5, bQwic= 0,
while for wi0.5, bQwic ≥ 1. In contrast, the smaller we set Qthe smaller the
inverted index will be. This is because the floor function will set to zero more
entries of the posting lists. Hence, we have to find a good trade-off between the
effectiveness of the retrieval system and its space occupation.
For example, if we set Q= 30 and we have for instance a feature vector
with just three components w= (0.01,0.15,0.09) the corresponding integer-
representation of the vector will be (0,4,2) and its text encoding will be: doc(w) =“f2
f2 f2 f2 f3 f3”.
Since on average the 25% of the DCNN features are non-zero (in our specific
case the fc6 layer), the size of their corresponding text encoding will have a small
fraction of the unique terms present in the whole dictionary (composed of 4,096
terms). In our case, on average a document contains about 275 unique terms,
which is about 6.7% of the dictionary because of quantization that set to zero
the feature components smaller than 1/Q.
3.2 Query Reduction
When we have to process similarity search, therefore the search engine have to
treat query of that size. These unusual long queries, however, can affect the
response time if the inverted index contains million of items.
2By abuse of notation, we denote the space-separated concatenation of keywords with
the union operator .
Large Scale Indexing and Searching DCNN Features 5
A quite intuitive way to overcome this issue is to reduce the size of the query
by exploiting the knowledge of the tf*idf (i.e., term frequency * inverse document
frequency) statistic of the text encoding, which comes for free in standard full-
text retrieval engines. We can retain the elements of the query that exhibit
greater values of tf*idf and eliminate the others. For instance, for a query of
about 275 unique term on average, we can take the first ten terms that exhibits
the highest tf*idf, we obtain a query time reduction of about 96%.
This query reduction comes, however, with a price: it decreases the precision
of results. To attenuate this problem, for a top-kquery, we reorder the results
using the cosine similarity between the original query (i.e., the one without
reduction) and the first Cr×kcandidates documents retrieved. Where Cris
an amplification factor that we refer to as reordering factor. For instance, if we
have to return k= 100 images and we set Cr= 10, we take and reorder the first
10 ×100 = 1000 candidate documents retrieved by the reduced query.
In order to calculate the cosine similarity of the original query and the Cr×k
candidates, we have to reconstruct the quantized features by accessing to the
posting list of the document returned by the search engine. As we will see, this
approach does not affect significantly the efficiency of the query but can offer
great improvements in terms of effectiveness.
Figure 1 summarizes the indexing and searching phases of LuQ.
4 Experiments
4.1 Setup
In order to test efficiency of LuQ, we used the Yahoo Flickr Creative Commons
100 Million (YFCC100M) dataset3. This dataset was created in 2014 as part
of the Yahoo Webscope program. YFCC100M consists of 99.2 million photos
and 0.8 million videos uploaded to Flickr between 2004 and 2014 and Creative
Commons commercial or noncommercial license. More information about the
dataset can be found in the article in Communications of the ACM [22].
For extracting deep features we used the Caffe [14] deep learning frame-
work. In particular we used the neural network Hybrid-DCNN whose model
and weights are public available in the Caffe Model Zoo4. The Hybrid-DCNN
was trained on 1,183 categories (205 scene categories from Places Database and
978 object categories from the train data of ILSVRC2012 (ImageNet) with 3.6
million images [23]). The architecture is the same as Caffe reference network.
The deep features we have extracted are the activations of the fc6 layer. We
have made them public available at http://www.deepfeatures.org and they
will be soon included in the Multimedia Commons initiative corpus.
A ground-truth is not yet available for the YFCC100M dataset. Therefore,
effectiveness of the proposed approach was evaluated using the INRIA Holidays
3http://bit.ly/yfcc100md
4http://github.com/BVLC/caffe/wiki/Model-Zoo
6 Giuseppe Amato et al.
Lucene
f2 f2 f2 f4095 ...
.0
.1
.0
.2
.0
.
.
.
.0
.2
.0
.2
.0
0
3
0
.0
.
.
.
0
6
0
.2
.0
.0
.1
.0
.2
.0
.
.
.
.0
.2
.0
.2
.0
0
3
0
.0
.
.
.
0
6
0
.2
.0
f2 f2 f2 f4095 ... f4095 f4095 ...
[ Cr * k [
Id1
Id2
Id3
Id4
Id5
Id6
Id7
Id8
Id9
Id10
Id9
Id5
Id2
Id7
Id5
Id9
Id7
Id8
Id3
Id1
Id1
Id2
Id3
Id4
Id5
Id6
Id7
Id8
Id9
Id10
.
.
.
X
[ Indexing [
[ querying [
[ reordering [
Searching phase
Indexing phase
data vector w
quantized
data vector
Document doc(w)
query vector wq
quantized
query vector
query doc(wq)Reduced query doc(wq)
candidate
result set
[result[
Inner product
Fig. 1. Diagram showing the indexing and searching phases of LuQ.
dataset [13, 11]. It is a collection of 1,491 holiday images. The authors selected
500 queries and for each of them a list of positive results. As in [10, 12, 11], to
evaluate the approach on a large scale, we merged the Holidays dataset with the
Large Scale Indexing and Searching DCNN Features 7
Flickr1M collection5. We extracted DCNN features also from these datasets in
order to test our technique.
All experiments were conducted on a Intel Core i7 CPU, 2.67 GHz with 12.0
GB of RAM a 2TB 7200 RPM HD for the Lucene index. We used Lucene v5.5
running on Java 8 64 bit.The quality of the retrieved images is typically evaluated
by means of precision and recall measures. As in many other papers [10, 18, 11],
we combined this information by means of the mean Average Precision (mAP),
which represents the area below the precision and recall curve.
4.2 Effectiveness and Quantization factor Q
In a first experimental analysis, we have evaluated the optimal value of Qover the
Flickr1M dataset. As explained above, by keeping the value Qto the minimum,
we can reduce the space occupation of the inverted index. Figure 2 shows the
mAP as function of Qand the corresponding space occupation of the inverted
index. From this analysis, we conclude that an optimal choice of the quantization
factor Qis 30, which leads to a mAP of 0.62 and a space occupation of 2.31GB.
We stress that the mAP using the brute-force approach (on the exact DCNN
feature vectors) is about 0.60. This means that our quantization error leads to
a slight improvement of the precision for Q30. Another important aspect is
that this effectiveness was obtained forcing Lucene to use the standard inner
product on tf weight without idf and any other document normalization. A
further improvement can be obtained using the similarity function of Lucene
called LMDirichlet, which provided a mAP of about 0.64.
Figure 3 shows the document frequency distribution, in the Flickr1M dataset,
of the terms τi(i.e., component wi), sorted in decreasing order. As can be seen,
the distribution is quite skewed and some terms are much more frequent than
others ranging from 313 to 378,876 in a collection of about 1 million features.
This aspect has some impact on the performance of the inverted index, since it
means that the posting list of the term τxhas 378,876 items since it appears in
many image features.
The observation about document frequency leads to the idea of using tf*idf
to reduce the query length by cutting off terms with lower tf*idf weight. Since
in inverted files the query time is usually proportional with the length of the
query [3], this approach gives a great improvement in terms of query response
time.
Figure 4 shows the mAP values at different levels of reordering factors Crand
query lengths Lq. Note that, Cr= 0 means no reordering, Cr= 1, reordering of
the first kcandidates, Cr= 2, reordering of the first 2kcandidates, and so on.
Concerning, Lq, we have considered a range of values between 2 and 50. Since
the average document length is about 275, this corresponds to an average length
reduction from 0.3% to 80%. In the graph of Figure 4, we have also plotted the
mAP level obtained with query without reduction (namely fullQ) and the mAP
obtained with sequential scan using L2 (brute-force). In all experiments we have
5http://press.liacs.nl/mirflickr/
8 Giuseppe Amato et al.
0
0,5
1
1,5
2
2,5
0,55
0,56
0,57
0,58
0,59
0,6
0,61
0,62
0,63
20 30 40 50 60 70 80 90 100
Q
size(GB) mAP
Fig. 2. Effectiveness (mAP) vs space occupation for increasing values of the quantiza-
tion factor Q.
set k= 100. As these experiments show, the configuration Cr= 10 and Lq= 8
exhibits a mAP comparable to the case of brute-force using L2.
4.3 Evaluation of the Efficiency
In order to evaluate the efficiency of LuQ, in Figure 5, we have plotted the
average search time for the same queries of the previous experiments. Please,
note that the y-axis is in logarithmic scale. When there is no reordering (i.e.,
Cr= 0), the length of the query has a great impact to the search time (more
than one order of magnitude). For increasing values of the the reordering factor
Cr,Lqhas a lower and lower influence on the search time. Clearly, increasing
Cr, we increase the cost of the search. However, as we can see, there is still a
big improvement in efficiency even for the case in which the reorder factor is
maximum, i.e., Cr= 10.
In order to further validate LuQ, we test our index on the much larger dataset
YFCC100M. We used the DCNN features of the fc6 layer and indexed them
with Lucene using a quantization factor Q= 30. Since for this dataset a ground-
truth is not available, we only reported the performance of the query in terms of
average search time (see Figure 6). From this experiment, we see that for instance
for the configuration Lq= 10 and Cr= 10, we have an average query time of
less than 4 seconds (without any parallelization), which is quite encouraging
considering that for the same dataset the query time was of the order of 10
minutes using the brute-force approach. In this case, we observe that Crdoes
not practically affect the efficiency of LuQ.
Large Scale Indexing and Searching DCNN Features 9
0
50000
100000
150000
200000
250000
300000
350000
400000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
termrank
VocabularyAnalysis
Fig. 3. Distribution of the vector components of the DCNN features (Q= 30).
5 Conclusions and Future Work
In this work, we propose LuQ, an efficient approach to build a CBIR, on top
of a text search engine, specifically developed for Deep Convolutional Neural
Network Features. This approach is very straightforward and does not demand
costly elaborations during the indexing process, as for instance the permutation-
based approaches such as [1], which requires to order a set of predefined reference
features for each feature to be indexed. Moreover, in our approach, we can tune
the query costs versus the quality of the approximation by specifying the length
of the query, without the need of maintaining the original features for reordering
the result set.
We evaluated different implementation strategies to balance index occupa-
tion, effectiveness, and the query response time. In order to show that our ap-
proach is able to handle large datasets, we have developed a browser-based ap-
plication that provides an interface for combined textual and visual searching
on a dataset of about 100 million of images, available at http://melisandre.
deepfeatures.org. The whole Lucene 5.5 archive of LuQ approach is also avail-
able for download from the same location. This index can be directly queried by
simply extracting the term vectors from the archive.
Acknowledgments
This work was partially founded by: EAGLE, Europeana network of Ancient
Greek and Latin Epigraphy, co-founded by the European Commission, CIP-
ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and
10 Giuseppe Amato et al.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
012345678910
mAP
Cr
mAP and reordering
2 4 6
8 12 16
24 36 50
fullQ L2
Lq
Fig. 4. Effectiveness (mAP) for various level of query lengths Lqand reordering factor
Cr(with k= 100 and Cr= 0 means no reordering) vs the query without reduction
(fullQ), and the mAP obtained with sequential scan using L2.
Smart News, Social sensing for breakingnews, co-founded by the Tuscany region
under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.
References
1. Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. MI-File: using inverted
files for scalable approximate similarity search. Multimedia Tools and Applications,
71(3):1333–1362, 2014.
2. Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neu-
ral codes for image retrieval. In Computer Vision–ECCV 2014, pages 584–599.
Springer, 2014.
3. Stefan B¨uttcher, Charles LA Clarke, and Gordon V Cormack. Information re-
trieval: Implementing and evaluating search engines. Mit Press, 2010.
4. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return
of the devil in the details: Delving deep into convolutional nets. arXiv preprint
arXiv:1405.3531, 2014.
5. G.E. Chavez, K. Figueroa, and G. Navarro. Effective proximity retrieval by order-
ing permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 30(9):1647 –1658, sept. 2008.
6. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng,
and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual
recognition. arXiv preprint arXiv:1310.1531, 2013.
7. ZongYuan Ge, Chris McCool, Conrad Sanderson, and Peter Corke. Modelling
local deep convolutional neural network features to improve fine-grained image
Large Scale Indexing and Searching DCNN Features 11
0,001
0,01
0,1
1
10
0 1 2 3 4 5 6 7 8 9 10
time (sec)
Cr
query time and reordering
2 4
6 8
12 16
24 36
50 fullQ
Lq
Fig. 5. Average search time (sec) for various level of query lengths Lqand reordering
factor Cr(with k= 100 and Cr= 0 means no reordering) vs the query without
reduction (fullQ), and the mAP obtained with sequential scan using L2.
classification. In Image Processing (ICIP), 2015 IEEE International Conference
on, pages 4112–4116. IEEE, 2015.
8. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 580–587,
2014.
9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. arXiv preprint arXiv:1512.03385, 2015.
10. H. J´egou, M. Douze, and C. Schmid. Packing bag-of-features. In Computer Vision,
2009 IEEE 12th International Conference on, pages 2357 –2364, 29 2009-oct. 2
2009.
11. H. J´egou, F. Perronnin, M. Douze, J. S`anchez, P. P´erez, and C. Schmid. Aggre-
gating local image descriptors into compact codes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 34(9):1704–1716, Sept 2012.
12. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Improving bag-of-features for
large scale image search. International Journal of Computer Vision, 87:316–336,
May 2010.
13. Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and Patrick P´erez. Aggregating
local descriptors into a compact image representation. In IEEE Conference on
Computer Vision & Pattern Recognition, pages 3304–3311, jun 2010.
14. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional ar-
chitecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
15. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
12 Giuseppe Amato et al.
0
2
4
6
8
10
12
14
16
18
20
012345678910
time(sec)
Cr
YFCC100Mquerytime
10 20 30 40 50
Lq
Fig. 6. Average search time (sec) for the YFCC100M dataset using lucene.
16. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
17. Ruoyu Liu, Yao Zhao, Shikui Wei, Zhenfeng Zhu, Lixin Liao, and Shuang
Qiu. Indexing of cnn features for large scale image search. arXiv preprint
arXiv:1508.00217, 2015.
18. F. Perronnin, Yan Liu, J. Sanchez, and H. Poirier. Large-scale image retrieval with
compressed fisher vectors. In Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pages 3384 –3391, june 2010.
19. Ali Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn
features off-the-shelf: an astounding baseline for recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages
806–813, 2014.
20. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
Cnn features off-the-shelf: An astounding baseline for recognition. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
June 2014.
21. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
22. Bart Thomee, Benjamin Elizalde, David A Shamma, Karl Ni, Gerald Friedland,
Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multi-
media research. Communications of the ACM, 59(2):64–73, 2016.
23. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.
Learning deep features for scene recognition using places database. In Advances
in neural information processing systems, pages 487–495, 2014.
... The text encoding of the CNN features suffers from three significant challenges. First, deep features are 1 Corresponding Author high dimensional and dense, which makes them computationally expensive in terms of indexing and retrieval. Secondly, feature transformation introduces quantization error resulting in some data loss, which hurts accuracy. ...
... DISIVR is proposed as an extension of our previous work [5] [6]. Our contribution in this work is a distributed in-memory computation (Spark) based implementation of the feature encoder in [1]. We tailored and fine-tuned it for video retrieval, and further improved the feature encoder by eliminating the fixed quantization factor. ...
... The Feature Encoding algorithm in [1] uses a fixed constant value Q for quantization. This suffers from two issues. ...
Conference Paper
The convolutional neural network ( CNN deep features pose a significant hurdle for direct use in inverted index based systems because of their high dimensionality and dense nature. To address such issue s , we propose a n in memory scale out inverted index based framework called DISIVR with t hree fold objectives . First, we employ a n adaptation of a n existing deep feature encod er from high dimension al vector space to a lightweight representation using an in memory computation framework Secondly, DISIVR supports incremental updates without re indexing and is designed to serve concurrent queries. Finally , DISIVR employ s load balancing with the aim of scaling out the load across the cluster Experimental results show that our system performs reasonably wel l in terms of indexing time, query time, and precision.
... Our contribution in this work is a distributed in-memory computation based feature indexing framework for video Big-Data retrieval, which works as SaaS in the Cloud. We adopted the feature encoder's implementation in [5], fine-tuned it for video retrieval, and further improved the feature encoder by making it independent of the fixed quantization factor. Furthermore, we tailored it to work with multiple types of features. ...
... The Feature Encoding algorithm proposed in [5] works by quantizing the features and uses a fixed constant value to quantize the real values of the feature vector into integers. This suffers from two issues. ...
Chapter
Full-text available
Feature indexing for video retrieval poses a significant hurdle for indexing due to three significant challenges. First, there are different types of features in varying nature, such as deep Convolutional Neural Network (CNN) features, handcrafted features, recognized text from the videos, and audio features, etc. Secondly, feature matching for those varying types of features requires different similarity measure methods. And thirdly, considering the Big-Data era the number of features to be indexed is enormous. To address these issues, in this paper, we present a lambda style distributed in-memory scale-out inverted-index based feature indexing framework for video retrieval, which operates as SaaS in the cloud. First, the video features are acquired, decoupled, and the visual features are encoded using an adaptation of an existing feature encoder with improvements. Secondly, the visual encoded features and the textual features are aggregated. Finally, the aggregated features are indexed and readily available for retrieval. Our framework supports incremental updates without the need to re-index the data and can serve enormous concurrent queries. Experimental results show that our framework performs reasonably well in terms of, accuracy, precision, and efficiency.
... Meanwhile, index learning has generated considerable interests in multimedia information retrieval [35]- [38], where indexing is enabled by utilizing available textual metadata or RF information to improve search precision. In [36], Datta et al. use indexing and matching score calculation to expand textual queries on both image search and the corresponding text retrieval. ...
... al. achieves efficient retrieval in large-scale multimedia databases [37]. To support efficient CBIR in large image sets, Amato [38] makes use of Deep Convolutional Neural Networks. ...
Preprint
The diversity of intrinsic qualities of multimedia entities tends to impede their effective retrieval. In a SelfLearning Search Engine architecture, the subtle nuances of human perceptions and deep knowledge are taught and captured through unsupervised reinforcement learning, where the degree of reinforcement may be suitably calibrated. Such architectural paradigm enables indexes to evolve naturally while accommodating the dynamic changes of user interests. It operates by continuously constructing indexes over time, while injecting progressive improvement in search performance. For search operations to be effective, convergence of index learning is of crucial importance to ensure efficiency and robustness. In this paper, we develop a Self-Learning Search Engine architecture based on reinforcement learning using a Markov Decision Process framework. The balance between exploration and exploitation is achieved through evolutionary exploration Strategies. The evolutionary index learning behavior is then studied and formulated using stochastic analysis. Experimental results are presented which corroborate the steady convergence of the index evolution mechanism. Index Term
... As an approximate index, we used our own implementation of the well-known Locality Sensitive Hashing. In the future, we may confront other approximate image search approaches, such as those exploiting surrogate texts [2,4,3,1]. ...
Preprint
Full-text available
To implement a good Content Based Image Retrieval (CBIR) system, it is essential to adopt efficient search methods. One way to achieve this results is by exploiting approximate search techniques. In fact, when we deal with very large collections of data, using an exact search method makes the system very slow. In this project, we adopt the Locality Sensitive Hashing (LSH) index to implement a CBIR system that allows us to perform fast similarity search on deep features. Specifically, we exploit transfer learning techniques to extract deep features from images; this phase is done using two famous Convolutional Neural Networks (CNNs) as features extractors: Resnet50 and Resnet50v2, both pre-trained on ImageNet. Then we try out several fully connected deep neural networks, built on top of both of the previously mentioned CNNs in order to fine-tuned them on our dataset. In both of previous cases, we index the features within our LSH index implementation and within a sequential scan, to better understand how much the introduction of the index affects the results. Finally, we carry out a performance analysis: we evaluate the relevance of the result set, computing the mAP (mean Average Precision) value obtained during the different experiments with respect to the number of done comparison and varying the hyper-parameter values of the LSH index.
... The R-MAC [83] descriptors are adopted as global image descriptors for the similarity search functionality. All descriptors (scene tags, dominant colors, object location, and visual descriptors) extracted from the video key-frames were encoded with a surrogate textual representation and efficient technologies for text retrieval were adopted for the indexing and searching phases [84]- [86]. The system's user interface provides a text box to specify the scene tags and a canvas for sketching objects and/or colors appearing in the target scene. ...
Article
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
... Large scale indexing and searching has always been a main focus of the research activities of the AIMIR group [24,25,26,27,28]. During 2019 we investigated some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch (see Section 3.1.1). ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
... "Fake words". We implement the approach described in Amato et al. [1], which encodes the features of a vector as a number of "fake" terms proportional to the feature value according to the following scheme: Given a vector w = (w 1 , ..., w m ), each feature w i is associated with a unique alphanumeric term τ i so that the document corresponding to the vector w is represented by fake words generated by ∪ m i=1 ∪ ⌊Q·wi⌋ j=1 τ i , where Q > 1 is a quantization factor. Thus, the fake words encoding maintains direct proportionality between the float value of a feature and the term frequency of the corresponding fake index term. ...
Preprint
We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study. At its core, Lucene is built around inverted indexes of a document collection's (sparse) term-document matrix, which is incompatible with the lower-dimensional dense vectors that are common in deep learning applications. We evaluate three techniques to overcome these challenges that can all be natively integrated into Lucene: the creation of documents populated with fake words, LSH applied to lexical realizations of dense vectors, and k-d trees coupled with dimensionality reduction. Experiments show that the "fake words" approach represents the best balance between effectiveness and efficiency. These techniques are integrated into the Anserini open-source toolkit and made available to the community.
... In Amato et al. (2018), we have also proved that this general approach can be implemented on top of Elasticsearch by showing how such a retrieval system is able to scale to multiple nodes. In the earlier attempt (Amato, Debole, Falchi, Gennaro, & Rabitti, 2016), we have presented a preliminary draft of quantization approach on deep features extracted from the Hybrid CNN, 1 which is less effective but has the advantage of being partly spare. ...
Article
The great success of visual features learned from deep neural networks has led to a significant effort to develop efficient and scalable technologies for image retrieval. Nevertheless, its usage in large-scale Web applications of content-based retrieval is still challenged by their high dimensionality. To overcome this issue, some image retrieval systems employ the product quantization method to learn a large-scale visual dictionary from a training set of global neural network features. These approaches are implemented in main memory, preventing their usage in big-data applications. The contribution of the work is mainly devoted to investigating some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea of our approaches relies on a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. Its effectiveness has been proved through several instance-level retrieval benchmarks. An extensive experimental evaluation conducted on the standard benchmarks shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
Article
With the rise of the network society, as the mapping Internet space, the public opinion has become the most active way of expressing social public opinion. It gradually gets deeply involved in the development and change of various social phenomena, social problems and social events, and evolves into the real politics and public management. In this context, it is of great practical significance to explore the evolution process and laws of online public opinions and systematically analyze the influence mechanism in the evolution process of online public opinions. This paper comprehensively uses the modeling simulation, empirical analysis, fuzzy systems and other research methods, adopts the reasonable abstraction of the main behavior characteristics, behavior motives and network relations of network users, and then constructs the evolution model of network public opinion in the complex social network. Besides, from the new research perspective of network members and network relations of the dynamic interaction between the government, media and netizen, this paper makes an in-depth study on the influence mechanism of the dynamic evolution of online public opinion.
Conference Paper
Full-text available
We propose a local modelling approach using deep convolutional neural networks (CNNs) for fine-grained image classification. Recently, deep CNNs trained from large datasets have considerably improved the performance of object recognition. However, to date there has been limited work using these deep CNNs as local feature extractors. This partly stems from CNNs having internal representations which are high dimensional, thereby making such representations difficult to model using stochastic models. To overcome this issue, we propose to reduce the dimensionality of one of the internal fully connected layers, in conjunction with layer-restricted retraining to avoid retraining the entire network. The distribution of low-dimensional features obtained from the modified layer is then modelled using a Gaussian mixture model. Comparative experiments show that considerable performance improvements can be achieved on the challenging Fish and UEC FOOD-100 datasets.
Article
Full-text available
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
We propose a local modelling approach using deep convolutional neural networks (CNNs) for fine-grained image classification. Recently, deep CNNs trained from large datasets have considerably improved the performance of object recognition. However, to date there has been limited work using these deep CNNs as local feature extractors. This partly stems from CNNs having internal representations which are high dimensional, thereby making such representations difficult to model using stochastic models. To overcome this issue, we propose to reduce the dimensionality of one of the internal fully connected layers, in conjunction with layer-restricted retraining to avoid retraining the entire network. The distribution of low-dimensional features obtained from the modified layer is then modelled using a Gaussian mixture model. Comparative experiments show that considerable performance improvements can be achieved on the challenging Fish and UEC FOOD-100 datasets.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
We created the Yahoo Flickr Creative Commons 100 Million Dataseta (YFCC100M) in 2014 as part of the Yahoo Webscope program, which is a reference library of interesting and scientifically useful datasets. The YFCC100M is the largest public multimedia collection ever released, with a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset is distributed through Amazon Web Services as a 12.5GB compressed archive containing only metadata. However, as with many datasets, the YFCC100M is constantly evolving; over time, we have released and will continue to release various expansion packs containing data not yet in the collection; for instance, the actual photos and videos, as well as several visual and aural features extracted from the data, have already been uploaded to the cloud, ensuring the dataset remains accessible and intact for years to come. The YFCC100M dataset overcomes many of the issues affecting existing multimedia datasets in terms of modalities, metadata, licensing, and, principally, volume.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Convolutional neural network (CNN) feature that represents an image with a global and high-dimensional vector has shown highly discriminative capability in image search. Although CNN features are more compact than most of local representation schemes, it still cannot efficiently deal with large-scale image search issues due to its non-negligible computational cost and storage usage. In this paper, we propose a simple but effective image indexing framework to improve the computational and storage efficiency of CNN features. Instead of projecting each CNN feature vector into a global hashing code, the proposed framework adapts Bag-of-Word model and inverted table to global feature indexing. To this end, two strategies, which are based on semantic information associated with CNN features, are proposed to convert a global vector to one or several discrete words. In addition, several strategies for compensating quantization error are fully investigated under the indexing framework. Extensive experimental results on two public benchmarks show the superiority of our framework.