Conference PaperPDF Available

Abstract and Figures

In this paper, we adapt a surrogate text representation technique to develop efficient instance-level image retrieval using Regional Maximum Activations of Convolutions (R-MAC). R-MAC features have recently showed outstanding performance in visual instance retrieval. However, contrary to the activations of hidden layers adopting ReLU (Rectified Linear Unit), these features are dense. This constitutes an obstacle to the direct use of inverted indexes, which rely on sparsity of data. We propose the use of deep permutations, a recent approach for efficient evaluation of permutations, to generate surrogate text representation of R-MAC features, enabling indexing of visual features as text into a standard search-engine. The experiments, conducted on Lucene, show the effectiveness and efficiency of the proposed approach.
Content may be subject to copyright.
Eicient Indexing of Regional Maximum Activations of
Convolutions using Full-Text Search Engines
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
ISTI-CNR, via G. Moruzzi 1, 56124, Pisa, Italy
{name.surname}@isti.cnr.it
ABSTRACT
In this paper, we adapt a surrogate text representation technique
to develop ecient instance-level image retrieval using Regional
Maximum Activations of Convolutions (R-MAC). R-MAC features
have recently showed outstanding performance in visual instance
retrieval. However, contrary to the activations of hidden layers
adopting ReLU (Rectied Linear Unit), these features are dense.
is constitutes an obstacle to the direct use of inverted indexes,
which rely on sparsity of data. We propose the use of deep permuta-
tions, a recent approach for ecient evaluation of permutations, to
generate surrogate text representation of R-MAC features, enabling
indexing of visual features as text into a standard search-engine.
e experiments, conducted on Lucene, show the eectiveness and
eciency of the proposed approach.
KEYWORDS
Similarity Search, Permutation-Based Indexing, Deep Convolu-
tional Neural Network
ACM Reference format:
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro. 2017.
Ecient Indexing of Regional Maximum Activations of Convolutions us-
ing Full-Text Search Engines. In Proceedings of ICMR ’17, June 6–9, 2017,
Bucharest, Romania, , 4 pages.
DOI: hp://dx.doi.org/10.1145/3078971.3079035
1 INTRODUCTION
Deep learning has rapidly become the state-of-the-art approach
for many computer vision tasks such as classication [
21
], content-
based image retrieval [
3
,
12
], cross-media retrieval [
10
], smart cam-
eras [
1
]. A practical and convenient way of using Deep Convo-
lutional Neural Networks (DCNNs) to support fast content-based
image retrieval is to treat the neuron activations of the hidden lay-
ers as global features [
12
,
22
,
27
]. Recently, Tolias et al. [
31
] have
gone further, proposing to use Regional Maximum Activation of
Convolutions (R-MAC) as a compact and eective image represen-
tation for instance-level retrieval. is feature is the result of the
spatial aggregation of the activations of a convolution layer of a
DCNN, therefore is robust to scale and translation. Gordo et al
.
[17]
extended the R-MAC representation by improving the region
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
ICMR ’17, June 6–9, 2017, Bucharest, Romania
©
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-4701-3/17/06.. . $15.00
DOI: hp://dx.doi.org/10.1145/3078971.3079035
pooling mechanism and including it in a dierentiable pipeline
trained end-to-end for retrieval tasks.
However, these features are typically of high dimensionality,
which prevents the use of space-partitioning data structure, such
as kd-tree [
9
]. For instance, in the well-known AlexNet architec-
ture [
21
] the output of the sixth layer (fc6) has 4,096 dimensions,
while the R-MAC extracted by Gordo et al
. [17]
produces a 2048-
dimensional image descriptor.
To overcome this problem, various partitioning methods have
been proposed. In particular, the inverted multi-index uses product
quantization both to dene the coarse level and for coding residual
vectors [
6
,
25
]. is approach combined with binary compressed
techniques outperforms the state of the art by a large margin [
13
].
Our approach, as we will see, is implemented on top of an existing
text retrieval engine and requires minimal pre-processing.
Mohedano et al. [
23
] propose a sparse visual descriptor based
on a Bag of Local Convolutional Features (BLCF), which allows
fast image retrieval by means of an inverted index. is method,
however, relies on a priori learning of a codebook (which involves
the use k-means) and the VGG16 [
30
] pre-trained network, which
makes it dicult to retrain the whole pipeline on new set of images.
Our approach to tackle the dimensionality curse problem is still
based on the application of approximate access methods, but relies
on the permutation approach similar to [
5
,
11
,
14
,
24
]. e key idea
is to represent metric objects (i.e., features) as sequences (permuta-
tions) of reference objects, chosen from a predened set of objects.
Similarity queries are executed by searching for data objects whose
permutation representations are similar to the query permutation
representation. Each permutation is generated by sorting the entire
set of reference objects according to their distances from the ob-
ject to be represented. e total number of reference objects to be
used for building permutations depends on the size of the dataset
to be indexed and can amount to tens of thousands [
5
]. In these
cases, both indexing time and searching time is aected by the cost
of generating permutations for objects being inserted or for the
queries.
In this paper, we propose an adaptation of surrogate text rep-
resentation [
15
] suitable to regional maximum activations of con-
volutions features, which exploits the so-called deep permutation
approach introduced in [
4
]. An advantage of this approach lies
in its low-computational cost since does not require the distance
calculation between the reference objects and the objects to be
represented.
e rest of the paper is organized as follows. Section 2 provides
background for the reader. In Section 3, we introduce our approach
to generate permutations for R-MAC features. Section 4 presents
some experimental results using real-life datasets. Section5 con-
cludes the paper.
ICMR ’17, , June 6–9, 2017, Bucharest, Romania Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
2 BACKGROUND
2.1 Permutation-Based Approach
Given a domain
D
, a distance function
d
:
D × D R
, and a xed
set of reference objects
P={p1. . . pn} ⊂ D
that we call pivots
or reference objects, we dene a permutation-based representation
Πo
(briey permutation) of an object
o∈ D
as the sequence of
pivots identiers sorted in ascending order by their distance from
o[5, 11, 14, 24].
Formally, the permutation-based representation
Πo=
(Πo(
1
), . . . , Πo(n))
lists the pivot identiers in an order such that
j∈ {
1
, . . . , n
1
},d(o,pΠo(j)) ≤ d(o,pΠo(j+1))
, where
pΠo(j)
indicates the pivot at position
j
in the permutation associated
with object
o
. If we denote as
Π1
o(i)
the position of a pivot
pi
,
in the permutation of an object
o∈ D
, so that
Πo(Π1
o(i)) =i
,
we obtain the equivalent inverted representation of permutations
Π1
o=(Π1
o(
1
), . . . , Π1
o(n)).
In
Πo
the value in each position of
the sequence is the identier of the pivot in that position. In the
inverted representation
Π1
o
, each position corresponds to a pivot
and the value in each position corresponds to the rank of the corre-
sponding pivot. e inverted representation of permutations
Π1
o
,
is a vector that we refer to as vector of permutations, and which
allows us to easily dene most of the distance functions between
permutations.
Permutations are generally compared using Spearman rho,
Kendall Tau, or Spearman Footrule distances. In our implementa-
tion, we apply a simple algebraic transformation to permutations
in order to compute the same ranking scores that we would obtain
with the Spearman Rho distance using dot products. Due to the
lack of space, we do not present the proof of this result, however,
further details can be found in [15].
2.2 Deep Features
Recently, a new class of image descriptor, built upon Deep Convo-
lutional Neural Networks, have been used as eective alternative
to descriptors built using local features such as SIFT, SURF, ORB,
BRIEF, etc. DCNNs have aracted enormous interest within the
Computer Vision community because of the state-of-the-art results
[
21
] achieved in challenging image classication challenges such as
ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In
computer vision, DCNN have been used to perform several tasks,
including image classication, as well as image retrieval [
8
,
12
]
and object detection [
16
], to cite some. In particular, it has been
proved that the multiple levels of representation, which are learned
by DCNN on specic task (typically supervised) can be used to
transfer learning across tasks [
12
,
27
]. e activation of neurons of
a specic layers, in particular the last ones, can be used as features
for describing the visual content.
2.3 R-MAC Features
Recently, image descriptors built upon activations of convolutional
layers have shown brilliant results in image instance retrieval
[
7
,
28
,
31
]. Tolias et al
. [31]
proposed the R-MAC feature repre-
sentation, which encodes and aggregates several regions of the
image in a dense and compact global image representation. To
compute a R-MAC feature, an input image is fed to a fully con-
volutional network pre-trained on ImageNet [
29
]. e output of
the last convolutional layer is max-pooled over dierent spatial
regions at dierent position and scales, obtaining a feature vec-
tor for each region. ese vectors are then
l
2-normalized, PCA-
whitened,
l
2-normalized again, and nally aggregated by summing
them together and
l
2-normalizing the nal result. e obtained
representation is an eective aggregation of local convolutional
features at multiple position and scales that can be compared with
the cosine similarity function.
Gordo et al
. [17]
built on the work of Tolias et al
. [31]
and in-
serted the R-MAC feature extractor in a end-to-end dierentiable
pipeline in order to learn a representation optimized for visual in-
stance retrieval through back-propagation. e whole pipeline is
composed by a fully convolutional neural network, a region pro-
posal network, the R-MAC extractor and PCA-like dimensionality
reduction layers, and it is trained using a ranking loss based on im-
age triplets. e obtained pipeline can extract an optimized R-MAC
feature representation that outperforms methods based on costly
local features and spatial geometry verication.
An additional performance boost is obtained using state-of-the-
art deep convolutional architectures, such as very deep residual
networks [
18
], and aggregating descriptors extracted at multiple res-
olutions. A multi-resolution R-MAC descriptor is obtained feeding
the network with images at dierent resolutions and then aggregat-
ing the obtained representations by summing them together and
then performing a nal l2-normalization.
In our work, we used the ResNet-101 trained model provided by
[
17
] as a R-MAC feature extractor, which has been shown to achieve
the best performance on standard benchmarmks. We extracted
the R-MAC features using xed regions at two dierent scales as
proposed in [
31
] instead of using the region proposal network.
Dened
S
as the size in pixel of the minimum side of an image,
we extracted both single-resolution descriptors (with
S=
800) and
multi-resolution ones (aggregating descriptors with
S=
550, 800
and 1050).
3 SURROGATE TEXT REPRESENTATION FOR
DEEP FEATURES
As introduced above, the basic idea of permutation-based indexing
techniques is to represent feature objects as permutations built
using a set of reference object identiers as permutants.
Using the permutation-based representation, the similarity be-
tween two objects is estimated computing the similarity between
the two corresponding permutations, rather than using the original
distance function. e rationale behind this is that, when permuta-
tions are built using this strategy, objects that are very close one
to the other, have similar permutation representations as well. In
other words, if two objects are very close one to the other, they will
sort the set of reference objects in a very similar way.
Notice however that the relevant aspect when building permuta-
tions is the capability of generating sequences of identiers (permu-
tations) in such a way that similar objects have similar permutations
as well. Sorting a set of reference objects, according to their dis-
tance with the object to be represented is just one, yet eective,
approach.
When objects to be indexed are vectors as in our case of deep
features, we can use the approach presented in [
4
], which allows
Eicient Indexing of R-MAC using Full-Text Search Engines ICMR ’17, , June 6–9, 2017, Bucharest, Romania
us to generate sequence of identiers not associated with reference
objects. e basic idea is as follows. Permutants are the indexes of
elements of the deep feature vectors. Given a deep feature vector,
the corresponding permutation is obtained by sorting the indexes
of the elements of the vector, in descending order with respect to
the values of the corresponding elements. Suppose for instance
the feature vector is
fv=[
0
.
1
,
0
.
3
,
0
.
4
,
0
,
0
.
2
]1
. e permutation-
based representation of
fv
is
Πfv=(
3
,
2
,
5
,
1
,
4
)
, that is permutant
(index) 3 is in position 1, permutant 2 is in position 2, permutant 5
is in position 3, etc. e permutation vectors, introduced in Section
2 is
Π1
fv=(
4
,
2
,
1
,
5
,
3
)
, that is permutant (index) 1 is in position 4,
permutant 2 is in position 2, permutant 3 is in position 1, etc.
e intuition behind this is that features in the high levels of the
neural network carry out some sort of high-level visual information.
We can imagine that individual dimensions of the deep feature
vectors represent some sort of visual concept, and that the value of
each dimension species the importance of that visual concept in
the image. Similar deep feature vectors sort the visual concepts (the
dimensions) in the same way, according to the activation values.
Without entering the technical details of this approach (for that,
we refer the reader to [
4
]), let us just stress the fact that although
the vector of permutations are of the same dimension of DCNN
vectors, the advantage consists in that they be easily encoded into
an inverted index. Moreover, following the intuition that the most
relevant information of the permutation is in the very rst, we can
truncate the vector of permutations to the top-
K
(i.e., truncated
permutations at
K
). e element of the vectors beyond
K
can be
ignored, this approach allows us to modulate the size of vectors
and reduce the size of the index by introducing more sparsity.
In order to index the permutation vectors with a text-retrieval en-
gine as Lucene, we use the surrogate text representation introduced
in [
15
], which simply consists in assigning a codeword to each
item of the permutation vector
Π1
and repeating the codewords
a number of times equal to the complement of the rank of each
item within the permutation. For instance, let
τi
be the codeword
corresponding to the
i
-th component of the permutation vector, for
the vector
Π1
fv=(
4
,
2
,
1
,
5
,
3
)
, we generate the following surrogate
text: “τ1τ1τ2τ2τ2τ2τ3τ3τ3τ3τ3τ4τ5τ5τ5.
4 EXPERIMENTAL EVALUATION
e assessment of the proposed algorithm in a multimedia infor-
mation retrieval task was performed using the R-MAC features
extracted as explained above from INRIA Holidays [
19
] and Oxford
Buildings [26] datasets.
INRIA Holidays [
19
] is a collection of 1,491 images, which mainly
contains personal holidays photos. e images are of high resolu-
tion and represent a large variety of scene type (natural, man-made,
water, re eects,etc). e authors selected 500 queries and man-
ually identied a list of qualied results for each of them. As in
[
20
], we merged the Holidays dataset with the distraction dataset
MIRFlickr including 1M images 2.
Oxford Buildings [
26
] is composed by 5062 images of 11 Ox-
ford landmarks downloaded from Flickr. A manually labelled
groundtruth is available for 5 queries for each landmark, for a
1In reality, the number of dimensions is 2,048 or more.
2hp://press.liacs.nl/mirickr/
0.00.20.40.60.81.0
Sparsity
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
mAP
20
37
68
127
236
438 813 2048
20
37
68
127
236
438 813 2048
mAP vs Sparsity - Holidays_MIRFlickr1M
Baseline
Permutation
Baseline (multi-res)
Permutation (multi-res)
0.00.20.40.60.81.0
Sparsity
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
mAP
20
37
68
127
236
438 813 2048
20
37
68
127
236
438 813 2048
mAP vs Sparsity - Oxford_Flickr100k
Baseline
Permutation
Baseline (multi-res)
Permutation (multi-res)
Figure 1: mAP vs sparsity for increasing values of K(re-
ported near each point). Note that, the sparsity of the dataset
is given by K
2048 . e horizontal lines represent the levels of
mAPs for the baselines, i.e., using the original R-MAC vec-
tors.
total of 55 queries. As for INRIA Holidays, we merged the dataset
with the distraction dataset Flickr100k including 100k images 3.
We generated dierent sets of permutations from the original
features with dierent values of
K
(i.e., we consider truncated at
K
of the permutations), and for each
K
, we measured the mAP
obtained and the query execution times. Results on both datasets
are reported in Figure 1. e experiments show the mAP of our
Lucene implementation as function of the sparsity of the database
introduced by the truncation at
K
. e greater is
K
(indicated near
the point in the graphs) the lower is the sparsity, and hence the
greater is the mAP. e levels of mAPs for the baselines, i.e., using
the original R-MAC vectors, are also reported in the gure. ese
levels can be considered as the upper-bounds for our approach
since we are dealing with an approximate approach. However, as
it is possible to see, for a sparsity level of about 80%, we reach
satisfactory levels of eectiveness.
In order to see the impact of the sparsity of the database, in
Figure 2 we report the average query time versus the parameter
K
. Clearly, by increasing
K
the query time increases. However, for
larger database sizes, a strategy of query reduction similar to the
one presented in [2] can be used.
3hp://www.robots.ox.ac.uk/vgg/data/oxbuildings/
ICMR ’17, , June 6–9, 2017, Bucharest, Romania Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
102103
K
0
2
4
6
8
Query Time [s]
Query Time vs K
Figure 2: Average query times in seconds on INRIA Holidays
dataset + MIRFlickr1M distractor set for dierent values of
Kon Lucene.
5 CONCLUSION
In this paper, we present an approach for indexing R-MAC features
as permutations built upon the full-text retrieval engine Lucene
that exploits surrogate text representation. e advantage is that
comparing to the classical approach based on permutation, this tech-
nique does not need to compute distances between pivots and data
objects but uses the same activation values of the neural network
as a source for associating Deep Feature vectors with permutations.
In this preliminary study, we obtained encouraging results on
two important benchmarks of image retrieval. In future, we plan
test our approach on the more challenging Yahoo Flickr Creative
Commons 100 Million (YFCC100M) dataset available at hp://bit.
ly/yfcc100md.
REFERENCES
[1]
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Carlo Meghini,
and Claudio Vairo. 2017. Deep learning for decentralized parking lot occupancy
detection. Expert Systems with Applications 72 (2017), 327–334.
[2]
Giuseppe Amato, Franca Debole, Fabrizio Falchi, Claudio Gennaro, and Fausto
Rabii. 2016. Large Scale Indexing and Searching Deep Convolutional Neural
Network Features. In Proceeding of the 18th International Conference on Big Data
Analytics and Knowledge Discovery (DaWaK 2016). Springer. to appear.
[3]
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabii. 2016.
YFCC100M HybridNet fc6 Deep Features for Content-Based Image Retrieval. In
Proceedings of the 2016 ACM Workshop on Multimedia COMMONS. ACM, 11–18.
[4]
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Lucia Vadicamo. 2016.
Deep Permutations: Deep Convolutional Neural Networks and Permutation-
Based Indexing. In International Conference on Similarity Search and Applications.
Springer, 93–106.
[5]
Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. 2014. MI-File: us-
ing inverted les for scalable approximate similarity search. Multimedia
Tools and Applications 71, 3 (2014), 1333–1362.
DOI:
hp://dx.doi.org/10.1007/
s11042-012- 1271-1
[6]
Artem Babenko and Victor Lempitsky. 2012. e inverted multi-index. In Com-
puter Vision and Paern Recognition (CVPR), 2012 IEEE Conference on. IEEE,
3069–3076.
[7]
Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for
image retrieval. In Proceedings of the IEEE international conference on computer
vision. 1269–1277.
[8]
Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014.
Neural codes for image retrieval. In Computer Vision–ECCV 2014. Springer,
584–599.
[9]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative
searching. Commun. ACM 18, 9 (1975), 509–517.
[10]
Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo
Fern
´
andez. 2016. Picture it in your mind: Generating high level visual represen-
tations from textual descriptions. arXiv preprint arXiv:1606.07287 (2016).
[11]
Edgar Ch
´
avez, Karina Figueroa, and Gonzalo Navarro. 2008. Eective Proximity
Retrieval by Ordering Permutations. Paern Analysis and Machine Intelligence,
IEEE Transactions on 30, 9 (2008), 1647–1658.
[12]
Je Donahue, Yangqing Jia, Oriol Vinyals, Judy Homan, Ning Zhang, Eric
Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature
for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013).
[13]
Mahijs Douze, Herv
´
e J
´
egou, and Florent Perronnin. 2016. Polysemous Codes.
Springer International Publishing, Cham, 785–801.
[14]
Andrea Esuli. 2012. Use of permutation prexes for ecient and scalable ap-
proximate similarity search. Information Processing & Management 48, 5 (2012),
889–902.
[15]
Claudio Gennaro, Giuseppe Amato, Paolo Boleieri, and Pasquale Savino. 2010.
An Approach to Content-Based Image Retrieval Based on the Lucene Search
Engine Library. In Research and Advanced Technologyfor Digital Libraries, Mounia
Lalmas, Joemon Jose, Andreas Rauber, Fabrizio Sebastiani, and Ingo Frommholz
(Eds.). Lecture Notes in Computer Science, Vol. 6273. Springer Berlin Heidelberg,
55–66. hp://dx.doi.org/10.1007/978-3- 642-15464- 5 8
[16]
Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich
feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and paern recognition.
580–587.
[17]
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2016. End-to-
end learning of deep visual representations for image retrieval. arXiv preprint
arXiv:1610.07940 (2016).
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual
Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).
[19]
Herv
´
e J
´
egou, Mahijs Douze, and Cordelia Schmid. 2008. Hamming Embedding
and Weak Geometric Consistency for Large Scale Image Search. In Computer
Vision – ECCV 2008, David Forsyth, Philip Torr, and Andrew Zisserman (Eds.).
Lecture Notes in Computer Science, Vol. 5302. Springer Berlin Heidelberg, 304–
317. hp://dx.doi.org/10.1007/978- 3-540- 88682- 2 24
[20]
H. J
´
egou, M. Douze, and C. Schmid. 2009. Packing bag-of-features. In Computer
Vision, 2009 IEEE 12th International Conference on. 2357 –2364.
DOI:
hp://dx.
doi.org/10.1109/ICCV.2009.5459419
[21]
Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[22]
Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature
521, 7553 (2015), 436–444.
[23]
Eva Mohedano, Kevin McGuinness, Noel E. O’Connor, Amaia Salvador, Ferran
Marques, and Xavier Giro-i Nieto. 2016. Bags of Local Convolutional Features
for Scalable Instance Search. In Proceedings of the 2016 ACM on International
Conference on Multimedia Retrieval (ICMR ’16). ACM, New York, NY, USA, 327–
331.
[24]
David Novak, Martin Kyselak, and Pavel Zezula. 2010. On locality-sensitive index-
ing in generic metric spaces. In Proceedings of the ird International Conference
on Similarity Search and Applications (SISAP ’10). ACM, 59–66.
[25]
Lo
¨
ıc Paulev
´
e, Herv
´
e J
´
egou, and Laurent Amsaleg. 2010. Locality sensitive hash-
ing: A comparison of hash function types and querying mechanisms. Paern
Recognition Leers 31, 11 (2010), 1348–1358.
[26]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2007. Object Retrieval
with Large Vocabularies and Fast Spatial Matching. In Computer Vision and
Paern Recognition, 2007. CVPR 2007. IEEE Conference on. 1–8.
[27]
Ali S Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014.
CNN features o-the-shelf: an astounding baseline for recognition. In Computer
Vision and Paern Recognition Workshops (CVPRW), 2014 IEEE Conference on.
IEEE, 512–519.
[28]
Ali Sharif Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. 2014.
Visual instance retrieval with deep convolutional networks. arXiv preprint
arXiv:1412.6574 (2014).
[29]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
DOI:hp://dx.doi.org/10.1007/s11263-015- 0816-y
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[31]
Giorgos Tolias, Ronan Sicre, and Herv
´
e J
´
egou. 2015. Particular object retrieval
with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879
(2015).
... Sfruttando tecniche di modellazione profonda data-driven come il Deep Learning, il gruppo si è specializzato nello sviluppo e l'utilizzo di rappresentazioni vettoriali compatte ed efficaci per immagini estratte tramite reti neurali convoluzionali (Deep Features, R-MAC). L'adozione di questo tipo di rappresentazioni ci ha permesso di sviluppare tecniche di indicizzazione e ricerca per similarità visuale di immagini non etichettate con un alto grado di scalabilità (nell'ordine di centinaia di milioni di immagini 1 ) mantenendo un alto livello di accuratezza dei risultati della ricerca [Amato et al., 2016a]. ...
... In questo contesto, sono state svolte attività di ricerca sulla trasformazione di tali rappresentazioni tramite l'utilizzo di permutazioni [Amato et al., 2014;Amato et al., 2016b] Analisi visuale dell'emotività trasmessa Nel contesto dell'analisi dei dati provenienti dai social media, il gruppo ha sviluppato competenze e tecniche allo stato dell'arte di visual sentiment analysis , cioè nell'analisi del sentimento veicolato da media visuali, tramite l'utilizzo di reti neurali convoluzionali 4 . Sono state sviluppate techiche di allenamento cross-media che sfruttano la grande quantità di dati rumorosi provenienti dai social media (in particolare Twitter) per allenare modelli per la classificazione del sentimento visuale allo stato dell'arte senza indurre in costi di etichettatura o di creazione di dataset di training. ...
... In questo contesto, sono state svolte attività di ricerca sulla trasformazione di tali rappresentazioni tramite l'utilizzo di permutazioni [Amato et al., 2014;Amato et al., 2016b] Analisi visuale dell'emotività trasmessa Nel contesto dell'analisi dei dati provenienti dai social media, il gruppo ha sviluppato competenze e tecniche allo stato dell'arte di visual sentiment analysis , cioè nell'analisi del sentimento veicolato da media visuali, tramite l'utilizzo di reti neurali convoluzionali 4 . Sono state sviluppate techiche di allenamento cross-media che sfruttano la grande quantità di dati rumorosi provenienti dai social media (in particolare Twitter) per allenare modelli per la classificazione del sentimento visuale allo stato dell'arte senza indurre in costi di etichettatura o di creazione di dataset di training. ...
Conference Paper
Full-text available
La diffusa produzione di immagini e media digita-li ha reso necessario l'utilizzo di metodi automatici di analisi e indicizzazione su larga scala per la loro fruzione. Il gruppo AIMIR dell'ISTI-CNR si è spe-cializzato da anni in questo ambito ed ha abbraccia-to tecniche di Deep Learning basate su reti neurali artificiali per molteplici aspetti di questa disciplina, come l'analisi, l'annotazione e la descrizione au-tomatica di contenuti visuali e il loro recupero su larga scala. 1 Attività Scientifica Il gruppo Artificial Intelligence for Multimedia Information Retrieval (AIMIR) dell'ISTI-CNR nasce storicamente in un contesto di gestione di dati multimediali ed ha quindi abbrac-ciato le moderne tecniche di IA nella modellazione e rappre-sentazione di tali dati, sposandole con successo con moltepli-ci aspetti di questa disciplina, in particolare con la gestione su larga scala di dati percettivi visuali, quali immagini e video. Tra le attività scientifiche sostenute e le competenze presenti nel gruppo, spiccano le seguenti: Recupero di immagini su larga scala basati sul contenu-to Data la mole di immagini prodotte quotidianamente dagli utenti del Web, lo sviluppo di tecniche automatiche e scalabi-li per la comprensione automatica ed il recupero di immagini risulta di vitale importanza. Sfruttando tecniche di modella-zione profonda data-driven come il Deep Learning, il grup-po si è specializzato nello sviluppo e l'utilizzo di rappresen-tazioni vettoriali compatte ed efficaci per immagini estratte tramite reti neurali convoluzionali (Deep Features, R-MAC). L'adozione di questo tipo di rappresentazioni ci ha permesso di sviluppare tecniche di indicizzazione e ricerca per simila-rità visuale di immagini non etichettate con un alto grado di scalabilità (nell'ordine di centinaia di milioni di immagini 1) mantenendo un alto livello di accuratezza dei risultati della ricerca [Amato et al., 2016a]. In questo contesto, sono state svolte attività di ricerca sulla trasformazione di tali rappresentazioni tramite l'uti-lizzo di permutazioni [Amato et al., 2014; Amato et al., 2016b] e trasformazioni geometriche [Amato et al., 2018a; 1 http://mifile.deepfeatures.org/ Amato et al., 2018b] per facilitarne l'indicizzazione. Le tra-sformazioni introdotte ci permettono utilizzare delle rappre-sentazioni testuali surrogate dei descrittori visuali e quindi di impiegare indici open source basati su liste invertite tradi-zionalmente usati per documenti testuali (e.g. Elasticsearch, Apache Lucene) per la gestione di database di immagini, fa-vorendo il trasferimento tecnologico di tali tecniche [Amato et al., 2017] 2. Inoltre, grazie alla flessibilità delle reti neurali profonde, sono state sviluppate tecniche di recupero di immagini che affrontano e risolvono problemi avanzati in questa discipli-na, quali il cross-media retrieval [Carrara et al., 2017], i.e. il recupero di immagini non etichettate partendo da una sua descrizione testuale , ed il relational content-based image retrieval [Messina et al., 2018], dove si richiede di recuperare immagini raffiguranti oggetti con precise relazioni spaziali o semantiche tra loro 3. Analisi visuale dell'emotività trasmessa Nel contesto del-l'analisi dei dati provenienti dai social media, il gruppo ha sviluppato competenze e tecniche allo stato dell'arte di visual sentiment analysis [Vadicamo et al., 2017], cioè nell'analisi del sentimento veicolato da media visuali, tramite l'utilizzo di reti neurali convoluzionali 4. Sono state sviluppate techiche di allenamento cross-media che sfruttano la grande quantità di dati rumorosi provenienti dai social media (in particolare Twitter) per allenare modelli per la classificazione del sen-timento visuale allo stato dell'arte senza indurre in costi di etichettatura o di creazione di dataset di training. Sistemi di video-browsing Dall'unione delle competenze sopraelencate, il gruppo ha svolto attività di ricerca e svilup-po di tool per la ricerca interattiva di video su larga scala, partecipando alla competizione di Video Browsing Showdown (VBS 2019) con il sistema VISIONE [Amato et al., 2019]. Il sistema integra moduli di analisi, annotazione e recupero del contenuto visuale basate su tecniche deep learning allo stato dell'arte e fornisce molteplici modalità di ricerca, come la ricerca per similarità visuale, per locazione spaziale di og-getti o per semplici keyword testuali. Tutte le informazioni ri-sultanti dalle analisi sono codificate tramite rappresentazioni testuali surrogate ed indicizzate con motori di ricerca testuali performanti e scalabili. 2
... The present paper is the evolution of previous works (Amato, Bolettieri, Carrara, Falchi, & Gennaro, 2018;Amato, Carrara, Falchi, & Gennaro, 2017;Amato, Gennaro, & Savino, 2014;Gennaro, Amato, Bolettieri, & Savino, 2010). In Amato, Gennaro et al. (2014) the idea of representing metric objects as permutations of reference objects to construct an inverted index that allows us to perform approximate nearest neighbor queries was presented. ...
... In , the authors introduced the idea of Deep Permutations that applies to the deep feature vectors and in which the components of the vectors themselves are permuted. In Amato et al. (2017) and Amato et al. (2018) an extension of the technique of Deep Permutations is presented, in the former using the surrogate text representation and R-MAC, and in the latter taking into account the negative components of R-MAC. In Amato et al. (2018), we have also proved that this general approach can be implemented on top of Elasticsearch by showing how such a retrieval system is able to scale to multiple nodes. ...
Article
The great success of visual features learned from deep neural networks has led to a significant effort to develop efficient and scalable technologies for image retrieval. Nevertheless, its usage in large-scale Web applications of content-based retrieval is still challenged by their high dimensionality. To overcome this issue, some image retrieval systems employ the product quantization method to learn a large-scale visual dictionary from a training set of global neural network features. These approaches are implemented in main memory, preventing their usage in big-data applications. The contribution of the work is mainly devoted to investigating some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea of our approaches relies on a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. Its effectiveness has been proved through several instance-level retrieval benchmarks. An extensive experimental evaluation conducted on the standard benchmarks shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
... To take full advantage from these stable search engine technologies, we specifically designed various text encodings for all the features and descriptors extracted from the video keyframes and the user query, and we decided to use the Apache Lucene project. In previous papers, we already exploited the idea of using text encoding, named Surrogate Text Representation [54], to index and search image for deep features [54][55][56][57]. In VISIONE, we extend this idea to index also information regarding the position of objects and colors that appear in the images. ...
Article
Full-text available
This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users' needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.
... To take full advantage from these stable search engine technologies, we specifically designed various text encodings for all the features and descriptors extracted from the video keyframes and the user query, and we decided to use the Apache Lucene project. In previous papers, we already exploited the idea of using text encoding, named Surrogate Text Representation [50], to index and search image for deep features [51,52,53,50]. In VISIONE, we extend this idea to index also information regarding the position of objects and colors that appear in the images. ...
Preprint
Full-text available
In this paper, we describe VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. The peculiarity of our approach is that we encode all the information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query needs to be merged. We report an extensive analysis of the system retrieval performance, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies among the ones that we tested.
... For instance, 99.77% accuracy of LFW under 6.000 pair evaluation protocol has been achieved by Liu at al. [10] and 99.33% by Schroff et al. of Google [11]. As in our proposed approach, approximate nearest neighbor search methods can be used to improve scalability and works very well as a lazy learning method [12], [13] and also a fulltext search engine [14]. ...
Conference Paper
Full-text available
Face verification is a key task in many application fields, such as security and surveillance. Several approaches and methodologies are currently used to try to determine if two faces belong to the same person. Among these, facial landmarks are very important in forensics, since the distance between some characteristic points of a face can be used as an objective measure in court during trials. However, the accuracy of the approaches based on facial landmarks in verifying whether a face belongs to a given person or not is often not quite good. Recently, deep learning approaches have been proposed to address the face verification problem, with very good results. In this paper, we compare the accuracy of facial landmarks and deep learning approaches in performing the face verification task. Our experiments, conducted on a real case scenario, show that the deep learning approach greatly outperforms in accuracy the facial landmarks approach.
... In this paper, we extend and improve our previous work on surrogate text representation [1] based on the approach of deep permutation [2], by adopting the Concatenated Rectified Linear Unit (CReLU) transformation [15]. The advantage of this approach is a better estimate of the matching score among R-MAC features by preserving both positive and negative activation information, which leads to an improvement of effectiveness at the same cost when using the conventional deep permutation approach. ...
Conference Paper
Content-Based Image Retrieval in large archives through the use of visual features has become a very attractive research topic in recent years. The cause of this strong impulse in this area of research is certainly to be attributed to the use of Convolutional Neural Network (CNN) activations as features and their outstanding performance. However, practically all the available image retrieval systems are implemented in main memory, limiting their applicability and preventing their usage in big-data applications. In this paper, we propose to transform CNN features into textual representations and index them with the well-known full-text retrieval engine Elasticsearch. We validate our approach on a novel CNN feature, namely Regional Maximum Activations of Convolutions. A preliminary experimental evaluation, conducted on the standard benchmark INRIA Holidays, shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
Chapter
The recommendation system can recommend information to users personally and efficiently, which satisfies the user’s demand for information in the information age, and has become a hot topic in the current era. In the recommendation system, users and items and the interaction of their own information has a crucial impact on the efficiency and accuracy of the recommendations. However, most of the existing recommendation systems usually design the systems as user-base only, considering the user’s influence on the item in the recommendation, which to some extent blurs the interaction between items and users at the item level, unknown and potential connections between items and users are not well considered. In this paper, we propose a collaborative memory network that can focus on the potential relation between items and users, and consider the impact of items’ characteristics on user behavior. Experiments have shown that our improvement is better than the original method and other baseline models.
Conference Paper
Full-text available
This paper improves recent methods for large scale image search. State-of-the-art methods build on the bag-of-features image representation. We, first, analyze bag-of-features in the framework of approximate nearest neighbor search. This shows the sub-optimality of such a representation for matching descriptors and leads us to derive a more precise representation based on 1) Hamming embedding (HE) and 2) weak geometric consistency constraints (WGC). HE provides binary signatures that refine the matching based on visual words. WGC filters matching descriptors that are not consistent in terms of angle and scale. HE and WGC are integrated within the inverted file and are efficiently exploited for all images, even in the case of very large datasets. Experiments performed on a dataset of one million of images show a significant improvement due to the binary signature and the weak geometric consistency constraints, as well as their efficiency. Estimation of the full geometric transformation, i.e., a re-ranking step on a short list of images, is complementary to our weak geometric consistency constraints and allows to further improve the accuracy.
Article
Full-text available
A smart camera is a vision system capable of extracting application-specific information from the captured images. The paper proposes a decentralized and efficient solution for visual parking lot occupancy detection based on a deep Convolutional Neural Network (CNN) specifically designed for smart cameras. This solution is compared with state-of-the-art approaches using two visual datasets: PKLot, already existing in literature, and CNRPark-EXT. The former is an existing dataset, that allowed us to exhaustively compare with previous works. The latter dataset has been created in the context of this research, accumulating data across various seasons of the year, to test our approach in particularly challenging situations, exhibiting occlusions, and diverse and difficult viewpoints. This dataset is public available to the scientific community and is another contribution of our research. Our experiments show that our solution outperforms and generalizes the best performing approaches on both datasets. The performance of our proposed CNN architecture on the parking lot occupancy detection task, is comparable to the well-known AlexNet, which is three orders of magnitude larger.
Article
Full-text available
While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: i) noisy training data, ii) inappropriate deep architecture, and iii) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy. For additional material, please see www.xrce.xerox.com/Deep-Image-Retrieval.
Conference Paper
Full-text available
The activation of the Deep Convolutional Neural Networks hidden layers can be successfully used as features, often referred as Deep Features, in generic visual similarity search tasks. Recently scientists have shown that permutation-based methods offer very good performance in indexing and supporting approximate similarity search on large database of objects. Permutation-based approaches represent metric objects as sequences (permutations) of reference objects, chosen from a predefined set of data. However, associating objects with permutations might have a high cost due to the distance calculation between the data objects and the reference objects. In this work, we propose a new approach to generate permutations at a very low computational cost, when objects to be indexed are Deep Features. We show that the permutations generated using the proposed method are more effective than those obtained using pivot selection criteria specifically developed for permutation-based methods.
Conference Paper
Full-text available
Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
This paper presents a corpus of deep features extracted from the YFCC100M images considering the fc6 hidden layer activation of the HybridNet deep convolutional neural network. For a set of random selected queries we made available k-NN results obtained sequentially scanning the entire set features comparing both using the Euclidean and Hamming Distance on a binarized version of the features. This set of results is ground truth for evaluating Content-Based Image Retrieval (CBIR) systems that use approximate similarity search methods for efficient and scalable indexing. Moreover, we present experimental results obtained indexing this corpus with two distinct approaches: the Metric Inverted File and the Lucene Quantization. These two CBIR systems are public available online allowing real-time search using both internal and external queries.
Conference Paper
This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3 millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 h on a single machine.