Content uploaded by Fabrizio Falchi
Author content
All content in this area was uploaded by Fabrizio Falchi on Aug 30, 2017
Content may be subject to copyright.
Eicient Indexing of Regional Maximum Activations of
Convolutions using Full-Text Search Engines
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
ISTI-CNR, via G. Moruzzi 1, 56124, Pisa, Italy
{name.surname}@isti.cnr.it
ABSTRACT
In this paper, we adapt a surrogate text representation technique
to develop ecient instance-level image retrieval using Regional
Maximum Activations of Convolutions (R-MAC). R-MAC features
have recently showed outstanding performance in visual instance
retrieval. However, contrary to the activations of hidden layers
adopting ReLU (Rectied Linear Unit), these features are dense.
is constitutes an obstacle to the direct use of inverted indexes,
which rely on sparsity of data. We propose the use of deep permuta-
tions, a recent approach for ecient evaluation of permutations, to
generate surrogate text representation of R-MAC features, enabling
indexing of visual features as text into a standard search-engine.
e experiments, conducted on Lucene, show the eectiveness and
eciency of the proposed approach.
KEYWORDS
Similarity Search, Permutation-Based Indexing, Deep Convolu-
tional Neural Network
ACM Reference format:
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro. 2017.
Ecient Indexing of Regional Maximum Activations of Convolutions us-
ing Full-Text Search Engines. In Proceedings of ICMR ’17, June 6–9, 2017,
Bucharest, Romania, , 4 pages.
DOI: hp://dx.doi.org/10.1145/3078971.3079035
1 INTRODUCTION
Deep learning has rapidly become the state-of-the-art approach
for many computer vision tasks such as classication [
21
], content-
based image retrieval [
3
,
12
], cross-media retrieval [
10
], smart cam-
eras [
1
]. A practical and convenient way of using Deep Convo-
lutional Neural Networks (DCNNs) to support fast content-based
image retrieval is to treat the neuron activations of the hidden lay-
ers as global features [
12
,
22
,
27
]. Recently, Tolias et al. [
31
] have
gone further, proposing to use Regional Maximum Activation of
Convolutions (R-MAC) as a compact and eective image represen-
tation for instance-level retrieval. is feature is the result of the
spatial aggregation of the activations of a convolution layer of a
DCNN, therefore is robust to scale and translation. Gordo et al
.
[17]
extended the R-MAC representation by improving the region
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
ICMR ’17, June 6–9, 2017, Bucharest, Romania
©
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-4701-3/17/06.. . $15.00
DOI: hp://dx.doi.org/10.1145/3078971.3079035
pooling mechanism and including it in a dierentiable pipeline
trained end-to-end for retrieval tasks.
However, these features are typically of high dimensionality,
which prevents the use of space-partitioning data structure, such
as kd-tree [
9
]. For instance, in the well-known AlexNet architec-
ture [
21
] the output of the sixth layer (fc6) has 4,096 dimensions,
while the R-MAC extracted by Gordo et al
. [17]
produces a 2048-
dimensional image descriptor.
To overcome this problem, various partitioning methods have
been proposed. In particular, the inverted multi-index uses product
quantization both to dene the coarse level and for coding residual
vectors [
6
,
25
]. is approach combined with binary compressed
techniques outperforms the state of the art by a large margin [
13
].
Our approach, as we will see, is implemented on top of an existing
text retrieval engine and requires minimal pre-processing.
Mohedano et al. [
23
] propose a sparse visual descriptor based
on a Bag of Local Convolutional Features (BLCF), which allows
fast image retrieval by means of an inverted index. is method,
however, relies on a priori learning of a codebook (which involves
the use k-means) and the VGG16 [
30
] pre-trained network, which
makes it dicult to retrain the whole pipeline on new set of images.
Our approach to tackle the dimensionality curse problem is still
based on the application of approximate access methods, but relies
on the permutation approach similar to [
5
,
11
,
14
,
24
]. e key idea
is to represent metric objects (i.e., features) as sequences (permuta-
tions) of reference objects, chosen from a predened set of objects.
Similarity queries are executed by searching for data objects whose
permutation representations are similar to the query permutation
representation. Each permutation is generated by sorting the entire
set of reference objects according to their distances from the ob-
ject to be represented. e total number of reference objects to be
used for building permutations depends on the size of the dataset
to be indexed and can amount to tens of thousands [
5
]. In these
cases, both indexing time and searching time is aected by the cost
of generating permutations for objects being inserted or for the
queries.
In this paper, we propose an adaptation of surrogate text rep-
resentation [
15
] suitable to regional maximum activations of con-
volutions features, which exploits the so-called deep permutation
approach introduced in [
4
]. An advantage of this approach lies
in its low-computational cost since does not require the distance
calculation between the reference objects and the objects to be
represented.
e rest of the paper is organized as follows. Section 2 provides
background for the reader. In Section 3, we introduce our approach
to generate permutations for R-MAC features. Section 4 presents
some experimental results using real-life datasets. Section5 con-
cludes the paper.
ICMR ’17, , June 6–9, 2017, Bucharest, Romania Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
2 BACKGROUND
2.1 Permutation-Based Approach
Given a domain
D
, a distance function
d
:
D × D → R
, and a xed
set of reference objects
P={p1. . . pn} ⊂ D
that we call pivots
or reference objects, we dene a permutation-based representation
Πo
(briey permutation) of an object
o∈ D
as the sequence of
pivots identiers sorted in ascending order by their distance from
o[5, 11, 14, 24].
Formally, the permutation-based representation
Πo=
(Πo(
1
), . . . , Πo(n))
lists the pivot identiers in an order such that
∀j∈ {
1
, . . . , n−
1
},d(o,pΠo(j)) ≤ d(o,pΠo(j+1))
, where
pΠo(j)
indicates the pivot at position
j
in the permutation associated
with object
o
. If we denote as
Π−1
o(i)
the position of a pivot
pi
,
in the permutation of an object
o∈ D
, so that
Πo(Π−1
o(i)) =i
,
we obtain the equivalent inverted representation of permutations
Π−1
o=(Π−1
o(
1
), . . . , Π−1
o(n)).
In
Πo
the value in each position of
the sequence is the identier of the pivot in that position. In the
inverted representation
Π−1
o
, each position corresponds to a pivot
and the value in each position corresponds to the rank of the corre-
sponding pivot. e inverted representation of permutations
Π−1
o
,
is a vector that we refer to as vector of permutations, and which
allows us to easily dene most of the distance functions between
permutations.
Permutations are generally compared using Spearman rho,
Kendall Tau, or Spearman Footrule distances. In our implementa-
tion, we apply a simple algebraic transformation to permutations
in order to compute the same ranking scores that we would obtain
with the Spearman Rho distance using dot products. Due to the
lack of space, we do not present the proof of this result, however,
further details can be found in [15].
2.2 Deep Features
Recently, a new class of image descriptor, built upon Deep Convo-
lutional Neural Networks, have been used as eective alternative
to descriptors built using local features such as SIFT, SURF, ORB,
BRIEF, etc. DCNNs have aracted enormous interest within the
Computer Vision community because of the state-of-the-art results
[
21
] achieved in challenging image classication challenges such as
ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In
computer vision, DCNN have been used to perform several tasks,
including image classication, as well as image retrieval [
8
,
12
]
and object detection [
16
], to cite some. In particular, it has been
proved that the multiple levels of representation, which are learned
by DCNN on specic task (typically supervised) can be used to
transfer learning across tasks [
12
,
27
]. e activation of neurons of
a specic layers, in particular the last ones, can be used as features
for describing the visual content.
2.3 R-MAC Features
Recently, image descriptors built upon activations of convolutional
layers have shown brilliant results in image instance retrieval
[
7
,
28
,
31
]. Tolias et al
. [31]
proposed the R-MAC feature repre-
sentation, which encodes and aggregates several regions of the
image in a dense and compact global image representation. To
compute a R-MAC feature, an input image is fed to a fully con-
volutional network pre-trained on ImageNet [
29
]. e output of
the last convolutional layer is max-pooled over dierent spatial
regions at dierent position and scales, obtaining a feature vec-
tor for each region. ese vectors are then
l
2-normalized, PCA-
whitened,
l
2-normalized again, and nally aggregated by summing
them together and
l
2-normalizing the nal result. e obtained
representation is an eective aggregation of local convolutional
features at multiple position and scales that can be compared with
the cosine similarity function.
Gordo et al
. [17]
built on the work of Tolias et al
. [31]
and in-
serted the R-MAC feature extractor in a end-to-end dierentiable
pipeline in order to learn a representation optimized for visual in-
stance retrieval through back-propagation. e whole pipeline is
composed by a fully convolutional neural network, a region pro-
posal network, the R-MAC extractor and PCA-like dimensionality
reduction layers, and it is trained using a ranking loss based on im-
age triplets. e obtained pipeline can extract an optimized R-MAC
feature representation that outperforms methods based on costly
local features and spatial geometry verication.
An additional performance boost is obtained using state-of-the-
art deep convolutional architectures, such as very deep residual
networks [
18
], and aggregating descriptors extracted at multiple res-
olutions. A multi-resolution R-MAC descriptor is obtained feeding
the network with images at dierent resolutions and then aggregat-
ing the obtained representations by summing them together and
then performing a nal l2-normalization.
In our work, we used the ResNet-101 trained model provided by
[
17
] as a R-MAC feature extractor, which has been shown to achieve
the best performance on standard benchmarmks. We extracted
the R-MAC features using xed regions at two dierent scales as
proposed in [
31
] instead of using the region proposal network.
Dened
S
as the size in pixel of the minimum side of an image,
we extracted both single-resolution descriptors (with
S=
800) and
multi-resolution ones (aggregating descriptors with
S=
550, 800
and 1050).
3 SURROGATE TEXT REPRESENTATION FOR
DEEP FEATURES
As introduced above, the basic idea of permutation-based indexing
techniques is to represent feature objects as permutations built
using a set of reference object identiers as permutants.
Using the permutation-based representation, the similarity be-
tween two objects is estimated computing the similarity between
the two corresponding permutations, rather than using the original
distance function. e rationale behind this is that, when permuta-
tions are built using this strategy, objects that are very close one
to the other, have similar permutation representations as well. In
other words, if two objects are very close one to the other, they will
sort the set of reference objects in a very similar way.
Notice however that the relevant aspect when building permuta-
tions is the capability of generating sequences of identiers (permu-
tations) in such a way that similar objects have similar permutations
as well. Sorting a set of reference objects, according to their dis-
tance with the object to be represented is just one, yet eective,
approach.
When objects to be indexed are vectors as in our case of deep
features, we can use the approach presented in [
4
], which allows
Eicient Indexing of R-MAC using Full-Text Search Engines ICMR ’17, , June 6–9, 2017, Bucharest, Romania
us to generate sequence of identiers not associated with reference
objects. e basic idea is as follows. Permutants are the indexes of
elements of the deep feature vectors. Given a deep feature vector,
the corresponding permutation is obtained by sorting the indexes
of the elements of the vector, in descending order with respect to
the values of the corresponding elements. Suppose for instance
the feature vector is
fv=[
0
.
1
,
0
.
3
,
0
.
4
,
0
,
0
.
2
]1
. e permutation-
based representation of
fv
is
Πfv=(
3
,
2
,
5
,
1
,
4
)
, that is permutant
(index) 3 is in position 1, permutant 2 is in position 2, permutant 5
is in position 3, etc. e permutation vectors, introduced in Section
2 is
Π−1
fv=(
4
,
2
,
1
,
5
,
3
)
, that is permutant (index) 1 is in position 4,
permutant 2 is in position 2, permutant 3 is in position 1, etc.
e intuition behind this is that features in the high levels of the
neural network carry out some sort of high-level visual information.
We can imagine that individual dimensions of the deep feature
vectors represent some sort of visual concept, and that the value of
each dimension species the importance of that visual concept in
the image. Similar deep feature vectors sort the visual concepts (the
dimensions) in the same way, according to the activation values.
Without entering the technical details of this approach (for that,
we refer the reader to [
4
]), let us just stress the fact that although
the vector of permutations are of the same dimension of DCNN
vectors, the advantage consists in that they be easily encoded into
an inverted index. Moreover, following the intuition that the most
relevant information of the permutation is in the very rst, we can
truncate the vector of permutations to the top-
K
(i.e., truncated
permutations at
K
). e element of the vectors beyond
K
can be
ignored, this approach allows us to modulate the size of vectors
and reduce the size of the index by introducing more sparsity.
In order to index the permutation vectors with a text-retrieval en-
gine as Lucene, we use the surrogate text representation introduced
in [
15
], which simply consists in assigning a codeword to each
item of the permutation vector
Π−1
and repeating the codewords
a number of times equal to the complement of the rank of each
item within the permutation. For instance, let
τi
be the codeword
corresponding to the
i
-th component of the permutation vector, for
the vector
Π−1
fv=(
4
,
2
,
1
,
5
,
3
)
, we generate the following surrogate
text: “τ1τ1τ2τ2τ2τ2τ3τ3τ3τ3τ3τ4τ5τ5τ5”.
4 EXPERIMENTAL EVALUATION
e assessment of the proposed algorithm in a multimedia infor-
mation retrieval task was performed using the R-MAC features
extracted as explained above from INRIA Holidays [
19
] and Oxford
Buildings [26] datasets.
INRIA Holidays [
19
] is a collection of 1,491 images, which mainly
contains personal holidays photos. e images are of high resolu-
tion and represent a large variety of scene type (natural, man-made,
water, re eects,etc). e authors selected 500 queries and man-
ually identied a list of qualied results for each of them. As in
[
20
], we merged the Holidays dataset with the distraction dataset
MIRFlickr including 1M images 2.
Oxford Buildings [
26
] is composed by 5062 images of 11 Ox-
ford landmarks downloaded from Flickr. A manually labelled
groundtruth is available for 5 queries for each landmark, for a
1In reality, the number of dimensions is 2,048 or more.
2hp://press.liacs.nl/mirickr/
0.00.20.40.60.81.0
Sparsity
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
mAP
20
37
68
127
236
438 813 2048
20
37
68
127
236
438 813 2048
mAP vs Sparsity - Holidays_MIRFlickr1M
Baseline
Permutation
Baseline (multi-res)
Permutation (multi-res)
0.00.20.40.60.81.0
Sparsity
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
mAP
20
37
68
127
236
438 813 2048
20
37
68
127
236
438 813 2048
mAP vs Sparsity - Oxford_Flickr100k
Baseline
Permutation
Baseline (multi-res)
Permutation (multi-res)
Figure 1: mAP vs sparsity for increasing values of K(re-
ported near each point). Note that, the sparsity of the dataset
is given by K
2048 . e horizontal lines represent the levels of
mAPs for the baselines, i.e., using the original R-MAC vec-
tors.
total of 55 queries. As for INRIA Holidays, we merged the dataset
with the distraction dataset Flickr100k including 100k images 3.
We generated dierent sets of permutations from the original
features with dierent values of
K
(i.e., we consider truncated at
K
of the permutations), and for each
K
, we measured the mAP
obtained and the query execution times. Results on both datasets
are reported in Figure 1. e experiments show the mAP of our
Lucene implementation as function of the sparsity of the database
introduced by the truncation at
K
. e greater is
K
(indicated near
the point in the graphs) the lower is the sparsity, and hence the
greater is the mAP. e levels of mAPs for the baselines, i.e., using
the original R-MAC vectors, are also reported in the gure. ese
levels can be considered as the upper-bounds for our approach
since we are dealing with an approximate approach. However, as
it is possible to see, for a sparsity level of about 80%, we reach
satisfactory levels of eectiveness.
In order to see the impact of the sparsity of the database, in
Figure 2 we report the average query time versus the parameter
K
. Clearly, by increasing
K
the query time increases. However, for
larger database sizes, a strategy of query reduction similar to the
one presented in [2] can be used.
3hp://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/
ICMR ’17, , June 6–9, 2017, Bucharest, Romania Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
102103
K
0
2
4
6
8
Query Time [s]
Query Time vs K
Figure 2: Average query times in seconds on INRIA Holidays
dataset + MIRFlickr1M distractor set for dierent values of
Kon Lucene.
5 CONCLUSION
In this paper, we present an approach for indexing R-MAC features
as permutations built upon the full-text retrieval engine Lucene
that exploits surrogate text representation. e advantage is that
comparing to the classical approach based on permutation, this tech-
nique does not need to compute distances between pivots and data
objects but uses the same activation values of the neural network
as a source for associating Deep Feature vectors with permutations.
In this preliminary study, we obtained encouraging results on
two important benchmarks of image retrieval. In future, we plan
test our approach on the more challenging Yahoo Flickr Creative
Commons 100 Million (YFCC100M) dataset available at hp://bit.
ly/yfcc100md.
REFERENCES
[1]
Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Carlo Meghini,
and Claudio Vairo. 2017. Deep learning for decentralized parking lot occupancy
detection. Expert Systems with Applications 72 (2017), 327–334.
[2]
Giuseppe Amato, Franca Debole, Fabrizio Falchi, Claudio Gennaro, and Fausto
Rabii. 2016. Large Scale Indexing and Searching Deep Convolutional Neural
Network Features. In Proceeding of the 18th International Conference on Big Data
Analytics and Knowledge Discovery (DaWaK 2016). Springer. to appear.
[3]
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabii. 2016.
YFCC100M HybridNet fc6 Deep Features for Content-Based Image Retrieval. In
Proceedings of the 2016 ACM Workshop on Multimedia COMMONS. ACM, 11–18.
[4]
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Lucia Vadicamo. 2016.
Deep Permutations: Deep Convolutional Neural Networks and Permutation-
Based Indexing. In International Conference on Similarity Search and Applications.
Springer, 93–106.
[5]
Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. 2014. MI-File: us-
ing inverted les for scalable approximate similarity search. Multimedia
Tools and Applications 71, 3 (2014), 1333–1362.
DOI:
hp://dx.doi.org/10.1007/
s11042-012- 1271-1
[6]
Artem Babenko and Victor Lempitsky. 2012. e inverted multi-index. In Com-
puter Vision and Paern Recognition (CVPR), 2012 IEEE Conference on. IEEE,
3069–3076.
[7]
Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for
image retrieval. In Proceedings of the IEEE international conference on computer
vision. 1269–1277.
[8]
Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014.
Neural codes for image retrieval. In Computer Vision–ECCV 2014. Springer,
584–599.
[9]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative
searching. Commun. ACM 18, 9 (1975), 509–517.
[10]
Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo
Fern
´
andez. 2016. Picture it in your mind: Generating high level visual represen-
tations from textual descriptions. arXiv preprint arXiv:1606.07287 (2016).
[11]
Edgar Ch
´
avez, Karina Figueroa, and Gonzalo Navarro. 2008. Eective Proximity
Retrieval by Ordering Permutations. Paern Analysis and Machine Intelligence,
IEEE Transactions on 30, 9 (2008), 1647–1658.
[12]
Je Donahue, Yangqing Jia, Oriol Vinyals, Judy Homan, Ning Zhang, Eric
Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature
for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013).
[13]
Mahijs Douze, Herv
´
e J
´
egou, and Florent Perronnin. 2016. Polysemous Codes.
Springer International Publishing, Cham, 785–801.
[14]
Andrea Esuli. 2012. Use of permutation prexes for ecient and scalable ap-
proximate similarity search. Information Processing & Management 48, 5 (2012),
889–902.
[15]
Claudio Gennaro, Giuseppe Amato, Paolo Boleieri, and Pasquale Savino. 2010.
An Approach to Content-Based Image Retrieval Based on the Lucene Search
Engine Library. In Research and Advanced Technologyfor Digital Libraries, Mounia
Lalmas, Joemon Jose, Andreas Rauber, Fabrizio Sebastiani, and Ingo Frommholz
(Eds.). Lecture Notes in Computer Science, Vol. 6273. Springer Berlin Heidelberg,
55–66. hp://dx.doi.org/10.1007/978-3- 642-15464- 5 8
[16]
Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich
feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and paern recognition.
580–587.
[17]
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2016. End-to-
end learning of deep visual representations for image retrieval. arXiv preprint
arXiv:1610.07940 (2016).
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual
Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).
[19]
Herv
´
e J
´
egou, Mahijs Douze, and Cordelia Schmid. 2008. Hamming Embedding
and Weak Geometric Consistency for Large Scale Image Search. In Computer
Vision – ECCV 2008, David Forsyth, Philip Torr, and Andrew Zisserman (Eds.).
Lecture Notes in Computer Science, Vol. 5302. Springer Berlin Heidelberg, 304–
317. hp://dx.doi.org/10.1007/978- 3-540- 88682- 2 24
[20]
H. J
´
egou, M. Douze, and C. Schmid. 2009. Packing bag-of-features. In Computer
Vision, 2009 IEEE 12th International Conference on. 2357 –2364.
DOI:
hp://dx.
doi.org/10.1109/ICCV.2009.5459419
[21]
Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[22]
Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature
521, 7553 (2015), 436–444.
[23]
Eva Mohedano, Kevin McGuinness, Noel E. O’Connor, Amaia Salvador, Ferran
Marques, and Xavier Giro-i Nieto. 2016. Bags of Local Convolutional Features
for Scalable Instance Search. In Proceedings of the 2016 ACM on International
Conference on Multimedia Retrieval (ICMR ’16). ACM, New York, NY, USA, 327–
331.
[24]
David Novak, Martin Kyselak, and Pavel Zezula. 2010. On locality-sensitive index-
ing in generic metric spaces. In Proceedings of the ird International Conference
on Similarity Search and Applications (SISAP ’10). ACM, 59–66.
[25]
Lo
¨
ıc Paulev
´
e, Herv
´
e J
´
egou, and Laurent Amsaleg. 2010. Locality sensitive hash-
ing: A comparison of hash function types and querying mechanisms. Paern
Recognition Leers 31, 11 (2010), 1348–1358.
[26]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2007. Object Retrieval
with Large Vocabularies and Fast Spatial Matching. In Computer Vision and
Paern Recognition, 2007. CVPR 2007. IEEE Conference on. 1–8.
[27]
Ali S Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014.
CNN features o-the-shelf: an astounding baseline for recognition. In Computer
Vision and Paern Recognition Workshops (CVPRW), 2014 IEEE Conference on.
IEEE, 512–519.
[28]
Ali Sharif Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. 2014.
Visual instance retrieval with deep convolutional networks. arXiv preprint
arXiv:1412.6574 (2014).
[29]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
DOI:hp://dx.doi.org/10.1007/s11263-015- 0816-y
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[31]
Giorgos Tolias, Ronan Sicre, and Herv
´
e J
´
egou. 2015. Particular object retrieval
with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879
(2015).