Conference PaperPDF Available

Combining local and global visual feature similarity using a text search engine



In this paper we propose a novel approach that allows processing image content based queries expressed as arbitrary combinations of local and global visual features, by using a single index realized as an inverted file. The index was implemented on top of the Lucene retrieval engine. This is particularly useful to allow people to efficiently and interactively check the quality of the retrieval result by exploiting combinations of various features when using content based retrieval systems.
Combining Local and Global Visual Feature Similarity using a Text Search
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti
{giuseppe.amato, fabrizio.falchi, paolo.bolettieri, claudio.gennaro, fausto.rabitti}
ISTI - CNR, Pisa, Italy
In this paper we propose a novel approach that allows
processing image content based queries expressed as arbi-
trary combinations of local and global visual features, by
using a single index realized as an inverted file. The index
was implemented on top of the Lucene retrieval engine.
This is particularly useful to allow people to efficiently
and interactively check the quality of the retrieval result
by exploiting combinations of various features when using
content based retrieval systems.
Categories and Subject Descriptors: H.3 [Information
Storage and Retrieval]: H.3.3 Information Search and Re-
Keywords: Approximate Similarity Search, Access Meth-
ods, Lucene
1 Introduction
Applications of image content based retrieval techniques
are increasingly becoming popular for accessing cultural
heritage information, as a complement to metadata based
In fact, in some cases metadata associated with images
do not describe the content with sufficient details to satisfy
the user queries, or metadata are completely missing. Im-
ages containing reproductions of work of art, contain a lot of
implicit information that is not generally described in man-
ually generated metadata.
Furthermore, in some cases users prefer, or are obliged,
to use visual queries rather than keyword based queries.
Consider for instance a scenario where a user, visiting a
tourist site, is in front of a monument (a building or a statue
for instance), which he/she does not even know the name,
and wants to have information on it. An easy thing to do, in
This work was partially supported by the ASSETS project, funded by
the European Commission.
that case would be that of taking a picture, by using a smart-
phone, and use it as a query to some image content based
retrieval engine.
Image content based retrieval is typically obtained by ex-
tracting visual features from images and by comparing the
visual features rather the original images. Various types of
visual features are available, which offer different perfor-
mance depending on the type of applications and type of
queries. In the cultural heritage domain itself effectively
processing different types of queries requires also choos-
ing the right type of visual features. Just as an example,
identification of a specific work of art, as for instance a spe-
cific building or a specific statue, might require the use of
local features, as for instance SIFT [17] or SURF [5]. On
the other hand, recognizing specific art styles, or type of
subjects, can be better obtained by using combinations of
global features, as for instance combinations of MPEG-7
descriptors [2, 22]. However, in many cases higher perfor-
mance can be obtained by combining together also global
and local features at the same time.
Providing users (or applications) with the possibility of
combining different type of features poses some efficiency
problems, especially when dealing with a very large amount
of images (as for instance the web itself). Unless a prede-
fined combination of features is imposed beforehand, each
available type of feature should be indexed separately from
the other, and queries expressed as combinations of features
should processed by accessing separately to the various in-
dex and by combining the results.
In this paper we propose a novel approach that allows
processing image content based queries expressed as arbi-
trary combinations of local and global visual features, by
using a single index realized as an inverted file. More in
detail the index was implemented on top of the Lucene text
retrieval engine.
The article is organized as follows. Section 2 introduces
the main type of visual features. Section 3 presents other
works related to this. Section Section 4 presents the pro-
posed approach. Section 5 describes a prototype that we
have realized. Section 6 concludes.
2 Local and Global Features
As reported in [9], a feature captures a certain visual
property of an image, either globally for the entire image
or locally for a small group of pixels. The most commonly
used features include those reflecting color, texture, shape,
and interest (or salient) points in an image. In global ex-
traction, features are computed to capture the overall char-
acteristics of an image. The advantage of global extraction
is its high speed for both extracting features and computing
In the MPEG-7 standard [14] various global features
have been defined. In particular: Scalable Color is an his-
togram of the colors of the pixel in an image, when colors
are represented in the Hue Saturation Value (HSV) space.
Color Structure expresses local color structure in an image
by use of a structuring element that is comprised of sev-
eral image samples. Color Layout is a compact descrip-
tion of the spatial distribution of colors in an image. Edge
Histogram descriptor describes edge distribution with a his-
togram based on local edge distribution in an image, using
five types of edges Homogeneous Texture descriptor char-
acterizes the properties of the texture in an image. For ex-
tracting the MPEG-7 visual descriptors we made use of the
MPEG-7 eXperimental Model (XM) Reference Software
Unfortunately global features can be oversensitive to lo-
cation and hence fail to identify important visual character-
istics. To increase the robustness to spatial transformation,
local features can be used to describe images. Feature based
on local invariants (e.g. corner points or interest points),
have been originally presented for stereo matching but are
nowadays used also for multimedia information retrieval.
Local features are low level descriptions of keypoints in an
image. The most famous local feature is probably Scale In-
variant Feature Transformation (SIFT) [17]. In SIFT, key-
points are interest points in an image that are invariant to
scale and orientation. Each keypoint in an image is asso-
ciated with one or more orientations, based on local image
gradients. Image matching is performed by comparing the
description of the keypoints in images.
Another widely used local feature is Speeded Up Robust
Features (SURF) [5] which is quite similar to SIFT. SURF
detects some keypoints in an image and describes these key-
points using orientation information. However, the SURF
definition uses a new method for both detection of keypoints
and their description that is much faster still guaranteeing a
performance comparable or even better than SIFT. Specif-
ically, keypoint detection relies on a technique based on a
approximation of the Hessian Matrix. The descriptor of a
keypoint is built considering the distortion of Haar-wavelet
responses around the keypoint itself. For both, detecting
keypoints and extracting the SURF features, we used the
public available noncommercial software developed by the
authors [1].
3 Related Work
3.1 Indexes for Global Features
Global features consist of visual descriptors that are ob-
tained by computing statistics of the entire image content.
In order to compare two images and assess their similarity,
each global feature is associated to a similarity function,
or equivalently to a distance function (a dissimilarity func-
tion). The assumption is that the more closer the features
are, the more similar the images are.
To efficiently process similarity queries based on global
features various index structures and search methods were
proposed in literature that organize data in such a way that k
nearest neighbors search and range search can be executed
efficiently. In case of knearest neighbors search, the objec-
tive is to retrieve the kfeatures closest to the query feature.
In case of range search, the aim is to retrieve the features
whose similarity to the query is above a certain threshold.
An extensive literature surveys on this topic can be found
in [25]. Here we introduce some of the most significant
techniques classifying them as sequential scan techniques,
hash based techniques, and tree based techniques.
The basic technique to efficiently process similarity
search queries is to perform a single scan of the entire
database, provided that data are stored on contiguous blocks
of disk. This technique might be faster than accessing ran-
domly small blocks spread widely on the secondary storage.
A relevant approach based on sequential scan, that can be
used when features are represented in a vector space, is the
VA-File (Vector Approximation File) [24]. It reduces the
size of the data set using a coarse approximate representa-
tion of data objects. Sequential scan is performed on the
reduced data set containing the coarse representation.
Hashing techniques for similarity search, use hash func-
tions that preserve the closeness of data objects. A signifi-
cant hashing approach applicable to data represented in vec-
tor spaces, is the grid-file [20]. It partitions the search space
symmetrically in all dimensions. Each cell of the obtained
grid is associated with data buckets. Another interesting
hashing approach for similarity search in metric spaces is
the D-Index [12]. It is a multilevel hash structure that takes
advantage of the idea of the excluded middle partitioning,
for building an access method based on excluded middle
vantage point. An interesting approach, recently proposed,
is to represent a data object as a permutation of a set of ref-
erence objects ordered according to the their distance from
the data object being represented [7]. Similarity between
permutations is used in place of similarity between data ob-
jects. In [3] it is shown how to use this technique with in-
verted files.
Tree based access methods relies mainly on a hierarchi-
cal decomposition of the data space. One of the first pro-
posed approaches is the K-d-Trees [6], that can be used with
features represented in vector spaces. Let us suppose that
vectors have dim dimensions. In every level of the tree a
value (key) is used as discriminator for branching decision
of a specific dimension of the corresponding vector space.
The root node (level 0) discriminates for the first dimen-
sion. Nodes pointed by the root (level one) discriminate for
the second dimension. And so on, up to the level dim 1.
If more than dim levels are needed, the process starts again
from the first dimension.
R-Trees were originally proposed in [13] and can be
used when features are represented in vector spaces. The
tree is built hierarchically grouping vectors with bound-
ing boxes. Leaf nodes represent bounding boxes including
vectors. Internal node represent bounding boxes including
other bounding boxes.
M-Trees [8] are secondary storage height balanced ac-
cess methods that can be used when features are represented
in metric spaces. They organize data objects by creating
partitions where set of objects are bounded by ball regions.
As before, leaf nodes represent ball regions containing fea-
tures. Internal nodes represent ball regions containing other
ball regions.
In [18] authors propose a similar approach of adopting
the Lucene text search engine. However, to the best of our
knowledge, in this approach the scoring function of Lucene
was extended, while in our implementation we exploit the
cosine similarity scoring, in this way any full text search
engine can be adopted.
3.2 Indexes for Lo cal Features
Comparing two images on the basis of their local fea-
tures requires matching of their interest points. Typically,
as a first step, each interest point in the query image is used
as query for a nearest neighbor search between the points of
the other image. A different approach recently proposed is
to assign each interest point to a visual word of a predefined
vocabulary. At search time, two local features assigned to
the same visual words will be considered as matching. In
the following, we report the most important works related
to both the approaches.
Searching for the nearest neighbor given a local features
is a very similar problem to the one described in the pre-
vious section. While many similarity searches will be per-
formed in order to select the matching points for any local
feature describing the image query, each search can be ef-
ficiently performed using vector or metric access structures
as they would single global features. In fact, a very popular
index used for local features matching is K-d-Trees [6], dis-
cussed also in next section. However, the computer vision
community has also developed some specific indexes. The
most important is probably the algorithm that applies prior-
ity search on hierarchical k-means trees proposed in [19].
The bag of visual words model was initially proposed in
[21]. The first step to describe images using visual words is
to select some visual words creating a vocabulary. The vi-
sual vocabulary is typically built grouping local descriptors
of the dataset using a clustering algorithm such as k-means.
The second step is to describe each image using the words
of the vocabulary that occur in it. At the end of the process,
each image is described as a set of visual words.
In [16], Hamming embedding and weak geometric con-
sistency constraints were proposed to improve bag-of-
features efficacy. The proposed approach can be used in
combination with traditional inverted files when high ac-
curacy is needed. In [23] the use of term weighting tech-
niques and classical distances from text retrieval in the case
of images has been explored. The experiments show that
the effectiveness of a given weighting scheme or distance
is strongly linked to the dataset used. In the case of large
and varied image collections, the noise in descriptor assig-
nation and the need to use larger vocabularies tend to make
all distances and weights equivalent.
4 Approximate Content Based Retrieval on
Global and Local Features Using Text-
Based Indexing Techniques
The technique that we present here leverages on a special
representation, based on a text encoding, of local and global
features such that any text retrieval engine can be used to
perform image similarity search based on local and global
visual features. As discussed later, we implemented these
ideas on top of the Lucene text retrieval engine.
Two different approaches are used to generate a conve-
nient text encoding for local and global visual features.
For encoding Local Features we use the state-of-the-art
technique of Bag of Features Approach discussed in Section
3. In particular, we selected 20,000 visual words out of the
local descriptors of the dataset using the k-means++ clus-
tering method proposed in [4]. Then each image has been
described using a subset of the vocabulary. In particular, for
each local features in each image, the first nearest neighbor
between the visual words is selected and added to the set
of words describing the images. Eventually, the images are
converted into a textual form using a textual representation
for each visual word of the vocabulary.
The approach to encode Global Features leverages on
the perspective based space transformation developed in
[3]. The idea at the basis of this technique is that when
two global descriptors (GDs) are very similar, with respect
to a given similarity function, they ’see’ the ’world around’
them in the same way. In the following, we will see that the
’world around’ can be encoded as a surrogate text represen-
tation (STR), which can be managed with an inverted index
by means of a standard text-based search. The conversion
of the GDs in textual form allows us to employ the search
engine off-the-shelf indexing and searching abilities with a
little implementation effort.
Let Dbe a domain of GDs, and a dataset of X∈ D, we
define a ranking function fk(o, X)as a function that takes a
GD oas input and returns an ordered subset of kGDs from
Xin order of decreasing similarity to o.fkis also known
as a top-k query.
Let R D be a set of mreference GDs chosen from, R)returns the top-k elements of Rin order of
decreasing similarity with o. Let pk
i(o, R)the position that
riRassumes in fk(o, R), assuming pk
i(o, R) = k+ 1
when riis not present in fk(o, R). We define the distance
function d(o, q)between two GDs oand qas follows:
d(o, q) = v
i=1 pkx
i(o, R)pkq
i(q, R)2
where kxand kqare two integers such that kxkq.
d(o, q)is a generalization of the Spearman Rho Distance
with location parameter for the special case l=kx=kq
[10], which evaluates the distance (or dissimilarity) of two
top-k ranked lists. By using this distance we can produce an
approximate ranking function e
fk. Note, in fact, that dmea-
sures the discrepancy between the ordering of the reference
descriptors in R, from oand qrespectively. These two or-
derings, can be seen as the representation of the view of the
’world around’ oand q. According to [3], the more similar
oand q, the more similar their view of the world around, so
fk(q, X )is an approximation of fk(q, X).
In order to implement the function d(o, q)and e
fk(q, X )
in an efficient way and leveraging on the search function-
ality offered by a text retrieval engine, we associate each
element riRwith a unique alphanumeric keyword τi.
Then we use the function t¯
k(o), defined in the following,
to obtain a space-separated concatenation of zero or more
repetitions of τiwords:
k(o) =
where, by abuse of notation, we denote the space-
separated concatenation of words with the union operator
S. The function t¯
k(o)returns a text representation of o
such that, if the reference descriptor riappears in position s
in fk(o, R), then the term τiis repeated (k+ 1) stimes in
the text. The function t¯
k(o)is used to generate the SRT to
be used for both indexing and querying purposes. Specifi-
cally we use ¯
k=kxfor indexing and ¯
k=kqfor querying.
Most text search engine, including Lucene, use the Vec-
tor Space model to represent text. In this representation, a
text document is represented as a vector of terms each as-
sociated with the number of occurrences of the term in the
document. In our case, this means that, if for instance term
τicorresponding to the reference descriptor ri(1im)
appears ntimes, the i-th element of the vector will contain
the number n, and whenever τidoes not appear it will con-
tain 0. Let us refers to these vectors of size mas okxand
okq, which correspond to tkx(o)and tkq(o), respectively.
The cosine similarity is typically adopted to determine the
similarity of the query vector and a vector in the database
of the text search engine, and it is defined as:
simcos(o, q) = okxqkq
kokxk| qkqk,
where is the scalar product. simcos can be used as a
function that evaluates the similarity of the two ranked lists
similarly as d(o, q)(although it is defined as a distance),
and it is possible to prove that the first one is an order re-
versing monotonic transformation of the second one, and
then that they are equivalent for practical aspects 1. This
means that if we use d(o, q)and we take the first knearest
GDs from X(i.e, from the shortest distance to the highest)
we obtain exactly the same descriptors in the same order if
we use simcos(o, q)and take the first ksimilar GDs (i.e.,
the greater values to the smaller ones).
To summarize, given a dataset of objects X⊂ D and a
ranking function fk, we are able to produce an approximate
ranking function e
fkstarting from a random selection of m
reference descriptors Rfrom D. The approximate ranking
function e
fkcan be obtained using the text generator tk(o)
and indexing all documents corresponding to tkx(o)for all
oXwith a standard text-based search engine and using
tkq(q)text for querying the object q.
5 Prototype Description
In order to index and search a collection of images for
visual features, the following operations have to be per-
formed, as the overview of Figure 1 depicts:
Feature Extraction – Starting from image files, the
system extracts visual descriptors. In particular, we
adopted SURF descriptor [5] as local feature and five
MPEG-7 descriptors [14] (CS,CL,SC,HT,EH) as
global features.
1To be precise, it is possible to prove that simcos(o, q)is an order
reversing monotonic transformation of d2(o, q). However, since d(o, q)is
monotonous this does not affect the ordering. The prove of this statement
is given in [11] due to lack of space.
Figure 1. Overview of the indexing architec-
Indexing – Each images is associated with six text
fields, one corresponding to the BOF obtained from the
SURF descriptor and five STRs, one for each MPEG-
7 descriptor. These segments, which form the basic
units on which search is performed, are stored in the
Lucene inverted index for fast processing. We use the
WhitespaceAnalyzer in order to avoid stemming and
stop words analysis.
Querying – the similarity computation between a
query image and the images in the collection is based
on a standard full text facilities of Lucene. In partic-
ular, we use the multi field query feature of Lucene,
which allow us to combined different descriptors in the
same query.
For both testing and demonstration, we developed a web
user interface to search among the indexed images. In the
following we briefly describe the web user interface which
is public available at the address http://melampo.
Starting from the home page the user can perform an image
similarity search beginning from one of the random selected
images, or from an image uploaded through the upload form
(the visual features will be automatically extracted by the
system). For each visual searching it is possible to set, by
some checkboxes placed above the search bar, the combina-
tion of visual descriptors to use to perform the query. The
available checkboxes allow to combine 5 MPEG-7 descrip-
tors (CS,SC,CL,HT,EH) with Bag of Features (BoF). A
typical results page will show the most similar images to the
query, and by some links and the visual descriptor check-
boxes a user can:
perform a similarity search with the given result as
query by the similar link available on top of each query
select a new visual descriptors combination
show the image info clicking the link at the bottom of
each image result
see the next/previous results in order of relevance to
the query using the navigation buttons at the bottom of
the page
go back to the home page
In the results page, for each descriptors combination chang-
ing, the last query will be automatic performed with the new
6 Conclusions and future work
Good performance of a image retrieval system is typi-
cally obtained by the right choice of the visual features. In
many cases combinations of various features is needed to
obtain the best performance for a query. In this paper we
propose a novel approach to index different types of features
at the same time and to allow the user to interactively choice
the right combination for the current query. The technique
was implemented on top a the Lucene retrieval engine.
[1] SURF detector.˜surf/. last accessed on 10-Jan-2011.
[2] G. Amato, F. Falchi, C. Gennaro, F. Rabitti, P. Savino,
and P. Stanchev. Improving image similarity search ef-
fectiveness in a multimedia content management sys-
tem. In Proc. of Workshop on Multimedia Information
System (MIS), pages 139–146, 2004.
[3] G. Amato and P. Savino. Approximate similarity
search in metric spaces using inverted files. In Pro-
ceedings of the 3rd international conference on Scal-
able information systems (InfoScale ’08), pages 1–10.
ICST, 2008.
[4] D. Arthur and S. Vassilvitskii. k-means++: the ad-
vantages of careful seeding. In Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete
algorithms, SODA ’07, pages 1027–1035, Philadel-
phia, PA, USA, 2007. Society for Industrial and Ap-
plied Mathematics.
[5] H. Bay, T. Tuytelaars, and L. J. V. Gool. Surf: Speeded
up robust features. In ECCV (1), pages 404–417,
[6] J. Bentley. Multidimensional binary search trees in
database applications. IEEE Transactions on Software
Engineering, 5(4):333–340, 1979.
[7] E. Chavez Gonzalez, K. Figueroa, and G. Navarro. Ef-
fective proximity retrieval by ordering permutations.
IEEE Trans. Pattern Anal. Mach. Intell., 30(9):1647–
1658, 2008.
[8] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An
efficient access method for similarity search in met-
ric spaces. In M. Jarke, M. J. Carey, K. R. Dittrich,
F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld,
editors, VLDB’97, Proceedings of 23rd International
Conference on Very Large Data Bases, August 25-29,
1997, Athens, Greece, pages 426–435. Morgan Kauf-
mann, 1997.
[9] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image re-
trieval: Ideas, influences, and trends of the new age.
ACM Comput. Surv., 40:5:1–5:60, May 2008.
[10] R. Fagin, R. Kumar, and D. Sivakumar. Comparing
top-k lists. SIAM J. of Discrete Math., 17(1):134–160,
[11] C. Gennaro, G. Amato, P. Bolettieri, and P. Savino.
An approach to content-based image retrieval based
on the lucene search engine library. In M. Lalmas,
J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz,
editors, Research and Advanced Technology for Digi-
tal Libraries, volume 6273 of Lecture Notes in Com-
puter Science, pages 55–66. Springer Berlin / Heidel-
berg, 2010.
[12] C. Gennaro, P. Savino, and P. Zezula. Similarity
search in metric databases through hashing. In Pro-
ceedings of MIR 2001 - 3rd Intl Workshop on Multi-
media Information Retrieval October 5, 2001. Ottawa,
Canada, 2001. In conjunction with ACM Multimedia
[13] A. Guttman. R-trees: A dynamic index structure for
spatial searching. In Proceedings of the 1984 ACM
SIGMOD International Conference on Management
of Data, Boston, MA, pages 47–57, 1984.
[14] ISO/IEC. Information technology - Multimedia con-
tent description interfaces, 2003. 15938.
[15] ISO/IEC. Information technology - Multimedia con-
tent description interfaces. Part 6: Reference Soft-
ware, 2003. 15938-6:2003.
[16] H. J ´
egou, M. Douze, and C. Schmid. Improving bag-
of-features for large scale image search. Int. J. Com-
put. Vision, 87:316–336, May 2010.
[17] D. G. Lowe. Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision, 60(2):91–110, 2004.
[18] M. Lux and S. A. Chatzichristofis. Lire: lucene im-
age retrieval: an extensible java cbir library. In MM
’08: Proceeding of the 16th ACM international con-
ference on Multimedia, pages 1085–1088, New York,
NY, USA, 2008. ACM.
[19] M. Muja and D. G. Lowe. Fast approximate near-
est neighbors with automatic algorithm configuration.
In VISAPP 2009 - Proceedings of the Fourth Inter-
national Conference on Computer Vision Theory and
Applications, Lisboa, Portugal, February 5-8, 2009 -
Volume 1, pages 331–340, 2009.
[20] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The
grid file: An adaptable, symmetric multikey file struc-
ture. TODS, 9(1):38–71, 1984.
[21] J. Sivic and A. Zisserman. Video google: A text
retrieval approach to object matching in videos. In
Proceedings of the Ninth IEEE International Confer-
ence on Computer Vision - Volume 2, ICCV ’03, pages
1470–, Washington, DC, USA, 2003. IEEE Computer
[22] E. Spyrou, H. Le Borgne, T. Mailis, E. Cooke,
Y. Avrithis, and N. O’Connor. Fusing mpeg-7 visual
descriptors for image classification. In Proceedings of
the 15th International Conference on Artificial Neural
Networks (ICANN’05), pages 847–852, 2005.
[23] P. Tirilly, V. Claveau, and P. Gros. Distances and
weighting schemes for bag of visual words image re-
trieval. In Proceedings of the international conference
on Multimedia information retrieval, MIR ’10, pages
323–332, New York, NY, USA, 2010. ACM.
[24] R. Weber, H.-J. Schek, and S. Blott. A quantitative
analysis and performance study for similarity-search
methods in high-dimensional spaces. In A. Gupta,
O. Shmueli, and J. Widom, editors, VLDB’98, Pro-
ceedings of 24rd International Conference on Very
Large Data Bases, August 24-27, 1998, New York City,
New York, USA, pages 194–205. Morgan Kaufmann,
[25] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Simi-
larity Search - The Metric Space Approach, volume 32
of Advances in Database Systems. Springer, 2006.
... At runtime, the search results are ordered by their similarity relative to the pivot set. A detailed description of the feature extraction and indexing process is available in [9]. ...
... Recently, a new functionality for sharing the links to current search results on Facebook and Twitter has been added to this screen. The latest version of CultureCam can be accessed on the official website 9 . When users choose to take a close look at an image retrieved from the Europeana repository, four more controls are displayed. ...
Conference Paper
This paper presents CultureCam, an online tool tailored for creative designers and artists, which offers the possibility of exploring and accessing Europeana image content in an interactive and intuitive way. By using a webcam to take a photo of any object or texture, users can invoke the CultureCam tool for accessing a set of images that are similar in color, shape or pattern. The tool stimulates the inspiration of creative designers by diving into common cultural heritage, to be used as source for new derivative designs and art. The collection of items used by CultureCam is curated by professional designers and accessible under the public domain licence. The main challenge for the CultureCam is to provide a search algorithm that satisfies the content requirements of creative designers and have the potential to inspire their work, when operating on a rather small and sparse image set.
... As an approximate index, we used our own implementation of the well-known Locality Sensitive Hashing. In the future, we may confront other approximate image search approaches, such as those exploiting surrogate texts [2,4,3,1]. ...
Full-text available
To implement a good Content Based Image Retrieval (CBIR) system, it is essential to adopt efficient search methods. One way to achieve this results is by exploiting approximate search techniques. In fact, when we deal with very large collections of data, using an exact search method makes the system very slow. In this project, we adopt the Locality Sensitive Hashing (LSH) index to implement a CBIR system that allows us to perform fast similarity search on deep features. Specifically, we exploit transfer learning techniques to extract deep features from images; this phase is done using two famous Convolutional Neural Networks (CNNs) as features extractors: Resnet50 and Resnet50v2, both pre-trained on ImageNet. Then we try out several fully connected deep neural networks, built on top of both of the previously mentioned CNNs in order to fine-tuned them on our dataset. In both of previous cases, we index the features within our LSH index implementation and within a sequential scan, to better understand how much the introduction of the index affects the results. Finally, we carry out a performance analysis: we evaluate the relevance of the result set, computing the mAP (mean Average Precision) value obtained during the different experiments with respect to the number of done comparison and varying the hyper-parameter values of the LSH index.
... The R-MAC [83] descriptors are adopted as global image descriptors for the similarity search functionality. All descriptors (scene tags, dominant colors, object location, and visual descriptors) extracted from the video key-frames were encoded with a surrogate textual representation and efficient technologies for text retrieval were adopted for the indexing and searching phases [84]- [86]. The system's user interface provides a text box to specify the scene tags and a canvas for sketching objects and/or colors appearing in the target scene. ...
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
... Since features have also been transformed into text form, it is possible to create a single text query that contains both the metadata and the surrogate text corresponding to the visual query. This approach has been extensively used in [57,41]. Retrieval special issue on Deep Learning in Image and Video Retrieval [54]. ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
... [9] introduced MI-File, an approach that allows using inverted files to perform similarity search with an arbitrary similarity function. In [4,5] a Surrogate Text Representation (STR) derived from the MI-File has been proposed. The conversion of the permutations in a textual form allows using off-the-shelf text search engines for similarity search. ...
During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to profound pervasiveness of relational databases in any kind of organization. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today.
... In [17,1] a Surrogate Text Representation (STR) derivated from the MI-File has been proposed. The conversion of the permutations in a textual form allows using off-the-shelf text search engines for similarity search. ...
Conference Paper
Full-text available
Permutation based approaches represent data objects as ordered lists of predefined reference objects. Similarity queries are executed by searching for data objects whose permutation representation is similar to the query one. Various permutation-based indexes have been recently proposed. They typically allow high efficiency with acceptable effectiveness. Moreover, various parameters can be set in order to find an optimal trade-off between quality of results and costs. In this paper we studied the permutation space without referring to any particular index structure focusing on both theoretical and experimental aspects. We used both synthetic and real-word datasets for our experiments. The results of this work are relevant in both developing and setting parameters of permutation-based similarity searching approaches.
... In particular the MI-File allows one to use inverted files to perform similarity search with an arbitrary similarity function. Moreover, in (Gennaro et al., 2010;Amato et al., 2011) a Surrogate Text Representation (STR) derived from the MI-File has been proposed. The conversion of the image description in textual form enables us to exploit the off-the-shelf search engine features with a little implementation effort. ...
Conference Paper
Full-text available
Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.
... Localized feature representations employing the MPEG-7, HS and the HSV colour histogram descriptor [28] are extracted and achieve better results compared to global techniques. In [37], authors index a collection of images combining local and global features. The method extracts SURF local features and five MPEG-7 descriptors (CS, CL, SC, HT, EH) as global features. ...
Full-text available
In this paper, we explore, extend and simplify the localization of the description ability of the well-established MPEG-7 (Scalable Colour Descriptor (SCD), Colour Layout Descriptor (CLD) and Edge Histogram Descriptor (EHD)) and MPEG-7-like (Color and Edge Directivity Descriptor (CEDD)) global descriptors, which we call the SIMPLE family of descriptors. Sixteen novel descriptors are introduced that utilize four different sampling strategies for the extraction of image patches to be used as points of interest. Designing with focused attention for content-based image retrieval tasks, we investigate, analyse and propose the preferred process for the definition of the parameters involved (point detection, description, codebook sizes and descriptors’ weighting strategies). The experimental results conducted on four different image collections reveal an astonishing boost in the retrieval performance of the proposed descriptors compared to their performance in their original global form. Furthermore, they manage to outperform common SIFT- and SURF-based approaches while they perform comparably, if not better, against recent state-of-the-art methods that base their success on much more complex data manipulation.
Conference Paper
As a result of digitization initiatives in recent years, most galleries hold digital copies of their masterpieces. In order to attract more visitors, public galleries are interested in advertising their content on websites and tourist-centric applications deployed in public spaces. The online version of CultureCam has the goal of stimulating the reuse of cultural heritage content by creative designers. In this paper, we present the Interactive Installation version of CultureCam tool, which has the goal of attracting the interest of public users when exploring public galleries. It concentrates on enhancing the user experience, by offering access to the images in an immersive environment, using an intuitive, easy-to-use tool that supports touch free interaction with the gallery content. A novel image similarity search algorithm was developed in order to adapt to user expectations when searching in small image datasets. The user feedback collected from exhibitions in different European cities indicates a very high acceptance of the CultureCam tool by the public. The intuitive and seamless interaction with the tool, as well as the automation and enhancement of the search algorithm are the main improvements over the previous version of CultureCam.
Conference Paper
With continuous reduction of the cost of internet-enabled smartphones and recent advances in mobile computing, the demand for content based image retrieval tools used for landmark identification on smartphones is gaining interest. We have developed a system that allows users to take photos of famous landmarks using a smartphone and receive information of landmarks from the system. This landmark image retrieval system contains an android application as the fronted client and a backend web server. It is based on Lucene and takes advantage of GPS data. In this paper we propose our ideas, design and architecture of the system. We also propose an approach to use combined feature and GPS filtering to improve the accuracy of classifying images. The experimental results show that the accuracy of combined feature is higher than individual feature and the system works quickly and reliably.
Full-text available
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underly-ing metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the met-ric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corre-sponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Conference Paper
We propose a new approach to perform approximate similarity search in metric spaces. The idea at the basis of this technique is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the view of the world, from the perspective of the two objects, in place of the distance function of the underlying metric space. To exploit this idea we represent each object of a dataset by the ordering of a number of reference objects of the metric space according to their distance from the object itself. In order to compare two objects of the dataset we compare the two corresponding orderings of the reference objects. We show that efficient and effective approximate similarity searching can be obtained by using inverted files, relying on this idea. We show that the proposed approach performs better than other approaches in literature.
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
cited By (since 1996) 6; Conference of -ACM Multimedia 2001 Workshops-Multimedia Information Retrieval; Conference Date: 5 October 2001 through 5 October 2001; Conference Code: 58702
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.