Conference PaperPDF Available

Generic similarity search engine demonstrated by an image retrieval application



We introduce a generic engine for large-scale similarity search and demonstrate it on a set of 100 million Flickr images.
Generic Similarity Search Engine
Demonstrated by an Image Retrieval Application
David Novak
Masaryk University
Brno, Czech republic
Michal Batko
Masaryk University
Brno, Czech republic
Pavel Zezula
Masaryk University
Brno, Czech republic
Practically any information can currently be in digital
form. Searching in future Internet will be complicated be-
cause of: (1) the diversity of data types and ways in which
data can be sorted, compared, or classified, and (2) the
quickly increasing amount of digital data. Accordingly, a
successful search engine should address problems of extensi-
bility and scalability. We present and demonstrate capabil-
ities of MUFIN (Multi-Feature Indexing Network). From a
general point of view, the search problem has three dimen-
sions: (1) data and query types, (2) index structures and
search algorithms, and (3) infrastructure to run the system
on. MUFIN adopts the metric space as a very general model
the similarity [3]. Its indexing and searching mechanisms
are based on the concept of structured Peer-to-Peer (P2P)
networks which makes the approach highly scalable and in-
dependent of the specific hardware infrastructure. This ap-
proach is schematically depicted in the following figure.
We demonstrate an “instance of MUFIN” designed for
content-based search on large databases of general digital
images. The dataset consists of 100 million images taken
from CoPhIR Database
. Each image is represented by
five global MPEG-7 descriptors [1] aggregated into a sin-
gle metric space and the system retrieves k images which
are the most similar to a given query image (according to
CoPhIR: Content-based Photo Image Retrieval Database:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR 2009, Boston USA
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
the aggregated metric). The data is indexed by a P2P-
based data structure M-Chord [2] with 2000 logical peers.
Peers organize their data locally in an M-Tree.The system
physically runs on six IBM servers (two quad-core CPUs,
16G RAM, six disks with RAID 5). The search is demon-
strated via a Web-based interface that is available online
at The following
figure shows an example of a MUFIN query result.
Several systems for Content-based Image Retrieval (CBIR)
are currently available:
ALIPR searches a set of images according to automatically
generated annotations [],
ImBrowse allows to search about 750,000 images by color,
texture, shapes (and combinations) employing five in-
dependent engines [],
id´ee searches a commercial database of 2.8 million images
according to image signatures [],
GazoPa is a private service by Hitachi searching 50 million
images by color and shape [].
All current CMIR systems, except for the very recent project
GazoPa, search databases two orders of magnitude smaller
than MUFIN. Moreover, approaches based on signatures
usually work well only for near-duplicates. The mentioned
systems are designed only for searching digital images by a
specific method, which is in contrast with highly versatile
MUFIN approach.
[1] MPEG-7. Multimedia content description interfaces.
Part 3: Visual. ISO/IEC 15938-3:2002, 2002.
[2] D. Novak and P. Zezula. M-Chord: A scalable
distributed similarity search structure. In Proceedings
of INFOSCALE 2006, New York, 2006. ACM Press.
[3] P. Zezula, G. Amato, V. Dohnal, and M. Batko.
Similarity Search: The Metric Space Approach.
Springer, 2006.
Technical Requirements
The image search engine runs continually on our hardware
infrastructure and we demonstrate it via a standard Web-
based interface at using
our own notebook. For the demonstration, we require:
power access, ideally with a European power adapter,
Internet connection, preferably cable (because of the
bandwidth and latency),
a projector and a projecting screen or a large display,
if available,
a board for a poster with the description of our ap-
proach (not necessary).
... Vectors These were the public CoPhIR database (Content-based Photo Image Retrieval) [31] with images from Flickr with five MPEG-7 descriptors. From this 33]. This query log was published in [34] and is available in [35]. ...
Full-text available
Algorithms Source Code: Efficient kNN search, or k-nearest neighbors search, is useful, among other fields, in multimedia information retrieval, data mining and pattern recognition problems. A distance function determines how similar the objects are to a given kNN query object. As finding the distance between any given pair of objects (i.e., high-dimensional vectors) is known to be a computationally expensive operation, using parallel computation techniques is an effective way of reducing running times to acceptable values in large databases. In the present work, we offer novel GPU approaches to solving kNN (k-nearest neighbor) queries using exhaustive algorithms based on the Selection Sort, Quicksort and state-of-the-art algorithms. We show that the best approach depends on the k value of the kNN query and achieve a speedup up to 86.4x better than the sequential counterpart. We also propose a multi-core algorithm to be used as reference for the experiments and a hybrid algorithm which combines the proposed algorithms with a state-of-the-art heaps-based method, in which the best performance is obtained with high k values. We also extend our algorithms to be able to deal with large databases that do not fit in GPU memory and whose performance does not deteriorate as database size increases.
... From Google Analytics, we have obtained statistics about queries processed in a demonstration application [16]. This application implements content-based retrieval on the CoPhIR data-set [3] consisting of 100 million images. ...
Conference Paper
Similarity searching has become widely available in many on-line archives of multimedia content. Querying such systems starts with either a query object provided by user or a random object provided by the system, and proceeds in more iterations to improve user’s satisfaction with query results. This leads to processing many very similar queries by the system. In this paper, we analyze performance of two representatives of metric indexing structures and propose a novel concept of reordering search queue that optimizes access to data partitions for repetitive queries. This concept is verified in numerous experiments on real-life image dataset.
Conference Paper
Though searching is already the most frequently used application of information technology today, similarity approach to searching is increasingly playing more and more important role in construction of new search engines. In the last twenty years, the technology has matured and many centralized, distributed, and even peer-to-peer architectures have been proposed. However, the use of similarity searching in numerous potential applications is still a challenge. In the talk, four research directions in developing similarity search applications at Masaryk University DISA laboratory are to be discussed. First, we concentrate on accelerating large-scale face recognition applications and continue with generic image annotation task for retrieval purposes. In the second half, we focus on data stream processing applications and finish the talk with the ambition topic of content-based retrieval in human motion-capture data. Applications will be illustrated by online prototype implementations.
M-Tree, Slim-Tree, DF-Tree, and Omni-Tree are some of the popular dynamic structures which can grow incrementally by splitting overflowed nodes, and adding new levels to the tree very much like the B-tree variants. Unfortunately, they have been shown to perform very poorly compared to flat structures such as AESA, LAESA, Spaghettis, and Kvp that use a fixed set of global pivots. HKvp index structure is an extension of Kvp allowing the elimination of pivots as well as the database objects. The number of pivots can be easily increased to provide more selectivity and query performance. However, there is an optimum number of pivots for a given query radius, and using too many pivots increases the costs of queries and index initialization. In this paper, a new set of pivot elimination mechanisms is proposed to determine the right number of pivots for different query radii. The suggested pivot elimination schemes perform significant cost reduction in terms of number of distance computations, and they estimate the drop rate value for HKvp on query time.
The general trend in data management is to outsource data to 3rd party systems that would provide data retrieval as a service. This approach naturally brings privacy concerns about the (potentially sensitive) data. Recently, quite extensive research has been done on privacy-preserving outsourcing of traditional exact-match and keyword search. However, not much attention has been paid to outsourcing of similarity search, which is essential in content-based retrieval in current multimedia, sensor or scientific data. In this paper, the authors propose a scheme of outsourcing similarity search. They define evaluation criteria for these systems with an emphasis on usability, privacy and efficiency in real applications. These criteria can be used as a general guideline for a practical system analysis and we use them to survey and mutually compare existing approaches. As the main result, the authors propose a novel dynamic similarity index EM-Index that works for an arbitrary metric space and ensures data privacy and thus is suitable for search systems outsourced for example in a cloud environment. In comparison with other approaches, the index is fully dynamic (update operations are efficient) and its aim is to transfer as much load from clients to the server as possible.
Conference Paper
Analysis of contemporary Big Data collections require an effective and efficient content-based access to data which is usually unstructured. This first implies a necessity to uncover descriptive knowledge of complex and heterogeneous objects to make them findable. Second, multimodal search structures are needed to efficiently execute complex similarity queries possibly in outsourced environments while preserving privacy. Four specific research objectives to tackle the challenges are outlined and discussed. It is believed that a relevant solution of these problems is necessary for a scalable similarity search operating on Big Data. © Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2015.
With growing popularity of cloud services, the trend in the industry is to outsource the data to a 3rd party system that provides searching in the data as a service. This approach naturally brings privacy concerns about the (potentially sensitive) data. Recently, quite extensive research of outsourcing classic exact-match or keyword search has been done. However, not much attention has been paid to the outsourcing of the similarity search, which becomes more and more important in information retrieval applications. In this work, we propose to the research community a model of outsourcing similarity search to the cloud environment (so called similarity cloud). We establish privacy and efficiency requirements to be laid down for the similarity cloud with an emphasis on practical use of the system in real applications; this requirement list can be used as a general guideline for practical system analysis and we use it to analyze current existing approaches. We propose two new similarity indexes that ensure data privacy and thus are suitable for search systems outsourced in a cloud. The balance of the first proposed technique EM-Index is more on the efficiency side while the other (DSH Index) shifts this balance more to the privacy side.
Analysis of contemporary Big Data collections require an effective and efficient content-based access to data which is usually unstructured. This first implies a necessity to uncover descriptive knowledge of complex and heterogeneous objects to make them findable. Second, multimodal search structures are needed to efficiently execute complex similarity queries possibly in outsourced environments while preserving privacy. After explaining the impacts of Big Data on similarity searching and summarizing the state of the art in the search technology, four specific research objectives to tackle the challenges are outlined and discussed. It is believed that effective and efficient processing of raw data for object findability and developing hybrid similarity search structures for multi-modal and privacy-preserving searching are necessary to achieve a scalable similarity search technology able to operate on Big Data.
Conference Paper
Similarity searching has been a research issue for many years, and searching has probably become the most important web application today. As the complexity of data objects grows, it is more and more difficult to reason about digital objects otherwise than through the similarity. In this article, we first discuss concepts of similarity and searching in light of future perspectives before a concise history of similarity searching technology is presented. We use the historical knowledge to extend the trends to future. We analyze the bottlenecks of application development and discuss perspectives of search computing for future applications. We also present a model of search technology and its position in computer clouds for application development. Finally, execution platforms for multi-modal findability and security issues for outsourced similarity searching environments are suggested as important research challenges.
Conference Paper
Full-text available
The need for a retrieval based not on the attribute val- ues but on the very data content has recently led to rise of themetric-basedsimilarity search. Thecomputationalcom- plexity of such a retrieval and large volumes of processed data call for distributed processing which allows to achieve scalability. In this paper, we propose M-Chord, a dis- tributed data structure for metric-based similarity search. The structure takes advantage of the idea of a vector index method iDistance in order to transform the issue of simi- larity searching into the problem of interval search in one dimension. The proposed peer-to-peer organization, based on the Chord protocol, distributes the storage space and parallelizes the execution of similarity queries. Promising features of the structure are validated by experiments on the prototype implementation and two real-life datasets.
Full-text available
In the Information Society, information holds the master key to economic influence. Similarity Search: The Metric Space Approach will focus on efficient ways to locate user-relevant information in collections of objects, the similarity of which is quantified using a pairwise distance measure. This book is a direct response to recent advances in computing, communications and storage which have led to the current flood of digital libraries, data warehouses and the limitless heterogeneity of internet resources. Similarity Search: The Metric Space Approach will introduce state-of-the-art in developing index structures for searching complex data modeled as instances of a metric space. This book consists of two parts. Part 1 presents the metric search approach in a nutshell by defining the problem, describes major theoretical principals, and provides an extensive survey of specific techniques for a large range of applications. Part 2 concentrates on approaches particularly designed for searching in very large collections of data. Similarity Search: The Metric Space Approach is designed for a professional audience, composed of academic researchers as well as practitioners in industry. This book is also suitable as introductory material for graduate-level students in computer science.
Conference Paper
Full-text available
Digital images have become a commodity which is searched on the Web as ordinarily as web pages. However, current large-scale engines search the images only on the basis of their annotations, while the content-based similarity systems do not seem to be ready for such scales. In this paper, we open the way to Web-scale image similarity search. We present a flexible system based on the metric space model and on the peer-to-peer paradigm. It uses M-Chord and M-Tree structures as its fundamental components and measures the image similarity by a combination of five MPEG-7 features. The system has been implemented including a graphical interface for online demonstrations and it currently indexes 10 million images crawled from the Web. We propose a novel strategy for approximate evaluation of similarity queries and we test its performance by a series of experiments. The results show that the system provides high-quality answers with response times around 0.5 second.