David Novak

David Novak
  • PhD
  • Researcher at Masaryk University

About

49
Publications
9,312
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,015
Citations
Current institution
Masaryk University
Current position
  • Researcher
Additional affiliations
September 2015 - present
University of Massachusetts Amherst
Position
  • Visiting Fulbright Scholar
January 2006 - present
Masaryk University
Position
  • Research Assistant

Publications

Publications (49)
Article
Full-text available
This article addresses the problem of matching the most similar data objects to a given query object. We adopt a generic model of similarity that involves the domain of objects and metric distance functions only. We examine the case of a large dataset in a complex data space, which makes this problem inherently difficult. Many indexing and searchin...
Chapter
Full-text available
Techniques of the Hamming embedding, producing bit string sketches, have been recently successfully applied to speed up similarity search. Sketches are usually compared by the Hamming distance, and applied to filter out non-relevant objects during the query evaluation. As several sketching techniques exist and each can produce sketches with differe...
Conference Paper
Full-text available
In order to accelerate efficiency of similarity search, compact bit-strings compared by the Hamming distance, so called sketches, have been proposed as a form of dimensionality reduction. To maximize the data compression and, at the same time, minimize the loss of information, sketches typically have the following two properties: (1) each bit divid...
Conference Paper
Full-text available
In this paper, we survey different state-of-the-art visual processing methods and utilize them in hyperlinking. Visual information, calculated using Features Signatures, SIMILE descriptors and convolutional neural networks (CNN), is utilized as similarity between video frames and used to find similar faces, objects and setting. Visual concepts in f...
Article
Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account s...
Preprint
Full-text available
Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account s...
Conference Paper
Full-text available
Efficient object retrieval based on a generic similarity is one of the fundamental tasks in the area of information retrieval. We propose an enhancement for techniques that use the distance-based model of similarity. This enhancement is based on sketches–compact bit strings compared by the Hamming distance which represent data objects from the orig...
Article
The rapid growth of unstructured data, commonly denoted as the Big Data challenge, requires new technologies that are capable of dealing with complex data objects such as multimedia. In this work, the authors focus on the content-based retrieval approach, which is able to organize such data by exploiting the similarity of data content. In particula...
Chapter
Many current applications need to organize data with respect to mutual similarity between data objects. A typical general strategy to retrieve objects similar to a given sample is to access and then refine a candidate set of objects. We propose an indexing and search technique that can significantly reduce the candidate set size by combination of s...
Article
Full-text available
Approximate similarity search techniques are getting more popular because of a high efficiency and good scalability. A key point of such techniques is a small candidate set identification, which can be done using sketches. Sketch is a compact object representation which approximates its position in a space. It consists of a bit-string. Each bit val...
Conference Paper
Full-text available
To be presented at SISAP '15. We present an efficiency evaluation of similarity search tech- niques applied on visual features from deep neural networks. Our test collection consists of 20 million 4096-dimensional descriptors (320GB of data). We test approximate k-NN search using several techniques, specifically FLANN library (a popular in-memory...
Article
Full-text available
We propose a system architecture for large-scale similarity search in various types of digital data. The archi- tecture combines contemporary highly-scalable distributed data stores with recent efficient similarity indexes and also with other types of search indexes. The system enables various types of data access by distance-based similarity queri...
Conference Paper
Full-text available
Many current applications need to organize data with respect to mutual similarity between data objects. Generic similarity retrieval in large data collections is a tough task that has been drawing researchers' attention for two decades. A typical general strategy to retrieve the most similar objects to a given example is to access and then refine a...
Conference Paper
Full-text available
We propose a generic distributed system architecture for large-scale similarity search in various types of digital data. The system combines current highly-scalable distributed data stores with recent efficient similarity indexes and also with other types of search indexes. The system is designed to provide several types of queries – distance-based...
Article
The general trend in data management is to outsource data to 3rd party systems that would provide data retrieval as a service. This approach naturally brings privacy concerns about the (potentially sensitive) data. Recently, quite extensive research has been done on privacy-preserving outsourcing of traditional exact-match and keyword search. Howev...
Article
Full-text available
This work targets the problem of search efficiency vs. answer quality of approximate metric-based similarity search. We especially focus on techniques based on recursive Voronoi-like partitioning or, from another perspective, on pivot permutations. These techniques use sets of reference objects (anchors/pivots) to partition the metric space into ce...
Conference Paper
Full-text available
We propose a similarity index that ensures data privacy and thus is suitable for search systems outsourced in a cloud. The proposed solution can exploit existing efficient metric indexes based on a fixed set of reference points. The method has been fully implemented as a security extension of an existing established approach called M-Index. This En...
Conference Paper
Full-text available
The success of content-based retrieval systems stands or falls with the quality of the utilized similarity model. In the case of having no additional keywords or annotations provided with the multimedia data, the hard task is to guarantee the highest possible retrieval precision using only content-based retrieval techniques. In this paper we push t...
Conference Paper
Full-text available
Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subse...
Article
Full-text available
We overview current problems of audio retrieval and time-series subsequence matching. We discuss the usage of subsequence matching approaches in audio data processing, especially in automatic speech recognition (ASR) area and we aim at improving performance of the retrieval process. To overcome the problems known from the time-series area like the...
Article
With the increasing number of applications that base searching on similarity rather than on exact matching, novel index structures are needed to speedup execution of similarity queries. An important stream of research in this direction uses the metric space as a model of similarity. We explain the principles and survey the most important representa...
Conference Paper
Full-text available
The recent techniques for approximate similarity search focus on optimizing answer precision/recall and they typically improve the average of these measures over a set of sample queries. However, according to our observation, the recall for particular indexes and queries can fluctuate considerably. In order to stabilize the recall, we propose a que...
Article
Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practic...
Conference Paper
This paper briefly describes an audio similarity retrieval engine included in the MUFIN project. The engine uses low-level audio descriptors defined by MPEG-7 standard for calculation of similarity measure between audio recordings. The core of the engine is implemented in Java with the use of the MESSIF framework that provides support for metric-ba...
Conference Paper
Full-text available
The concept of Locality-sensitive Hashing (LSH) has been successfully used for searching in high-dimensional data and a number of locality-preserving hash functions have been introduced. In order to extend the applicability of the LSH approach to a general metric space, we focus on a recently presented Metric Index (M-Index), we redefine its hashin...
Article
Full-text available
As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have b...
Article
Full-text available
It has become customary that practically any information can be in digital form. Searching through future internet will be complicated because of: (1) the diversity of ways in which specific data can be sorted, compared, related, or classified, and (2) the exponentially increasing amount of digital data. Accordingly, a successful search engine shou...
Conference Paper
Full-text available
It has become customary that practically any information can be in a digital form. However, searching for relevant information can be complicated because of: (1) the diversity of ways in which specific data can be sorted, compared, related, or classified, and (2) the exponentially increasing amount of digital data. Accordingly, a successful search...
Conference Paper
Full-text available
The Content-based Photo Image Retrieval (CoPhIR) dataset is the largest available database of digital images with corresponding visual descriptors. It contains five MPEG-7 global descriptors extracted from more than 106 million images from Flickr photo-sharing system. In this paper, we analyze this dataset focusing on 1) efficiency of similarity-ba...
Conference Paper
Full-text available
Metric space as a universal and versatile model of similarity can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index), that employs pra...
Conference Paper
Full-text available
We introduce a generic engine for large-scale similarity search and demonstrate it on a set of 100 million Flickr images.
Article
Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and grow linearly on single server structures, which prevents them from efficient application on large data volumes. In this paper, we shortly describe four recent scal...
Conference Paper
Full-text available
Digital images have become a commodity which is searched on the Web as ordinarily as web pages. However, current large-scale engines search the images only on the basis of their annotations, while the content-based similarity systems do not seem to be ready for such scales. In this paper, we open the way to Web-scale image similarity search. We pre...
Conference Paper
Full-text available
Due to the exponential growth of digital data and its complexity, we need a technique which allows us to search such collections efficiently. A suitable solution seems to be based on the peer-to-peer (P2P) network paradigm and the metric-space model of similarity. During the building phase of the distributed structure, the peers often split as new...
Conference Paper
Full-text available
In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images, to show the possibility to scale Content-based Image Retrieval (CBIR) systems towards the Web size. First, we had to tackle the non-trivial process of image crawling and descriptive feature extraction,...
Article
Full-text available
The concept of peer-to-peer structures has recently been ap- plied on the problem of large-scale similarity search. This resulted in sys- tems where the computational load of the peers is of a high importance. Since no current load-balancing technique is designed for structures of this kind, we propose LOBS - a general system for load-balancing in...
Conference Paper
Full-text available
The similarity search has become a fundamental computational task in many applications. One of the mathematical models of the similarity - the metric space - has drawn attention of many researchers resulting in several sophisticated metric-indexing techniques. An important part of a research in this area is typically a prototype implementation and...
Conference Paper
Full-text available
Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and the linear scalability of single server implemen- tations prevents from efficient searching in large data vol- umes. In this paper, we shortly describe four recent...
Conference Paper
Full-text available
The need for a retrieval based not on the attribute val- ues but on the very data content has recently led to rise of themetric-basedsimilarity search. Thecomputationalcom- plexity of such a retrieval and large volumes of processed data call for distributed processing which allows to achieve scalability. In this paper, we propose M-Chord, a dis- tr...
Article
Full-text available
One of the issues considered in all Peer-to-Peer Data Net-works, or Structured Overlays, is keeping a fair load distribution among the nodes participating in the network. Whilst this issue is well defined and basically solved for systems with relatively simple search paradigms, none of the existing solutions is appropriate nor applicable for simila...
Article
Full-text available
The popularity of Peer-to-Peer Data Networks has increased in recent years since they fairy combine the functionality of a distributed storage and a parallel searching engine. Moreover, these systems are opened, self-organizing and dynamic by nature. One of the aspects that all P2P systems deal with is load-balancing of the nodes in the system. Thi...

Network

Cited By