About
71
Publications
9,763
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
574
Citations
Publications
Publications (71)
One of the characteristics of big data is its internal complexity and also variety manifested in many types of datasets that are to be managed, searched, or analyzed. In their natural forms, some of the data entities are unstructured, such as texts or multimedia objects, while some are structured but too complex. In this paper, we have investigated...
In data science and content-based retrieval, we find many domain-specific techniques that employ a data processing pipeline with two fundamental steps. First, data entities are represented by some visualizations, while in the second step, the visualizations are used with a machine learning model to extract deep features. Deep convolutional neural n...
Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of th...
Purpose
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide b...
Similarity queries play the crucial role in content-based retrieval. The similarity function itself is regarded as the function of relevance between a query object and objects from database; the most similar objects are understood as the most relevant. However, such an automatic adoption of similarity as relevance leads to limited applicability of...
Decision-making in our everyday lives is surrounded by visually important information. Fashion, housing, dating, food or travel are just a few examples. At the same time, most commonly used tools for information retrieval operate on relational and text-based search models which are well understood by end users, but unable to directly cover visual i...
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.
Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were a...
The TriGen algorithm is a general approach to transform distance spaces in order to provide both exact and approximate similarity search in metric and non-metric spaces. This paper focuses on the reduction of intrinsic dimensionality using TriGen. Besides the well-known intrinsic dimensionality based on distance distribution, we inspect properties...
Many institutions choose to make their datasets available as Open Data. Open Data datasets are described by publisher-provided metadata and are registered in catalogs such as the European Data Portal. In spite of that, findability still remain a major issue. One of the main reasons is that metadata is captured in different contexts and with differe...
We present SIMILANT, a data analytics tool for modeling similarity in content-based retrieval scenarios. In similarity search, data elements are modeled using black-box descriptors, where a pair-wise similarity function is the only way how to relate data elements to each other. Only these relations provide information about the dataset structure. D...
The metric space model is a popular and extensible model for indexing data for fast similarity search. However, there is often need for broader concepts of similarities (beyond the metric space model) while these cannot directly benefit from metric indexing. This paper focuses on approximate search in semi-metric spaces using a genetic variant of t...
There is a large quantity of datasets available as Open Data on the Web. However, it is challenging for users to find datasets relevant to their needs, even though the datasets are registered in catalogs such as the European Data Portal. This is because the available metadata such as keywords or textual description is not descriptive enough. At the...
The success of many businesses is based on a thorough knowledge of their clients. There exists a number of supervised as well as unsupervised data mining or other approaches that allow to analyze data about clients, their behavior or environment. In our ongoing project focusing primarily on bank clients, we propose an innovative strategy that will...
When searching for complex data entities, such as products in an e-shop, relational attributes are used as filters within structured queries. However, in many domains the visual appearance of an item is important for a user, while coverage of visual appearance by relational attributes is left to database designer at design time and is by nature an...
In this paper, we present a prototype web application of a product search engine of a fashion e-shop. Today, e-shop product metadata consist of text description, simple attributes (price, size, color, fabric, etc.) and visual information (product photo). Search engines used in e-shops mostly provide text and attribute/category interface for product...
Collecting various types of data about users/clients in order to improve the services and competitiveness of companies has a long history. However, these approaches are often based on classical statistical methods and an assumption of limited computational power. In this paper we introduce the vision of our applied research project targeting to the...
In this demo paper, we present a prototype web application of a product search engine of a fashion e-shop. Although e-shop products consist of full-text description, relational attributes (e.g., price, type, size, color, etc.) as well as visual information (product photo), traditional search engines in e-shops only provide full-text and relational...
We present a demo of behaviour-based similarity retrieval in network traffic data. The underlying framework is intended to support domain experts searching for network nodes (computers) infected by malicious software, especially in cases when single client-server communication does not have to be sufficient to reliably identify the infection. The f...
Since the boom in new proposals on techniques for efficient querying of XML data is now over and the research world has shifted its attention toward new types of data formats, we believe that it is crucial to review what has been done in the area to help users choose an appropriate strategy and scientists exploit the contributions in new areas of d...
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence databases and noisy query spectra demand database indexing techniques and better similarity measures for the compari...
In this paper, we present detection of malware in HTTPS traffic using k-NN classification. We focus on the metric space approach for approximate k-NN searches over dataset of sparse high-dimensional descriptors of network traffic. We show the classification based on approximate k-NN search using metric index exhibits false positive rate reduced by...
This paper presents a tool for interactive filtering and browsing of up to hundreds of hours of video content. In particular, we address the known-item search, i.e., searching for a short video clip known visually or by textual description. Video content is filtered with simple user-defined sketches of the searched scenes consisting of its distinct...
The traditional content-based retrieval approaches usually use flat querying, where whole multimedia database is searched for a result of some similarity query with a user specified query object. However, there are retrieval scenarios (e.g., multimedia exploration), where users may not have a clear search intents in their minds, they just want to i...
Understanding the saliency of keyframes in short casual/home-made videos containing redundant information is an important step towards the design of successful keyframe selection and summarization techniques for such videos. Therefore, we present an extensive user study focusing on saliency of keyframes in such short redundant videos. In our study,...
Similarity search and content-based retrieval have become widely used in multimedia database systems that often manage huge data collections. Unfortunately, many effective content-based similarity models cannot be fully utilized for larger datasets, as they are computationally demanding and require massive parallel processing for both feature extra...
The success of our Signature-Based Video Browser presented last year at Video Browser Showdown 2014 (now renamed to Video Search Showcase) was mainly based on effective filtering using position-color feature signatures, while browsing in the results comprising matched keyframes was based just on a simple sequential search approach. Since the result...
During last decades, there have emerged various similarity models suitable for specific similarity search tasks. In this paper, we present a web-based portal that combines two popular similarity models (based on feature signatures and SURF descriptors) in order to improve the recall of multimedia exploration. Comparing to single-model approach, we...
In this paper, we present an effective yet efficient approach for known-item search in video data. The approach employs feature signatures based on color distribution to represent video key-frames. At the same time, the feature signatures enable users to intuitively draw simple colored sketches of the desired scene. We describe in detail the video...
Most of the current metric indexes focus on indexing the collection of reference. In this work we study the problem of indexing the query set by exploiting some property that query objects may have. Thereafter, we present the Snake Table, which is an index structure designed for supporting streams of k-NN searches within a content-based similarity...
With the huge expansion of smart devices and mobile applications, the ordinary users are consistently changing the conventional similarity search model. The users want to explore the multimedia data, so the typical query-by-example principle and the well-known keyword searching have become just a part of more complex retrieval processes. The emergi...
After two decades of research, the techniques for efficient similarity search in metric spaces have combined virtually all the available tricks resulting in many structural index designs. As the representative state-of-the-art metric access methods (also called metric indexes) that vary in the usage of filtering rules and in structural designs, we...
In this demo paper, we present a video retrieval and browsing tool inspired by the natural human ability to memorize visual stimuli of color regions in video frames. Our tool utilizes feature signatures that can be used to represent both significant color regions in the key-frames and simple query sketches. As recently shown at the video browser sh...
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence databases and noisy query spectra demand database indexing techniques and better similarity measures for the compari...
Metric indexing is the state of the art in general distance-based retrieval. Relying on the triangular inequality, metric indexes achieve significant online speed-up beyond a linear scan. Recently, the idea of Ptolemaic indexing was introduced, which substitutes Ptolemy's inequality for the triangular one, potentially yielding higher efficiency for...
The increasing amount of available unstructured content introduced a new concept of searching for information - the content-based retrieval. The principle behind is that the objects are compared based on their content which is far more complex than simple text or metadata based searching. Many indexing techniques arose to provide an efficient and e...
The popularity of similarity search expanded with the increased interest in multimedia databases, bioinformatics, or social networks, and with the growing number of users trying to find information in huge collections of unstructured data. During the exploration, the users handle database objects in different ways based on the utilized similarity m...
In this demo paper, we focus on the dynamic multimedia exploration techniques which are an intuitive, effective and entertaining way to present a pre-selected subset of a multimedia database to the users. More specifically, we present an exploration schema employing a similarity model based on SIFT descriptors that can be used to explore image data...
Recent popular applications like online video analysis or image exploration techniques utilizing content-based retrieval create a serious demand for fast and scalable feature extraction implementations. One of the promising content-based retrieval models is based on the feature signatures and the signature quadratic form distance. Although the mode...
In this paper, we present the vision of the usage of an object-based video data storage format for similarity search. The efficient (fast) and effective (accurate) search in video streams is an ongoing and still unsolved problem. Using an object-based format of multimedia data, all the information that is needed to answer queries is already availab...
Similarity search in protein structure databases is an important task of computational biology. To reduce the time required to search for similar structures, indexing techniques are being often introduced. However, as the indexing phase is computationally very expensive, it becomes useful only when a large number of searches are expected (so that t...
SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass spectra generated from databases of known protein sequences. Since the number of protein sequences in the databases...
With the emerging applications dealing with complex multimedia retrieval, such as the multimedia exploration, appropriate indexing structures need to be designed. A formalism for compact metric region description can significantly simplify the design of algorithms for such indexes, thus more complex and efficient metric indexes can be developed. In...
Similarity search is becoming popular in even more disciplines, such as multimedia databases, bioinformatics, social networks, to name a few. The existing indexing techniques often assume the metric space model that could be too restrictive from the domain point of view. Hence, many modern applications that involve complex similarities do not use a...
We present the Snake Table, an index structure designed for supporting streams of k-NN searches within a content-based similarity search framework. The index is created and updated in the online phase while resolving the queries, thus it does not need a preprocessing step. This index is intended to be used when the stream of query objects fits a sn...
We present the Smart Image Retrieval meta-search engine that allows content-based exploration of the results obtained from various sources (mostly based on keyword query). The online feature extraction architecture and exploration models utilizing single-/multi-query approaches are the two key features of our demo application that shows very promis...
The dynamic time warping (DTW) distance has been used as a popular measure to compare similarities of numeric time series because it provides robust matching that recognizes warps in time, different sampling rate, etc. Although DTW computation can be optimized by dynamic programming, it is still expensive, so there have been many attempts proposed...
We present an image meta-search engine that allows content-based exploration of the results obtained from various sources (mostly based on keyword query). The online feature extraction and the particle physics model are the two key features of our demo application that shows very promising results.
An important research issue in multimedia databases is the retrieval of similar objects. For most applications in multimedia databases, an exact search is not meaningful. Thus, much effort has been devoted to
develop efficient and effective similarity search techniques. A recent approach that has been shown to improve the effectiveness
of similarit...
In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to...
Similarity search in protein databases is one of the most essential issues in computational proteomics. With the growing number of experimentally resolved protein structures, the focus shifted from sequences to structures. The area of structure similarity forms a big challenge since even no standard definition of optimal structure similarity exists...
The quadratic form distance (QFD) has been utilized as an effective similarity function in multimedia retrieval, in particular, when a histogram representation of objects is used. Unlike the widely used Euclidean distance, the QFD allows to arbitrarily correlate the histogram bins (dimensions), allowing thus to better model the similarity between h...
Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based;
they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once)
or dynamic (insertions/deletions are supported), but there is always a preprocessing step need...
Since its introduction in 1997, the M-tree became a respected metric access method (MAM), while remaining, together with its descendants, still the only database-friendly MAM, that is, a dynamic structure persistent in paged index. Although there have been many other MAMs developed over the last decade, most of them require either static or expensi...
So far, an efficient similarity search in multimedia databases has been carried out by metric access methods (MAMs), where
the utilized similarity measure had to satisfy the metric properties (reflexivity, non-negativity, symmetry, triangle inequality).
Recently, the introduction of TriGen algorithm (turning any nonmetric into metric) enabled MAMs...
The M-tree and its variants have been proved to provide an efficient similarity search in database environments. In order
to further improve their performance, in this paper we propose an extension of the M-tree family, which makes use of nearest-neighbor
(NN) graphs. Each tree node maintains its own NN-graph, a structure that stores for each node...
In multimedia databases, the spatial index structures based on trees (like R-tree, M-tree) have been proved to be efficient
and scalable for low-dimensional data retrieval. However, if the data dimensionality is too high, the hierarchy of nested
regions (represented by the tree nodes) becomes spatially indistinct. Hence, the query processing deteri...
The retrieval of objects from a multimedia database employs a measure which defines a similarity score for every pair of objects.
The measure should effectively follow the nature of similarity, hence, it should not be limited by the triangular inequality, regarded as a restriction
in similarity modeling. On the other hand, the retrieval should be a...
In multi-dimensional databases the essential tool for accessing data is the range query (or window query). In this paper we introduce a new algorithm of processing range query in universal B-tree (UB-tree), which is an index structure for searching in multi-dimensional databases. The new range query algorithm (called the DRU algorithm) works effici...
We introduce a method of searching the k nearest neighbours (k-NN) using PM-tree. The PM-tree is a metric access method for similarity search in large multimedia databases. As an extension
of M-tree, the structure of PM-tree exploits local dynamic pivots (like M-tree does it) as well as global static pivots (used
by LAESA-like methods). While in M-...
Text collections represented in LSI model are hard to search efficiently (i.e. quickly), since there exists no indexing method
for the LSI matrices. The inverted file, often used in both boolean and classic vector model, cannot be effectively utilized,
because query vectors in LSI model are dense. A possible way for efficient search in LSI matrices...
In the area of Text Retrieval, processing a query in the vector model has been verified to be qualitatively more effective
than searching in the boolean model. However, in case of the classic vector model the current methods of processing many-term
queries are inefficient, in case of LSI model there does not exist an efficient method for processing...
Multi-dimensional data structures are applied in many real index applications, i.e. data min-ing, indexing multimedia data, indexing non-structured text documents and so on. Many index structures and algorithms have been proposed. There are two major approaches to multi-dimensional indexing. These are, data structures to indexing metric and vec-tor...
The area of information retrieval deals with problems of storage and retrieval within a huge collection of text documents. In IR models, the semantics of a document is usually characterized using a set of terms. A common need to various IR models is an efficient term retrieval provided via a term index. Existing approaches of term indexing, e.g., t...
Using the terminology usual in databases, it is possible to view XML as a language for data modeling. To retrieve XML data
from XML databases, several query languages have been proposed. The common feature of such languages is the use of regular
path expressions. They enable the user to navigate through arbitrary long paths in XML data. If we consi...
Abstrakt. Indexování a dotazování multimediálních dat patří v současnosti ke žhavým tématům v oblasti dokumentografických infor-mačních systémů. V tomto příspěvku chceme představit dvě metody indexování multimediálních dokumentů, jež vycházejí z geometrické re-prezentace dokumentů. První metoda uvažuje dokumenty jako body ve vícerozměrném vektorové...