Fig 15 - uploaded by Stefan Berchtold
Content may be subject to copyright.
Source publication
During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, or molecular biology. An important research issue in the field of multimedia databases is the content based retrieval of similar multimedia objects such as images, text, and videos. However, in contrast to sear...
Similar publications
Abstract— In database research, there has been some scattered work with the programmed extraction
of comparability/positioning capacities from a database. Relational database system supports the
effective execution of complex queries. During Query Recommendations users employ query
interface to issue a series of SQL queries in order to analyze the...
In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the out...
Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some...
Supporting efficient access to XML data using XPath [3] continues to be an important research problem [6, 12]. XPath queries Supporting efficient access to XML data using XPath [3] continues to be an important research problem [6, 12]. XPath queries
are used to specify nodelabeled trees which match portions of the hierarchical XML data. In XPath qu...
In this work, we present our implementation for managing moving objects on top of a popular relational database system MySQL, namely SpADE (spatio-temporal autonomic database engine for managing moving objects). In our SpADE system, non-static entities like vehicles and pedestrians are abstracted as moving objects. They obtain positioning informati...
Citations
... Due to these advantages, the d-tree is the choice of data structure in many applications. Indeed, after its invention by Bentley in 1975 [11], d-tree has been widely used and cited by over ten thousand times across multiple areas such as databases [23,40,44,59], data science [30,63,80,87], machine learning [28,56,57,74], clustering algorithms [55,58,61,72,76], and computational geometry [21,43,60,78]. ...
The kd-tree is one of the most widely used data structures to manage multi-dimensional data. Due to the ever-growing data volume, it is imperative to consider parallelism in kd-trees. However, we observed challenges in existing parallel kd-tree implementations, for both constructions and updates. The goal of this paper is to develop efficient in-memory kd-trees by supporting high parallelism and cache-efficiency. We propose the Pkd-tree (Parallel kd-tree), a parallel kd-tree that is efficient both in theory and in practice. The Pkd-tree supports parallel tree construction, batch update (insertion and deletion), and various queries including k-nearest neighbor search, range query, and range count. We proved that our algorithms have strong theoretical bounds in work (sequential time complexity), span (parallelism), and cache complexity. Our key techniques include 1) an efficient construction algorithm that optimizes work, span, and cache complexity simultaneously, and 2) reconstruction-based update algorithms that guarantee the tree to be weight-balanced. With the new algorithmic insights and careful engineering effort, we achieved a highly optimized implementation of the Pkd-tree. We tested Pkd-tree with various synthetic and real-world datasets, including both uniform and highly skewed data. We compare the Pkd-tree with state-of-the-art parallel kd-tree implementations. In all tests, with better or competitive query performance, Pkd-tree is much faster in construction and updates consistently than all baselines. We released our code.
... Content-Based Multimedia Retrieval (CBMR) tools and applications are already very popular and are still gaining traction in several domains. Some of these include automatic tagging of photos in social networks and image search engines [1,2]. In these systems, multimedia objects (e.g. ...
Content-Based Multimedia Retrieval (CBMR) has become very popular in several applications, driven by the growing routine use of multimedia data. Since the datasets used in real-world applications are very large and descriptor’s dimensionality is high, querying is an expensive, albeit important functionality. Further, exact search is prohibitive in most cases, motivating the use of Approximate Nearest Neighbour Search (ANNS) algorithms, trading accuracy for performance. These have been mainly developed targeting a sequential execution in a single node. However, the large and increasing datasets used and the high query loads submitted to those systems typically surpass the memory and computing resources available in a single node. This motivated the development of parallel distributed memory ANNS solutions to meet the computing capabilities required by those applications. A common problem that must be handled when using distributed memory systems is data partitioning and its impact on load imbalance. Several data partitioning approaches have already been proposed, including elaborated spatial-aware strategies. However, little effort has been put into carefully analyzing the performance of those strategies at scale. Here, we evaluated the commonly used data partitioning strategies in ANNS and identified their limitations to propose a novel class of partitioning algorithms that can minimize load imbalance while improving data locality to attain high performance on the distributed memory search. Experimentally, we found that our proposed algorithms (SABBS and SABBSR) improved search performance by up to 1.64× compared to the best previous solution. In a distributed memory weak scaling evaluation, with up to 12 billion 128-dimensional descriptors and 60 compute nodes, the gains were maintained as the system scaled with our novel approaches. These results demonstrate the efficiency of our new algorithms for billion-scale ANNS and the importance of considering not only data locality but also data and load imbalance in the data partitioning.
... These algorithms are frequently conceived to work with data having spatial components or represented in a vector space with few dimensions. However, they become unusable in high-dimensional spaces, where many features characterise each element due to the Curse of Dimensionality [1]. As a result, the distance between elements tends to become increasingly uniform, making it challenging to differentiate between them based solely on their distances or when the number of dimensions significantly exceeds the quantity of data points. ...
... Let ℓ ∞ , ≥ 1 be a vector space of real sequences such that ∞ =1 | | with the norm given by || || ( ∞ =1 | | ) 1 . This norm induces the ℓ or Minkowski metric and consequently ℓ and ℓ = (R , ) are Banach spaces. ...
This paper presents GMASK, a general algorithm for distributed approximate similarity search that accepts any arbitrary distance function. GMASK requires a clustering algorithm that induces Voronoi regions in a dataset and returns a representative element for each region. Then, it creates a multilevel indexing structure suitable for large datasets with high dimensionality and sparsity, usually stored in distributed systems. Many similarity search algorithms rely on k-means, typically associated with the Euclidean distance, which is inappropriate for specific problems. Instead, in this work we implement GMASK using k-medoids to make it compatible with any distance and a wider range of problems. Experimental results verify the applicability of this method with real datasets, improving the performance of alternative algorithms for approximate similarity search. In addition, results confirm existing intuitions regarding the advantages of using certain instances of the Minkowski distance in high-dimensional datasets.
... In addition, the dimensionality curse has an impact on indexing structures that speed up nearest neighbor searches. As dimensionality rises, traditional indexing methods like k-d trees and R-trees experience exponential growth in both node count and tree depth [34]. As a result, there is a phenomenon called the "dimensionality curse," wherein these indexing structures perform no better than a linear scan of the whole dataset [35]. ...
Effective similarity search and retrieval are now possible thanks to vector databases, which have become a potent tool for organizing and searching high-dimensional data. This article provides a thorough examination of vector databases, their underlying theories, and their applications across a range of industries. In handling complex data types, we address the significance of vector representation and emphasize the benefits of vector databases over traditional databases [1]. The article explores the process of creating a vector database, highlighting the critical function of indexing strategies such as IVF (Inverted File Indexing) and HNSW (Hierarchical Navigable Small World) in guaranteeing the effectiveness and precision of searches [2]. In addition, we discuss the problems caused by the curse of dimensionality and offer solutions to lessen its effects on nearest neighbor searches [3]
... Our contributions are delineated as follows: (1) We furnish a comprehensive framework for approximate KNN query processing across diverse data sources within web and peer-to-peer (p2p) environments, encompassing both preprocessing and query stages. (2) During the preprocessing stage, we offer a detailed exposition on the construction of a UBR-Tree and introduce an algorithm for constructing a Centroid Base. Concurrently, we enhance and implement an index structure proposed by [34] to expedite the retrieval of relevant data sources and KNN outcomes for arbitrary queries. ...
... The issue of nearest neighbor searches, commonly referred to as the K-nearest neighbor (KNN) query, has been a focal point of inquiry within the database community for numerous decades [1]. Two surveys on this topic can be found in [2,13]. A plethora of methodologies for processing (approximate) KNN queries across datasets rely on a diverse array of tree structures or hash-based mechanisms. ...
A K NN query aims to identify the K closest neighbors or tuples from a dataset based on a specified distance metric. This paper delves into the realm of approximate K NN query processing, focusing on the meticulous selection of multiple data sources characterized by diverse dimensions. We provide a framework for processing approximate K NN queries over multiple data sources, proposing algorithms to construct a UBR-Tree and a Centroid Base for selecting related data sources and retrieving K NN tuples. We enhance and apply an index structure to quickly retrieve related data sources and K NN tuples for a query. For a K NN query Q , the query processing consists of the following steps: (1) Estimate a search distance using the index structure. (2) Use the search distance to select relevant data sources from the Centroid Base, sorting them according to their representative tuple. (3) Employ a heap structure to merge the local K NN tuples obtained from the related data sources to form global K NN tuples for Q . Additionally, update the index structure when processing the query. Extensive experiments over low-dimensional and high-dimensional datasets demonstrate the performances of our proposed approaches.
... This fundamental problem can be described as follows, given a query q ∈ R D and a database X = {x 1 , · · · , x N } ∈ R D , the goal of NN search is to find a nearest neighbor item x * ∈ X in the database with the minimum distance to the q according to Eq. (1). However, due to the explosive growth of database size and the inevitable curse of dimensionality [4], the NN search cannot meet the requirements of efficiency and cost in practice. As a result, researchers show more interest in approximate nearest neighbor (ANN) search. ...
Vector quantization (VQ) is a widely used Approximate Nearest Neighbor (ANN) search method. By constructing multiple codebooks, VQ can create more codeword vectors with lower memory consumption, enabling the indexing of large-scale database. In recent years, many VQ-based methods have been proposed, but the codeword vectors constructed in these methods are often underutilized due to lack of data support, and the unimodal data distribution within the cell is not considered. To address these issues, this paper introduces a new quantization method called Cyclic Hierarchical Product Quantization (CHPQ). This method firstly constructs a hierarchical quantization structure in each subspace, with each hierarchical structure composed of several sub-quantizers. Then, the codebook is locally optimized under the sub-quantizers according to the data distribution of each cell, significantly improving quantization performance compared to other methods and greatly enhancing the accuracy of ANN search. Additionally, this paper proposes a new hierarchical quantization structure named cyclic hierarchical structure, which can generate more diverse codeword vectors in different cells compared to the traditional hierarchical quantization structure. Extensive experiments on two publicly available datasets demonstrate that CHPQ outperforms existing methods in terms of retrieval accuracy while maintaining comparable computational efficiency.
... Content-based multimedia retrieval (CBMR) is increasingly important due to the growth of its use in several applications [1][2][3][4][5]. Example applications employing CBMR include content-based image search systems, video identification and misuse, and several applications in social networks such as data propagation. ...
... MS-ADAPT is compared against the static configuration with the best average performance among all QLF 9 ILF levels evaluated. Our code is available in. 1 ...
Similarity search is a key operation in content-based multimedia retrieval (CBMR) applications. Online CBMR applications, which are the focus of this work, perform a large number of search operations on dynamic datasets, which are updated at run-time. Additionally, the rates of search and data insertion (updated) operations vary during the execution. Such applications that rely on similarity search are required to fulfill these demands while also offering low response times. Thus, it is common for the computing demands in such applications to exceed the processing power of a single computer, motivating the usage of large-scale compute systems. As such, we propose in this work a distributed memory parallelization of similarity search that addresses these challenges. Our solution employs the efficient Inverted File System with Asymmetric Distance Computation algorithm (IVFADC) as the baseline, which is extended to support dynamic datasets. A dynamic resource management algorithm, called Multi-Stream Adaptation (MS-ADAPT) is proposed. It allows run-time changes on resource assignment with the goal of reducing response times. We evaluate our solution with multiple data partitioning strategies using up to 160 compute nodes and a dataset with 344 billion multimedia descriptors. Our experiments demonstrate superlinear scalability and MS-ADAPT outperforms the best static approach (oracle) by improving the response times up to 32× on high-load cases.
... Step 2: Compare graphs Measuring the differences and similarities between the plan graph and the functional program can be split into two separate stages: one analyzing the layout of the building and the other analyzing the suitability of a single room to a function. Feature-based similarity search 26 focusses on the latter, where a node of the functional program, that is, an activity, is compared to a node of the plan graph, that is, a room. This similarity may treat two nodes as equivalent even though they have different neighborhood structures. ...
... The search for similarity between a demand for a room and an existing physical room is transformed into the search of the distance between two points in a high-dimensional feature space. 26 The calculation of distances between vectors is computationally straightforward, even when considering large graphs. 26 These two points are represented by feature vectors (FV), containing numerical values extracted from the node attributes each describing a certain characteristic, for example, ½area, presence of windows, number of doors could result in ½16 m 2 , true, 1 for a bedroom. ...
... 26 The calculation of distances between vectors is computationally straightforward, even when considering large graphs. 26 These two points are represented by feature vectors (FV), containing numerical values extracted from the node attributes each describing a certain characteristic, for example, ½area, presence of windows, number of doors could result in ½16 m 2 , true, 1 for a bedroom. A weight vector is applied to the distance function, by which the influence of each feature on the distance value is defined, for example, ½30%, 60%, 10% because it is important for the architect in this example that the rooms have a window if requested. ...
Building, demolishing, and rebuilding new structures, instead of renovating vacant buildings for adaptive reuse, generates a significant amount of waste. Quickly assessing if and how a building can house a new function can be time-consuming. We developed a method that automates this process using a graph-based strategy in which both the building and the functional program are represented by attributed graphs. Specifically, the new method performs a node matching strategy based on the properties of the nodes and the topological structure of the graphs by which each empty room is filled with an activity of the new function.
... It demonstrated superlinear scalability for our Spatial-Aware data partition algorithms and MS-ADAPT outperformed the best static approach (oracle) by reducing the response times up to 32× on high-load cases. Content-based multimedia retrieval (CBMR) is increasingly important due to the growth of its use in several applications [1,7,12,22,32]. Example applications employing CBMR include content-based image search systems, video identification and misuse, and several applications in social networks such as data propagation. A common aspect of the mentioned examples is the large and increasing amount of data employed. ...
Similarity search is an key operation in Content-based multimedia retrieval(CBMR) applications. Online CBMR applications, which is the focus of thiswork, have to search in large and dynamic datasets that are updated during theexecution while offering low response times. Additionally, these applications aresubmitted to workloads that vary at runtime. The computing demands in thisscenario exceeds the processing power of a single computer, motivating the large-scale machines in the domain. Thus, in this work, we proposed a distributedmemory parallelization of similarity search that addresses the mentioned chal-lenges. Our solution employs the efficient Inverted File System with AsymmetricDistance Computation algorithm (IVFADC) algorithm as the baseline, which isextended here to support dynamic datasets. Further, we developed a dynamicresource management algorithm, called Multi-Stream Adaptation (MS-ADAPT),that is executed at run-time to change the computing resource assignment withthe goal of minimizing response times. We have evaluated our system solutionwith multiple data partitioning strategies using up to 160 compute nodes anda dataset with 344 billion multimedia descriptors. It demonstrated superlinearscalability for our Spatial-Aware data partition algorithms and MS-ADAPT out-performed the best static approach (oracle) by reducing the response times upto 32× on high-load cases.
... The cross-media indexing issue belongs to high-dimensional indexing [5] and multi-feature indexing [17] methods. There is a long stream of research for addressing the high-dimensional indexing problems [5]. ...
... The cross-media indexing issue belongs to high-dimensional indexing [5] and multi-feature indexing [17] methods. There is a long stream of research for addressing the high-dimensional indexing problems [5]. Existing techniques can be divided into four main categories. ...
With the rapid growth of multimedia data (e.g., text, image, video, audio and 3D model, etc) in the web, there are a large number of media objects with different modalities in the multimedia documents such as webpages, which exhibit latent semantic correlation. As a new type of multimedia retrieval method, cross-media retrieval is becoming increasingly attractive, through which users can get the results with various media types with the same semantic information by submitting a retrieval of any media type. The explosive increasing of the number of media objects, however, makes it difficult for the traditional local standalone mode to process efficiently. So the powerful parallel processing capability of cloud computing is accommodated to facilitate the efficient large-scale cross-media retrieval. In this paper, based on a Multi-Layer-Cross-Reference-Graph(MLCRG) model, we propose an efficient parallel cross-media retrieval (PCMR) method in which two enabling techniques (i.e., 1) the adaptive cross-media data allocation algorithm and 2) the PCIndex scheme) are accommodated to effectively speedup the retrieval performance. To the best of our knowledge, there is little research on the parallel retrieval processing of the large-scale cross-media databases in the mobile cloud network. Extensive experiments are conducted to testify that our proposed PCIndex method outperform the three competitors (e.g., the PFAR (Mao et al, 22), the MBSR (Retrieval 4(2):153-164, 42) and the SPECH (Knowl Based
Syst 251(5):1-13, 40)) in terms of the effectiveness and efficiency, respectively.