Article

A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory requirements

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Approximating and Eliminating Search Algorithm (AESA) can currently be considered as one of the most efficient procedures for finding Nearest Neighbours in Metric Spaces where distances computation is expensive. One of the major bottlenecks of the AESA, however, is its quadratic preprocessing time and memory space requirements which, in practice, can severely limit the applicability of the algorithm for large sets of data. In this paper a new version of the AESA is introduced which only requires linear preprocessing time and memory. The performance of the new version, referred to as ‘Linear AESA’ (LAESA), is studied through a number of simulation experiments in abstract metric spaces. The results show that LAESA achieves a search performance similar to that of the AESA, while definitely overcoming the quadratic costs bottleneck.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Using measures based on points from the data set to assess boundaries on norms is a concept already employed in, e.g., spatial indexing. Methods like LAESA [7] use so-called pivot/reference/prototype points and the triangle inequality to prune the data set during spatial queries. Tree-based methods like the Balltree [8] use the triangle inequality to exclude entire subtrees, while permutation based indexing [3,14] uses the relative closeness to reference points to partition the data. ...
... In spatial indexing, pivots have been successfully used to bound distances via the triangle inequality [7,8]. We propose to bound distances in terms of a decomposition of the squared Euclidean norm into dot products given by ...
... which is the triangle inequality for cosines introduced in [10]. Triangle-inequalitybased bounds have been used in spatial indexing in methods like, e.g., LAESA [7]. For multiple pivots, these approaches take the minimum or maximum of the bounds obtained separately for each pivot. ...
Preprint
The merit of projecting data onto linear subspaces is well known from, e.g., dimension reduction. One key aspect of subspace projections, the maximum preservation of variance (principal component analysis), has been thoroughly researched and the effect of random linear projections on measures such as intrinsic dimensionality still is an ongoing effort. In this paper, we investigate the less explored depths of linear projections onto explicit subspaces of varying dimensionality and the expectations of variance that ensue. The result is a new family of bounds for Euclidean distances and inner products. We showcase the quality of these bounds as well as investigate the intimate relation to intrinsic dimensionality estimation.
... The performance of pivot-based methods depends on the quality of the pivots used. Thus, many proposals [1,4,5,9,16,17,20,22,24,27,29,34,36] exist on how to design algorithms for selecting high-quality pivots for use in metric indexes. Pivot-based indexing has attracted recent attention [6,10,11,28,30,33], and in two previous studies [10,11], we survey all metric indexes, including pivot-based metric indexes. ...
... We survey all existing general pivot selection algorithms known to us, and classify them into three categories: (i) P-P distribution based algorithms [1,16,17,27,29] that select pivots according to the distance distribution among pivots; (ii) P-O distribution based algorithms [22,24,34] that select pivots according to the distance distribution between pivots and data objects; and (iii) O-O distribution based algorithms [4,5,9] that select pivots according to the distance distribution between data objects, and aim to maximize the similarity between original metric distances and lower bound distances derived by using pivots. We provide detailed descriptions of the algorithms, and analyze their time complexities. ...
... Balancing Pivot-Position Occurrences (BPP) [1] tends to choose centers as pivots. In contrast, algorithms such as Base-Prototypes Selection (BPS) [27] and Hull of Foci (HF) [17] choose outliers as pivots. ...
Article
Full-text available
Similarity search in metric spaces is used widely in areas such as multimedia retrieval, data mining, data integration, to name but a few. To accelerate metric similarity search, pivot-based indexing is often employed. Pivot-based indexing first computes the distances between data objects and pivots and then exploits filtering techniques that use the triangle inequality on pre-computed distances to prune search space during search. The performance of pivot-based indexing depends on the quality of the pivots used, and many algorithms have been proposed for selecting high-quality pivots. We present a comprehensive empirical study of pivot selection algorithms. Specifically, we classify all existing algorithms into three categories according to the types of distances they use for selecting pivots. We also propose a new pivot selection algorithm that exploits the power law probabilistic distribution. Next, we report on a comprehensive empirical study of the search performance enabled by different pivot selection approaches, using different datasets and indexes, thus contributing new insight into the strengths and weaknesses of existing selection techniques. Finally, we offer advice on how to select appropriate pivot selection algorithms for different settings.
... La compleji-dad total de la búsqueda está dada por la suma de la complejidad interna (comparaciones de q a cada pivote), y la complejidad externa (comparación de q con cada objeto de la lista de objetos candidatos). Algunos algoritmos como [4,18,19,53,56,76] son implementaciones de esta idea, pero difieren básicamente en la estructura extra que utilizan para reducir los costos de CPU para encontrar los pivotes, pero no en el número de evaluaciones de distancias. Existen algunas estructuras de datos de tipo árbol que utilizan la idea de pivotes pero de una forma más indirecta [10,11,15,52,71,81,82]. ...
... En la literatura se pueden encontrar varios trabajos que proponen heurísticas para la selección de pivotes. En [53] los pivotes se seleccionan de forma tal que maximizan la suma de distancias entre ellos. En [80] y [12] se seleccionan los pivotes que se encuentran mas alejados entre sí. ...
Book
Full-text available
El trabajo desarrollado en esta tesis tuvo como objetivo el diseño, implementación y evaluación de un índice distribuido para objetos en espacios métricos y su respectiva estrategia de procesamiento paralelo de consultas para máquinas de búsqueda.
... Many variants of BST including Monotonous BST (MBST) [92], Voronoi Tree (VT) [46,47] and Bottom-Up Tree (BU-Tree) [72] are proposed to improve the efficiency of BST. Generalized Hyperplane Tree (GHT) [115] is similar as BST [69] , MBST [92] CP-Index O(ns) Ω(nlog2n) VT [46] [46] CP-Index O(ns) Ω(nlog3n) BU-Tree [72] CP-Index O(ns) O(n 3 ) GHT [115] CP-Index O(ns) Ω(nlog2n) GNAT [21] , EGNAT [76] [89] Hybrid O(ns + nm) Ω(nmlogmn) SAT [29][84] [85] DSAT [86][87] [88][90] [91] DSACLT [12] [22] CP-Index CP-Index CP-Index HRG [54] , kNNG [96] CP-Index O(ns) O(n 2 ) M-tree [38][42] [110] CP-Index O(ns + ns/m) O(mnlogmn) PM-tree [112] Hybrid O(n(s+l) + n(s+l)/m + ls) O(n(m+l)logmn) LC [28] [31] , DLC [91] CP-Index O(ns) O(n 2 /m) HC [56] [57] CP-Index O(ns) O(nlog2 ⁄ ) D-index [48][49] [123] Hybrid O(ns + nl + ls) O(nl) MB + -Tree [63] CP-Index O((n+n/m)(s+log2 ⁄ +log2nd)) O(nlog2 ⁄ + nmlogmn) AESA [102] [117] , ROAESA [118] P-Index O(ns + n 2 ) O(n 2 ) iAESA [51] [52] P-Index O(ns + n 2 ) O(n 2 log2n) LAESA [79] P-Index O(ns + ls + nl) O(nl) TLAESA [80] [113] Hybrid O(ns + ls + nl) Ω(nlog2n + nl) EPT [103] EPT* [37] P-Index P-Index O(ns + lgs + nl) O(ns + lcs + nl) O(nlg) O(nllcns) CPT [82] P-Index O(ns + ns/m + ls + nl) O(mnlogmn + nl) BKT [23] P-Index O(ns + lnd) O(nl) FQT [10] , FHQT [11] , FQA [31] P-Index O(ns + nl) O(nl) VPT [114] [115] [122] , DVPT [58] MVPT [18] [19] P-Index P-Index O(ns) O(ns) O(nlog2n) O(nlogmn) Omni-family [20] [66] P-Index O(ns + nl + nl/m + ls) O(nl + nmlogmn) SPB-tree [33] [34] P-Index O(ns + n + n/m + ls) O(nlx + nmlogmn) M-index [93] M-index* [37] Hybrid Hybrid O(ns + nl + n + n/m + ls) O(ns + nl + n + n/m + nl/m+ ls) Ω(nllogl ⁄ + nmlogmn) Ω(nllogl ⁄ + nmlogmn) ...
... To improve the search efficiency, Reduced-Overhead AESA (ROAESA) [118] and iAESA [51,52] are designed while its data structure is same as AESA, where iAESA sorts the pre-computed distances for each object. To reduce the storage for AESA, Linear AESA (LAESA) [79] only keeps the distances from every object to selected pivots. Different from LAESA that uses a single set of pivots, Extreme Pivot Table EPT ( * ) [37,103] selects several sets of pivots. ...
Preprint
With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes.
... In [27] described the process of Sparse Pivot Selection and remarks the previous work in this field like in [28], [29], [30] and [34]. As we can see in the aforementioned works, the robustness of the similarity search method based on pivots depends directly on the number of pivots, the distribution between them and the distribution in the metric space. ...
... As we can see in the aforementioned works, the robustness of the similarity search method based on pivots depends directly on the number of pivots, the distribution between them and the distribution in the metric space. Precisely in [30] the maximization of the distance between pivots is pursued, showing empirical results of the effectiveness of this method. More recently [35] presents a dynamic method of pivot selection that can modify the set of pivots while the database is growing. ...
Preprint
Full-text available
The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by $8$\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/$2014-63140074$ through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement $690941$; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number $1170497$; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58.
... For this reason tree-based structures are not flexible enough to provide greater elimination power when needed. In contrast, vantage point structures like LAESA [34], Spaghettis [12] and FQA [13] represent another family of solutions. They use more space and construction time, but provide greater efficiency at query time. ...
... To the best of our knowledge, the first vantage-point structure that appeared in literature was LAESA [34], as a special case of AESA [43]. There have been some improvements over the basic LAESA algorithm, such as keeping distances to the vantage points sorted and doing binary searches to identify which objects can be eliminated from consideration [36]. ...
Thesis
Full-text available
The complex and unstructured nature of many types of data, such as mul-timedia objects, text documents, protein sequences, requires the use of similarity search techniques for retrieval of information from databases. One popular approach for similarity searching is mapping database objects into feature vectors, which introduces an undesirable element of indirection into the process. A more direct approach is to define a distance function directly between objects. Typically such a function is taken from a metric space, which satisfies a number of properties, such as the triangle inequality. Index structures that can work for metric spaces have been shown to provide satisfactory performance, and were reported to outperform vector-based counterparts in many applications. Metric spaces also provide a more general framework, and for some domains defining a distance between objects can be accomplished more intuitively than mapping objects to feature vectors. In this thesis we will investigate new efficient methods for similarity searching in metric spaces. We will first show that current solutions to indexing in metric spaces have several drawbacks. Tree-based solutions do not provide the best trade-offs between construction time and query performance. Tree structures are also difficult to make dynamic without further degrading their performance. There is also a family of flat structures that address some of the deficiencies of tree-based indices , but they introduce their own unique problems in terms of higher construction cost, higher space usage, and extra CPU overhead. In this thesis a new family of flat structures will be introduced, which are very flexible and simple. We will show that dynamic operations can easily be performed, and that they can be customized to work under different performance requirements. They also address many of the general drawbacks of flat structures as outlined above. A new framework, composite metrics will also be introduced, which provides a more flexible similarity searching process by allowing several metrics to be combined in one search structure. Two indexing structures will be introduced that can handle similarity queries in this setting, and it will be shown that they provide competitive query performance with respect to data structures for standard metrics.
... In this context, one of the very first embeddings proposed in a metric search scenario was the one representing each data object with a vector of its distances to the pivots. The LAESA [16] is a notable example of indexing technique using this approach. Recently, Connor et al. [12,11,10] observed that for a large class of metric spaces it is possible to use the distances to a set of n pivots to project the data objects into a n-dimensional Euclidean space such that in the projected space 1) the distances object-pivots are preserved, 2) the Euclidean distance between any two points is a lower-bound of the actual distance, 3) also an upper-bound can be easily computed. ...
... , p n }, the corresponding mapping f l provides upper-and lower-bounds that are less tight than that obtained using f n . This family of embeddings are typically used in indexing tables like LAESA [16] or for space pruning [23]. However, as further described in Section 4, in this work we used them not for indexing purpose, but rather as techniques to approximate the distances between a query and data objects already indexed using a permutation-based approach. ...
... The research paradigm of Case Based Reasoning (CBR) (Watson and Marir 1994) was proposed by Slade to mine experiences from previous cases in 1991. Herein, the nearest-neighbour matching algorithm (Micó et al. 1994) is referenced to calculate the attribute similarity between the new case and previous cases. The matching rate is the most critical definition in the ASM, and its development process is as follows. ...
Article
Full-text available
Unlike most brownfields located in the urban center, there is a kind of special brownfields produced in the Third Front Construction (TFC) period of China, and in turn they are named the Third Front Brownfield (TFB) in this paper. In addition to commercial value, other values should also be considered when TFBs are redeveloped, which makes they may need a specific protective reuse way and their revitalization process is relatively slower. Therefore, it is of great significance to study the redevelopment mode of TFBs. Accordingly, this paper presents a redevelopment mode selection framework to support stakeholders’ investment decision-making and facilitate the reuse of TFBs. First, a previous case base including two sets is developed to conduct experience mining. In specific, an attribute set and a TFB redevelopment mode set of previous successful cases are established through literatures and expert interviews. Second, the weights of abovementioned attributes are determined by using the G1 method. Third, a concept of matching rate is defined based on the Attribute Similarity Model (ASM) to search the similarity between the new TFB and previous cases so that stakeholders can get advice on the redevelopment of the new TFB. A case study is conducted to show the effectiveness of the proposed framework and some policy suggestions are made according to the study process.
... Then, all remaining objects are classified according to their distances to the pivot objects. Clearly, ball partitioning methods fall in this category, which also includes other MAM that make use of precomputed distance matrices between pairs of elements in the dataset, such as AESA [44] and LAESA [45]. AESA is very fast but it consumes a lot of resources (requires O(n 2 ) space and construction time). ...
Preprint
Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up queries for similar elements in this context. Moreover, since exact approaches to solve similarity queries can be complex and time-consuming, alternative options have appeared to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with $k$-means), an unconventional application of the $k$-means algorithm as the foundation of a multilevel index structure for approximate similarity search, suitable for metric spaces. We show that inherent properties of $k$-means, like representing high-density data areas with fewer prototypes, can be leveraged for this purpose. An implementation of this new indexing method is evaluated, using a synthetic dataset and a real-world dataset in a high-dimensional and high-sparsity space. Results are promising and underpin the applicability of this novel indexing method in multiple domains.
... The efficiency of the search in M-tree is reduced when the volume is high, thus, Pivoting M-tree (PM-tree) is proposed [251,252] to resolve this problem. PM-tree is a hybrid structure, which combines the "local-pivoting strategies" of M-tree [241] with the "global-pivoting strategies" of LAESA [253]. Recently, Razent et al. [254] presented a new construction algorithm for the two indexing structures M-tree and PM-tree. ...
Article
Full-text available
The past decade has been characterized by the growing volumes of data due to the widespread use of the Internet of Things (IoT) applications, which introduced many challenges for efficient data storage and management. Thus, the efficient indexing and searching of large data collections is a very topical and urgent issue. Such solutions can provide users with valuable information about IoT data. However, efficient retrieval and management of such information in terms of index size and search time require optimization of indexing schemes which is rather difficult to implement. The purpose of this paper is to examine and review existing indexing techniques for large-scale data. A taxonomy of indexing techniques is proposed to enable researchers to understand and select the techniques that will serve as a basis for designing a new indexing scheme. The real-world applications of the existing indexing techniques in different areas, such as health, business, scientific experiments, and social networks, are presented. Open problems and research challenges, e.g., privacy and large-scale data mining, are also discussed.
... The efficiency of the search in M-tree is reduced when the volume is high; thus, Pivoting M-tree (PM-tree) is proposed [388,390] to resolve this problem. PM-tree is a hybrid structure, which combines the "local-pivoting strategies" of M-tree [105] with the "global-pivoting strategies" of LAESA [285]. Recently, Razent et al. [346] during their partitioning. ...
Thesis
The work presented in this Ph.D. thesis focuses on developing a large-scale distributed video surveillance system for tracking suspicious moving objects and analysing their behaviour in an IoT environment. To date, video surveillance systems still face the problem of unnecessary and redundant data caused by multiple detections of the same events by several cameras due to overlapping fields of view. These problems affect not only the increase in processing, storage, and communication resources consumed but also the quality of the tracking, the quality of the behavioural analysis of the tracked objects, and the real-time operation of the system. These effects are particularly exacerbated by dense deployments of large-scale camera networks. To address these issues, we proposed a new distributed and collaborative tracking system. This system aims to improve the quality of monitoring and reduce the cost of data processing by reducing the number of active cameras and reducing communicating data. The proposed system operates in two steps: (i) Electing a leader who has the best view of the detected object among the neighbouring cameras. (ii) Choosing the best assistants from neighbouring cameras to maximise detection when the leader’s vision is insufficient to track the object. Only the leader and their assistants are active. The other neighbouring cameras remain in an inactive state. To improve real-time data processing, the system’s load is distributed throughout the IoVT computing architecture. We proposed two methods of grouping cameras based on the FoV overlap area criterion instead of the radio and distance criterion to reduce the coordination mechanism’s complexity, ensure the feasibility of their operation in large-scale networks, and restrict the communication range between cameras. The first proposed technique is based on the ascending hierarchical classification algorithm. This method mainly focuses on grouping cameras that have maximum overlap. Unfortunately, the method has only partial knowledge about the network and its state, i.e., it only knows the maximum overlap. The other overlaps are not taken into account and are completely neglected. To exceed this limit, we proposed a second grouping technique that groups not only the two most overlapping cameras, but also all overlapping cameras with as many cameras as possible. To find this group, we used the Bron-karboch Clique-based search algorithm. We also proposed a new and efficient indexing mechanism based on the tree structure. The proposed mechanism aims to index the massive data generated by large-scale cameras network to organise it appropriately to reduce the search time as much as possible to ensure real-time system operation. This structure is based on recursive partitioning of space using the k-means clustering algorithm to effectively separate space into non-overlapping subspace to improve search and discovery algorithm results. The results obtained demonstrate the effectiveness of the proposed methods in terms of tracking quality, amount of network data, energy consumption and real-time operation compared to the conventional system.
... Metric similarity search indexes using this approach include the ball-tree [13], the metric tree [18] aka. the vantage-point tree [19], the LAESA index [9,15], the Geometric Near-neighbor Access Tree (GNAT) [3] aka. multi-vantage-pointtree [2], the M-tree [5], the SA-tree [11] and Distal SAT [4], the iDistance index [6], the cover tree [1], the M-index [12], and many more. (Neither the k-d-tree, quad-tree, nor the R-tree belong to this family, these indexes are coordinatebased, and require lower-bounds based on hyperplanes and bounding boxes, respectively.) ...
Preprint
Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for Cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.
... The idea of reconstructing the distance between any pair of objects in a metric space by exploiting distances with a group of reference objects was probably first addressed in [9]. The authors proposed an embedding into another metric space where it is possible to deduce upper and lower bounds on the actual distance of any pair of objects. ...
Chapter
Efficient indexing and retrieval in generic metric spaces often translate into the search for approximate methods that can retrieve relevant samples to a query performing the least amount of distance computations. To this end, when indexing and fulfilling queries, distances are computed and stored only against a small set of reference points (also referred to as pivots) and then adopted in geometrical rules to estimate real distances and include or exclude elements from the result set. In this paper, we propose to learn a regression model that estimates the distance between a pair of metric objects starting from their distances to a set of reference objects. We explore architectural hyper-parameters and compare with the state-of-the-art geometrical method based on the n-simplex projection. Preliminary results show that our model provides a comparable or slightly degraded performance while being more efficient and applicable to generic metric spaces.
... Probabilistic approaches can cope more efficiently with higher dimensionality while still providing logarithmic performance with respect to the number of nodes (Weiss 1980). The efficiency of probabilistic approaches can even be increased when preprocessing the data, as has been discussed at the example of the AESA algorithm (Vidal 1994) and its improved variants LAESA (Micó et al. 1994), TLAESA (Micó et al. 1996), and an improved TLAESA algorithm (Tokoro et al. 2006). Further variants of the aforenamed strategies exist (Sproull 1991, Roussopoulos et al. 1995. ...
Article
Full-text available
The naïve algorithm for generating nearest-neighbour models determines the distance between every pair of nodes, resulting in quadratic running time. Such time complexity is common among spatial problems and impedes the generation of larger spatial models. In this article, an improved algorithm for the Mocnik model, an example of nearest-neighbour models, is introduced. Instead of solving k nearest-neighbour problems for each node (k dynamic in the sense that it varies among the nodes), the improved algorithm presented exploits the notion of locality through introducing a corresponding spatial index, resulting in a linear average-case time complexity. This makes possible to generate very large prototypical spatial networks, which can serve as testbeds to evaluate and improve spatial algorithms, in particular, with respect to the optimization of algorithms towards big geospatial data.
... There is a certain genre of structures, however, that do the exact opposite-where instead of discovering data, you eliminate it. Take, for example, the LAESA structure [17]: a table of distances between so-called pivots and the other points in the data set. The query is compared to each of these pivots, and the computed query-pivot distances, along with the stored pivot-data-point distances, are used to determine whether any given data point may possibly be relevant. ...
Preprint
A follow-up to my previous tutorial on metric indexing, this paper walks through the classic structures, placing them all in the context of the recently proposed "sprawl of ambits" framework. The indexes are presented as configurations of a single, more general structure, all queried using the same search procedure.
... In Section 5 we demonstrate the efficacy of using convex transforms with two well-known pivot-based metric approaches, LAESA [11] and Vantage Point Trees (VPT) [15] . The first of these performs filtering, while the latter performs indexing. ...
Article
Scalable similarity search in metric spaces relies on using the mathematical properties of the space in order to allow efficient querying. Most important in this context is the triangle inequality property, which can allow the majority of individual similarity comparisons to be avoided for a given query. However many important metric spaces, typically those with high dimensionality, are not amenable to such techniques. In the past convex transforms have been studied as a pragmatic mechanism which can overcome this effect; however the problem with this approach is that the metric properties may be lost, leading to loss of accuracy. Here, we study the underlying properties of such transforms and their effect on metric indexing mechanisms. We show there are some spaces where certain transforms may be applied without loss of accuracy, and further spaces where we can understand the engineering tradeoffs between accuracy and efficiency. We back these observations with experimental analysis. To highlight the value of the approach, we show three large spaces deriving from practical domains whose dimensionality prevents normal indexing techniques, but where the transforms applied give scalable access with a relatively small loss of accuracy.
... That is to say, The idea is to allow searching items based on any combination of the previous strategies, following a sublinear algorithm order. For example, classical alternatives include Burkhard-Keller Trees [58], Approximating and Eliminating Search Algorithm [59], Spatial Approximation Trees [60], Vantage Point Trees [61], Excluded Middle Vantage Point Forest [62], among others. ...
Chapter
Nowadays, data are generated both by users and other systems deriving new data from the previous ones for supporting decision making. The Electronic Health Records contains from structured data (e.g. hospital id, etc.), semi-structured data (e.g. a Health Level Seven-based records), to unstructured data (e.g. patient's symptoms). The big challenge with health in smart cities is associated with the prevention, both the business and human health point of view. That is to say, avoid the propagation of certain diseases' patterns is the best option no just for people, but also from the city's health and the local economy. Thus, an architecture able to integrate into an Organizational Memory the medical data coming from heterogeneous repositories with the aim of gathering different kinds of symptoms is introduced. The query in the architecture is understood such as an unstructured text (i.e. symptoms) or an electronic health record. In this sense, the architecture is able to reach similar cases from the organizational memory based on a textual similarity analysis for limiting the search space. Next, using the International Classification of Diseases is possible to convert a case to a vector model representation in order to compute metric distances and get other cases order by a level of similarity. Each query answer contains a set of recommendations based on the frequency of diagnoses related to similar cases are given in order to share previous experiences. The processes point of view related to architecture is outlined. Finally, some conclusions and future works are outlined.
... At query time, it performs a fixed number of distance computations for a database; it is also necessary to compute a linear number of arithmetic and logical operations. A linear restriction of the same idea is presented in LAESA [19], where a constant number of pivots are used, independently of the size of the database. ...
Article
Full-text available
This manuscript presents the extreme pivots (EP) metric index, a data structure, to speed up exact proximity searching in the metric space model. For the EP, we designed an automatic rule to select the best pivots for a dataset working on limited memory resources. The net effect is that our approach solves queries efficiently with a small memory footprint, and without a prohibitive construction time. In contrast with other related structures, our performance is achieved automatically without dealing directly with the index’s parameters, using optimization techniques over a model of the index. The EP’s model is studied in-depth in this contribution. In practical terms, an interested user only needs to provide the available memory and a sample of the query distribution as parameters. The resulting index is quickly built, and has a good trade-off among memory usage, preprocessing, and search time. We provide an extensive experimental comparison with state-of-the-art searching methods. We also carefully compared the performance of metric indexes in several scenarios, firstly with synthetic data to characterize performance as a function of the intrinsic dimension and the size of the database, and also in different real-world datasets with excellent results.
... They are designed to partition the metric space into many subsets, and discard regions that do not intersect with the query ball during the search. The second class is called pivots-based methods like AESA (Ruiz, 1986), LAESA (Chávez et al., 2001;Micó et al., 1994) and EP (Ruiz et al., 2013). They use a distances matrix between datasets objects and references objects called pivots. ...
... They are designed to partition the metric space into many subsets, and discard regions that do not intersect with the query ball during the search. The second class is called pivots-based methods like AESA (Ruiz, 1986), LAESA (Chávez et al., 2001;Micó et al., 1994) and EP (Ruiz et al., 2013). They use a distances matrix between datasets objects and references objects called pivots. ...
Article
Full-text available
Owing to the development of image data production and use, the quantity of image datasets has exponentially increased in the last decade. Consequently, the similarity searching cost in image datasets becomes a severe problem which affects the efficiency of similarity search engines in this data type. In this paper, we address the problem of reducing the similarity search cost in large, high-dimensional and scalable image datasets; we propose an improvement of the D-index method to reduce the searching cost and to deal efficiently with scalable datasets. The proposed improvement is based on two propositions; first, we propose criteria and algorithms to choose effective separation values which can reduce the searching cost. Second, we propose an algorithm for updating the structure in case of scalable datasets to resist the impact of objects' insertion on the searching cost. The experiments show that the proposed D-index version has proved a good searching performance in comparison with the classical D-index and a significant resistance to the dataset scalability against the original D-index.
... In order to ensure a responsive behavior of the similarity search component in terms of response time, we implemented a Nearest-Neighbour Approximating and Eliminating Search Algorithm [13]. In a first step, the algorithm selects a set of well distributed feature points in the search space as pivot elements. ...
Conference Paper
Full-text available
With the growing hype for wearable devices recording biometric data comes the readiness to capture and combine even more personal information as a form of digital diary - lifelogging today is practiced ever more and can be categorized anywhere between an informative hobby and a life-changing experience. From an information processing point of view, analyzing the entirety of such multi-source data is immensely challenging, which is why the first Lifelog Search Challenge 2018 competition is brought into being, as to encourage the development of efficient interactive data retrieval systems. Answering this call, we present a retrieval system based on our video search system diveXplore, which has successfully been used in the Video Browser Showdown 2017 and 2018. Due to the different task definition and available data corpus, the base system was adapted and extended to this new challenge. The resulting lifeXplore system is a flexible retrieval and exploration tool that offers various easy-to-use, yet still powerful search and browsing features that have been optimized for lifelog data and for usage by novice users. Besides efficient presentation and summarization of lifelog data, it includes searchable feature maps, concept and metadata filters, similarity search and sketch search.
Chapter
The merit of projecting data onto linear subspaces is well known from, e.g., dimension reduction. One key aspect of subspace projections, the maximum preservation of variance (principal component analysis), has been thoroughly researched and the effect of random linear projections on measures such as intrinsic dimensionality still is an ongoing effort. In this paper, we investigate the less explored depths of linear projections onto explicit subspaces of varying dimensionality and the expectations of variance that ensue. The result is a new family of bounds for Euclidean distances and inner products. We showcase the quality of these bounds as well as investigate the intimate relation to intrinsic dimensionality estimation.
Article
Similarity search finds similar objects for a given query object based on a certain similarity metric. Similarity search in metric spaces has attracted increasing attention, as the metric space can accommodate any type of data and support flexible distance metrics. However, a metric space only models a single data type with a specific similarity metric. In contrast, a multi-metric space combines multiple metric spaces to simultaneously model a variety of data types and a collection of associated similarity metrics. Thus, a multi-metric space is capable of performing similarity search over any combination of metric spaces. Many studies focus on indexing a single metric space, while only a few aims at indexing multi-metric space to accelerate similarity search. In this paper, we propose DESIRE, an efficient dynamic cluster-based forest index for similarity search in multi-metric spaces. DESIRE first selects high-quality centers to cluster objects into compact regions, and then employs B ⁺ -trees to effectively index distances between centers and corresponding objects. To support dynamic scenarios, efficient update strategies are developed. Further, we provide filtering techniques to accelerate similarity queries in multi-metric spaces. Extensive experiments on four real datasets demonstrate the superior efficiency and scalability of our proposed DESIRE compared with the state-of-the-art multi-metric space indexes.
Chapter
In this paper, we propose a hybrid algorithm for exact nearest neighbors queries in high-dimensional spaces. Indexing structures typically used for exact nearest neighbors search become less efficient in high-dimensional spaces, effectively requiring brute-force search. Our method uses a massively-parallel approach to brute-force search that efficiently splits the computational load between CPU and GPU. We show that the performance of our algorithm scales linearly with the dimensionality of the data, improving upon previous approaches for high-dimensional datasets. The algorithm is implemented in Julia, a high-level programming language for numerical and scientific computing. It is openly available at https://github.com/davnn/ParallelNeighbors.jl.
Article
With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.
Article
Chatter detection from sensor signals has been an active field of research. While some success has been reported using several featurization tools and machine learning algorithms, existing methods have several drawbacks including the need for data pre-processing by an expert. In this paper, we present an alternative approach for chatter detection based on K-Nearest Neighbor (KNN) algorithm for classification and the Dynamic Time Warping (DTW) as a time series similarity measure. The used time series are the acceleration signals acquired from the tool holder in a series of turning experiments. Our results show that this approach achieves detection accuracies that can outperform existing methods, and it does not require data pre-processing. We compare our results to the traditional methods based on Wavelet Packet Transform (WPT) and the Ensemble Empirical Mode Decomposition (EEMD), as well as to the more recent Topological Data Analysis (TDA) based approach. We show that in two out of four cutting configurations our DTW-based approach is in the error range of the highest accuracy or attain the highest classification rate reaching in one case as high as 98% accuracy. Moreover, we combine the Approximate and Eliminate Search Algorithm (AESA) and parallel computing with the DTW-based approach to achieve chatter classification in less than 2 s thus making our approach applicable for online chatter detection.
Chapter
In the ongoing multimedia age, search needs become more variable and challenging to aid. In the area of content-based similarity search, asking search engines for one or just a few nearest neighbours to a query does not have to be sufficient to accomplish a challenging search task. In this work, we investigate a task type where users search for one particular multimedia object in a large database. Complexity of the task is empirically demonstrated with a set of experiments and the need for a larger number of nearest neighbours is discussed. A baseline approach for finding a larger number of approximate nearest neighbours is tested, showing potential speed-up with respect to a naive sequential scan. Last but not least, an open efficiency challenge for metric access methods is discussed for datasets used in the experiments.
Chapter
Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.
Article
In high dimensional datasets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire dataset is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the XY plane using the four-point property. The caveat is that the mapping is local: in other words, each object is mapped using a different mapping. The idea is that each element of the data is associated with a pair of reference objects that is well-suited to filter that particular object, in cases where it is not relevant to a query. This maximises the probability of excluding that object from a search. At query time, a query is compared with a pool of reference objects which allow its mapping to all the planes used by data objects. Then, for each query/object pair, a lower bound of the actual distance is obtained. The technique can be applied to any metric space that possesses the four-point property, therefore including Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form distances. Our experiments show that for all the datasets tested, of varying dimensionality, our approach can filter more objects than a standard metric indexing approach. For low dimensional data this does not make a good search mechanism in its own right, as it does not scale with the size of the data: that is, its cost is linear with respect to the data size. However, we also show that it can be added as a post-filter to other mechanisms, increasing efficiency with little extra cost in space or time. For high-dimensional data, we show related approximate techniques which, we believe, give the best known compromise for speeding up the essential sequential scan. The potential uses of our filtering technique include pure GPU searching, taking advantage of the tiny memory footprint of the mapping.
Article
Efficient knn computation for high-dimensional data is an important, yet challenging task. Today, most information systems use a column-store back-end for relational data. For such systems, multi-dimensional indexes accelerating selections are known. However, they cannot be used to accelerate knn queries. Consequently, one relies on sequential scans, specialized knn indexes, or trades result quality for speed. To avoid storing one specialized index per query type, we envision multipurpose indexes allowing to efficiently compute multiple query types. In this paper, we focus on additionally supporting knn queries as first step towards this goal. To this end, we study how to exploit total orders for accelerating knn queries based on the sub-space distance equalities observation. It means that non-equal points in the full space, which are projected to the same point in a sub space, have the same distance to every other point in this sub space. In case one can easily find these equalities and tune storage structures towards them, this offers two effects one can exploit to accelerate knn queries. The first effect allows pruning of point groups based on a cascade of lower bounds. The second allows to re-use previously computed sub-space distances between point groups. This results in a worst-case execution bound, which is independent of the distance function. We present knn algorithms exploiting both effects and show how to tune a storage structure already known to work well for multi-dimensional selections. Our investigations reveal that the effects are robust to increasing, e.g., the dimensionality, suggesting generally good knn performance. Comparing our knn algorithms to well-known competitors reveals large performance improvements up to one order of magnitude. Furthermore, the algorithms deliver at least comparable performance as the next fastest competitor suggesting that the algorithms are only marginally affected by the curse of dimensionality.
Article
All‐pairs distance computation for a collection of strings is a computation‐intensive task with important applications in bioinformatics, in particular, in distance‐based phylogenetic analysis techniques. Even if the computationally efficient Hamming distance is used for this purpose, the quadratic number of sequence pairs may be challenging. We propose a number of practical algorithms for efficient pairwise Hamming distance computation under a given distance threshold. The techniques are based on such concepts as pivot‐based similarity search in metric spaces, pigeonhole principle for approximate string matching, cache‐friendly data arrangement, bit‐parallelism, and others. We experimentally show that our solutions are often about an order of magnitude faster than the average‐case linear‐time LCP based clusters method proposed recently, both in real and synthetic benchmarks.
Article
Pivot-based algorithms are effective tools for proximity searching in metric spaces. They allow trading space overhead for number of distance evaluations performed at query time. With additional search structures (that pose extra space overhead) they can also reduce the amount of side computations. We introduce a new data structure, the Fixed Queries Array (FQA), whose novelties are (1) it permits sublinear extra CPU time without any extra data structure; (2) it permits trading number of pivots for their precision so as to make better use of the available memory. We show experimentally that the FQA is an efficient tool to search in metric spaces and that it compares favorably against other state of the art approaches. Its simplicity converts it into a simple yet effective tool for practitioners seeking for a black-box method to plug in their applications.
Chapter
A follow-up to my previous tutorial on metric indexing, this paper walks through the classic structures, placing them all in the context of the recently proposed sprawl of ambits framework. The indexes are presented as configurations of a single, more general structure, all queried using the same search procedure.
Article
Full-text available
In the context of support vector machines, identifying the support vectors is a key issue when dealing with large data sets. In Camelo et al. (Ann Oper Res 235:85–101, 2015), the authors present a promising approach to finding or approximating most of the support vectors through a procedure based on sub-sampling and enriching the support vector sets by nearest neighbors. This method has been shown to improve the computational efficiency of support vector machines on large data sets with low or intermediate feature space dimension. In the present article we discuss ways of adapting the nearest neighbor enriching methodology to the context of very high dimensional data, such as text data or other high dimensional data types, for which nearest neighbor queries involve, in principle, a high computational cost. Our approach incorporates the proximity preserving order search algorithm of Chavez et al. (MICAI 2005: advances in artificial intelligence, Springer, Berlin, pp 405–414, 2005), into the nearest neighbor enriching method of Camelo et al. (2015), in order to adapt this procedure to the high dimension setting. For the required set of pivots, both random pivots and the base prototype pivot set of Micó et al. (Pattern Recogn Lett 15:9–17, 2015), are considered. The methodology proposed is evaluated on real data sets.
Article
We define BitPart (Bitwise representations of binary Partitions), a novel exact search mechanism intended for use in high-dimensional spaces. In outline, a fixed set of reference objects is used to define a large set of regions within the original space, and each data item is characterised according to its containment within these regions. In contrast with other mechanisms only a subset of this information is selected, according to the query, before a search within the re-cast space is performed. Partial data representations are accessed only if they are known to be potentially useful towards the calculation of the exact query solution. Our mechanism requires Ω(NlogN) space to evaluate a query, where N is the cardinality of the data, and therefore does not scale as well as previously defined mechanisms with low-dimensional data. However it has recently been shown that, for a nearest neighbour search in high dimensions, a sequential scan of the data is essentially unavoidable. This result has been suspected for a long time, and has been referred to as the curse of dimensionality in this context. In the light of this result, the compromise achieved by this work is to make the best possible use of the available fast memory, and to offer great potential for parallel query evaluation. To our knowledge, it gives the best compromise currently known for performing exact search over data whose dimensionality is too high to allow the useful application of metric indexing, yet is still sufficiently low to give at least some traction from the metric and supermetric properties.
Chapter
The metric space model is a popular and extensible model for indexing data for fast similarity search. However, there is often need for broader concepts of similarities (beyond the metric space model) while these cannot directly benefit from metric indexing. This paper focuses on approximate search in semi-metric spaces using a genetic variant of the TriGen algorithm. The original TriGen algorithm generates metric modifications of semi-metric distance functions, thus allowing metric indexes to index non-metric models. However, “analytic” modifications provided by TriGen are not stable in predicting the retrieval error. In our approach, the genetic variant of TriGen – the TriGenGA – uses genetically learned semi-metric modifiers (piecewise linear functions) that lead to better estimates of the retrieval error. Additionally, the TriGenGA modifiers result in better overall performance than original TriGen modifiers.
Chapter
The concept of local pivoting is to partition a metric space so that each element in the space is associated with precisely one of a fixed set of reference objects or pivots. The idea is that each object of the data set is associated with the reference object that is best suited to filter that particular object if it is not relevant to a query, maximising the probability of excluding it from a search. The notion does not in itself lead to a scalable search mechanism, but instead gives a good chance of exclusion based on a tiny memory footprint and a fast calculation. It is therefore most useful in contexts where main memory is at a premium, or in conjunction with another, scalable, mechanism.
Preprint
Basic assumptions about comparison-based indexing are laid down and a general design space is derived from these. An index structure spanning this design space (the sprawl) is described, along with an associated family of partitioning predicates, or regions (the ambits), as well as algorithms for search and, to some extent, construction. The sprawl of ambits forms a unification and generalization of current indexing methods, and a jumping-off point for future designs.
Chapter
The nearest neighbour algorithm is an efficient but relatively time–consuming method of non-parametric regression and classification [9]. The method can be easily adapted to work with missing data using simple marginalisation or other preprocessing [1, 8, 10, 12]. Moreover, the efficiency of this solution is really high. In this chapter, a rough version of the algorithm will be presented. At the beginning, the basic version of the k-nearest neighbour classifier will be recalled, and then a rough version prepared for missing data will be proposed.
Article
Full-text available
The Index is a data structure which stores data in a suitably abstracted and compressed form to facilitate rapid processing by an application. Multidimensional databases may have a lot of redundant data also. The indexed data, therefore need to be aggregated to decrease the size of the index which further eliminates unnecessary comparisons. Feature-based indexing is found to be quite useful to speed up retrieval, and much has been proposed in this regard in the current era. Hence, there is growing research efforts for developing new indexing techniques for data analysis. In this article, we propose a comprehensive survey of indexing techniques with application and evaluation framework. First, we present a review of articles by categorizing into a hash and non-hash based indexing techniques. A total of 45 techniques has been examined. We discuss advantages and disadvantages of each method that are listed in a tabular form. Then we study evaluation results of hash based indexing techniques on different image datasets followed by evaluation campaigns in multimedia retrieval. In this paper, in all 36 datasets and three evaluation campaigns have been reviewed. The primary aim of this study is to apprise the reader of the significance of different techniques, the dataset used and their respective pros and cons.
Article
This work uses a new method of determining a parameterization, resampling, and dimension search of an uncertainty model that can be used for efficient engineering models in control design. An algorithm using the Cayley–Menger determinant as a measure of the dimension test geometry (volume/area/length) of the parametric data points is presented to search for a reduced number of dimensions that can be used to represent the parameters of a model that captures the uncertainty in a dynamic system (uncertainty model). A genetic algorithm (GA) is utilized to solve the nonconvex problem of finding the coefficients of a parameterization of the uncertainty model. A resampling approach for the uncertainty model is also presented. The methods presented here are demonstrated on an electrohydraulic valve control system problem. This demonstration includes consideration of the dimensional search, data resampling, and parameterizing of an uncertainty class determined from test data for 30 replications of an electrohydraulic flow control valve which were experimentally modeled in the lab. The suggested resampling method and the parameterization of the uncertainty are used to analyze the robust stability of a control system for the class of valves using both frequency domain h-infinity methods and analysis of closed-loop poles for the resampled uncertainty model.
Article
Full-text available
La búsqueda por similitud consiste en recuperar todos aquellos objetos dentro de una base de datos que sean parecidos o relevantes a una determinada consulta. Actualmente es un tema de gran interés para la comunidad científica debido a sus múltiples campos de aplicación, como la búsqueda de palabras e imágenes en la World Wide Web, reconocimiento de patrones, detección de plagio, bases de datos multimedia, entre otros. La búsqueda por similitud o en proximidad se modela matemáticamente a través de un espacio métrico, en el cual los objetos son representados como una caja negra donde la única información disponible es la distancia de este objeto a los otros. En general, el cálculo de la función de distancia es costoso y los sistemas de búsqueda operan a una gran tasa de consultas por unidad de tiempo. A fin de optimizar este procesamiento se han desarrollado numerosas estructuras métricas, que funcionan como índices y realizan un preprocesamiento de los datos a fin de disminuir las evaluaciones de distancia al momento de la búsqueda. Por otro lado, la necesidad de procesar grandes volúmenes de datos hace poco factible la utilización de una estructura en aplicaciones reales si ésta no considera la utilización de entornos de procesamiento paralelo. Existen una serie de tecnologías para realizar implementaciones de procesamiento paralelo. Se incluyen entre las más vigentes las tecnologías basadas en arquitecturas multi-CPU (multi-core) y GPU / multi-GPU, que son interesantes debido a las altas prestaciones y los bajos costes involucrados. En el presente Informe Científico-Técnico se aborda la búsqueda por similitud y la implementación de estructuras métricas sobre entornos paralelos. Se presenta el estado del arte en los temas relacionados a búsqueda por similitud con estructuras métricas y tecnologías de paralelización. También se proponen análisis comparativos sobre experimentos que buscan identificar el comportamiento de un conjunto de espacios métricos y estructuras métricas seleccionados sobre plataformas de procesamiento basadas en multicore y GPU.
Article
Full-text available
The problem of searching the set of keys in a file to find a key which is closest to a given query key is discussed. After “closest,” in terms of a metric on the the key space, is suitably defined, three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result. These methods are derived using certain inequalities satisfied by metrics and by graph-theoretic concepts. Some empirical results are presented which compare the efficiency of the methods.
Article
Full-text available
Relational models are frequently used in high-level computer vision. Finding a correspondence between a relational model and an image description is an important operation in the analysis of scenes. In this paper the process of finding the correspondence is formalized by defining a general relational distance measure that computes a numeric distance between any two relational descriptions-a model and an image description, two models, or two image descriptions. The distance measure is proved to be a metric, and is illustrated with examples of distance between object models. A variant measure used in our past studies is shown not to be a metric.
Article
Full-text available
A scheme to answer best-match queries from a file containing a collection of objects is described. A best-match query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331] suggests that one can reduce the number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness of our scheme, and its performance relative to previous algorithms.
Conference Paper
Full-text available
Given a set of n points or `prototypes' and another point or `test sample'. The authors present an algorithm that finds a prototype that is a nearest neighbour of the test sample, by computing only a constant number of distances on the average. This is achieved through a preprocessing procedure that computes only a number of distances and uses an amount of memory that grows lineally with n. The algorithm is an improvement of the previously introduced AESA algorithm and, as such, does not assume the data to be structured into a vector space, making only use of the metric properties of the given distance
Article
Experiments and results of the application of the Approximating and Eliminating Search Algorithm (aesa) to multi-speaker data are reported. Previous (single-speaker) results had already shown that the performance (speed) of the aesa remains greatly insensitive to increasing the size of the dictionary, while a very strong (exponential)_tendency to higher performance is exhibited as the test utterances are close to their corresponding prototypes. Following these results we show in this paper that, by increasing the number of tokens included in dictionaries with multiply represented words, a simultaneous reduction can be achieved in both the error-rate and the number of distance computations required. The speech data used in the experiments corresponds to the Spanish digit vocabulary uttered several times by 10 different male and female speakers, and it has been found that very accurate (> 99%) recognition of this vocabulary can be achieved while requiring only about 5 DTW computations on the average.
A method to determine a distance measure between two nonhierarchical attributed relational graphs is presented. In order to apply this distance measure, the graphs are characterised by descriptive graph grammars (DGG). The proposed distance measure is based on the computation of the minimum number of modifications required to transform an input graph into the reference one. Specifically, the distance measure is defined as the cost of recognition of nodes plus the number of transformations which include node insertion, node deletion, branch insertion, branch deletion, node label substitution and branch label substitution. The major difference between the proposed distance measure and the other ones is the consideration of the cost of recognition of nodes in the distance computation. In order to do this, the principal features of the nodes are described by one or several cost functions which are used to compute the similarity between the input nodes and the reference ones. Finally, an application of this distance measure to the recognition of lower case handwritten English characters is presented.
Article
The authors present an efficient algorithm for fast nearest-neighbour search in multidimensional space under a so called approximation-elimination framework. The algorithm is based on an approximation procedure which selects codevectors for distance computation in the close proximity of the test vector and eliminates codevectors using the triangle inequality based elimination. The algorithm is studied in the context of vector quantization of speech and compared with related algorithms proposed earlier. It is shown to be more efficient in terms of reducing the main search complexity, overhead costs and storage.
Article
The art and science of speech recognition have been advanced to the state where it is now possible to communicate reliably with a computer by speaking to it in a disciplined manner using a vocabulary of moderate size. It is the purpose of this paper to outline two aspects of speech-recognition research. First, we discuss word recognition as a classical pattern-recognition problem and show how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences. We then describe methods whereby these principles, augmented by modern theories of formal language and semantic analysis, can be used to study some of the more general problems in speech recognition. It is anticipated that these methods will ultimately lead to accurate mechanical recognition of fluent speech under certain controlled conditions.
Article
The intrinsic dimensionality of a set of patterns is important in determining an appropriate number of features for representing the data and whether a reasonable two- or three-dimensional representation of the data exists. We propose an intuitively appealing, noniterative estimator for intrinsic dimensionality which is based on nearneighbor information. We give plausible arguments supporting the consistency of this estimator. The method works well in identifying the true dimensionality for a variety of artificial data sets and is fairly insensitive to the number of samples and to the algorithmic parameters. Comparisons between this new method and the global eigenvalue approach demonstrate the utility of our estimator.
Article
A new algorithm is proposed which finds the Nearest Neighbour of a given sample in approximately constant average time complexity (i.e. independent of the data set size). The algorithm does not assume the data to be structured into any vector space, and only makes use of the metric properties of the given distance, thus being of general use in many present applications of Pattern Recognition. Simulation results for different sizes, metrics, and dimensions, show that the average number of distance computations is less than 4 in a 2-dimensional space, and less than 60 in 10 dimensions. These results are obtained at the expense of a quadratic space complexity and, for data-set sizes over 1000 samples, represents a time complexity improvement of at least one order of magnitude over the best results reported until now for the same task.
Article
Recently, the Approximating and Eliminating Search Algorithm (AESA) was introduced to search for Nearest Neighbours in asymptotically constant average time complexity. In this paper, a new development of the AESA is presented which formally adheres to the general algorithmic strategy of (best-first) Branch and Bound (B&B). This development naturally suggests a new selection or Approximating Criterion which: (a) is cheaper to compute, (b) significantly reduces the “overhead” or computation not alloted to distance computation, (c) leads to a more compact and clear presentation of the AESA, and (d) slightly but consistently reduces the average number of required distance computations. Experimental evidence assessing the last mentioned improvement is presented.
Article
An efficient approximation-elimination search algorithm for fast nearest-neighbour search is proposed based on a spherical distance coordinate formulation, where a vector in K-dimensional space is represented uniquely by its distances from K + 1 fixed points. The proposed algorithm uses triangle-inequality based elimination rules which is applicable for search using metric distances measures. It is a more efficient fixed point equivalent of the Approximation Elimination Search Algorithm (AESA) proposed earlier by Vidal [2]. In comparison to AESA which has a very high O(N2) storage complexity, the proposed algorithm uses only O(N) storage with very low approximation-elimination computational overheads while achieving complexity reductions closely comparable to AESA. The algorithm is used for fast vector quantization of speech waveforms and is observed to have O(K + 1) average complexity.
Article
Improvements to the exhaustive search method of best-match file searching have previously been achieved by doing a preprocessing step involving the calculation of distances from a reference point. This paper discusses the proper choice of reference points and extends the previous algorithm to use more than one reference point. It is shown that reference points should be located outside of data clusters. The results of computer simulations are presented which show that large improvements can be achieved by the proper choice and location of multiple reference points.
Article
Fast search algorithms are proposed and studied for vector quantization encoding using the K -dimensional ( K -d) tree structure. Here, the emphasis is on the optimal design of the K -d tree for efficient nearest neighbor search in multidimensional space under a bucket-Voronoi intersection search framework. Efficient optimization criteria and procedures are proposed for designing the K -d tree, for the case when the test data distribution is available (as in vector quantization application in the form of training data) as well as for the case when the test data distribution is not available and only the Voronoi intersection information is to be used. The criteria and bucket-Voronoi intersection search procedure are studied in the context of vector quantization encoding of speech waveform. They are empirically observed to achieve constant search complexity for O (log N ) tree depths and are found to be more efficient in reducing the search complexity. A geometric interpretation is given for the maximum product criterion, explaining reasons for its inefficiency with respect to the optimization criteria
Article
Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y , d ( X , Y ) is defined as the minimum of W ( P )/ L ( P ), where P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P , and L ( P ) is the number of these operations (length of P ). It is shown that in general, d ( X , Y ) cannot be computed by first obtaining the conventional (unnormalized) edit distance between X and Y and then normalizing this value by the length of the corresponding editing path. In order to compute normalized edit distances, an algorithm that can be implemented to work in O ( m × n <sup>2</sup>) time and O ( n <sup>2</sup>) memory space is proposed, where m and n are the lengths of the strings under consideration, and m &ges; n . Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances
The intrinsic dimensionally (ID) of different sets of isolated word utterances is estimated through a method recently proposed by K.W. Pettis et al. (1979). This results show ID values ranging from 3 to 15, which are consistent with the intuitive degree of difficulty associated to the sets considered. Also, some speculative applications of ID estimating are discussed
The approximating and eliminating search algorithm (AESA) presented was recently introduced for finding nearest neighbors in metric spaces. Although the AESA was originally developed for reducing the time complexity of dynamic time-warping isolated word recognition (DTW-IWR), only rather limited experiments had been previously carried out to check its performance in this task. A set of experiments aimed at filling this gap is reported. The main results show that the important features reflected in previous simulation experiments are also true for real speech samples. With single-speaker dictionaries of up to 200 words, and for most of the different speech parameterizations, local metrics, and DTW productions tested, the AEAS consistently found the appropriate prototype while requiring only an average of 7-12 DTW computations (94-96% savings for 200 words), with a strong tendency to need fewer computations if the samples are close to their corresponding prototypes
Article
Computation of the k-nearest neighbors generally requires a large number of expensive distance computations. The method of branch and bound is implemented in the present algorithm to facilitate rapid calculation of the k-nearest neighbors, by eliminating the necesssity of calculating many distances. Experimental results demonstrate the efficiency of the algorithm. Typically, an average of only 61 distance computations were made to find the nearest neighbor of a test sample among 1000 design samples.
Reconocimiento Automdtico del Habla Nearest Neighbour (NN) norms: NN Pattern Classification Techniques
  • F Casacuberta
  • E Vidal
Casacuberta, F. and E. Vidal ( 1987 ). Reconocimiento Automdtico del Habla. Marcombo. Dasarathy, B. ( 1991 ). Nearest Neighbour (NN) norms: NN Pattern Classification Techniques. IEEE Computer Soc. Press, Silver Spring, MD.
Fast Algorithms for Nearest-Neighbour Search and Application to Vector Quantization, PhD dissertation
  • V Ramasubramanian
Ramasubramanian, V. ( 1991 ). Fast Algorithms for Nearest-Neighbour Search and Application to Vector Quantization, PhD dissertation. University of Bombay.
Algoritmo para encontrar el vecino mils pr6ximo en un tiempo medio constante con una complejidad espacial lineal
  • M L Mic
  • J Oncina
  • E Vidal
Mic6, M.L., J. Oncina and E. Vidal (1991). Algoritmo para encontrar el vecino mils pr6ximo en un tiempo medio constante con una complejidad espacial lineal. Tech. Report DSIC II/ 14-9 l, Universidad Polit6cnica de Valencia.
Algoritmo para encontrar el vecino más próximo en un tiempo medio constante con una complejidad espacial lineal
  • Micó