ArticlePDF Available

Indefinite Proximity Learning: A Review

Authors:

Abstract

Efficient learning of a data analysis task strongly depends on the data representation. Most methods rely on (symmetric) similarity or dissimilarity representations by means of metric inner products or distances, providing easy access to powerful mathematical formalisms like kernel or branch-and-bound approaches. Similarities and dissimilarities are, however, often naturally obtained by nonmetric proximity measures that cannot easily be handled by classical learning algorithms. Major efforts have been undertaken to provide approaches that can either directly be used for such data or to make standard methods available for these types of data. We provide a comprehensive survey for the field of learning with nonmetric proximities. First, we introduce the formalism used in nonmetric spaces and motivate specific treatments for nonmetric proximity data. Second, we provide a systematization of the various approaches. For each category of approaches, we provide a comparative discussion of the individual algorithms and address complexity issues and generalization properties. In a summarizing section, we provide a larger experimental study for the majority of the algorithms on standard data sets. We also address the problem of large-scale proximity learning, which is often overlooked in this context and of major importance to make the method relevant in practice. The algorithms we discuss are in general applicable for proximity-based clustering, one-class classification, classification, regression, and embedding approaches. In the experimental part, we focus on classification tasks.
A preview of the PDF is not available
... The kernel theory (Hofmann et al., 2008) is based on positive semi-definite (psd) and symmetric ma-arXiv:2112.09893v1 [cs.LG] 18 Dec 2021 trices, but there is recent research that argues for also considering indefinite kernels Mehrkanoon et al., 2018;Schleif and Tino, 2015;Schleif et al., 2018). Such kernels emerge from domainspecific object comparisons like protein sequence alignments or like in case of the truncated Manhattan kernel (TL1) Huang et al. (2018) are motivated by local learning properties. ...
... Such kernels emerge from domainspecific object comparisons like protein sequence alignments or like in case of the truncated Manhattan kernel (TL1) Huang et al. (2018) are motivated by local learning properties. Since these kernels do not fulfill the psd assumption of the kernel theory, new techniques were proposed to handle such non-psd matrices (Loosli et al., 2015;Liu et al., 2018;Oglic and Gärtner, 2019;Gisbrecht and Schleif, 2015) or to convert the matrices to be psd (Schleif and Tino, 2015). As briefly noticed in , MEKA can lead to non-psd matrix approximations and hence this paper aims to discuss different aspects of this problem and to show how these limitations can be overcome. ...
... If a kernel function leads to a non-psd kernel matrix, a Kreȋn space is induced instead (Duin and Pekalska, 2005;Schleif and Tino, 2015). A Kreȋn space K is spanned by two orthogonal Hilbert spaces H + and H − so that x, y K = x + , y + H+ − x − , y − H− for x, y ∈ K and x + , y + as the projection onto H + and x − , y − as the projection onto H − . ...
Preprint
Full-text available
Matrix approximations are a key element in large-scale algebraic machine learning approaches. The recently proposed method MEKA (Si et al., 2014) effectively employs two common assumptions in Hilbert spaces: the low-rank property of an inner product matrix obtained from a shift-invariant kernel function and a data compactness hypothesis by means of an inherent block-cluster structure. In this work, we extend MEKA to be applicable not only for shift-invariant kernels but also for non-stationary kernels like polynomial kernels and an extreme learning kernel. We also address in detail how to handle non-positive semi-definite kernel functions within MEKA, either caused by the approximation itself or by the intentional use of general kernel functions. We present a Lanczos-based estimation of a spectrum shift to develop a stable positive semi-definite MEKA approximation, also usable in classical convex optimization frameworks. Furthermore, we support our findings with theoretical considerations and a variety of experiments on synthetic and real-world data.
... In general, non-psd kernels arise much more frequently than typically assumed and can already occur due to normalization procedures or careless parameter settings [5]. Several correction and adjustment procedures were proposed (eigenspectrum correction, proxy matrix learning or dedicated models for indefinite kernels) [15]. We consider a finite collection of objects X " tx i u, i " 1, . . . ...
... The x¨,¨y can be any symmetric similarity function. In case of Mercer kernels this could be the Euclidean inner product or other types of kernels, but also domain-specific non-psd similarities as alignment functions for sequential data and alike [15]. ...
... This is achieved by an eigendecomposition K m " Q m Λ m Q T m , with Λ m containing the eigenvalues and Q m the corresponding eigenvectors of K m and modifications of Λ m (like clip, flip, shift, square) to ensure thatλ m "Λ mriis ě 0 for i " 1, . . . , N , which eventually results in a positive semi-definiteK m " Q mΛm Q T m [15]. Instead of modifying the eigenspectrum of K m directly, the authors in [10] proposed to learn a psd proxy matrix with maximum alignment to K m . ...
... Toutefois, il ne suffit pasà couvrir l'intégralité des applications des données relationnelles. En effet, comme décrit dans Schleif and Tino (2015), les données relationnelles peuventêtre décrites par des mesures de ressemblance ou de dissemblance qui peuvent sortir du cadre euclidien. Nous parlerons alors de similarité ou dissimilarité, dont la définition formelle n'est pas complètement fixée dans la littérature, mais que nous formaliserons de la manière suivante : une dissimilarité est une mesure de dissemblance, δ : X × X → R + , qui peutêtreévaluée pour toute paire d'observations : ∀ x i , x i ∈ X , δ ii := δ(x i , x i ). ...
... S est alors simplement une matrice noyau et le formalisme de la section précédente s'applique ; -la matrice de dissimilarité, ∆, est une matrice de distance euclidienne (Schoenberg, 1935;Young and Householder, 1938). Schleif and Tino (2015) font une revue des approches permettant d'analyser des mesures de similarité non euclidiennes. Celles-ci se séparent, schématiquement, en deux grandes familles : l'une consisteà transformer les données d'une similarité non euclidienne en noyau (par correction du spectre, approches par troncature spectrale ou inversion spectrale (Chen et al., 2009), ou plongement dans un espace euclidien en minimisant la distorsion avec les mesures initiales (Kruskal, 1964)). ...
... Despite the great development in theory and application, positive semi-definite kernels may be inappropriate in some situations where non-metric pairwise proximity is used [31]. For example, the pairwise proximity may be asymmetric and negative for object comparisons in text documents, graphs and semi-groups. ...
Preprint
Full-text available
In this paper, we consider the coefficient-based regularized distribution regression which aims to regress from probability measures to real-valued responses over a reproducing kernel Hilbert space (RKHS), where the regularization is put on the coefficients and kernels are assumed to be indefinite. The algorithm involves two stages of sampling, the first stage sample consists of distributions and the second stage sample is obtained from these distributions. Asymptotic behaviors of the algorithm in different regularity ranges of the regression function are comprehensively studied and learning rates are derived via integral operator techniques. We get the optimal rates under some mild conditions, which matches the one-stage sampled minimax optimal rate. Compared with the kernel methods for distribution regression in the literature, the algorithm under consideration does not require the kernel to be symmetric and positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods, which enriches the theme of the distribution regression. To the best of our knowledge, this is the first result for distribution regression with indefinite kernels, and our algorithm can improve the saturation effect.
... However, these methods usually suppose specific properties of those proximity matrices like positive definiteness or the Euclidean property for mathematical consistence [65]. Unfortunately, these assumptions frequently are violated in practice such that these methods are either not applicable or may yield incorrect results [82]. However, in case of small deviations from the strong setting the relational methods frequently still achieve a promising performance using moderate correction procedures [65], [66], [78]. ...
Article
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
... When such conditions do not hold, D is simply called a dissimilarity dataset, which is a particular case of proximity or relational datasets. Schleif et Tino (2015) have proposed a typology of such datasets and described different approaches that can be used to extend statistical or learning methods defined for Euclidean data to such proximity data. In brief, the first main strategy consists in finding a way to turn a non-Euclidean dissimilarity into an Euclidean distance, that is the closest (in some sense) to the original dissimilarity. ...
Thesis
L’organisation spatiale du génome à l’intérieur du noyau des cellules a un impact majeur sur la régulation de l’expression des gènes, avec notamment des implications importantes dans le développement fœtal, la différentiation cellulaire ou le développement de maladies. Ceci constitue la motivation initiale de ce travail dont l’objet est l’étude de la structure tri-dimensionnelle du matériel génétique et de ses variations à partir de données Hi-C.Tout d’abord, on se penche sur la modélisation de la structure hiérarchique du génome à partir de données Hi-C. On étudie les extensions d’un outil statistique naturel pour l’examen de structures hiérarchiques, la Classification Ascendante Hiérarchique (CAH), pour justifier son application au Hi-C. Cela permet de justifier la modélisation des structures par des arbres binaires (issues de la CAH). On développe ensuite une méthode de comparaison de deux échantillons d’arbres pour être capable d’identifier des différences significatives.
Article
Pairwise learning usually refers to the learning problem that works with pairs of training samples, such as ranking, similarity and metric learning, and AUC maximization. To overcome the challenge of pairwise learning in the large scale computation, this paper introduces Nyström sampling approach to the coefficient-based regularized pairwise algorithm in the context of kernel networks. Our theorems establish that the obtained Nyström estimator achieves the minimax error over all estimators using the whole data provided that the subsampling level is not too small. We derive the function relation between the subsampling level and regularization parameter that guarantees computation cost reduction and asymptotic behaviors’ optimality simultaneously. The Nyström coefficient-based pairwise learning method does not require the kernel to be symmetric or positive semi-definite, which provides more flexibility and adaptivity in the learning process. We apply the method to the bipartite ranking problem, which improves the state-of-the-art theoretical results in previous works. By developing probability inequalities for U-statistics on Hilbert–Schmidt operators, we provide new mathematical tools for handling pairs of examples involved in pairwise learning.
Article
High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith-Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.
Thesis
In intelligenter Datenanalyse gibt es zwei gängige Datenrepräsentationen, nämlich die vektorielle Repräsentation und die paarweise Repräsentation. Die Übersetzung der letzteren in die erstgenannte nennt man Einbettung, eine nicht triviale Problematik von stetem, wissenschaftlichen Interesse. Während die paarweise Repräsentation den Daten weniger Einschränkungen auferlegt und so potentiell fähig ist, reichere Struktur festzuhalten, wartet die vektorielle Repräsentation mit vielen mächtigen datenanalytische Werkzeugen auf, da man in solchen Räumen über probabilistische Modelle für die Daten verfügt. Paarweise Daten, die restriktive Bedingungen erfüllen, können getreu in eine vektorielle Repräsentation abgebildet werden. Paarweise Daten, für die dies nicht möglich ist, werden nicht metrisch genannt. Diese Doktorarbeit betrifft nicht-metrische, paarweise Daten. Es ist eine investigative und explorative Studie nicht metrischer, paarweiser Daten, gestützt auf theoretische und konzeptuelle, sowie auf empirische Betrachtungen. Zuerst wird der Leser mit den beiden Datenrepräsentationen vertraut gemacht. Paarweise Daten werden illustriert und die ersten Problematiken angesprochen. Gängige Einbettungsmethoden werden dargestellt. Dann wird gezeigt, dass diese beiden Datenrepräsentationen für eine gewisse Klasse von Lernalgorithmen übereinstimmen, sogar wenn die paarweisen Daten nicht metrisch sind, und traditionelle Techniken nur zu approximativen Vektorrepräsentation führen. Die neuentwickelte Einbettung ist exakt in Bezug auf Struktur. Das Hauptgewicht liegt im Erfassen der Natur und der Folgen von metrischen Verletzungen. Obwohl die wissenschaftliche Gemeinschaft die Problematik wahrzuhaben scheint, wurde diese nach des Autors bestem Wissen nie klar formuliert. Metrische Verletzungen wurden gemeinhin als zufälliges Nebenprodukt von Rauschen betrachtet und wurden mathematisch dementsprechend behandelt. Eine einfache Modellierung metrischer Verletzungen zeigt, dass diese Annahme falsch ist. Eine spezielle Einbettung wird benutzt um den Informationsgehalt metrischer Störungen zu visualisieren und interpretieren. Schliesslich wird gezeigt, dass ein einfacher Algorithmus, der die Struktur über einen Stabilitätsindex auswertet, effizient die Struktur, die von metrischen Verletzungen kodiert wird, extrahieren kann.
Conference Paper
Showing the nearest neighbor is a useful explanation for the result of an automatic classification. Given, expert defined, distance measures may be improved on the basis of a training set. We study several proposals to optimize such measures for nearest neighbor classification, explicitly including non-Euclidean measures. Some of them may directly improve the distance measure, others may construct a dissimilarity space for which the Euclidean distances show significantly better performances. Results are application dependent and raise the question what characteristics of the original distance measures influence the possibilities of metric learning.
Article
Heterogeneous data sets are typically represented in different feature spaces, making it difficult to analyze relationships spanning different data sets even when they are semantically related. Data fusion via space alignment can remedy this task by integrating multiple data sets lying in different spaces into one common space. Given a set of reference correspondence data that share the same semantic meaning across different spaces, space alignment attempts to place the corresponding reference data as close together as possible, and accordingly, the entire data are aligned in a common space. Space alignment involves optimizing two potentially conflicting criteria: minimum deformation of the original relationships and maximum alignment between the different spaces. To solve this problem, we provide a novel graph embedding framework for space alignment, which converts each data set into a graph and assigns zero distance between reference correspondence pairs resulting in a single graph. We propose a graph embedding method for fusion based on nonmetric multidimensional scaling (MDS). Its criteria using the rank order rather than the distance allows nonmetric MDS to effectively handle both deformation and alignment. Experiments using parallel data sets demonstrate that our approach works well in comparison to existing methods such as constrained Laplacian eigenmaps, Procrustes analysis, and tensor decomposition. We also present standard cross-domain information retrieval tests as well as interesting visualization examples using space alignment. Copyright
Conference Paper
Spectral methods for manifold learning and clustering typically construct a graph weighted with affinities from a dataset and compute eigenvectors of a graph Laplacian. With large datasets, the eigendecomposition is too expensive, and is usually approximated by solving for a smaller graph defined on a subset of the points (landmarks) and then applying the Nyström formula to estimate the eigenvectors over all points. This has the problem that the affinities between landmarks do not benefit from the remaining points and may poorly represent the data if using few landmarks. We introduce a modified spectral problem that uses all data points by constraining the latent projection of each point to be a local linear function of the landmarks’ latent projections. This constructs a new affinity matrix between landmarks that preserves manifold structure even with few landmarks, allows one to reduce the eigenproblem size, and defines a fast, nonlinear out-of-sample mapping.
Article
In this paper, an entropy-regularized fuzzy clustering approach for non-Euclidean relational data and indefinite kernel data is developed that has not previously been discussed. It is important because relational data and kernel data are not always Euclidean and positive semi-definite, respectively. It is theoretically determined that an entropy-regularized approach for both non-Euclidean relational data and indefinite kernel data can be applied without using a β -spread transformation, and that two other options make the clustering results crisp for both data types. These results are in contrast to those from the standard approach. Numerical experiments are employed to verify the theoretical results, and the clustering accuracy of three entropy-regularized approaches for non-Euclidean relational data, and three for indefinite kernel data, is compared.