Fig 3 - uploaded by Vincent Wertz
Content may be subject to copyright.
Relative variance for data distributed as F Ã , as a function of p. We can see that a maximum is obtained for a rather large value of p, far from 1.
Source publication
Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned...
Citations
... We will see that the TCA map discriminates the two groups almost as well as the (meta-spec) map ; furthermore it provides additional complementary information on the two groups by the interpretation of words associated with each group, as is practised in CA. 6) All of the above mentioned methods are based on the pairwise Euclidean distances among the 590 chapters, where each chapter is a vector of counts of 8260 words. The well-known phenomenon of the norm concentration of the Euclidean distances, see for instance among others (François, Wertz, Verleysen 2007) and (Lee and Verheysen 2011), manifests itself clearly and differently in the high-dimensional data space and in the embedded lowdimensional visualization maps : this is the main reason that the maps produced by the 12 simple methods are not sufficiently good enough. ...
We present an overview of taxicab correspondence analysis, a robust variant of correspondence analysis, for visualization of extremely sparse ontingency tables. In particular we visualize an extremely sparse textual data set of size 590 by 8265 concerning fragments of 8 sacred books recently introduced by Sah and Fokou\'e (2019) and studied quite in detail by (12 + 1) dimension reduction methods (t-SNE, UMAP, PHATE,...) by Ma, Sun and Zou (2022).
... The second axiom focuses on the effect of the data dimension on the distance δ , due to the so-called curse of dimensionality [9]. The growing dimensions increase the average of pairwise distances while the variances remain constant [11,24,38], thus the differences between distances become negligible. To be used for Label-T&C, CVM should be shift invariant [38,39] to cancel the shift of the average distances due to the different dimensions of X and Z. ...
A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures -- Label-Trustworthiness and Label-Continuity (Label-T&C) -- advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters.
... Distance selection. Distance choice is a pervasive issue in data analysis; see (Francois et al. 2007;Aggarwal et al. 2001) to choose distance and transformations for high dimensional datasets. In Deza and Deza (2013), the authors broadly describe many distance functions, which is very helpful in understanding which distance is better for which type of data. ...
... It is a recurrent issue in data analysis, and the right choice depends on the data nature. Euclidean distance is the most commonly used distance; however, it becomes biased by the higher dimension components and is not recommended in high dimensional data (Aggarwal et al. 2001;Francois et al. 2007). It is advised to apply data transformations such as the standard or logarithmic transformations as we do with Cancer and Wine datasets. ...
... In Aggarwal et al. (2001), it was proved that distances with higher values of p behave poorly in high-dimensional settings. On the contrary, the Manhattan distance behaves better than Euclidean distance in almost every situation, and distances with power value 0 < p < 1 were the most suited for high dimensional datasets (Aggarwal et al. 2001;Francois et al. 2007). ...
Topological Data Analysis (TDA) is an emerging field that aims to discover a dataset’s underlying topological information. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes a different TDA pipeline to classify balanced and imbalanced multi-class datasets without additional ML methods. Our proposed method was designed to solve multi-class and imbalanced classification problems with no data resampling preprocessing stage. The proposed TDA-based classifier (TDABC) builds a filtered simplicial complex on the dataset representing high-order data relationships. Following the assumption that a meaningful sub-complex exists in the filtration that approximates the data topology, we apply Persistent Homology (PH) to guide the selection of that sub-complex by considering detected topological features. We use each unlabeled point’s link and star operators to provide different-sized and multi-dimensional neighborhoods to propagate labels from labeled to unlabeled points. The labeling function depends on the filtration’s entire history of the filtered simplicial complex and it is encoded within the persistence diagrams at various dimensions. We select eight datasets with different dimensions, degrees of class overlap, and imbalanced samples per class to validate our method. The TDABC outperforms all baseline methods classifying multi-class imbalanced data with high imbalanced ratios and data with overlapped classes. Also, on average, the proposed method was better than K Nearest Neighbors (KNN) and weighted KNN and behaved competitively with Support Vector Machine and Random Forest baseline classifiers in balanced datasets.
... Consequently, the k-nearest neighbor classifier exhibits erratic behavior [4]. Due to distance concentration, Euclidean distance (ED)-based classifiers suffer certain limitations in HDLSS situations [2,13]. Some recent work has studied the effect of distance concentration on some widely used classifiers based on Euclidean distances, such as 1-nearest neighbor (1-NN) classifier [15], support vector machines (SVM) [8], etc. ...
Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.
... Our analysis is a particular manifestation of the curse of dimensionality. The effect of growing dimensions on distance concentration and meaningfulness of nearest neighbors has been extensively explored in classical settings, specially in the context of kernels (François et al., 2007;Evangelista et al., 2006;Aggarwal et al., 2001;Beyer et al., 1999). In particular, the fact that the ratio of distance variance to distance to mean of i.i.d. ...
Precision and Recall are two prominent metrics of generative performance, which were proposed to separately measure the fidelity and diversity of generative models. Given their central role in comparing and improving generative models, understanding their limitations are crucially important. To that end, in this work, we identify a critical flaw in the common approximation of these metrics using k-nearest-neighbors, namely, that the very interpretations of fidelity and diversity that are assigned to Precision and Recall can fail in high dimensions, resulting in very misleading conclusions. Specifically, we empirically and theoretically show that as the number of dimensions grows, two model distributions with supports at equal point-wise distance from the support of the real distribution, can have vastly different Precision and Recall regardless of their respective distributions, hence an emergent asymmetry in high dimensions. Based on our theoretical insights, we then provide simple yet effective modifications to these metrics to construct symmetric metrics regardless of the number of dimensions. Finally, we provide experiments on real-world datasets to illustrate that the identified flaw is not merely a pathological case, and that our proposed metrics are effective in alleviating its impact.
... The Manhattan distance (corresponding to p = 1) has been found beneficial when sparsity is preferred (Froyland et al. 2019), but the utilisation of quasi-norms has led to mixed results (Aggarwal et al. 2001;Flexer and Schnitzer 2015;Mirkes et al. 2019). Smaller values of p do not inherently circumvent the phenomenon of distance concentration, and the optimal choice appears to be highly application dependent and must be chosen empirically (Francois et al. 2007). In our case, p is taken to be the largest value in P = {0.1, ...
We develop a transfer operator-based method for the detection of coherent structures and their associated lifespans. Characterising the lifespan of coherent structures allows us to identify dynamically meaningful time windows, which may be associated with transient coherent structures in the localised phase space, as well as with time intervals within which these structures experience fundamental changes, such as merging or separation events. The localised transfer operator approach we pursue allows one to explore the fundamental properties of a dynamical system without full knowledge of the dynamics. The algorithms we develop prove useful not only in the simple case of a periodically driven double well potential model, but also in more complex cases generated using the rotating Boussinesq equations.
... Accurate clustering high-dimension m yet low-sample-size n (HDLSS) data remains challenging when n m [9], [10], [11]. The critical challenge that hinders the clustering of HDLSS data is called the concentration effect [12], [13], which refers to the situation that the pairwise Euclidean distance among the samples collapses to a constant, thus resulting in the sample-wise affinity tending to be indiscriminating in high feature dimensions [14], [15], [16]. Such an effect exists as an insurmountable roadblock that hinders most clustering methods, which rely on the pairwise affinity of samples, from achieving precise clustering. ...
Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples' affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.
... High dimensional data are difficult to study as many classical machine learning techniques are impaired by the so called curse of dimensionality [4,12]. One of the manifestation of this curse is the tendency of distances to concentrate: pairwise distances between observations have both a large mean and a small variance (see [5,17]). This shows also that a multivariate Gaussian distribution is mostly concentrated on a central sphere. ...
... For instance µ kj {0} is the jcoordinate of directional mean of the k component when β = 0. By a natural extension r k {β} is the result of applying equation (23) to Θ {β} (using equations (17) and (15)). ...
Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
... As rigorously shown in [4], the relative contrast provided by a norm with a smaller parameter p is more likely to dominate another norm with a larger parameter p as the dimensionality increases. The fractional distance concentration [30] dictates that we either work with topological space (instead of metric space) or use 0 as a pseudo-norm that recognizes the limitations of distance metrics. Interestingly, the problem of ∞ -optimization has shown to be useful for an approximate NN search with anti-sparse coding [48]. ...
In this paper, we revisit the problem of computational modeling of simple and complex cells for an over-parameterized and direct-fit model of visual perception. Unlike conventional wisdom, we highlight the difference in parallel and sequential binding mechanisms between simple and complex cells. A new proposal for abstracting them into space partitioning and composition is developed as the foundation of our new hierarchical construction. Our construction can be interpreted as a product topology-based generalization of the existing k-d tree, making it suitable for brute-force direct-fit in a high-dimensional space. The constructed model has been applied to several classical experiments in neuroscience and psychology. We provide an anti-sparse coding interpretation of the constructed vision model and show how it leads to a dynamic programming (DP)-like approximate nearest-neighbor search based on ℓ ∞ -optimization. We also briefly discuss two possible implementations based on asymmetrical (decoder matters more) auto-encoder and spiking neural networks (SNN), respectively.
... For example distances (or measures) tend to concentrate in higher dimensions. In fact, with dimensionality approaching infinity distances between pairs of objects become effectively useless by being indistinguishable [28]. This distance concentration can be explained by the fact that with increasing dimensions the volume of a unit hypercube grows faster than the volume of a unit hyperball. ...
The flexibility of Knowledge Graphs to represent heterogeneous entities and relations of many types is challenging for conventional data integration frameworks. In order to address this challenge the use of Knowledge Graph Embeddings (KGEs) to encode entities from different data sources into a common lower-dimensional embedding space has been a highly active research field. It was recently discovered however that KGEs suffer from the so-called hubness phenomenon. If a dataset suffers from hubness some entities become hubs, that dominate the nearest neighbor search results of the other entities. Since nearest neighbor search is an integral step in the entity alignment procedure when using KGEs, hubness is detrimental to the alignment quality. We investigate a variety of hubness reduction techniques and (approximate) nearest neighbor libraries to show we can perform hubness-reduced nearest neighbor search at practically no cost w.r.t speed, while reaping a significant improvement in quality. We ensure the statistical significance of our results with a Bayesian analysis. For practical use and future research we provide the open-source python library kiez at https://github.com/dobraczka/kiez.