Fig 3 - uploaded by Vincent Wertz

Content may be subject to copyright.

# Relative variance for data distributed as F Ã , as a function of p. We can see that a maximum is obtained for a rather large value of p, far from 1.

Source publication

Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned...

## Citations

... DiskANN is slightly better than DumpyOS on data series datasets, while IVF and IVFPQ are much slower than DumpyOS. In the classical long data series dataset like DNA, IVF suffers from the curse of dimensionality [31] and thus a poor clustering quality, which results in performance degradation. On the other hand, DumpyOS can capture the differences on different segments between different data series and thus provide a superior query performance. ...

Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms. An optimized approximate search algorithm, DumpyOS-F that prominently improves the search accuracy without violating the index, is also proposed. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors. DumpyOS further improves on Dumpy, by delivering several times faster index building and querying, and DumpyOS-F improves the search accuracy of Dumpy-Fuzzy without the additional space cost of Dumpy-Fuzzy. This paper is an extension of the previously published SIGMOD paper [81].

... CONTACT Reza Modarres reza@gwu.edu The literature on clustering methods is vast (Everitt et al. 2011) and includes approaches such as partitioning, hierarchical, density-based, grid-based, model-based, and constraintbased methods. Algorithms such as k-means, k-medoids, with their many variants, are among the most popular techniques. ...

... This transformation provides a large data reduction when for HDLSS data where p N. The Euclidean distance between two vectors does not take into account their probability mass distribution and only depends on those two vectors. Violation of neighbourhood structure in high dimensions is recognised for the Euclidean and fractional distances (Aggarwal et al. 2001;Francois et al. 2007;Beyer et al. 1999). The violation has adverse effects on the performance of the machine learning methods that use the Euclidean distance such as the nearest neighbour (NN). ...

... However, like other nonparametric methods, this classifier also suffers from the curse of dimensionality (see, e.g., Carrerira-Perpinan, 2009), especially when the dimension of the data is much larger than the training sample size. In such high-dimension, low-sample-size (HDLSS) situations, the concentration of pairwise distances (see, e.g., Hall et al., 2005;François et al., 2007), presence of hubs and the violation of the neighborhood structure (see, e.g., Radovanovic et al., 2010;Pal et al., 2016) often have adverse effects on the performance of the nearest neighbor classifier. ...

... Note that (A2) holds under that ρ-mixing condition. François et al. (2007) observed that for high-dimensional data with highly correlated or dependent measurement variables, pairwise distances are less concentrated than if all variables are independent. They claimed that the distance concentration phenomenon depends on the intrinsic dimension (see, e.g., Levina and Bickel, 2004;Camastra and Staiano, 2016) of the data, instead of the dimension of the embedding space. ...

Nearest neighbor classifier is arguably the most simple and popular nonparametric classifier available in the literature. However, due to the concentration of pairwise distances and the violation of the neighborhood structure, this classifier often suffers in high-dimension, low-sample size (HDLSS) situations, especially when the scale difference between the competing classes dominates their location difference. Several attempts have been made in the literature to take care of this problem. In this article, we discuss some of these existing methods and propose some new ones. We carry out some theoretical investigations in this regard and analyze several simulated and benchmark datasets to compare the empirical performances of proposed methods with some of the existing ones.

... The curse of dimensionality refers to the phenomenon where the Euclidean distance between data points tend to be identical [15]. It renders Euclidean distance difficult to model the proximity in high dimensional settings due to diminished discrimination. ...

Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.

... This can be challenging in real applications, especially for highdimension m yet low-sample-size n (HDLSS) data when n ≪ m [15], [16]. The clustering performance of HDLSS data is hindered by the concentration effects, also known as the "curse of dimensionality" [17]. The collapse of pairwise distances in high-dimensional feature spaces presents a formidable challenge for clustering algorithms reliant on pairwise affinity, impeding accurate clustering [18]. ...

Graph-based multi-view clustering encodes multi-view data into sample affinities to find consensus representation, effectively overcoming heterogeneity across different views. However, traditional affinity measures tend to collapse as the feature dimension expands, posing challenges in estimating a unified alignment that reveals both crossview and inner relationships. To tackle this challenge, we propose to achieve multi-view uniform clustering via consensus representation coregularization. First, the sample affinities are encoded by both popular dyadic affinity and recent high-order affinities to comprehensively characterize spatial distributions of the HDLSS data. Second, a fused consensus representation is learned through aligning the multi-view lowdimensional representation by co-regularization. The learning of the fused representation is modeled by a high-order eigenvalue problem within manifold space to preserve the intrinsic connections and complementary correlations of original data. A numerical scheme via manifold minimization is designed to solve the high-order eigenvalue problem efficaciously. Experiments on eight HDLSS datasets demonstrate the effectiveness of our proposed method in comparison with the recent thirteen benchmark methods.

... Those estimators give rise to different feature selection methods (Traina et al., 2010;Mo & Huang, 2012;Suryakumar et al., 2013;Golay et al., 2016), occasionally based on gradients to learn an embedding with the desired properties (Pope et al., 2021). However these algorithms do not help to decide if and to what extent the data set is affected by the curse of dimensionality and the related concentration phenomena (François et al., 2007;Houle, 2013). ...

Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.

... The performance of spectral clustering pins on the pairwise similarity in the affinity matrix. However, the pairwise similarities are easily broken by noise contamination [18] or concentration effect [19] in data with high dimensions. To address this issue, recent works of tensor spectral clustering [13] attempted to use high-order tensor affinity among more than two samples to compensate for the inefficacy within the pairwise similarities. ...

Tensor spectral clustering (TSC) is an emerging approach that explores multi- wise similarities to boost learning. However, two key challenges have yet to be well addressed in the existing TSC methods: (1) The construction and storage of high-order affinity tensors to encode the multi- wise similarities are memory-intensive and hampers their applicability, and (2) they mostly employ a two-stage approach that integrates multiple affinity tensors of different orders to learn a consensus tensor spectral embedding, thus often leading to a suboptimal clustering result. To this end, this paper proposes a tensor spectral clustering network (TSC-Net) to achieve one-stage learning of a consensus tensor spectral embedding, while reducing the memory cost. TSC-Net employs a deep neural network that learns to map the input samples to the consensus tensor spectral embedding, guided by a TSC objective with multiple affinity tensors. It uses stochastic optimization to calculate a small part of the affinity tensors, thereby avoiding loading the whole affinity tensors for computation, thus significantly reducing the memory cost. Through using an ensemble of multiple affinity tensors, the TSC can dramatically improve clustering performance. Empirical studies on benchmark datasets demonstrate that TSC-Net outperforms the recent baseline methods.

... The second axiom focuses on the effect of the data dimension on the distance δ , due to the so-called curse of dimensionality [9]. The growing dimensions increase the average of pairwise distances while the variances remain constant [11,24,38], thus the differences between distances become negligible. To be used for Label-T&C, CVM should be shift invariant [38,39] to cancel the shift of the average distances due to the different dimensions of X and Z. ...

A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures—
Label-Trustworthiness
and
Label-Continuity
(Label-T&C)—advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters.

... In the literature, there are several papers describing the underlying characteristics of datasets that make them more or less difficult for similarity search purposes. Particularly, dataset properties that might affect the performance of similarity search methods include intrinsic dimensionality [2], relative contrast [7], fractal correlation [4], and the concentration of distances [6]. However, works in the field rely on properties like these to select the datasets for experimentation without analyzing the interconnection between properties that affect the dataset variability. ...

... Several metrics can be extracted from datasets to measure the complexity of a similarity search problem [2,10]. Existing metrics include the Relative Contrast [7], intrinsic dimensionality [2], fractal correlation [4], and the concentration of distances [6]. These metrics can characterize datasets providing valuable insights for choosing similarity search algorithms and their parameters. ...

Most papers on similarity retrieval present experiments executed on an assortion of complex datasets. However, no work focuses on analyzing the selection of datasets to evaluate the techniques proposed in the related literature. Ideally, the datasets chosen for experimental analysis should cover a variety of properties to ensure a proper evaluation; however, this is not always the case. This paper introduces the dataset-similarity-based approach, a new conceptual view of datasets that explores how they vary according to their characteristics. The approach is based on extracting a set of features from the datasets to represent them in a similarity space and analyze their distribution in this space. We present an instantiation of our approach using datasets gathered by surveying the dataset usage in papers published in relevant conferences on similarity retrieval and sample analyses. Our analyses show that datasets often used together in experiments are more similar than they seem to be at first glance, reducing the variability. The proposed representation of datasets in a similarity space allows future works to improve the choice of datasets for running experiments in similarity retrieval.

... Distance-based classification techniques, like the k-Nearest Neighbors family of methods, are classifiers that usually perform well when n > m. However, they are also known to suffer from the curse of dimensionality in the opposite situation because the pairwise distances between all observations concentrate around a single value in this case (François et al. 2007). Similarly as before, a large part of the solutions proposed in the literature are based on dimensionality reduction techniques, e.g., Deegalla and Bostrom (2006). ...

High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the random forest dissimilarity, that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems.