Article

Robust embedding and outlier detection of metric space data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... , 0.30. A similar setting was used in Heinonen et al. (2023). ...
Preprint
We propose a novel measure of statistical depth, the metric spatial depth, for data residing in an arbitrary metric space. The measure assigns high (low) values for points located near (far away from) the bulk of the data distribution, allowing quantifying their centrality/outlyingness. This depth measure is shown to have highly interpretable geometric properties, making it appealing in object data analysis where standard descriptive statistics are difficult to compute. The proposed measure reduces to the classical spatial depth in a Euclidean space. In addition to studying its theoretical properties, to provide intuition on the concept, we explicitly compute metric spatial depths in several different metric spaces. Finally, we showcase the practical usefulness of the metric spatial depth in outlier detection, non-convex depth region estimation and classification.
Article
Full-text available
The concept of depth has proved very important for multivariate and functional data analysis, as it essentially acts as a surrogate for the notion of ranking of observations which is absent in more than one dimension. Motivated by the rapid development of technology, in particular the advent of ‘Big Data’, we extend here that concept to general metric spaces, propose a natural depth measure and explore its properties as a statistical depth function. Working in a general metric space allows the depth to be tailored to the data at hand and to the ultimate goal of the analysis, a very desirable property given the polymorphic nature of modern data sets. This flexibility is thoroughly illustrated by several real data analyses.
Article
Full-text available
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed kernel minimum regularized covariance determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel-induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
Article
Full-text available
Common representations of functional networks of resting state fMRI time series, including covariance, precision, and cross-correlation matrices, belong to the family of symmetric positive definite (SPD) matrices forming a special mathematical structure called Riemannian manifold. Due to its geometric properties, the analysis and operation of functional connectivity matrices may well be performed on the Riemannian manifold of the SPD space. Analysis of functional networks on the SPD space takes account of all the pairwise interactions (edges) as a whole, which differs from the conventional rationale of considering edges as independent from each other. Despite its geometric characteristics, only a few studies have been conducted for functional network analysis on the SPD manifold and inference methods specialized for connectivity analysis on the SPD manifold are rarely found. The current study aims to show the significance of connectivity analysis on the SPD space and introduce inference algorithms on the SPD manifold, such as regression analysis of functional networks in association with behaviors, principal geodesic analysis, clustering, state transition analysis of dynamic functional networks and statistical tests for network equality on the SPD manifold. We applied the proposed methods to both simulated data and experimental resting state fMRI data from the human connectome project and argue the importance of analyzing functional networks under the SPD geometry. All the algorithms for numerical operations and inferences on the SPD manifold are implemented as a MATLAB library, called SPDtoolbox, for public use to expediate functional network analysis on the right geometry.
Article
Full-text available
This paper aims to propose an efficient numerical method for the most challenging problem known as the robust Euclidean embedding (REE) in the family of multidimensional scaling (MDS). The problem is notoriously known to be nonsmooth, nonconvex and its objective is non-Lipschitzian. We first explain thatthe semidefinite programming (SDP) relaxations and Euclidean distance matrix (EDM) approach, popular for other types of problems in the MDS family, failed to provide a viable method for this problem. We then propose a penalized REE(PREE), which can be economically majorized. We show that the majorized problem is convex provided that the penalty parameter is above certain threshold.Moreover, it has a closed-form solution, resulting in an efficient algorithm dubbed as PREEEDM (for Penalized REE via EDM optimization). We will prove among others that PREEEDM converges to a stationary point of PREE. Finally, the efficiency of PREEEDM will be compared with several state-of-the-art methods including SDP and EDM solvers on a large number of test problems from sensor network localization and molecular conformation
Article
Full-text available
The recent developments by considering a rather unexpected application of the theory of Independent component analysis (ICA) found in outlier detection , data clustering and multivariate data visualization etc. Accurate identification of outliers plays an important role in statistical analysis. If classical statistical models are blindly applied to data containing outliers, the results can be misleading at best. In addition, outliers themselves are often the special points of interest in many practical situations and their identification is the main purpose of the investigation. This paper takes an attempt a new and novel method for multivariate outlier detection using ICA and compares with different outlier detect ion techniques in the literature.
Article
Full-text available
The sample covariance matrix, which is well known to be highly nonrobust, plays a central role in many classical multivariate statistical methods. A popular way of making such multivariate methods more robust is to replace the sample covariance matrix with some robust scatter matrix. The aim of this paper is to point out that multivariate methods often require that certain properties of the covariance matrix hold also for the robust scatter matrix in order for the corresponding robust plug-in method to be a valid approach, but that not all scatter matrices possess the desired properties. Plug-in methods for independent components analysis, observational regression and graphical modelling are considered in more detail. For each case, it is shown that replacing the sample covariance matrix with a symmetrized robust scatter matrix yields a valid robust multivariate procedure.
Conference Paper
Full-text available
A recent trend in computer vision is to represent images through covariance matrices, which can be treated as points on a special class of Riemannian manifolds. A popular way of analysing such manifolds is to embed them in Euclidean spaces, a process which can be interpreted as warping the feature space. Embedding manifolds is not without problems, as the manifold structure may not be accurately preserved. In this paper, we propose a new method for analysing Riemannian manifolds, where embedding into Euclidean spaces is not explicitly required. To this end, we propose to represent Riemannian points through their similarities to a set of reference points on the manifold, with the aid of the recently proposed Stein divergence, which is a symmetrised version of Bregman matrix divergence. Classification problems on manifolds are then effectively converted into the problem of finding appropriate machinery over the space of similarities, which can be tackled by conventional Euclidean learning methods such as linear discriminant analysis. Experiments on face recognition, person re-identification and texture classification show that the proposed method outperforms state-of-the-art approaches, such as Tensor Sparse Coding, Histogram Plus Epitome and the recent Riemannian Locality Preserving Projection.
Article
Full-text available
Multidimensional scaling (MDS) seeks an embedding of N objects in a p <; N dimensional space such that inter-vector distances approximate pairwise object dissimilarities. Despite their popularity, MDS algorithms are sensitive to outliers, yielding grossly erroneous embeddings even if few outliers contaminate the available dissimilarities. This work introduces robust MDS approaches exploiting the degree of sparsity in the outliers present. Links with compressive sampling lead to robust MDS solvers capable of coping with unstructured and structured outliers. The novel algorithms rely on a majorization-minimization approach to minimize a regularized stress function, whereby iterative MDS solvers involving Lasso and sparse group-Lasso operators are obtained. The resulting schemes identify outliers and obtain the desired embedding at computational cost comparable to that of their nonrobust MDS alternatives. The robust structured MDS algorithm considers outliers introduced by a sparse set of objects. In this case, two types of sparsity are exploited: i) sparsity of outliers in the dissimilarities; and ii) sparsity of the objects introducing outliers. Numerical tests on synthetic and real datasets illustrate the merits of the proposed algorithms.
Article
Full-text available
In this paper, the shape matrix estimators based on spatial sign and rank vectors are considered. The estimators considered here are slight modifications of the estimators introduced in Dümbgen (1998) and Oja and Randles (2004) and further studied for example in Sirkiä et al. (2009). The shape estimators are computed using pairwise differences of the observed data, therefore there is no need to estimate the location center of the data. When the estimator is based on signs, the use of differences also implies that the estimators have the so called independence property if the estimator, that is used as an initial estimator, has it. The influence functions and limiting distributions of the estimators are derived at the multivariate elliptical case. The estimators are shown to be highly efficient in the multinormal case, and for heavy-tailed distributions they outperform the shape estimator based on sample covariance matrix.
Article
Full-text available
A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.
Article
Full-text available
The existence and uniqueness of a limiting form of a Huber-type M-estimator of multivariate scatter is established under certain conditions on the observed sample. These conditions hold with probability one when sampling randomly from a continuous multivariate distribution. The existence of the estimator is proven by showing that it is the limiting point of a specific algorithm. Hence, the proof is constructive. For continuous populations, the estimator of multivariate scatter is shown to be strongly consistent and asymptotically normal. An important property of the estimator is that its asymptotic distribution is distribution-free with respect to the class of continuous elliptically distributed populations. This distribution-free property also holds for the finite sample size distribution when the location parameter is known. In addition, the estimator is the "most robust" estimator of the scatter matrix of an elliptical distribution in the sense of minimizing the maximum asymptotic variance.
Article
Full-text available
This letter discusses the robustness issue of kernel principal component analysis. A class of new robust procedures is proposed based on eigenvalue decomposition of weighted covariance. The proposed procedures will place less weight on deviant patterns and thus be more resistant to data contamination and model deviation. Theoretical influence functions are derived, and numerical examples are presented as well. Both theoretical and numerical results indicate that the proposed robust method outperforms the conventional approach in the sense of being less sensitive to outliers. Our robust method and results also apply to functional principal component analysis.
Article
Full-text available
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of classifiaction functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms. 1
Article
Full-text available
Statistical depth functions are being formulated ad hoc with increasing popularity in nonparametric inference for multivariate data. Here we introduce several general structures for depth functions, classify many existing examples as special cases, and establish results on the possession, or lack thereof, of four key properties desirable for depth functions in general. Roughly speaking, these properties may be described as: ane invariance, maximality at center, monotonicity relative to deepest point, and vanishing at innity. This provides a more systematic basis for selection of a depth function. In particular, from these and other considerations it is found that the halfspace depth behaves very well overall in comparison with various competitors. AMS 1991 Subject Classication: Primary 62H05 Secondary 62G20. Key words and phrases: statistical depth functions; halfspace depth; simplicial depth; multivariate symmetry. 1 Introduction Statistical depth functions have become increasin...
Article
Full-text available
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data. Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Article
Starting with Tukey's pioneering work in the 1970s, the notion of depth in statistics has been widely extended, especially in the last decade. Such extensions include those to high‐dimensional data, functional data, and manifold‐valued data. In particular, in the learning paradigm, the depth‐depth method has become a useful technique. In this article, we extend the lens depth to the case of data in metric spaces and study its main properties. We also introduce, for Riemannian manifolds, the weighted lens depth. The weighted lens depth is nothing more than a lens depth for a weighted version of the Riemannian distance. To build it, we replace the geodesic distance on the manifold with the Fermat distance, which has the important property of taking into account the density of the data together with the geodesic distance. Next, we illustrate our results with some simulations and also in some interesting real datasets, including pattern recognition in phylogenetic trees, using the depth‐depth approach. La notion de profondeur statistique a émergé dans les années 70 dans un travail pionnier de Tukey. Cette notion a très largement été étudiée et étendue dans la dernière décennie. Certaines de ces extensions concernent les données en grande dimension, les données fonctionnelles ou à valeurs sur une variété. En apprentissage machine, la méthode de classification utilisant le vote par profondeur (depth‐depth), est très utilisée. Dans cet article, nous étendons la notion de profondeur lentille (lens depth) pour des données à valeurs dans un espace métrique et nous étudions les principales propriétés de cette extension. Nous introduisons également, dans le cas d'une variété riemannienne, la profondeur de lentille pondérée. Cette profondeur est aussi une profondeur de lentille pour une version pondérée de la distance riemannienne. Dans cette seconde extension, on remplace la distance géodésique sur la variété par la distance de Fermat. Cela permet ainsi de prendre en compte la densité de probabilité des observations. Nous illustrons nos résultats sur des données synthétiques et réelles. En particulier, nous appliquons les outils développés sur un problème de reconnaissance de formes d'arbres phylogéniques.
Article
We develop a novel exploratory tool for non-Euclidean object data based on data depth, extending celebrated Tukey’s depth for Euclidean data. The proposed metric halfspace depth, applicable to data objects in a general metric space, assigns to data points depth values that characterize the centrality of these points with respect to the distribution and provides an interpretable center-outward ranking. Desirable theoretical properties that generalize standard depth properties postulated for Euclidean data are established for the metric halfspace depth. The depth median, defined as the deepest point, is shown to have high robustness as a location descriptor both in theory and in simulation. We propose an efficient algorithm to approximate the metric halfspace depth and illustrate its ability to adapt to the intrinsic data geometry. The metric halfspace depth was applied to an Alzheimer’s disease study, revealing group differences in the brain connectivity, modeled as covariance matrices, for subjects in different stages of dementia. Based on phylogenetic trees of 7 pathogenic parasites, our proposed metric halfspace depth was also used to construct a meaningful consensus estimate of the evolutionary history and to identify potential outlier trees.
Article
The R-package REPPlab is designed to explore multivariate data sets using one-dimensional unsupervised projection pursuit. It is useful as a preprocessing step to find clusters or as an outlier detection tool for multivariate data. Except from the packages tourr and rggobi, there is no implementation of exploratory projection pursuit tools available in R. REPPlab is an R interface for the Java program EPP-lab that implements four projection indices and three biologically inspired optimization algorithms. It also proposes new tools for plotting and combining the results and specific tools for outlier detection. The functionality of the package is illustrated through some simulations and using some real data.
Article
Increasingly, statisticians are faced with the task of analyzing complex data that are non-Euclidean and specifically do not lie in a vector space. To address the need for statistical methods for such data, we introduce the concept of Fréchet regression. This is a general approach to regression when responses are complex random objects in a metric space and predictors are in Rp, achieved by extending the classical concept of a Fréchet mean to the notion of a conditional Fréchet mean. We develop generalized versions of both global least squares regression and local weighted least squares smoothing. The target quantities are appropriately defined population versions of global and local regression for response objects in a metric space. We derive asymptotic rates of convergence for the corresponding fitted regressions using observed data to the population targets under suitable regularity conditions by applying empirical process methods. For the special case of random objects that reside in a Hilbert space, such as regression models with vector predictors and functional data as responses, we obtain a limit distribution. The proposed methods have broad applicability. Illustrative examples include responses that consist of probability distributions and correlation matrices, and we demonstrate both global and local Fréchet regression for demographic and brain imaging data. Local Fréchet regression is also illustrated via a simulation with response data which lie on the sphere.
Book
It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than 100 years. It is even more difficult to realize that so many statisticians and users of statistics are unaware of the particular problems affecting compositional data, as well as their solutions. The issue of spurious correlation'', as the situation was phrased by Karl Pearson back in 1897, affects all data that measures parts of some whole, such as percentages, proportions, ppm and ppb. Such measurements are present in all fields of science, ranging from geology, biology, environmental sciences, forensic sciences, medicine and hydrology. This book presents the history and development of compositional data analysis along with Aitchison's log-ratio approach. Compositional Data Analysis describes the state of the art both in theoretical fields as well as applications in the different fields of science. Key Features: • Reflects the state-of-the-art in compositional data analysis. • Gives an overview of the historical development of compositional data analysis, as well as basic concepts and procedures. • Looks at advances in algebra and calculus on the simplex. • Presents applications in different fields of science, including, genomics, ecology, biology, geochemistry, planetology, chemistry and economics. • Explores connections to correspondence analysis and the Dirichlet distribution. • Presents a summary of three available software packages for compositional data analysis. • Supported by an accompanying website featuring R code. Applied scientists working on compositional data analysis in any field of science, both in academia and professionals will benefit from this book, along with graduate students in any field of science working with compositional data.
Article
Multi-dimensional scaling (MDS) plays a central role in data-exploration, dimensionality reduction and visualization. State-of-the-art MDS algorithms are not robust to outliers, yielding significant errors in the embedding even when only a handful of outliers are present. In this paper, we introduce a technique to detect and filter outliers based on geometric reasoning. We test the validity of triangles formed by three points, and mark a triangle as broken if its triangle inequality does not hold. The premise of our work is that unlike inliers, outlier distances tend to break many triangles. Our method is tested and its performance is evaluated on various datasets and distributions of outliers. We demonstrate that for a reasonable amount of outliers, e.g., under 20%20\%, our method is effective, and leads to a high embedding quality.
Article
Fr\'echet mean and variance provide a way of obtaining mean and variance for general metric space valued random variables and can be used for statistical analysis of data objects that lie in abstract spaces devoid of algebraic structure and operations. Examples of such spaces include covariance matrices, graph Laplacians of networks and univariate probability distribution functions. We derive a central limit theorem for Fr\'echet variance under mild regularity conditions, utilizing empirical process theory, and also provide a consistent estimator of the asymptotic variance. These results lead to a test to compare k populations based on Fr\'echet variance for general metric space valued data objects, with emphasis on comparing means and variances. We examine the finite sample performance of this inference procedure through simulation studies for several special cases that include probability distributions and graph Laplacians, which leads to tests to compare populations of networks. The proposed methodology has good finite sample performance in simulations for different kinds of random objects. We illustrate the proposed methods with data on mortality profiles of various countries and resting state Functional Magnetic Resonance Imaging data.
Article
The paper develops a general regression framework for the analysis of manifold-valued response in a Riemannian symmetric space (RSS) and its association with multiple covariates of interest, such as age or gender, in Euclidean space. Such RSS-valued data arise frequently in medical imaging, surface modelling and computer vision, among many other fields. We develop an intrinsic regression model solely based on an intrinsic conditional moment assumption, avoiding specifying any parametric distribution in RSS. We propose various link functions to map from the Euclidean space of multiple covariates to the RSS of responses. We develop a two-stage procedure to calculate the parameter estimates and determine their asymptotic distributions. We construct the Wald and geodesic test statistics to test hypotheses of unknown parameters. We systematically investigate the geometric invariant property of these estimates and test statistics. Simulation studies and a real data analysis are used to evaluate the finite sample properties of our methods.
Book
The visual interpretation of data is an essential step to guide any further processing or decision making. Dimensionality reduction (or manifold learning) tools may be used for visualization if the resulting dimension is constrained to be 2 or 3. The field of machine learning has developed numerous nonlinear dimensionality reduction tools in the last decades. However, the diversity of methods reflects the diversity of quality criteria used both for optimizing the algorithms, and for assessing their performances. In addition, these criteria are not always compatible with subjective visual quality. Finally, the dimensionality reduction methods themselves do not always possess computational properties that are compatible with interactive data visualization. This paper presents current and future developments to use dimensionality reduction methods for data visualization.
Article
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
Article
The Davis–Kahan theorem is used in the analysis of many statistical procedures to bound the distance between subspaces spanned by population eigenvectors and their sample versions. It relies on an eigenvalue separation condition between certain population and sample eigenvalues. We present a variant of this result that depends only on a population eigenvalue separation condition, making it more natural and convenient for direct application in statistical contexts, and provide an improvement in many cases to the usual bound in the statistical literature. We also give an extension to situations where the matrices under study may be asymmetric or even non-square, and where interest is in the distance between subspaces spanned by corresponding singular vectors.
Article
Kernel principal component analysis (KPCA) fails to detect the nonlinear structure of data well when outliers exist. To reduce this problem, this paper presents a novel algorithm, named robust weighted KPCA (RWKPCA). RWKPCA works well in dealing with outliers, and can be carried out in an iterative manner. This algorithm gives the weighted means vector and weighted covariance matrix based on M-estimator in robust statistics, then the weight on each datum can be got by an iterative computing and the outliers can be exterminated by the weights. The RWKPCA algorithm not only remains non-linearity property of KPCA but gets better robustness and improves the accuracy of KPCA. The simulation experiments show that the RWKPCA algorithm developed is better than the KPCA algorithm.
Conference Paper
We derive a robust Euclidean embedding pro- cedure based on semidefinite programming that may be used in place of the popular classical multidimensional scaling (cMDS) al- gorithm. We motivate this algorithm by ar- guing that cMDS is not particularly robust and has several other deficiencies. General- purpose semidefinite programming solvers are too memory intensive for medium to large sized applications, so we also describe a fast subgradient-based implementation of the ro- bust algorithm. Additionally, since cMDS is often used for dimensionality reduction, we provide an in-depth look at reducing dimen- sionality with embedding procedures. In par- ticular, we show that it is NP-hard to find optimal low-dimensional embeddings under a variety of cost functions.
Article
We extend the theory of distance (Brownian) covariance from Euclidean spaces, where it was introduced by Sz\'{e}kely, Rizzo and Bakirov, to general metric spaces. We show that for testing independence, it is necessary and sufficient that the metric space be of strong negative type. In particular, we show that this holds for separable Hilbert spaces, which answers a question of Kosorok. Instead of the manipulations of Fourier transforms used in the original work, we use elementary inequalities for metric spaces and embeddings in Hilbert spaces.
Article
A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based on the eigenvalue-eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant co-ordinate system for the multivariate data. Consequently, we view this method as a method for "invariant co-ordinate selection". By plotting the data with respect to this new invariant co-ordinate system, various data structures can be revealed. For example, under certain independent components models, it is shown that the invariant co- ordinates correspond to the independent components. Another example pertains to mixtures of elliptical distributions. In this case, it is shown that a subset of the invariant co-ordinates corresponds to Fisher's linear discriminant subspace, even though the class identifications of the data points are unknown. Some illustrative examples are given. Copyright (c) 2009 Royal Statistical Society.
Article
Sufficient conditions are given for the uniqueness of intrinsic and extrinsic means as measures of location of probability measures Q on Riemannian manifolds. It is shown that, when uniquely defined, these are estimated consistently by the corresponding indices of the empirical Q^n\hat Q_n. Asymptotic distributions of extrinsic sample means are derived. Explicit computations of these indices of Q^n\hat Q_n and their asymptotic dispersions are carried out for distributions on the sphere SdS^d (directional spaces), real projective space RPN1\mathbb{R}P^{N-1} (axial spaces) and CPk2\mathbb{C} P^{k-2} (planar shape spaces).
Sliced inverse regression in metric spaces
  • J Virta