Conference Paper

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions.

Conference: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.
Source: DBLP

ABSTRACT Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and com- paring probabilities. In particular, the distance between embeddings (the maxi- mum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are intro- duced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distri- butions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation in- variant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by min- imizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demon- strated on a homogeneity testing example.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Let P be a distribution with support S. The salient features of S can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point x to S). Given a sample from P we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by Chazal et al. (2011), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers. Chazal et al. (2014) derived concentration bounds for DTM. Building on these results, we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
    Advances in Neural Information Processing Systems 25 (NIPS 2012); 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently, domain adaptation learning (DAL) has shown surprising performance by utilizing labeled samples from the source (or auxiliary) domain to learn a robust classifier for the target domain of the interest which has a few or even no labeled samples. In this paper, by incorporating classical graph-based transductive SSL diagram, a novel DAL method is proposed based on a sparse graph constructed via kernel sparse representation of data in an optimal reproduced kernel Hilbert space (RKHS) recovered by minimizing inter-domain distribution discrepancy. Our method, named as Sparsity regularization Label Propagation for Domain Adaptation Learning (SLPDAL), can propagate the labels of the labeled data from both domains to the unlabeled one in the target domain using their sparsely reconstructed objects with sufficient smoothness by using three steps: (1) an optimal RKHS is first recovered so as to minimize the data distributions of two domains; (2) it then computes the best kernel sparse reconstructed coefficients for each data point in both domains by using l1-norm minimization in the RKHS, thus constructing a sparse graph; and (3) the labels of the labeled data from both domains is finally propagated to the unlabeled points in the target domain over the sparse graph based on our proposed sparsity regularization framework, in which it is assumed that the label of each data point can be sparsely reconstructed by those of other data points from both domains. Furthermore, based on the proposed sparsity regularization framework, an easy way is derived to extend SLPDAL to out-of-sample data. Promising experimental results have been obtained on both a serial of toy datasets and several real-world datasets such as face, visual video and text.
    Neurocomputing 09/2014; 139:202–219. DOI:10.1016/j.neucom.2014.02.044 · 2.01 Impact Factor