Conference Paper

# Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions.

• ##### Bernhard Schölkopf
Conference: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.
Source: DBLP

ABSTRACT Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and com- paring probabilities. In particular, the distance between embeddings (the maxi- mum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are intro- duced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distri- butions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation in- variant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by min- imizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demon- strated on a homogeneity testing example.

0 Bookmarks
·
74 Views
• Source
##### Article: Testing Hypotheses by Regularized Maximum Mean Discrepancy
[Hide abstract]
ABSTRACT: Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset.
05/2013;
• Source
##### Article: Equivalence of distance-based and RKHS-based statistics in hypothesis testing
[Hide abstract]
ABSTRACT: We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, Maximum Mean Discrepancies (MMD), i.e., distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with the semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.
The Annals of Statistics 10/2013; 41(5):2263-2291. · 2.53 Impact Factor
• Source
##### Article: On the Optimal Estimation of Probability Measures in Weak and Strong Topologies
[Hide abstract]
ABSTRACT: Given random samples drawn i.i.d. from a probability measure $\mathbb{P}$ (defined on say, $\mathbb{R}^d$), it is well-known that the empirical estimator is an optimal estimator of $\mathbb{P}$ in weak topology but not even a consistent estimator of its density (if it exists) in the strong topology (induced by the total variation distance). On the other hand, various popular density estimators such as kernel and wavelet density estimators are optimal in the strong topology in the sense of achieving the minimax rate over all estimators for a Sobolev ball of densities. Recently, it has been shown in a series of papers by Gin\'{e} and Nickl that these density estimators on $\mathbb{R}$ that are optimal in strong topology are also optimal in $\Vert\cdot\Vert_\mathcal{F}$ for certain choices of $\mathcal{F}$ such that $\Vert\cdot\Vert_\mathcal{F}$ metrizes the weak topology, where $\Vert\mathbb{P}\Vert_\mathcal{F}:=\sup\{\int f\,d\mathbb{P}:f\in\mathcal{F}\}$. In this paper, we investigate this problem of optimal estimation in weak and strong topologies by choosing $\mathcal{F}$ to be a unit ball in a reproducing kernel Hilbert space (say $\mathcal{F}_H$ defined over $\mathbb{R}^d$), where this choice is both of theoretical and computational interest. Under some mild conditions on the reproducing kernel, we show that $\Vert\cdot\Vert_{\mathcal{F}_H}$ metrizes the weak topology and the kernel density estimator (with $L^1$ optimal bandwidth) estimates $\mathbb{P}$ at dimension independent optimal rate of $n^{-1/2}$ in $\Vert\cdot\Vert_{\mathcal{F}_H}$ along with providing a uniform central limit theorem for the kernel density estimator.
10/2013;