Conference Paper

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions.

Conference: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.
Source: DBLP


Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and com- paring probabilities. In particular, the distance between embeddings (the maxi- mum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are intro- duced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distri- butions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation in- variant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by min- imizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demon- strated on a homogeneity testing example.

16 Reads
  • Source
    • "The MMD has been attracting attention in recent two-sample test research due to its solid theoretical base (Smola, Gretton, Song, & Schölkopf , 2008; Sriperumbudur, Gretton, Fukumizu, Schölkopf , & Lanckriet, 2010; Sriperumbudur, Fukumizu, & Lanckriet, 2011; Sejdinovic, Sriperumbudur, Gretton, & Fukumizu, 2013) and successful applications, including biological data tests, data integration and attribute matching (Gretton, Fukumizu, Harchaoui, & Sriperumbudur, 2009), outlier detection, data classifiability (Sriperumbudur, Fukumizu, Gretton, Lanckriet, & Schölkopf , 2009), and domain adaptation. By generalizing the MMD to kernel families as the supremum of MMDs on a class of kernels, it has also been effectively used for some basic machine learning problems such as kernel selection (Sriperumbudur et al., 2009). The exact MMD needs O(N 2 d) computational cost, where N and d denote the size and dimension of samples, respectively, to calculate the kernel values of all pairs from the assessed two-sample sets. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The maximum mean discrepancy (MMD) is a recently proposed test statistic for two-sample test. Its quadratic time complexity, however, greatly hampers its availability to large-scale applications. To accelerate the MMD calculation, in this study we propose an efficient method called FastMMD. The core idea of FastMMD is to equivalently transform the MMD with shift-invariant kernels into the amplitude expectation of a linear combination of sinusoid components based on Bochner's theorem and Fourier transform \cite{Rahimi07}. Taking advantage of sampling of Fourier transform, FastMMD decreases the time complexity for MMD calculation from $O(N^2 d)$ to $O(N d)$, where $N$ and $d$ are the size and dimension of the sample set, respectively. For kernels that are spherically invariant, the computation can be further accelerated to $O(N \log d)$ by using the Fastfood technique \cite{LeQ13}. The uniform convergence of our method has also been theoretically proved in both unbiased and biased estimates. We have further provided a geometric explanation for our method, namely ensemble of circular discrepancy, which facilitates us to understand the insight of MMD, and is hopeful to help arouse more extensive metrics for assessing two-sample test. Experimental results substantiate that FastMMD is with similar accuracy as exact MMD, while with faster computation speed and lower variance than the existing MMD approximation methods.
    Neural Computation 04/2014; 27(6). DOI:10.1162/NECO_a_00732 · 2.21 Impact Factor
  • Source
    • ", 2 5 . In our third experiment, we used the benchmark data of [15]: one distribution was a univariate Gaussian, and the second was a univariate Gaussian with a sinusoidal perturbation of increasing frequency (where higher frequencies correspond to differences in distribution that are more difficult to detect). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
    Advances in Neural Information Processing Systems 25 (NIPS 2012); 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on ℜd, then it is characteristic if and only if the support of its Fourier transform is the entire ℜd. Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γk, we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology.
    Journal of Machine Learning Research 07/2009; 11:1517-1561. · 2.47 Impact Factor
Show more


16 Reads
Available from