Conference Paper

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions.

Conference: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.
Source: DBLP

ABSTRACT Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and com- paring probabilities. In particular, the distance between embeddings (the maxi- mum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are intro- duced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distri- butions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation in- variant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by min- imizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demon- strated on a homogeneity testing example.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding is injective. In this paper, we generalize this embedding to finite signed Borel measures, wherein any finite signed Borel measure is represented as a mean element in an RKHS. We show that the proposed embedding is injective if and only if the kernel is universal. This therefore, provides a novel characterization of universal kernels, which are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting this relation between universality and the embedding of finite signed Borel measures into an RKHS, we establish the relation between universal and characteristic kernels. Comment: 30 pages, 1 figure
    Journal of Machine Learning Research - Proceedings Track. 01/2010; 9:773-780.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Exploratory tools that are sensitive to arbitrary statistical variations in spike train observations open up the possibility of novel neuroscientific discoveries. Developing such tools, however, is difficult due to the lack of Euclidean structure of the spike train space, and an experimenter usually prefers simpler tools that capture only limited statistical features of the spike train, such as mean spike count or mean firing rate. We explore strictly positive-definite kernels on the space of spike trains to offer both a structural representation of this space and a platform for developing statistical measures that explore features beyond count or rate. We apply these kernels to construct measures of divergence between two point processes and use them for hypothesis testing, that is, to observe if two sets of spike trains originate from the same underlying probability law. Although there exist positive-definite spike train kernels in the literature, we establish that these kernels are not strictly definite and thus do not induce measures of divergence. We discuss the properties of both of these existing nonstrict kernels and the novel strict kernels in terms of their computational complexity, choice of free parameters, and performance on both synthetic and real data through kernel principal component analysis and hypothesis testing.
    Neural Computation 04/2012; 24(8):2223-50. · 1.76 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on ℜd, then it is characteristic if and only if the support of its Fourier transform is the entire ℜd. Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γk, we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology.
    Journal of Machine Learning Research 07/2009; 11:1517-1561. · 3.42 Impact Factor


Available from