Page 1

Kernel Choice and Classifiability for RKHS

Embeddings of Probability Distributions

Bharath K. Sriperumbudur

Department of ECE

UC San Diego, La Jolla, USA

bharathsv@ucsd.edu

Kenji Fukumizu

The Institute of Statistical Mathematics

Tokyo, Japan

fukumizu@ism.ac.jp

Arthur Gretton

Carnegie Mellon University

MPI for Biological Cybernetics

arthur.gretton@gmail.com

Gert R. G. Lanckriet

Department of ECE

UC San Diego, La Jolla, USA

gert@ece.ucsd.edu

Bernhard Sch¨ olkopf

MPI for Biological Cybernetics

T¨ ubingen, Germany

bs@tuebingen.mpg.de

Abstract

Embeddings of probability measures into reproducing kernel Hilbert spaces have

been proposed as a straightforward and practical means of representing and com-

paring probabilities. In particular, the distance between embeddings (the maxi-

mum mean discrepancy, or MMD) has several key advantages over many classical

metricsondistributions, namelyeasycomputability, fastconvergenceandlowbias

of finite sample estimates. An important requirement of the embedding RKHS is

that it be characteristic: in this case, the MMD between two distributions is zero

if and only if the distributions coincide. Three new results on the MMD are intro-

duced in the present study. First, it is established that MMD corresponds to the

optimal risk of a kernel classifier, thus forming a natural link between the distance

between distributions and their ease of classification. An important consequence

is that a kernel must be characteristic to guarantee classifiability between distri-

butions in the RKHS. Second, the class of characteristic kernels is broadened to

incorporate all strictly positive definite kernels: these include non-translation in-

variant kernels and kernels on non-compact domains. Third, a generalization of

the MMD is proposed for families of kernels, as the supremum over MMDs on

a class of kernels (for instance the Gaussian kernels with different bandwidths).

This extension is necessary to obtain a single distance measure if a large selection

or class of characteristic kernels is potentially appropriate. This generalization is

reasonable, given that it corresponds to the problem of learning the kernel by min-

imizing the risk of the corresponding kernel classifier. The generalized MMD is

shown to have consistent finite sample estimates, and its performance is demon-

strated on a homogeneity testing example.

1

Kernel methods are broadly established as a useful way of constructing nonlinear algorithms

from linear ones, by embedding points into higher dimensional reproducing kernel Hilbert spaces

(RKHSs) [12]. A generalization of this idea is to embed probability distributions into RKHSs, giv-

Introduction

1

Page 2

ing us a linear method for dealing with higher order statistics [8, 15, 17]. More specifically, suppose

we are given the set P of all Borel probability measures defined on the topological space M, and

the RKHS (H,k) of functions on M with k as its reproducing kernel (r.k.). For P ∈ P, denote by

Pk :=?

called the maximum mean discrepancy (MMD) [8, 17], and is written

Mk(.,x)dP(x). If k is measurable and bounded, then we may define the embedding of P

in H as Pk ∈ H. The RKHS distance between two such mappings associated with P,Q ∈ P is

γk(P,Q) = ?Pk − Qk?H.

(1)

We say that k is characteristic [6, 17] if the mapping P ?→ Pk is injective, in which case (1) is

zero if and only if P = Q, i.e., γkis a metric on P. An immediate application of the MMD is to

problems of comparing distributions based on finite samples: examples include tests of homogeneity

[8], independence [9], and conditional independence [6]. In this application domain, the question of

whether k is characteristic is key: without this property, the algorithms can fail through inability to

distinguish between particular distributions.

Characteristic kernels are important in binary classification: The problem of distinguishing dis-

tributions is strongly related to binary classification: indeed, one would expect easily distinguishable

distributions to be easily classifiable.1The link between these two problems is especially direct in

the case of the MMD: in Section 2, we show that γkis the negative of the optimal risk (correspond-

ing to a linear loss function) associated with the Parzen window classifier [12, 14] (also called kernel

classification rule [4, Chapter 10]), where the Parzen window turns out to be k. We also show that

γkis an upper bound on the margin of a hard-margin support vector machine (SVM). The impor-

tance of using characteristic RKHSs is further underlined by this link: if the property does not hold,

then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by

showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve

Bayes risk in the kernel-based classification algorithms.

Characterization of characteristic kernels: Given the centrality of the characteristic property to

both RKHS classification and RKHS distribution testing, we should take particular care in estab-

lishing which kernels satisfy this requirement. Early results in this direction include [8], where k is

shown to be characteristic on compact M if it is universal in the sense of Steinwart [18, Definition

4]; and [6, 7], which address the case of non-compact M, and show that k is characteristic if and

only if H+R is dense in the Banach space of p-power (p ≥ 1) integrable functions. The conditions

in both these studies can be difficult to check and interpret, however, and the restriction of the first

to compact M is limiting. In the case of translation invariant kernels, [17] proved the kernel to

be characteristic if and only if the support of the Fourier transform of k is the entire Rd, which is

a much easier condition to verify. Similar sufficient conditions are obtained by [7] for translation

invariant kernels on groups and semi-groups. In Section 3, we expand the class of characteristic

kernels to include kernels that may or may not be translation invariant, with the introduction of a

novel criterion: strictly positive definite kernels (see Definition 3) on M are characteristic.

Choice of characteristic kernels: In expanding the families of allowable characteristic kernels, we

have so far neglected the question of which characteristic kernel to choose. A practitioner asking by

how much two samples differ does not want to receive a blizzard of answers for every conceivable

kernel and bandwidth setting, but a single measure that satisfies some “reasonable” notion of dis-

tance across the family of kernels considered. Thus, in Section 4, we propose a generalization of the

MMD, yielding a new distance measure between P and Q defined as

γ(P,Q) = sup{γk(P,Q) : k ∈ K} = sup{?Pk − Qk?H: k ∈ K},

which is the maximal RKHS distance between P and Q over a family, K of positive definite kernels.

For example, K can be the family of Gaussian kernels on Rdindexed by the bandwidth parameter.

This distance measure is very natural in the light of our results on binary classification (in Section 2):

most directly, this corresponds to the problem of learning the kernel by minimizing the risk of the

associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on

the margin allowed for a hard margin SVM between the samples. To apply the generalized MMD

in practice, we must ensure its empirical estimator is consistent. In our main result of Section 4,

we provide an empirical estimate of γ(P,Q) based on finite samples, and show that many popular

kernels like the Gaussian, Laplacian, and the entire Mat´ ern class on Rdyield consistent estimates

(2)

1There is a subtlety here, since unlike the problem of testing for differences in distributions, classification

suffers from slow learning rates. See [4, Chapter 7] for details.

2

Page 3

of γ(P,Q). The proof is based on bounding the Rademacher chaos complexity of K, which can be

understood as the U-process equivalent of Rademacher complexity [3].

Finally, in Section 5, we provide a simple experimental demonstration that the generalized MMD

can be applied in practice to the problem of homogeneity testing. Specifically, we show that when

two distributions differ on particular length scales, the kernel selected by the generalized MMD

is appropriate to this difference, and the resulting hypothesis test outperforms the heuristic kernel

choice employed in earlier studies [8]. The proofs of the results in Sections 2-4 are provided in the

appendix.

2 Characteristic Kernels and Binary Classification

One of the most important applications of the maximum mean discrepancy is in nonparametric hy-

pothesis testing [8, 9, 6], where the characteristic property of k is required to distinguish between

probability measures. In the following, we show how MMD naturally appears in binary classifica-

tion, with reference to the Parzen window classifier and hard-margin SVM. This motivates the need

for characteristic k to guarantee that classes arising from different distributions can be classified by

kernel-based algorithms.

To this end, let us consider the binary classification problem with X being a M-valued random

variable, Y being a {−1,+1}-valued random variable and the product space, M ×{−1,+1}, being

endowed with an induced Borel probability measure µ. A discriminant function, f is a real valued

measurable function on M, whose sign is used to make a classification decision. Given a loss

function L : {−1,+1} × R → R, the goal is to choose an f that minimizes the risk associated with

L, with the optimal L-risk being defined as

?

where F?is the set of all measurable functions on M, L1(α) := L(1,α), L−1(α) := L(−1,α),

P(X) := µ(X|Y = +1), Q(X) := µ(X|Y = −1), ε := µ(M,Y = +1). Here, P and Q represent

the class-conditional distributions and ε is the prior distribution of class +1. Now, we present the

result that relates γkto the optimal risk associated with the Parzen window classifier.

Theorem 1 (γkand Parzen classification). Let L1(α) = −α

−RL

Suppose {(Xi,Yi)}N

m = |{i : Yi= 1}|. If?f ∈ Fkis an empirical minimizer of (3) (where F?is replaced by Fkin (3)),

sign(?f(x)) =

which is the Parzen window classifier.

Theorem 1 shows that γkis the negative of the optimal L-risk (where L is the linear loss as defined

in Theorem 1) associated with the Parzen window classifier. Therefore, if k is not characteristic,

which means γk(P,Q) = 0 for some P ?= Q, then RL

since 0 ≤ γk(P,Q) = −RL

the maximum risk is obtained only when P = Q. This motivates the importance of characteristic

kernels in binary classification. In the following, we provide another result which provides a similar

motivation for the importance of characteristic kernels in binary classification, wherein we relate γk

to the margin of a hard-margin SVM.

Theorem 2 (γkand hard-margin SVM). Suppose {(Xi,Yi)}N

a training sample drawn i.i.d. from µ. Assuming the training sample is separable, let fsvmbe the

solution to the program, inf{?f?H: Yif(Xi) ≥ 1, ∀i}, where H is an RKHS with measurable and

bounded k. If k is characteristic, then

1

?fsvm?H

where Pm :=

mn

represents the Dirac measure at x.

RL

F?= inf

f∈F?

M

L(y,f(x))dµ(x,y) = inf

f∈F?

?

ε

?

M

L1(f)dP + (1 − ε)

?

M

L−1(f)dQ

?

,

(3)

εand L−1(α) =

α

1−ε. Then, γk(P,Q) =

Fk, where Fk = {f : ?f?H ≤ 1} and H is an RKHS with a measurable and bounded k.

i=1, Xi∈ M, Yi∈ {−1,+1}, ∀i is a training sample drawn i.i.d. from µ and

then

?

1,

−1,

1

m

1

m

?

Yi=1k(x,Xi) >

?

1

N−m

1

N−m

?

Yi=−1k(x,Xi)

?

Yi=1k(x,Xi) ≤

Yi=−1k(x,Xi)

,

(4)

Fk= 0, i.e., the risk is maximum (note that

Fk, the maximum risk is zero). In other words, if k is characteristic, then

i=1, Xi∈ M, Yi∈ {−1,+1}, ∀i is

≤γk(Pm,Qn)

2

,

(5)

1

?

Yi=1δXi, Qn :=

1

?

Yi=−1δXi, m = |{i : Yi = 1}| and n = N − m. δx

3

Page 4

Theorem 2 provides a bound on the margin of hard-margin SVM in terms of MMD. (5) shows that

a smaller MMD between Pmand Qnenforces a smaller margin (i.e., a less smooth classifier, fsvm,

where smoothness is measured as ?fsvm?H). We can observe that the bound in (5) may be loose if

the number of support vectors is small. Suppose k is not characteristic, then γk(Pm,Qn) can be zero

for Pm?= Qnand therefore the margin is zero, which means even unlike distributions can become

inseparable in this feature representation.

Another justification of using characteristic kernels in kernel-based classification algorithms can be

provided by studying the conditions on H for which the Bayes risk is realized for all µ. Steinwart

and Christmann [19, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk

is achieved for all µ if and only if H is dense in Lp(M,η) for all η, where η = εP + (1 − ε)Q.

Here, Lp(M,η) represents the Banach space of p-power integrable functions, where p ∈ [1,∞) is

dependent on the loss function, L. Denseness of H in Lp(M,η) implies H + R is dense Lp(M,η),

which therefore yields that k is characteristic [6, 7]. On the other hand, if constant functions are

included in H, then it is easy to show that the characteristic property of k is also sufficient to

achieve the Bayes risk. As an example, it can be shown that characteristic kernels are necessary (and

sufficient if constant functions are in H) for SVMs to achieve the Bayes risk [19, Example 5.40].

Therefore, the characteristic property of k is fundamental in kernel-based classification algorithms.

Having showed how characteristic kernels play a role in kernel-based classification, in the following

section, we provide a novel characterization for them.

3

Apositivedefinite(pd)kernel, k issaidtobecharacteristictoP ifandonlyifγk(P,Q) = 0 ⇔ P =

Q, ∀P,Q ∈ P. The following result provides a novel characterization for characteristic kernels,

which shows that strictly pd kernels are characteristic to P. An advantage with this characterization

is that it holds for any arbitrary topological space M unlike the earlier characterizations where a

group structure on M is assumed [17, 7]. First, we define strictly pd kernels as follows.

Definition 3 (Strictly positive definite kernels). Let M be a topological space. A measurable and

boundedkernel, k issaid tobestrictlypositive definiteifandonly if?

Note that the above definition is not equivalent to the usual definition of strictly pd kernels that in-

volves finite sums [19, Definition 4.15]. The above definition is a generalization of integrally strictly

positive definite functions [20, Section 6]:

which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is

stronger than the finite sum definition as [19, Theorem 4.62] shows a kernel that is strictly pd in the

finite sum sense but not in the integral sense.

Theorem 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on M, then k is

characteristic to P.

The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic.

We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to

P. Examplesofstrictly pdkernelsonRdincludeexp(−σ?x−y?2

0, (c2+ ?x − y?2

strictly pd kernel if k is strictly pd, where f : M → R is a bounded continuous function. Therefore,

translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant

strictly pd kernel. A simple example of a translation-variant kernel that is a strictly pd kernel on

compact sets of Rdis˜k(x,y) = exp(σxTy), σ > 0, where we have chosen f(.) = exp(σ?.?2

and k(x,y) = exp(−σ?x − y?2

which is the same result that follows from the universality of˜k [18, Section 3, Example 1].

The following result in [13], which is based on the usual definition of strictly pd kernels, can be

obtained as a corollary to Theorem 4.

Corollary 5 ([13]). Let X = {xi}m

yj, ∀i,j. Suppose k is strictly positive definite. Then?m

Suppose we choose αi =

and?n

4

Novel Characterization for Characteristic Kernels

M

?

Mk(x,y)dµ(x)dµ(y) > 0

for all finite non-zero signed Borel measures, µ defined on M.

? ?k(x,y)f(x)f(y)dxdy > 0 for all f ∈ L2(Rd),

2), σ > 0, exp(−σ?x−y?1), σ >

2)−β, β > 0, c > 0, B2l+1-splines etc. Note that˜k(x,y) = f(x)k(x,y)f(y) is a

2/2)

2/2), σ > 0. Therefore,˜k is characteristic on compact sets of Rd,

i=1⊂ M, Y = {yj}n

j=1⊂ M and assume that xi?= xj, yi?=

j=1βjk(.,yj) for some

i=1αik(.,xi) =?n

n, ∀j in Corollary 5.

αi,βj∈ R\{0} ⇒ X = Y .

1

m, ∀i and βj =

1

Then?m

i=1αik(.,xi)

j=1βjk(.,yj) represent the mean functions in H. Note that the Parzen classifier in (4)

Page 5

is a mean classifier (that separates the mean functions) in H, i.e., sign(?k(.,x),w?H), where

w =

mn

characteristic). Then, by Corollary 5, the normal vector, w to the hyperplane in H passing through

the origin is zero, i.e., the mean functions coincide (and are therefore not classifiable) if and only if

X = Y .

1

?m

i=1k(.,xi) −1

?n

i=1k(.,yi). Suppose k is strictly pd (more generally, suppose k is

4

The discussion so far has been related to the characteristic property of k that makes γka metric on

P. We have seen that this characteristic property is of prime importance both in distribution testing,

and to ensure classifiability of dissimilar distributions in the RKHS. We have not yet addressed how

to choose among a selection/family of characteristic kernels, given a particular pair of distributions

we wish to discriminate between. We introduce one approach to this problem in the present section.

Let M = Rdand kσ(x,y) = exp(−σ?x − y?2

parameter. {kσ : σ ∈ R+} is the family of Gaussian kernels and {γkσ: σ ∈ R+} is the family

of MMDs indexed by the kernel parameter, σ. Note that kσis characteristic for any σ ∈ R++and

therefore γkσis a metric on P for any σ ∈ R++. However, in practice, one would prefer a single

number that defines the distance between P and Q. The question therefore to be addressed is how

to choose appropriate σ. The choice of σ has important implications on the statistical aspect of γkσ.

Note that as σ → 0, kσ→ 1 and as σ → ∞, kσ→ 0 a.e., which means γkσ(P,Q) → 0 as σ → 0

or σ → ∞ for all P,Q ∈ P (this behavior is also exhibited by kσ(x,y) = exp(−σ?x − y?1) and

kσ(x,y) = σ2/(σ2+ ?x − y?2

small or sufficiently large σ (depending on P and Q) makes γkσ(P,Q) arbitrarily small. Therefore, σ

has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently,

the applications involving MMD set σ heuristically [8, 9].

To generalize the MMD to families of kernels, we propose the following modification to γk, which

yields a pseudometric on P,

γ(P,Q) = sup{γk(P,Q) : k ∈ K} = sup{?Pk − Qk?H: k ∈ K}.

Note that γ is the maximal RKHS distance between P and Q over a family, K of positive definite

kernels. It is easy to check that if any k ∈ K is characteristic, then γ is a metric on P. Examples for

K include: Kg:= {e−σ?x−y?2

Kψ:= {e−σψ(x,y), x,y ∈ M : σ ∈ R+}, where ψ : M × M → R is a negative definite kernel;

Krbf:= {?∞

The proposal of γ(P,Q) in (6) can be motivated by the connection that we have established in

Section 2 between γkand the Parzen window classifier. Since the Parzen window classifier depends

on the kernel, k, one can propose to learn the kernel like in support vector machines [10], wherein

the kernel is chosen such that RL

−supk∈Kγk(P,Q) = −γ(P,Q). A similar motivation for γ can be provided based on (5) as

learning the kernel in a hard-margin SVM by maximizing its margin.

At this point, we briefly discuss the issue of normalized vs. unnormalized kernel families, K in

(6). We say a translation-invariant kernel, k on Rdis normalized if?

family if every kernel in K is normalized. If K is not normalized, we say it is unnormalized. For

example, it is easy to see that Kgand Klare unnormalized kernel families. Let us consider the

normalized Gaussian family, Kn

shown that for any kσ,kτ ∈ Kn

means, γ(P,Q) = γσ0(P,Q). Therefore, the generalized MMD reduces to a single kernel MMD. A

similar result also holds for the normalized inverse-quadratic kernel family, {?2σ2/π(σ2+ ?x −

is usually not very useful if K is a normalized kernel family. In addition, σ0should be chosen

beforehand, which is equivalent to heuristically setting the kernel parameter in γk. Note that σ0

cannot be zero because in the limiting case of σ → 0, the kernels approach a Dirac distribution,

which means the limiting kernel is not bounded and therefore the definition of MMD in (1) does

not hold. So, in this work, we consider unnormalized kernel families to render the definition of

generalized MMD in (6) useful.

Generalizing the MMD for Classes of Characteristic Kernels

2), σ ∈ R+, where σ represents the bandwidth

2), which are also characteristic). This means choosing sufficiently

(6)

2, x,y ∈ Rd: σ ∈ R+}; Kl:= {e−σ?x−y?1, x,y ∈ Rd: σ ∈ R+};

0e−λ?x−y?2

2dµσ(λ),x,y ∈ Rd, µσ∈ M+: σ ∈ Σ ⊂ Rd}, where M+is the set of

all finite nonnegative Borel measures, µσon R+that are not concentrated at zero, etc.

Fkin Theorem 1 is minimized over k ∈ K, i.e., infk∈KRL

Fk=

Mψ(y)dy = c (some positive

constant independent of the kernel parameter), where k(x,y) = ψ(x−y). K is a normalized kernel

g= {(σ/π)d/2e−σ?x−y?2

g, 0 < σ < τ < ∞, we have γkσ(P,Q) ≥ γkτ(P,Q), which

2, x,y ∈ Rd: σ ∈ [σ0,∞)}. It can be

y?2

2)−1, x,y ∈ R : σ ∈ [σ0,∞)}. These examples show that the generalized MMD definition

5

Page 6

To use γ in statistical applications where P and Q are known only through i.i.d. samples {Xi}m

and {Yi}n

represent the empirical measures based on {Xi}m

[8, 15] have shown that γk(Pm,Qn) is a

statistical consistency of γ(Pm,Qn) is established in the following theorem, which uses tools from

U-process theory [3, Chapters 3,5]. We begin with the following definition.

Definition 6 (Rademacher chaos). Let G be a class of functions on M × M and {ρi}n

independent Rademacher random variables, i.e., Pr(ρi = 1) = Pr(ρi = −1) =

homogeneous Rademacher chaos process of order two with respect to {ρi}n

{n−1?n

Un(G;{xi}n

g∈G

We now provide the main result of the present section.

Theorem 7 (Consistency of γ(Pm,Qn)). Let every k ∈ K be measurable and bounded with ν :=

supk∈K,x∈Mk(x,x) < ∞. Then, with probability at least 1 − δ, |γ(Pm,Qn) − γ(P,Q)| ≤ A,

where

?

mn

From (8), it is clear that if Um(K;{Xi}) = OP(1) and Un(K;{Yi}) = OQ(1), then γ(Pm,Qn)

γ(P,Q). The following result provides a bound on Um(K;{Xi}) in terms of the entropy integral.

Lemma 8 (Entropy bound). For any K as in Theorem 7 with 0 ∈ K, there exists a universal constant

C such that

Um(K;{Xi}m

??m

Assuming K to be a VC-subgraph class, the following result, as a corollary to Lemma 8 provides

an estimate of Um(K;{Xi}m

VC-subgraph class.

Definition 9 (VC-subgraph class). The subgraph of a function g : M × R is the subset of M × R

given by {(x,t) : t < g(x)}. A collection G of measurable functions on a sample space is called a

VC-subgraph class, if the collection of all subgraphs of the functions in G forms a VC-class of sets

(in M × R).

The VC-index (also called the VC-dimension) of a VC-subgraph class, G is the same as the pseudo-

dimension of G. See [1, Definition 11.1] for details.

Corollary 10 (Um(K;{Xi}) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K)

being the VC-index. Assume K satisfies the conditions in Theorem 7 and 0 ∈ K. Then

Um(K;{Xi}) ≤ Cν log(C1V (K)(16e9)V (K)),

for some universal constants C and C1.

Using (10) in (8), we have |γ(Pm,Qn) − γ(P,Q)| = OP,Q(?(m + n)/mn) and by the Borel-

nel classes, K have V (K) < ∞. [22, Lemma 12] showed that V (Kg) = 1 (also see [23]) and

Um(Krbf) ≤ C2Um(Kg), where C2 < ∞. It can be shown that V (Kψ) = 1 and V (Kl) = 1.

All these classes satisfy the conditions of Theorem 7 and Corollary 10 and therefore provide consis-

tent estimates of γ(P,Q) for any P,Q ∈ P. Examples of kernels on Rdthat are covered by these

classes include the Gaussian, Laplacian, inverse multiquadratics, Mat´ ern class etc. Other choices

for K that are popular in machine learning are the linear combination of kernels, Klin := {kλ =

?l

fixed, parameterized kernel, one can also use a finite linear combination of kernels to compute γ.

i=1

i=1respectively, we require its estimator γ(Pm,Qn) to be consistent, where Pmand Qn

i=1and {Yj}n

?mn/(m + n)-consistent estimator of γk(P,Q). The

j=1. For k measurable and bounded,

i=1be

2.

1

The

i=1is defined as

i<jρiρjg(xi,xj) : g ∈ G} for some {xi}n

ity over G is defined as

i=1⊂ M. The Rademacher chaos complex-

???1

i=1) := Eρsup

n

n

?

i<j

ρiρjg(xi,xj)

???.

(7)

A =

16Um(K;{Xi})

+16Un(K;{Yi})

+

(√8ν +

?

36ν log4

√mn

δ)√m + n

.

(8)

a.s.

→

i=1) ≤ C

?ν

0

logN(K,D,?)d?,

(9)

where D(k1,k2) =

covering number of K with respect to the metric D.

1

m

i<j(k1(Xi,Xj) − k2(Xi,Xj))2?1

2.

N(K,D,?) represents the ?-

i=1). Before presenting the result, we first provide the definition of a

(10)

Cantelli lemma, |γ(Pm,Qn) − γ(P,Q)|

a.s.

→ 0. Now, the question reduces to which of the ker-

i=1λiki|kλis pd,?l

i=1λi= 1} and Kcon:= {kλ=?l

i=1λiki|λi≥ 0,?l

i=1λi= 1}. [16,

Lemma 7] have shown that V (Kcon) ≤ V (Klin) ≤ l. Therefore, instead of using a class based on a

6

Page 7

So far, we have presented the metric property and statistical consistency (of the empirical estimator)

ofγ. Now, the questionishowdowecompute γ(Pm,Qn)inpractice. Toshowthis, inthefollowing,

we present two examples.

Example 11. Suppose K = Kg. Then, γ(Pm,Qn) can be written as

i,j=1

m2

i,j=1

γ2(Pm,Qn) = sup

σ∈R+

m

?

e−σ?Xi−Xj?2

+

n

?

e−σ?Yi−Yj?2

n2

− 2

m,n

?

i,j=1

e−σ?Xi−Yj?2

mn

.

(11)

The optimum σ∗can be obtained by solving (11) and γ(Pm,Qn) = ?Pmkσ∗ − Qnkσ∗?Hσ?.

Example 12. Suppose K = Kcon. Then, γ(Pm,Qn) becomes

γ2(Pm,Qn)= sup

k∈Kcon

sup{λTa : λT1 = 1, λ ? 0},

i=1λiki. Here λ = (λ1,...,λl) and (a)i= ?Pmki− Qnki?2

a,b=1ki(Xa,Xb) +

n2

γ2(Pm,Qn) = max1≤i≤l(a)i.

Similar examples can be provided for other K, where γ(Pm,Qn) can be computed by solving a

semidefinite program (K = Klin) or by the constrained gradient descent ( K = Kl,Krbf).

Finally, whiletheapproachin(6)togeneralizingγkisourfocusinthispaper, analternativeBayesian

strategy would be to define a non-negative finite measure λ over K, and to average γkover that

measure, i.e., β(P,Q) :=

β(P,Q) ≤ λ(K)γ(P,Q), ∀P,Q, which means if P and Q can be distinguished by β, they can be

distinguished by γ, but not vice-versa. In this sense, γ is stronger than β. One further complication

with the Bayesian approach is in defining a sensible λ over K. Note that γk0(single kernel MMD

based on k0) can be obtained by defining λ(k) = δ(k − k0) in β(P,Q).

5 Experiments

Inthissection, wepresent abenchmarkexperimentthat illustratesthegeneralizedMMDproposedin

Section 4 is preferred above the single kernel MMD where the kernel parameter is set heuristically.

The experimental setup is as follows.

Letp = N(0,σ2

version of p, given as q(x) = p(x)(1+sinνx). Here p and q are the densities associated with P and

Q respectively. It is easy to see that q differs from p at increasing frequencies with increasing ν. Let

k(x,y) = exp(−(x − y)2/σ). Now, the goal is that given random samples drawn i.i.d. from P and

Q (with ν fixed), we would like to test H0: P = Q vs. H1: P ?= Q. The idea is that as ν increases,

it will be harder to distinguish between P and Q for a fixed sample size. Therefore, using this setup

we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps

to distinguish between P and Q at higher ν compared to γkwith a heuristic σ. To this end, using

γ(Pm,Qn) and γk(Pm,Qn) (with various σ) as test statistics Tmn, we design a test that returns H0

if Tmn≤ cmn, and H1otherwise. The problem therefore reduces to finding cmn. cmnis determined

as the (1 − α) quantile of the asymptotic distribution of Tmnunder H0, which therefore fixes the

type-I error (the probability of rejecting H0when it is true) to α. The consistency of this test under

γk(for any fixed σ) is proved in [8]. A similar result can be shown for γ under some conditions on

K. We skip the details here.

In our experiments, we set m = n = 1000, σ2

samples from Q. The distribution of Tmnis estimated by bootstrapping on these samples (250 boot-

strap iterations are performed) and the associated 95thquantile (we choose α = 0.05) is computed.

Since the performance of the test is judged by its type-II error (the probability of accepting H0

when H1is true), we draw a random sample, one each from P and Q and test whether P = Q.

This process is repeated 300 times, and estimates of type-I and type-II errors are obtained for both

γ and γk. 14 different values for σ are considered on a logarithmic scale of base 2 with exponents

(−3,−2,−1,0,1,3

choice. 5 different choices for ν are considered: (1

?Pmk − Qnk?2

H= sup

k∈Kcon

? ?

kd(Pm− Qn) ⊗ (Pm− Qn)

=

(12)

where we have replaced k by?l

Hi=

1

m2

?m

1

?n

a,b=1ki(Ya,Yb) −

2

mn

?m,n

a,b=1ki(Xa,Yb). It is easy to see that

?

Kγk(P,Q)dλ(k). This also yields a pseudometric on P. That said,

p), anormaldistributioninRwithzeromeanandvariance, σ2

p. Letq betheperturbed

p= 10 and draw two sets of independent random

2,2,5

2,3,7

2,4,5,6) along with the median distance between samples as one more

2,3

4,1,5

4,3

2).

7

Page 8

0.50.751

ν

1.25 1.5

0

2

4

5

6

Error (in %)

Type−I error

Type−II error

(a)

−3 −2 −1 0 1 2 3 4 5 6

log σ

(b)

5

10

15

20

25

Type−I error (in %)

ν=0.5

ν=0.75

ν=1.0

ν=1.25

ν=1.5

−3 −2 −1 0 1 2 3 4 5 6

log σ

(c)

0

50

100

Type−II error (in %)

ν=0.5

ν=0.75

ν=1.0

ν=1.25

ν=1.5

0.50.751

ν

1.251.5

0

1

2

3

log σ

(d)

0.5 0.751

ν

1.25 1.5

8

9

10

11

Median as σ

(e)

Figure 1: (a) Type-I and Type-II errors (in %) for γ for varying ν. (b,c) Type-I and type-II error (in

%) for γk(with different σ) for varying ν. The dotted line in (c) corresponds to the median heuristic,

which shows that its associated type-II error is very large at large ν. (d) Box plot of logσ grouped

by ν, where σ is selected by γ. (e) Box plot of the median distance between points (which is also a

choice for σ), grouped by ν. Refer to Section 5 for details.

Figure 1(a) shows the estimated type-I and type-II errors using γ as the test statistic for varying

ν. Note that the type-I error is close to its design value of 5%, while the type-II error is zero for

all ν, which means γ distinguishes between P and Q for all perturbations. Figures 1(b,c) show the

estimates of type-I and type-II errors using γkas the test statistic for different σ and ν. Figure 1(d)

shows the box plot for logσ, grouped by ν, where σ is the bandwidth selected by γ. Figure 1(e)

shows the box plot of the median distance between points (which is also a choice for σ), grouped by

ν. From Figures 1(c) and (e), it is easy to see that the median heuristic exhibits high type-II error for

ν =3

choices of σ can result in high type-II errors. It is intuitive to note that as ν increases, (which means

the characteristic function of Q differs from that of P at higher frequencies), a smaller σ is needed

to detect these changes. The advantage of using γ is that it selects σ in a distribution-dependent

fashion and its behavior in the box plot shown in Figure 1(d) matches with the previously mentioned

intuition about the behavior of σ with respect to ν. These results demonstrate the validity of using γ

as a distance measure in applications.

2, while γ exhibits zero type-II error (from Figure 1(a)). Figure 1(c) also shows that heuristic

6

In this work, we have shown how MMD appears in binary classification, and thus that characteristic

kernels are important in kernel-based classification algorithms. We have broadened the class of

characteristic RKHSs to include those induced by strictly positive definite kernels (with particular

application to kernels on non-compact domains, and/or kernels that are not translation invariant). We

have further provided a convergent generalization of MMD over families of kernel functions, which

becomes necessary even in considering relatively simple families of kernels (such as the Gaussian

kernels parameterized by their bandwidth). The usefulness of the generalized MMD is illustrated

experimentally with a two-sample testing problem.

Conclusions

Acknowledgments

The authors thank anonymous reviewers for their constructive comments and especially the re-

viewer who pointed out the connection between characteristic kernels and the achievability of Bayes

risk. B. K. S. was supported by the MPI for Biological Cybernetics, National Science Founda-

tion (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MI-

CRO program. A. G. was supported by grants DARPA IPTO FA8750-09-1-0141, ONR MURI

N000140710747, and ARO MURI W911NF0810242.

8

Page 9

A Proofs

We provide proofs for the results in Sections 2-4.

A.1Proof of Theorem 1

To prove Theorem 1, we need the following result from [17].

Theorem 13 ([17]). Let Fk:= {f : ?f?H≤ 1}, where (H,k) is an RKHS defined on a measurable

space M with k measurable and bounded. Then,

γk(P,Q) = sup

f∈Fk

|Pf − Qf| = ?Pk − Qk?H,

(13)

where ? · ?Hrepresents the RKHS norm.

Proof of Theorem 1: From (3), we have

?

Therefore,

ε

M

L1(f)dP + (1 − ε)

?

M

L−1(f)dQ =

?

M

f dQ −

?

M

f dP = Qf − Pf.

RL

Fk= inf

f∈Fk(Qf − Pf) = − sup

f∈Fk

(Pf − Qf) = − sup

f∈Fk

|Pf − Qf| = −γk(P,Q),

which follows from Theorem 13. Given {(Xi,Yi)}N

of (3) is given by

?

i=1drawn i.i.d. from µ, the empirical equivalent

inf

−1

m

?

Yi=1

f(Xi) +

1

N − m

?

Yi=−1

f(Xi) : f ∈ Fk

?

.

Solving this for f gives

f =

1

m

?

Yi=1k(.,Xi) −

Yi=1k(.,Xi) −

1

N−m

1

N−m

?

Yi=−1k(.,Xi)

Yi=−1k(.,Xi)?H

?1

m

??

,

and the result in (4) follows.

A.2Proof of Theorem 2

Before we prove Theorem 2, we present a lemma which we will use to prove Theorem 2.

Lemma 14. Let θ : V → R and ψ : V → R be convex functions on a real vector space V . Suppose

a = sup{θ(x) : ψ(x) ≤ b},

where θ is not constant on {x : ψ(x) ≤ b} and a < ∞. Then

b = inf{ψ(x) : θ(x) ≥ a}.

We need the following result from [11, Theorem 32.1] to prove Lemma 14.

Theorem 15 ([11]). Let f be a convex function, and let C be a convex set contained in the domain

of f. If f attains its supremum relative to C at some point of relative interior of C, then f is actually

constant throughout C.

(14)

(15)

Proof of Lemma 14: Note that A := {x : ψ(x) ≤ b} is a convex subset of V . Since θ is not constant

on A, by Theorem 15, θ attains its supremum on the boundary of A. Therefore, any solution, x∗to

(14) satisfies θ(x∗) = a and ψ(x∗) = b. Let G := {x : θ(x) > a}. For any x ∈ G, ψ(x) > b. If

this were not the case, then x∗is not a solution to (14). Let H := {x : θ(x) = a}. Clearly, x∗∈ H

and so there exists an x ∈ H for which ψ(x) = b. Suppose inf{ψ(x) : x ∈ H} = c < b, which

means for some x∗∈ H, x∗∈ A. From (14), this implies θ attains its supremum relative to A at

some point of relative interior of A. By Theorem 15, this implies θ is constant on A leading to a

contradiction. Therefore, inf{ψ(x) : x ∈ H} = b and the result in (15) follows.

9

Page 10

Proof of Theorem 2: By Theorem 13, γk(P,Q) = sup{Pf − Qf : ?f?H≤ 1}. Note that Pf − Qf

and ?f?Hare convex functionals on H. For P ?= Q, Pf − Qf is not constant on Fk, since k is

characteristic. Therefore, by Lemma 14, we have

1 = inf{?f?H: Pf − Qf ≥ γk(P,Q), f ∈ H}.

Since this holds for all P ?= Q, it holds for Pmand Qn. Therefore, we have

2

γk(Pm,Qn)= inf {?f?H: Pmf − Qnf ≥ 2, f ∈ H}.

Consider

{f ∈ H : Pmf − Qnf ≥ 2}

=

?

{f ∈ H : Yif(Xi) ≥ 1, ∀i},

f ∈ H :1

m

?

Yi=1

f(Xi) −1

n

?

Yi=−1

f(Xi) ≥ 2

?

⊃

which means

2

γk(Pm,Qn)≤ ?fsvm?H.

A.3 Proof of Theorem 4

To prove Theorem 4, we need the following lemma that provides necessary and sufficient conditions

for a kernel not to be characteristic.

Lemma 16. Let k be measurable and bounded on M. Then ∃P ?= Q, P,Q ∈ P such that

γk(P,Q) = 0 if and only if there exists a finite non-zero signed Borel measure µ that satisfies:

?

(ii) µ(M) = 0.

(i)

M

?

Mk(x,y)dµ(x)dµ(y) = 0,

Proof. (⇒) Suppose there exists a finite non-zero signed Borel measure, µ that satisfies (i) and (ii)

in Lemma 16. By the Jordan decomposition theorem [5, Theorem 5.6.1], there exist unique positive

measures µ+and µ−such that µ = µ+− µ−and µ+⊥ µ−(µ+and µ−are singular). By (ii), we

have µ+(M) = µ−(M) =: α. Define P = α−1µ+and Q = α−1µ−. Clearly, P ?= Q, P,Q ∈ P.

Then,

γ2

k(P,Q) = ?Pk − Qk?2

H= α−2?µk?2

H= α−2?µk,µk?H.

(16)

From the proof of Theorem 13 (see Theorem 3 in [17]),

? ?k(x,y)dµ(x)dµ(y) and therefore, by (i), γk(P,Q) = 0. So, we have constructed P ?= Q

(⇐) Suppose ∃P ?= Q, P,Q ∈ P such that γk(P,Q) = 0. Let µ = P − Q. Clearly µ is a finite

non-zero signed Borel measure that satisfies µ(M) = 0. Note that γ2

?µk?2

Proof of Theorem 4: Since k is strictly pd on M, we have? ?k(x,y)dη(x)dη(y) > 0 for any

Borel measure that satisfies (i) in Lemma 16. Therefore, by Lemma 16, there does not exist P ?=

Q, P,Q ∈ P such that γk(P,Q) = 0, which implies k is characteristic to P.

we have ?µk,µk?H

=

such that γk(P,Q) = 0.

k(P,Q) = ?Pk − Qk?2

H=

H=? ?k(x,y)dµ(x)dµ(y), and therefore (i) follows.

finite non-zero signed Borel measure η. This means there does not exist a finite non-zero signed

A.4

Consider?m

Proof of Corollary 5

i=1αik(.,xi) =?k(.,x)dµX(x) = µXk, where µX=?m

?

Since k is strictly pd, by Lemma 16, we have µXk = µYk ⇒ µX= µY ⇒ X = Y .

i=1αiδxi. Here δxirepre-

sents the Dirac measure at xi∈ M. So, we have µXk = µYk, which is equivalent to

?

MM

k(x,y)d(µX− µY)(x)d(µX− µY)(y) = 0.

(17)

10

Page 11

A.5 Proof of Theorem 7

Consider

|γ(Pm,Qn) − γ(P,Q)|

=

| sup

sup

k∈K?Pmk − Qnk?H− sup

k∈K|?Pmk − Qnk?H− ?Pk − Qk?H|

sup

k∈K?Pk − Qk?H|

≤

≤

≤

≤

k∈K?Pmk − Qnk − Pk + Qk?H

sup

k∈K[?Pmk − Pk?H+ ?Qnk − Qk?H]

sup

k∈K?Pmk − Pk?H+ sup

k∈K?Qnk − Qk?H.

(18)

Wenowboundthetermssupk∈K?Pmk−Pk?Handsupk∈K?Qnk−Qk?H. Sincesupk∈K?Pmk−

Pk?Hsatisfies the bounded difference property, using McDiarmid’s inequality gives that with prob-

ability at least 1 −δ

4over the choice of {Xi}, we have

sup

k∈K?Pmk − Pk?H≤ E sup

k∈K?Pmk − Pk?H+

?

2ν

mlog4

δ.

(19)

By invoking symmetrization for Esupk∈K?Pmk − Pk?H, we have

E sup

k∈K?Pmk − Pk?H≤ 2EEρsup

i=1represent i.i.d. Rademacher random variables and Eρrepresents the expectation

w.r.t. {ρi} conditioned on {Xi}. Since Eρsupk∈K?1

difference property, by McDiarmid’s inequality, with probability at least 1 −δ

the random samples of size m, we have

?????

?????

≤

m

k∈K

?????

1

m

m

?

i=1

ρik(.,Xi)

?????

H

,

(20)

where {ρi}m

m

?m

i=1ρik(.,Xi)?Hsatisfies the bounded

4over the choice of

EEρsup

k∈K

1

m

m

?

i=1

ρik(.,Xi)

?????

H

≤ Eρsup

k∈K

?????

??????

1

m

m

?

i=1

ρik(.,Xi)

?????

H

+

?

2ν

mlog4

δ.

(21)

By writing

1

m

m

?

i=1

ρik(.,Xi)

?????

H

=

1

m

?

?

?

?

?

?

?

?

?

m

?

??????

i,j=1

ρiρjk(Xi,Xj)

??????

√2

?

m

?

i<j

ρiρjk(Xi,Xj)

??????

+

√ν

√m,

(22)

we have with probability at least 1 −δ

4, the following holds:

???

?

m

EEρsup

k∈K

???1

m

m

?

i=1

ρik(.,Xi)

H≤

?

2Um(K;{Xi})

m

+

√ν

√m+

?

2ν

mlog4

δ.

(23)

Tying (19)-(23), we have that w.p. at least 1 −δ

2over the choice of {Xi}, the following holds:

8Um(K;{Xi})

√m+

sup

k∈K?Pmk − Pk?H≤

+2√ν

?

18ν

m

log4

δ.

(24)

Performing similar analysis for supk∈K?Qnk − Qk?H, we have that w.p. at least 1 −δ

choice of {Yi},

?

Using (24) and (25) with√a +√b ≤?2(a + b) provides the result.

2over the

sup

k∈K?Qnk − Qk?H≤

8Un(K;{Yi})

n

+2√ν

√n+

?

18ν

n

log4

δ.

(25)

11

Page 12

A.6 Proof of Lemma 8

From [2, Proposition 2.2, Proposition 2.6] (also see [3, Corollary 5.1.8]), we have that there exists a

universal constant C < ∞ such that Um(K;{Xi}) ≤ C?ν

D2(k1,k2)=

Eρ

m

i<j

i<j,r<s

0logN(K,D,?)d?, where

1

n

?

ρiρjh(Xi,Xj)

2

=

1

m2Eρ

m

?

ρiρjρrρsh(Xi,Xj)h(Xr,Xs)

=

1

m2

m

?

i<j

h2(Xi,Xj),

where h(Xi,Xj) = k1(Xi,Xj) − k2(Xi,Xj).

A.7Proof of Corollary 10

The result follows by bounding the uniform covering number of the VC-subgraph class, K. By [21,

Theorem 2.6], we have

N(K,D,?) ≤ C1V (K)(16ν2?−2e)V (K),

where V (K) is the VC-index of K. Therefore,

?ν

≤

0

Note that?ν

References

(26)

Um(K;{Xi})

≤

C

0

logN(K,D,?)d?

?ν

4V (K)C

log(

√ν

√?)d? + νV (K)C log(16e) + Cν log(C1V (K)).(27)

0log(

√ν

√?)d? ≤ 2ν. Using this in (27) and rearranging the terms provides the result.

[1] M.AnthonyandP.L.Bartlett. NeuralNetworkLearning: TheoreticalFoundations. CambridgeUniversity

Press, UK, 1999.

[2] M. A. Arcones and E. Gin´ e. Limit theorems for U-processes. Annals of Probability, 21:1494–1542, 1993.

[3] V. H. de la Pe˜ na and E. Gin´ e. Decoupling: From Dependence to Independence. Springer-Verlag, NY,

1999.

[4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag,

New York, 1996.

[5] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002.

[6] K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ olkopf. Kernel measures of conditional dependence. In J.C.

Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems

20, pages 489–496, Cambridge, MA, 2008. MIT Press.

[7] K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Sch¨ olkopf. Characteristic kernels on groups

and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural

Information Processing Systems 21, pages 473–480, 2009.

[8] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨ olkopf, and A. Smola. A kernel method for the two sample

problem. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing

Systems 19, pages 513–520. MIT Press, 2007.

[9] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨ olkopf, and A. Smola. A kernel statistical test of

independence. In Advances in Neural Information Processing Systems 20, pages 585–592. MIT Press,

2008.

[10] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix

with semidefinite programming. Journal of Machine Learning Research, 5:24–72, 2004.

[11] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.

12

Page 13

[12] B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

[13] B. Sch¨ olkopf, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. RKHS representation of measures. In

Learning Theory and Approximation Workshop, Oberwolfach, Germany, 2008.

[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,

UK, 2004.

[15] A. J. Smola, A. Gretton, L. Song, and B. Sch¨ olkopf. A Hilbert space embedding for distributions. In

Proc. 18th International Conference on Algorithmic Learning Theory, pages 13–31. Springer-Verlag,

Berlin, Germany, 2007.

[16] N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In

G. Lugosi and H. U. Simon, editors, Proc. of the 19thAnnual Conference on Learning Theory, pages

169–183, 2006.

[17] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch¨ olkopf. Injective Hilbert

space embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21stAnnual

Conference on Learning Theory, pages 111–122, 2008.

[18] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of

Machine Learning Research, 2:67–93, 2002.

[19] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.

[20] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky Mountain Journal

of Mathematics, 6(3):409–433, 1976.

[21] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag,

New York, 1996.

[22] Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In Proc. of the 22ndAnnual

Conference on Learning Theory, 2009.

[23] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of Machine Learning

Research, 8:249–276, 2007.

13