Page 1
Kernel Choice and Classifiability for RKHS
Embeddings of Probability Distributions
Bharath K. Sriperumbudur
Department of ECE
UC San Diego, La Jolla, USA
bharathsv@ucsd.edu
Kenji Fukumizu
The Institute of Statistical Mathematics
Tokyo, Japan
fukumizu@ism.ac.jp
Arthur Gretton
Carnegie Mellon University
MPI for Biological Cybernetics
arthur.gretton@gmail.com
Gert R. G. Lanckriet
Department of ECE
UC San Diego, La Jolla, USA
gert@ece.ucsd.edu
Bernhard Sch¨ olkopf
MPI for Biological Cybernetics
T¨ ubingen, Germany
bs@tuebingen.mpg.de
Abstract
Embeddings of probability measures into reproducing kernel Hilbert spaces have
been proposed as a straightforward and practical means of representing and com-
paring probabilities. In particular, the distance between embeddings (the maxi-
mum mean discrepancy, or MMD) has several key advantages over many classical
metricsondistributions, namelyeasycomputability, fastconvergenceandlowbias
of finite sample estimates. An important requirement of the embedding RKHS is
that it be characteristic: in this case, the MMD between two distributions is zero
if and only if the distributions coincide. Three new results on the MMD are intro-
duced in the present study. First, it is established that MMD corresponds to the
optimal risk of a kernel classifier, thus forming a natural link between the distance
between distributions and their ease of classification. An important consequence
is that a kernel must be characteristic to guarantee classifiability between distri-
butions in the RKHS. Second, the class of characteristic kernels is broadened to
incorporate all strictly positive definite kernels: these include non-translation in-
variant kernels and kernels on non-compact domains. Third, a generalization of
the MMD is proposed for families of kernels, as the supremum over MMDs on
a class of kernels (for instance the Gaussian kernels with different bandwidths).
This extension is necessary to obtain a single distance measure if a large selection
or class of characteristic kernels is potentially appropriate. This generalization is
reasonable, given that it corresponds to the problem of learning the kernel by min-
imizing the risk of the corresponding kernel classifier. The generalized MMD is
shown to have consistent finite sample estimates, and its performance is demon-
strated on a homogeneity testing example.
1
Kernel methods are broadly established as a useful way of constructing nonlinear algorithms
from linear ones, by embedding points into higher dimensional reproducing kernel Hilbert spaces
(RKHSs) [12]. A generalization of this idea is to embed probability distributions into RKHSs, giv-
Introduction
1
Page 2
ing us a linear method for dealing with higher order statistics [8, 15, 17]. More specifically, suppose
we are given the set P of all Borel probability measures defined on the topological space M, and
the RKHS (H,k) of functions on M with k as its reproducing kernel (r.k.). For P ∈ P, denote by
Pk :=?
called the maximum mean discrepancy (MMD) [8, 17], and is written
Mk(.,x)dP(x). If k is measurable and bounded, then we may define the embedding of P
in H as Pk ∈ H. The RKHS distance between two such mappings associated with P,Q ∈ P is
γk(P,Q) = ?Pk − Qk?H.
(1)
We say that k is characteristic [6, 17] if the mapping P ?→ Pk is injective, in which case (1) is
zero if and only if P = Q, i.e., γkis a metric on P. An immediate application of the MMD is to
problems of comparing distributions based on finite samples: examples include tests of homogeneity
[8], independence [9], and conditional independence [6]. In this application domain, the question of
whether k is characteristic is key: without this property, the algorithms can fail through inability to
distinguish between particular distributions.
Characteristic kernels are important in binary classification: The problem of distinguishing dis-
tributions is strongly related to binary classification: indeed, one would expect easily distinguishable
distributions to be easily classifiable.1The link between these two problems is especially direct in
the case of the MMD: in Section 2, we show that γkis the negative of the optimal risk (correspond-
ing to a linear loss function) associated with the Parzen window classifier [12, 14] (also called kernel
classification rule [4, Chapter 10]), where the Parzen window turns out to be k. We also show that
γkis an upper bound on the margin of a hard-margin support vector machine (SVM). The impor-
tance of using characteristic RKHSs is further underlined by this link: if the property does not hold,
then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by
showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve
Bayes risk in the kernel-based classification algorithms.
Characterization of characteristic kernels: Given the centrality of the characteristic property to
both RKHS classification and RKHS distribution testing, we should take particular care in estab-
lishing which kernels satisfy this requirement. Early results in this direction include [8], where k is
shown to be characteristic on compact M if it is universal in the sense of Steinwart [18, Definition
4]; and [6, 7], which address the case of non-compact M, and show that k is characteristic if and
only if H+R is dense in the Banach space of p-power (p ≥ 1) integrable functions. The conditions
in both these studies can be difficult to check and interpret, however, and the restriction of the first
to compact M is limiting. In the case of translation invariant kernels, [17] proved the kernel to
be characteristic if and only if the support of the Fourier transform of k is the entire Rd, which is
a much easier condition to verify. Similar sufficient conditions are obtained by [7] for translation
invariant kernels on groups and semi-groups. In Section 3, we expand the class of characteristic
kernels to include kernels that may or may not be translation invariant, with the introduction of a
novel criterion: strictly positive definite kernels (see Definition 3) on M are characteristic.
Choice of characteristic kernels: In expanding the families of allowable characteristic kernels, we
have so far neglected the question of which characteristic kernel to choose. A practitioner asking by
how much two samples differ does not want to receive a blizzard of answers for every conceivable
kernel and bandwidth setting, but a single measure that satisfies some “reasonable” notion of dis-
tance across the family of kernels considered. Thus, in Section 4, we propose a generalization of the
MMD, yielding a new distance measure between P and Q defined as
γ(P,Q) = sup{γk(P,Q) : k ∈ K} = sup{?Pk − Qk?H: k ∈ K},
which is the maximal RKHS distance between P and Q over a family, K of positive definite kernels.
For example, K can be the family of Gaussian kernels on Rdindexed by the bandwidth parameter.
This distance measure is very natural in the light of our results on binary classification (in Section 2):
most directly, this corresponds to the problem of learning the kernel by minimizing the risk of the
associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on
the margin allowed for a hard margin SVM between the samples. To apply the generalized MMD
in practice, we must ensure its empirical estimator is consistent. In our main result of Section 4,
we provide an empirical estimate of γ(P,Q) based on finite samples, and show that many popular
kernels like the Gaussian, Laplacian, and the entire Mat´ ern class on Rdyield consistent estimates
(2)
1There is a subtlety here, since unlike the problem of testing for differences in distributions, classification
suffers from slow learning rates. See [4, Chapter 7] for details.
2
Page 3
of γ(P,Q). The proof is based on bounding the Rademacher chaos complexity of K, which can be
understood as the U-process equivalent of Rademacher complexity [3].
Finally, in Section 5, we provide a simple experimental demonstration that the generalized MMD
can be applied in practice to the problem of homogeneity testing. Specifically, we show that when
two distributions differ on particular length scales, the kernel selected by the generalized MMD
is appropriate to this difference, and the resulting hypothesis test outperforms the heuristic kernel
choice employed in earlier studies [8]. The proofs of the results in Sections 2-4 are provided in the
appendix.
2 Characteristic Kernels and Binary Classification
One of the most important applications of the maximum mean discrepancy is in nonparametric hy-
pothesis testing [8, 9, 6], where the characteristic property of k is required to distinguish between
probability measures. In the following, we show how MMD naturally appears in binary classifica-
tion, with reference to the Parzen window classifier and hard-margin SVM. This motivates the need
for characteristic k to guarantee that classes arising from different distributions can be classified by
kernel-based algorithms.
To this end, let us consider the binary classification problem with X being a M-valued random
variable, Y being a {−1,+1}-valued random variable and the product space, M ×{−1,+1}, being
endowed with an induced Borel probability measure µ. A discriminant function, f is a real valued
measurable function on M, whose sign is used to make a classification decision. Given a loss
function L : {−1,+1} × R → R, the goal is to choose an f that minimizes the risk associated with
L, with the optimal L-risk being defined as
?
where F?is the set of all measurable functions on M, L1(α) := L(1,α), L−1(α) := L(−1,α),
P(X) := µ(X|Y = +1), Q(X) := µ(X|Y = −1), ε := µ(M,Y = +1). Here, P and Q represent
the class-conditional distributions and ε is the prior distribution of class +1. Now, we present the
result that relates γkto the optimal risk associated with the Parzen window classifier.
Theorem 1 (γkand Parzen classification). Let L1(α) = −α
−RL
Suppose {(Xi,Yi)}N
m = |{i : Yi= 1}|. If?f ∈ Fkis an empirical minimizer of (3) (where F?is replaced by Fkin (3)),
sign(?f(x)) =
which is the Parzen window classifier.
Theorem 1 shows that γkis the negative of the optimal L-risk (where L is the linear loss as defined
in Theorem 1) associated with the Parzen window classifier. Therefore, if k is not characteristic,
which means γk(P,Q) = 0 for some P ?= Q, then RL
since 0 ≤ γk(P,Q) = −RL
the maximum risk is obtained only when P = Q. This motivates the importance of characteristic
kernels in binary classification. In the following, we provide another result which provides a similar
motivation for the importance of characteristic kernels in binary classification, wherein we relate γk
to the margin of a hard-margin SVM.
Theorem 2 (γkand hard-margin SVM). Suppose {(Xi,Yi)}N
a training sample drawn i.i.d. from µ. Assuming the training sample is separable, let fsvmbe the
solution to the program, inf{?f?H: Yif(Xi) ≥ 1, ∀i}, where H is an RKHS with measurable and
bounded k. If k is characteristic, then
1
?fsvm?H
where Pm :=
mn
represents the Dirac measure at x.
RL
F?= inf
f∈F?
M
L(y,f(x))dµ(x,y) = inf
f∈F?
?
ε
?
M
L1(f)dP + (1 − ε)
?
M
L−1(f)dQ
?
,
(3)
εand L−1(α) =
α
1−ε. Then, γk(P,Q) =
Fk, where Fk = {f : ?f?H ≤ 1} and H is an RKHS with a measurable and bounded k.
i=1, Xi∈ M, Yi∈ {−1,+1}, ∀i is a training sample drawn i.i.d. from µ and
then
?
1,
−1,
1
m
1
m
?
Yi=1k(x,Xi) >
?
1
N−m
1
N−m
?
Yi=−1k(x,Xi)
?
Yi=1k(x,Xi) ≤
Yi=−1k(x,Xi)
,
(4)
Fk= 0, i.e., the risk is maximum (note that
Fk, the maximum risk is zero). In other words, if k is characteristic, then
i=1, Xi∈ M, Yi∈ {−1,+1}, ∀i is
≤γk(Pm,Qn)
2
,
(5)
1
?
Yi=1δXi, Qn :=
1
?
Yi=−1δXi, m = |{i : Yi = 1}| and n = N − m. δx
3
Page 4
Theorem 2 provides a bound on the margin of hard-margin SVM in terms of MMD. (5) shows that
a smaller MMD between Pmand Qnenforces a smaller margin (i.e., a less smooth classifier, fsvm,
where smoothness is measured as ?fsvm?H). We can observe that the bound in (5) may be loose if
the number of support vectors is small. Suppose k is not characteristic, then γk(Pm,Qn) can be zero
for Pm?= Qnand therefore the margin is zero, which means even unlike distributions can become
inseparable in this feature representation.
Another justification of using characteristic kernels in kernel-based classification algorithms can be
provided by studying the conditions on H for which the Bayes risk is realized for all µ. Steinwart
and Christmann [19, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk
is achieved for all µ if and only if H is dense in Lp(M,η) for all η, where η = εP + (1 − ε)Q.
Here, Lp(M,η) represents the Banach space of p-power integrable functions, where p ∈ [1,∞) is
dependent on the loss function, L. Denseness of H in Lp(M,η) implies H + R is dense Lp(M,η),
which therefore yields that k is characteristic [6, 7]. On the other hand, if constant functions are
included in H, then it is easy to show that the characteristic property of k is also sufficient to
achieve the Bayes risk. As an example, it can be shown that characteristic kernels are necessary (and
sufficient if constant functions are in H) for SVMs to achieve the Bayes risk [19, Example 5.40].
Therefore, the characteristic property of k is fundamental in kernel-based classification algorithms.
Having showed how characteristic kernels play a role in kernel-based classification, in the following
section, we provide a novel characterization for them.
3
Apositivedefinite(pd)kernel, k issaidtobecharacteristictoP ifandonlyifγk(P,Q) = 0 ⇔ P =
Q, ∀P,Q ∈ P. The following result provides a novel characterization for characteristic kernels,
which shows that strictly pd kernels are characteristic to P. An advantage with this characterization
is that it holds for any arbitrary topological space M unlike the earlier characterizations where a
group structure on M is assumed [17, 7]. First, we define strictly pd kernels as follows.
Definition 3 (Strictly positive definite kernels). Let M be a topological space. A measurable and
boundedkernel, k issaid tobestrictlypositive definiteifandonly if?
Note that the above definition is not equivalent to the usual definition of strictly pd kernels that in-
volves finite sums [19, Definition 4.15]. The above definition is a generalization of integrally strictly
positive definite functions [20, Section 6]:
which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is
stronger than the finite sum definition as [19, Theorem 4.62] shows a kernel that is strictly pd in the
finite sum sense but not in the integral sense.
Theorem 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on M, then k is
characteristic to P.
The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic.
We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to
P. Examplesofstrictly pdkernelsonRdincludeexp(−σ?x−y?2
0, (c2+ ?x − y?2
strictly pd kernel if k is strictly pd, where f : M → R is a bounded continuous function. Therefore,
translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant
strictly pd kernel. A simple example of a translation-variant kernel that is a strictly pd kernel on
compact sets of Rdis˜k(x,y) = exp(σxTy), σ > 0, where we have chosen f(.) = exp(σ?.?2
and k(x,y) = exp(−σ?x − y?2
which is the same result that follows from the universality of˜k [18, Section 3, Example 1].
The following result in [13], which is based on the usual definition of strictly pd kernels, can be
obtained as a corollary to Theorem 4.
Corollary 5 ([13]). Let X = {xi}m
yj, ∀i,j. Suppose k is strictly positive definite. Then?m
Suppose we choose αi =
and?n
4
Novel Characterization for Characteristic Kernels
M
?
Mk(x,y)dµ(x)dµ(y) > 0
for all finite non-zero signed Borel measures, µ defined on M.
? ?k(x,y)f(x)f(y)dxdy > 0 for all f ∈ L2(Rd),
2), σ > 0, exp(−σ?x−y?1), σ >
2)−β, β > 0, c > 0, B2l+1-splines etc. Note that˜k(x,y) = f(x)k(x,y)f(y) is a
2/2)
2/2), σ > 0. Therefore,˜k is characteristic on compact sets of Rd,
i=1⊂ M, Y = {yj}n
j=1⊂ M and assume that xi?= xj, yi?=
j=1βjk(.,yj) for some
i=1αik(.,xi) =?n
n, ∀j in Corollary 5.
αi,βj∈ R\{0} ⇒ X = Y .
1
m, ∀i and βj =
1
Then?m
i=1αik(.,xi)
j=1βjk(.,yj) represent the mean functions in H. Note that the Parzen classifier in (4)
Page 5
is a mean classifier (that separates the mean functions) in H, i.e., sign(?k(.,x),w?H), where
w =
mn
characteristic). Then, by Corollary 5, the normal vector, w to the hyperplane in H passing through
the origin is zero, i.e., the mean functions coincide (and are therefore not classifiable) if and only if
X = Y .
1
?m
i=1k(.,xi) −1
?n
i=1k(.,yi). Suppose k is strictly pd (more generally, suppose k is
4
The discussion so far has been related to the characteristic property of k that makes γka metric on
P. We have seen that this characteristic property is of prime importance both in distribution testing,
and to ensure classifiability of dissimilar distributions in the RKHS. We have not yet addressed how
to choose among a selection/family of characteristic kernels, given a particular pair of distributions
we wish to discriminate between. We introduce one approach to this problem in the present section.
Let M = Rdand kσ(x,y) = exp(−σ?x − y?2
parameter. {kσ : σ ∈ R+} is the family of Gaussian kernels and {γkσ: σ ∈ R+} is the family
of MMDs indexed by the kernel parameter, σ. Note that kσis characteristic for any σ ∈ R++and
therefore γkσis a metric on P for any σ ∈ R++. However, in practice, one would prefer a single
number that defines the distance between P and Q. The question therefore to be addressed is how
to choose appropriate σ. The choice of σ has important implications on the statistical aspect of γkσ.
Note that as σ → 0, kσ→ 1 and as σ → ∞, kσ→ 0 a.e., which means γkσ(P,Q) → 0 as σ → 0
or σ → ∞ for all P,Q ∈ P (this behavior is also exhibited by kσ(x,y) = exp(−σ?x − y?1) and
kσ(x,y) = σ2/(σ2+ ?x − y?2
small or sufficiently large σ (depending on P and Q) makes γkσ(P,Q) arbitrarily small. Therefore, σ
has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently,
the applications involving MMD set σ heuristically [8, 9].
To generalize the MMD to families of kernels, we propose the following modification to γk, which
yields a pseudometric on P,
γ(P,Q) = sup{γk(P,Q) : k ∈ K} = sup{?Pk − Qk?H: k ∈ K}.
Note that γ is the maximal RKHS distance between P and Q over a family, K of positive definite
kernels. It is easy to check that if any k ∈ K is characteristic, then γ is a metric on P. Examples for
K include: Kg:= {e−σ?x−y?2
Kψ:= {e−σψ(x,y), x,y ∈ M : σ ∈ R+}, where ψ : M × M → R is a negative definite kernel;
Krbf:= {?∞
The proposal of γ(P,Q) in (6) can be motivated by the connection that we have established in
Section 2 between γkand the Parzen window classifier. Since the Parzen window classifier depends
on the kernel, k, one can propose to learn the kernel like in support vector machines [10], wherein
the kernel is chosen such that RL
−supk∈Kγk(P,Q) = −γ(P,Q). A similar motivation for γ can be provided based on (5) as
learning the kernel in a hard-margin SVM by maximizing its margin.
At this point, we briefly discuss the issue of normalized vs. unnormalized kernel families, K in
(6). We say a translation-invariant kernel, k on Rdis normalized if?
family if every kernel in K is normalized. If K is not normalized, we say it is unnormalized. For
example, it is easy to see that Kgand Klare unnormalized kernel families. Let us consider the
normalized Gaussian family, Kn
shown that for any kσ,kτ ∈ Kn
means, γ(P,Q) = γσ0(P,Q). Therefore, the generalized MMD reduces to a single kernel MMD. A
similar result also holds for the normalized inverse-quadratic kernel family, {?2σ2/π(σ2+ ?x −
is usually not very useful if K is a normalized kernel family. In addition, σ0should be chosen
beforehand, which is equivalent to heuristically setting the kernel parameter in γk. Note that σ0
cannot be zero because in the limiting case of σ → 0, the kernels approach a Dirac distribution,
which means the limiting kernel is not bounded and therefore the definition of MMD in (1) does
not hold. So, in this work, we consider unnormalized kernel families to render the definition of
generalized MMD in (6) useful.
Generalizing the MMD for Classes of Characteristic Kernels
2), σ ∈ R+, where σ represents the bandwidth
2), which are also characteristic). This means choosing sufficiently
(6)
2, x,y ∈ Rd: σ ∈ R+}; Kl:= {e−σ?x−y?1, x,y ∈ Rd: σ ∈ R+};
0e−λ?x−y?2
2dµσ(λ),x,y ∈ Rd, µσ∈ M+: σ ∈ Σ ⊂ Rd}, where M+is the set of
all finite nonnegative Borel measures, µσon R+that are not concentrated at zero, etc.
Fkin Theorem 1 is minimized over k ∈ K, i.e., infk∈KRL
Fk=
Mψ(y)dy = c (some positive
constant independent of the kernel parameter), where k(x,y) = ψ(x−y). K is a normalized kernel
g= {(σ/π)d/2e−σ?x−y?2
g, 0 < σ < τ < ∞, we have γkσ(P,Q) ≥ γkτ(P,Q), which
2, x,y ∈ Rd: σ ∈ [σ0,∞)}. It can be
y?2
2)−1, x,y ∈ R : σ ∈ [σ0,∞)}. These examples show that the generalized MMD definition
5
Page 6
To use γ in statistical applications where P and Q are known only through i.i.d. samples {Xi}m
and {Yi}n
represent the empirical measures based on {Xi}m
[8, 15] have shown that γk(Pm,Qn) is a
statistical consistency of γ(Pm,Qn) is established in the following theorem, which uses tools from
U-process theory [3, Chapters 3,5]. We begin with the following definition.
Definition 6 (Rademacher chaos). Let G be a class of functions on M × M and {ρi}n
independent Rademacher random variables, i.e., Pr(ρi = 1) = Pr(ρi = −1) =
homogeneous Rademacher chaos process of order two with respect to {ρi}n
{n−1?n
Un(G;{xi}n
g∈G
We now provide the main result of the present section.
Theorem 7 (Consistency of γ(Pm,Qn)). Let every k ∈ K be measurable and bounded with ν :=
supk∈K,x∈Mk(x,x) < ∞. Then, with probability at least 1 − δ, |γ(Pm,Qn) − γ(P,Q)| ≤ A,
where
?
mn
From (8), it is clear that if Um(K;{Xi}) = OP(1) and Un(K;{Yi}) = OQ(1), then γ(Pm,Qn)
γ(P,Q). The following result provides a bound on Um(K;{Xi}) in terms of the entropy integral.
Lemma 8 (Entropy bound). For any K as in Theorem 7 with 0 ∈ K, there exists a universal constant
C such that
Um(K;{Xi}m
??m
Assuming K to be a VC-subgraph class, the following result, as a corollary to Lemma 8 provides
an estimate of Um(K;{Xi}m
VC-subgraph class.
Definition 9 (VC-subgraph class). The subgraph of a function g : M × R is the subset of M × R
given by {(x,t) : t < g(x)}. A collection G of measurable functions on a sample space is called a
VC-subgraph class, if the collection of all subgraphs of the functions in G forms a VC-class of sets
(in M × R).
The VC-index (also called the VC-dimension) of a VC-subgraph class, G is the same as the pseudo-
dimension of G. See [1, Definition 11.1] for details.
Corollary 10 (Um(K;{Xi}) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K)
being the VC-index. Assume K satisfies the conditions in Theorem 7 and 0 ∈ K. Then
Um(K;{Xi}) ≤ Cν log(C1V (K)(16e9)V (K)),
for some universal constants C and C1.
Using (10) in (8), we have |γ(Pm,Qn) − γ(P,Q)| = OP,Q(?(m + n)/mn) and by the Borel-
nel classes, K have V (K) < ∞. [22, Lemma 12] showed that V (Kg) = 1 (also see [23]) and
Um(Krbf) ≤ C2Um(Kg), where C2 < ∞. It can be shown that V (Kψ) = 1 and V (Kl) = 1.
All these classes satisfy the conditions of Theorem 7 and Corollary 10 and therefore provide consis-
tent estimates of γ(P,Q) for any P,Q ∈ P. Examples of kernels on Rdthat are covered by these
classes include the Gaussian, Laplacian, inverse multiquadratics, Mat´ ern class etc. Other choices
for K that are popular in machine learning are the linear combination of kernels, Klin := {kλ =
?l
fixed, parameterized kernel, one can also use a finite linear combination of kernels to compute γ.
i=1
i=1respectively, we require its estimator γ(Pm,Qn) to be consistent, where Pmand Qn
i=1and {Yj}n
?mn/(m + n)-consistent estimator of γk(P,Q). The
j=1. For k measurable and bounded,
i=1be
2.
1
The
i=1is defined as
i<jρiρjg(xi,xj) : g ∈ G} for some {xi}n
ity over G is defined as
i=1⊂ M. The Rademacher chaos complex-
???1
i=1) := Eρsup
n
n
?
i<j
ρiρjg(xi,xj)
???.
(7)
A =
16Um(K;{Xi})
+16Un(K;{Yi})
+
(√8ν +
?
36ν log4
√mn
δ)√m + n
.
(8)
a.s.
→
i=1) ≤ C
?ν
0
logN(K,D,?)d?,
(9)
where D(k1,k2) =
covering number of K with respect to the metric D.
1
m
i<j(k1(Xi,Xj) − k2(Xi,Xj))2?1
2.
N(K,D,?) represents the ?-
i=1). Before presenting the result, we first provide the definition of a
(10)
Cantelli lemma, |γ(Pm,Qn) − γ(P,Q)|
a.s.
→ 0. Now, the question reduces to which of the ker-
i=1λiki|kλis pd,?l
i=1λi= 1} and Kcon:= {kλ=?l
i=1λiki|λi≥ 0,?l
i=1λi= 1}. [16,
Lemma 7] have shown that V (Kcon) ≤ V (Klin) ≤ l. Therefore, instead of using a class based on a
6
Page 7
So far, we have presented the metric property and statistical consistency (of the empirical estimator)
ofγ. Now, the questionishowdowecompute γ(Pm,Qn)inpractice. Toshowthis, inthefollowing,
we present two examples.
Example 11. Suppose K = Kg. Then, γ(Pm,Qn) can be written as
i,j=1
m2
i,j=1
γ2(Pm,Qn) = sup
σ∈R+
m
?
e−σ?Xi−Xj?2
+
n
?
e−σ?Yi−Yj?2
n2
− 2
m,n
?
i,j=1
e−σ?Xi−Yj?2
mn
.
(11)
The optimum σ∗can be obtained by solving (11) and γ(Pm,Qn) = ?Pmkσ∗ − Qnkσ∗?Hσ?.
Example 12. Suppose K = Kcon. Then, γ(Pm,Qn) becomes
γ2(Pm,Qn)= sup
k∈Kcon
sup{λTa : λT1 = 1, λ ? 0},
i=1λiki. Here λ = (λ1,...,λl) and (a)i= ?Pmki− Qnki?2
a,b=1ki(Xa,Xb) +
n2
γ2(Pm,Qn) = max1≤i≤l(a)i.
Similar examples can be provided for other K, where γ(Pm,Qn) can be computed by solving a
semidefinite program (K = Klin) or by the constrained gradient descent ( K = Kl,Krbf).
Finally, whiletheapproachin(6)togeneralizingγkisourfocusinthispaper, analternativeBayesian
strategy would be to define a non-negative finite measure λ over K, and to average γkover that
measure, i.e., β(P,Q) :=
β(P,Q) ≤ λ(K)γ(P,Q), ∀P,Q, which means if P and Q can be distinguished by β, they can be
distinguished by γ, but not vice-versa. In this sense, γ is stronger than β. One further complication
with the Bayesian approach is in defining a sensible λ over K. Note that γk0(single kernel MMD
based on k0) can be obtained by defining λ(k) = δ(k − k0) in β(P,Q).
5 Experiments
Inthissection, wepresent abenchmarkexperimentthat illustratesthegeneralizedMMDproposedin
Section 4 is preferred above the single kernel MMD where the kernel parameter is set heuristically.
The experimental setup is as follows.
Letp = N(0,σ2
version of p, given as q(x) = p(x)(1+sinνx). Here p and q are the densities associated with P and
Q respectively. It is easy to see that q differs from p at increasing frequencies with increasing ν. Let
k(x,y) = exp(−(x − y)2/σ). Now, the goal is that given random samples drawn i.i.d. from P and
Q (with ν fixed), we would like to test H0: P = Q vs. H1: P ?= Q. The idea is that as ν increases,
it will be harder to distinguish between P and Q for a fixed sample size. Therefore, using this setup
we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps
to distinguish between P and Q at higher ν compared to γkwith a heuristic σ. To this end, using
γ(Pm,Qn) and γk(Pm,Qn) (with various σ) as test statistics Tmn, we design a test that returns H0
if Tmn≤ cmn, and H1otherwise. The problem therefore reduces to finding cmn. cmnis determined
as the (1 − α) quantile of the asymptotic distribution of Tmnunder H0, which therefore fixes the
type-I error (the probability of rejecting H0when it is true) to α. The consistency of this test under
γk(for any fixed σ) is proved in [8]. A similar result can be shown for γ under some conditions on
K. We skip the details here.
In our experiments, we set m = n = 1000, σ2
samples from Q. The distribution of Tmnis estimated by bootstrapping on these samples (250 boot-
strap iterations are performed) and the associated 95thquantile (we choose α = 0.05) is computed.
Since the performance of the test is judged by its type-II error (the probability of accepting H0
when H1is true), we draw a random sample, one each from P and Q and test whether P = Q.
This process is repeated 300 times, and estimates of type-I and type-II errors are obtained for both
γ and γk. 14 different values for σ are considered on a logarithmic scale of base 2 with exponents
(−3,−2,−1,0,1,3
choice. 5 different choices for ν are considered: (1
?Pmk − Qnk?2
H= sup
k∈Kcon
? ?
kd(Pm− Qn) ⊗ (Pm− Qn)
=
(12)
where we have replaced k by?l
Hi=
1
m2
?m
1
?n
a,b=1ki(Ya,Yb) −
2
mn
?m,n
a,b=1ki(Xa,Yb). It is easy to see that
?
Kγk(P,Q)dλ(k). This also yields a pseudometric on P. That said,
p), anormaldistributioninRwithzeromeanandvariance, σ2
p. Letq betheperturbed
p= 10 and draw two sets of independent random
2,2,5
2,3,7
2,4,5,6) along with the median distance between samples as one more
2,3
4,1,5
4,3
2).
7
Page 8
0.50.751
ν
1.25 1.5
0
2
4
5
6
Error (in %)
Type−I error
Type−II error
(a)
−3 −2 −1 0 1 2 3 4 5 6
log σ
(b)
5
10
15
20
25
Type−I error (in %)
ν=0.5
ν=0.75
ν=1.0
ν=1.25
ν=1.5
−3 −2 −1 0 1 2 3 4 5 6
log σ
(c)
0
50
100
Type−II error (in %)
ν=0.5
ν=0.75
ν=1.0
ν=1.25
ν=1.5
0.50.751
ν
1.251.5
0
1
2
3
log σ
(d)
0.5 0.751
ν
1.25 1.5
8
9
10
11
Median as σ
(e)
Figure 1: (a) Type-I and Type-II errors (in %) for γ for varying ν. (b,c) Type-I and type-II error (in
%) for γk(with different σ) for varying ν. The dotted line in (c) corresponds to the median heuristic,
which shows that its associated type-II error is very large at large ν. (d) Box plot of logσ grouped
by ν, where σ is selected by γ. (e) Box plot of the median distance between points (which is also a
choice for σ), grouped by ν. Refer to Section 5 for details.
Figure 1(a) shows the estimated type-I and type-II errors using γ as the test statistic for varying
ν. Note that the type-I error is close to its design value of 5%, while the type-II error is zero for
all ν, which means γ distinguishes between P and Q for all perturbations. Figures 1(b,c) show the
estimates of type-I and type-II errors using γkas the test statistic for different σ and ν. Figure 1(d)
shows the box plot for logσ, grouped by ν, where σ is the bandwidth selected by γ. Figure 1(e)
shows the box plot of the median distance between points (which is also a choice for σ), grouped by
ν. From Figures 1(c) and (e), it is easy to see that the median heuristic exhibits high type-II error for
ν =3
choices of σ can result in high type-II errors. It is intuitive to note that as ν increases, (which means
the characteristic function of Q differs from that of P at higher frequencies), a smaller σ is needed
to detect these changes. The advantage of using γ is that it selects σ in a distribution-dependent
fashion and its behavior in the box plot shown in Figure 1(d) matches with the previously mentioned
intuition about the behavior of σ with respect to ν. These results demonstrate the validity of using γ
as a distance measure in applications.
2, while γ exhibits zero type-II error (from Figure 1(a)). Figure 1(c) also shows that heuristic
6
In this work, we have shown how MMD appears in binary classification, and thus that characteristic
kernels are important in kernel-based classification algorithms. We have broadened the class of
characteristic RKHSs to include those induced by strictly positive definite kernels (with particular
application to kernels on non-compact domains, and/or kernels that are not translation invariant). We
have further provided a convergent generalization of MMD over families of kernel functions, which
becomes necessary even in considering relatively simple families of kernels (such as the Gaussian
kernels parameterized by their bandwidth). The usefulness of the generalized MMD is illustrated
experimentally with a two-sample testing problem.
Conclusions
Acknowledgments
The authors thank anonymous reviewers for their constructive comments and especially the re-
viewer who pointed out the connection between characteristic kernels and the achievability of Bayes
risk. B. K. S. was supported by the MPI for Biological Cybernetics, National Science Founda-
tion (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MI-
CRO program. A. G. was supported by grants DARPA IPTO FA8750-09-1-0141, ONR MURI
N000140710747, and ARO MURI W911NF0810242.
8
Page 9
A Proofs
We provide proofs for the results in Sections 2-4.
A.1Proof of Theorem 1
To prove Theorem 1, we need the following result from [17].
Theorem 13 ([17]). Let Fk:= {f : ?f?H≤ 1}, where (H,k) is an RKHS defined on a measurable
space M with k measurable and bounded. Then,
γk(P,Q) = sup
f∈Fk
|Pf − Qf| = ?Pk − Qk?H,
(13)
where ? · ?Hrepresents the RKHS norm.
Proof of Theorem 1: From (3), we have
?
Therefore,
ε
M
L1(f)dP + (1 − ε)
?
M
L−1(f)dQ =
?
M
f dQ −
?
M
f dP = Qf − Pf.
RL
Fk= inf
f∈Fk(Qf − Pf) = − sup
f∈Fk
(Pf − Qf) = − sup
f∈Fk
|Pf − Qf| = −γk(P,Q),
which follows from Theorem 13. Given {(Xi,Yi)}N
of (3) is given by
?
i=1drawn i.i.d. from µ, the empirical equivalent
inf
−1
m
?
Yi=1
f(Xi) +
1
N − m
?
Yi=−1
f(Xi) : f ∈ Fk
?
.
Solving this for f gives
f =
1
m
?
Yi=1k(.,Xi) −
Yi=1k(.,Xi) −
1
N−m
1
N−m
?
Yi=−1k(.,Xi)
Yi=−1k(.,Xi)?H
?1
m
??
,
and the result in (4) follows.
A.2Proof of Theorem 2
Before we prove Theorem 2, we present a lemma which we will use to prove Theorem 2.
Lemma 14. Let θ : V → R and ψ : V → R be convex functions on a real vector space V . Suppose
a = sup{θ(x) : ψ(x) ≤ b},
where θ is not constant on {x : ψ(x) ≤ b} and a < ∞. Then
b = inf{ψ(x) : θ(x) ≥ a}.
We need the following result from [11, Theorem 32.1] to prove Lemma 14.
Theorem 15 ([11]). Let f be a convex function, and let C be a convex set contained in the domain
of f. If f attains its supremum relative to C at some point of relative interior of C, then f is actually
constant throughout C.
(14)
(15)
Proof of Lemma 14: Note that A := {x : ψ(x) ≤ b} is a convex subset of V . Since θ is not constant
on A, by Theorem 15, θ attains its supremum on the boundary of A. Therefore, any solution, x∗to
(14) satisfies θ(x∗) = a and ψ(x∗) = b. Let G := {x : θ(x) > a}. For any x ∈ G, ψ(x) > b. If
this were not the case, then x∗is not a solution to (14). Let H := {x : θ(x) = a}. Clearly, x∗∈ H
and so there exists an x ∈ H for which ψ(x) = b. Suppose inf{ψ(x) : x ∈ H} = c < b, which
means for some x∗∈ H, x∗∈ A. From (14), this implies θ attains its supremum relative to A at
some point of relative interior of A. By Theorem 15, this implies θ is constant on A leading to a
contradiction. Therefore, inf{ψ(x) : x ∈ H} = b and the result in (15) follows.
9
Page 10
Proof of Theorem 2: By Theorem 13, γk(P,Q) = sup{Pf − Qf : ?f?H≤ 1}. Note that Pf − Qf
and ?f?Hare convex functionals on H. For P ?= Q, Pf − Qf is not constant on Fk, since k is
characteristic. Therefore, by Lemma 14, we have
1 = inf{?f?H: Pf − Qf ≥ γk(P,Q), f ∈ H}.
Since this holds for all P ?= Q, it holds for Pmand Qn. Therefore, we have
2
γk(Pm,Qn)= inf {?f?H: Pmf − Qnf ≥ 2, f ∈ H}.
Consider
{f ∈ H : Pmf − Qnf ≥ 2}
=
?
{f ∈ H : Yif(Xi) ≥ 1, ∀i},
f ∈ H :1
m
?
Yi=1
f(Xi) −1
n
?
Yi=−1
f(Xi) ≥ 2
?
⊃
which means
2
γk(Pm,Qn)≤ ?fsvm?H.
A.3 Proof of Theorem 4
To prove Theorem 4, we need the following lemma that provides necessary and sufficient conditions
for a kernel not to be characteristic.
Lemma 16. Let k be measurable and bounded on M. Then ∃P ?= Q, P,Q ∈ P such that
γk(P,Q) = 0 if and only if there exists a finite non-zero signed Borel measure µ that satisfies:
?
(ii) µ(M) = 0.
(i)
M
?
Mk(x,y)dµ(x)dµ(y) = 0,
Proof. (⇒) Suppose there exists a finite non-zero signed Borel measure, µ that satisfies (i) and (ii)
in Lemma 16. By the Jordan decomposition theorem [5, Theorem 5.6.1], there exist unique positive
measures µ+and µ−such that µ = µ+− µ−and µ+⊥ µ−(µ+and µ−are singular). By (ii), we
have µ+(M) = µ−(M) =: α. Define P = α−1µ+and Q = α−1µ−. Clearly, P ?= Q, P,Q ∈ P.
Then,
γ2
k(P,Q) = ?Pk − Qk?2
H= α−2?µk?2
H= α−2?µk,µk?H.
(16)
From the proof of Theorem 13 (see Theorem 3 in [17]),
? ?k(x,y)dµ(x)dµ(y) and therefore, by (i), γk(P,Q) = 0. So, we have constructed P ?= Q
(⇐) Suppose ∃P ?= Q, P,Q ∈ P such that γk(P,Q) = 0. Let µ = P − Q. Clearly µ is a finite
non-zero signed Borel measure that satisfies µ(M) = 0. Note that γ2
?µk?2
Proof of Theorem 4: Since k is strictly pd on M, we have? ?k(x,y)dη(x)dη(y) > 0 for any
Borel measure that satisfies (i) in Lemma 16. Therefore, by Lemma 16, there does not exist P ?=
Q, P,Q ∈ P such that γk(P,Q) = 0, which implies k is characteristic to P.
we have ?µk,µk?H
=
such that γk(P,Q) = 0.
k(P,Q) = ?Pk − Qk?2
H=
H=? ?k(x,y)dµ(x)dµ(y), and therefore (i) follows.
finite non-zero signed Borel measure η. This means there does not exist a finite non-zero signed
A.4
Consider?m
Proof of Corollary 5
i=1αik(.,xi) =?k(.,x)dµX(x) = µXk, where µX=?m
?
Since k is strictly pd, by Lemma 16, we have µXk = µYk ⇒ µX= µY ⇒ X = Y .
i=1αiδxi. Here δxirepre-
sents the Dirac measure at xi∈ M. So, we have µXk = µYk, which is equivalent to
?
MM
k(x,y)d(µX− µY)(x)d(µX− µY)(y) = 0.
(17)
10
Page 11
A.5 Proof of Theorem 7
Consider
|γ(Pm,Qn) − γ(P,Q)|
=
| sup
sup
k∈K?Pmk − Qnk?H− sup
k∈K|?Pmk − Qnk?H− ?Pk − Qk?H|
sup
k∈K?Pk − Qk?H|
≤
≤
≤
≤
k∈K?Pmk − Qnk − Pk + Qk?H
sup
k∈K[?Pmk − Pk?H+ ?Qnk − Qk?H]
sup
k∈K?Pmk − Pk?H+ sup
k∈K?Qnk − Qk?H.
(18)
Wenowboundthetermssupk∈K?Pmk−Pk?Handsupk∈K?Qnk−Qk?H. Sincesupk∈K?Pmk−
Pk?Hsatisfies the bounded difference property, using McDiarmid’s inequality gives that with prob-
ability at least 1 −δ
4over the choice of {Xi}, we have
sup
k∈K?Pmk − Pk?H≤ E sup
k∈K?Pmk − Pk?H+
?
2ν
mlog4
δ.
(19)
By invoking symmetrization for Esupk∈K?Pmk − Pk?H, we have
E sup
k∈K?Pmk − Pk?H≤ 2EEρsup
i=1represent i.i.d. Rademacher random variables and Eρrepresents the expectation
w.r.t. {ρi} conditioned on {Xi}. Since Eρsupk∈K?1
difference property, by McDiarmid’s inequality, with probability at least 1 −δ
the random samples of size m, we have
?????
?????
≤
m
k∈K
?????
1
m
m
?
i=1
ρik(.,Xi)
?????
H
,
(20)
where {ρi}m
m
?m
i=1ρik(.,Xi)?Hsatisfies the bounded
4over the choice of
EEρsup
k∈K
1
m
m
?
i=1
ρik(.,Xi)
?????
H
≤ Eρsup
k∈K
?????
??????
1
m
m
?
i=1
ρik(.,Xi)
?????
H
+
?
2ν
mlog4
δ.
(21)
By writing
1
m
m
?
i=1
ρik(.,Xi)
?????
H
=
1
m
?
?
?
?
?
?
?
?
?
m
?
??????
i,j=1
ρiρjk(Xi,Xj)
??????
√2
?
m
?
i<j
ρiρjk(Xi,Xj)
??????
+
√ν
√m,
(22)
we have with probability at least 1 −δ
4, the following holds:
???
?
m
EEρsup
k∈K
???1
m
m
?
i=1
ρik(.,Xi)
H≤
?
2Um(K;{Xi})
m
+
√ν
√m+
?
2ν
mlog4
δ.
(23)
Tying (19)-(23), we have that w.p. at least 1 −δ
2over the choice of {Xi}, the following holds:
8Um(K;{Xi})
√m+
sup
k∈K?Pmk − Pk?H≤
+2√ν
?
18ν
m
log4
δ.
(24)
Performing similar analysis for supk∈K?Qnk − Qk?H, we have that w.p. at least 1 −δ
choice of {Yi},
?
Using (24) and (25) with√a +√b ≤?2(a + b) provides the result.
2over the
sup
k∈K?Qnk − Qk?H≤
8Un(K;{Yi})
n
+2√ν
√n+
?
18ν
n
log4
δ.
(25)
11
Page 12
A.6 Proof of Lemma 8
From [2, Proposition 2.2, Proposition 2.6] (also see [3, Corollary 5.1.8]), we have that there exists a
universal constant C < ∞ such that Um(K;{Xi}) ≤ C?ν
D2(k1,k2)=
Eρ
m
i<j
i<j,r<s
0logN(K,D,?)d?, where
1
n
?
ρiρjh(Xi,Xj)
2
=
1
m2Eρ
m
?
ρiρjρrρsh(Xi,Xj)h(Xr,Xs)
=
1
m2
m
?
i<j
h2(Xi,Xj),
where h(Xi,Xj) = k1(Xi,Xj) − k2(Xi,Xj).
A.7Proof of Corollary 10
The result follows by bounding the uniform covering number of the VC-subgraph class, K. By [21,
Theorem 2.6], we have
N(K,D,?) ≤ C1V (K)(16ν2?−2e)V (K),
where V (K) is the VC-index of K. Therefore,
?ν
≤
0
Note that?ν
References
(26)
Um(K;{Xi})
≤
C
0
logN(K,D,?)d?
?ν
4V (K)C
log(
√ν
√?)d? + νV (K)C log(16e) + Cν log(C1V (K)).(27)
0log(
√ν
√?)d? ≤ 2ν. Using this in (27) and rearranging the terms provides the result.
[1] M.AnthonyandP.L.Bartlett. NeuralNetworkLearning: TheoreticalFoundations. CambridgeUniversity
Press, UK, 1999.
[2] M. A. Arcones and E. Gin´ e. Limit theorems for U-processes. Annals of Probability, 21:1494–1542, 1993.
[3] V. H. de la Pe˜ na and E. Gin´ e. Decoupling: From Dependence to Independence. Springer-Verlag, NY,
1999.
[4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag,
New York, 1996.
[5] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002.
[6] K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ olkopf. Kernel measures of conditional dependence. In J.C.
Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems
20, pages 489–496, Cambridge, MA, 2008. MIT Press.
[7] K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Sch¨ olkopf. Characteristic kernels on groups
and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural
Information Processing Systems 21, pages 473–480, 2009.
[8] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨ olkopf, and A. Smola. A kernel method for the two sample
problem. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing
Systems 19, pages 513–520. MIT Press, 2007.
[9] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨ olkopf, and A. Smola. A kernel statistical test of
independence. In Advances in Neural Information Processing Systems 20, pages 585–592. MIT Press,
2008.
[10] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix
with semidefinite programming. Journal of Machine Learning Research, 5:24–72, 2004.
[11] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
12
Page 13
[12] B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[13] B. Sch¨ olkopf, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. RKHS representation of measures. In
Learning Theory and Approximation Workshop, Oberwolfach, Germany, 2008.
[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,
UK, 2004.
[15] A. J. Smola, A. Gretton, L. Song, and B. Sch¨ olkopf. A Hilbert space embedding for distributions. In
Proc. 18th International Conference on Algorithmic Learning Theory, pages 13–31. Springer-Verlag,
Berlin, Germany, 2007.
[16] N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In
G. Lugosi and H. U. Simon, editors, Proc. of the 19thAnnual Conference on Learning Theory, pages
169–183, 2006.
[17] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch¨ olkopf. Injective Hilbert
space embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21stAnnual
Conference on Learning Theory, pages 111–122, 2008.
[18] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of
Machine Learning Research, 2:67–93, 2002.
[19] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[20] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky Mountain Journal
of Mathematics, 6(3):409–433, 1976.
[21] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag,
New York, 1996.
[22] Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In Proc. of the 22ndAnnual
Conference on Learning Theory, 2009.
[23] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of Machine Learning
Research, 8:249–276, 2007.
13