Page 1

A Kernel Statistical Test of Independence

Arthur Gretton

MPI for Biological Cybernetics

T¨ ubingen, Germany

arthur@tuebingen.mpg.de

Kenji Fukumizu

Inst. of Statistical Mathematics

Tokyo Japan

fukumizu@ism.ac.jp

Choon Hui Teo

NICTA, ANU

Canberra, Australia

choonhui.teo@gmail.com

Le Song

NICTA, ANU

and University of Sydney

lesong@it.usyd.edu.au

Bernhard Sch¨ olkopf

MPI for Biological Cybernetics

T¨ ubingen, Germany

bs@tuebingen.mpg.de

Alexander J. Smola

NICTA, ANU

Canberra, Australia

alex.smola@gmail.com

Abstract

Although kernel measures of independence have been widely applied in machine

learning (notably in kernel ICA), there is as yet no method to determine whether

they have detected statistically significant dependence. We provide a novel test of

the independence hypothesis for one particular kernel independence measure, the

Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2),

where m is the sample size. We demonstrate that this test outperformsestablished

contingency table and functional correlation-based tests, and that this advantage

is greater for multivariate data. Finally, we show the HSIC test also applies to

text (and to structured data more generally), for which no other independence test

presently exists.

1Introduction

Kernel independencemeasures have been widely applied in recent machine learning literature, most

commonly in independentcomponentanalysis (ICA) [2, 11], but also in fitting graphical models [1]

and in feature selection [22]. One reason for their success is that these criteria have a zero expected

value if and only if the associated random variables are independent, when the kernels are universal

(in the sense of [23]). There is presently no way to tell whether the empirical estimates of these

dependence measures indicate a statistically significant dependence, however. In other words, we

are interested in the threshold an empirical kernel dependence estimate must exceed, before we can

dismiss with high probability the hypothesis that the underlying variables are independent.

Statistical tests of independencehave been associated with a broad variety of dependencemeasures.

Classical tests such as Spearman’s ρ and Kendall’s τ are widely applied, however they are not

guaranteed to detect all modes of dependence between the random variables. Contingency table-

based methods, and in particular the power-divergence family of test statistics [17], are the best

knowngeneralpurposetestsofindependence,butarelimitedtorelativelylowdimensions,sincethey

require a partitioning of the space in which each random variable resides. Characteristic function-

based tests [6, 13] have also been proposed, which are more general than kernel density-based tests

[19], although to our knowledge they have been used only to compare univariate random variables.

In this paper we present three main results: first, and most importantly,we show how to test whether

statistically significant dependence is detected by a particular kernel independence measure, the

Hilbert Schmidt independence criterion (HSIC, from [9]). That is, we provide a fast (O(m2) for

sample size m) and accurate means of obtaining a threshold which HSIC will only exceed with

small probability, when the underlying variables are independent. Second, we show the distribution

1

Page 2

of our empirical test statistic in the large sample limit can be straightforwardly parameterised in

terms of kernels on the data. Third, we apply our test to structured data (in this case, by establishing

the statistical dependence between a text and its translation). To our knowledge, ours is the first

independence test for structured data.

We begin our presentation in Section 2, with a short overview of cross-covariance operators be-

tween RKHSs and their Hilbert-Schmidt norms: the latter are used to define the Hilbert Schmidt

Independence Criterion (HSIC). In Section 3, we describe how to determine whether the depen-

dence returned via HSIC is statistically significant, by proposing a hypothesis test with HSIC as its

statistic. In particular,we show that this test can be parameterisedusinga combinationof covariance

operator norms and norms of mean elements of the random variables in feature space. Finally, in

Section 4, we give our experimental results, both for testing dependence between random vectors

(which could be used for instance to verify convergence in independent subspace analysis [25]),

and for testing dependence between text and its translation. Software to implement the test may be

downloaded from http : //www.kyb.mpg.de/bs/people/arthur/indep.htm

2Definitions and description of HSIC

Our problem setting is as follows:

Problem 1 Let Pxybe a Borel probability measure defined on a domain X × Y, and let Pxand

Pybe the respective marginal distributions on X and Y. Given an i.i.d sample Z := (X,Y ) =

{(x1,y1),...,(xm,ym)} of size m drawn independently and identically distributed according to

Pxy, does Pxyfactorise as PxPy(equivalently, we may write x ⊥ ⊥ y)?

We begin with a description of our kernel dependence criterion, leaving to the following section the

question of whether this dependence is significant. This presentation is largely a review of material

from[9, 11,22], themaindifferencebeingthatweestablish linkstothecharacteristicfunction-based

independencecriteria in [6, 13]. Let F be an RKHS, with the continuous feature mapping φ(x) ∈ F

from each x ∈ X, such that the inner product between the features is given by the kernel function

k(x,x′) := ?φ(x),φ(x′)?. Likewise, let G be a second RKHS on Y with kernel l(·,·) and feature

map ψ(y). Following [7], the cross-covariance operator Cxy : G → F is defined such that for all

f ∈ F and g ∈ G,

?f,Cxyg?F

=

Exy([f(x) − Ex(f(x))][g(y) − Ey(g(y))]).

The cross-covariance operator itself can then be written

Cxy:= Exy[(φ(x) − µx) ⊗ (ψ(y) − µy)],

(1)

where µx:= Exφ(x), µy:= Eyφ(y), and ⊗ is the tensor product [9, Eq. 6]: this is a generalisation

of the cross-covariance matrix between random vectors. When F and G are universal reproducing

kernel Hilbert spaces (that is, dense in the space of bounded continuous functions [23]) on the

compact domains X and Y, then the largest singular value of this operator,?Cxy?, is zero if and only

if x ⊥ ⊥ y [11, Theorem6]: the operatorthereforeinducesan independencecriterion, and can be used

to solve Problem1. The maximumsingularvaluegives a criterionsimilar to that originallyproposed

in[18], butwithmorerestrictivefunctionclasses (ratherthanfunctionsofboundedvariance). Rather

than the maximum singular value, we may use the squared Hilbert-Schmidt norm (the sum of the

squared singular values), which has a population expression

HSIC(Pxy,F,G) = Exx′yy′[k(x,x′)l(y,y′)] + Exx′[k(x,x′)]Eyy′[l(y,y′)]

− 2Exy[Ex′[k(x,x′)]Ey′[l(y,y′)]]

(2)

(assuming the expectations exist), where x′denotes an independent copy of x [9, Lemma 1]: we

call this the Hilbert-Schmidt independence criterion (HSIC).

We now address the problem of estimating HSIC(Pxy,F,G) on the basis of the sample Z. An

unbiased estimator of (2) is a sum of three U-statistics [21, 22],

HSIC(Z) =

1

(m)2

?

(i,j)∈im

2

kijlij+

1

(m)4

?

(i,j,q,r)∈im

4

kijlqr− 2

1

(m)3

?

(i,j,q)∈im

3

kijliq,

(3)

2

Page 3

where(m)n:=

the set {1,...,m}, kij:= k(xi,xj), and lij:= l(yi,yj). For the purpose of testing independence,

however, we will find it easier to use an alternative, biased empirical estimate [9, Definition 2],

obtained by replacing the U-statistics with V-statistics1

m!

(m−n)!, theindexsetim

rdenotesthesetallr-tuplesdrawnwithoutreplacementfrom

HSICb(Z) =

1

m2

m

?

i,j

kijlij+

1

m4

m

?

i,j,q,r

kijlqr− 21

m3

m

?

i,j,q

kijliq=

1

m2trace(KHLH),

(4)

where the summation indices now denote all r-tuples drawn with replacement from {1,...,m} (r

beingthenumberofindicesbelowthesum), Kis them×mmatrixwithentrieskij, H = I−1

and 1 is an m × 1 vector of ones (the cost of computing this statistic is O(m2)). When a Gaussian

kernel kij := exp

?

statistic is equivalent to the characteristic function-based statistic [6, Eq. 4.11] and the T2nstatistic

of [13, p. 54]: details are reproduced in [10] for comparison. Our setting allows for more general

kernels, however, such as kernels on strings (as in our experiments in Section 4) and graphs (see

[20] for further details of kernels on structures): this is not possible under the characteristic function

framework, which is restricted to Euclidean spaces (Rdin the case of [6, 13]). As pointed out in [6,

Section 5], the statistic in (4)can also be linked to the originalquadratictest of Rosenblatt [19] given

an appropriatekernel choice; the main differencesbeing that characteristic function-basedtests (and

RKHS-based tests) are not restricted to using kernel densities, nor should they reduce their kernel

width with increasing sample size. Another related test described in [4] is based on the functional

canonical correlation between F and G, rather than the covariance: in this sense the test statistic

resembles those in [2]. The approach in [4] differs with both the present work and [2], however,

in that the function spaces F and G are represented by finite sets of basis functions (specifically

B-spline kernels) when computing the empirical test statistic.

m11⊤,

−σ−2?xi− xj?2?

is used (or a kernel deriving from [6, Eq. 4.10]), the latter

3Test description

We now describe a statistical test of independence for two random variables, based on the test

statistic HSICb(Z). We begin with a more formal introduction to the framework and terminology

of statistical hypothesis testing. Given the i.i.d. sample Z defined earlier, the statistical test, T(Z) :

(X × Y)m?→ {0,1} is used to distinguish between the null hypothesis H0 : Pxy = PxPyand

the alternative hypothesis H1 : Pxy?= PxPy. This is achieved by comparing the test statistic, in

our case HSICb(Z), with a particular threshold: if the threshold is exceeded, then the test rejects

the null hypothesis (bearing in mind that a zero population HSIC indicates Pxy = PxPy). The

acceptance region of the test is thus defined as any real number below the threshold. Since the test

is based on a finite sample, it is possible that an incorrect answer will be returned: the Type I error

is defined as the probability of rejecting H0based on the observed sample, despite x and y being

independent. Conversely, the Type II error is the probability of accepting Pxy= PxPywhen the

underlying variables are dependent. The level α of a test is an upper bound on the Type I error, and

is a design parameter of the test, used to set the test threshold. A consistent test achieves a level α,

and a Type II error of zero, in the large sample limit.

How, then, do we set the threshold of the test given α? The approach we adopt here is to derive

the asymptotic distribution of the empirical estimate HSICb(Z) of HSIC(Pxy,F,G) under H0. We

then use the 1−α quantile of this distribution as the test threshold.2Our presentation in this section

is therefore dividedinto two parts. First, we obtain the distribution of HSICb(Z) under both H0and

H1; the latterdistributionis also neededto ensureconsistencyof thetest. We shall see, however,that

the null distribution has a complex form, and cannot be evaluated directly. Thus, in the second part

of this section, we describe ways to accurately approximate the 1 − α quantile of this distribution.

Asymptotic distribution of HSICb(Z)

The first theorem holds under H1.

We now describe the distribution of the test statistic in (4)

1The U- and V-statistics differ in that the latter allow indices of different sums to be equal.

2Analternativewould betousealargedeviation bound, asprovided forinstance by[9] based onHoeffding’s

inequality. It has been reported in [8], however, that such bounds are generally too loose for hypothesis testing.

3

Page 4

Theorem 1 Let

hijqr=1

4!

(i,j,q,r)

?

(t,u,v,w)

ktultu+ ktulvw− 2ktultv,

(5)

where the sum represents all ordered quadruples (t,u,v,w) drawn without replacement from

(i,j,q,r), and assume E?h2?< ∞. Under H1, HSICb(Z) converges in distribution as m → ∞

to a Gaussian according to

m

1

2(HSICb(Z) − HSIC(Pxy,F,G))

D

→ N?0,σ2

?

u

?.

(6)

The variance is σ2

u= 16

?

Ei

?

Ej,q,rhijqr

?2

− HSIC(Pxy,F,G), where Ej,q,r:= Ezj,zq,zr.

Proof We first rewrite (4) as a single V-statistic,

HSICb(Z) =

1

m4

m

?

i,j,q,r

hijqr,

(7)

where we note that hijqrdefined in (5) does not change with permutation of its indices. The associ-

ated U-statistic HSICs(Z) convergesin distribution as (6) with varianceσ2

see [22]. Since the differencebetweenHSICb(Z) andHSICs(Z) dropsas 1/m(see [9], orTheorem

3 below), HSICb(Z) converges asymptotically to the same distribution.

The second theorem applies under H0

u[21, Theorem5.5.1(A)]:

Theorem 2 Under H0, the U-statistic HSICs(Z) corresponding to the V-statistic in (7) is degen-

erate, meaning Eihijqr = 0. In this case, HSICb(Z) converges in distribution according to [21,

Section 5.5.2]

mHSICb(Z)

D

→

∞

?

l=1

λlz2

l,

(8)

where zl∼ N(0,1) i.i.d., and λlare the solutions to the eigenvalue problem

λlψl(zj) =

?

hijqrψl(zi)dFi,q,r,

where the integral is over the distribution of variables zi, zq, and zr.

Proof This follows from the discussion of [21, Section 5.5.2], making appropriate allowance for

the fact that we are dealing with a V-statistic (which is why the terms in (8) are not centred: in the

case of a U-statistic, the sum would be over terms λl(z2

l− 1)).

Approximating the 1 − α quantile of the null distribution

could be derived from Theorem 2 above by computing the (1−α)th quantile of the distribution (8),

where consistency of the test (that is, the convergence to zero of the Type II error for m → ∞) is

guaranteed by the decay as m−1of the variance of HSICb(Z) under H1. The distribution under H0

is complex, however: the question then becomes how to accurately approximate its quantiles.

A hypothesis test using HSICb(Z)

One approach, taken by [6], is to use a Monte Carlo resampling technique: the ordering of the Y

sample is permuted repeatedly while that of X is kept fixed, and the 1 − α quantile is obtained

from the resulting distribution of HSICbvalues. This can be very expensive, however. A second

approach,suggestedin [13, p. 34],is toapproximatethenulldistributionas atwo-parameterGamma

distribution[12, p. 343,p. 359]: this is oneof themorestraightforwardapproximationsofan infinite

sum of χ2variables (see [12, Chapter 18.8] for further ways to approximate such distributions; in

particular, we wish to avoid using moments of order greater than two, since these can become

expensive to compute). Specifically, we make the approximation

mHSICb(Z) ∼xα−1e−x/β

βαΓ(α)

whereα =(E(HSICb(Z)))2

var(HSICb(Z)),β =mvar(HSICb(Z))

E(HSICb(Z))

.

(9)

4

Page 5

Figure 1: mHSICb cumulative distribution

function (Emp) under H0 for m = 200,

obtained empirically using 5000 indepen-

dent draws of mHSICb. The two-parameter

Gamma distribution (Gamma) is fit using

α = 1.17 and β = 8.3 × 10−4in (9), with

mean and variance computed via Theorems

3 and 4.

0 0.51 1.52

0

0.2

0.4

0.6

0.8

1

mHSICb

P(mHSICb(Z) < mHSICb)

Emp

Gamma

An illustration of the cumulative distribution function

(CDF) obtained via the Gamma approximation is given

in Figure 1, along with an empirical CDF obtained by

repeated draws of HSICb. We note the Gamma approxi-

mation is quite accurate, especially in areas of high prob-

ability (which we use to compute the test quantile). The

accuracy of this approximation will be further evaluated

experimentally in Section 4.

To obtain the Gamma distribution from our observa-

tions, we need empirical estimates for E(HSICb(Z)) and

var(HSICb(Z)) under the null hypothesis. Expressions

for these quantities are given in [13, pp. 26-27], however

these are in terms of the joint and marginal characteris-

tic functions, and not in our more general kernel setting

(see also [14, p. 313]). In the following two theorems,

we provide much simpler expressions for both quantities,

in terms of norms of mean elements µxand µy, and the

covariance operators

Cxx:= Ex[(φ(x) − µx) ⊗ (φ(x) − µx)]

and Cyy, in feature space. The main advantage of our new expressions is that they are computed

entirely in terms of kernels, which makes possible the application of the test to any domains on

which kernels can be defined, and not only Rd.

Theorem 3 Under H0,

E(HSICb(Z)) =1

mTrCxxTrCyy=

1

m

?

1 + ?µx?2?µy?2− ?µx?2− ?µy?2?

,

(10)

where the second equality assumes kii= lii= 1. An empirical estimate of this statistic is obtained

by replacing the norms above with?

?µx?2= (m)−1

2

in a (generally negligible) bias of O(m−1) in the estimate of ?µx?2?µy?2.

?

(i,j)∈im

2kij, bearing in mind that this results

Theorem 4 Under H0,

var(HSICb(Z)) =2(m − 4)(m − 5)

(m)4

?Cxx?2

HS?Cyy?2

HS+ O(m−3).

Denoting by ⊙ the entrywise matrix product, A·2the entrywise matrix power, and B =

((HKH) ⊙ (HLH))·2, an empirical estimate with negligible bias may be found by replacing the

product of covariance operator norms with 1⊤(B − diag(B))1: this is slightly more efficient than

taking the product of the empirical operator norms (although the scaling with m is unchanged).

Proofsofboththeoremsmaybefoundin[10],wherewealsocomparewiththeoriginalcharacteristic

function-based expressions in [13]. We remark that these parameters, like the original test statistic

in (4), may be computed in O(m2).

4

General tests of statistical independence are most useful for data having complex interactions that

simple correlation does not detect. We investigate two cases where this situation arises: first, we

test vectors in Rdwhich have a dependence relation but no correlation, as occurs in independent

subspaceanalysis; andsecond, we study the statistical dependencebetween a text and its translation.

Experiments

Independence of subspaces

mining the convergence of algorithms for independent component analysis (ICA), which involves

separating random variables that have been linearly mixed, using only their mutual independence.

ICA generally entails optimisation over a non-convex function (including when HSIC is itself the

optimisation criterion [9]), and is susceptible to local minima, hence the need for these tests (in fact,

for classical approaches to ICA, the global minimum of the optimisation might not correspond to

independenceforcertain sourcedistributions). Contingencytable-basedtests havebeen applied[15]

One area where independence tests have been applied is in deter-

5