Page 1

A Kernel Statistical Test of Independence

Arthur Gretton

MPI for Biological Cybernetics

T¨ ubingen, Germany

arthur@tuebingen.mpg.de

Kenji Fukumizu

Inst. of Statistical Mathematics

Tokyo Japan

fukumizu@ism.ac.jp

Choon Hui Teo

NICTA, ANU

Canberra, Australia

choonhui.teo@gmail.com

Le Song

NICTA, ANU

and University of Sydney

lesong@it.usyd.edu.au

Bernhard Sch¨ olkopf

MPI for Biological Cybernetics

T¨ ubingen, Germany

bs@tuebingen.mpg.de

Alexander J. Smola

NICTA, ANU

Canberra, Australia

alex.smola@gmail.com

Abstract

Although kernel measures of independence have been widely applied in machine

learning (notably in kernel ICA), there is as yet no method to determine whether

they have detected statistically significant dependence. We provide a novel test of

the independence hypothesis for one particular kernel independence measure, the

Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2),

where m is the sample size. We demonstrate that this test outperformsestablished

contingency table and functional correlation-based tests, and that this advantage

is greater for multivariate data. Finally, we show the HSIC test also applies to

text (and to structured data more generally), for which no other independence test

presently exists.

1Introduction

Kernel independencemeasures have been widely applied in recent machine learning literature, most

commonly in independentcomponentanalysis (ICA) [2, 11], but also in fitting graphical models [1]

and in feature selection [22]. One reason for their success is that these criteria have a zero expected

value if and only if the associated random variables are independent, when the kernels are universal

(in the sense of [23]). There is presently no way to tell whether the empirical estimates of these

dependence measures indicate a statistically significant dependence, however. In other words, we

are interested in the threshold an empirical kernel dependence estimate must exceed, before we can

dismiss with high probability the hypothesis that the underlying variables are independent.

Statistical tests of independencehave been associated with a broad variety of dependencemeasures.

Classical tests such as Spearman’s ρ and Kendall’s τ are widely applied, however they are not

guaranteed to detect all modes of dependence between the random variables. Contingency table-

based methods, and in particular the power-divergence family of test statistics [17], are the best

knowngeneralpurposetestsofindependence,butarelimitedtorelativelylowdimensions,sincethey

require a partitioning of the space in which each random variable resides. Characteristic function-

based tests [6, 13] have also been proposed, which are more general than kernel density-based tests

[19], although to our knowledge they have been used only to compare univariate random variables.

In this paper we present three main results: first, and most importantly,we show how to test whether

statistically significant dependence is detected by a particular kernel independence measure, the

Hilbert Schmidt independence criterion (HSIC, from [9]). That is, we provide a fast (O(m2) for

sample size m) and accurate means of obtaining a threshold which HSIC will only exceed with

small probability, when the underlying variables are independent. Second, we show the distribution

1

Page 2

of our empirical test statistic in the large sample limit can be straightforwardly parameterised in

terms of kernels on the data. Third, we apply our test to structured data (in this case, by establishing

the statistical dependence between a text and its translation). To our knowledge, ours is the first

independence test for structured data.

We begin our presentation in Section 2, with a short overview of cross-covariance operators be-

tween RKHSs and their Hilbert-Schmidt norms: the latter are used to define the Hilbert Schmidt

Independence Criterion (HSIC). In Section 3, we describe how to determine whether the depen-

dence returned via HSIC is statistically significant, by proposing a hypothesis test with HSIC as its

statistic. In particular,we show that this test can be parameterisedusinga combinationof covariance

operator norms and norms of mean elements of the random variables in feature space. Finally, in

Section 4, we give our experimental results, both for testing dependence between random vectors

(which could be used for instance to verify convergence in independent subspace analysis [25]),

and for testing dependence between text and its translation. Software to implement the test may be

downloaded from http : //www.kyb.mpg.de/bs/people/arthur/indep.htm

2Definitions and description of HSIC

Our problem setting is as follows:

Problem 1 Let Pxybe a Borel probability measure defined on a domain X × Y, and let Pxand

Pybe the respective marginal distributions on X and Y. Given an i.i.d sample Z := (X,Y ) =

{(x1,y1),...,(xm,ym)} of size m drawn independently and identically distributed according to

Pxy, does Pxyfactorise as PxPy(equivalently, we may write x ⊥ ⊥ y)?

We begin with a description of our kernel dependence criterion, leaving to the following section the

question of whether this dependence is significant. This presentation is largely a review of material

from[9, 11,22], themaindifferencebeingthatweestablish linkstothecharacteristicfunction-based

independencecriteria in [6, 13]. Let F be an RKHS, with the continuous feature mapping φ(x) ∈ F

from each x ∈ X, such that the inner product between the features is given by the kernel function

k(x,x′) := ?φ(x),φ(x′)?. Likewise, let G be a second RKHS on Y with kernel l(·,·) and feature

map ψ(y). Following [7], the cross-covariance operator Cxy : G → F is defined such that for all

f ∈ F and g ∈ G,

?f,Cxyg?F

=

Exy([f(x) − Ex(f(x))][g(y) − Ey(g(y))]).

The cross-covariance operator itself can then be written

Cxy:= Exy[(φ(x) − µx) ⊗ (ψ(y) − µy)],

(1)

where µx:= Exφ(x), µy:= Eyφ(y), and ⊗ is the tensor product [9, Eq. 6]: this is a generalisation

of the cross-covariance matrix between random vectors. When F and G are universal reproducing

kernel Hilbert spaces (that is, dense in the space of bounded continuous functions [23]) on the

compact domains X and Y, then the largest singular value of this operator,?Cxy?, is zero if and only

if x ⊥ ⊥ y [11, Theorem6]: the operatorthereforeinducesan independencecriterion, and can be used

to solve Problem1. The maximumsingularvaluegives a criterionsimilar to that originallyproposed

in[18], butwithmorerestrictivefunctionclasses (ratherthanfunctionsofboundedvariance). Rather

than the maximum singular value, we may use the squared Hilbert-Schmidt norm (the sum of the

squared singular values), which has a population expression

HSIC(Pxy,F,G) = Exx′yy′[k(x,x′)l(y,y′)] + Exx′[k(x,x′)]Eyy′[l(y,y′)]

− 2Exy[Ex′[k(x,x′)]Ey′[l(y,y′)]]

(2)

(assuming the expectations exist), where x′denotes an independent copy of x [9, Lemma 1]: we

call this the Hilbert-Schmidt independence criterion (HSIC).

We now address the problem of estimating HSIC(Pxy,F,G) on the basis of the sample Z. An

unbiased estimator of (2) is a sum of three U-statistics [21, 22],

HSIC(Z) =

1

(m)2

?

(i,j)∈im

2

kijlij+

1

(m)4

?

(i,j,q,r)∈im

4

kijlqr− 2

1

(m)3

?

(i,j,q)∈im

3

kijliq,

(3)

2

Page 3

where(m)n:=

the set {1,...,m}, kij:= k(xi,xj), and lij:= l(yi,yj). For the purpose of testing independence,

however, we will find it easier to use an alternative, biased empirical estimate [9, Definition 2],

obtained by replacing the U-statistics with V-statistics1

m!

(m−n)!, theindexsetim

rdenotesthesetallr-tuplesdrawnwithoutreplacementfrom

HSICb(Z) =

1

m2

m

?

i,j

kijlij+

1

m4

m

?

i,j,q,r

kijlqr− 21

m3

m

?

i,j,q

kijliq=

1

m2trace(KHLH),

(4)

where the summation indices now denote all r-tuples drawn with replacement from {1,...,m} (r

beingthenumberofindicesbelowthesum), Kis them×mmatrixwithentrieskij, H = I−1

and 1 is an m × 1 vector of ones (the cost of computing this statistic is O(m2)). When a Gaussian

kernel kij := exp

?

statistic is equivalent to the characteristic function-based statistic [6, Eq. 4.11] and the T2nstatistic

of [13, p. 54]: details are reproduced in [10] for comparison. Our setting allows for more general

kernels, however, such as kernels on strings (as in our experiments in Section 4) and graphs (see

[20] for further details of kernels on structures): this is not possible under the characteristic function

framework, which is restricted to Euclidean spaces (Rdin the case of [6, 13]). As pointed out in [6,

Section 5], the statistic in (4)can also be linked to the originalquadratictest of Rosenblatt [19] given

an appropriatekernel choice; the main differencesbeing that characteristic function-basedtests (and

RKHS-based tests) are not restricted to using kernel densities, nor should they reduce their kernel

width with increasing sample size. Another related test described in [4] is based on the functional

canonical correlation between F and G, rather than the covariance: in this sense the test statistic

resembles those in [2]. The approach in [4] differs with both the present work and [2], however,

in that the function spaces F and G are represented by finite sets of basis functions (specifically

B-spline kernels) when computing the empirical test statistic.

m11⊤,

−σ−2?xi− xj?2?

is used (or a kernel deriving from [6, Eq. 4.10]), the latter

3Test description

We now describe a statistical test of independence for two random variables, based on the test

statistic HSICb(Z). We begin with a more formal introduction to the framework and terminology

of statistical hypothesis testing. Given the i.i.d. sample Z defined earlier, the statistical test, T(Z) :

(X × Y)m?→ {0,1} is used to distinguish between the null hypothesis H0 : Pxy = PxPyand

the alternative hypothesis H1 : Pxy?= PxPy. This is achieved by comparing the test statistic, in

our case HSICb(Z), with a particular threshold: if the threshold is exceeded, then the test rejects

the null hypothesis (bearing in mind that a zero population HSIC indicates Pxy = PxPy). The

acceptance region of the test is thus defined as any real number below the threshold. Since the test

is based on a finite sample, it is possible that an incorrect answer will be returned: the Type I error

is defined as the probability of rejecting H0based on the observed sample, despite x and y being

independent. Conversely, the Type II error is the probability of accepting Pxy= PxPywhen the

underlying variables are dependent. The level α of a test is an upper bound on the Type I error, and

is a design parameter of the test, used to set the test threshold. A consistent test achieves a level α,

and a Type II error of zero, in the large sample limit.

How, then, do we set the threshold of the test given α? The approach we adopt here is to derive

the asymptotic distribution of the empirical estimate HSICb(Z) of HSIC(Pxy,F,G) under H0. We

then use the 1−α quantile of this distribution as the test threshold.2Our presentation in this section

is therefore dividedinto two parts. First, we obtain the distribution of HSICb(Z) under both H0and

H1; the latterdistributionis also neededto ensureconsistencyof thetest. We shall see, however,that

the null distribution has a complex form, and cannot be evaluated directly. Thus, in the second part

of this section, we describe ways to accurately approximate the 1 − α quantile of this distribution.

Asymptotic distribution of HSICb(Z)

The first theorem holds under H1.

We now describe the distribution of the test statistic in (4)

1The U- and V-statistics differ in that the latter allow indices of different sums to be equal.

2Analternativewould betousealargedeviation bound, asprovided forinstance by[9] based onHoeffding’s

inequality. It has been reported in [8], however, that such bounds are generally too loose for hypothesis testing.

3

Page 4

Theorem 1 Let

hijqr=1

4!

(i,j,q,r)

?

(t,u,v,w)

ktultu+ ktulvw− 2ktultv,

(5)

where the sum represents all ordered quadruples (t,u,v,w) drawn without replacement from

(i,j,q,r), and assume E?h2?< ∞. Under H1, HSICb(Z) converges in distribution as m → ∞

to a Gaussian according to

m

1

2(HSICb(Z) − HSIC(Pxy,F,G))

D

→ N?0,σ2

?

u

?.

(6)

The variance is σ2

u= 16

?

Ei

?

Ej,q,rhijqr

?2

− HSIC(Pxy,F,G), where Ej,q,r:= Ezj,zq,zr.

Proof We first rewrite (4) as a single V-statistic,

HSICb(Z) =

1

m4

m

?

i,j,q,r

hijqr,

(7)

where we note that hijqrdefined in (5) does not change with permutation of its indices. The associ-

ated U-statistic HSICs(Z) convergesin distribution as (6) with varianceσ2

see [22]. Since the differencebetweenHSICb(Z) andHSICs(Z) dropsas 1/m(see [9], orTheorem

3 below), HSICb(Z) converges asymptotically to the same distribution.

The second theorem applies under H0

u[21, Theorem5.5.1(A)]:

Theorem 2 Under H0, the U-statistic HSICs(Z) corresponding to the V-statistic in (7) is degen-

erate, meaning Eihijqr = 0. In this case, HSICb(Z) converges in distribution according to [21,

Section 5.5.2]

mHSICb(Z)

D

→

∞

?

l=1

λlz2

l,

(8)

where zl∼ N(0,1) i.i.d., and λlare the solutions to the eigenvalue problem

λlψl(zj) =

?

hijqrψl(zi)dFi,q,r,

where the integral is over the distribution of variables zi, zq, and zr.

Proof This follows from the discussion of [21, Section 5.5.2], making appropriate allowance for

the fact that we are dealing with a V-statistic (which is why the terms in (8) are not centred: in the

case of a U-statistic, the sum would be over terms λl(z2

l− 1)).

Approximating the 1 − α quantile of the null distribution

could be derived from Theorem 2 above by computing the (1−α)th quantile of the distribution (8),

where consistency of the test (that is, the convergence to zero of the Type II error for m → ∞) is

guaranteed by the decay as m−1of the variance of HSICb(Z) under H1. The distribution under H0

is complex, however: the question then becomes how to accurately approximate its quantiles.

A hypothesis test using HSICb(Z)

One approach, taken by [6], is to use a Monte Carlo resampling technique: the ordering of the Y

sample is permuted repeatedly while that of X is kept fixed, and the 1 − α quantile is obtained

from the resulting distribution of HSICbvalues. This can be very expensive, however. A second

approach,suggestedin [13, p. 34],is toapproximatethenulldistributionas atwo-parameterGamma

distribution[12, p. 343,p. 359]: this is oneof themorestraightforwardapproximationsofan infinite

sum of χ2variables (see [12, Chapter 18.8] for further ways to approximate such distributions; in

particular, we wish to avoid using moments of order greater than two, since these can become

expensive to compute). Specifically, we make the approximation

mHSICb(Z) ∼xα−1e−x/β

βαΓ(α)

whereα =(E(HSICb(Z)))2

var(HSICb(Z)),β =mvar(HSICb(Z))

E(HSICb(Z))

.

(9)

4

Page 5

Figure 1: mHSICb cumulative distribution

function (Emp) under H0 for m = 200,

obtained empirically using 5000 indepen-

dent draws of mHSICb. The two-parameter

Gamma distribution (Gamma) is fit using

α = 1.17 and β = 8.3 × 10−4in (9), with

mean and variance computed via Theorems

3 and 4.

0 0.51 1.52

0

0.2

0.4

0.6

0.8

1

mHSICb

P(mHSICb(Z) < mHSICb)

Emp

Gamma

An illustration of the cumulative distribution function

(CDF) obtained via the Gamma approximation is given

in Figure 1, along with an empirical CDF obtained by

repeated draws of HSICb. We note the Gamma approxi-

mation is quite accurate, especially in areas of high prob-

ability (which we use to compute the test quantile). The

accuracy of this approximation will be further evaluated

experimentally in Section 4.

To obtain the Gamma distribution from our observa-

tions, we need empirical estimates for E(HSICb(Z)) and

var(HSICb(Z)) under the null hypothesis. Expressions

for these quantities are given in [13, pp. 26-27], however

these are in terms of the joint and marginal characteris-

tic functions, and not in our more general kernel setting

(see also [14, p. 313]). In the following two theorems,

we provide much simpler expressions for both quantities,

in terms of norms of mean elements µxand µy, and the

covariance operators

Cxx:= Ex[(φ(x) − µx) ⊗ (φ(x) − µx)]

and Cyy, in feature space. The main advantage of our new expressions is that they are computed

entirely in terms of kernels, which makes possible the application of the test to any domains on

which kernels can be defined, and not only Rd.

Theorem 3 Under H0,

E(HSICb(Z)) =1

mTrCxxTrCyy=

1

m

?

1 + ?µx?2?µy?2− ?µx?2− ?µy?2?

,

(10)

where the second equality assumes kii= lii= 1. An empirical estimate of this statistic is obtained

by replacing the norms above with?

?µx?2= (m)−1

2

in a (generally negligible) bias of O(m−1) in the estimate of ?µx?2?µy?2.

?

(i,j)∈im

2kij, bearing in mind that this results

Theorem 4 Under H0,

var(HSICb(Z)) =2(m − 4)(m − 5)

(m)4

?Cxx?2

HS?Cyy?2

HS+ O(m−3).

Denoting by ⊙ the entrywise matrix product, A·2the entrywise matrix power, and B =

((HKH) ⊙ (HLH))·2, an empirical estimate with negligible bias may be found by replacing the

product of covariance operator norms with 1⊤(B − diag(B))1: this is slightly more efficient than

taking the product of the empirical operator norms (although the scaling with m is unchanged).

Proofsofboththeoremsmaybefoundin[10],wherewealsocomparewiththeoriginalcharacteristic

function-based expressions in [13]. We remark that these parameters, like the original test statistic

in (4), may be computed in O(m2).

4

General tests of statistical independence are most useful for data having complex interactions that

simple correlation does not detect. We investigate two cases where this situation arises: first, we

test vectors in Rdwhich have a dependence relation but no correlation, as occurs in independent

subspaceanalysis; andsecond, we study the statistical dependencebetween a text and its translation.

Experiments

Independence of subspaces

mining the convergence of algorithms for independent component analysis (ICA), which involves

separating random variables that have been linearly mixed, using only their mutual independence.

ICA generally entails optimisation over a non-convex function (including when HSIC is itself the

optimisation criterion [9]), and is susceptible to local minima, hence the need for these tests (in fact,

for classical approaches to ICA, the global minimum of the optimisation might not correspond to

independenceforcertain sourcedistributions). Contingencytable-basedtests havebeen applied[15]

One area where independence tests have been applied is in deter-

5

Page 6

in this context, while the test of [13] has been used in [14] for verifying ICA outcomes when the

data are stationary random processes (through using a subset of samples with a sufficiently large

delay between them). Contingency table-based tests may be less useful in the case of independent

subspace analysis (ISA, see e.g. [25] and its bibliography), where higher dimensional independent

random vectors are to be separated. Thus, characteristic function-based tests [6, 13] and kernel

independence measures might work better for this problem.

In our experiments, we tested the independence of random vectors, as a way of verifying the so-

lutions of independent subspace analysis. We assumed for ease of presentation that our subspaces

have respective dimension dx = dy = d, but this is not required. The data were constructed as

follows. First, we generated m samples of two univariate random variables, each drawn at random

from the ICA benchmark densities in [11, Table 3]: these include super-Gaussian, sub-Gaussian,

multimodal, and unimodal distributions. Second, we mixed these random variables using a rota-

tion matrix parameterised by an angle θ, varying from 0 to π/4 (a zero angle means the data are

independent, while dependence becomes easier to detect as the angle increases to π/4: see the two

plots in Figure 2, top left). Third, we appended d − 1 dimensional Gaussian noise of zero mean

and unit standard deviation to each of the mixtures. Finally, we multiplied each resulting vector

by an independent random d-dimensional orthogonal matrix, to obtain vectors dependent across all

observed dimensions. We emphasise that classical approaches (such as Spearman’s ρ or Kendall’s

τ) are completely unable to find this dependence, since the variables are uncorrelated; nor can we

recover the subspace in which the variables are dependent using PCA, since this subspace has the

same second order properties as the noise. We investigated sample sizes m = 128,512,1024,2048,

and d = 1,2,4.

We compared two different methods for computing the 1−α quantile of the HSIC null distribution:

repeated random permutation of the Y sample ordering as in [6] (HSICp), where we used 200 per-

mutations; and Gamma approximation(HSICg) as in [13], based on (9). We used a Gaussian kernel,

with kernel size set to the median distance between points in input space. We also compared with

two alternative tests, the first based on a discretisation of the variables, and the second on functional

canonical correlation. The discretisation based test was a power-divergence contingency table test

from [17] (PD), which consisted in partitioning the space, counting the number of samples falling

in each partition, and comparing this with the number of samples that would be expected under the

null hypothesis(the test we used, described in [15], is more refined than this short descriptionwould

suggest). Rather than a uniform space partitioning, we divided our space into roughly equiprobable

bins as in [15], using a Gessaman partition for higher dimensions [5, Figure 21.4] (Ku and Fine did

not specify a space partitioning strategy for higher dimensions, since they dealt only with univariate

randomvariables). All remainingparameters were set accordingto [15]. The functional correlation-

based test (fCorr) is described in [4]: the main differences with respect to our test are that it uses

the spectrum of the functional correlation operator, rather than the covariance operator; and that it

approximates the RKHSs F and G by finite sets of basis functions. Parameter settings were as in

[4, Table 1], with the second order B-spline kernel and a twofold dyadic partitioning. Note that

fCorr applies only in the univariate case. Results are plotted in Figure 2 (average over 500 repeti-

tions). The y-intercept on these plots corresponds to the acceptance rate of H0at independence, or

1 − (Type I error), and should be close to the design parameter of 1 − α = 0.95. Elsewhere, the

plots indicate acceptance of H0where the underlying variables are dependent, i.e. the Type II error.

As expected, we observe that dependence becomes easier to detect as θ increases from 0 to π/4,

when m increases, and when d decreases. The PD and fCorr tests perform poorly at m = 128,

but approach the performance of HSIC-based tests for increasing m (although PD remains slightly

worse than HSIC at m = 512 and d = 1, while fCorr becomes slightly worse again than PD). PD

also scales very badly with d, and never rejects the null hypothesis when d = 4, even for m = 2048.

Although HSIC-based tests are unreliable for small θ, they generally do well as θ approaches π/4

(besides m = 128, d = 2). We also emphasise that HSICp and HSICg performidentically,although

HSICp is far more costly (by a factor of around 100, given the number of permutations used).

Dependence and independence between text

pendence testing on text.

(http : //www.isi.edu/natural− language/download/hansard/). These consist of the of-

ficial records of the 36th Canadian parliament, in English and French. We used debate transcripts

on the three topics of Agriculture, Fisheries, and Immigration, due to the relatively large volume of

data in these categories. Our goal was to test whether there exists a statistical dependence between

In this section, we demonstrate inde-

Our data are taken from the Canadian Hansard corpus

6

Page 7

−20

X

2

−3

−2

−1

0

1

2

3

Rotation θ = π/8

Y

−20

X

2

−3

−2

−1

0

1

2

3

Rotation θ = π/4

Y

00.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

Samp:1024, Dim:4

% acceptance of H0

Samp:128, Dim:1

PD

fCorr

HSICp

HSICg

0 0.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

Samp:2048, Dim:4

% acceptance of H0

Samp:128, Dim:2

0 0.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

% acceptance of H0

Samp:512, Dim:1

0 0.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

% acceptance of H0

Samp:512, Dim:2

0 0.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

% acceptance of H0

0 0.51

0

0.2

0.4

0.6

0.8

1

Angle (×π/4)

% acceptance of H0

Figure 2: Top left plots: Example dataset for d = 1, m = 200, and rotation angles θ = π/8 (left) and θ = π/4

(right). In this case, both sources are mixtures of two Gaussians (source (g) in [11, Table 3]). We remark that

the random variables appear “more dependent” as the angle θ increases, although their correlation is always

zero. Remaining plots: Rate of acceptance of H0 for the PD, fCorr, HSICp, and HSICg tests. “Samp” is the

number m of samples, and “dim” is the dimension d of x and y.

English text and its French translation. Our dependent data consisted of a set of paragraph-long

(5 line) English extracts and their French translations. For our independent data, the English para-

graphs were matched to random French paragraphs on the same topic: for instance, an English

paragraph on fisheries would always be matched with a French paragraph on fisheries. This was

designed to prevent a simple vocabulary check from being used to tell when text was mismatched.

We also ignored lines shorter than five words long, since these were not always part of the text (e.g.

identification of the person speaking). We used the k-spectrum kernel of [16], computed according

to the method of [24]. We set k = 10 for both languages, where this was chosen by cross validating

on an SVM classifier for Fisheries vs National Defense, separately for each language (performance

was not especially sensitive to choice of k; k = 5 also worked well). We compared this kernel with

a simple kernel between bags of words [3, pp. 186–189]. Results are in Table 1.

Our results demonstrate the excellent performance of the HSICp test on this task: even for small

sample sizes, HSICp with a spectral kernel always achieves zero Type II error, and a Type I error

close to the design value (0.95). We further observe for m = 10 that HSICp with the spectral kernel

always has better Type II error than the bag-ofwords kernel. This suggests that a kernel with a more

sophisticated encoding of text structure induces a more sensitive test, although for larger sample

sizes, the advantage vanishes. The HSICg test does less well on this data, always accepting H0for

m = 10, and returning a Type I error of zero, rather than the design value of 5%, when m = 50. It

appears that this is due to a very low variance estimate returnedby the Theorem 4 expression, which

could be caused by the high diagonaldominanceof kernels on strings. Thus, while the test threshold

for HSICg at m = 50 still fell between the dependent and independent values of HSICb, this was

not the result of an accurate modelling of the null distribution. We would therefore recommend the

permutationapproachfor this problem. Finally, we also tried testing with 2-line extracts and 10-line

extracts, which yielded similar results.

5 Conclusion

We haveintroduceda test ofwhethersignificantstatistical dependenceis obtainedbya kerneldepen-

dence measure, the Hilbert-Schmidt independence criterion (HSIC). Our test costs O(m2) for sam-

ple size m. In our experiments, HSIC-based tests always outperformed the contingency table [17]

and functional correlation [4] approaches, for both univariate random variables and higher dimen-

sional vectors which were dependentbut uncorrelated. We would thereforerecommendHSIC-based

tests for checking the convergence of independent component analysis and independent subspace

analysis. Finally, our test also applies on structured domains, being able to detect the dependence

7

Page 8

Table 1: Independence tests for cross-language dependence detection. Topics are in the first column, where the

total number of 5-line extracts for each dataset is in parentheses. BOW(10) denotes a bag of words kernel and

m = 10 sample size, Spec(50) is a k-spectrum kernel with m = 50. The first entry in each cell is the null

acceptance rate of the test under H0(i.e. 1−(Type I error); should be near 0.95); the second entry is the null

acceptance rate under H1(the Type II error, small is better). Each entry is an average over 300 repetitions.

BOW(10)Spec(10)

HSICgHSICpHSICgHSICp

Agriculture

(555)0.990.181.00 0.00

Fisheries

(408)1.000.201.000.00

Immigration

(289)1.000.091.000.00

Topic

BOW(50)

HSICg

1.00,

0.00

1.00,

0.00

0.99,

0.00

Spec(50)

HSICp

0.93,

0.00

0.93,

0.00

0.94,

0.00

HSICg

1.00,

0.00

1.00,

0.00

1.00,

0.00

HSICp

0.95,

0.00

0.95,

0.00

0.95,

0.00

1.00,0.94,1.00,0.95,

1.00,0.94,1.00,0.94,

1.00,0.96,1.00,0.91,

of passages of text and their translation.Another application along these lines might be in testing

dependence between data of completely different types, such as images and captions.

Acknowledgements: NICTA is funded through the Australian Government’s Backing Australia’s

Ability initiative, in part through the ARC. This work was supported in part by the IST Programme

of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.

References

[1] F. Bach and M. Jordan. Tree-dependent component analysis. In UAI 18, 2002.

[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002.

[3] I. Calvino. If on a winter’s night a traveler. Harvest Books, Florida, 1982.

[4] J. Dauxois and G. M. Nkiet.Nonlinear canonical analysis and independence tests.

26(4):1254–1278, 1998.

[5] L. Devroye, L. Gy¨ orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in

Applications of mathematics. Springer, New York, 1996.

[6] Andrey Feuerverger.A consistent test for bivariate dependence.

61(3):419–433, 1993.

[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-

ducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004.

[8] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ olkopf, and A. Smola. A kernel method for the two-sample-

problem. In NIPS 19, pages 513–520, Cambridge, MA, 2007. MIT Press.

[9] A. Gretton, O. Bousquet, A.J. Smola, and B. Sch¨ olkopf. Measuring statistical dependence with Hilbert-

Schmidt norms. In ALT, pages 63–77, 2005.

[10] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨ olkopf, and A. Smola. A kernel statistical test of

independence. Technical Report 168, MPI for Biological Cybernetics, 2008.

[11] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨ olkopf. Kernel methods for measuring

independence. J. Mach. Learn. Res., 6:2075–2129, 2005.

[12] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1 (Second

Edition). John Wiley and Sons, 1994.

[13] A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function.

PhD thesis, University of Jyv¨ askyl¨ a, 1995.

[14] Juha Karvanen. A resampling test for the total independence of stationary time series: Application to the

performance evaluation of ica algorithms. Neural Processing Letters, 22(3):311 – 324, 2005.

[15] C.-J. Ku and T. Fine. Testing for stochastic independence: application to blind source separation. IEEE

Transactions on Signal Processing, 53(5):1815–1826, 2005.

[16] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification.

In Pacific Symposium on Biocomputing, pages 564–575, 2002.

[17] T. Read and N. Cressie. Goodness-Of-Fit Statistics for Discrete Multivariate Analysis. Springer-Verlag,

New York, 1988.

[18] A. R´ enyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441–451, 1959.

[19] M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of

independence. The Annals of Statistics, 3(1):1–14, 1975.

[20] B. Sch¨ olkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004.

[21] R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.

[22] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence

estimation. In Proc. Intl. Conf. Machine Learning, pages 823–830. Omnipress, 2007.

[23] I. Steinwart. The influence of the kernel on the consistency of support vector machines. Journal of

Machine Learning Research, 2, 2002.

[24] C. H. Teo and S. V. N. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In ICML,

pages 929–936, 2006.

[25] F.J. Theis. Towards a general independent subspace analysis. In NIPS 19, 2007.

Ann. Statist.,

International Statistical Review,

8