A Gretton’s research while affiliated with University College London and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (48)


Kernel methods for causal functions: dose, heterogeneous and incremental response curves
  • Article

July 2023

·

20 Reads

·

20 Citations

Biometrika

R Singh

·

·

A Gretton

We propose estimators based on kernel ridge regression for nonparametric causal functions such as dose, heterogeneous and incremental response curves. The treatment and covariates may be discrete or continuous in general spaces. Because of a decomposition property specific to the reproducing kernel Hilbert space, our estimators have simple closed-form solutions. We prove uniform consistency with finite sample rates via an original analysis of generalized kernel ridge regression. We extend our main results to counterfactual distributions and to causal functions identified by front and back door criteria. We achieve state-of-the-art performance in nonlinear simulations with many covariates, and conduct a policy evaluation of the US Job Corps training programme for disadvantaged youths.


Figure 1: Sinusoid experiment with joint density p xy ∝ 1 + sin(ωx) sin(ωy) on (−π, π) 2 for some frequency ω > 0.
Figure 2: Gaussian Sign experiment where X ∼ N (0, I d ) and Y = |Z| d i=1 sgn(X i ) with noise Z ∼ N (0, 1).
Discussion of ‘Multi-scale Fisher’s independence test for multivariate dependence’
  • Article
  • Full-text available

August 2022

·

43 Reads

·

1 Citation

Biometrika

·

·

Z Szabó

·

[...]

·

A Gretton

1. Introduction We read with interest the work of Gorsky & Ma (2022) on statistical dependence testing using a multi-scale Fisher’s independence test, MultiFIT. The procedure consists of first transforming the data to map to the unit ball, then performing univariate Fisher’s exact tests of independence on a collection of |2×22 \times 2| contingency tables, and finally correcting for the use of multiple testing. The collection is obtained using a divide-and-conquer approach with a coarse-to-fine procedure: the unit ball is partitioned into cuboids, and |2×22 \times 2| contingency tables of counts of samples in the cuboids are tested; the cuboids with small associated |p|-values are then further partitioned at finer resolutions and tested again. This approach has a number of advantages; chief among them are that the test is multivariate, the computational cost is in general |O(nlogn){O}(n\log n)| as a function of the sample size |n|⁠, and the test threshold is exact at...

Download

A Weaker Faithfulness Assumption based on Triple Interactions

July 2021

·

23 Reads

·

11 Citations

One of the core assumptions in causal discovery is the faithfulness assumption-i.e. assuming that independencies found in the data are due to separations in the true causal graph. This assumption can, however, be violated in many ways, including xor connections, deterministic functions or cancelling paths. In this work, we propose a weaker assumption that we call 2-adjacency faithfulness. In contrast to adjacency faithfulness, which assumes that there is no conditional independence between each pair of variables that are connected in the causal graph, we only require no conditional independence between a node and a subset of its Markov blanket that can contain up to two nodes. Equivalently, we adapt orientation faithfulness to this setting. We further propose a sound orientation rule for causal discovery that applies under weaker assumptions. As a proof of concept, we derive a modified Grow and Shrink algorithm that recovers the Markov blanket of a target node and prove its correctness under strictly weaker assumptions than the standard faithfulness assumption.


KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support

January 2021

·

6 Reads

·

23 Citations

We study the gradient flow for a relaxed approximation to the Kullback-Leibler (KL) divergence between a moving source and a fixed target distribution. This approximation, termed the KALE (KL Approximate Lower bound Estimator), solves a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions. When using a Reproducing Kernel Hilbert Space (RKHS) to define the function class, we show that the KALE continuously interpolates between the KL and the Maximum Mean Discrepancy (MMD). Like the MMD and other Integral Probability Metrics, the KALE remains well-defined for mutually singular distributions. Nonetheless, the KALE inherits from the limiting KL a greater sensitivity to mismatch in the support of the distributions, compared with the MMD. These two properties make the KALE gradient flow particularly well suited when the target distribution is supported on a low-dimensional manifold. Under an assumption of sufficient smoothness of the trajectories, we show the global convergence of the KALE flow. We propose a particle implementation of the flow given initial samples from the source and the target distribution, which we use to empirically confirm the KALE’s properties.


Self-Supervised Learning with Kernel Dependence Maximization

January 2021

·

12 Reads

·

56 Citations

We approach self-supervised learning of image representations from a statistical dependence perspective, proposing Self-Supervised Learning with the Hilbert- Schmidt Independence Criterion (SSL-HSIC). SSL-HSIC maximizes dependence between representations of transformations of an image and the image identity, while minimizing the kernelized variance of those representations. This framework yields a new understanding of InfoNCE, a variational lower bound on the mutual information (MI) between different transformations. While the MI itself is known to have pathologies which can result in learning meaningless representations, its bound is much better behaved: we show that it implicitly approximates SSLHSIC (with a slightly different regularizer). Our approach also gives us insight into BYOL, a negative-free SSL method, since SSL-HSIC similarly learns local neighborhoods of samples. SSL-HSIC allows us to directly optimize statistical dependence in time linear in the batch size, without restrictive data assumptions or indirect mutual information estimators. Trained with or without a target network, SSL-HSIC matches the current state-of-the-art for standard linear evaluation on ImageNet [1], semi-supervised learning and transfer to other classification and vision tasks such as semantic segmentation, depth estimation and object recognition. Code is available at https://github.com/deepmind/ssl-hsic.



CIFAR-10.1 (α = 0.05): mean rejection rates.
Learning deep kernels for non-parametric two-sample tests

January 2020

·

79 Reads

·

70 Citations

We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution. Our tests are constructed from kernels parameterized by deep neural nets, trained to maximize test power. These tests adapt to variations in distribution smoothness and shape over space, and are especially suited to high dimensions and complex data. By contrast, the simpler kernels used in prior kernel testing work are spatially homogeneous, and adaptive only in lengthscale. We explain how this scheme includes popular classifier-based two-sample tests as a special case, but improves on them in general. We provide the first proof of consistency for the proposed adaptation method, which applies both to kernels on deep features and to simpler radial basis kernels or multiple kernel learning. In experiments, we establish the superior performance of our deep kernels in hypothesis testing on benchmark and real-world data.


Maximum Mean Discrepancy Gradient Flow

December 2019

·

102 Reads

·

125 Citations

We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.


Exponential Family Estimation via Adversarial Dynamics Embedding

December 2019

·

25 Reads

·

20 Citations

We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. We exploit the primal-dual view of the MLE with a kinetics augmented model to obtain an estimate associated with an adversarial dual sampler. To represent this sampler, we introduce a novel neural architecture, dynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while enabling tractable entropy estimation for the augmented model. By learning both a dual sampler and the primal model simultaneously, and sharing parameters between them, we obviate the requirement to design a separate sampling procedure once the model has been trained, leading to more effective learning. We show that many existing estimators, such as contrastive divergence, pseudo/composite-likelihood, score matching, minimum Stein discrepancy estimator, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow, are special cases of the proposed approach, each expressed by a different (fixed) dual sampler. An empirical investigation shows that adapting the sampler during MLE can significantly improve on state-of-the-art estimators.


Kernel Instrumental Variable Regression

December 2019

·

62 Reads

·

123 Citations

Instrumental variable (IV) regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y , and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm’s first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric IV regression.


Citations (40)


... A plug-in estimatorWe further require some regularity conditions on the RKHS, which are commonly assumed[46,71]. ...

Reference:

Doubly-Robust Estimation of Counterfactual Policy Mean Embeddings
Kernel methods for causal functions: dose, heterogeneous and incremental response curves
  • Citing Article
  • July 2023

Biometrika

... Once textual data is converted into numerical representations, various methods can be applied to assess text similarity. For instance, a Maximum Mean Discrepancy (MMD) procedure is proposed to infer whether two sets of documents convey similar meanings based on the vector space model [22]. Matrix factorization techniques, such as Latent Semantic Analysis (LSA), have also been employed to construct vector-based metrics for document comparison [23]. ...

Interpretable Distribution Features with Maximum Testing Power
  • Citing Article
  • October 2016

... We wish to start by expressing our gratitude to the Editorial Board for giving us the opportunity to discuss our article, and to all of the discussants for their very insightful, thought-provoking contributions. These discussions provide a more comprehensive review of the vast relevant literature along with more indepth views on several popular approaches for multivariate dependency testing, including computationally efficient approximation to kernel methods (Schrab et al., 2022), the k-nearest-neighbour-based mutual information test (Berrett, 2022) and the binary-expansion-test-based framework (Lee et al., 2022a). The discussions also complement our paper with additional numerical experiments that shed light on the statistical and computational strengths and weaknesses of the various approaches, and suggest interesting future research directions such as testing functional dependency (Lee et al., 2022a). ...

Discussion of ‘Multi-scale Fisher’s independence test for multivariate dependence’

Biometrika

... Our findings underscore the benefits of using an ELBOdriven approach for BiNN training and suggest that quantuminspired variational methods may offer a principled pathway toward more effective learning. Future work could further explore the theoretical relationship between explicit and surrogate ELBO formulations, particularly through the KALE divergence [27]. Additionally, improvements in parameter initialization for quantum circuits could improve convergence stability and training efficiency. ...

KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support
  • Citing Conference Paper
  • January 2021

... For instance, SSL-based recommendation systems can autonomously learn user preferences by identifying patterns in implicit interactions, such as scrolling behaviour, session duration, and cursor movement. This eliminates the need for extensive labelled feedback while enabling real-time personalization [173], [174], [175]. ...

Self-Supervised Learning with Kernel Dependence Maximization
  • Citing Conference Paper
  • January 2021

... While the Markov property is a given in almost every causal inference method, causal faithfulness is more controversial and its validity has been discussed in various places [1,24]. As a consequence, weaker versions of causal faithfulness have been developed, most notably adjacency and orientation faithfulness [5,25]. We study under which conditions the causal Markov property, faithfulness and some of its relatives do and do not carry over from a fine grained micro-level causal graphical model to a more coarse grained macro-level graph in which the micro-level variables are partitioned into groups, see Figure 1. ...

A Weaker Faithfulness Assumption based on Triple Interactions
  • Citing Conference Paper
  • July 2021

... A number of approaches have been developed to extend the two-stage least squares (2SLS) algorithm (Angrist et al., 1996) to non-linear settings. A common approach is to use non-linear basis functions, such as Sieve IV (Newey & Powell, 2003;Blundell et al., 2007;Chen & Christensen, 2018), Kernel IV (Singh et al., 2019) and Dual IV (Muandet et al., 2020). These methods enjoy theoretical benefits, but their flexibility is limited by the set of basis functions. ...

Kernel Instrumental Variable Regression
  • Citing Conference Paper
  • December 2019

... Score-matching methods and their variants, first explored in [5,17,18], have faced challenges in EBM training, particularly in addressing the presence of multiple, imbalanced modes in the target distribution. These shortcomings are examined in [19,20,21]. Contrastive divergence, introduced in [6, 7, 8], is a widely used technique for approximating the gradient of the log-likelihood. ...

Learning deep kernels for exponential family densities

... Kernel-based GOF tests achieve this through a tractable test statistic based on the score function (i.e., the gradient of the log-density), which is widely available since it can be evaluated without knowledge of the normalization constant of the density. This property has made kernel GOF test popular and has led to numerous extensions specializing the approach to problems from survival analysis (Fernandez et al., 2020) to inference with manifold data (Xu & Matsuda, 2020), graphs (Xu & Reinert, 2021) and protein data (Amin et al., 2023). ...

Kernelized Stein Discrepancy Tests of Goodness-of-fit for Time-to-Event Data
  • Citing Conference Paper
  • August 2020