# Wittawat JitkrittumGoogle Inc. | Google

Wittawat Jitkrittum

Doctor of Philosophy

## About

77

Publications

7,986

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

794

Citations

Citations since 2017

Introduction

I am interested in both theoretical and practical sides of machine learning. The main focus of my recent works is on automatic representation of empirical distributions through the use of a positive definite kernel. This has led to an improvement of a number of techniques in the areas of approximate Bayesian inference, two-sample testing, and independence testing.

Education

September 2013 - September 2017

March 2010 - February 2012

May 2005 - May 2009

## Publications

Publications (77)

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on t...

We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., Pro...

Learning to reject (L2R) and out-of-distribution (OOD) detection are two classical problems, each of which involve detecting certain abnormal samples: in L2R, the goal is to detect "hard" samples on which to abstain, while in OOD detection, the goal is to detect "outlier" samples not drawn from the training distribution. Intriguingly, despite being...

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative g...

We address the problem of retrieving in-the-wild images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by e...

1. Introduction
We read with interest the work of Gorsky & Ma (2022) on statistical dependence testing using a multi-scale Fisher’s independence test, MultiFIT. The procedure consists of first transforming the data to map to the unit ball, then performing univariate Fisher’s exact tests of independence on a collection of |$2 \times 2$| contingency...

We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one al...

We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as i...

We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as i...

Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models an...

Multi-party computation (MPC) is a branch of cryptography where multiple non-colluding parties execute a well designed protocol to securely compute a function. With the non-colluding party assumption, MPC has a cryptographic guarantee that the parties will not learn sensitive information from the computation process, making it an appealing framewor...

We developed a novel approximate Bayesian computation (ABC) framework, ABCDP, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity o...

We propose data-dependent test statistics based on a one-dimensional witness function, which we call witness two-sample tests (WiTS tests). We first optimize the witness function by maximizing an asymptotic test-power objective and then use as the test statistic the difference in means of the witness evaluated on two held-out test samples. When the...

Refining one’s hypotheses in the light of data is a common scientific practice; however, the dependency on the data introduces selection bias and can lead to specious statistical analysis. An approach for addressing this is via conditioning on the selection procedure to account for how we have used the data to generate our hypotheses, and prevent i...

This paper is an in-depth investigation of using kernel methods to immunize optimization solutions against distributional ambiguity. We propose kernel distributionally robust optimization (K-DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct ambiguit...

Modern large-scale kernel-based tests such as maximum mean discrepancy (MMD) and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a held-out sample via data splitting to obtain the most powerful test statistics. While data splitting results in a tractable null distribution, it suffers from a reduction in test power due to small...

In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods -- the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associat...

We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function $p(y|x)$ and a joint sample, decide whether the sample is drawn from $p(y|x)r_x(x)$ for some density $r_x$. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional de...

We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on conditional moment embeddings (CMME)---a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). After transforming the conditional moment restrictions into a continuum of unconditional co...

We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS) called conditional moment embedding (CMME). After transforming the conditional moment restrictions into a continuum of unconditiona...

We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function p(y|x) and a joint sample, decide whether the sample is drawn from p(y|x)rx(x) for some density rx. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density mo...

Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meanin...

We address the problem of non-parametric multiple model comparison: given $l$ candidate models, decide whether each candidate is as good as the best one(s) or worse than it. We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls...

Refining one's hypotheses in the light of data is a commonplace scientific practice, however, this approach introduces selection bias and can lead to specious statistical analysis. One approach of addressing this phenomena is via conditioning on the selection procedure, i.e., how we have used the data to generate our hypotheses, and prevents inform...

We develop a novel approximate Bayesian computation (ABC) framework, ABCDP, that obeys the notion of differential privacy (DP). Under our framework, simply performing ABC inference with a mild modification yields differentially private posterior samples. We theoretically analyze the interplay between the ABC similarity threshold $\epsilon_{abc}$ (f...

We propose a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized...

We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed appro...

Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from instability and lack of interpretability as it is difficult to diagnose what aspects of the target distribution are missed by the generative model. In this work, we propose a theoretically grounded solution to these issues by augmenti...

We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed appro...

We address the problem of non-parametric multiple model comparison: given $l$ candidate models, decide whether each candidate is as good as the best one(s) or worse than it. We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls...

Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic CramérRao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may...

In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing o...

Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a...

Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a...

The Kullback-Leilber divergence from model to data is a classic goodness of fit measure but can be intractable in many cases. In this paper, we estimate the ratio function between a data density and a model density with the help of Stein operator. The estimated density ratio allows us to compute the likelihood ratio function which is a surrogate to...

We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to comp...

The kernel mean embedding is known to provide a data representation which preserves full information of the data distribution. While typically computationally costly, its nonparametric nature has an advantage of requiring no explicit model specification of the data. At the other extreme are approaches which summarize data distributions into a finit...

We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to comp...

A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bo...

Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using th...

Positive and negative moods can be treated as prior expectations over future delivery of rewards and punishments. This provides an inferential foundation for the cognitive (judgement) bias task, now widely-used for assessing affective states in non-human animals. In the task, information about affect is extracted from the optimistic or pessimistic...

Changes in mean (+/- SEM) (a) Affective Grid Activation score and (b) PANAS NA score across the study in Pleasant Room (solid line) and Unpleasant Room (dashed line) subjects.
(EPS)

Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using th...

Two semimetrics on probability distributions are proposed, based on a difference between features chosen from each, where these features can be in either the spatial or Fourier domains. The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound of power for a statistical test using these featu...

Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the simi...

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression....

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression....

We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the poster...

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression....

Complicated generative models often result in a situation where computing the
likelihood of observed data is intractable, while simulating from the
conditional density given a parameter value is relatively easy. Approximate
Bayesian Computation (ABC) is a paradigm that enables simulation-based
posterior inference in such cases by measuring the simi...

We propose to learn a kernel-based message operator which takes as input all
expectation propagation (EP) incoming messages to a factor node and produces an
outgoing message. In ordinary EP, computing an outgoing message involves
estimating a multivariate integral which may not have an analytic expression.
Learning such an operator allows one to by...

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wi...

We propose squared-loss mutual information regularization (SMIR) for multi-class probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semi-supervised algorithms: An...

Feature selection is a technique to screen out less important features. Many
existing supervised feature selection algorithms use redundancy and relevancy
as the main criteria to select features. However, feature interaction,
potentially a key characteristic in real-world problems, has not received much
attention. As an attempt to take feature inte...

The goal of supervised feature selection is to find a subset of input
features that are responsible for predicting output values. The least absolute
shrinkage and selection operator (Lasso) allows computationally efficient
feature selection based on linear dependency between input features and output
values. In this paper, we consider a feature-wis...

We propose an open-domain question answering system using Thai Wikipedia as the knowledge base. Two types of information are used for answering a question: (1) structured information extracted and stored in the form of Resource Description Framework (RDF), and (2) unstructured texts stored as a search index. For the structured information, SPARQL t...

We propose a feature called category browsing to enhance the full-text search function of Thai-language news article search engine. The category browsing allows users to browse and filter search results based on some predefined categories. To implement the category browsing feature, we applied and compared among several text categorization algorith...

In this paper, we describe our ongoing project to help alleviate the digital divide problem among high schools in rural areas
of Thailand. The idea is to select, organize, index and distribute useful educational Web contents to schools where the Internet
connection is not available. These Web contents can be used by teachers and students to enhance...