Content uploaded by Shai Ben-David

Author content

All content in this area was uploaded by Shai Ben-David

Content may be subject to copyright.

A preview of the PDF is not available

Discriminative learning methods for classification perfor m well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source dom ain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs in- herent in designing a representation for domain adaptation and gives a new justifi- cation for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.

Content uploaded by Shai Ben-David

Author content

All content in this area was uploaded by Shai Ben-David

Content may be subject to copyright.

A preview of the PDF is not available

... Domain shift is often measured by the dissimilarity of the distributions of each domain. A number of metrics have been proposed to this end (Ben-David et al., 2006;2010;Mansour et al., 2009;Germain et al., 2013), but the notion most relevant to present study is that of H-divergence. Based on the work of Kifer et al. (2004), it was later used in the formalization of Ben-David et al. (2006;2010)'s theory on domain adaptation. ...

... A number of metrics have been proposed to this end (Ben-David et al., 2006;2010;Mansour et al., 2009;Germain et al., 2013), but the notion most relevant to present study is that of H-divergence. Based on the work of Kifer et al. (2004), it was later used in the formalization of Ben-David et al. (2006;2010)'s theory on domain adaptation. This same theoretical framework led to the Domain-Adversarial Neural Networks (DANN) architecture, one of the first successful deep approaches for DA (Ganin et al., 2016). ...

Machine learning techniques have enabled researchers to leverage neuroimaging data to decode speech from brain activity, with some amazing recent successes achieved by applications built using invasive devices. However, research requiring surgical implants has a number of practical limitations. Non-invasive neuroimaging techniques provide an alternative but come with their own set of challenges, the limited scale of individual studies being among them. Without the ability to pool the recordings from different non-invasive studies, data on the order of magnitude needed to leverage deep learning techniques to their full potential remains out of reach. In this work, we focus on non-invasive data collected using magnetoencephalography (MEG). We leverage two different, leading speech decoding models to investigate how an adversarial domain adaptation framework augments their ability to generalize across datasets. We successfully improve the performance of both models when training across multiple datasets. To the best of our knowledge, this study is the first ever application of feature-level, deep learning based harmonization for MEG neuroimaging data. Our analysis additionally offers further evidence of the impact of demographic features on neuroimaging data, demonstrating that participant age strongly affects how machine learning models solve speech decoding tasks using MEG data. Lastly, in the course of this study we produce a new open-source implementation of one of these models to the benefit of the broader scientific community.

... The training of deep neural networks commonly relies on the assumption that the distribution of the training data is representative of the distribution at inference. Despite being widely adopted, this assumption has been heavily criticised as it is often challenged in practice [21,5,1,2,17]. Indeed, despite tremendous progress on many vision tasks over recent years, deep neural networks face limitations and reduced performance on tasks requiring the model to shift from its training distribution at test time [25]. ...

Deep neural networks can obtain impressive performance on various tasks under the assumption that their training domain is identical to their target domain. Performance can drop dramatically when this assumption does not hold. One explanation for this discrepancy is the presence of spurious domain-specific correlations in the training data that the network exploits. Causal mechanisms, in the other hand, can be made invariant under distribution changes as they allow disentangling the factors of distribution underlying the data generation. Yet, learning causal mechanisms to improve out-of-distribution generalisation remains an under-explored area. We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms. We show theoretically and experimentally that our model approximates reasoning under causal interventions. We demonstrate the performance of our method, outperforming point estimate-counterparts, on out-of-distribution image recognition tasks where the data distribution acts as strong adversarial confounders.

... Multi-source domain generalization. Domain-invariant feature learning is the most popular family of approaches, originating from the results of Ben-David et al. [1]: The upper bound of the target domain error is a function of the discriminability between the source and target domain in the feature space. Many follow-up methods exist in the MSDG literature, such as kernel-based approaches [43] or multi-task autoencoders [21]. ...

Single-source domain generalization attempts to learn a model on a source domain and deploy it to unseen target domains. Limiting access only to source domain data imposes two key challenges - how to train a model that can generalize and how to verify that it does. The standard practice of validation on the training distribution does not accurately reflect the model's generalization ability, while validation on the test distribution is a malpractice to avoid. In this work, we construct an independent validation set by transforming source domain images with a comprehensive list of augmentations, covering a broad spectrum of potential distribution shifts in target domains. We demonstrate a high correlation between validation and test performance for multiple methods and across various datasets. The proposed validation achieves a relative accuracy improvement over the standard validation equal to 15.4% or 1.6% when used for method selection or learning rate tuning, respectively. Furthermore, we introduce a novel family of methods that increase the shape bias through enhanced edge maps. To benefit from the augmentations during training and preserve the independence of the validation set, a k-fold validation process is designed to separate the augmentation types used in training and validation. The method that achieves the best performance on the augmented validation is selected from the proposed family. It achieves state-of-the-art performance on various standard benchmarks. Code at: https://github.com/NikosEfth/crafting-shifts

... This will open the opportunity to quantify the so called domain gap -something that has not been achieved so far. Domain gap refers to differences in data characteristics and distribution between source and target domains [5]. In the general context of machine learning, it is typically the difference between the training dataset and the real data the model should be used on. ...

In supervised learning, it is almost always assumed that the training and test input points follow the same probability distribution. However, this assumption is violated, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. In such situations—known as the covariate shift, cross-validation estimate of the general- ization error is biased, which results in poor model selection. In this paper, we propose an alternative estimator of the generalization error which is under the covariate shift exactly unbiased if model includes the learning target function and is asymptotically unbiased in general. We also show that, in addition to the unbiasedness, the proposed generalization error estimator can accurately estimate the dierence of the generalization error among dierent models, which is a desirable property in model selection. Numerical studies show that the proposed method compares favorably with cross-validation.

We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generaliza- tion to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a com- plete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data.

Detecting changes in a data stream is an im- portant area of research with many appli- cations. In this paper, we present a novel method for the detection and estimation of change. In addition to providing statisti- cal guarantees on the reliability of detected changes, our method also provides meaning- ful descriptions and quantification of these changes. Our approach assumes that the points in the stream are independently gen- erated, but otherwise makes no assumptions on the nature of the generating distribution. Thus our techniques work for both continuous and discrete data. In an experimental study we demonstrate the power of our techniques.

We study the phenomenon of cognitive learning from an algorithmic standpoint. How does the brain effectively learn concepts
from a small number of examples despite the fact that each example contains a huge amount of information? We provide a novel
algorithmic analysis via a model of robust concept learning (closely related to “margin classifiers”), and show that a relatively small number of examples are sufficient
to learn rich concept classes. The new algorithms have several advantages—they are faster, conceptually simpler, and resistant
to low levels of noise. For example, a robust half-space can be learned in linear time using only a constant number of training
examples, regardless of the number of attributes. A general (algorithmic) consequence of the model, that “more robust concepts
are easier to learn”, is supported by a multitude of psychological studies.

We address the computational complexity of learning in the agnostic framework. For a variety of common concept classes we prove that, unless P=NP, there is no polynomial time approximation scheme for finding a member in the class that approximately maximizes the agreement with a given training sample. In particular our results apply to the classes of monomials, axis-aligned hyper-rectangles, closed balls and monotone monomials. For each of these classes, we prove the NP-hardness of approximating maximal agreement to within some fixed constant (independent of the sample size and of the dimensionality of the sample space). For the class of half-spaces, we prove that, for any ε>0, it is NP-hard to approximately maximize agreements to within a factor of (418/415−ε), improving on the best previously known constant for this problem, and using a simpler proof. An interesting feature of our proofs is that, for each of the classes we discuss, we find patterns of training examples that, while being hard for approximating agreement within that concept class, allow efficient agreement maximization within other concept classes. These results bring up a new aspect of the model selection problem—they imply that the choice of hypothesis class for agnostic learning from among those considered in this paper can drastically effect the computational complexity of the learning process.

Linear prediction methods, such as least squares for regression, logistic regression and support vector machines for classification, have been extensively used in statistics and machine learning. In this paper, we study stochastic gradient descent (SGD) algorithms on regularized forms of linear prediction methods. This class of methods, related to online algorithms such as perceptron, are both efficient and very simple to implement. We obtain numerical rate of convergence for such algorithms, and discuss its implications. Experiments on text data will be provided to demonstrate numerical and statistical consequences of our theoretical findings.

Discriminative learning methods are widely used in natural language process- ing. These methods work best when their training and test data are drawn from the same distribution. For many NLP tasks, however, we are confronted with new domains in which labeled data is scarce or non-existent. In such cases, we seek to adapt existing models from a resource- rich source domain to a resource-poor target domain. We introduce structural correspondence learning to automatically induce correspondences among features from different domains. We test our tech- nique on part of speech tagging and show performance gains for varying amounts of source and target training data, as well as improvements in target domain parsing accuracy using our improved tagger.