Article

BuresNet: Conditional Bures Metric for Transferable Representation Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As a fundamental manner for learning and cognition, transfer learning has attracted widespread attention in recent years. Typical transfer learning tasks include unsupervised domain adaptation (UDA) and few-shot learning (FSL), which both attempt to sufficiently transfer discriminative knowledge from the training environment to the test environment to improve the model's generalization performance. Previous transfer learning methods usually ignore the potential conditional distribution shift between environments. This leads to the discriminability degradation in the test environments. Therefore, how to construct a learnable and interpretable metric to measure and then reduce the gap between conditional distributions is very important in the literature. In this work, we design the Conditional Kernel Bures (CKB) metric for characterizing conditional distribution discrepancy, and derive an empirical estimation with convergence guarantee. CKB provides a statistical and interpretable approach, under the optimal transportation framework, to understand the knowledge transfer mechanism. It is essentially an extension of optimal transportation from the marginal distributions to the conditional distributions. CKB can be used as a plug-and-play module and placed onto the loss layer in deep networks, thus, it plays the bottleneck role in representation learning. From this perspective, the new method with network architecture is abbreviated as BuresNet, and it can be used extract conditional invariant features for both UDA and FSL tasks. BuresNet can be trained in an end-to-end manner. Extensive experiment results on several benchmark datasets validate the effectiveness of BuresNet.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Tachet des Combes et al. (2020) have demonstrated the necessity of reducing conditional shift for UDA. Several works explore the locality consistency across domains by matching conditional and even joint distributions, e.g., extensions of MMD (Long et al., 2017;Zhu et al., 2021b), conditional adversarial adaptation (Long et al., 2018), centroid alignment (Yang et al., 2022), OT-based joint distribution matching (Bhushan Damodaran et al., 2018;Courty et al., 2017), and conditional statistic alignment (Ren et al., 2023). Most of these methods rely on target pseudo-labels since the target domain is unlabeled. ...
... Xia and Ding (2020) leverages Gromov-Wasserstein distance to extract cross-domain matching relation. Inspired by Zhang et al. (2019b), Ren et al. (2023) propose the Conditional Kernel Bures metric in BuresNet for characterizing the class-wise domain discrepancy. ...
... The compared methods can be roughly categorized into three groups. (1): Moment alignment methods, e.g., DAN , JAN (Long et al., 2017), TPN (Pan et al., 2019), DSAN (Zhu et al., 2021b), DSAN+CAFT (Kumar et al., 2023), and BuresNet (Ren et al., 2023). (2): Adversarial methods, e.g., DANN (Ganin et al., 2016), CDAN (Long et al., 2018), CDAN+E (Long et al., 2018), CDAN+SDAT (Rangwani et al., 2022), TADA , and SCDA (Li et al., 2021b). ...
Article
Full-text available
As a vital problem in pattern analysis and machine intelligence, Unsupervised Domain Adaptation (UDA) attempts to transfer an effective feature learner from a labeled source domain to an unlabeled target domain. Inspired by the success of the Transformer, several advances in UDA are achieved by adopting pure transformers as network architectures, but such a simple application can only capture patch-level information and lacks interpretability. To address these issues, we propose the Domain-Transformer (DoT) with domain-level attention mechanism to capture the long-range correspondence between the cross-domain samples. On the theoretical side, we provide a mathematical understanding of DoT: (1) We connect the domain-level attention with optimal transport theory, which provides interpretability from Wasserstein geometry; (2) From the perspective of learning theory, Wasserstein distance-based generalization bounds are derived, which explains the effectiveness of DoT for knowledge transfer. On the methodological side, DoT integrates the domain-level attention and manifold structure regularization, which characterize the sample-level information and locality consistency for cross-domain cluster structures. Besides, the domain-level attention mechanism can be used as a plug-and-play module, so DoT can be implemented under different neural network architectures. Instead of explicitly modeling the distribution discrepancy at domain-level or class-level, DoT learns transferable features under the guidance of long-range correspondence, so it is free of pseudo-labels and explicit domain discrepancy optimization. Extensive experiment results on several benchmark datasets validate the effectiveness of DoT.
... A IMING to relax the identical distribution assumption in standard learning scenario, dataset shift, also known as distribution shift, has received increasing attention in machine learning, computer vision and statistics communities [1], [2]. In dataset shift scenario, the primary goal is to learn an invariant model for the potentially changing realworld environments, which is closely connected with domain adaptation (DA) problem [3]- [9]. Specifically, the model trained on source domain (distribution) P with sufficient knowledge (e.g., annotations) is supposed to be unbiased on a related but different target domain (distribution) Q with less or no prior knowledge, i.e., semi-supervised and unsupervised transfer. ...
... To characterize the essential of shifting distributions, several assumptions have been made based on different factorizations of distributions. Most of the works formulate dataset shift as covariate shift [20], [21] or conditional shift [3], [9], [22], [23], where the distributions over covariate X or X given label Y are shifting, i.e., P X ̸ = Q X or P X|Y ̸ = Q X|Y , respectively. These assumptions are closely related to an important and popular framework called invariant representation learning [4], [5], [11], [24], [25], where the shifting distributions are supposed to be aligned in a latent space with a mapping g : X → Z. Another fruitful assumption is label shift [26]- [29]. ...
... As conditional alignment focuses on distributions P Z|Y =y and Q Z|Y =y , it is relatively reliable when label shift exists, and also more accurate than marginal/reweighting methods with global structure matching [13], [22]. Typical methods including conditional MMD [48], adversarial training with conditional information [35] and conditional variants of OT [9], [23]. ...
Preprint
As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner is proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.
... Thus, it is appealing to learn discriminative and domain-invariant features by applying OT to domain adaptation. Specifically, various OT-based methods are mainly proposed for UDA, including matching domains by learning marginal invariant features [30], [3], [31], joint invariant features [32], [14], and class-conditional invariant features [33]. Several OT-based PDA methods [34], [35], [36] have been proposed, which are mostly based on Unbalanced OT (UOT) [37]. ...
... Xu et al. [41] incorporate spatial prototype information and intra-domain structures to construct a weighted Kantorovich formulation. Ren et al. [33] proposed a variant of OT distance to quantify the classconditional distribution discrepancy between domains. ...
Preprint
Full-text available
Visual domain adaptation aims to learn discriminative and domain-invariant representation for an unlabeled target domain by leveraging knowledge from a labeled source domain. Partial domain adaptation (PDA) is a general and practical scenario in which the target label space is a subset of the source one. The challenges of PDA exist due to not only domain shift but also the non-identical label spaces of domains. In this paper, a Soft-masked Semi-dual Optimal Transport (SSOT) method is proposed to deal with the PDA problem. Specifically, the class weights of domains are estimated, and then a reweighed source domain is constructed, which is favorable in conducting class-conditional distribution matching with the target domain. A soft-masked transport distance matrix is constructed by category predictions, which will enhance the class-oriented representation ability of optimal transport in the shared feature space. To deal with large-scale optimal transport problems, the semi-dual formulation of the entropy-regularized Kantorovich problem is employed since it can be optimized by gradient-based algorithms. Further, a neural network is exploited to approximate the Kantorovich potential due to its strong fitting ability. This network parametrization also allows the generalization of the dual variable outside the supports of the input distribution. The SSOT model is built upon neural networks, which can be optimized alternately in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets to demonstrate the effectiveness of SSOT.
... Various UDA approaches have been proposed to reduce domain shift. The main idea is to minimize the domain discrepancy and learn domain-invariant features, which can be broadly classified as distance metric-based domain adaptation (Zellinger et al. 2017;Ren, Luo, and Dai 2023) and adversarial learning-based domain adaptation (Long et al. 2018;Chen et al. 2022). Distance metric-based methods aim to learn a shared feature representation by minimiz- Figure 1: Illustration of the optimal transport plan obtained by entropy-regularized OT (OT-IT), and our probabilitypolarized OT with PP regularizer and DPP regularizer. ...
... JAN (Long et al. 2017a) introduces the class-specific MMD to mitigate domain discrepancy at the class level. BuresNet (Ren, Luo, and Dai 2023) proposes a conditional Bures metric to align class-conditioned distributions across domains. Adversarial method DANN (Ganin and Lempitsky 2015) aims to learn domain-invariant representations by confusing a domain discriminator. ...
Article
Optimal transport (OT) is an important methodology to measure distribution discrepancy, which has achieved promising performance in artificial intelligence applications, e.g., unsupervised domain adaptation. However, from the view of transportation, there are still limitations: 1) the local discriminative structures for downstream tasks, e.g., cluster structure for classification, cannot be explicitly admitted by the learned OT plan; 2) the entropy regularization induces a dense OT plan with increasing uncertainty. To tackle these issues, we propose a novel Probability-Polarized OT (PPOT) framework, which can characterize the structure of OT plan explicitly. Specifically, the probability polarization mechanism is proposed to guide the optimization direction of OT plan, which generates a clear margin between similar and dissimilar transport pairs and reduces the uncertainty. Further, a dynamic mechanism for margin is developed by incorporating task-related information into the polarization, which directly captures the intra/inter class correspondence for knowledge transportation. A mathematical understanding for PPOT is provided from the view of gradient, which ensures interpretability. Extensive experiments on several datasets validate the effectiveness and empirical efficiency of PPOT.
... Most transfer learning methods are proposed for Unsupervised Domain Adaptation (UDA) with classification task. They align the feature distributions of source and target domain by distribution discrepancy measurement [23]- [27], domain adversarial learning [28]- [30], etc. ...
Preprint
Full-text available
While data-driven methods such as neural operator have achieved great success in solving differential equations (DEs), they suffer from domain shift problems caused by different learning environments (with data bias or equation changes), which can be alleviated by transfer learning (TL). However, existing TL methods adopted in DEs problems lack either generalizability in general DEs problems or physics preservation during training. In this work, we focus on a general transfer learning method that adaptively correct the domain shift and preserve physical information. Mathematically, we characterize the data domain as product distribution and the essential problems as distribution bias and operator bias. A Physics-preserved Optimal Tensor Transport (POTT) method that simultaneously admits generalizability to common DEs and physics preservation of specific problem is proposed to adapt the data-driven model to target domain utilizing the push-forward distribution induced by the POTT map. Extensive experiments demonstrate the superior performance, generalizability and physics preservation of the proposed POTT method.
... Classic methods primarily attempt to reduce the difference between the source and target domain data to align them [13], such as CDA [27], which adds maximum mean discrepancy (MMD) loss [4] based on contrastive representation to minimize domain discrepancies. Additionally, [28] and colleagues designed a Conditional Kernel Bures Distance based on the discrepancy in conditional feature distributions, providing an interpretable approach for domain adaptation. Recently, knowledge distillation methods based on teacher-student models have also become a research focus. ...
Article
Full-text available
Unsupervised domain adaptation (UDA) aims to reduce the domain differences between source and target domains by mapping their data to a shared feature space, thereby learning domain-invariant features. The aim of this study is to address the challenges faced by contrastive learning-based UDA methods when dealing with domain discrepancies, particularly the spurious correlations introduced by confounding factors caused by data augmentation. In recent years, contrastive learning has gained attention for its powerful representation learning capabilities, as it can pull similar samples from the source and target domains closer together while separating different classes of negative samples. This process helps alleviate domain differences and enhances the model’s generalization ability. However, mainstream UDA methods based on contrastive learning often introduce confounding factors due to the randomness of data augmentation, leading the model to learn incorrect spurious associations, especially when the target domain contains counterfactual data from the source domain. As the amount of counterfactual data increases, this bias and accuracy loss can significantly exacerbate and are difficult to eliminate through non-causal methods. To address this, this paper proposes causal invariance contrastive adaptation (CICA), a causal-contrastive learning-based unsupervised domain adaptation model for image classification. The model inputs labeled source domain samples and unlabeled target domain samples into a feature generator after data augmentation, and quantifies the degree of confusion between the generated features based on a backdoor criterion. We effectively separate domain-invariant features from spurious features using adversarial training, thereby reducing the interference of confounding factors on the domain adaptation task. Our experiments conducted on four domain adaptation image classification benchmark datasets and one counterfactual dataset show that the model achieves a significant improvement in average classification accuracy compared to state-of-the-art methods on the benchmark datasets, while still maintaining advanced performance on the counterfactual dataset.
... Long et al. proposed minimizing the multi-kernel MMD between two domains along with classification prediction errors, thus learning abstract representations of features at different levels to align the domains [21]. Meanwhile,Ren et al. designed a Conditional Kernel Bures distance based on conditional distribution differences to offer an interpretable transfer method [22]. ...
Article
Full-text available
In the realm of Unsupervised Domain Adaptation (UDA), adversarial learning has achieved significant progress. Existing adversarial UDA methods typically employ additional discriminators and feature extractors to engage in a max-min game. However, these methods often fail to effectively utilize the predicted discriminative information, thus resulting in the mode collapse of the generator. In this paper, we propose a Dynamic Balance-based Domain Adaptation (DBDA) method for self-correlated domain adaptive image classification. Instead of adding extra discriminators, we repurpose the classifier as a discriminator and introduce a dynamic balancing learning approach. This approach ensures an explicit domain alignment and category distinction, thus enabling DBDA to fully leverage the predicted discriminative information for an effective feature alignment. We conducted experiments on multiple datasets, therefore demonstrating that the proposed method maintains a robust classification performance across various scenarios.
... The CKB, a new measure of gauging conditional distribution disparities [29,30], finds its niche within the realm of Optimal Transport (OT). Operating as a statistically grounded and interpretable tool, CKB facilitates an in-depth exploration of the intricate knowledge transfer mechanisms inherent in transfer learning models. ...
Article
Full-text available
In recent years, most research on bearing fault diagnosis has assumed that the source domain and target domain data come from the same machine. The differences in equipment lead to a decrease in diagnostic accuracy. To address this issue, unsupervised domain adaptation techniques have been introduced. However, most cross-device fault diagnosis models overlook the discriminative information under the marginal distribution, which restricts the performance of the models. In this paper, we propose a bearing fault diagnosis method based on envelope spectrum and conditional metric learning. First, envelope spectral analysis is used to extract frequency domain features. Then, to fully utilize the discriminative information from the label distribution, we construct a deep Siamese convolutional neural network based on conditional metric learning to eliminate the data distribution differences and extract common features from the source and target domain data. Finally, dynamic weighting factors are employed to improve the convergence performance of the model and optimize the training process. Experimental analysis is conducted on 12 cross-device tasks and compared with other relevant methods. The results show that the proposed method achieves the best performance on all three evaluation metrics.
... Recently, optimal transport (OT) theory, such as Wasserstein distance (WD), has been successfully applied to solve the crossdomain transfer problem [14][15][16][17][18][19][20]. The OT theory defines the minimum cost of moving from one data distribution to another, and the metric compares measures of the discrepancies between the two distributions, no matter how severe. ...
Article
Transfer learning techniques have been extensively developed for the intelligent diagnosis of rotating machinery as a critical and valuable tool dedicated to minimizing the distributional discrepancies between different working conditions of the machine. However, conditional probability information about fault classes and the geometric features of data distribution is rarely considered in traditional distance metrics, invalidating cross-domain diagnostic models when faced with significant distributional discrepancies. To address these issues, a new cross-domain diagnostic algorithm is proposed via joint conditional Wasserstein distance matching. First, the conditional Bures-Wasserstein distance is constructed based on the second-order statistic cross-covariance operator, approximating the distributions in the source and target domains while constraining the geometry. Then, to avoid losing first-order fault data information, the conditional probability 1-Wasserstein distance is embedded to construct a joint distance adaptation. The entropy loss is introduced into the training process to build reliable pseudo labels for the target domain samples. In the proposed method, the samples and label features of different domains are mapped to the reproducing kernel Hilbert space (RKHS), and the Feature Extractor and Classifier modules of the model are jointly optimized to obtain a more robust diagnostic model. The proposed cross-domain diagnostic model is experimentally validated on bearing and gearbox datasets under variable loads and speeds with significant diagnostic performance compared to existing transfer models.
... CKB is a novel conditional distribution discrepancy measure [30,31], whose effectiveness and interpretability have been validated in the fields of computer vision and pattern recognition. Considering CKB's capability in fitting data under domain shift, we introduce it into cross-device fault diagnosis, specifically cross-device mechanical fault diagnosis, aiming to enhance the interpretability of the transfer model and reduce the disparities among different devices' data. ...
Article
Full-text available
The issue of cross-device fault diagnosis is a focal point in bearing fault diagnosis. Nevertheless, due to the imbalance in bearing fault data, conventional fault diagnosis methods have certain limitations in practical applications. To overcome this problem, this paper proposes a bearing fault diagnosis method based on Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTENC) and deep transfer learning. Firstly, the SMOTENC algorithm is employed to oversample the imbalanced bearing vibration signals, thereby obtaining a balanced dataset. Secondly, a six-layer deep transfer neural network model is constructed, and a novel conditional distribution metric loss function is utilized to minimize the distance between the source and target domains. Lastly, the proposed method is applied to 12 cross-device bearing fault diagnosis tasks under an imbalanced dataset, and validated using three performance metrics. The research findings demonstrate that the bearing fault diagnosis method based on SMOTENC and deep transfer learning exhibits significant advantages in handling imbalanced data, offering an effective solution for research in the field of bearing fault diagnosis.
... For example, entropy minimization in [5], or self-training in [6], seeks to learn knowledge from the model predictions themselves, which is risky due to label noise accumulation [14] and poor model calibration [15]. Hence, recent advances, such as [8,16], have proposed label-aware alignment strategies for cross-domain feature distributions. which considers the ingratiation of label information into adaptation, thereby effectively mitigating the aforementioned issues. ...
Article
As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner are proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.
Article
Unsupervised domain adaptation (UDA) studies how to transfer a learner from a labeled source domain to an unlabeled target domain with different distributions. Existing methods mainly focus on matching marginal distributions of the source and target domains, which probably leads to a misalignment of samples from the same class but different domains. In this paper, we tackle this misalignment issue by achieving the class-conditioned transferring from a new perspective. Specifically, we propose a method named maximizing conditional independence (MCI) for UDA, which maximizes the conditional independence of feature and domain given class in the reproducing kernel Hilbert spaces. The optimization of conditional independence can be viewed as a surrogate for minimizing class-wise mutual information between feature and domain. An interpretable empirical estimation of the conditional dependence measure is deduced and connected with the unconditional case. Besides, we provide an upper bound on the target error by taking the class-conditional distribution into account, which provides a new theoretical insight for class-conditioned transferring. Extensive experiments on six benchmark datasets and various ablation studies validate the effectiveness of the proposed model in dealing with UDA.
Article
Limited labeled training samples constitute a challenge in hyperspectral image classification, with much research devoted to cross-domain adaptation, where the classes of the source and target domains are different. Current cross-domain few-shot learning (FSL) methods only use a small number of sample pairs to learn the discriminant features, which limits their performance. To address this problem, we propose a new framework for cross-domain FSL, considering all possible positive and negative pairs in a training batch and not just pairs between the support and query sets. Furthermore, we propose a new kernel triplet loss to characterize complex nonlinear relationships between samples and design appropriate feature extraction and discriminant networks. Specifically, the source and target data are simultaneously fed into the same feature extraction network, and then, the proposed kernel triplet loss on the embedding feature and the cross-entropy loss on the softmax output are used to learn discriminant features for both source and target data. Finally, an iterative adversarial strategy is employed to mitigate domain shifts between source and target data. The proposed method significantly outperforms state-of-the-art methods in experiments on four target datasets and one source dataset. The code is available at https://github.com/kkcocoon/CFSL-KT .
Article
Unsupervised domain-adaptive object detection uses labeled source domain data and unlabeled target domain data to alleviate the domain shift and reduce the dependence on the target domain data labels. For object detection, the features responsible for classification and localization are different. However, the existing methods basically only consider classification alignment, which is not conducive to cross-domain localization. To address this issue, in this article, we focus on the alignment of localization regression in domain-adaptive object detection and propose a novel localization regression alignment (LRA) method. The idea is that the domain-adaptive localization regression problem can be transformed into a general domain-adaptive classification problem first, and then adversarial learning is applied to the converted classification problem. Specifically, LRA first discretizes the continuous regression space, and the discrete regression intervals are treated as bins. Then, a novel binwise alignment (BA) strategy is proposed through adversarial learning. BA can further contribute to the overall cross-domain feature alignment for object detection. Extensive experiments are conducted on different detectors in various scenarios, and the state-of-the-art performance is achieved; these results demonstrate the effectiveness of our method. The code will be available at: https://github.com/zqpiao/LRA .
Article
Face representation in the wild is extremely hard due to the large scale face variations. Some deep convolutional neural networks (CNNs) have been developed to learn discriminative feature by designing properly margin-based losses, which perform well on easy samples but fail on hard samples. Although some methods mainly adjust the weights of hard samples in training stage to improve the feature discrimination, they overlook the distribution property of feature. It is worth noting that the miss-classified hard samples may be corrected from the feature distribution view. To overcome this problem, this paper proposes the hard samples guided optimal transport (OT) loss for deep face representation, OTFace in short. OTFace aims to enhance the performance of hard samples by introducing the feature distribution discrepancy while maintaining the performance on easy samples. Specifically, we embrace triplet scheme to indicate hard sample groups in one mini-batch during training. OT is then used to characterize the distribution differences of features from the high level convolutional layer. Finally, we integrate the margin-based-softmax (e.g. ArcFace or AM-Softmax) and OT together to guide deep CNN learning. Extensive experiments were conducted on several benchmark databases. The quantitative results demonstrate the advantages of the proposed OTFace over state-of-the-art methods. The code is available at https://github.com/FST-ZHUSHUMIN/OTFace .
Article
Unsupervised domain adaptation (UDA) aims to generalize the supervised model trained on a source domain to an unlabeled target domain. Previous works mainly rely on the marginal distribution alignment of feature spaces, which ignore the conditional dependence between features and labels, and may suffer from negative transfer. To address this problem, some UDA methods focus on aligning the conditional distributions of feature spaces. However, most of these methods rely on class-specific Maximum Mean Discrepancy or adversarial training, which may suffer from mode collapse and training instability. In this paper, we propose a Deep Conditional Adaptation Network (DCAN) that aligns the conditional distributions by minimizing Conditional Maximum Mean Discrepancy, and extracts discriminant information from the target domain by maximizing the mutual information between samples and the prediction labels. Conditional Maximum Mean Discrepancy measures the difference between conditional distributions directly through their conditional embedding in Reproducing Kernel Hilbert Space, thus DCAN can be trained stably and converge fast. Mutual information can be expressed as the difference between the entropy and conditional entropy of the predicted category variable, thus DCAN can extract the discriminant information of individual and overall distributions in the target domain, simultaneously. In addition, DCAN can be used to address a special scenario, Partial UDA, where the target domain category is a subset of the source domain category. Experiments on both UDA and Partial UDA show that DCAN achieves superior classification performance over state-of-the-art methods.
Article
Full-text available
We address the problem of unsupervised domain adaptation under the setting of generalized target shift (joint class-conditional and label shifts). For this framework, we theoretically show that, for good generalization, it is necessary to learn a latent representation in which both marginals and class-conditional distributions are aligned across domains. For this sake, we propose a learning problem that minimizes importance weighted loss in the source domain and a Wasserstein distance between weighted marginals. For a proper weighting, we provide an estimator of target label proportion by blending mixture estimation and optimal matching by optimal transport. This estimation comes with theoretical guarantees of correctness under mild assumptions. Our experimental results show that our method performs better on average than competitors across a range domain adaptation problems including digits,VisDA and Office. Code for this paper is available at https://github.com/arakotom/mars_domain_adaptation.
Conference Paper
Full-text available
As a vital problem in classification-oriented transfer, unsupervised domain adaptation (UDA) has attracted widespread attention in recent years. Previous UDA methods assume the marginal distributions of different domains are shifted while ignoring the discriminant information in the label distributions. This leads to classification performance degeneration in real applications. In this work, we focus on the conditional distribution shift problem which is of great concern to current conditional invariant models. We aim to seek a kernel covariance embedding for conditional distribution which remains yet unexplored. Theoretically , we propose the Conditional Kernel Bures (CKB) metric for characterizing conditional distribution discrepancy , and derive an empirical estimation for the CKB metric without introducing the implicit kernel feature map. It provides an interpretable approach to understand the knowledge transfer mechanism. The established consistency theory of the empirical estimation provides a theoretical guarantee for convergence. A conditional distribution matching network is proposed to learn the conditional invariant and discriminative features for UDA. Extensive experiments and analysis show the superiority of our proposed model.
Article
Full-text available
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution diverges from the target one. Mainstream UDA methods strive to learn domain-aligned features such that classifiers trained on the source features can be readily applied to the target ones. Although impressive results have been achieved, these methods have a potential risk of damaging the intrinsic data structures of target discrimination, raising an issue of generalization particularly for UDA tasks in an inductive setting. To address this issue, we are motivated by a UDA assumption of structural similarity across domains, and propose to directly uncover the intrinsic target discrimination via constrained clustering, where we constrain the clustering solutions using structural source regularization that hinges on the very same assumption. Technically, we propose a hybrid model of Structurally Regularized Deep Clustering , which integrates the regularized discriminative clustering of target data with a generative one, and we thus term our method as H-SRDC. Our hybrid model is based on a deep clustering framework that minimizes the Kullback-Leibler divergence between the distribution of network prediction and an auxiliary one, where we impose structural regularization by learning domain-shared classifier and cluster centroids. By enriching the structural similarity assumption, we are able to extend H-SRDC for a pixel-level UDA task of semantic segmentation. We conduct extensive experiments on seven UDA benchmarks of image classification and semantic segmentation. With no explicit feature alignment, our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings. We make our implementation codes publicly available at https://github.com/huitangtang/H-SRDC .
Chapter
Full-text available
Few-shot learning aims to train efficient predictive models with a few examples. The lack of training data leads to poor models that perform high-variance or low-confidence predictions. In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E3^3BM) to achieve robust predictions. “Epoch-wise” means that each training epoch has a Bayes model whose parameters are specifically learned and deployed. “Empirical” means that the hyperparameters, e.g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data. We introduce four kinds of hyperprior learners by considering inductive vs. transductive, and epoch-dependent vs. epoch-independent, in the paradigm of meta-learning. We conduct extensive experiments for five-class few-shot tasks on three challenging benchmarks: miniImageNet, tieredImageNet, and FC100, and achieve top performance using the epoch-dependent transductive hyperprior learner, which captures the richest information. Our ablation study shows that both “epoch-wise ensemble” and “empirical” encourage high efficiency and robustness in the model performance (Our code is open-sourced at https://gitlab.mpi-klsb.mpg.de/yaoyaoliu/e3bm).
Article
Full-text available
Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To tackle this issue, this paper proposes Contrastive Adaptation Network (CAN) that optimizes a new metric named Contrastive Domain Discrepancy explicitly modeling the intra-class domain discrepancy and the inter-class domain discrepancy. To optimize CAN, two technical issues need to be addressed: 1) the target labels are not available and 2) the conventional mini-batch sampling is imbalanced. Thus we design an alternating update strategy to optimize both the target label estimations and the feature representations. Moreover, we develop class-aware sampling to enable more efficient and effective training. Our framework can be generally applied to the single-source and multi-source domain adaptation scenarios. In particular, to deal with multiple source domain data, we propose 1) multi-source clustering ensemble which exploits the complementary knowledge of distinct source domains to make more accurate and robust target label estimations, and 2) boundary-sensitive alignment to make the decision boundary better fitted to the target. Experiments conducted on three real-world benchmarks, demonstrating CAN performs favorably against previous state-of-the-arts.
Article
Full-text available
Unsupervised domain adaptation is effective in leveraging rich information from a labeled source domain to an unlabeled target domain. Though deep learning and adversarial strategy made a significant breakthrough in the adaptability of features, there are two issues to be further studied. First, hard-assigned pseudo labels on the target domain are arbitrary and error-prone, and direct application of them may destroy the intrinsic data structure. Second, batch-wise training of deep learning limits the characterization of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability simultaneously. For the first issue, this framework establishes a probabilistic discriminant criterion on the target domain via soft labels. Based on pre-built prototypes, this criterion is extended to a global approximation scheme for the second issue. Manifold metric alignment is adopted to be compatible with the embedding space. The theoretical error bounds of different alignment metrics are derived for constructive guidance. The proposed method can be used to tackle a series of variants of domain adaptation problems, including both vanilla and partial settings. Extensive experiments have been conducted to investigate the method and a comparative study shows the superiority of the discriminative manifold learning framework.
Conference Paper
Full-text available
Unsupervised domain adaptation (UDA) is a representative problem in transfer learning, which aims to improve the classification performance on an unlabeled target domain by exploiting discriminant information from a labeled source domain. The optimal transport model has been used for UDA in the perspective of distribution matching. However , the transport distance cannot reflect the discriminant information from either domain knowledge or category prior. In this work, we propose an enhanced transport distance (ETD) for UDA. This method builds an attention-aware transport distance, which can be viewed as the prediction-feedback of the iteratively learned classifier, to measure the domain discrepancy. Further, the Kantorovich potential variable is re-parameterized by deep neural networks to learn the distribution in the latent space. The entropy-based regularization is developed to explore the intrinsic structure of the target domain. The proposed method is optimized alternately in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets to demonstrate the SOTA performance of ETD.
Article
Full-text available
For a target task where the labeled data are unavailable, domain adaptation can transfer a learner from a different source domain. Previous deep domain adaptation methods mainly learn a global domain shift, i.e., align the global source and target distributions without considering the relationships between two subdomains within the same category of different domains, leading to unsatisfying transfer learning performance without capturing the fine-grained information. Recently, more and more researchers pay attention to subdomain adaptation that focuses on accurately aligning the distributions of the relevant subdomains. However, most of them are adversarial methods that contain several loss functions and converge slowly. Based on this, we present a deep subdomain adaptation network (DSAN) that learns a transfer network by aligning the relevant subdomain distributions of domain-specific layer activations across different domains based on a local maximum mean discrepancy (LMMD). Our DSAN is very simple but effective, which does not need adversarial training and converges fast. The adaptation can be achieved easily with most feedforward network models by extending them with LMMD loss, which can be trained efficiently via backpropagation. Experiments demonstrate that DSAN can achieve remarkable results on both object recognition tasks and digit classification tasks. Our code will be available at https://github.com/easezyc/deep-transfer-learning .
Conference Paper
Full-text available
Few-shot learning has become essential for producing models that generalize from few examples. In this work, we identify that metric scaling and metric task conditioning are important to improve the performance of few-shot algorithms. Our analysis reveals that simple metric scaling completely changes the nature of few-shot algorithm parameter updates. Metric scaling provides improvements up to 14% in accuracy for certain metrics on the mini-Imagenet 5-way 5-shot classification task. We further propose a simple and effective way of conditioning a learner on the task sample set, resulting in learning a task-dependent metric space. Moreover, we propose and empirically test a practical end-to-end optimization procedure based on auxiliary task co-training to learn a task-dependent metric space. The resulting few-shot learning model based on the task-dependent scaled metric achieves state of the art on mini-Imagenet. We confirm these results on another few-shot dataset that we introduce in this paper based on CIFAR100. Our code is publicly available at https://github.com/ElementAI/TADAM.
Article
Full-text available
In this paper, we study object detection using a large pool of unlabeled images and only a few labeled images per category, named "few-example object detection". The key challenge consists in generating trustworthy training samples as many as possible from the pool. Using few training examples as seeds, our method iterates between model training and high-confidence sample selection. In training, easy samples are generated first and, then the poorly initialized model undergoes improvement. As the model becomes more discriminative, challenging but reliable samples are selected. After that, another round of model improvement takes place. To further improve the precision and recall of the generated training samples, we embed multiple detection models in our framework, which has proven to outperform the single model baseline and the model ensemble method. Experiments on PASCAL VOC'07, MS COCO'14, and ILSVRC'13 indicate that by using as few as three or four samples selected for each category, our method produces very competitive results when compared to the state-of-the-art weakly-supervised approaches using a large number of image-level labels.
Article
Full-text available
The metric d(A,B)=\left[ \tr\, A+\tr\, B-2\tr(A^{1/2}BA^{1/2})^{1/2}\right]^{1/2} on the manifold of n×nn\times n positive definite matrices arises in various optimisation problems, in quantum information and in the theory of optimal transport. It is also related to Riemannian geometry. In the first part of this paper we study this metric from the perspective of matrix analysis, simplifying and unifying various proofs. Then we develop a theory of a mean of two, and a barycentre of several, positive definite matrices with respect to this metric. We explain some recent work on a fixed point iteration for computing this Wasserstein barycentre. Our emphasis is on ideas natural to matrix analysis.
Article
Full-text available
Domain adaptation is critical for success in new, unseen environments. Adversarial adaptation models applied in feature spaces discover domain invariant representations, but are difficult to visualize and sometimes fail to capture pixel-level and low-level domain shifts. Recent work has shown that generative adversarial networks combined with cycle-consistency constraints are surprisingly effective at mapping images between domains, even without the use of aligned image pairs. We propose a novel discriminatively-trained Cycle-Consistent Adversarial Domain Adaptation model. CyCADA adapts representations at both the pixel-level and feature-level, enforces cycle-consistency while leveraging a task loss, and does not require aligned pairs. Our model can be applied in a variety of visual recognition and prediction settings. We show new state-of-the-art results across multiple adaptation tasks, including digit classification and semantic segmentation of road scenes demonstrating transfer from synthetic to real world domains.
Article
Full-text available
In recent years, deep neural networks have emerged as a dominant machine learning tool for a wide variety of application domains. However, training a deep neural network requires a large amount of labeled data, which is an expensive process in terms of time, labor and human expertise. Domain adaptation or transfer learning algorithms address this challenge by leveraging labeled data in a different, but related source domain, to develop a model for the target domain. Further, the explosive growth of digital data has posed a fundamental challenge concerning its storage and retrieval. Due to its storage and retrieval efficiency, recent years have witnessed a wide application of hashing in a variety of computer vision applications. In this paper, we first introduce a new dataset, Office-Home, to evaluate domain adaptation algorithms. The dataset contains images of a variety of everyday objects from multiple domains. We then propose a novel deep learning framework that can exploit labeled source data and unlabeled target data to learn informative hash codes, to accurately classify unseen target data. To the best of our knowledge, this is the first research effort to exploit the feature learning capabilities of deep neural networks to learn representative hash codes to address the domain adaptation problem. Our extensive empirical studies on multiple transfer tasks corroborate the usefulness of the framework in learning efficient hash codes which outperform existing competitive baselines for unsupervised domain adaptation.
Article
Full-text available
This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function f in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two domain Ps\mathcal{P}_s and Pt\mathcal{P}_t. We propose a solution of this problem with optimal transport, that allows to recover an estimated target Ptf=(X,f(X))\mathcal{P}^f_t=(X,f(X)) by optimizing simultaneously the optimal coupling and f. We show that our method corresponds to the minimization of a bound on the target error, and provide an efficient algorithmic solution, for which convergence is proved. The versatility of our approach, both in terms of class of hypothesis or loss functions is demonstrated with real world classification and regression problems, for which we reach or surpass state-of-the-art results.
Article
Full-text available
In domain adaptation, maximum mean discrepancy (MMD) has been widely adopted as a discrepancy metric between the distributions of source and target domains. However, existing MMD-based domain adaptation methods generally ignore the changes of class prior distributions, i.e., class weight bias across domains. This remains an open problem but ubiquitous for domain adaptation, which can be caused by changes in sample selection criteria and application scenarios. We show that MMD cannot account for class weight bias and results in degraded domain adaptation performance. To address this issue, a weighted MMD model is proposed in this paper. Specifically, we introduce class-specific auxiliary weights into the original MMD for exploiting the class prior probability on source and target domains, whose challenge lies in the fact that the class label in target domain is unavailable. To account for it, our proposed weighted MMD model is defined by introducing an auxiliary weight for each class in the source domain, and a classification EM algorithm is suggested by alternating between assigning the pseudo-labels, estimating auxiliary weights and updating model parameters. Extensive experiments demonstrate the superiority of our weighted MMD over conventional MMD for domain adaptation.
Article
Full-text available
Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.
Article
Full-text available
Deep networks rely on massive amounts of labeled data to learn powerful models. For a target task short of labeled data, transfer learning enables model adaptation from a different source domain. This paper addresses deep transfer learning under a more general scenario that the joint distributions of features and labels may change substantially across domains. Based on the theory of Hilbert space embedding of distributions, a novel joint distribution discrepancy is proposed to directly compare joint distributions across domains, eliminating the need of marginal-conditional factorization. Transfer learning is enabled in deep convolutional networks, where the dataset shifts may linger in multiple task-specific feature layers and the classifier layer. A set of joint adaptation networks are crafted to match the joint distributions of these layers across domains by minimizing the joint distribution discrepancy, which can be trained efficiently using back-propagation. Experiments show that the new approach yields state of the art results on standard domain adaptation datasets.
Article
Full-text available
Domain adaptation from one data space (or domain) to another is one of the most challenging tasks of modern data analytics. If the adaptation is done correctly, models built on a specific data space become more robust when confronted to data depicting the same semantic concepts (the classes), but observed by another observation system with its own specificities. Among the many strategies proposed to adapt a domain to another, finding a common representation has shown excellent properties: by finding a common representation for both domains, a single classifier can be effective in both and use labelled samples from the source domain to predict the unlabelled samples of the target domain. In this paper, we propose a regularized unsupervised optimal transportation model to perform the alignment of the representations in the source and target domains. We learn a transportation plan matching both PDFs, which constrains labelled samples in the source domain to remain close during transport. This way, we exploit at the same time the few labeled information in the source and the unlabelled distributions observed in both domains. Experiments in toy and challenging real visual adaptation examples show the interest of the method, that consistently outperforms state of the art approaches.
Article
Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neural networks, thus limiting its effectiveness. In order to achieve top performance, some recent works tried to use the DNNs pre-trained on large-scale datasets but mostly in straight-forward manners, e.g., (1) taking their weights as a warm start of meta-training, and (2) freezing their convolutional layers as the feature extractor of base-learners. In this paper, we propose a novel approach called meta-transfer learning (MTL) , which learns to transfer the weights of a deep NN for few-shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights (and biases) for each task. To further boost the learning efficiency of MTL, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum of few-shot classification tasks. We conduct experiments for five-class few-shot classification tasks on three challenging benchmarks, mini ImageNet, tiered ImageNet, and Fewshot-CIFAR100 (FC100), in both supervised and semi-supervised settings. Extensive comparisons to related works validate that our MTL approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy.
Article
Few-shot learning aims to learn a well-performing model from a few labeled examples. Recently, quite a few works propose to learn a predictor to directly generate model parameter weights with episodic training strategy of meta-learning and achieve fairly promising performance. However, the predictor in these works is task-agnostic, which means that the predictor cannot adjust to novel tasks in the testing phase. In this article, we propose a novel meta-learning method to learn how to learn task-adaptive classifier-predictor to generate classifier weights for few-shot classification. Specifically, a meta classifier-predictor module, (MPM) is introduced to learn how to adaptively update a task-agnostic classifier-predictor to a task-specialized one on a novel task with a newly proposed center-uniqueness loss function. Compared with previous works, our task-adaptive classifier-predictor can better capture characteristics of each category in a novel task and thus generate a more accurate and effective classifier. Our method is evaluated on two commonly used benchmarks for few-shot classification, i.e., miniImageNet and tieredImageNet. Ablation study verifies the necessity of learning task-adaptive classifier-predictor and the effectiveness of our newly proposed center-uniqueness loss. Moreover, our method achieves the state-of-the-art performance on both benchmarks, thus demonstrating its superiority.
Article
Machine learning has been highly successful in data-intensive applications but is often hampered when the data set is small. Recently, Few-shot Learning (FSL) is proposed to tackle this problem. Using prior knowledge, FSL can rapidly generalize to new tasks containing only a few samples with supervised information. In this article, we conduct a thorough survey to fully understand FSL. Starting from a formal definition of FSL, we distinguish FSL from several relevant machine learning problems. We then point out that the core issue in FSL is that the empirical risk minimizer is unreliable. Based on how prior knowledge can be used to handle this core issue, we categorize FSL methods from three perspectives: (i) data, which uses prior knowledge to augment the supervised experience; (ii) model, which uses prior knowledge to reduce the size of the hypothesis space; and (iii) algorithm, which uses prior knowledge to alter the search for the best hypothesis in the given hypothesis space. With this taxonomy, we review and discuss the pros and cons of each category. Promising directions, in the aspects of the FSL problem setups, techniques, applications, and theories, are also proposed to provide insights for future research.
Article
Unsupervised domain adaptation addresses the problem of transferring knowledge from a well-labeled source domain to an unlabeled target domain where the two domains have distinctive data distributions. Thus, the essence of domain adaptation is to mitigate the distribution divergence between the two domains. The state-of-the-art methods practice this very idea by either conducting adversarial training or minimizing a metric which defines the distribution gaps. In this paper, we propose a new domain adaptation method named Adversarial Tight Match (ATM) which enjoys the benefits of both adversarial training and metric learning. Specifically, at first, we propose a novel distance loss, named Maximum Density Divergence (MDD), to quantify the distribution divergence. MDD minimizes the inter-domain divergence (“match” in ATM) and maximizes the intra-class density (“tight” in ATM). Then, to address the equilibrium challenge issue in adversarial domain adaptation, we consider leveraging the proposed MDD into adversarial domain adaptation framework. At last, we tailor the proposed MDD as a practical learning loss and report our ATM. Both empirical evaluation and theoretical analysis are reported to verify the effectiveness of the proposed method. The experimental results on five benchmarks, both classical and large-scale, show that our method is able to achieve new state-of-the-art performance on most evaluations.
Article
In this paper, we present a mathematical and computational framework for comparing and matching distributions in reproducing kernel Hilbert spaces (RKHS). This framework, called optimal transport in RKHS, is a generalization of the optimal transport problem in input spaces to (potentially) infinite-dimensional feature spaces. We provide a computable formulation of Kantorovich's optimal transport in RKHS. In particular, we explore the case in which data distributions in RKHS are Gaussian, obtaining closed-form expressions of both the estimated Wasserstein distance and optimal transport map via kernel matrices. Based on these expressions, we generalize the Bures metric on covariance matrices to infinite-dimensional settings, providing a new metric between covariance operators. Moreover, we extend the correlation alignment problem to Hilbert spaces, giving a new strategy for matching distributions in RKHS. Empirically, we apply the derived formulas under the Gaussianity assumption to image classification and domain adaptation. In both tasks, our algorithms yield state-of-the-art performances, demonstrating the effectiveness and potential of our framework.
Article
Learning domain adaptive features aims to enhance the classification performance of the target domain by exploring the discriminant information from an auxiliary source set. Let X denote the feature and Y as the label. The most typical problem to be addressed is that P XY has a so large variation between different domains that classification in the target domain is difficult. In this paper, we study the generalized conditional domain adaptation (DA) problem, in which both P Y and P X|Y change across domains, in a causal perspective. We propose transforming the class conditional probability matching to the marginal probability matching problem, under a proper assumption. We build an intermediate domain by employing a regression model. In order to enforce the most relevant data to reconstruct the intermediate representations, a low-rank constraint is placed on the regression model for regularization. The low-rank constraint underlines a global algebraic structure between different domains, and stresses the group compactness in representing the samples. The new model is considered under the discriminant subspace framework, which is favorable in simultaneously extracting the classification information from the source domain and adaptation information across domains. The model can be solved by an alternative optimization manner of quadratic programming and the alternative Lagrange multiplier method. To the best of our knowledge, this paper is the first to exploit low-rank representation, from the source domain to the intermediate domain, to learn the domain adaptive features. Comprehensive experimental results validate that the proposed method provides better classification accuracies with DA, compared with well-established baselines.
Article
Domain adaptation generalizes a learning machine across source domain and target domain under different distributions. Recent studies reveal that deep neural networks can learn transferable features generalizing well to similar novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, feature transferability drops significantly in higher task-specific layers with increasing domain discrepancy. To formally reduce the dataset shift and enhance the feature transferability in task-specific layers, this paper presents a novel framework for deep adaptation networks, which generalizes deep convolutional neural networks to domain adaptation. The framework embeds the deep features of all task-specific layers to reproducing kernel Hilbert spaces (RKHSs) and optimally match different domain distributions. The deep features are made more transferable by exploring low-density separation of target-unlabeled data and very deep architectures, while the domain discrepancy is further reduced using multiple kernel learning for maximal testing power of kernel embedding matching. This leads to a minimax game framework that learns transferable features with statistical guarantees, and scales linearly with unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed networks yield state-of-the-art results on standard visual domain adaptation benchmarks.
Conference Paper
In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against state-of-the-art deep domain adaptation methods.
Article
In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. In this work, we advance this few-shot classification paradigm towards a scenario where unlabeled examples are also available within each episode. We consider two situations: one where all unlabeled examples are assumed to belong to the same set of classes as the labeled examples of the episode, as well as the more challenging situation where examples from other distractor classes are also provided. To address this paradigm, we propose novel extensions of Prototypical Networks (Snell et al., 2017) that are augmented with the ability to use unlabeled examples when producing prototypes. These models are trained in an end-to-end way on episodes, to learn to leverage the unlabeled examples successfully. We evaluate these methods on versions of the Omniglot and miniImageNet benchmarks, adapted to this new framework augmented with unlabeled examples. We also propose a new split of ImageNet, consisting of a large set of classes, with a hierarchical structure. Our experiments confirm that our Prototypical Networks can learn to improve their predictions due to unlabeled examples, much like a semi-supervised algorithm would.
Article
We present a conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each. Our method, called the Relation Network (RN), is trained end-to-end from scratch. During meta-learning, it learns to learn a deep distance metric to compare a small number of images within episodes, each of which is designed to simulate the few-shot setting. Once trained, a RN is able to classify images of new classes by computing relation scores between query images and the few examples of each new class without further updating the network. Besides providing improved performance on few-shot learning, our framework is easily extended to zero-shot learning. Extensive experiments on four datasets demonstrate that our simple approach provides a unified and effective approach for both of these two tasks.
Article
We present the 2017 Visual Domain Adaptation (VisDA) dataset and challenge, a large-scale testbed for unsupervised domain adaptation across visual domains. Unsupervised domain adaptation aims to solve the real-world problem of domain shift, where machine learning models trained on one domain must be transferred and adapted to a novel visual domain without additional supervision. The VisDA2017 challenge is focused on the simulation-to-reality shift and has two associated tasks: image classification and image segmentation. The goal in both tracks is to first train a model on simulated, synthetic data in the source domain and then adapt it to perform well on real image data in the unlabeled test domain. Our dataset is the largest one to date for cross-domain object classification, with over 280K images across 12 categories in the combined training, validation and testing domains. The image segmentation dataset is also large-scale with over 30K images across 18 categories in the three domains. We compare VisDA to existing cross-domain adaptation datasets and provide a baseline performance analysis using various domain adaptation models that are currently popular in the field.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Chapter
We introduce a representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behavior can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new Gradient Reversal Layer. The resulting augmented architecture can be trained using standard backpropagation, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for image classification, where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
Article
Domain adaptation arises in supervised learning when the training (source domain) and test (target domain) data have different distributions. Let X and Y denote the features and target, respectively, previous work on domain adaptation mainly considers the covariate shift situation where the distribution of the features P(X) changes across domains while the conditional distribution P(Y∣X) stays the same. To reduce domain discrepancy, recent methods try to find invariant components [Formula: see text] that have similar [Formula: see text] on different domains by explicitly minimizing a distribution discrepancy measure. However, it is not clear if [Formula: see text] in different domains is also similar when P(Y∣X) changes. Furthermore, transferable components do not necessarily have to be invariant. If the change in some components is identifiable, we can make use of such components for prediction in the target domain. In this paper, we focus on the case where P(X∣Y) and P(Y) both change in a causal system in which Y is the cause for X. Under appropriate assumptions, we aim to extract conditional transferable components whose conditional distribution [Formula: see text] is invariant after proper location-scale (LS) transformations, and identify how P(Y) changes between domains simultaneously. We provide theoretical analysis and empirical evaluation on both synthetic and real-world data to show the effectiveness of our method.
Article
Visual Domain adaptation is an actively researched problem in Computer Vision. In this work, we propose an approach that leverages unsupervised data to bring the source and target distributions closer in a learned joint feature space. We accomplish this by inducing a symbiotic relationship between the learned embedding and a generative adversarial framework. This is in contrast to methods which use an adversarial framework for realistic data generation and retraining deep models with such data. We show the strength and generality of our method by performing experiments on three different tasks: (1) Digit classification (MNIST, SVHN and USPS datasets) (2) Object recognition using OFFICE dataset and (3) Face recognition using the Celebrity Frontal Profile (CFP) dataset.
Article
We propose prototypical networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each new class. Prototypical networks learn a metric space in which classification can be performed by computing Euclidean distances to prototype representations of each class. Compared to recent approaches for few-shot learning, they reflect a simpler inductive bias that is beneficial in this limited-data regime, and achieve state-of-the-art results. We provide an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning. We further extend prototypical networks to the case of zero-shot learning and achieve state-of-the-art zero-shot results on the CU-Birds dataset.
Article
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on a few-shot image classification benchmark, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
Article
Let X denote the feature and Y the target. We consider domain adaptation under three possible scenarios: (1) the marginal PY changes, while the conditional PX|Y stays the same (target shift), (2) the marginal PY is fixed, while the conditional PX|Y changes with certain constraints (conditional shift), and (3) the marginal PY changes, and the conditional PX|Y changes with constraints (generalized target shift). Using background knowledge, causal interpretations allow us to determine the correct situation for a problem at hand. We exploit importance reweighting or sample transformation to find the learning machine that works well on test data, and propose to estimate the weights or transformations by reweighting or transforming training data to reproduce the covariate distribution on the test domain. Thanks to kernel embedding of conditional as well as marginal distributions, the proposed approaches avoid distribution estimation, and are applicable for high-dimensional problems. Numerical evaluations on synthetic and real-world data sets demonstrate the effectiveness of the proposed framework.
Article
Unlike human learning, machine learning often fails to handle changes between training (source) and test (target) input distributions. Such domain shifts, common in practical scenarios, severely damage the performance of conventional machine learning methods. Supervised domain adaptation methods have been proposed for the case when the target data have labels, including some that perform very well despite being ``frustratingly easy'' to implement. However, in practice, the target domain is often unlabeled, requiring unsupervised adaptation. We propose a simple, effective, and efficient method for unsupervised domain adaptation called CORrelation ALignment (CORAL). CORAL minimizes domain shift by aligning the second-order statistics of source and target distributions, without requiring any target labels. Even though it is extraordinarily simple--it can be implemented in four lines of Matlab code--CORAL performs remarkably well in extensive evaluations on standard benchmark datasets.