October 2023
·
74 Reads
·
12 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
October 2023
·
74 Reads
·
12 Citations
June 2023
·
43 Reads
Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.
March 2023
·
268 Reads
·
1 Citation
Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.
December 2022
·
9 Reads
Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.
December 2022
·
123 Reads
Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
September 2022
·
51 Reads
·
1 Citation
Neural Computing and Applications
We propose a system for calculating a “scaling constant” for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is “preconditioned” via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a number of consequences, among them the fact that the geometric mean of the fan-in and fan-out, rather than the fan-in, fan-out, or arithmetic mean, should be used for the initialization of the variance of weights in a neural network. Our system allows for the off-line design & engineering of ReLU (Rectified Linear Unit) neural networks, potentially replacing blind experimentation. We verify the effectiveness of our approach on a set of benchmark problems.
June 2022
·
7 Reads
Proceedings of the AAAI Conference on Artificial Intelligence
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation.
April 2022
·
398 Reads
·
1 Citation
Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model's complexity. Current Deep Networks heavily rely on regularizers such as Data-Augmentation (DA) or weight-decay, and employ structural risk minimization, i.e. cross-validation, to select the optimal regularization hyper-parameters. In this study, we demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes e.g. on Imagenet with a resnet50, the "barn spider" classification test accuracy falls from to only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay. Those results demonstrate that our search for ever increasing generalization performance -- averaged over all classes and samples -- has left us with models and regularizers that silently sacrifice performances on some classes. This scenario can become dangerous when deploying a model on downstream tasks e.g. an Imagenet pre-trained resnet50 deployed on INaturalist sees its performances fall from to on class \#8889 when introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
March 2022
·
34 Reads
There often is a dilemma between ease of optimization and robust out-of-distribution (OoD) generalization. For instance, many OoD methods rely on penalty terms whose optimization is challenging. They are either too strong to optimize reliably or too weak to achieve their goals. In order to escape this dilemma, we propose to first construct a rich representation (RFC) containing a palette of potentially useful features, ready to be used by even simple models. On the one hand, a rich representation provides a good initialization for the optimizer. On the other hand, it also provides an inductive bias that helps OoD generalization. RFC is constructed in a succession of training episodes. During each step of the discovery phase, we craft a multi-objective optimization criterion and its associated datasets in a manner that prevents the network from using the features constructed in the previous iterations. During the synthesis phase, we use knowledge distillation to force the network to simultaneously develop all the features identified during the discovery phase. RFC consistently helps six OoD methods achieve top performance on challenging invariant training benchmarks, ColoredMNIST (Arjovsky et al., 2020). Furthermore, on the realistic Camelyon17 task, our method helps both OoD and ERM methods outperform earlier compatable results by at least , reduce standard deviation by at least , and makes hyperparameter tuning and model selection more reliable.
February 2022
·
525 Reads
·
5 Citations
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions , as well as convex and non-convex loss functions. An implication of our results is that for each DRO problem there exists a data distribution such that learning this distribution is equivalent to solving the DRO problem. Yet, important problems that DRO seeks to address (for instance, adversarial ro-bustness and fighting bias) cannot be reduced to finding the one 'unbiased' dataset. Our discussion section addresses this important discrepancy.
... In contrast, models like TA-VAAL (Kim et al., 2021) and SRAAL incorporate classconditional relationships, surpassing task-agnostic methods like VAAL (Sinha et al., 2019) and LL4AL (Yoo & Kweon, 2019). Despite these advancements, current methods under-utilize available unlabeled data for task learning (Gao et al., 2020;Cabannes et al., 2023), missing the opportunity to enhance visual representation and model performance . Self-supervised learning (Gidaris et al., 2018;Chen & He, 2021;Doersch et al., 2015;Noroozi & Favaro, 2016) exploits unlabeled data for intermediate representation learning, producing rich and semantically meaningful representations without requiring the knowledge of the annotation space (distribution). ...
October 2023
... MoBYv2AL [102] incorporates self-supervised learning algorithms (MoBY) into the DeepAL process, and jointly optimizes a task-aware objective function with contrastive loss. Cabannes et al. [103] proposed a unified learning framework that combines semisupervised, self-supervised, and supervised learning based on the concept of a similarity graph, introducing a strategy called positive active learning to select samples based on relative similarity comparisons rather than absolute sample labeling. ...
March 2023
... The goal of a good initialisation strategy is to produce similar statistics in every layer. This can be done in the forward pass (LeCun et al., 1998;Mishkin and Matas, 2016;Klambauer et al., 2017;Chang et al., 2020) or during back-propagation (Glorot and Bengio, 2010;Hoedt et al., 2018;Defazio and Bottou, 2021). It is also important to account for the effects due to non-linearities in the network (Saxe et al., 37th Conference on Neural Information Processing Systems (NeurIPS 2023). ...
September 2022
Neural Computing and Applications
... A statistical error will always be added on top. This can be circumvented to some degree using data augmentation methods, yet excessive use of such techniques bears the risk of inducing bias in the model [75,76]. ...
April 2022
... Differences in CTC-based training losses due to length, speaker, and acoustics may lead to varying magnitudes and irreducible components of losses across different groups. As a result, some groups with disproportionately high losses may dominate training with group DRO, causing under-training of the other groups, and ultimately negatively impacting overall downstream performance (Słowik & Bottou, 2022). ...
February 2022
... In this study, our experiment was conducted on an Intel Core i7-10700 CPU with 16GB RAM. In addition, we used the umap-learn library to implement our method and partially used the code in the literature (Klimovskaia et al., 2020) to produce the visualization. Note that we did not vary λ and k, i.e., these parameters were fixed λ = 0.1 and k = 20. ...
June 2020
... Selecting the right optimizer type becomes crucial for minimizing train and test errors. Both (Adam, Adagrad) can be recognized as effective optimizers [32]. Various learning rates are used for testing each optimizer. ...
March 2020
... As the study progresses, it delves into the application of deep learning techniques, such as CNNs and RNNs, in the realm of audio processing. The exploration of coupled deep learning frameworks, such as HR-LSTM [47], Dense-U-Net [55][56][57], Wave-U-Net [50][51][52][53][54], Conv-Tasnet [62][63][64][65], Res-U-Net [58,59], and LRCN [49], provides a holistic view of the cutting-edge approaches employed in this field. Moreover, the evaluation of combined architectures, such as DenseLSTM [60,61] and Audio Spectrogram Transformer [77], underscores the potential for increased efficiency and efficacy in separating instrumental acoustics. ...
November 2019
... The invariant features are informative about the labels and shared across domains, which can enable model generalization beyond training domains [14]. A notable approach in learning invariant features is the invariant risk minimization (IRM) [15]. The key idea of IRM is penalizing a dummy classifier to be simultaneously optimal on multiple training domains. ...
July 2019
... 8 Instead, data typically have continuous relationships, more akin to a distance or similarity, than a binary connection. Even in studies that have worked with data in this form, 9,10 there is no clear prescription for determining the curvature or dimension of the underlying space. Both are important geometric parameters for interpreting continuous maps obtained from discrete data. ...
July 2019