Piotr Bojanowski's research while affiliated with Meta and other places

Publications (54)

Chapter
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Visio...
Preprint
Full-text available
A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably sema...
Article
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When...
Preprint
Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge. In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or any doma...
Preprint
Full-text available
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Visio...
Preprint
Full-text available
Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric features that perform on par with supervised features on most object-centric downstream tasks. In this work, we...
Preprint
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a s...
Preprint
Full-text available
Information retrieval is an important component in natural language processing, for knowledge intensive tasks such as question answering and fact checking. Recently, information retrieval has seen the emergence of dense retrievers, based on neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obt...
Preprint
Full-text available
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This f...
Preprint
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When...
Preprint
Full-text available
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain e...
Preprint
This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representation...
Preprint
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset....
Preprint
In this work, we address the problem of image-goal navigation in the context of visually-realistic 3D environments. This task involves navigating to a location indicated by a target image in a previously unseen environment. Earlier attempts, including RL-based and SLAM-based approaches, have either shown poor generalization performance, or are heav...
Preprint
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose...
Preprint
Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward...
Preprint
Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number of parameters. Extracting subnetworks from these large unsupervised convnets with preserved performance is of particular interest to make them less computationally intensive. T...
Preprint
In this paper, we focus on the problem of adapting word vector-based models to new textual data. Given a model pre-trained on large reference data, how can we adapt it to a smaller piece of data with a slightly different language distribution? We frame the adaptation problem as a monolingual word vector alignment problem, and simply average models...
Preprint
In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our m...
Preprint
We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we...
Preprint
Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated datasets like ImageNet, whereas using uncurated raw datasets was found to decrease the feature quality...
Preprint
Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the...
Article
Continuous word representations, learned on different languages, can be aligned with remarkable precision. Using a small bilingual lexicon as training data, learning the linear transformation is often formulated as a regression problem using the square loss. The obtained mapping is known to suffer from the hubness problem, when used for retrieval t...
Article
Recurrent neural networks (RNNs) have achieved impressive results in a variety of linguistic processing tasks, suggesting that they can induce non-trivial properties of language. We investigate here to what extent RNNs learn to track abstract hierarchical syntactic structure. We test whether RNNs trained with a generic language modeling objective i...
Article
Full-text available
Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we d...
Article
Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main resul...
Article
This paper shows that a simple baseline based on a Bag-of-Words (BoW) representation learns surprisingly good knowledge graph embeddings. By casting knowledge base completion and question answering as supervised classification problems, we observe that modeling co-occurences of entities and relations leads to state-of-the-art performance with a tra...
Article
Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. Such applications include person and action recognition, text-to-video alignment, object co-segmentation and colocalization in videos and images. One drawback of discriminative clustering, however, is its limited scalability. We address this iss...
Article
Generative Adversarial Networks (GANs) have been shown to be able to sample impressively realistic images. GAN training consists of a saddle point optimization problem that can be thought of as an adversarial game between a generator which produces the images, and a discriminator, which judges if the images are real. Both the generator and the disc...
Article
Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, c...
Article
We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accur...
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many...
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion word...
Thesis
Videos often depict complex scenes including people, objects and interactions between these and the environment. Relations between agents are likely to evolve in time and agents can perform actions. The automatic understanding of video data is complicated as it requires to properly localize the agents both in space and time. Moreover, one need to a...
Article
Recurrent neural networks are convenient and efficient models for language modeling. However, when applied on the level of characters instead of words, they suffer from several problems. In order to successfully model long-term dependencies, the hidden representation needs to be large. This in turn implies higher computational costs, which can beco...
Article
We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a joint model for video and natural language narration that takes advantage of the complementary nature of the two signa...
Article
Full-text available
In this paper, we address the problem of multi-label classification. We consider linear classifiers and propose to learn a prior over the space of labels to directly leverage the performance of such methods. This prior takes the form of a quadratic function of the labels and permits to encode both attractive and repulsive relations between labels....
Article
Full-text available
Suppose that we are given a set of videos, along with natural language descriptions in the form of multiple sentences (e.g., manual annotations, movie scripts, sport summaries etc.), and that these sentences appear in the same temporal order as their visual counterparts. We propose in this paper a method for aligning the two modalities, i.e., autom...
Conference Paper
Full-text available
We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the probl...
Conference Paper
We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constr...

Citations

... Based on these findings, we thus present a simple parametric classification baseline for generalized category discovery (see Figs. 1 and 7). The representation learning objective follows GCD [33], and the classification objective is simply cross-entropy for labelled samples and self-distillation [1,7] for unlabelled samples. ...
... Model Architectures. We extensively conduct experiments on 45 deep models with various model architectures, from Multilayer Perceptron (MLP) based model ResMLP (Touvron et al, 2022), classical CNN models (VGG (Simonyan and Zisserman, 2015), ResNet (He et al, 2016), DenseNet (Huang et al, 2017), MobileNet-V2 (Sandler et al, 2018), NASNet (Zoph et al, 2018), the InceptionV4 (Szegedy et al, 2017) and the squeeze-and-excitation network (Hu et al, 2018)) to the latest transformer models (ViT (Dosovitskiy et al, 2021), Swin (Liu et al, 2021) and DeiT (Touvron et al, 2021)). The pre-trained CNN models are provided by torchvision 0.10.0 (Paszke et al, 2019), and other pre-trained models are provided by timm 0.4.12 (Wightman, 2019). ...
... While supervised methods have achieved very impressive results [35], the extensive need for supervision inspired many works aiming to learn with fewer labels [6,33]. Another prominent line of works aims to use large unlabelled datasets to learn a strong visual representation, which can then be utilized for labelling downstream datasets with fewer supervised samples [4,12,13]. Many clustering methods based on deep learning have been proposed for categorizing large dataset without any labels, [14,37]. ...
... Notably, the outcome of this representation learning is only an encoder network that is commonly used for downstream tasks, such as multi-class classification and BC. Using this technique, several semi-supervised classification tasks, particularly in multi-class classification (such as CIFAR, STL-10, and ImageNet), have been solved with remarkable and nearly similar results compared to the fully supervised setting, as demonstrated in the studies presented in [7,8,6]. However, these approaches rely on stochastic data augmentation techniques. ...
... These clustering-based methods can account for inter-data similarity; representations are encouraged to encode the semantic structure of data. Prior works [51,49,4,32] have shown encouraging results in small-scaled settings; Caron et al. [6] show that it can also be applied to the large-scaled dataset or even to a non-curated dataset [7]. Recently, several works [2,8,39] have adopted the philosophy of augmentation invariance and achieved strong empirical results. ...
... Artetxe et al. (2018a) form an initial solution with similarity matrices and refine with iterative Procrustes. Grave et al. (2019b) optimize "Procrustes in Wasserstein Distance", employing a quadratic assignment formulation and the Frank-Wolfe method. Ren et al. (2020) form CSLS similarity matrices, iteratively extract cliques, and map with Procrustes. ...
... These adaptive computation methods (e.g. Han et al., 2021;Sukhbaatar et al., 2019;Schuster et al., 2021;Scardapane et al., 2020;Bapna et al., 2020;Elbayad et al., 2019;Schwartz et al., 2020) aim to use less compute resources for the easier inference steps. While many of these solutions have proven extremely effective in practice, they usually require changing the model architecture, changing the training-procedure and re-training the models, and don't maintain identical outputs. ...
... and Ethan A. Chi, Stanford University (ethanchi@cs.stanford.edu). Task cites:(Nishino et al., 2019;Rozner et al., 2021;Jones et al., 2020;Mays et al., 1991;Edizel et al., 2019;Sakaguchi et al., 2016;Kim et al., 2015;Xue et al., 2021a; ...
... On most natural language processing (NLP) tasks, some variation of this approach provides the best results. Previously, many multilingual and cross-lingual artificial intelligence approaches to text used pretrained and aligned word-embeddings, such as the aligned fastText embeddings [8,30]. But similarly to the overall trend in NLP, recent approaches have also moved towards using fine-tuning of pretrained transformers. ...
... Classification-based technique Contrary to the Regression-based approach, the methods based on classification take the task of constructing pseudo labels to help model training. Sun et al. (2020) present a method called Multi-Stage self-supervised (M3S) by leveraging the DeepCluster (Caron et al. 2018) to iteratively train an encoder architecture for assigning pseudo labels to unlabeled nodes during each iteration of the training process. Similarly, You et al. (2020) introduced a method (node clustering) for node clustering by using pre-computed cluster index as self-supervised labels. ...