Armand Joulin’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (24)


DINOv2: Learning Robust Visual Features without Supervision
  • Preprint

April 2023

·

621 Reads

·

27 Citations

Maxime Oquab

·

Timothée Darcet

·

·

[...]

·

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.


Figure 2. Automatic evaluation of SIDE components on the WAFER test set.
Figure 3. Crowd annotator evaluation for 2016 claims in the WAFER test set for which SIDE produces a citation with higher evidence score than the existing Wikipedia citation. We collect 5 annotations per claim and report majority voting results, bucketed according to the verification engine score associated with the existing Wikipedia citation (bucket size reported on top).
Improving Wikipedia Verifiability with AI
  • Preprint
  • File available

September 2022

·

72 Reads

·

1 Citation

Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not support a given claim or become obsolete once the original source is updated or deleted. Hence, maintaining and improving the quality of Wikipedia references is an important challenge and there is a pressing need for better tools to assist humans in this effort. Here, we show that the process of improving references can be tackled with the help of artificial intelligence (AI). We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system's suggested alternatives compared to the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that Side's first citation recommendation collects double the preferences of the existing Wikipedia citation for the same top 10% most likely unverifiable claims according to Side. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia. More generally, we hope that our work can be used to assist fact checking efforts and increase the general trustworthiness of information online. All our code, data, indexes and models are publicly available at https://github.com/facebookresearch/side.

Download

Unbounded cache model for online language modeling with open vocabulary

November 2017

·

41 Reads

·

40 Citations

Recently, continuous cache models were proposed as extensions to recurrent neural network language models, to adapt their predictions to local changes in the data distribution. These models only capture the local context, of up to a few thousands tokens. In this paper, we propose an extension of continuous cache models, which can scale to larger contexts. In particular, we use a large scale non-parametric memory component that stores all the hidden activations seen in the past. We leverage recent advances in approximate nearest neighbor search and quantization algorithms to store millions of representations while searching them efficiently. We conduct extensive experiments showing that our approach significantly improves the perplexity of pre-trained language models on new distributions, and can scale efficiently to much larger contexts than previously proposed local cache models.


Improving Neural Language Models with a Continuous Cache

December 2016

·

230 Reads

·

164 Citations

We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.


FastText.zip: Compressing text classification models

December 2016

·

1,005 Reads

·

507 Citations

We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy.


Variable Computation in Recurrent Neural Networks

November 2016

·

66 Reads

·

27 Citations

Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only is our model more computationally efficient, it also leads to better performance overall on our evaluation tasks.


Learning Visual Features from Large Weakly Supervised Data

October 2016

·

149 Reads

·

412 Citations

Lecture Notes in Computer Science

Convolutional networks trained on large supervised datasets produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and comments, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity and learn correspondences between different languages.


Revisiting Visual Question Answering Baselines

October 2016

·

76 Reads

·

243 Citations

Lecture Notes in Computer Science

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8%65.8\,\% accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.


Figure 4. Computational time for the adaptive softmax on the Bulgarian Europarl data, as a function of the number of clusters.
Figure 5. Finnish Europarl: perplexity (on validation) as the function of time for our method and baselines. We represent the result after each epoch by a point. Our method favorably compares with all other approaches w.r.t. the tradeoff perplexity and training time.Similar conclusions are drawn for the other languages.
One Billion Word benchmark. Perplexity on the test set for single models. Our result is obtained after 5 epochs.
Efficient softmax approximation for GPUs

September 2016

·

477 Reads

·

163 Citations

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity. Our approach further reduces the computational cost by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.


Enriching Word Vectors with Subword Information

July 2016

·

3,072 Reads

·

9,387 Citations

Transactions of the Association for Computational Linguistics

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.


Citations (22)


... We consider three different (frozen) backbones to extract features from the input images: ResNet-50 [13], DINOv2 [31], and CLIP [37]. We use the libraries LAMPE [41] and ZUKO [40] to learn a model (of type NPE ) and to manipulate the normalizing flows (of type NSF ), respectively. ...

Reference:

Physically Interpretable Probabilistic Domain Characterization
DINOv2: Learning Robust Visual Features without Supervision
  • Citing Preprint
  • April 2023

... Considerable progress has been made in the improvement of text generating LLMs over past years, which reflects in reaching significant results on such benchmarks datasets for Language Modelling as WikiText-103 [7], WikiText2, Text8, etc. Constant improvements of the text generated quality are presented almost monthly, showing how significant the growth of the sphere has become. These enhancements have been made by developing the LLM itself and then fine-tuning [24] it on the specific tasks. ...

Improving Neural Language Models with a Continuous Cache
  • Citing Article
  • December 2016

... We discovered that depending solely on the language information from the source could be misleading. Hence, we employ a specialized model known as fastText (Joulin, Grave, Bojanowski, Douze, et al., 2016; to accurately determine the article's language. Subsequently, we standardize each article's text. ...

FastText.zip: Compressing text classification models
  • Citing Article
  • December 2016

... PoWER (Goyal et al., 2020) attention drop tokens TR-BERT (Ye et al., 2021) hidden states forward tokens LAT (Kim and Cho, 2021) attention forward tokens LTP attention drop tokens Transkimmer (Guan et al., 2022) hidden states forward tokens VCRNN (Jernite et al., 2017) input token; hidden states partial update with zero-masked weights Skim-RNN (Seo et al., 2018) input token; hidden states partial update with a small RNN HM-RNN (Chung et al., 2017) states of the gates skip a single step; "flush" FHRNN (Ke et al., 2018) query; hidden states update the upper RNN layer ing jumped over important information. LSTM-Shuttle (Fu and Ma, 2018) proposes a bidirectional shuttling mechanism, which can jump multiple time steps both forward and backward, allowing the model to ignore unimportant information and recover lost information if needed. ...

Variable Computation in Recurrent Neural Networks
  • Citing Article
  • November 2016

... Despite recent advances in multi-modal models (Zhang et al., 2024a) using transformer architectures, they remain poorly understood and often show unwanted behaviors such as poor visiocompositional reasoning (Thrush et al., 2022;Diwan et al., 2022) or spatial reasoning skills (Kamath et al., 2023). In addition, in the visual questionanswering domain it is a well-known problem that models often lack visual grounding and have trouble integrating textual and visual data (Goyal et al., 2017;Jabri et al., 2016;Agrawal et al., 2018). This makes it perhaps even more puzzling that Alper and Averbuch-Elor (2023) found strong evidence for a bouba-kiki effect in CLIP and Stable Diffusion: even if these models are able to extract soundsymbolic information in the absence of auditory data, they will likely struggle to actually associate that information with visual properties. ...

Revisiting Visual Question Answering Baselines
  • Citing Conference Paper
  • October 2016

Lecture Notes in Computer Science

... In this work, we revisit the classification method for pretraining on large-scale text-image pairs. Some previous works [54,31,39,27,51] attempt to tackle this by employing bag-of-words classification in a weak supervised learning manner. However, most of these studies have been conducted on a small scale, and there is no evidence demonstrating their scalability in terms of data and model size. ...

Learning Visual Features from Large Weakly Supervised Data
  • Citing Conference Paper
  • October 2016

Lecture Notes in Computer Science

... Efficient Softmax: The softmax operation, crucial for generating probability distributions over target vocabularies in language models, presents a significant computational bottleneck, particularly for large vocabularies. Existing approaches address this through techniques such as low-rank approximation of classifier weights [7,36], clustering of classifier weights or hidden states to pre-select target tokens [12,8]. However, these methods remain computationally expensive for NAR models which perform multiple softmax operations within a single forward pass. ...

Efficient softmax approximation for GPUs

... Our team annotated a sample of a couple hundred articles by declaring articles low-quality if the article was not worth reading. We trained a support vector machine (SVM) binary classifier on the annotated data that classifies the articles relying on the fastText embedding (Bojanowski et al., 2017) of their concatenated title and body. We leverage the Croatian fastText embedding model for this purpose. ...

Enriching Word Vectors with Subword Information
  • Citing Article
  • July 2016

Transactions of the Association for Computational Linguistics

... With the advent of Convolutional Neural Networks (CNNs), researchers [20][21][22] began utilizing neural network to extract hierarchical features from images. Some researches employed CNNs to generate candidate bounding boxes [23][24][25], which were evaluated for their likelihood of containing foreground objects. Recently, the introduction of ViT (Vision Transformer) architecture has led to novel ideas in unsupervised object discovery and detection tasks [26][27][28]. ...

Co-localization in Real-World Images
  • Citing Conference Paper
  • June 2014