Victor Sanh’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (32)


Figure 1: From Laurençon et al. (2024). The self-attention, or fully-autoregressive, architecture: Input images are processed by the Vision encoder. The resulting visual features are mapped (and optionally pooled) to the LLM input space to get the visual tokens. They are concatenated (and potentially interleaved) with the input sequence of text embeddings (green and red column). The concatenated sequence is fed to the language model (LLM ), which predicts the text tokens output.
Figure 2: The different stages of training and the types of datasets used.
Figure 3: Types of examples used during the pre-training of VLMs. (a) An image-text pair from LAION COCO, (b) an interleaved image-text document from OBELICS, (c) a PDF document from OCR-IDL.
Building and better understanding vision-language models: insights and future directions
  • Preprint
  • File available

August 2024

·

44 Reads

Hugo Laurençon

·

Andrés Marafioti

·

Victor Sanh

·

Léo Tronchon

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

Download

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

June 2024

·

97 Reads

Shayne Longpre

·

Stella Biderman

·

·

[...]

·

Luca Soldaini

Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

June 2023

·

167 Reads

·

12 Citations

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

June 2023

·

148 Reads

·

1 Citation

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks that require reasoning over one or multiple images to generate a text. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELISC dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELISC, we train an 80 billion parameters vision and language model on the dataset and obtain competitive performance on various multimodal benchmarks. We release the code to reproduce the dataset along with the dataset itself.


Figure 1: Organization of BigScience working groups.
Figure 2: Creation Pipeline of the ROOTS Corpus. The purple-colored sourcing stage of the pipeline and the yellow-colored processing stage are described respectively in Section 3.1.2 and Section 3.1.3.
Figure 5: The BLOOM architecture. The k head slope parameters for ALIBI are taken as 2 −8i n
Figure 6: DP+PP+TP combination leads to 3D parallelism.
Figure 7: Performance of various LLMs on subset of tasks from SuperGLUE benchmark in zero-and one-shot prompt-based setting.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

March 2023

·

717 Reads

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

November 2022

·

3,602 Reads

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

November 2022

·

667 Reads

·

99 Citations

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


What Language Model to Train if You Have One Million GPU Hours?

October 2022

·

7 Reads

·

3 Citations

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .


Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation With Large Language Models

October 2022

·

47 Reads

·

162 Citations

IEEE Transactions on Visualization and Computer Graphics

State-of-the-art neural language models can now be used to solve ad-hoc language tasks through zero-shot prompting without the need for supervised training. This approach has gained popularity in recent years, and researchers have demonstrated prompts that achieve strong accuracy on specific NLP tasks. However, finding a prompt for new tasks requires experimentation. Different prompt templates with different wording choices lead to significant accuracy differences. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts. We developed a workflow that allows users to first focus on model feedback using small data before moving on to a large data regime that allows empirical grounding of promising prompts using quantitative measures of the task. The tool then allows easy deployment of the newly created ad-hoc models. We demonstrate the utility of PromptIDE (demo: http://prompt.vizhub.ai ) and our workflow using several real-world use cases.


Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models

August 2022

·

128 Reads

·

2 Citations

State-of-the-art neural language models can now be used to solve ad-hoc language tasks through zero-shot prompting without the need for supervised training. This approach has gained popularity in recent years, and researchers have demonstrated prompts that achieve strong accuracy on specific NLP tasks. However, finding a prompt for new tasks requires experimentation. Different prompt templates with different wording choices lead to significant accuracy differences. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts. We developed a workflow that allows users to first focus on model feedback using small data before moving on to a large data regime that allows empirical grounding of promising prompts using quantitative measures of the task. The tool then allows easy deployment of the newly created ad-hoc models. We demonstrate the utility of PromptIDE (demo at http://prompt.vizhub.ai) and our workflow using several real-world use cases.


Citations (22)


... The BloombergGPT model is a decoder-only causal language model based on BLOOM [349]. The model contains 70 layers of transformer decoder blocks defined as follows: ...

Reference:

A Survey on Large Language Models with some Insights on their Capabilities and Limitations
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Inside FFN, the non-linear function is GELU [30]. ALiBi positional encoding is applied through additive biases at the self-attention component of the transformer network [180]. The input token embeddings are tied to the linear mapping before the final softmax. ...

What Language Model to Train if You Have One Million GPU Hours?
  • Citing Conference Paper
  • January 2022

... Approach. We created six system prompts using three approaches: custom designs (Sys_v1 to Sys_v3), ChatGPT-4 generated prompts based on task descriptions (Sys_v4 and Sys_v5), and a PromptSource-generated prompt (Sys_v6) after we provided the problem and task details [26]. Additionally, we manually designed three user prompts (User_v1 to User_v3) to accompany the system prompts. ...

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

... Encoder-based models, like BERT [27], and encoder-decoder, such as T5 [28], have been adapted for the medical domain [29] although they faced challenges with QA tasks. More recently, multiple decoder-only LLMs have been developed for the medical field, including BioGPT [30], ClinicalGPT [31], (based on BLOOM-7B [32]), PMC-LLaMA [33], and MediTron-70B [34] (adapted from Llama-2 [22]). In contrast, proprietary medical LLMs like GPT-4 MedPrompt [35] and Med-PALM 2 [36] face usability issues similar to generalpurpose models. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Other work focused on adapting XLM-RoBERTa into African languages including Setswana [5], training multilingual BERT for African languages [31]. Additionally, with the advent of the age of massive large language models, we have models such as BLOOM [35] which have Setswana within and more Afrocentric models such as Serengeti [1] that are now available. ...

What Language Model to Train if You Have One Million GPU Hours?
  • Citing Preprint
  • October 2022

... Interaction through natural language is more agile and democratic, as communication in the form of chat facilitates the description of demands and allows both advanced programmers and lay users to interact with an AI interface without detailed knowledge of its technical processes (Strobelt et al. 2022). With just a few words or phrases, users can gain insights from geospatial data, create customized visualizations, or even automate data analysis tasks without the need for specialized knowledge in geoprocessing or programming, for instance (Mai et al. 2022). ...

Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation With Large Language Models
  • Citing Article
  • October 2022

IEEE Transactions on Visualization and Computer Graphics

... ICL vs. FT Previous exploratory studies have aimed to compare the performance of ICL and FT methodologies. Some research suggests that ICL exhibits a more robust out-of-distribution generalization compared to FT (Si et al., 2022;Awadalla et al., 2022;Utama et al., 2021). However, some recent studies (Mosbach et al., 2023) argue that these earlier comparisons may be biased. ...

Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning

... mit generativer KI. Eine zunehmend wichtige menschliche Fähigkeit in diesem Zusammenhang ist das Prompting, das Eingeben von Prompts, das die Schnittstelle zwischen Mensch und Modell darstellt (Strobelt et al. 2022). Aktuelle Forschungsergebnisse bestätigen, dass die Wahl der richtigen Formulierung und Strukturierung von Prompts einen signifikanten Einfluss auf die Qualität und Relevanz der Ergebnisse hat (Holtzman et al. 2020). ...

Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models
  • Citing Preprint
  • August 2022

... However, special considerations are required when attempting to prune large pre-trained models. Previous approaches that were successful in pruning small Transformer-based models are not practical for large models, e.g., Movement (Sanh et al., 2020) or Block pruning (Lagunas et al., 2021), because they require expensive weights updates. ...

Block Pruning For Faster Transformers
  • Citing Conference Paper
  • January 2021