Teven Le Scao’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (23)


Figure 7: Reasoning over complex figures. An example showcasing Pixtral's capabilities to understand and reason over complex figures. Pixtral correctly identifies that the green boxes represent the European countries and then reads and sorts the GDP of all the European countries to list the top 5 with accurate GDP numbers.
Figure 8: Multi-image instruction following. Pixtral can process arbitrary number of images in its context window. The example shows that Pixtral can successfully combine the information from both images into a single markdown table.
Figure 9: Chart Understanding and Analysis. Pixtral demonstrates the capability to interpret and analyze intricate charts with high accuracy. In this instance, Pixtral correctly identifies that "dark-dragon" corresponds to the red line. Furthermore, it recognizes that the training loss is expected to decrease smoothly and notes that the training run became unstable around the 10K step mark due to a significant spike in loss.
Figure 11: Examples of model responses from Pixral-12B, QwenVL-7B and Gemini-1.5 Flash-8B (0827) LLM-as-a-judge scores. Pixtral's response is complete and accurate, hence getting a rating of 8, while GeminiFlash-8B extracts wrong information, and QwenVL does not elaborate on trends.
Pixtral 12B
  • Preprint
  • File available

October 2024

·

180 Reads

·

1 Citation

Pravesh Agrawal

·

Szymon Antoniak

·

Emma Bou Hanna

·

[...]

·

Thomas Wang

We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.

Download

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

June 2023

·

206 Reads

·

16 Citations

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


Scaling Data-Constrained Language Models

May 2023

·

167 Reads

·

3 Citations

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at https://github.com/huggingface/datablations.


Figure 1: Organization of BigScience working groups.
Figure 2: Creation Pipeline of the ROOTS Corpus. The purple-colored sourcing stage of the pipeline and the yellow-colored processing stage are described respectively in Section 3.1.2 and Section 3.1.3.
Figure 5: The BLOOM architecture. The k head slope parameters for ALIBI are taken as 2 −8i n
Figure 6: DP+PP+TP combination leads to 3D parallelism.
Figure 7: Performance of various LLMs on subset of tasks from SuperGLUE benchmark in zero-and one-shot prompt-based setting.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

March 2023

·

822 Reads

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

March 2023

·

378 Reads

·

2 Citations

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.


Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

February 2023

·

9 Reads

A key feature of neural models is that they can produce semantic vector representations of objects (texts, images, speech, etc.) ensuring that similar objects are close to each other in the vector space. While much work has focused on learning representations for other modalities, there are no aligned cross-modal representations for text and knowledge base (KB) elements. One challenge for learning such representations is the lack of parallel data, which we use contrastive training on heuristics-based datasets and data augmentation to overcome, training embedding models on (KB graph, text) pairs. On WebNLG, a cleaner manually crafted dataset, we show that they learn aligned representations suitable for retrieval. We then fine-tune on annotated data to create EREDAT (Ensembled Representations for Evaluation of DAta-to-Text), a similarity metric between English text and KB graphs. EREDAT outperforms or matches state-of-the-art metrics in terms of correlation with human judgments on WebNLG even though, unlike them, it does not require a reference text to compare against.





BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

November 2022

·

679 Reads

·

119 Citations

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


Citations (16)


... Existing research advances the development of LVLMs capable of addressing diverse tasks via a unified interface that can directly generate natural language, thus avoiding task-specific modifica-tions [80,78,43,1,64]. Utilizing advanced pre-trained LLMs [5,6,23,83] as the language component [56,92], the instruction-following and complex reasoning abilities of LVLMs are significantly improved [21,28]. ...

Reference:

Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Pixtral 12B

... The OpenGPT-X project initially adopted the Megatron-DeepSpeed codebase 6 , developed by NVIDIA, extended by Microsoft researchers and further adapted during BigScience research workshop [47]. Other codebases, such as Meta's Open Pretrained Transformer (OPT) [31], also emerged, promising potential advantages in abstraction and usability. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Large Language Models (LLMs) are increasingly deployed as autonomous agents in multi-agent systems, demonstrating remarkable effectiveness across diverse real-world scenarios including multi-player games [26,30], software development tasks [27,14], medical care applications [22], education [6] and etc. However, despite their success in structured settings, achieving spontaneous cooperation among self-interested LLM agents remains challenging in open-ended environments, where explicit rules and predefined roles are absent and agents' goals can be inherently conflicting [26]. ...

FinGPT: Large Generative Models for a Small Language
  • Citing Conference Paper
  • January 2023

... We set a balanced sampling rate to elevate the multilingual modeling of Malay and Tamil, and English and Mandarin data play a role to retain the fundamental language capabilities. Second, we enhanced the model's multilingual instruction following by multi-task learning (Teknium, 2023) and cross-lingual alignment (Muennighoff et al., 2023;Lin et al., 2025), including multilingual role-play corpora generated through simulating diverse conversation scenarios (Sun et al., 2024;Liu et al., 2024b). To further strengthen cross-lingual capabilities, we did a hybrid training approach that combines translation and cross-lingual problem-solving tasks (Muennighoff et al., 2022;Liu et al., 2022). ...

Crosslingual Generalization through Multitask Finetuning
  • Citing Conference Paper
  • January 2023

... Although many people assume that a computer has perfect recall and only needs to 'read' material once, AI systems work in a statistical fashion that means re-reading boosts performance, says Niklas Muennighoff, a PhD student at Stanford University and a member of the Data Provenance Initiative. In a 2023 paper published while he was at the AI firm HuggingFace in New York City, he and his colleagues showed that a model learnt just as much from re-reading a given data set four times as by reading the same amount of unique data -although the benefits of re-reading dropped off quickly after that 6 . ...

Scaling Data-Constrained Language Models

... It is a popular choice for research and development projects, supported by a large developer community. The flexibility of customization makes it applicable to a wide variety of projects [24]. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... ChatGPT is an AI system constructed upon models that enable it to generate syntactically accurate texts and statements (Floridi and Chiriatti 2020). However, like other LLMs, ChatGPT is susceptible to issues of hallucination (Radford et al. 2019;Muennighoff et al. 2022) throughout its outputs. ...

Crosslingual Generalization through Multitask Finetuning
  • Citing Preprint
  • November 2022

... Other work focused on adapting XLM-RoBERTa into African languages including Setswana [5], training multilingual BERT for African languages [31]. Additionally, with the advent of the age of massive large language models, we have models such as BLOOM [35] which have Setswana within and more Afrocentric models such as Serengeti [1] that are now available. ...

What Language Model to Train if You Have One Million GPU Hours?
  • Citing Preprint
  • October 2022

... The choice of which LLM to use in an AI project for a given task or functionality may appear daunting due to the sheer number of LLMs currently available, and the emergence of a seemingly endless stream of new LLMs, including updated versions of existing LLMs, each purporting to improve on the previous version ( Figure 3, page 5) [77]. Moreover, pre-training and fine tuning techniques for models differ depending on the desired capabilities of the language model [68]. : Technical evolution of the OpenAI GPT-series models [77] LLMs have evolved from relatively simple language tasks such as text generation to models capable of performing more complex tasks, such as those illustrated in Figure 4, page 6 [77]. ...

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
  • Citing Preprint
  • April 2022