Antoine Chaffin’s research while affiliated with University of Rennes 2 and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
  • Conference Paper

February 2025

·

2 Reads

Gautier Evennou

·

Antoine Chaffin

·

Vivien Chappelier

·

Ewa Kijak

Figure 4. Synthetic dataset creation pipeline leveraging a promptbased image editing and large language models.
Figure 5. Comparison of train set samples from EE, generated by the Emu Edit model, and Syned, generated by a fine-tuned InstructPix2Pix model.
CIDEr scores on CLEVR-Change for the 5 categories of changes.
Results on CLEVR-DC, STD, and IER, reported from their corresponding papers, except † : reported from our own experiments.
CIDEr scores on Emu Edit test set. We show consistent improvements with synthetic augmentation (EE + Syned) on all models.
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
  • Preprint
  • File available

December 2024

·

25 Reads

The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.

Download

ModernBERT training settings. Dropout and below are shared across all phases.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

December 2024

·

62 Reads

·

2 Citations

Benjamin Warner

·

Antoine Chaffin

·

Benjamin Clavié

·

[...]

·

Iacopo Poli

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.



Figure 1: Relative performance degradation at various pooling factors using 16-bit vectors with HNSW indexing.
Figure 2: Relative performance degradation at various pooling factors using PLAID-indexed 2-bit vectors. fiqa at factor 6 is truncated for readability.
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

September 2024

·

20 Reads

Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.



Fig. 1: General illustration of text watermarking for LLMs.
Three Bricks to Consolidate Watermarks for Large Language Models

July 2023

·

53 Reads

·

1 Citation

The task of discerning between generated and natural texts is increasingly challenging. In this context, watermarking emerges as a promising technique for ascribing generated text to a specific model. It alters the sampling generation process so as to leave an invisible trace in the generated output, facilitating later detection. This research consolidates watermarks for large language models based on three theoretical and empirical considerations. First, we introduce new statistical tests that offer robust theoretical guarantees which remain valid even at low false-positive rates (less than 10-6^{\text{-6}}). Second, we compare the effectiveness of watermarks using classical benchmarks in the field of natural language processing, gaining insights into their real-world applicability. Third, we develop advanced detection schemes for scenarios where access to the LLM is available, as well as multi-bit watermarking.


BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

June 2023

·

205 Reads

·

15 Citations

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1


Figure 1: Organization of BigScience working groups.
Figure 2: Creation Pipeline of the ROOTS Corpus. The purple-colored sourcing stage of the pipeline and the yellow-colored processing stage are described respectively in Section 3.1.2 and Section 3.1.3.
Figure 5: The BLOOM architecture. The k head slope parameters for ALIBI are taken as 2 −8i n
Figure 6: DP+PP+TP combination leads to 3D parallelism.
Figure 7: Performance of various LLMs on subset of tasks from SuperGLUE benchmark in zero-and one-shot prompt-based setting.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

March 2023

·

801 Reads

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. 1



Citations (7)


... More recently, biases are not included, which improves training stability, throughput, and final performance. Additionally, improvements like SwiGLU activation functions and rotary positional embeddings are also commonly utilized 3,4,34,35 . GPT (Generative Pretrained Transformer) models, such as OpenAI's GPT series (GPT-3, GPT-4, etc.), are designed for generative tasks and use transformer decoders [36][37][38] . ...

Reference:

Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

... The OpenGPT-X project initially adopted the Megatron-DeepSpeed codebase 6 , developed by NVIDIA, extended by Microsoft researchers and further adapted during BigScience research workshop [47]. Other codebases, such as Meta's Open Pretrained Transformer (OPT) [31], also emerged, promising potential advantages in abstraction and usability. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Further, there has been a large-scale exploration of the susceptibility of watermarks to natural paraphrasing attacks and strong adversarial attacks Krishna et al. (2023);Sadasivan et al. (2023); . A good amount of work has proposed new generationtime watermarks and modifications to existing schemes to improve in detection and generation performance Fernandez et al. (2023), for example Christ et al. (2023) and Hu et al. (2023) who propose a unbiased watermarking techniques that ensures the watermarking process does not alter the probability distribution of generated text, in certain technical notions of imperceptible. Another approach, Kuditipudi et al. (2023), proposes a distortion-free watermarking approach that pre-samples a random key for LLM generation, with advantages especially on detection after paraphrasing. ...

Three Bricks to Consolidate Watermarks for Large Language Models
  • Citing Conference Paper
  • December 2023

... It is a popular choice for research and development projects, supported by a large developer community. The flexibility of customization makes it applicable to a wide variety of projects [24]. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... To search for a hypothesis that better satisfies predetermined constraints, some previous methods use rollout in decoding that generates partial future sequences (Chaffin et al., 2022;Lu et al., 2022). These methods become infeasible for large models due to the inefficiency of sampling in decoding time and handling the large vocabulary constraints in our task. ...

PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding
  • Citing Conference Paper
  • January 2022

... In subsequent research, the Monte Carlo Tree Search technique [51] has been successfully applied across a range of text generation tasks. Empirical evidence shows that these methodologies can achieve state-of-the-art results in several NLG domains, including Question Generation [52], Abstractive Summarization [53], Machine Translation [54], and Constrained Generation [55,56]. Nonetheless, it's critical to bear in mind that all the methodologies mentioned above operate exclusively at the token level. ...

Which Discriminator for Cooperative Text Generation?
  • Citing Conference Paper
  • July 2022

... Recently, a fourth category based on using large language models (LLMs) in scoring as a judge shows significant promise compared to the preceding three categories. Utilizing LLM with carefully crafted prompts [13] has demonstrated remarkable success in various tasks, both within academic benchmarks [14] and realworld settings [15]. However, to our knowledge, no published research has yet reported on using LLMs as a scoring agent for RC tasks in a QA context to mimic the human judgments on a Likert scale and the simpler binary tasks for correct/incorrect answers. ...

Multitask Prompted Training Enables Zero-Shot Task Generalization