Stéphane Lathuilière’s research while affiliated with Institut Polytechnique de Paris and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (110)


Fig. 2: Comparative Analysis of Diffusion-Based Image Editing Techniques. This review contrasts existing methodologies, which utilize Classifier-Free Guidance (CFG) [6] with various combinations, including the pretrained null-prompt ∅, an optimized latent representation ∅ * , the descriptive prompt of the input image P in , and the target editing prompt P out .
Don't Forget your Inverse DDIM for Image Editing
  • Preprint
  • File available

May 2025

·

3 Reads

Guillermo Gomez-Trenado

·

·

·

Stéphane Lathuilière

The field of text-to-image generation has undergone significant advancements with the introduction of diffusion models. Nevertheless, the challenge of editing real images persists, as most methods are either computationally intensive or produce poor reconstructions. This paper introduces SAGE (Self-Attention Guidance for image Editing) - a novel technique leveraging pre-trained diffusion models for image editing. SAGE builds upon the DDIM algorithm and incorporates a novel guidance mechanism utilizing the self-attention layers of the diffusion U-Net. This mechanism computes a reconstruction objective based on attention maps generated during the inverse DDIM process, enabling efficient reconstruction of unedited regions without the need to precisely reconstruct the entire input image. Thus, SAGE directly addresses the key challenges in image editing. The superiority of SAGE over other methods is demonstrated through quantitative and qualitative evaluations and confirmed by a statistically validated comprehensive user study, in which all 47 surveyed users preferred SAGE over competing methods. Additionally, SAGE ranks as the top-performing method in seven out of 10 quantitative analyses and secures second and third places in the remaining three.

Download

Di[M]\mathtt{[M]}O: Distilling Masked Diffusion Models into One-step Generator

March 2025

·

20 Reads

Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di[M]\mathtt{[M]}O, a novel approach that distills masked diffusion models into a one-step generator. Di[M]\mathtt{[M]}O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di[M]\mathtt{[M]}O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.


Figure 8. Ablating parameter λ. MIU Avg. Gap when varying parameter λ in CelebA [31], Waterbirds [42], and FairFace [24]. While λ = 1 is optimal in CelebA [31] and FairFace [24], Waterbirds [42] benefits from higher lambdas.
MIU ablations. We compute MIU ablations on each of the three investigated datasets. From left to right, we report the investigated dataset, the retaining term, the unlearning term, the calibration term, and REWEIGHT. We measure performance using UA, GA, and Avg. Gap. The configuration that corresponds to MIU + REWEIGHT is highlighted.
Group-robust Machine Unlearning

March 2025

·

11 Reads

·

·

Stéphane Lathuilière

·

[...]

·

Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group, we empirically show that performance for this group degrades, leading to fairness issues. This work tackles the overlooked problem of non-uniformly distributed forget sets, which we call group-robust machine unlearning, by presenting a simple, effective strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at https://github.com/tdemin16/group-robust_machine_unlearning.


Figure 6. Comparison of feature distillation methods on NExT-QA. Each matrix shows the performance of a model trained on tasks (rows) and evaluated on tasks (columns). The diagonal (highlighted in orange) represents in-domain performance, while off-diagonal elements show cross-domain generalization. Higher values (darker colors) indicate better performance.
Ablation study of QUAD components on VQAv2 and NExT-QA.
Comparison of attention distillation methods.
No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory

February 2025

·

15 Reads

Continual Learning in Visual Question Answering (VQACL) requires models to learn new visual-linguistic tasks (plasticity) while retaining knowledge from previous tasks (stability). The multimodal nature of VQACL presents unique challenges, requiring models to balance stability across visual and textual domains while maintaining plasticity to adapt to novel objects and reasoning tasks. Existing methods, predominantly designed for unimodal tasks, often struggle to balance these demands effectively. In this work, we introduce QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularisation, eliminating the need to store visual data and addressing both memory and privacy concerns. QUAD achieves stability by introducing a question-only replay mechanism that selectively uses questions from previous tasks to prevent overfitting to the current task's answer space, thereby mitigating the out-of-answer-set problem. Complementing this, we propose attention consistency distillation, which uniquely enforces both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.



TACO: Training-free Sound Prompted Segmentation via Deep Audio-visual CO-factorization

December 2024

·

22 Reads

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models to reveal shared interpretable concepts. These concepts are passed to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.




An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

October 2024

·

34 Reads

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion while keeping the initial image captioning component unaltered. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.


OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

August 2024

·

27 Reads

Event cameras, known for low-latency operation and superior performance in challenging lighting conditions, are suitable for sensitive computer vision tasks such as semantic segmentation in autonomous driving. However, challenges arise due to limited event-based data and the absence of large-scale segmentation benchmarks. Current works are confined to closed-set semantic segmentation, limiting their adaptability to other applications. In this paper, we introduce OVOSE, the first Open-Vocabulary Semantic Segmentation algorithm for Event cameras. OVOSE leverages synthetic event data and knowledge distillation from a pre-trained image-based foundation model to an event-based counterpart, effectively preserving spatial context and transferring open-vocabulary semantic segmentation capabilities. We evaluate the performance of OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic, comparing it with existing conventional image open-vocabulary models adapted for event-based data. Similarly, we compare OVOSE with state-of-the-art methods designed for closed-set settings in unsupervised domain adaptation for event-based semantic segmentation. OVOSE demonstrates superior performance, showcasing its potential for real-world applications. The code is available at https://github.com/ram95d/OVOSE.


Citations (50)


... A promising solution involves merging the model's parameters with those of a complementary model specialized in the current task (Lin et al., 2022a). This idea of merging models of old and new tasks has also been empirically shown to enhance overall performance across diverse scenarios Marouf et al., 2024). However, these methods often acquire interpretation by assuming that learning new tasks is minimally influenced by previous knowledge, and then approximate the posterior of each task with an independent Gaussian distribution . ...

Reference:

BECAME: BayEsian Continual Learning with Adaptive Model MErging
Weighted Ensemble Models Are Strong Continual Learners
  • Citing Chapter
  • November 2024

... Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS) denotes the joint execution of open-vocabulary semantic segmentation (OVSS) [1,2,3,4] and domain generalization in semantic segmentation (DGSS) [5,6,7,8] tasks. It involves training a model that, without access to target-domain samples or annotations for novel categories, can generate pixel-wise segmentation for unseen classes while sustaining high performance across previously unseen domains (such as different cities, lighting environments, or climatic conditions). ...

Collaborating Foundation Models for Domain Generalized Semantic Segmentation
  • Citing Conference Paper
  • June 2024

... An initial answer to this question emerges from recent studies [39,40] that have developed approaches for UDA-Rid, focusing on eliminating the necessity for storing images. These approaches align with privacy regulations thereby clarifying GDPR's practical implications. ...

Source-Guided Similarity Preservation for Online Person Re-Identification

... HEVC [16] and VVC [1], renowned for their high compression efficiency, have been widely adopted as conventional and advanced approaches for video compression. In recent years, learning-based codecs [2,[8][9][10]17] have demonstrated superior performance compared to conventional approaches. In particular, they have achieved high-quality reconstruction under challenging low-bitrate conditions, which was difficult for traditional methods, thereby improving encoding efficiency. ...

A Hybrid Deep Animation Codec for Low-Bitrate Video Conferencing
  • Citing Conference Paper
  • October 2022

... It distills useful knowledge in the pretrained network through a co-learning algorithm to boost the pseudo-label quality. DALL-V [31] exploits CLIP that contains the rich world prior robust to domain shift for source-free video domain adaptation. It distills the source/target domain adapted CLIP and original CLIP to a student network. ...

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation
  • Citing Conference Paper
  • October 2023

... Compositional generation aims to produce images that are faithfully aligned with complex texts [21,60,22,25,52,13,56,53,9,31]. While diffusion models [45,11,3] Figure 2: The ComposeAnything framework, which enhances text-to-image diffusion models e.g. ...

Zero-shot spatial layout conditioning for text-to-image diffusion models
  • Citing Conference Paper
  • October 2023

... This prevents users with low computation/communication resources from participating in training and potentially causing bias in the global model. Thus, recent works have explored parameterefficient fine-tuning (PEFT) [8,12,13,23,24,33,40,46], where in lieu of fine-tuning the entire pretrained model, only a small number of lightweight modules are trained; the backbone model is kept frozen. Due to marked reduction in resource consumption and training latency, PEFT has become widely popular in FL [29,39,52,61,66,67]. ...

Mini but Mighty: Finetuning ViTs with Mini Adapters

... Last, in contrast with the current trend in talking face synthesis, we rely on an autoregressive generative network for its inherent ability to model sequential dependencies, and its flexibility to handle sequences of arbitrary length. To do so, we build on the autoregressive Generative Adversarial Network (GAN) baseline of [2], and show that the conditioning speech signal has a stabilizing effect that hinders error accumulation on a much longer term than in the unconditional setting. In particular, we demonstrate experimentally that the error drift can be mitigated on test sequences more than five times the length of the training sequences. ...

Autoregressive GAN for Semantic Unconditional Head Motion Generation
  • Citing Article
  • December 2023

ACM Transactions on Multimedia Computing, Communications and Applications

... Investigations elucidate the advent of Promptable Game Models (PGMs), which furnish highlevel semantic oversight in-game simulation dynamics, facilitating player engagement with the gaming milieu via intuitive language prompts (Menapace et al., 2023). Concurrently, Virtual Mine exemplifies the fusion of gaming AI with Unreal Engine, streamlining the creation of behavioural AI sans the constraints of inflexible scenario sequences, paramount for the efficacious execution of training simulations within virtual realms (Abu-Abed & Zhironkin, 2023). ...

Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models
  • Citing Article
  • December 2023

ACM Transactions on Graphics

... This self-supervised learning framework enables training the COAE with a large amount of unlabeled data, thereby reducing the reliance on annotated samples for fine-tuning a BIQA model. PRIQ [89] aims to address the image quality assessment problem by jointly modeling multiple images that depict the same content, in contrast to existing approaches that predict image quality independently for each image. The motivation behind this approach is the belief that multiple distorted images can provide valuable information to differentiate between content-related features and quality-related features. ...

Test Your Samples Jointly: Pseudo-Reference for Image Quality Evaluation
  • Citing Conference Paper
  • June 2023