Cordelia Schmid’s research while affiliated with École Normale Supérieure - PSL and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (516)


Figure 5: Change of question type distribution as a result of human rater filtering.
Figure 8: Diversity sampling: We show the change in cluster distribution after diversity sampling.
Figure 10: Screenshot of rater UI.
Figure 12: Model-based Temporal Certificate: Illustration of video clip querying for the model-based temporal certificate experiment. The red clip is the clip length that resulted in an incorrect response. As we increased the clip length wider, and the model correctly answered the question, we logged the frame count for incorrect response and correct response, and stopped querying. Besides clip length, we vary the fps of the query clip.
Figure 13: Frame level temporal certificate: We compared our dataset sample with EgoSchema to evaluate the number of frames needed by model to answer questions correctly. The figures above show the distribution of the minimum number of frames required to achieve the correct response.

+5

Neptune: The Long Orbit to Benchmarking Long Video Understanding
  • Preprint
  • File available

December 2024

·

5 Reads

·

Mingda Zhang

·

Ramin Mehran

·

[...]

·

Tobias Weyand

This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

Download

Improving vision-language models with ViLex tokens.
Visual Lexicon: Rich Image Features in Language Space

December 2024

·

8 Reads

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.


Language-Guided Image Tokenization for Generation

December 2024

·

3 Reads

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.


Grounded Video Caption Generation

November 2024

·

18 Reads

We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. First, we present a task definition and a manually annotated test dataset for this task, referred to as GROunded Video Caption Generation (GROC). Second, we introduce a large-scale automatic annotation method leveraging an existing model for grounded still image captioning together with an LLM for summarising frame-level captions into temporally consistent captions in video. Furthermore, we prompt the LLM to track by language -- classifying noun phrases from the frame-level captions into noun phrases of the video-level generated caption. We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels. Third, we introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset. Finally, results of our VideoGround model set the state of the art for the new task of grounded video caption generation. We perform extensive ablations and demonstrate the importance of key technical contributions of our model.


Figure 5: Qualitative examples of suboptimal annotations in OVEN benchmark. We show the input question, input image, OVEN ground truth entity as well as the top-5 predictions of our model.
Zero-shot transfer of generative models to finegrained image classification. We report top-1 accuracies. All models are run by us and are based on the same architecture.
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

October 2024

·

4 Reads

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.


Learning Text-to-Video Retrieval from Image Captioning

October 2024

·

27 Reads

·

4 Citations

International Journal of Computer Vision

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.


Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

October 2024

·

34 Reads

Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. The benchmark, codes and trained models are available at \url{https://www.di.ens.fr/willow/research/gembench/}.


CoVR-2: Automatic Data Construction for Composed Video Retrieval

September 2024

·

7 Reads

IEEE Transactions on Pattern Analysis and Machine Intelligence

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs , while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/ ventural/covr .


Figure 4: Evolution of aggregated COMET and CoMMuTE scores when changing the KL penalty coefficient.
Towards Zero-Shot Multimodal Machine Translation

July 2024

·

30 Reads

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.



Citations (47)


... Numerous multimodal encoders, such as CLIP [10], BLIP [41] and BLIP2 [42], are designed to align various modalities within a unified representation space, demonstrating notable success in various areas such as image/video caption [43]- [45], text-to-image generation [46], and also robotics [47]. However, these encoders are not optimally suited for robot learning as they often fail to capture the temporal visual dynamics critical for robotics [31]. ...

Reference:

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning
Learning Text-to-Video Retrieval from Image Captioning

International Journal of Computer Vision

... Spatial and Numeracy accuracy are computed using Ground-ingSAM [27] to locate and count the objects. Motion Binding is computed using GroundingSAM and Dense Optical Tracking [17]. TC-Bench adopts GPT-4 Turbo to answer a list of assertion questions related to compositions of the video. ...

Dense Optical Tracking: Connecting the Dots
  • Citing Conference Paper
  • June 2024

... In dense action detection, we aim to identify and temporally localize all actions occurring in an untrimmed video, even when the actions densely temporarily overlap. Despite advancements in other video understanding tasks [1,7,10,19], dense action detection remains challenging, with the current state-of-the-art mean Average Precision (mAP) reaching only 26.5% on the main benchmark dataset, Charades [15]. This limitation primarily arises from the inherent complexities of the task: (i) the action durations can Figure 1. ...

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
  • Citing Conference Paper
  • June 2024

... Motion-centric approaches. The limitation of static representations of skeleton sequences in ST-GCN models has led to the development of motion-centric representations [14,45,53,95,97,122], such as Taylor-transformed skeleton sequences [95]. These transformations use higher-order temporal derivatives (e.g., velocity and acceleration) to enrich skeleton representations by emphasizing dynamic motion patterns. ...

Learning Correlation Structures for Vision Transformers
  • Citing Conference Paper
  • June 2024

... This process suggests that the ability to understand spatial information is limited by whether the patch token has a corresponding spatial feature, not the spatial localizing capacity itself. Therefore, we decide to adopt the end-to-end STAL framework [16], training the decoder while freezing the vision backbone. The detailed hyperparameters are shown in Table 8. ...

End-to-End Spatio-Temporal Action Localisation with Video Transformers
  • Citing Conference Paper
  • June 2024

... This benchmark also validated the effectiveness of the generative entity recognition framework (GER). Building on this, GER-ALD (Caron et al. 2024b) demonstrated that unAmbiguous Language-based Discriminative (ALD) entity codes offer a performance advantage within the GER framework. AUTOVER (Xiao et al. 2024) achieved an accuracy 11.9 points higher than GER-ALD on the OVEN-Wiki test set through retrieval-augmented constrained decoding. ...

A Generative Approach for Wikipedia-Scale Visual Entity Recognition
  • Citing Conference Paper
  • June 2024

... Subsequent works Kim et al. 2024) explored incorporating additional data or modalities as external knowledge, such as transcribed speech or external caption to enhance performance. Another study (Zhou et al. 2024) addressed online dense video captioning, processing videos frame by frame instead of analyzing them as a whole. More recently, the emergence of Video Large Language Models has inspired efforts to evaluate these models' fine-grained video understanding capabilities with the task of DVC. ...

Streaming Dense Video Captioning
  • Citing Conference Paper
  • June 2024

... Given a query image and an optional prompt specifying the keypoint of interest, our goal is to generate textual descriptions and keypoint locations that convey fine-grained keypoint information within the image. Recognizing the exceptional ability of LLMs in handling multimodal tokens for different perception tasks [76,10,77,73,78], we further leverage LLM for keypoint comprehension, which could effectively process various inputs: (1) the visual tokens z q of the query image, (2) the prompt tokens z p , and (3) a sequence of language tokens t, which depend on the three semantic keypoint comprehension scenarios. ...

Pixel Aligned Language Models
  • Citing Conference Paper
  • June 2024

... To tackle the above issue, we propose the Low-rank Parallel Adaptation method inspired by previous studies [51], [52]. Instead of inserting tunable parameters inside the backbone, we design a parallel network that does not require backpropagation gradients through the frozen backbone during training as shown in Fig. 3 (c). ...

Time-, Memory- and Parameter-Efficient Visual Adaptation
  • Citing Conference Paper
  • June 2024

... MV-MWM [65], 3D-MVP [60], and SPA [96] leverage multi-view MAE to learn 3D visual representation. SUGAR [9] and Point Cloud Matters [95] introduce point cloud-based 3D representations, showing that these observations often improve policy performance and generalization capabilities. DPR [77] leverages depth information as auxiliary knowledge for pretraining. ...

SUGAR : Pre-training 3D Visual Representations for Robotics
  • Citing Conference Paper
  • June 2024