Tinne Tuytelaars’s research while affiliated with KU Leuven and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (466)


Animate Your Motion: Turning Still Images into Dynamic Videos
  • Chapter

November 2024

·

4 Reads

·

3 Citations

Mingxiao Li

·

Bo Wan

·

·

Tinne Tuytelaars

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

November 2024

·

1 Read

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.




Figure 1: Top: Venn diagram relationships between types of collapses, viz. dimensional collapse (D), cluster collapse (C), intracluster collapse (I) and representation (full) collapse of embeddings. Bottom: on the left, failure-proof data embeddings and on the right, examples of collapses for each single type (collapsed clusters are highlighted in red).
Figure 2: In FALCON, minimizing the proposed objective together with the corresponding projector ensures that the embedding representations are clustered and at the same time that their features are decorrelated. This guarantees that the representations are failure-free, meaning that dimensional, cluster, intra-cluster and representation collapses are prevented.
Figure 3: Top: Illustration of W T W obtained by randomly sampling W . Bottom: Normalized histograms of the elements of W T W . Figs. 3a-3c have fixed f = 50, whereas Figs. 3d-3f have fixed f = 100. W T W has diagonal values at 1 and random off-diagonal values centered around zero. For larger c and fixed f , variance remains constant and quasi-orthogonality is preserved.
Figure 5: Analysis of downstream generalization on CIFAR-10 test dataset, clustering (left) and linear evaluation results (right).
Figure 7: Collapse analysis on CIFAR-10 test data for different dictionary sizes c. Results are averaged over 5 training runs obtained from random initialization seeds. Left: The singular values of the embedding covariance are in sorted order and logarithmic scale. The curve rises with very large values of c, avoiding zero singular values. Right: The number of mixture components are in logarithmic scale. The curve rises with very large values of c for all number of mixture components.

+2

Failure-Proof Non-Contrastive Self-Supervised Learning
  • Preprint
  • File available

October 2024

·

33 Reads

We identify sufficient conditions to avoid known failure modes, including representation, dimensional, cluster and intracluster collapses, occurring in non-contrastive self-supervised learning. Based on these findings, we propose a principled design for the projector and loss function. We theoretically demonstrate that this design introduces an inductive bias that promotes learning representations that are both decorrelated and clustered without explicit enforcing these properties and leading to improved generalization. To the best of our knowledge, this is the first solution that achieves robust training with respect to these failure modes while guaranteeing enhanced generalization performance in downstream tasks. We validate our theoretical findings on image datasets including SVHN, CIFAR10, CIFAR100 and ImageNet-100, and show that our solution, dubbed FALCON, outperforms existing feature decorrelation and cluster-based self-supervised learning methods in terms of generalization to clustering and linear classification tasks.

Download

Analysis of Spatial augmentation in Self-supervised models in the purview of training and test distributions

September 2024

·

1 Read

In this paper, we present an empirical study of typical spatial augmentation techniques used in self-supervised representation learning methods (both contrastive and non-contrastive), namely random crop and cutout. Our contributions are: (a) we dissociate random cropping into two separate augmentations, overlap and patch, and provide a detailed analysis on the effect of area of overlap and patch size to the accuracy on down stream tasks. (b) We offer an insight into why cutout augmentation does not learn good representation, as reported in earlier literature. Finally, based on these analysis, (c) we propose a distance-based margin to the invariance loss for learning scene-centric representations for the downstream task on object-centric distribution, showing that as simple as a margin proportional to the pixel distance between the two spatial views in the scence-centric images can improve the learned representation. Our study furthers the understanding of the spatial augmentations, and the effect of the domain-gap between the training augmentations and the test distribution.


Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation

September 2024

·

6 Reads

This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.


Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering

September 2024

·

3 Reads

Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. In this work, we tackle this challenge from the perspective of camera selection. We begin by constructing a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. Based on this matrix, we use the Intra-List Diversity (ILD) metric to assess camera redundancy, formulating the camera selection task as an optimization problem. Then we apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments, closely mimicking real-world scenarios. Experimental results demonstrate that our strategy outperforms other approaches under time and memory constraints. Remarkably, our method achieves performance comparable to models trained on the full dataset, while using only an average of 15% of the frames and 75% of the allotted time.


Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

August 2024

·

12 Reads

This paper proposes a self-learning framework to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 5 s or 16.4 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.


Implicit Gaussian Splatting with Efficient Multi-Level Tri-Plane Representation

August 2024

·

9 Reads

Recent advancements in photo-realistic novel view synthesis have been significantly driven by Gaussian Splatting (3DGS). Nevertheless, the explicit nature of 3DGS data entails considerable storage requirements, highlighting a pressing need for more efficient data representations. To address this, we present Implicit Gaussian Splatting (IGS), an innovative hybrid model that integrates explicit point clouds with implicit feature embeddings through a multi-level tri-plane architecture. This architecture features 2D feature grids at various resolutions across different levels, facilitating continuous spatial domain representation and enhancing spatial correlations among Gaussian primitives. Building upon this foundation, we introduce a level-based progressive training scheme, which incorporates explicit spatial regularization. This method capitalizes on spatial correlations to enhance both the rendering quality and the compactness of the IGS representation. Furthermore, we propose a novel compression pipeline tailored for both point clouds and 2D feature grids, considering the entropy variations across different levels. Extensive experimental evaluations demonstrate that our algorithm can deliver high-quality rendering using only a few MBs, effectively balancing storage efficiency and rendering fidelity, and yielding results that are competitive with the state-of-the-art.


Citations (35)


... Video motion control research has developed along two primary paths: explicit control through bounding boxes and motion transfer from reference videos. Explicit control methods include AnimateAnyone [12], Boximator [29], Peekaboo [9], and Trailblazer [15]. Another significant line of work focuses on transferring motion from reference videos. ...

Reference:

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance
Animate Your Motion: Turning Still Images into Dynamic Videos
  • Citing Chapter
  • November 2024

... The slightly reduced reso-lution from the Thumbnail image can be mitigated by existing high-resolution operations in image LLMs [22], which feeds image patch features to the LLM. There is also potential in exploring combining vision encoders through feature routing [32]. We leave the above for future works. ...

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
  • Citing Chapter
  • November 2024

... Conversely, other approaches (Fang et al. 2022;Işık et al. 2023;Fridovich-Keil et al. 2023;Cao and Johnson 2023a;Li et al. 2022a;Shao et al. 2023) extend the radiance field into a 4D spatio-temporal domain, facilitating faster training and rendering at the cost of increased storage demands. Several studies Wu et al. 2024;Zheng et al. 2024b,a) use residual radiance fields to represent long-sequence dynamic scenes, leveraging compact motion grids and residual feature grids to exploit inter-frame feature similarity. Our compact tri-plane residual-based dynamic modeling method is designed for inter-frame modeling in extended sequences, which effectively captures high-dimensional appearance features within compact planes. ...

TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video
  • Citing Conference Paper
  • June 2024

... NeRFactor [53], NeRD [3], Neural-PIL [4], NeRV [40], Neural Transfer Field [27], InvRender [54] and TensoIR [15] use a density field as the geometry representation and an environment tensor with Monte Carlo sampling for the light reconstruction. To solve the ambiguity of the base color and environment light, [8,19] show the importance of adding a material prior to inverse rendering. However, these methods fail to reconstruct the material properties of glossy objects. ...

Unveiling the Ambiguity in Neural Inverse Rendering: A Parameter Compensation Analysis

... To tackle this problem, early endeavors developed combinatorial learning, enabling the prediction of HOI triplets during inference [2,21,22,36]. Inspired by the rapid advancements in visual-language pre-trained models like CLIP [39], more recent work has tapped into zero-shot generalization capabilities of these models to transfer prior knowledge for identifying unseen HOI categories [3,28,34,37,38,43,44,49]. Specifically, these methods employ CLIP image encoder extracting feature embeddings related to human-object pairs. ...

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels
  • Citing Conference Paper
  • January 2024

... This motivates us to explore natural and more dynamic visual stimuli: videos. EEG has a high temporal resolution such that it can capture the fast dynamics of neural responses elicited by the time-varying features within the video [17]. Using EEG signals, we aim to decode the attended object in videos with two moving objects (persons) that spatially overlap. ...

Identifying temporal correlations between natural single-shot videos and EEG signals
  • Citing Article
  • January 2024

... Knowledge distillation, first introduced by Hinton et al. (Hinton, 2015), is an effective model compression technique that transfers knowledge from a complex teacher model to a simpler student model. This method has demonstrated remarkable success in various domains (Rusu et al., 2015;Joshi et al., 2024), including but not limited to computer vision Li et al., 2024;Fan et al., 2024), natural language processing (Sanh, 2019;Sun et al., 2019;Jiao et al., 2019;Tang et al., 2019;Gu et al., 2024), and multimodal learning (Wang et al., 2020;Li et al., 2021;Radevski et al., 2023;Li et al., 2023;Shen et al., 2023). ...

Multimodal Distillation for Egocentric Action Recognition
  • Citing Conference Paper
  • October 2023

... In summary, KD leverages various strategies such as output layer distillation [40], intermediate layer distillation [41], self-distillation [42], adversarial distillation [43], and multi-teacher distillation to effectively enhance the accuracy and efficiency of student models. Despite the increased training time and cost associated with pre-training a large teacher model, the benefits in terms of improved performance and reduced computational demands make KD an essential technique for model compression. ...

Adaptive Similarity Bootstrapping for Self-Distillation based Representation Learning
  • Citing Conference Paper
  • October 2023

... Miah y Wang implementaron en [9] un sistema de detección de palabras clave usando MFCC para la extracción de características en un microcontrolador STM32F769NI y en una computadora de placa única Jetson Nano para comparar su desempeño en ambas plataformas. Rusci y Tuytelaars desplegaron varios modelos de detección de palabras clave usando técnicas de personalización al hablante en un microcontrolador GAP9, usando MFCC para la extracción de características [10] . Li et al. combinaron la extracción de los MFCC con una Transformada Discreta de Wavelet para la extracción de características, llegando a validar su propuesta en un microcontrolador ESP32 [11] . ...

On-Device Customization of Tiny Deep Learning Models for Keyword Spotting With Few Examples

IEEE Micro