Thomas Brox’s research while affiliated with University of Freiburg and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (407)


Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models
  • Preprint

May 2025

·

4 Reads

Sassan Mokhtar

·

·

·

[...]

·

Thomas Brox

Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in real-world inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD, exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.


We create a dataset for vision-language pretraining: First, we extract entities from knowledge graphs, then generate attributes and natural types for them. We search for different combinations of entities, attributes, and types in image search engines, and collect alt texts for each image. Finally, we train our model on the combined data.
We demonstrate how to harvest datasets for training CLIP models with an improved quality-cost trade-off, either for a generic domain (top) or an expert domain (bottom).
Using Knowledge Graphs to harvest datasets for efficient CLIP model training
  • Preprint
  • File available

May 2025

·

3 Reads

Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover well -- and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.

Download



Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

March 2025

·

7 Reads

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.


Realistic Evaluation of Deep Active Learning for Image Classification and Semantic Segmentation

February 2025

·

47 Reads

International Journal of Computer Vision

Active learning aims to reduce the high labeling cost involved in training machine learning models on large datasets by efficiently labeling only the most informative samples. Recently, deep active learning has shown success on various tasks. However, the conventional evaluation schemes are either incomplete or below par. This study critically assesses various active learning approaches, identifying key factors essential for choosing the most effective active learning method. It includes a comprehensive guide to obtain the best performance for each case, in image classification and semantic segmentation. For image classification, the AL methods improve by a large-margin when integrated with data augmentation and semi-supervised learning, but barely perform better than the random baseline. In this work, we evaluate them under more realistic settings and propose a more suitable evaluation protocol. For semantic segmentation, previous academic studies focused on diverse datasets with substantial annotation resources. In contrast, data collected in many driving scenarios is highly redundant, and most medical applications are subject to very constrained annotation budgets. The study evaluates active learning techniques under various conditions including data redundancy, the use of semi-supervised learning, and differing annotation budgets. As an outcome of our study, we provide a comprehensive usage guide to obtain the best performance for each case.



When and How Does CLIP Enable Domain and Compositional Generalization?

February 2025

·

3 Reads

The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires learning of shared representations already in intermediate layers and shared circuitry.


Figure 4. Qualitative Novel View Synthesis Comparison on SEED4D Test Set. Comparison of large-baseline novel view synthesis under sparse observation conditions. Six ego-centric input frames (top row) with limited overlap serve as reference views. We evaluate each method's ability to reconstruct exo-centric with a large offset to the input views.
Figure 5. Qualitative Novel View Synthesis Comparison on nuScenes Test Set. Visualization of multi-view synthesis results using six reference views captured at t=0. We compare novel views reconstructed at temporal difference of TD=2, 3, and 4 (1s, 1.5s, and 2s, respectively).
sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views

February 2025

·

25 Reads

Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.


What Matters for In-Context Learning: A Balancing Act of Look-up and In-Weight Learning

January 2025

·

1 Read

Large Language Models (LLMs) have demonstrated impressive performance in various tasks, including In-Context Learning (ICL), where the model performs new tasks by conditioning solely on the examples provided in the context, without updating the model's weights. While prior research has explored the roles of pretraining data and model architecture, the key mechanism behind ICL remains unclear. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL. To disambiguate these factors, we conduct a study with a controlled dataset and data sequences using a deep autoregressive model. We show that conceptual repetitions in the data sequences are crucial for ICL, more so than previously indicated training data properties like burstiness or long-tail distribution. Conceptual repetitions could refer to n-gram repetitions in textual data or exact image copies in image sequence data. Such repetitions also offer other previously overlooked benefits such as reduced transiency in ICL performance. Furthermore, we show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability during training.


Citations (46)


... frequently face significant performance degradation when subjected to adversarial attacks, as illustrated in Figure 1 [13,62,73,75,83,92]. These attacks involve introducing subtle, imperceptible perturbations into the input data, misleading models to misclassify normal 1 samples as anomalies or to overlook actual anomalies [2,26,54,74,91]. ...

Reference:

PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
Anomaly Detection with Conditioned Denoising Diffusion Models
  • Citing Chapter
  • April 2025

... Although Vision transformers trained on large datasets for metric depth prediction can provide useful priors (Bhat et al., 2023;Yin et al., 2023;Ke et al., 2024), the resulting depth maps, when combined with pixel information, are inadequate for generating complete 3D representations (Guizilini et al., 2022;Yang et al., 2024d;Kästingschäfer et al., 2025). ...

SEED4D: A Synthetic Ego-Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark
  • Citing Conference Paper
  • February 2025

... However, nonparametric approaches (including CoGap) tend to scale poorly with the dataset size [32]. Finally, there are one-shot imitation methods, based on Dynamical Systems (Elastic-DS) [33] and object-centric trajectories (DITTO) [34]. Yet, because they only leverage one demonstration, they need to make stronger assumptions regarding the trajectory distribution. ...

DITTO: Demonstration Imitation by Trajectory Transformation
  • Citing Conference Paper
  • October 2024

... As video generation becomes increasingly popular, various methods have emerged to evaluate its quality. Traditional video quality assessment techniques for user-generated content videos (Tu et al., 2021;Ging et al., 2024) and AIGC videos (Huang et al., 2024;Fan et al., 2024;Liu et al., 2023) heavily utilize computer vision methods, offering quantitative scores that partially capture the perceived quality of videos. However, these scores fall short of identifying areas of divergence from human preferences or areas needing enhancement. ...

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

... Recently, a trend is to repurpose video [3,14] and image [28,34] diffusion models for generation of 3D scenes and objects, by generating multiple views from different camera poses [10,39,50], with or without given images as conditioning. In comparison to direct generation of 3D representations [6,24,30], such multi-view generative models can be trained on images and videos and their pixel-aligned representation allows for more efficient models and better scalability. However, they have only a weak to non-existent inductive bias to produce actually 3D consistent results, Figure 2. Existing metrics. ...

Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation
  • Citing Conference Paper
  • June 2024

... The method is particularly beneficial for preoperative functional cortical mapping in patients [11; 12; 13; 14; 15; 16; 51; 52; 53; 54] and for exploring higher-order cortical functions in cognitive neuroscience studies involving healthy participants [55]. In these applications, streamlining mapping approaches that reduce participation time while maintaining data quality are essential and may, for example, enable functional mapping of complex cognitive processes such as internal world models [56]. ...

Internal world models in humans, animals, and AI
  • Citing Article
  • August 2024

Neuron

... Further work suggests augmenting the RL state with Lagrange multipliers [186] or with the currently used safety budget [187]. Still, a common problem is that safe RL policies become either too conservative or, otherwise, may violate constraints [188]. Thus, combining MPC and RL is highly desirable for constraint satisfaction. ...

Constrained Reinforcement Learning with Smoothed Log Barrier Function

... Internal representations have recently attracted the attention of scientists not only in connection with the development of cognitive sciences [1,2], but also in the context of large language models [3,4]. Representation in a broad sense is how a particular object is presented in the space of internal states of the perceiving subject. ...

Internal world models in humans, animals, and AI
  • Citing Article
  • July 2024

Neuron

... Currently, there are multiple co-existing works aiming to address the source of this problem. In the literature, one can find approaches that minimize human intervention by selecting the most informative samples for labeling, like active learning [18,43], or approaches that use less precise labels to reduce the annotation effort, as proposed in weakly supervised learning [38]. It is also common to use model predictions with the highest confidence levels to automatically label previously unlabeled samples and then train and improve the previous model's performance. ...

Best Practices in Active Learning for Semantic Segmentation
  • Citing Chapter
  • March 2024

Lecture Notes in Computer Science

... Video Understanding remains a fundamental challenge in computer vision, encompassing tasks such as action recognition (Zhu et al., 2020), object localization (Fan et al., 2023;Bai et al., 2025), tracking (Zhao et al., 2023b), temporal grounding (Lin et al., 2023b), captioning (Wang et al., 2020;Bai et al., 2021), and, more recently, AI-Generated video detection (Ye et al., 2024). Video Large Language Models (Video-LLMs) (Tang et al., 2023), powered by Large Language Models (LLMs) (Zhao et al., 2023a), leverage language as a universal interface to facilitate a wide range of video-related tasks. ...

Object-Centric Multiple Object Tracking
  • Citing Conference Paper
  • October 2023