Bernt Schiele’s research while affiliated with Max Planck Institute for Informatics and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (624)


Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
  • Chapter

December 2024

·

5 Reads

·

2 Citations

Amin Parchami-Araghi

·

Moritz Böhle

·

·

Bernt Schiele


Number it: Temporal Grounding Videos like Flipping Manga

November 2024

·

4 Reads

Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.



B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

November 2024

·

6 Reads

B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks (CNNs) and vision transformers (ViTs), which primarily replace linear layers with B-cos transformations, perform competitively to their respective standard variants while also yielding explanations that are faithful by design. However, it has so far been necessary to train these models from scratch, which is increasingly infeasible in the era of large, pre-trained foundation models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose 'B-cosification', a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to a pretrained CLIP model, and show that, even with limited data and compute cost, we obtain a B-cosified version that is highly interpretable and competitive on zero shot performance across a variety of datasets. We release our code and pre-trained model weights at https://github.com/shrebox/B-cosification.



TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

October 2024

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \url{https://github.com/Haiyang-W/TokenFormer}.





Citations (52)


... Furthermore, Cheng et al. [25] reinterpreted vanilla KD from a new perspective by defining and adding the knowledge of layer features in neural networks, explaining the success mechanism of knowledge distillation algorithms. Parchami-Araghi et al. [26] have proposed a novel cosine similarity to enhance the consistency between teacher and student explanations for the same sample, significantly improving the accuracy and consistency of the student model with the teacher model. ...

Reference:

Collaborative multi-knowledge distillation under the influence of softmax regression representation
Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
  • Citing Chapter
  • December 2024

... OmniSplat shows the best trade-off compared to the original feedforward networks for perspective images and models with optimization designed for omnidirectional images. ments have focused on feed-forward scene generation networks [2,4,29], which are capable of generating 3D Gaussian splatting (3DGS) representations directly from a few input images without scene-wise optimization [18]. These models estimate plausible 3D Gaussian parameters by leveraging priors learned from large-scale datasets and achieve more than 30 times faster than optimization-based methods. ...

LatentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
  • Citing Chapter
  • November 2024

... Recent developments in foundation models (FMs) have shown the potential of Transformers (Vaswani et al., 2017) as a universal computational architecture. Thanks to their flexibility and scalability, Transformers have achieved state-of-the-art performance across various domains, including natural language processing (NLP) (Radford et al., 2018;Alec et al., 2019;, visual modeling (Dosovitskiy et al., 2021;Liu et al., 2021), vision-language (Liu et al., 2023;Wang et al., 2024), graph representation (Ying et al., 2021), and 3D vision (Wang et al., 2023a;b). ...

GiT: Towards Generalist Vision Transformer Through Universal Language Interface
  • Citing Chapter
  • November 2024

... These key steps can then be used for visual instruction generation [53]. Recently, the improved capabilities of large language models [1,3,17,19,33] allowed for solely using video narrations to produce temporal captions, key steps, and instructions [36,37,50]. We build on these works to automatically obtain key steps from videos; however, in contrast to the related works, our dataset is constructed completely automatically, is composed of individual instruction frames instead of temporal intervals, and contains significantly fewer errors. ...

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
  • Citing Chapter
  • October 2024

... While CBMs can be trained end-to-end with supervision on both concepts and classes, their practical application is often limited by the need for human annotations of concepts, which can be costly and timeconsuming. Newer CBM models such as Label-free CBM [28], DN-CBM [33] allow mapping inputs to concepts without a training set of labeled concepts by leveraging the shared embedding space in the vision-language models such as CLIP [32]. In histopathology, vision-language foundation models like PLIP [19] and CONCH [26] have been recently proposed and have shown promising results in various tasks. ...

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
  • Citing Chapter
  • October 2024

... Open-vocabulary 3D segmentation (OV3DS) methods use natural language descriptions to segment objects in 3D scenes. These approaches address both semantic segmentation [18,26] and instance segmentation [25,31,32,35]. Since OV3DS benchmarks provide both RGBD images and 3D point clouds, OV3DS methods are designed to fully exploit both 2D and 3D data. ...

Open-Vocabulary 3D Semantic Segmentation with Foundation Models
  • Citing Conference Paper
  • June 2024

... In this structure, all classes are represented as maximally separated points, effectively reducing the overlap between class features. While NC improves feature separation in intra-domain tasks [10]- [12], its performance diminishes in inter-domain tasks. As new tasks with significant domain differences are introduced, the projection onto the ETF degrades, disrupting the structure learned from earlier tasks and causing performance drop across domains. ...

OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
  • Citing Conference Paper
  • June 2024

... S4Former in [10] utilizes three novel modifications to introduce regularization into image, feature, and output aspects. For image regularization, PatchShuffle is used, an augmentation technique explicitly designed for the self-attention of ViT transformers. ...

Training Vision Transformers for Semi-Supervised Semantic Segmentation
  • Citing Conference Paper
  • June 2024

... Several authors [52][53][54][55][56] have considered the use of additional data for class-incremental Learning by leveraging datasets of real images belonging to classes different from the ones that the model has to learn. Most of these approaches assume that the additional dataset is unlabeled, restricting the use of those additional data samples to the distillation loss [54][55][56]. ...

Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos
  • Citing Conference Paper
  • January 2024

... Although efforts have been made to express neural networks as a single matrix, existing methods face notable limitations. For instance, B-cos [9] requires specialized architectures that exclude biases, while FullGrad [53] accommodates biases but is restricted to networks with only ReLU or LeakyReLU activation functions. In contrast, OMENN addresses this, supporting a wider range of architectures and activation functions. ...

B-Cos Alignment for Inherently Interpretable CNNs and Vision Transformers
  • Citing Article
  • January 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence