Adam Kortylewski’s research while affiliated with Max Planck Institute for Informatics and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (117)


Escaping Plato's Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes
  • Preprint

March 2025

Nhi Pham

·

Bernt Schiele

·

Adam Kortylewski

·

Jonas Fischer

With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.




PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing
  • Preprint
  • File available

November 2024

·

14 Reads

Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at https://github.com/sidsunny/pocoloco .

Download




OOD-CV-v2 : An Extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images

September 2024

·

12 Reads

·

7 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV-v2, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking of models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area. Our dataset is publically available online, https://genintel.mpi-inf.mpg.de/ood-cv-v2.html .


Figure 1. More ablative results on the ControlNet conditioning scale, and the windowed root annealing method, please zoom-in to see the details.
Figure 2. The figure shows the annealing of timesteps using the proposed window-root timestep annealing strategy for 10k iterations. The timestep t is randomly sampled within the shown window. As per Eq. 7 if t > k Then the scores from both pre-trained LDM and personalized LDM are used else only the scores from pre-trained LDM are used.
Figure 5. Qualitative Results. Our approach generates captivating visual edits guided by textual prompts across various contexts. We recommend the readers to zoom in for better viewing of the details.
Figure 6. Qualitative Results. The free-viewpoint rendering results. We recommend the readers to zoom in to better view the details.
Figure 7. Qualitative Results. The free-viewpoint rendering results. We recommend the readers to zoom in to better view the details.

+2

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

August 2024

·

37 Reads

Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.



Citations (44)


... Recent methods [18,20,32,40] leverages 3DGS representation for relighting due to its ability to reconstruct fine details and interaction in CG engines. For human performance relighting, researchers [5,6,39,43,71,83] extend mesh-based and neural relighting methods by incorporating body pose priors [29,41]. However, avatars relying on skeletal priors struggle with complex clothing, wrinkles, and humanobject interactions. ...

Reference:

BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video
Relightable Neural Actor with Intrinsic Decomposition and Pose Control
  • Citing Chapter
  • November 2024

... Generative models [4, 13-15, 22, 23, 37, 38] have produced state-of-the-art results in tasks related to unconditional and conditional image generation. Inspired by their powerful image generation capabilities, many attempts have been made to utilize features from pretrained generative models for downstream tasks, as these features are known to contain richer information than original RGB images [1,3,9,10,28,32,33,45,47,50,51,53]. However, both utilizing tri-plane features and utilizing features for 3D-aware face editing have not been explored. ...

DatasetNeRF: Efficient 3D-Aware Data Factory with Generative Radiance Fields
  • Citing Chapter
  • October 2024

... Recent researches have shifted towards learning-based methods [7,25,38,50,82], which generate 3D content through feedforward networks, reducing the latency to a few seconds per output. Unfortunately, existing 3D datasets [16-18, 83, 127] are much smaller compared to those used in training text-to-image generation models, while the 3D data therein suffer from texture quality and inconsistent object poses [62]. Consequently, these approaches struggle to produce high-quality 3D outputs. ...

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
  • Citing Conference Paper
  • June 2024

... In parallel to our challenge of unsupervised prior learning from videos for pose estimation, unsupervised pose estimation leverages the abundant, unannotated visual information available in large image and video datasets to extract pose information (Hu & Ahuja, 2021;Sommer et al., 2024;Chen et al., 2019;He et al., 2022a;Schmidtke et al., 2021). The use of pose priors can provide valuable guidance in this process. ...

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos
  • Citing Conference Paper
  • June 2024

... For instance, COCO-C [42] evaluates model performance by applying synthetic corruptions, such as JPEG compression and Gaussian noise, to the COCO test set. Similarly, OOD-CV [68] and its extended version, OOD-CV-v2 [69], include OOD examples across 10 object categories from PASCAL VOC and ImageNet, spanning variations in pose, shape, texture, context, and weather conditions. These datasets enable benchmarking across multiple tasks like image classification, object detection, and 3D pose estimation. ...

OOD-CV-v2 : An Extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images
  • Citing Article
  • September 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... The method is particularly beneficial for preoperative functional cortical mapping in patients [11; 12; 13; 14; 15; 16; 51; 52; 53; 54] and for exploring higher-order cortical functions in cognitive neuroscience studies involving healthy participants [55]. In these applications, streamlining mapping approaches that reduce participation time while maintaining data quality are essential and may, for example, enable functional mapping of complex cognitive processes such as internal world models [56]. ...

Internal world models in humans, animals, and AI
  • Citing Article
  • August 2024

Neuron

... Internal representations have recently attracted the attention of scientists not only in connection with the development of cognitive sciences [1,2], but also in the context of large language models [3,4]. Representation in a broad sense is how a particular object is presented in the space of internal states of the perceiving subject. ...

Internal world models in humans, animals, and AI
  • Citing Article
  • July 2024

Neuron

... Animal pose estimation involves identifying the spatial coordinates of keypoints on an animal's body from visual input, with methods spanning both 2D [16,17,38] and 3D [10,30,36] domains. Despite recent progress, the scarcity of large-scale, high-quality datasets remains a major bottleneck, particularly for 2D pose estimation. ...

Robust Category-Level 3D Pose Estimation from Diffusion-Enhanced Synthetic Data
  • Citing Conference Paper
  • January 2024

... Attempts to close this performance gap have included data augmentation [16] and innovative architectural designs, such as the analysis-by-synthesis approach [23]. Along this line of research, recently neural mesh models emerged as a family of models [36, [53][54][55] that learn a 3D pose-conditioned model of neural features and predict 3D pose and object class [20] by minimizing the reconstruction error between the actual and rendered feature maps using render-and-compare. Such models have shown to be significantly more robust to occlusions and OOD data. ...

Neural Textured Deformable Meshes for Robust Analysis-by-Synthesis
  • Citing Conference Paper
  • January 2024

... Many studies have applied Convolutional Neural Networks (CNNs) and other deep architectures to extract occlusion-robust representations [35,62,66,76]. These approaches use deep models to capture complex patterns and variations in visual data, making learned features resilient to occlusions and having proven valuable for many computer vision applications, such as action recognition [17,88], pose estimation [62,95], and object detection [12,36]. The exploration of occlusion-robust representations in visual tracking has also demonstrated great success [1,6,27,34,39,58,59,61,94]. ...

3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation
  • Citing Conference Paper
  • October 2023