Christian Theobalt’s research while affiliated with Max Planck Center for Visual Computing and Communication and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (599)


Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures
  • Preprint

December 2024

·

2 Reads

Guoxing Sun

·

Rishabh Dabral

·

·

[...]

·

Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to the sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or completely ignore sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address these issues, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.


Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

December 2024

·

6 Reads

Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.


Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view Event Cameras

December 2024

Volumetric reconstruction of dynamic scenes is an important problem in computer vision. It is especially challenging in poor lighting and with fast motion. It is partly due to the limitations of RGB cameras: To capture fast motion without much blur, the framerate must be increased, which in turn requires more lighting. In contrast, event cameras, which record changes in pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion. We hence propose the first method to spatiotemporally reconstruct a scene from sparse multi-view event streams and sparse RGB frames. We train a sequence of cross-faded time-conditioned NeRF models, one per short recording segment. The individual segments are supervised with a set of event- and RGB-based losses and sparse-view regularisation. We assemble a real-world multi-view camera rig with six static event cameras around the object and record a benchmark multi-view event stream dataset of challenging motions. Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras. The code and the data will be released soon at https://4dqv.mpi-inf.mpg.de/DynEventNeRF/


Figure 3. Hand Representation We parameterize each frame of hand pose by using J surface keypoints (in orange), sampled from the surface of the hand. In addition to position, we also use the direction vector (dark blue lines) from each keypoint to the nearest object surface as an additional feature.
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects
  • Preprint
  • File available

December 2024

·

2 Reads

We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating. To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that excel over the state-of-the-art in motion quality and diversity.

Download



Figure 4. Converting 2D contact regions to 3D contact points. Rays emitted from contact mask pixels intersect object and hand geometries. Contact point candidates are constrained to the extremal ray intersections: nearest or farthest points relative to the camera for the object, and palmar-side extremal points for the hand.
Figure 5. We perform ablation studies by examining the outputs at each processing stage. Starting with reconstruction results from foundational models, we present the progressive improvements through Camera Setup, HOI Contact Alignment, and Hand Parameter Refinement.
Figure 8. Given an input image, we use predefined prompt to reason the segmentation of hand and object.
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

November 2024

·

33 Reads

Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task. Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes. Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions. Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content. We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions. Here is the link of our project page: https://lym29.github.io/EasyHOI-page/



GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

November 2024

·

11 Reads

·

1 Citation

ACM Transactions on Graphics

Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, they often fail to represent complex motion changes, such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real time. At the core of our method is a hierarchical representation of head models that can capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as learnable parameters in an end-to-end framework. This enables not only controllable facial animation via video inputs but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application. We make the code available on our project page.



Citations (55)


... IDMRF Loss. Following prior works [4,11,81], we additionally use the IDMRF Loss [76] for the perceptual regularization and encourage high-frequency details. ...

Reference:

Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures
EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars
  • Citing Conference Paper
  • December 2024

... Moreover, these methods generally fail under sparse camera assumptions, i.e. four or less cameras, due to inherent ambiguities and lack of observation. To solve such ambiguities, some works [66,75,86] learn priors from data, and perform expensive fine-tuning on novel images, making them inappropriate for live inference. Other works [34,50,54,61] efficiently predict rendering primitives in 2D texture space relying on a texture unprojection step [61]. ...

MetaCap: Meta-learning Priors from Multi-view Imagery for Sparse-View Human Performance Capture and Rendering
  • Citing Chapter
  • October 2024

... Image-based relighting aims to alter the lighting in photographs post-capture. Specialized methods have been proposed for relighting isolated objects [18,21,30,51,62,64,66,72,79], human portraits [14,26,31,47,49,52,54,59], human bodies [12,46,63], outdoor scenes [23,32,56,68,71], and indoor scenes [6,33,45,74,78]. The focus of our work, indoor scene relighting, is especially challenging due to the mixture of natural and artificial light sources, occlusions, and intricate light interactions in a cluttered scene creating cast shadows, strong highlights, and interreflections. ...

Relightable Neural Actor with Intrinsic Decomposition and Pose Control
  • Citing Chapter
  • November 2024

... By employing a color module that modulates the color of the Gaussians based on position and the output of a style encoder, Saroha et al. can stylize a pretrained 3DGS to an arbitrary style. Liu et al. [35] embed VGG features onto the Gaussians and use an AdaIN layer for the style transfer. While this approach is similar in spirit to ours, they only consider static scenes. ...

StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting
  • Citing Conference Paper
  • November 2024

... Multi-view Personalized Avatars. Volumetric primitives, combined with multi-view training, are highly effective for modeling human heads [22,27,30,52,58,59,70,82] as they capture intricate details like hair and subsurface scattering [53], outperforming traditional textured meshes. VolTeMorph [20] embeds a NeRF within tetrahedral cages that guide volumetric deformation. ...

GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations
  • Citing Article
  • November 2024

ACM Transactions on Graphics

... Early approaches generated motions based on action categories [20]- [23], past motions [24]- [29], trajectories [30]- [34], and scene context [35]- [45]. Recent works have enabled direct generation of human motions from textual inputs [14], [46]- [66], extending to multi-person [67]- [69] and human-scene interactions [41], [70], [71]. However, generating collaborative human-objecthuman interactions remains largely unexplored. ...

REMOS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions
  • Citing Chapter
  • October 2024

... Moreover, optimization-based methods usually take hours to generate a single object, which hinders its application in real life. On the other hand, feed-forward methods [33,62,67,70,74] directly learn diffusion model for 3D representations, such as points [37,68], voxels [45], meshes [33,65], and implicit neural representations [20,36,51]. These methods can generate 3D objects in seconds. ...

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
  • Citing Chapter
  • October 2024

... It takes global image features to predict the skeletal motion and template deformation. DDC [20] further develops this idea to a motion-conditioned deformable avatar, widely used in animatable avatars [21,33,94]. Meanwhile, a series of neural implicit methods [16,71,74,79] also deform canonical SDF fields with motion conditions. ...

TriHuman : A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis
  • Citing Article
  • September 2024

ACM Transactions on Graphics

... Neural Human Modeling. In the domain of digital human neural representation, various approaches [Lin et al. , 2022Shetty et al. 2024;Sun et al. 2021;Suo et al. 2021;Xiang et al. 2022] have been proposed to address this challenge. A collection of studies [Pumarola et al. 2021;Tretschk et al. 2021;Xian et al. 2021] model time as an additional latent variable into the NeRF's MLP. ...

Holoported Characters: Real-Time Free-Viewpoint Rendering of Humans from Sparse RGB Cameras
  • Citing Conference Paper
  • June 2024

... However, applying image diffusion models to generate multi-view images separately poses significant challenges in maintaining consistency across different views. To address multi-view inconsistency, multi-view attentions and camera pose controls are adopted to fine-tune pre-trained image diffusion models, enabling the simultaneous synthesis of multi-view images [Long et al. 2024;Shi et al. 2023a,b;, though these methods might result in compromised geometric consistency due to the lack of inherent 3D biases. To ensure both global semantic consistency and detailed local alignment in multiview diffusion models, 3D-adapters [Chen et al. 2024a] propose a plug-in module designed to infuse 3D geometry awareness. ...

Wonder3D: Single Image to 3D Using Cross-Domain Diffusion
  • Citing Conference Paper
  • June 2024