Christian Theobalt

Max Planck Institute for Informatics, Saarbrücken, Saarland, Germany

Are you Christian Theobalt?

Claim your profile

Publications (119)59.18 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces compressed eigenfunctions of the Laplace-Beltrami operator on 3D manifold surfaces. They constitute a novel functional basis, called the compressed manifold basis, where each function has local support. We derive an algorithm, based on the alternating direction method of multipliers (ADMM), to compute this basis on a given triangulated mesh. We show that compressed manifold modes identify key shape features, yielding an intuitive understanding of the basis for a human observer, where a shape can be processed as a collection of parts. We evaluate compressed manifold modes for potential applications in shape matching and mesh abstraction. Our results show that this basis has distinct advantages over existing alternatives, indicating high potential for a wide range of use-cases in mesh processing.
    Computer Graphics Forum 08/2014; 33(5). · 1.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is now possible to capture the 3D motion of the human body on consumer hardware and to puppet in real time skeleton-based virtual characters. However, many characters do not have humanoid skeletons. Characters such as spiders and caterpillars do not have boned skeletons at all, and these characters have very different shapes and motions. In general, character control under arbitrary shape and motion transformations is unsolved - how might these motions be mapped? We control characters with a method which avoids the rigging-skinning pipeline — source and target characters do not have skeletons or rigs. We use interactively-defined sparse pose correspondences to learn a mapping between arbitrary 3D point source sequences and mesh target sequences. Then, we puppet the target character in real time. We demonstrate the versatility of our method through results on diverse virtual characters with different input motion controllers. Our method provides a fast, flexible, and intuitive interface for arbitrary motion mapping which provides new ways to control characters for real-time animation.
    Computer Graphics Forum 05/2014; 33(2). · 1.64 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: GrabCut is a segmentation technique for 2D still color images, which is mainly based on an iterative energy minimization. The energy function of the GrabCut optimization algorithm is based mainly on a probabilistic model for pixel color distribution. Therefore, GrabCut may introduce unacceptable results in the cases of low contrast between foreground and background colors. In this manner, this paper presents a modified GrabCut technique for the segmentation of human faces from images of full humans. The modified technique introduces a new face location model for the energy minimization function of the GrabCut, in addition to the existing color one. This location model considers the distance distribution of the pixels from the silhouette boundary of a fitted head, of a 3D morphable model, to the image. The experimental results of the modified GrabCut have demonstrated better segmentation robustness and accuracy compared to the original GrabCut for human face segmentation.
    Ain Shams Engineering Journal. 01/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-based pose retrieval, and an adapted late fusion step to calculate the final body pose.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Kwang In Kim, James Tompkin, Christian Theobalt
    [Show abstract] [Hide abstract]
    ABSTRACT: One fundamental assumption in object recognition as well as in other computer vision and pattern recognition problems is that the data generation process lies on a manifold and that it respects the intrinsic geometry of the manifold. This assumption is held in several successful algorithms for diffusion and regularization, in particular, in graph-Laplacian-based algorithms. We claim that the performance of existing algorithms can be improved if we additionally account for how the manifold is embedded within the ambient space, i.e., if we consider the extrinsic geometry of the manifold. We present a procedure for characterizing the extrinsic (as well as intrinsic) curvature of a manifold M which is described by a sampled point cloud in a high-dimensional Euclidean space. Once estimated, we use this characterization in general diffusion and regularization on M, and form a new regularizer on a point cloud. The resulting re-weighted graph Laplacian demonstrates superior performance over classical graph Laplacian in semi-supervised learning and spectral clustering.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modeling realistic skin deformations due to underneath muscle bulging has a wide range of applications in medicine, entertainment and art. Current acquisition systems based on dense markers and multiple synchronized cameras are able to record and reproduce fine-scale skin deformations with sufficient quality. However, the complexity and the high cost of these systems severely limit their applicability. In this paper, we propose a method for reconstructing fine-scale arm muscle deformations using the Kinect depth camera. The captured data from the depth camera has no temporal contiguity and suffers from noise and sensory artifacts, and thus unsuitable by itself for potential applications in visual media production or biomechanics. We process noisy depth input to obtain spatio-temporally consistent 3D mesh reconstructions showing fine-scale muscle bulges over time. Our main contribution is the incorporation of statistical deformation priors into the spatiotemporal mesh registration progress. We obtain these priors from a previous dataset of a limited number of physiologically different actors captured using a high fidelity acquisition setup, and these priors help provide a better initialization for the ultimate non-rigid surface refinement that models deformations beyond the range of the previous dataset. Thus, our method is an easily scalable framework for bootstrapping the statistical muscle deformation model, by extending the set of subjects through a Kinect based acquisition process. We validate our spatio-temporal surface registration method on several arm movements performed by people of different body shapes.
    Proceedings of the 10th European Conference on Visual Media Production; 11/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.
    IEEE Transactions on Software Engineering 11/2013; 35(11):2720-35. · 2.59 Impact Factor
  • Source
    ACM Transactions on Graphics 11/2013; 32(6):201. · 3.36 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a method that extracts sparse and spatially localized deformation modes from an animated mesh sequence. To this end, we propose a new way to extend the theory of sparse matrix decompositions to 3D mesh sequence processing, and further contribute with an automatic way to ensure spatial locality of the decomposition in a new optimization framework. The extracted dimensions often have an intuitive and clear interpretable meaning. Our method optionally accepts user-constraints to guide the process of discovering the underlying latent deformation space. The capabilities of our efficient, versatile, and easy-to-implement method are extensively demonstrated on a variety of data sets and application contexts. We demonstrate its power for user friendly intuitive editing of captured mesh animations, such as faces, full body motion, cloth animations, and muscle deformations. We further show its benefit for statistical geometry processing and biomechanically meaningful animation editing. It is further shown qualitatively and quantitatively that our method outperforms other unsupervised decomposition methods and other animation parameterization approaches in the above use cases.
    ACM Transactions on Graphics (TOG). 11/2013; 32(6).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Video collections of places show contrasts and changes in our world, but current interfaces to video collections make it hard for users to explore these changes. Recent state-of-the-art interfaces attempt to solve this problem for 'outside->in' collections, but cannot connect 'inside->out' collections of the same place which do not visually overlap. We extend the focus+context paradigm to create a video-collections+context interface by embedding videos into a panorama. We build a spatio-temporal index and tools for fast exploration of the space and time of the video collection. We demonstrate the flexibility of our representation with interfaces for desktop and mobile flat displays, and for a spherical display with joypad and tablet controllers. We study with users the effect of our video-collection+context system to spatio-temporal localization tasks, and find significant improvements to accuracy and completion time in visual search tasks compared to existing systems. We measure the usability of our interface with System Usability Scale (SUS) and task-specific questionnaires, and find our system scores higher.
    Proceedings of the 26th annual ACM symposium on User interface software and technology; 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Emerging interfaces for video collections of places attempt to link similar content with seamless transitions. However, the automatic computer vision techniques that enable these transitions have many failure cases which lead to artifacts in the final rendered transition. Under these conditions, which transitions are preferred by participants and which artifacts are most objectionable? We perform an experiment with participants comparing seven transition types, from movie cuts and dissolves to image-based warps and virtual camera transitions, across five scenes in a city. This document describes how we condition this experiment on slight and considerable view change cases, and how we analyze the feedback from participants to find their preference for transition types and artifacts. We discover that transition preference varies with view change, that automatic rendered transitions are significantly preferred even with some artifacts, and that dissolve transitions are comparable to less-sophisticated rendered transitions. This leads to insights into what visual features are important to maintain in a rendered transition, and to an artifact ordering within our transitions.
    ACM Transactions on Applied Perception (TAP). 08/2013; 10(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present an algorithm for creating free-viewpoint video of interacting humans using three handheld Kinect cameras. Our method reconstructs deforming surface geometry and temporal varying texture of humans through estimation of human poses and camera poses for every time step of the RGBZ video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem, which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Finally, texture recovery is achieved through jointly optimization on spatio-temporal RGB data using matrix completion. As opposed to previous methods, our algorithm succeeds on free-viewpoint video of human actors under general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.
    IEEE transactions on cybernetics. 07/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a method for 3D object scanning by aligning depth scans that were taken from around an object with a Time-of-Flight (ToF) camera. These ToF cameras can measure depth scans at video rate. Due to comparably simple technology, they bear potential for economical production in big volumes. Our easy-to-use, cost-effective scanning solution, which is based on such a sensor, could make 3D scanning technology more accessible to everyday users. The algorithmic challenge we face is that the sensor's level of random noise is substantial and there is a nontrivial systematic bias. In this paper, we show the surprising result that 3D scans of reasonable quality can also be obtained with a sensor of such low data quality. Established filtering and scan alignment techniques from the literature fail to achieve this goal. In contrast, our algorithm is based on a new combination of a 3D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the sensor's noise characteristics.
    IEEE Transactions on Software Engineering 05/2013; 35(5):1039-50. · 2.59 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a comprehensive data-driven statistical model for skin and muscle deformation of the human shoulder-arm complex. Skin deformations arise from complex bio-physical effects such as non-linear elasticity of muscles, fat, and connective tissue; and vary with physiological constitution of the subjects and external forces applied during motion. Thus, they are hard to model by direct physical simulation. Our alternative approach is based on learning deformations from multiple subjects performing different exercises under varying external forces. We capture the training data through a novel multi-camera approach that is able to reconstruct fine-scale muscle detail in motion. The resulting reconstructions from several people are aligned into one common shape parametrization, and learned using a semi-parametric non-linear method. Our learned data-driven model is fast, compact and controllable with a small set of intuitive parameters – pose, body shape and external forces, through which a novice artist can interactively produce complex muscle deformations. Our method is able to capture and synthesize fine-scale muscle bulge effects to a greater level of realism than achieved previously. We provide quantitative and qualitative validation of our method.
    Computer Graphics Forum 05/2013; 32(2pt3). · 1.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach to create relightable free-viewpoint human performances from multi-view video recorded under general uncontrolled and uncalibated illumination. We first capture a multi-view sequence of an actor wearing arbitrary apparel and reconstruct a spatio-temporal coherent coarse 3D model of the performance using a marker-less tracking approach. Using these coarse reconstructions, we estimate the low-frequency component of the illumination in a spherical harmonics (SH) basis as well as the diffuse reflectance, and then utilize them to estimate the dynamic geometry detail of human actors based on shading cues. Given the high-quality time-varying geometry, the estimated illumination is extended to the all-frequency domain by re-estimating it in the wavelet basis. Finally, the high-quality all-frequency illumination is utilized to reconstruct the spatially-varying BRDF of the surface. The recovered time-varying surface geometry and spatially-varying non-Lambertian reflectance allow us to generate high-quality model-based free view-point videos of the actor under novel illumination conditions. Our method enables plausible reconstruction of relightable dynamic scene models without a complex controlled lighting apparatus, and opens up a path towards relightable performance capture in less constrained environments and using less complex acquisition setups.
    Computer Graphics Forum 05/2013; 32(2pt3). · 1.64 Impact Factor
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Reconstructing a three-dimensional representation of human motion in real-time constitutes an important research topic with applications in sports sciences, human-computer-interaction, and the movie industry. In this paper, we contribute with a robust algorithm for estimating a personalized human body model from just two sequentially captured depth images that is more accurate and runs an order of magnitude faster than the current state-of-the-art procedure. Then, we employ the estimated body model to track the pose in real-time from a stream of depth images using a tracking algorithm that combines local pose optimization and a stabilizing dataBase look-up. Together, this enables accurate pose tracking that is more accurate than previous approaches. As a further contribution, we evaluate and compare our algorithm to previous work on a comprehensive benchmark dataset containing more than 15 minutes of challenging motions. This dataset comprises calibrated marker-Based motion capture data, depth data, as well as ground truth tracking results and is publicly available for research purposes.
    3DV-Conference, 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -- even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.
    ACM Transactions on Graphics (TOG). 11/2012; 31(6).
  • Michal Richter, Kiran Varanasi, Nils Hasler, Christian Theobalt
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a system for real-time deformation of the shape and appearance of people who are standing in front of a depth+RGB camera, such as the Microsoft Kinect. Our system allows manipulating human body shape parameters such as height, muscularity, weight, waist girth and leg length. The manipulated appearance is displayed in realtime. Thus, instead of posing in front a real mirror and visualizing their appearance, users can pose in front of a 'virtual mirror' and visualize themselves in different body shapes. Our system is made possible by a morphable model of 3D human shape that was learnt from a large database of 3D scans of people in various body shapes and poses. In an initialization step, which lasts a couple of seconds, this model is fit to the 3D shape parameters of the people as observed in the depth data. Then, a succession of pose tracking, body segmentation, shape deformation and image warping steps are performed -- in real-time and independently for multiple people. We present a variety of results in the paper and the video, showing the interactive virtual mirror cabinet experience.
    Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.
    Proceedings of the 12th European conference on Computer Vision - Volume Part II; 10/2012

Publication Stats

1k Citations
59.18 Total Impact Points


  • 1998–2014
    • Max Planck Institute for Informatics
      Saarbrücken, Saarland, Germany
  • 2013
    • Tsinghua University
      Peping, Beijing, China
    • Universität des Saarlandes
      Saarbrücken, Saarland, Germany
  • 2005–2009
    • Stanford University
      Palo Alto, California, United States
  • 2007
    • Zhejiang University
      Hang-hsien, Zhejiang Sheng, China
  • 2003–2007
    • Max Planck Society
      München, Bavaria, Germany
  • 2002
    • The University of Edinburgh
      • School of Informatics
      Edinburgh, SCT, United Kingdom