Christian Theobalt

Tsinghua University, Peping, Beijing, China

Are you Christian Theobalt?

Claim your profile

Publications (143)160.17 Total impact

  • ACM Transactions on Graphics 10/2015; 34(6):1-14. DOI:10.1145/2816795.2818056 · 4.10 Impact Factor

  • ACM Transactions on Graphics 10/2015; 34(6):1-12. DOI:10.1145/2816795.2818082 · 4.10 Impact Factor
  • Source
    Dan Casas · Christian Richardt · John Collomosse · Christian Theobalt · Adrian Hilton ·
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce the concept of 4D model flow for the precomputed alignment of dynamic surface appearance across 4D video sequences of different motions reconstructed from multi-view video. Precomputed 4D model flow allows the efficient parametrization of surface appearance from the captured videos, which enables efficient real-time rendering of interpolated 4D video sequences whilst accurately reproducing visual dynamics, even when using a coarse underlying geometry. We estimate the 4D model flow using an image-based approach that is guided by available geometry proxies. We propose a novel representation in surface texture space for efficient storage and online parametric interpolation of dynamic appearance. Our 4D model flow overcomes previous requirements for computationally expensive online optical flow computation for data-driven alignment of dynamic surface appearance by precomputing the appearance alignment. This leads to an efficient rendering technique that enables the online interpolation between 4D videos in real time, from arbitrary viewpoints and with visual quality comparable to the state of the art.
    Computer Graphics Forum 10/2015; 34(7). DOI:10.1111/cgf.12756 · 1.64 Impact Factor
  • Younghee Kwon · Kwang In Kim · James Tompkin · Jin Hyung Kim · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Improving the quality of degraded images is a key problem in image processing, but the breadth of the problem leads to domain-specific approaches for tasks such as super-resolution and compression artifact removal. Recent approaches have shown that a general approach is possible by learning application-specific models from examples; however, learning models sophisticated enough to generate high-quality images is computationally expensive, and so specific per-application or per-dataset models are impractical. To solve this problem, we present an efficient semi-local approximation scheme to large-scale Gaussian processes. This allows efficient learning of task-specific image enhancements from example images without reducing quality. As such, our algorithm can be easily customized to specific applications and datasets, and we show the efficiency and effectiveness of our approach across five domains: single-image super-resolution for scene, human face, and text images, and artifact removal in JPEG- and JPEG 2000-encoded images.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 09/2015; 37(9):1-1. DOI:10.1109/TPAMI.2015.2389797 · 5.78 Impact Factor

  • ACM Transactions on Graphics 07/2015; 34(4):96:1-96:14. DOI:10.1145/2766887 · 4.10 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In many countries, foreign movies and TV productions are dubbed, i.e., the original voice of an actor is replaced with a translation that is spoken by a dubbing actor in the country's own language. Dubbing is a complex process that requires specific translations and accurately timed recitations such that the new audio at least coarsely adheres to the mouth motion in the video. However, since the sequence of phonemes and visemes in the original and the dubbing language are different, the video-to-audio match is never perfect, which is a major source of visual discomfort. In this paper, we propose a system to alter the mouth motion of an actor in a video, so that it matches the new audio track. Our paper builds on high-quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space-time retrieval method to synthesize a new photo-realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance. We demonstrate plausible visual quality of our results compared to footage that has been professionally dubbed in the traditional way, both qualitatively and through a user study.
    Computer Graphics Forum 05/2015; 34(2). DOI:10.1111/cgf.12552 · 1.64 Impact Factor
  • Source
    Leonid Pishchulin · Stefanie Wuhrer · Thomas Helten · Christian Theobalt · Bernt Schiele ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Statistical models of 3D human shape and pose learned from scan databases have developed into valuable tools to solve a variety of vision and graphics problems. Unfortunately, most publicly available models are of limited expressiveness as they were learned on very small databases that hardly reflect the true variety in human body shapes. In this paper, we contribute by rebuilding a widely used statistical body representation from the largest commercially available scan database, and making the resulting model available to the community (visit As preprocessing several thousand scans for learning the model is a challenge in itself, we contribute by developing robust best practice solutions for scan alignment that quantitatively lead to the best learned models. We make implementations of these preprocessing steps also publicly available. We extensively evaluate the improved accuracy and generality of our new model, and show its improved performance for human body reconstruction from sparse input data.
  • S. Sridhar · H. Rhodin · H.-P. Seidel · A. Oulasvirta · C. Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Real-time marker-less hand tracking is of increasing importance in human-computer interaction. Robust and accurate tracking of arbitrary hand motion is a challenging problem due to the many degrees of freedom, frequent selfocclusions, fast motions, and uniform skin color. In this paper, we propose a new approach that tracks the full skeleton motion of the hand from multiple RGB cameras in real-time. The main contributions include a new generative tracking method which employs an implicit hand shape representation based on Sum of Anisotropic Gaussians (SAG), and a pose fitting energy that is smooth and analytically differentiable making fast gradient based pose optimization possible. This shape representation, together with a full perspective projection model, enables more accurate hand modeling than a related baseline method from literature. Our method achieves better accuracy than previous methods and runs at 25 fps. We show these improvements both qualitatively and quantitatively on publicly available datasets.
  • A. Elhayek · C. Stoll · K. I. Kim · C. Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a method for capturing the skeletal motions of humans using a sparse set of potentially moving cameras in an uncontrolled environment. Our approach is able to track multiple people even in front of cluttered and non-static backgrounds, and unsynchronized cameras with varying image quality and frame rate. We completely rely on optical information and do not make use of additional sensor information (e.g. depth images or inertial sensors). Our algorithm simultaneously reconstructs the skeletal pose parameters of multiple performers and the motion of each camera. This is facilitated by a new energy functional that captures the alignment of the model and the camera positions with the input videos in an analytic way. The approach can be adopted in many practical applications to replace the complex and expensive motion capture studios with few consumer-grade cameras even in uncontrolled outdoor scenes. We demonstrate this based on challenging multi-view video sequences that are captured with unsynchronized and moving (e.g. mobile-phone or GoPro) cameras.
    Computer Graphics Forum 12/2014; 34(6). DOI:10.1111/cgf.12519 · 1.64 Impact Factor
  • T. Neumann · K. Varanasi · C. Theobalt · M. Magnor · M. Wacker ·
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces compressed eigenfunctions of the Laplace-Beltrami operator on 3D manifold surfaces. They constitute a novel functional basis, called the compressed manifold basis, where each function has local support. We derive an algorithm, based on the alternating direction method of multipliers (ADMM), to compute this basis on a given triangulated mesh. We show that compressed manifold modes identify key shape features, yielding an intuitive understanding of the basis for a human observer, where a shape can be processed as a collection of parts. We evaluate compressed manifold modes for potential applications in shape matching and mesh abstraction. Our results show that this basis has distinct advantages over existing alternatives, indicating high potential for a wide range of use-cases in mesh processing.
    Computer Graphics Forum 08/2014; 33(5). DOI:10.1111/cgf.12429 · 1.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a combined hardware and software solution for marker-less reconstruction of non-rigidly deforming physical objects with arbitrary shape in real-time. Our system uses a single self-contained stereo camera unit built from off-the-shelf components and consumer graphics hardware to generate spatio-temporally coherent 3D models at 30 Hz. A new stereo matching algorithm estimates real-time RGB-D data. We start by scanning a smooth template model of the subject as they move rigidly. This geometric surface prior avoids strong scene assumptions, such as a kinematic human skeleton or a parametric shape model. Next, a novel GPU pipeline performs non-rigid registration of live RGB-D data to the smooth template using an extended non-linear as-rigid-as-possible (ARAP) framework. High-frequency details are fused onto the final mesh using a linear deformation model. The system is an order of magnitude faster than state-of-the-art methods, while matching the quality and robustness of many offline algorithms. We show precise real-time reconstructions of diverse scenes, including: large deformations of users' heads, hands, and upper bodies; fine-scale wrinkles and folds of skin and clothing; and non-rigid interactions performed by users on flexible objects such as toys. We demonstrate how acquired models can be used for many interactive scenarios, including re-texturing, online performance capture and preview, and real-time shape and motion re-targeting.
    ACM Transactions on Graphics 07/2014; 33(4):1-12. DOI:10.1145/2601097.2601165 · 4.10 Impact Factor
  • Source
    Dina Khattab · Christian Theobalt · Ashraf S. Hussein · Mohamed F. Tolba ·
    [Show abstract] [Hide abstract]
    ABSTRACT: GrabCut is a segmentation technique for 2D still color images, which is mainly based on an iterative energy minimization. The energy function of the GrabCut optimization algorithm is based mainly on a probabilistic model for pixel color distribution. Therefore, GrabCut may introduce unacceptable results in the cases of low contrast between foreground and background colors. In this manner, this paper presents a modified GrabCut technique for the segmentation of human faces from images of full humans. The modified technique introduces a new face location model for the energy minimization function of the GrabCut, in addition to the existing color one. This location model considers the distance distribution of the pixels from the silhouette boundary of a fitted head, of a 3D morphable model, to the image. The experimental results of the modified GrabCut have demonstrated better segmentation robustness and accuracy compared to the original GrabCut for human face segmentation.
    Ain Shams Engineering Journal 06/2014; 5(4). DOI:10.1016/j.asej.2014.04.012
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is now possible to capture the 3D motion of the human body on consumer hardware and to puppet in real time skeleton-based virtual characters. However, many characters do not have humanoid skeletons. Characters such as spiders and caterpillars do not have boned skeletons at all, and these characters have very different shapes and motions. In general, character control under arbitrary shape and motion transformations is unsolved - how might these motions be mapped? We control characters with a method which avoids the rigging-skinning pipeline — source and target characters do not have skeletons or rigs. We use interactively-defined sparse pose correspondences to learn a mapping between arbitrary 3D point source sequences and mesh target sequences. Then, we puppet the target character in real time. We demonstrate the versatility of our method through results on diverse virtual characters with different input motion controllers. Our method provides a fast, flexible, and intuitive interface for arbitrary motion mapping which provides new ways to control characters for real-time animation.
    Computer Graphics Forum 05/2014; 33(2). DOI:10.1111/cgf.12325 · 1.64 Impact Factor
  • Yebin Liu · Genzhi Ye · Yangang Wang · Qionghai Dai · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Capturing real performances of human actors has been an important topic in the fields of computer graphics and computer vision in the last few decades. The reconstructed 3D performance can be used for character animation and free-viewpoint video. While most of the available performance capture approaches rely on a 3D video studio with tens of RGB cameras, this chapter presents a method for marker-less performance capture of single or multiple human characters using only three handheld Kinects. Compared with the RGB camera approaches, the proposed method is more convenient with respect to data acquisition, allowing for much fewer cameras and carry-on camera capture. The method introduced in this chapter reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. It succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even for reconstruction of multiple closely interacting characters.
    Computer Vision and Machine Learning with RGB-D Sensors, 01/2014: pages 91-108;
  • Thomas Helten · Meinard Müller · Hans-Peter Seidel · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-based pose retrieval, and an adapted late fusion step to calculate the final body pose.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Kwang In Kim · James Tompkin · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: One fundamental assumption in object recognition as well as in other computer vision and pattern recognition problems is that the data generation process lies on a manifold and that it respects the intrinsic geometry of the manifold. This assumption is held in several successful algorithms for diffusion and regularization, in particular, in graph-Laplacian-based algorithms. We claim that the performance of existing algorithms can be improved if we additionally account for how the manifold is embedded within the ambient space, i.e., if we consider the extrinsic geometry of the manifold. We present a procedure for characterizing the extrinsic (as well as intrinsic) curvature of a manifold M which is described by a sampled point cloud in a high-dimensional Euclidean space. Once estimated, we use this characterization in general diffusion and regularization on M, and form a new regularizer on a point cloud. The resulting re-weighted graph Laplacian demonstrates superior performance over classical graph Laplacian in semi-supervised learning and spectral clustering.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Srinath Sridhar · Antti Oulasvirta · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Tracking the articulated 3D motion of the hand has important applications, for example, in human-computer interaction and teleoperation. We present a novel method that can capture a broad range of articulated hand motions at interactive rates. Our hybrid approach combines, in a voting scheme, a discriminative, part-based pose retrieval method with a generative pose estimation method based on local optimization. Color information from a multi-view RGB camera setup along with a person-specific hand model are used by the generative method to find the pose that best explains the observed images. In parallel, our discriminative pose estimation method uses fingertips detected on depth data to estimate a complete or partial pose of the hand by adopting a part-based pose retrieval strategy. This part-based strategy helps reduce the search space drastically in comparison to a global pose retrieval strategy. Quantitative results show that our method achieves state-of-the-art accuracy on challenging sequences and a near-real time performance of 10 fps on a desktop computer.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Nadia Robertini · Thomas Neumann · Kiran Varanasi · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Modeling realistic skin deformations due to underneath muscle bulging has a wide range of applications in medicine, entertainment and art. Current acquisition systems based on dense markers and multiple synchronized cameras are able to record and reproduce fine-scale skin deformations with sufficient quality. However, the complexity and the high cost of these systems severely limit their applicability. In this paper, we propose a method for reconstructing fine-scale arm muscle deformations using the Kinect depth camera. The captured data from the depth camera has no temporal contiguity and suffers from noise and sensory artifacts, and thus unsuitable by itself for potential applications in visual media production or biomechanics. We process noisy depth input to obtain spatio-temporally consistent 3D mesh reconstructions showing fine-scale muscle bulges over time. Our main contribution is the incorporation of statistical deformation priors into the spatiotemporal mesh registration progress. We obtain these priors from a previous dataset of a limited number of physiologically different actors captured using a high fidelity acquisition setup, and these priors help provide a better initialization for the ultimate non-rigid surface refinement that models deformations beyond the range of the previous dataset. Thus, our method is an easily scalable framework for bootstrapping the statistical muscle deformation model, by extending the set of subjects through a Kinect based acquisition process. We validate our spatio-temporal surface registration method on several arm movements performed by people of different body shapes.
    Proceedings of the 10th European Conference on Visual Media Production; 11/2013
  • Pablo Garrido · Levi Valgaert · Chenglei Wu · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Detailed facial performance geometry can be reconstructed using dense camera and light setups in controlled studios. However, a wide range of important applications cannot employ these approaches, including all movie productions shot from a single principal camera. For post-production, these require dynamic monocular face capture for appearance modification. We present a new method for capturing face geometry from monocular video. Our approach captures detailed, dynamic, spatio-temporally coherent 3D face geometry without the need for markers. It works under uncontrolled lighting, and it successfully reconstructs expressive motion including high-frequency face detail such as folds and laugh lines. After simple manual initialization, the capturing process is fully automatic, which makes it versatile, lightweight and easy-to-deploy. Our approach tracks accurate sparse 2D features between automatically selected key frames to animate a parametric blend shape model, which is further refined in pose, expression and shape by temporally coherent optical flow and photometric stereo. We demonstrate performance capture results for long and complex face sequences captured indoors and outdoors, and we exemplify the relevance of our approach as an enabling technology for model-based face editing in movies and video, such as adding new facial textures, as well as a step towards enabling everyone to do facial performance capture with a single affordable camera.
    ACM Transactions on Graphics 11/2013; 32(6):1-10. DOI:10.1145/2508363.2508380 · 4.10 Impact Factor
  • Source
    Miguel Granados · Kwang In Kim · James Tompkin · Christian Theobalt ·
    [Show abstract] [Hide abstract]
    ABSTRACT: High dynamic range reconstruction of dynamic scenes requires careful handling of dynamic objects to prevent ghosting. However, in a recent review, Srikantha et al. [2012] conclude that "there is no single best method and the selection of an approach depends on the user's goal". We attempt to solve this problem with a novel approach that models the noise distribution of color values. We estimate the likelihood that a pair of colors in different images are observations of the same irradiance, and we use a Markov random field prior to reconstruct irradiance from pixels that are likely to correspond to the same static scene object. Dynamic content is handled by selecting a single low dynamic range source image and hand-held capture is supported through homography-based image alignment. Our noise-based reconstruction method achieves better ghost detection and removal than state-of-the-art methods for cluttered scenes with large object displacements. As such, our method is broadly applicable and helps move the field towards a single method for dynamic scene HDR reconstruction.
    ACM Transactions on Graphics 11/2013; 32(6):201. DOI:10.1145/2508363.2508410 · 4.10 Impact Factor

Publication Stats

3k Citations
160.17 Total Impact Points


  • 2014
    • Tsinghua University
      • Department of Automation
      Peping, Beijing, China
  • 1998-2014
    • Max Planck Institute for Informatics
      Saarbrücken, Saarland, Germany
  • 2013
    • University College London
      Londinium, England, United Kingdom
    • Universität des Saarlandes
      Saarbrücken, Saarland, Germany
  • 2012
    • Evangelische Hochschule Freiburg, Germany
      Freiburg, Baden-Württemberg, Germany
  • 2007-2009
    • Stanford University
      • Department of Computer Science
      Palo Alto, California, United States
    • Bulgarian Academy of Sciences
      Ulpia Serdica, Sofia-Capital, Bulgaria