Christian Theobalt

Universität des Saarlandes, Saarbrücken, Saarland, Germany

Are you Christian Theobalt?

Claim your profile

Publications (105)42.9 Total impact

  • Source
    ACM Transactions on Graphics 11/2013; 32(6):201. · 3.36 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.
    IEEE Transactions on Software Engineering 11/2013; 35(11):2720-35. · 2.59 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: We present an algorithm for creating free-viewpoint video of interacting humans using three handheld Kinect cameras. Our method reconstructs deforming surface geometry and temporal varying texture of humans through estimation of human poses and camera poses for every time step of the RGBZ video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem, which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Finally, texture recovery is achieved through jointly optimization on spatio-temporal RGB data using matrix completion. As opposed to previous methods, our algorithm succeeds on free-viewpoint video of human actors under general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.
    IEEE transactions on cybernetics. 07/2013;
  • [show abstract] [hide abstract]
    ABSTRACT: We describe a method for 3D object scanning by aligning depth scans that were taken from around an object with a Time-of-Flight (ToF) camera. These ToF cameras can measure depth scans at video rate. Due to comparably simple technology, they bear potential for economical production in big volumes. Our easy-to-use, cost-effective scanning solution, which is based on such a sensor, could make 3D scanning technology more accessible to everyday users. The algorithmic challenge we face is that the sensor's level of random noise is substantial and there is a nontrivial systematic bias. In this paper, we show the surprising result that 3D scans of reasonable quality can also be obtained with a sensor of such low data quality. Established filtering and scan alignment techniques from the literature fail to achieve this goal. In contrast, our algorithm is based on a new combination of a 3D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the sensor's noise characteristics.
    IEEE Transactions on Software Engineering 05/2013; 35(5):1039-50. · 2.59 Impact Factor
  • Source
    01/2013;
  • [show abstract] [hide abstract]
    ABSTRACT: Reconstructing a three-dimensional representation of human motion in real-time constitutes an important research topic with applications in sports sciences, human-computer-interaction, and the movie industry. In this paper, we contribute with a robust algorithm for estimating a personalized human body model from just two sequentially captured depth images that is more accurate and runs an order of magnitude faster than the current state-of-the-art procedure. Then, we employ the estimated body model to track the pose in real-time from a stream of depth images using a tracking algorithm that combines local pose optimization and a stabilizing dataBase look-up. Together, this enables accurate pose tracking that is more accurate than previous approaches. As a further contribution, we evaluate and compare our algorithm to previous work on a comprehensive benchmark dataset containing more than 15 minutes of challenging motions. This dataset comprises calibrated marker-Based motion capture data, depth data, as well as ground truth tracking results and is publicly available for research purposes.
    3DV-Conference, 2013 International Conference on; 01/2013
  • Source
    Computer Graphics Forum 05/2012; 31(2pt1):219-228. · 1.64 Impact Factor
  • Source
    International Journal of Computer Vision 01/2012; 97:1. · 3.62 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: We present a new spatio-temporal method for markerless motion capture. We reconstruct the pose and motion of a character from a multi-view video sequence without requiring the cameras to be synchronized and without aligning captured frames in time. By formulating the model-to-image similarity measure as a temporally continuous functional, we are also able to reconstruct motion in much higher temporal detail than was possible with previous synchronized approaches. By purposefully running cameras unsynchronized we can capture even very fast motion at speeds that off-the-shelf but high quality cameras provide.
    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on; 01/2012
  • Source
    Proceedings of the 12th European Conference on Computer Vision - Volume Part I, Florence, Italy; 01/2012
  • James Tompkin, Kwang In Kim, Jan Kautz, Christian Theobalt
    [show abstract] [hide abstract]
    ABSTRACT: The abundance of mobile devices and digital cameras with video capture makes it easy to obtain large collections of video clips that contain the same location, environment, or event. However, such an unstructured collection is difficult to comprehend and explore. We propose a system that analyzes collections of unstructured but related video data to create a Videoscape: a data structure that enables interactive exploration of video collections by visually navigating -- spatially and/or temporally -- between different clips. We automatically identify transition opportunities, or portals. From these portals, we construct the Videoscape, a graph whose edges are video clips and whose nodes are portals between clips. Now structured, the videos can be interactively explored by walking the graph or by geographic map. Given this system, we gauge preference for different video transition styles in a user study, and generate heuristics that automatically choose an appropriate transition style. We evaluate our system using three further user studies, which allows us to conclude that Videoscapes provides significant benefits over related methods. Our system leads to previously unseen ways of interactive spatio-temporal exploration of casually captured videos, and we demonstrate this on several video collections.
    ACM Transactions on Graphics - TOG. 01/2012;
  • [show abstract] [hide abstract]
    ABSTRACT: Multi-view stereo methods reconstruct 3D geometry from images well for sufficiently textured scenes, but often fail to recover high-frequency surface detail, particularly for smoothly shaded surfaces. On the other hand, shape-from-shading methods can recover fine detail from shading variations. Unfortunately, it is non-trivial to apply shape-from-shading alone to multi-view data, and most shading-based estimation methods only succeed under very restricted or controlled illumination. We present a new algorithm that combines multi-view stereo and shading-based refinement for high-quality reconstruction of 3D geometry models from images taken under constant but otherwise arbitrary illumination. We have tested our algorithm on several scenes that were captured under several general and unknown lighting conditions, and we show that our final reconstructions rival laser range scans.
    Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on; 07/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We present an approach for modeling the human body by Sums of spatial Gaussians (SoG), allowing us to per- form fast and high-quality markerless motion capture from multi-view video sequences. The SoG model is equipped with a color model to represent the shape and appearance of the human and can be reconstructed from a sparse set of images. Similar to the human body, we also represent the image domain as SoG that models color consistent im- age blobs. Based on the SoG models of the image and the human body, we introduce a novel continuous and differen- tiable model-to-image similarity measure that can be used to estimate the skeletal motion of a human at 5-15 frames per second even for many camera views. In our experi- ments, we show that our method, which does not rely on silhouettes or training data, offers an good balance between accuracy and computational cost.
    IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We present a markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of two closely interacting people from multi-view video. Due to ambiguities in feature-to-person assignments and frequent occlusions, it is not feasible to directly apply single-person capture approaches to the multi-person case. We therefore propose a combined image segmentation and tracking approach to overcome these difficulties. A new probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Thereafter, a single-person markerless motion and surface capture approach can be applied to each individual, either one-by-one or in parallel, even under strong occlusions. We demonstrate the performance of our approach on several challenging multi-person motions, including dance and martial arts, and also provide a reference dataset for multi-person motion capture with ground truth.
    The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: In recent years, depth cameras have become a widely available sensor type that captures depth images at real- time frame rates. Even though recent approaches have shown that D pose estimation from monocular 2.5D depth images has become feasible, there are still challenging problems due to strong noise in the depth data and self- occlusions in the motions being captured. In this paper, we present an effi cient and robust pose estimation framework for tracking full-body motions from a single depth image stream. Following a data-driven hybrid strategy that com- bines local optimization with global retrieval techniques, we contribute several technical improvements that lead to speed-ups of an order of magnitude compared to previous approaches. In particular, we introduce a variant of Dijk- stra's algorithm to effi ciently extract pose features from the depth data and describe a novel late-fusion scheme based on an effi ciently computable sparse Hausdorff distance to combine local and global pose estimates. Our experiments show that the combination of these techniques facilitates real-time tracking with stable results even for fast and com- plex motions, making it applicable to a wide range of inter- active scenarios.
    IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We present a method to synthesize plausible video sequences of humans according to user-defined body motions and viewpoints. We first capture a small database of multi-view video sequences of an actor performing various basic motions. This database needs to be captured only once and serves as the input to our synthesis algorithm. We then apply a marker-less model-based performance capture approach to the entire database to obtain pose and geometry of the actor in each database frame. To create novel video sequences of the actor from the database, a user animates a 3D human skeleton with novel motion and viewpoints. Our technique then synthesizes a realistic video sequence of the actor performing the specified mo- tion based only on the initial database. The first key component of our approach is a new efficient retrieval strategy to find appro- priate spatio-temporally coherent database frames from which to synthesize target video frames. The second key component is a warping-based texture synthesis approach that uses the retrieved most-similar database frames to synthesize spatio-temporally co- herent target video frames. For instance, this enables us to easily create video sequences of actors performing dangerous stunts with- out them being placed in harm's way. We show through a variety of result videos and a user study that we can synthesize realistic videos of people, even if the target motions and camera views are different from the database content.
    ACM Transactions on Graphics 01/2011; 30:32. · 3.36 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We present an approach to add true fine-scale spatio-temporal shape detail to dynamic scene geometry captured from multi-view video footage. Our approach exploits shading information to recover the millimeter-scale surface structure, but in contrast to related approaches succeeds under general unconstrained lighting conditions. Our method starts off from a set of multi-view video frames and an initial series of reconstructed coarse 3D meshes that lack any surface detail. In a spatio-temporal maximum a posteriori probability (MAP) inference framework, our approach first estimates the incident illumination and the spatially-varying albedo map on the mesh surface for every time instant. Thereafter, albedo and illumination are used to estimate the true geometric detail visible in the images and add it to the coarse reconstructions. The MAP framework uses weak temporal priors on lighting, albedo and geometry which improve reconstruction quality yet allow for temporal variations in the data.
    IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Given a multi-exposure sequence of a scene, our aim is to recover the absolute irradiance falling onto a linear camera sensor. The established approach is to perform a weighted average of the scaled input exposures. However, there is no clear consensus on the appropriate weighting to use. We propose a weighting function that produces statisti- cally optimal estimates under the assumption of compound- Gaussian noise. Our weighting is based on a calibrated camera model that accounts for all noise sources. This model also allows us to simultaneously estimate the irra- diance and its uncertainty. We evaluate our method on sim- ulated and real world photographs, and show that we con- sistently improve the signal-to-noise ratio over previous ap- proaches. Finally, we show the effectiveness of our model for optimal exposure sequence selection and HDR image denoising.
    The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We present a new performance capture approach that incorporates a physically-based cloth model to reconstruct a rigged fullyanimatable virtual double of a real person in loose apparel from multi-view video recordings. Our algorithm only requires a minimum of manual interaction. Without the use of optical markers in the scene, our algorithm first reconstructs skeleton motion and detailed time-varying surface geometry of a real person from a reference video sequence. These captured reference performance data are then analyzed to automatically identify non-rigidly deforming pieces of apparel on the animated geometry. For each piece of apparel, parameters of a physically-based real-time cloth simulation model are estimated, and surface geometry of occluded body regions is approximated. The reconstructed character model comprises a skeleton-based representation for the actual body parts and a physically-based simulation model for the apparel. In contrast to previous performance capture methods, we can now also create new real-time animations of actors captured in general apparel.
    ACM Transactions on Graphics 01/2010; 29:139. · 3.36 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We describe a method for 3D object scanning by aligning depth scans that were taken from around an object with a time-of-flight camera. These ToF cameras can measure depth scans at video rate. Due to comparably simple technology they bear potential for low cost production in big volumes. Our easy-to-use, cost-effective scanning solution based on such a sensor could make D scanning technology more accessible to everyday users. The algorithmic challenge we face is that the sensor's level of random noise is substantial and there is a non-trivial systematic bias. In this paper we show the surprising result that 3D scans of reasonable quality can also be obtained with a sensor of such low data quality. Established filtering and scan alignment techniques from the literature fail to achieve this goal. In contrast, our algorithm is based on a new combination of a 3D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the sensor's noise characteristics.
    The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010; 01/2010

Publication Stats

1k Citations
343 Downloads
42.90 Total Impact Points

Institutions

  • 2013
    • Universität des Saarlandes
      Saarbrücken, Saarland, Germany
    • Tsinghua University
      Peping, Beijing, China
  • 2002–2010
    • Max Planck Institute for Informatics
      Saarbrücken, Saarland, Germany
  • 2005–2009
    • Stanford University
      Palo Alto, California, United States
  • 2007
    • Max Planck Institute for Empirical Aesthetics
      Frankfurt, Hesse, Germany
    • Zhejiang University
      Hang-hsien, Zhejiang Sheng, China
  • 2003
    • Max Planck Society
      München, Bavaria, Germany