Preprint

Capture Dense: Markerless Motion Capture Meets Dense Pose Estimation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

We present a method to combine markerless motion capture and dense pose feature estimation into a single framework. We demonstrate that dense pose information can help for multiview/single-view motion capture, and multiview motion capture can help the collection of a high-quality dataset for training the dense pose detector. Specifically, we first introduce a novel markerless motion capture method that can take advantage of dense parsing capability provided by the dense pose detector. Thanks to the introduced dense human parsing ability, our method is demonstrated much more efficient, and accurate compared with the available state-of-the-art markerless motion capture approach. Second, we improve the performance of available dense pose detector by using multiview markerless motion capture data. Such dataset is beneficial to dense pose training because they are more dense and accurate and consistent, and can compensate for the corner cases such as unusual viewpoints. We quantitatively demonstrate the improved performance of our dense pose detector over the available DensePose. Our dense pose dataset and detector will be made public.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose DoubleFusion, a new real-time system that combines volumetric non-rigid reconstruction with data-driven template fitting to simultaneously reconstruct detailed surface geometry, large non-rigid motion, and the optimized human body shape from a single depth camera. One of the key contributions of this method is a double-layer representation consisting of a complete parametric body model inside, and a gradually fused detailed surface outside. A pre-defined node graph on the body parameterizes the non-rigid deformations near the body, and a free-form dynamically changing graph parameterizes the outer surface layer far from the body, which allows more general reconstruction. We further propose a joint motion tracking method based on the double-layer representation to enable robust and fast motion tracking performance. Moreover, the inner parametric body is optimized online and forced to fit inside the outer surface layer as well as the live depth input. Overall, our method enables increasingly denoised, detailed and complete surface reconstructions, fast motion tracking performance and plausible inner body shape reconstruction in real-time. Experiments and comparisons show improved fast motion tracking and loop closure performance on more challenging scenarios. Two extended applications including body measurement and shape retargeting show the potential of our system in terms of practical use.
Article
Full-text available
This paper describes how to obtain accurate 3D body models and texture of arbitrary people from a single, monocular video in which a person is moving. Based on a parametric body model, we present a robust processing pipeline achieving 3D model fits with 5mm accuracy also for clothed people. Our main contribution is a method to nonrigidly deform the silhouette cones corresponding to the dynamic human silhouettes, resulting in a visual hull in a common reference frame that enables surface reconstruction. This enables efficient estimation of a consensus 3D shape, texture and implanted animation skeleton based on a large number of frames. We present evaluation results for a number of test subjects and analyze overall performance. Requiring only a smartphone or webcam, our method enables everyone to create their own fully animatable digital double, e.g., for social VR applications or virtual try-on for online fashion shopping.
Article
Full-text available
We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using in-the-wild images that only have ground truth 2D annotations. However, reprojection loss alone is highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any coupled 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detection and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimizationbased methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.
Article
Full-text available
Humans move their hands and bodies together to communicate and solve tasks. Capturing and replicating such coordinated activity is critical for virtual characters that behave realistically. Surprisingly, most methods treat the 3D modeling and tracking of bodies and hands separately. Here we formulate a model of hands and bodies interacting together and fit it to full-body 4D sequences. When scanning or capturing the full body in 3D, hands are small and often partially occluded, making their shape and pose hard to recover. To cope with low-resolution, occlusion, and noise, we develop a new model called MANO (hand Model with Articulated and Non-rigid defOrmations). MANO is learned from around 1000 high-resolution 3D scans of hands of 31 subjects in a wide variety of hand poses. The model is realistic, low-dimensional, captures non-rigid shape changes with pose, is compatible with standard graphics packages, and can fit any human hand. MANO provides a compact mapping from hand poses to pose blend shape corrections and a linear manifold of pose synergies. We attach MANO to a standard parameterized 3D body shape model (SMPL), resulting in a fully articulated body and hand model (SMPL+H). We illustrate SMPL+H by fitting complex, natural, activities of subjects captured with a 4D scanner. The fitting is fully automatic and results in full body models that move naturally with detailed hand motions and a realism not seen before in full body performance capture. The models and data are freely available for research purposes at http://mano.is.tue.mpg.de.
Article
Full-text available
Designing and simulating realistic clothing is challenging. Previous methods addressing the capture of clothing from 3D scans have been limited to single garments and simple motions, lack detail, or require specialized texture patterns. Here we address the problem of capturing regular clothing on fully dressed people in motion. People typically wear multiple pieces of clothing at a time. To estimate the shape of such clothing, track it over time, and render it believably, each garment must be segmented from the others and the body. Our ClothCap approach uses a new multi-part 3D model of clothed bodies, automatically segments each piece of clothing, estimates the minimally clothed body shape and pose under the clothing, and tracks the 3D deformations of the clothing over time. We estimate the garments and their motion from 4D scans; that is, high-resolution 3D scans of the subject in motion at 60 fps. ClothCap is able to capture a clothed person in motion, extract their clothing, and retarget the clothing to new body shapes; this provides a step towards virtual try-on.
Article
Full-text available
We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control---thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e. it works for outdoor scenes, community videos, and low quality commodity RGB cameras.
Conference Paper
Full-text available
We propose a unified formulation for the problem of 3D human pose estimation from a single raw RGB image that reasons jointly about 2D joint estimation and 3D pose reconstruction to improve both tasks. We take an integrated approach that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN architecture and uses the knowledge of plausible 3D landmark locations to refine the search for better 2D locations. The entire process is trained end-to-end, is extremely efficient and obtains state- of-the-art results on Human3.6M outperforming previous approaches both on 2D and 3D errors.
Article
Full-text available
In this paper we propose to learn a mapping from image pixels into a dense template grid through a fully convolutional network. We formulate this task as a regression problem and train our network by leveraging upon manually annotated facial landmarks "in-the-wild". We use such landmarks to establish a dense correspondence field between a three-dimensional object template and the input image, which then serves as the ground-truth for training our regression system. We show that we can combine ideas from semantic segmentation with regression networks, yielding a highly-accurate "quantized regression" architecture. Our system, called DenseReg, allows us to estimate dense image-to-template correspondences in a fully convolutional manner. As such our network can provide useful correspondence information as a stand-alone system, while when used as an initialization for Statistical Deformable Models we obtain landmark localization results that largely outperform the current state-of-the-art on the challenging 300W benchmark. We thoroughly evaluate our method on a host of facial analysis tasks, and also demonstrate its use for other correspondence estimation tasks, such as modelling of the human ear. DenseReg code is made available at http://alpguler.com/DenseReg.html along with supplementary materials.
Article
Full-text available
To look human, digital full-body avatars need to have soft tissue deformations like those of real people. Current methods for physics simulation of soft tissue lack realism, are computationally expensive, or are hard to tune. Learning soft tissue motion from example, however, has been limited by the lack of dense, high-resolution, training data. We address this using a 4D capture system and a method for accurately registering 3D scans across time to a template mesh. Using over 40,000 scans of ten subjects, we compute how soft tissue motion causes mesh triangles to deform relative to a base 3D body model and learn a low-dimensional linear subspace approximating this soft-tissue deformation. Our model, called Dyna, relates the linear coefficients of this body surface deformation to the changing pose of the body. We learn a second-order auto-regressive model that predicts soft-tissue deformations based on previous deformations, the velocity and acceleration of the body, and the angular velocities and accelerations of the limbs. Dyna also models how deformations vary with a person’s body mass index (BMI), producing different deformations for people with different shapes. Dyna realistically represents the dynamics of soft tissue for previously unseen subjects and motions. Finally, we provide tools for animators to vary BMI to produce different effects, to selectively control the location and intensity of soft-tissue motions, and to apply the model to new, stylized characters.
Conference Paper
Full-text available
We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.
Article
Full-text available
We contribute a new pipeline for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Our algorithm supports both incremental reconstruction, improving the surface estimation over time, as well as parameterizing the nonrigid scene motion. Our approach is highly robust to both large frame-to-frame motion and topology changes, allowing us to reconstruct extremely challenging scenes. We demonstrate advantages over related real-time techniques that either deform an online generated template or continually fuse depth data nonrigidly into a single reference model. Finally, we show geometric reconstruction results on par with offline methods which require orders of magnitude more processing time and many more RGBD cameras.
Article
Full-text available
We present the first end-to-end solution to create high-quality free-viewpoint video encoded as a compact data stream. Our system records performances using a dense set of RGB and IR video cameras , generates dynamic textured surfaces, and compresses these to a streamable 3D video format. Four technical advances contribute to high fidelity and robustness: multimodal multi-view stereo fusing RGB, IR, and silhouette information; adaptive meshing guided by automatic detection of perceptually salient areas; mesh tracking to create temporally coherent subsequences; and encoding of tracked textured meshes as an MPEG video stream. Quantitative experiments demonstrate geometric accuracy, texture fidelity, and encoding efficiency. We release several datasets with calibrated inputs and processed results to foster future research.
Conference Paper
Full-text available
We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.
Conference Paper
Full-text available
We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.
Conference Paper
Full-text available
Markerless tracking of human pose is a hard yet relevant problem. In this paper, we derive an efficient filtering algorithm for tracking human pose using a stream of monocular depth images. The key idea is to combine an accurate generative model - which is achievable in this setting using programmable graphics hardware - with a discriminative model that provides data-driven evidence about body part locations. In each filter iteration, we apply a form of local model-based search that exploits the nature of the kinematic chain. As fast movements and occlusion can disrupt the local search, we utilize a set of discriminatively trained patch classifiers to detect body parts. We describe a novel algorithm for propagating this noisy evidence about body part locations up the kinematic chain using the unscented transform. The resulting distribution of body configurations allows us to reinitialize the model-based search. We provide extensive experimental results on 28 real-world sequences using automatic ground-truth annotations from a commercial motion capture system.
Article
Full-text available
Details in mesh animations are difficult to generate but they have great impact on visual quality. In this work, we demonstrate a practical software system for capturing such details from multi-view video recordings. Given a stream of synchronized video images that record a human performance from multiple viewpoints and an articulated template of the performer, our system captures the motion of both the skeleton and the shape. The output mesh animation is enhanced with the details observed in the image silhouettes. For example, a performance in casual loose-fitting clothes will generate mesh animations with flowing garment motions. We accomplish this with a fast pose tracking method followed by nonrigid deformation of the template to fit the silhouettes. The entire process takes less than sixteen seconds per frame and requires no markers or texture cues. Captured meshes are in full correspondence making them readily usable for editing operations including texturing, deformation transfer, and deformation model learning.
Article
Full-text available
We present a framework and algorithms for robust geometry and motion reconstruction of complex deforming shapes. Our method makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological prior for reconstruction. Large-scale motion of the acquired object is recovered using a novel space-time adaptive, non-rigid registration method. Fine-scale details such as wrinkles and folds are synthesized with an efficient linear mesh deformation algorithm. Subsequent spatial and temporal filtering of detail coefficients allows transfer of persistent geometric detail to regions not observed by the scanner. We show how this two-scale process allows faithful recovery of small-scale shape and motion features leading to a high-quality reconstruction. We illustrate the robustness and generality of our algorithm on a variety of examples composed of different materials and exhibiting a large range of dynamic deformations. © 2009 ACM.
Conference Paper
Full-text available
We present a vision system for the 3-D model-based tracking of unconstrained human movement. Using image sequences acquired simultaneously from multiple views, we recover the 3-D body pose at each time instant without the use of markers. The pose-recovery problem is formulated as a search problem and entails finding the pose parameters of a graphical human model whose synthesized appearance is most similar to the actual appearance of the real human in the multi-view images. The models used for this purpose are acquired from the images. We use a decomposition approach and a best-first technique to search through the high dimensional pose parameter space. A robust variant of chamfer matching is used as a fast similarity measure between synthesized and real edge images. We present initial tracking results from a large new Humans-in-Action (HIA) database containing more than 2500 frames in each of four orthogonal views. They contain subjects involved in a variety of activities, of various degrees of complexity, ranging from the more simple one-person hand waving to the challenging two-person close interaction in the Argentine Tango
Conference Paper
In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.
Conference Paper
The sixteen-volume set comprising the LNCS volumes 11205-11220 constitutes the refereed proceedings of the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018.The 776 revised papers presented were carefully reviewed and selected from 2439 submissions. The papers are organized in topical sections on learning for vision; computational photography; human analysis; human sensing; stereo and reconstruction; optimization; matching and recognition; video attention; and poster sessions.
Article
We present a unified deformation model for the markerless capture of multiple scales of human movement, including facial expressions, body motion, and hand gestures. An initial model is generated by locally stitching together models of the individual parts of the human body, which we refer to as the "Frankenstein" model. This model enables the full expression of part movements, including face and hands by a single seamless model. Using a large-scale capture of people wearing everyday clothes, we optimize the Frankenstein model to create "Adam". Adam is a calibrated model that shares the same skeleton hierarchy as the initial model but can express hair and clothing geometry, making it directly usable for fitting people as they normally appear in everyday life. Finally, we demonstrate the use of these models for total motion tracking, simultaneously capturing the large-scale body movements and the subtle face and hand motion of a social group of people.
Conference Paper
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Article
We present the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video. Our approach reconstructs articulated human skeleton motion as well as medium-scale non-rigid surface deformations in general scenes. Human performance capture is a challenging problem due to the large range of articulation, potentially fast motion, and considerable non-rigid deformations, even from multi-view data. Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem. We tackle these challenges by a novel approach that employs sparse 2D and 3D human pose detections from a convolutional neural network using a batch-based pose estimation strategy. Joint recovery of per-batch motion allows to resolve the ambiguities of the monocular reconstruction problem based on a low dimensional trajectory subspace. In addition, we propose refinement of the surface geometry based on fully automatically extracted silhouettes to enable medium-scale non-rigid alignment. We demonstrate state-of-the-art performance capture results that enable exciting applications such as video editing and free viewpoint video, previously infeasible from monocular video. Our qualitative and quantitative evaluation demonstrates that our approach significantly outperforms previous monocular methods in terms of accuracy, robustness and scene complexity that can be handled.
Article
We present a realtime approach for multi-person 2D pose estimation that predicts vector fields, which we refer to as Part Affinity Fields (PAFs), that directly expose the association between anatomical parts in an image. The architecture is designed to jointly learn part locations and their association, via two branches of the same sequential prediction process. The sequential prediction enables the part confidence maps and the association fields to encode global context, while allowing an efficient bottom-up parsing step that maintains tractable runtime complexity. Our method has set the state-of-the-art performance on the inaugural MSCOCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
Conference Paper
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Article
We present a fast, automatic method for accurately capturing full-body motion data using a single depth camera. At the core of our system lies a realtime registration process that accurately reconstructs 3D human poses from single monocular depth images, even in the case of significant occlusions. The idea is to formulate the registration problem in a Maximum A Posteriori (MAP) framework and iteratively register a 3D articulated human body model with monocular depth cues via linear system solvers. We integrate depth data, silhouette information, full-body geometry, temporal pose priors, and occlusion reasoning into a unified MAP estimation framework. Our 3D tracking process, however, requires manual initialization and recovery from failures. We address this challenge by combining 3D tracking with 3D pose detection. This combination not only automates the whole process but also significantly improves the robustness and accuracy of the system. Our whole algorithm is highly parallel and is therefore easily implemented on a GPU. We demonstrate the power of our approach by capturing a wide range of human movements in real time and achieve state-of-the-art accuracy in our comparison against alternative systems such as Kinect [2012].
Article
Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.
Article
Shape-From-Silhouette (SFS) is a shape reconstruction method which constructs a 3D shape estimate of an object using silhouette images of the object. The output of a SFS algorithm is known as the Visual Hull (VH). Traditionally SFS is either performed on static objects, or separately at each time instant in the case of videos of moving objects. In this paper we develop a theory of performing SFS across time: estimating the shape of a dynamic object (with unknown motion) by combining all of the silhouette images of the object over time. We first introduce a one dimensional element called a Bounding Edge to represent the Visual Hull. We then show that aligning two Visual Hulls using just their silhouettes is in general ambiguous and derive the geometric constraints (in terms of Bounding Edges) that govern the alignment. To break the alignment ambiguity, we combine stereo information with silhouette information and derive a Temporal SFS algorithm which consists of two steps: (1) estimate the motion of the objects over time (Visual Hull Alignment) and (2) combine the silhouette information using the estimated motion (Visual Hull Refinement). The algorithm is first developed for rigid objects and then extended to articulated objects. In the Part II of this paper we apply our temporal SFS algorithm to two human-related applications: (1) the acquisition of detailed human kinematic models and (2) marker-less motion tracking.
Conference Paper
We present an approach for modeling the human body by Sums of spatial Gaussians (SoG), allowing us to per- form fast and high-quality markerless motion capture from multi-view video sequences. The SoG model is equipped with a color model to represent the shape and appearance of the human and can be reconstructed from a sparse set of images. Similar to the human body, we also represent the image domain as SoG that models color consistent im- age blobs. Based on the SoG models of the image and the human body, we introduce a novel continuous and differen- tiable model-to-image similarity measure that can be used to estimate the skeletal motion of a human at 5-15 frames per second even for many camera views. In our experi- ments, we show that our method, which does not rely on silhouettes or training data, offers an good balance between accuracy and computational cost.
Article
This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of- freedom articulated human body configurations in complex video sequences. We introduce the use and integration of a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation. This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk cycels of the famous movements of Eadweard Muybridge's motion studies from the last century. To the best of our knowledge, this is the first computer vision based system that is able to process such challenging footage.
Article
We present a method for markerless tracking of complex human motions from multiple camera views. In the absence of markers, the task of recovering the pose of a person during such motions is challenging and requires strong image features and robust tracking. We propose a solution which integrates multiple image cues such as edges, color information and volumetric reconstruction. We show that a combination of multiple image cues helps the tracker to overcome ambiguous situations such as limbs touching or strong occlusions of body parts. Following a model-based approach, we match an articulated body model built from superellipsoids against these image cues. Stochastic Meta Descent (SMD) optimization is used to find the pose which best matches the images. Stochastic sampling makes SMD robust against local minima and lowers the computational costs as a small set of predicted image features is sufficient for optimization. The power of SMD is demonstrated by comparing it to the commonly used Levenberg–Marquardt method. Results are shown for several challenging sequences showing complex motions and full articulation, with tracking of 24 degrees of freedom in ≈1 frame per second.
Smpl: A skinned multi-person linear model
  • M Loper
  • N Mahmood
  • J Romero
  • G Pons-Moll
  • M J Black
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015. 2, 3, 4
Detailed human shape and pose from images. Figure 10. Visualization of the dense feature distributions. From top to bottom: DensePose-COCO training set, DensePose-COCO validation set and our Panoptic-Dense training set
  • A O Balan
  • L Sigal
  • M J Black
  • J E Davis
  • H W Haussecker
A. O. Balan, L. Sigal, M. J. Black, J. E. Davis, and H. W. Haussecker. Detailed human shape and pose from images. Figure 10. Visualization of the dense feature distributions. From top to bottom: DensePose-COCO training set, DensePose-COCO validation set and our Panoptic-Dense training set. Red means higher frequency while blue means lower frequency.
In Computer Vision, A Reference Guide
  • Z Zhang
Z. Zhang. Iterative closest point (ICP). In Computer Vision, A Reference Guide, pages 433-434. 2014. 2