Article

Articulated distance fields for ultra-fast tracking of hands interacting

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The state of the art in articulated hand tracking has been greatly advanced by hybrid methods that fit a generative hand model to depth data, leveraging both temporally and discriminatively predicted starting poses. In this paradigm, the generative model is used to define an energy function and a local iterative optimization is performed from these starting poses in order to find a "good local minimum" (i.e. a local minimum close to the true pose). Performing this optimization quickly is key to exploring more starting poses, performing more iterations and, crucially, exploiting high frame rates that ensure that temporally predicted starting poses are in the basin of convergence of a good local minimum. At the same time, a detailed and accurate generative model tends to deepen the good local minima and widen their basins of convergence. Recent work, however, has largely had to trade-off such a detailed hand model with one that facilitates such rapid optimization. We present a new implicit model of hand geometry that mostly avoids this compromise and leverage it to build an ultra-fast hybrid hand tracking system. Specifically, we construct an articulated signed distance function that, for any pose, yields a closed form calculation of both the distance to the detailed surface geometry and the necessary derivatives to perform gradient based optimization. There is no need to introduce or update any explicit "correspondences" yielding a simple algorithm that maps well to parallel hardware such as GPUs. As a result, our system can run at extremely high frame rates (e.g. up to 1000fps). Furthermore, we demonstrate how to detect, segment and optimize for two strongly interacting hands, recovering complex interactions at extremely high framerates. In the absence of publicly available datasets of sufficiently high frame rate, we leverage a multiview capture system to create a new 180fps dataset of one and two hands interacting together or with objects.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To ease the problem, many previous works on 3D hand pose estimation use special depth cameras providing partial 3D information. Nevertheless, many of them focused on tracking a single isolated hand , with only a few exceptions that are able to handle object interactions [Panteleris et al. 2015;Sridhar et al. 2016;Tzionas and Gall 2015] or interactions with a second hand [Mueller et al. 2019;Taylor et al. , 2017. In recent years, the research focus has shifted towards methods that use a single RGB camera since these sensors are ubiquitous Mueller et al. 2018;Zimmermann et al. 2019]. ...
... Other methods, like [Sridhar et al. 2016;Tzionas et al. 2016], jointly reconstruct hand and object motion, and are thus able to exploit mutual constraints like physically stable grasps. Pose estimation methods for two hands often have a trade-off between real-time runtime [Taylor et al. 2017] and accurate collision resolution [Kyriazis and Argyros 2014;Oikonomidis et al. 2012;Tzionas et al. 2016]. The most recent method by [Mueller et al. 2019] runs in real time while providing coarse interpenetration avoidance. ...
... We present an overview of our approach in Figure 2. Given a monocular RGB image that depicts a two-hand interaction scenario, our goal is to recover the global 3D pose and 3D surface geometry by fitting a parametric hand model to both hands in the input image, as described in Sec. 4. Such a model-fitting task requires information extracted from the input image to be used as a fitting target, which however represents a major challenge when using monocular RGB data only. Previous methods that rely on depth data [Mueller et al. 2019;Taylor et al. 2017] are implicitly provided with a much richer input (i.e., global depth), which is the fundamental ingredient for an accurate 3D pose and shape fit. Per-pixel estimation of correct 3D hand depth from a single RGB image is very challenging. ...
Full-text available
Preprint
Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.
... To ease the problem, many previous works on 3D hand pose estimation use special depth cameras providing partial 3D information. Nevertheless, many of them focused on tracking a single isolated hand , with only a few exceptions that are able to handle object interactions [Panteleris et al. 2015;Sridhar et al. 2016;Tzionas and Gall 2015] or interactions with a second hand [Mueller et al. 2019;Taylor et al. , 2017. In recent years, the research focus has shifted towards methods that use a single RGB camera since these sensors are ubiquitous Mueller et al. 2018;Zimmermann et al. 2019]. ...
... Other methods, like [Sridhar et al. 2016;Tzionas et al. 2016], jointly reconstruct hand and object motion, and are thus able to exploit mutual constraints like physically stable grasps. Pose estimation methods for two hands often have a trade-off between real-time runtime [Taylor et al. 2017] and accurate collision resolution [Kyriazis and Argyros 2014;Oikonomidis et al. 2012;Tzionas et al. 2016]. The most recent method by [Mueller et al. 2019] runs in real time while providing coarse interpenetration avoidance. ...
... We present an overview of our approach in Figure 2. Given a monocular RGB image that depicts a two-hand interaction scenario, our goal is to recover the global 3D pose and 3D surface geometry by fitting a parametric hand model to both hands in the input image, as described in Sec. 4. Such a model-fitting task requires information extracted from the input image to be used as a fitting target, which however represents a major challenge when using monocular RGB data only. Previous methods that rely on depth data [Mueller et al. 2019;Taylor et al. 2017] are implicitly provided with a much richer input (i.e., global depth), which is the fundamental ingredient for an accurate 3D pose and shape fit. Per-pixel estimation of correct 3D hand depth from a single RGB image is very challenging. ...
Full-text available
Preprint
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In order to achieve this, we embed a recent parametric hand pose and shape model and a dense correspondence predictor based on a deep neural network into a suitable energy minimization framework. For training the correspondence prediction network, we synthesize a two-hand dataset based on physical simulations that includes both hand pose and shape annotations while at the same time avoiding inter-hand penetrations. To achieve real-time rates, we phrase the model fitting in terms of a nonlinear least-squares problem so that the energy can be optimized based on a highly efficient GPU-based Gauss-Newton optimizer. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work, including tight two-hand grasps, significant inter-hand occlusions, and gesture interaction.
... Methods need to run robustly and in real time in uncontrolled environments, for example in a cluttered living room and not only in a research lab. This necessity is even made more crucial by the recent advances in virtual and augmented Taylor et al., 2017Taylor et al., 2016 Figure 1.1: Real-time 3D hand motion capture and reconstruction enables diverse applications in virtual reality and gaming. c The respective copyright owners. ...
... The first class of methods, the so-called generative methods, assumes the availability of a generative model of the hand, ranging from meshes, collections of geometric primitives, to implicit functions, as depicted in Figure 2.1 (Heap and Hogg, 1996;Oikonomidis et al., 2011a;Tagliasacchi et al., 2015;Taylor et al., 2016Taylor et al., , 2017Tkach et al., 2016). During pose optimization, the image formation model is employed to compare the hand model at its current pose to the input image and this discrepancy is minimized. ...
... Only few methods estimate a detailed hand shape automatically. Khamis Oikonomidis et al., 2012 Meshes Tzionas et al., 2016 Sphere Meshes Tkach et al., 2016 Articulated Distance Functions Taylor et al., 2017 Subdivision Surfaces Taylor et al., 2016Sridhar et al., 2014 Sum of Gaussians model of hand shape and pose which can be used for generative model fitting. Generative methods usually enforce temporal consistency but are therefore prone to both propagating errors over time and getting stuck in poor local optima. ...
... To ease the problem, many previous works on 3D hand pose estimation use special depth cameras providing partial 3D information. Nevertheless, many of them focused on tracking a single isolated hand , with only a few exceptions that are able to handle object interactions [Panteleris et al. 2015;Sridhar et al. 2016;Tzionas and Gall 2015] or interactions with a second hand [Mueller et al. 2019;Taylor et al. , 2017. In recent years, the research focus has shifted towards methods that use a single RGB camera since these sensors are ubiquitous Mueller et al. 2018;Zimmermann et al. 2019]. ...
... Other methods, like [Sridhar et al. 2016;Tzionas et al. 2016], jointly reconstruct hand and object motion, and are thus able to exploit mutual constraints like physically stable grasps. Pose estimation methods for two hands often have a trade-off between real-time runtime [Taylor et al. 2017] and accurate collision resolution [Kyriazis and Argyros 2014;Oikonomidis et al. 2012;Tzionas et al. 2016]. The most recent method by [Mueller et al. 2019] runs in real time while providing coarse interpenetration avoidance. ...
... We present an overview of our approach in Figure 2. Given a monocular RGB image that depicts a two-hand interaction scenario, our goal is to recover the global 3D pose and 3D surface geometry by fitting a parametric hand model to both hands in the input image, as described in Sec. 4. Such a model-fitting task requires information extracted from the input image to be used as a fitting target, which however represents a major challenge when using monocular RGB data only. Previous methods that rely on depth data [Mueller et al. 2019;Taylor et al. 2017] are implicitly provided with a much richer input (i.e., global depth), which is the fundamental ingredient for an accurate 3D pose and shape fit. Per-pixel estimation of correct 3D hand depth from a single RGB image is very challenging. ...
Full-text available
Article
Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings ( e.g. , considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.
... With the rapid growing usage of wearable technologies generating massive volumes of egocentric image data [4, 3, 2, 1], the ability for machines to understand the human hands becomes crucial for applications such as humancomputer interaction (HCI), activity logging, gesture/sign language recognition and VR/AR. Consequently, hand detection and segmentation are fundamental in areas such as 2D/3D hand pose estimation [36,27,23] and gesture recognition [7,17]. However, hand segmentation on images in the wild is extremely challenging due to numerous factors: vastness of the color space, different skin color/texture, complex background noise, motion blur, lighting type/color, shadow features, speed and model size requirement, etc. ...
... Early works [37,35,33] utilized Randomized Decision Forests (RDF) on depth image to obtain the hand segmentation, which allows multicore parallelization with fast inference time suitable for real-time applications. [36] introduced a Fully Convolutional Network (FCN) that segments the left and right hand for fast tracking of two interacting hands in egocentric viewpoint. Similarly, [9] proposed a hybrid encoder-decoder architecture with skip-connections for two-hand segmentation from a third-person viewpoint. ...
... Recently, [22] extended the segmentation task to 8 classes to include arms and objects and trained a FCN on synthetic data with a level of generaliza-tion on real depth data. Following [36], [27] used a Correspondence Regression Network to estimate two-hand segmentation prior to hand pose estimation, which shows the significance of separate segmentation of the two hands for pose estimation as it provides information on how interacting hands occlude each other. Note that for depth-based approaches, segmentation ground truth is obtained by color thresholding, requiring subjects to wear thin colored gloves. ...
Full-text available
Preprint
Hand segmentation and detection in truly unconstrained RGB-based settings is important for many applications. However, existing datasets are far from sufficient both in terms of size and variety due to the infeasibility of manual annotation of large amounts of segmentation and detection data. As a result, current methods are limited by many underlying assumptions such as constrained environment, consistent skin color and lighting. In this work, we present a large-scale RGB-based egocentric hand segmentation/detection dataset Ego2Hands that is automatically annotated and a color-invariant compositing-based data generation technique capable of creating unlimited training data with variety. For quantitative analysis, we manually annotated an evaluation set that significantly exceeds existing benchmarks in quantity, diversity and annotation accuracy. We show that our dataset and training technique can produce models that generalize to unseen environments without domain adaptation. We introduce Convolutional Segmentation Machine (CSM) as an architecture that better balances accuracy, size and speed and provide thorough analysis on the performance of state-of-the-art models on the Ego2Hands dataset.
... Finally, it is not trivial to design efficient network architectures which can formulate the occlusion and interaction of two hands and meet the requirement of low latency on mobile hardware at the same time. Promising results have been produced by the monocular depth-based twohands tracking methods [20,29,30,38,40]. Although these depth-based frameworks are studied for years, the algorithm complexity limit the ubiquitous application of the methods. ...
... A recent multi-view tracking based method provides a solution to reconstruct high-quality interactive hand motions, however, its hardware setup is expensive, and the algorithm is time-consuming. Other monocular kinematic tracking based methods are sensitive to fast motion and possible tracking failure regardless of whether a depth sensor [20,29,30,38,40] or an RGB camera [41] is incorporated. However, their dense mapping strategy queries correspondences between hand vertices and image pixels. ...
Preprint
Hand reconstruction has achieved great success in real-time applications such as visual reality and augmented reality while interacting with two-hand reconstruction through efficient transformers is left unexplored. In this paper, we propose a method called lightweight attention hand (LWA-HAND) to reconstruct hands in low flops from a single RGB image. To solve the occlusion and interaction challenges in efficient attention architectures, we introduce three mobile attention modules. The first module is a lightweight feature attention module that extracts both local occlusion representation and global image patch representation in a coarse-to-fine manner. The second module is a cross image and graph bridge module which fuses image context and hand vertex. The third module is a lightweight cross-attention mechanism that uses element-wise operation for cross attention of two hands in linear complexity. The resulting model achieves comparable performance on the InterHand2.6M benchmark in comparison with the state-of-the-art models. Simultaneously, it reduces the flops to $0.47GFlops$ while the state-of-the-art models have heavy computations between $10GFlops$ and $20GFlops$.
... Vision-based hand-tracking is a topic of interest of several researchers. Most work on hands-tracking focuses on the use of depth cameras [26] or RGB [12]. Depth-based approaches present results that are superior to RGB-based approaches. ...
... Depth-based approaches present results that are superior to RGB-based approaches. A depth camera provides hand geometry in terms of a 2.5D point cloud, and the model-based approaches can reliably fit a hand mesh to the reconstructed point cloud [26]. Using hand tracking input with mobile technology is a problem mainly due to the high energy consumption. ...
Conference Paper
Extended Reality as a consolidated game platform was always a dream for both final consumers and game producers. If for one side this technology had enchanted and called the attention due its possibilities, for other side many challenges and difficulties had delayed its proliferation and massification. This paper intends to rise and discuss aspects and considerations related to these challenges and solutions. We try to bring the most relevant research topics and try to guess how XR games should look in the near future. We divide the challenges into 7 topics, based on extensive literature reviews: Cybersickness, User Experience, Displays, Rendering, Movements, Body Tracking and External World Information.
... Han et al. [17] make use of marker gloves while Simon et al. [5] employ multiple view setups. Taylor et al. [18], [19] use a high frame-rate depth camera to jointly optimize the pose and correspondences of a subdivision surface model. Mueller et al. [20] present a realtime two hand reconstruction method using single commodity depth camera. ...
... There are a few existing methods that try to address the 3D multi-hand pose estimation task. Taylor et al. [19] and Mueller et al. [20] track two hands in real-time using the extra depth sensor. Simon et al. [5] propose the first 3D markerless hand motion capture system with multi-view setups. ...
Preprint
In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, we propose a concise but efficient single-stage pipeline. Specifically, we design a multi-head auto-encoder structure for multi-hand reconstruction, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners.
... Second, the interaction context between two hands is difficult to be effectively formulated during network design and training. Monocular depth-based two-hand tracking [22,27,28,[38][39][40] has been studied for years and promising results have been demonstrated. However, the energy demand and algorithm complexity restrict the ubiquitous application of depth-based methods. ...
... A recent multiview tracking based method [34] could reconstruct high-quality interactive hand motions, however, its hardware setup is expensive, and the algorithm is timeconsuming. Monocular kinematic tracking based two-hand motion estimation methods, regardless of whether a depth sensor [22,27,28,38,39] or an RGB camera [40] is incorporated, are sensitive to fast motion and possible tracking failure. However, their dense mapping strategy, which queries correspondences between hand vertices and image pixels, inspires us to seek mesh-image alignment using dense features. ...
Preprint
Graph convolutional network (GCN) has achieved great success in single hand reconstruction task, while interacting two-hand reconstruction by GCN remains unexplored. In this paper, we present Interacting Attention Graph Hand (IntagHand), the first graph convolution based network that reconstructs two interacting hands from a single RGB image. To solve occlusion and interaction challenges of two-hand reconstruction, we introduce two novel attention based modules in each upsampling step of the original GCN. The first module is the pyramid image feature attention (PIFA) module, which utilizes multiresolution features to implicitly obtain vertex-to-image alignment. The second module is the cross hand attention (CHA) module that encodes the coherence of interacting hands by building dense cross-attention between two hand vertices. As a result, our model outperforms all existing two-hand reconstruction methods by a large margin on InterHand2.6M benchmark. Moreover, ablation studies verify the effectiveness of both PIFA and CHA modules for improving the reconstruction accuracy. Results on in-the-wild images further demonstrate the generalization ability of our network. Our code is available at https://github.com/Dw1010/IntagHand.
... Capturing 3D interacting hand motion from monocular single RGB images can facilitate numerous downstream tasks including AR/VR [8,47] and social signal understanding [13,26]. Previous works on motion capture of two interacting hands [2,25,28,39,43,46] mainly rely on depth images, multi-view images or image sequences as input. These methods cannot be easily applied to monocular RGB images. ...
... Tzionas et al. [43] extend the idea of discriminative salient points by introducing physics simulation. Taylor et al. [39] introduce a new implicit hand model to facilitate simultaneous optimization over both hand poses and hand surfaces. Smith et al. [34] use an elastic volume deformation and a collision response model to recover dense hand surface from multi-view sequences. ...
Preprint
3D interacting hand reconstruction is essential to facilitate human-machine interaction and human behaviors understanding. Previous works in this field either rely on auxiliary inputs such as depth images or they can only handle a single hand if monocular single RGB images are used. Single-hand methods tend to generate collided hand meshes, when applied to closely interacting hands, since they cannot model the interactions between two hands explicitly. In this paper, we make the first attempt to reconstruct 3D interacting hands from monocular single RGB images. Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions. This is made possible via a two-stage framework. Specifically, the first stage adopts a convolutional neural network to generate coarse predictions that tolerate collisions but encourage pose-accurate hand meshes. The second stage progressively ameliorates the collisions through a series of factorized refinements while retaining the preciseness of 3D poses. We carefully investigate potential implementations for the factorized refinement, considering the trade-off between efficiency and accuracy. Extensive quantitative and qualitative results on large-scale datasets such as InterHand2.6M demonstrate the effectiveness of the proposed approach.
... Vision-based hand-tracking is an object of interest of several researchers. Most work on hands-tracking focuses on the use of depth cameras [28] or RGB [21]. Depth-based approaches present results that are superior to RGB-based approaches. ...
... Depth-based approaches present results that are superior to RGB-based approaches. A depth camera provides hand geometry in terms of a 2.5D point cloud, and the model-based approaches can reliably fit a hand mesh to the reconstructed point cloud [28]. Using hand tracking input with mobile technology is a problem mainly due to the high energy consumption. ...
Conference Paper
Extended Reality as a consolidated game platform was always a dream for both final consumers and game producers. If for one side this technology had enchanted and called the attention due its possibilities, for other side many challenges and difficulties had delayed its proliferation and massification. This workshop intends to rise and discuss aspects and considerations related to these challenges and solutions. We try to bring the most relevant research topics and try to guess how XR games should look in the near future. We divide the challenges into 7 topics: Cybersickness, User Experience, Displays, Rendering, Movements, Body Tracking and External World Information.
... As such, albeit their general good performance, these methods may produce bio-mechanically implausible poses [38]. An alternative to learning-based approaches are model-based hand tracking methods, such as [15], [27], [32], [35], among others. These methods use generative hand models to recover the pose that best explains the image through an analysis-by-synthesis strategy. ...
... Previous work demonstrated that energies based on articulated, rigid, part-based models of the hand can be optimized to provide good tracking [20], [15]. Additional 3D hand representations, including continuous subdivision surfaces [31], collection of Gaussians [26], [28], sphere meshes [34], and articulated signed distance functions [32], have been proposed with the goal of creating detailed models that are still fast to optimize. ...
Preprint
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model. This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist). We show that our partially-supervised method achieves results that are comparable to those of fully-supervised methods which enforce articulation consistency. Moreover, for the first time we demonstrate that such an approach can be used to train on datasets that have erroneous annotations, i.e. "ground truth" with notable measurement errors, while obtaining predictions that explain the depth images better than the given "ground truth".
... Self-Contact. Most of previous work on self-contact (Tzionas et al. 2016;Tzionas and Gall 2013;Taylor et al. 2017; applies to the interaction of human extremities, such as hands. Tzionas et al. (2016) introduces a method for modelling 3d hand to hand or hand to object interactions based on RGB-D data. ...
Article
Monocular estimation of three dimensional human self-contact is fundamental for detailed scene analysis including body language understanding and behaviour modeling. Existing 3d reconstruction methods do not focus on body regions in self-contact and consequently recover configurations that are either far from each other or self-intersecting, when they should just touch. This leads to perceptually incorrect estimates and limits impact in those very fine-grained analysis domains where detailed 3d models are expected to play an important role. To address such challenges we detect self-contact and design 3d losses to explicitly enforce it. Specifically, we develop a model for Self-Contact Prediction (SCP), that estimates the body surface signature of self-contact, leveraging the localization of self-contact in the image, during both training and inference. We collect two large datasets to support learning and evaluation: (1) HumanSC3D, an accurate 3d motion capture repository containing 1,032 sequences with 5,058 contact events and 1,246,487 ground truth 3d poses synchronized with images collected from multiple views, and (2) FlickrSC3D, a repository of 3,969 images, containing 25,297 surface-to-surface correspondences with annotated image spatial support. We also illustrate how more expressive 3d reconstructions can be recovered under self-contact signature constraints and present monocular detection of face-touch as one of the multiple applications made possible by more accurate self-contact models.
... For two-hand 3D tracking, segmentation and detection of both hands are commonly required prior to pose estimation [11,14,18]. Recently, [10] introduced Ego2Hands for the task of two-hand segmentation and detection in the wild. ...
Full-text available
Preprint
Color-based two-hand 3D pose estimation in the global coordinate system is essential in many applications. However, there are very few datasets dedicated to this task and no existing dataset supports estimation in a non-laboratory environment. This is largely attributed to the sophisticated data collection process required for 3D hand pose annotations, which also leads to difficulty in obtaining instances with the level of visual diversity needed for estimation in the wild. Progressing towards this goal, a large-scale dataset Ego2Hands was recently proposed to address the task of two-hand segmentation and detection in the wild. The proposed composition-based data generation technique can create two-hand instances with quality, quantity and diversity that generalize well to unseen domains. In this work, we present Ego2HandsPose, an extension of Ego2Hands that contains 3D hand pose annotation and is the first dataset that enables color-based two-hand 3D tracking in unseen domains. To this end, we develop a set of parametric fitting algorithms to enable 1) 3D hand pose annotation using a single image, 2) automatic conversion from 2D to 3D hand poses and 3) accurate two-hand tracking with temporal consistency. We provide incremental quantitative analysis on the multi-stage pipeline and show that training on our dataset achieves state-of-the-art results that significantly outperforms other datasets for the task of egocentric two-hand global 3D pose estimation.
... Reconstructing 3D hand surfaces from RGB or depth observations is a well-studied problem [28]. Existing work can gen-erally be classified into two paradigms: discriminative approaches [18,12,71,41,7] directly estimate hand shape and pose parameters from the observation, while generative approaches [55,57,59] iteratively optimize a parametric hand model so that its projection matches the observation. Recently, more challenging settings such as reconstructing two interacting hands [64,44,53] are also explored. ...
Full-text available
Preprint
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch/
... Generative methods often rely heavily on tracking and are prone to drift whereas discriminative methods tend to generalize poorly to unseen images [1]. Hybrid approaches [4,47,51,45,48,32,55,7,12,43,54] try to combine the best of these two worlds by using discriminative methods to detect visual cues in the image followed by model fitting. ...
Preprint
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We do not require that all locations correctly correspond to a joint, not that all the joints are detected. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints and output the 3D poses of both hands. Our approach thus allies the recognition power of a Transformer to the accuracy of heatmap-based methods. We also show it can be extended to estimate the 3D pose of an object manipulated by one or two hands. We evaluate our approach on the recent and challenging InterHand2.6M and HO-3D datasets. We obtain 17% improvement over the baseline. Moreover, we introduce the first dataset made of action sequences of two hands manipulating an object fully annotated in 3D and will make it publicly available.
... However, it only provided a rough ap-85 proximation for the human hand and was a leak of accuracy. Many models was proposed to improve the performance, such as [5], [18]. The most popular model of them is MANO [6], which is easy to understand and produce accurate results. ...
Article
Due to its flexible joints and self-occlusion, representation and reconstruction of 3D human hand is a very challenging problem. Although some parametric models have been proposed to alleviate this problem, these representation models have limited representation ability, like not being able to represent complex gestures. In this paper, we presented a new 3D hand model with powerful representation ability and applied it to high accuracy monocular RGB-D/RGB 3D hand reconstruction. To achieve this, we firstly build a large scale high-quality hand mesh data set based on MANO with a novel mesh deformation method. We train a VAE based on this data set, and get the low-dimensional representation of hand meshes. By using our HandVAE model, we can recover a 3D human hand by giving a code within this latent space. We also build a framework to recover 3D hand mesh from RGB-D/RGB data. Experimental results have demonstrated the powerfulness of our hand model in terms of the reconstruction accuracy and the application for RGB-D/RGB reconstruction. We believe that our 3D hand representation could be further used in other related human hand applications.
... The dataset is then leveraged for 3D hand pose estimation from RGB images by fitting the MANO model to the predicted 2D hand joints. Taylor et al. [54] proposed an approach for hand tracking from a new custom-built depth sensor. Their custom depth camera supports 180 fps which is significantly faster than commodity depth cameras (30-60fps) but still far from the temporal resolution of an event camera. ...
Preprint
3D hand pose estimation from monocular videos is a long-standing and challenging problem, which is now seeing a strong upturn. In this work, we address it for the first time using a single event camera, i.e., an asynchronous vision sensor reacting on brightness changes. Our EventHands approach has characteristics previously not demonstrated with a single RGB or depth camera such as high temporal resolution at low data throughputs and real-time performance at 1000 Hz. Due to the different data modality of event cameras compared to classical cameras, existing methods cannot be directly applied to and re-trained for event streams. We thus design a new neural approach which accepts a new event stream representation suitable for learning, which is trained on newly-generated synthetic event streams and can generalise to real data. Experiments show that EventHands outperforms recent monocular methods using a colour (or depth) camera in terms of accuracy and its ability to capture hand motions of unprecedented speed. Our method, the event stream simulator and the dataset will be made publicly available.
... Discriminative methods learn a mapping from input to parameters of hand keypoints such as coordinates or joint angles [9][10][11][14][15][16][17][18][19][20][21][22][23][24][25]. Hybrid methods generally use a model from generative methods to optimize the initial estimation from discriminative methods [2,12,[40][41][42]. ...
Article
3D hand pose estimation by taking point cloud as input has been paid more and more attention recently. In this paper, a new module for point cloud processing, named Local-aware Point Processing Module (LPPM), is designed. With the ability to extract local information, it is permutation invariant w.r.t. neighboring points in input point cloud and is an independent module that is easy to be implemented and flexible to construct point cloud network. Based on this module, a LPPM-Net is constructed to estimate 3D hand pose. In order to normalize orientations of the point cloud as well as to maintain diversity properly in a controllable manner, we transform point cloud into an oriented bounding box coordinate system (OBB C.S.) and then rotate it randomly around the principal axis when training. In addition, a simple but effective technique called sampling ensemble is used in the test stage, which compensates for the resolution degradation caused by downsampling and improves the performance without extra parameters. We evaluate the proposed method on three public hand datasets: NYU, ICVL, and MSRA. Results show that our approach has a competitive performance on the three datasets.
... An energy over correspondence and model parameters can be minimized directly with non-linear optimization (LMICP [30]), which requires differentiating distance transforms for efficiency [9,73]. In general, the distance transform needs to change with model parameters, which is hard and needs to be approximated [77] or learned [26]. A few works are differentiable both wrt. ...
Full-text available
Preprint
We address the problem of fitting 3D human models to 3D scans of dressed humans. Classical methods optimize both the data-to-model correspondences and the human model parameters (pose and shape), but are reliable only when initialized close to the solution. Some methods initialize the optimization based on fully supervised correspondence predictors, which is not differentiable end-to-end, and can only process a single scan at a time. Our main contribution is LoopReg, an end-to-end learning framework to register a corpus of scans to a common 3D human model. The key idea is to create a self-supervised loop. A backward map, parameterized by a Neural Network, predicts the correspondence from every scan point to the surface of the human model. A forward map, parameterized by a human model, transforms the corresponding points back to the scan based on the model parameters (pose and shape), thus closing the loop. Formulating this closed loop is not straightforward because it is not trivial to force the output of the NN to be on the surface of the human model - outside this surface the human model is not even defined. To this end, we propose two key innovations. First, we define the canonical surface implicitly as the zero level set of a distance field in R3, which in contrast to morecommon UV parameterizations, does not require cutting the surface, does not have discontinuities, and does not induce distortion. Second, we diffuse the human model to the 3D domain R3. This allows to map the NN predictions forward,even when they slightly deviate from the zero level set. Results demonstrate that we can train LoopRegmainly self-supervised - following a supervised warm-start, the model becomes increasingly more accurate as additional unlabelled raw scans are processed. Our code and pre-trained models can be downloaded for research.
... Early depth based works [28,21,31,7,37,41] estimate hand pose by fitting a generative model onto a depth image. Some works [35,32,40,36,43] additionally leveraged discriminative predictions for initialization and regularization. Recently, deep learning methods have been applied to this area. ...
Preprint
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data: image data with either 2D or 3D annotations, as well as stand-alone 3D animations without corresponding image data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass. This output makes the method more directly usable for applications in computer vision and graphics compared to only regressing 3D joint positions. We demonstrate that our architectural design leads to a significant quantitative and qualitative improvement over the state of the art on several challenging benchmarks. Our model is publicly available for future research.
... By knowing the scene structure we could reason about what is visible and what is not. Another interesting direction would be the unification of the self-penetration and the body-scene interpenetration by employing the implicit formulation of [65] for the whole body. Future work can exploit recent deep networks to estimate the scene directly from monocular RGB images. ...
... In recent years, some works [14], [41], [25] have focused on the tracking two interacting hands. They all adopted the strategy of 'left/right-hand segment + pose/shape optimization'. ...
Full-text available
Article
In this paper, we present a novel approach for 3D hand tracking in real-time from a set of depth images. In each frame, our approach initializes hand pose with learning and then jointly optimizes the hand pose and shape. For pose initialization, we propose a gesture classification and root location network(GCRL), which can capture the meaningful topological structure of the hand to estimate the gesture and root location of the hand. With the per-frame initialization, our approach can rapidly recover from tracking failures. For optimization, unlike most existing methods that have been using a fixed-size hand model or manual calibration, we propose a hand gesture-guided optimization strategy to estimate pose and shape iteratively, which makes the tracking results more accuracy. Experiments on three challenging datasets show that our proposed approach achieves similar accuracy as state-of-the-art approaches, while runs on a low computational resource (without GPU).
Article
Human hand gestures are the most important tools for interacting with the real environment. Capturing hand motion is critical for a wide range of applications in Augmented Reality (AR)/Virtual Reality (VR), Human-computer Interface (HCI), and many other disciplines. This paper presents a 3 module pipeline for effective hand gesture detection in real-time at the speed of 100 frames per second (fps).Various hand gestures can be captured by simple RGB camera and then processed to first detect the palm and then find essential 3D landmarks, which helps in creating skeletal representation of hand. In order to form a 3D mesh around the skeletal hand 2D and 3D annotations of Hand gestures are merged and in the final module 3D animated hand gestures are presented using advanced neural network. 3D representation of hand gestures ensures greater understanding of depth ambiguity problem in monocular pose estimations and can be effectively used in computer vision and graphics applications. The proposed design is compared with several benchmarks to highlight improvements in the results achieved over conventional methods.
Chapter
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a correspondence based prior learnt directly from data. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous works focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. TOCH fields are a point-wise, object-centric representation, which encode the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D hand-object reconstruction methods and transferring grasps across objects. Our code and model are available at [1].KeywordsHand-object interactionMotion refinementHand prior
Full-text available
Preprint
Reconstructing two-hand interactions from a single image is a challenging problem due to ambiguities that stem from projective geometry and heavy occlusions. Existing methods are designed to estimate only a single pose, despite the fact that there exist other valid reconstructions that fit the image evidence equally well. In this paper we propose to address this issue by explicitly modeling the distribution of plausible reconstructions in a conditional normalizing flow framework. This allows us to directly supervise the posterior distribution through a novel determinant magnitude regularization, which is key to varied 3D hand pose samples that project well into the input image. We also demonstrate that metrics commonly used to assess reconstruction quality are insufficient to evaluate pose predictions under such severe ambiguity. To address this, we release the first dataset with multiple plausible annotations per image called MultiHands. The additional annotations enable us to evaluate the estimated distribution using the maximum mean discrepancy metric. Through this, we demonstrate the quality of our probabilistic reconstruction and show that explicit ambiguity modeling is better-suited for this challenging problem.
Article
Depth-based hand pose estimation has received increasing attention in the fields of human-computer interaction and virtual reality. A comprehensive survey and analysis of depth-based hand pose estimation of recent works are conducted. First, the definition and difficulties of this problem are explained, the widely used sensor and public datasets are also introduced. Then, the works of this field are divided into three categories, model-driven, data-driven, and hybrid method. The model-driven methods perform a model fitting between the model and the depth points. The data-driven methods learn a function, which maps the depth image to pose. The hybrid methods combine model-driven and data-driven to recovery the hand pose. In the course of narration, we focus on the solved problems and shortcomings to be solved. In the final, the works are compared in terms of accuracy, suitability, and robustness. The future research in this direction is also discussed. © 2021, Beijing China Science Journal Publishing Co. Ltd. All right reserved.
Article
Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.
Article
Wearable devices, such as smartwatches and head-mounted devices (HMD), demand new input devices for a natural, subtle, and easy-to-use way to input commands and text. In this paper, we propose and investigate ViFin, a new technique for input commands and text entry, which harness finger movement induced vibration to track continuous micro finger-level writing with a commodity smartwatch. Inspired by the recurrent neural aligner and transfer learning, ViFin recognizes continuous finger writing, works across different users, and achieves an accuracy of 90% and 91% for recognizing numbers and letters, respectively. We quantify our approach's accuracy through real-time system experiments in different arm positions, writing speeds, and smartwatch position displacements. Finally, a real-time writing system and two user studies on real-world tasks are implemented and assessed.
Article
Many of the actions that we take with our hands involve self-contact and occlusion: shaking hands, making a fist, or interlacing our fingers while thinking. This use of of our hands illustrates the importance of tracking hands through self-contact and occlusion for many applications in computer vision and graphics, but existing methods for tracking hands and faces are not designed to treat the extreme amounts of self-contact and self-occlusion exhibited by common hand gestures. By extending recent advances in vision-based tracking and physically based animation, we present the first algorithm capable of tracking high-fidelity hand deformations through highly self-contacting and self-occluding hand gestures, for both single hands and two hands. By constraining a vision-based tracking algorithm with a physically based deformable model, we obtain an algorithm that is robust to the ubiquitous self-interactions and massive self-occlusions exhibited by common hand gestures, allowing us to track two hand interactions and some of the most difficult possible configurations of a human hand.
Chapter
3D hand reconstruction from images is a widely-studied problem in computer vision and graphics, and has a particularly high relevance for virtual and augmented reality. Although several 3D hand reconstruction approaches leverage hand models as a strong prior to resolve ambiguities and achieve more robust results, most existing models account only for the hand shape and poses and do not model the texture. To fill this gap, in this work we present HTML, the first parametric texture model of human hands. Our model spans several dimensions of hand appearance variability (e.g., related to gender, ethnicity, or age) and only requires a commodity camera for data acquisition. Experimentally, we demonstrate that our appearance model can be used to tackle a range of challenging problems such as 3D hand reconstruction from a single monocular image. Furthermore, our appearance model can be used to define a neural rendering layer that enables training with a self-supervised photometric loss. We make our model publicly available.
Chapter
Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables representation of articulated deformable objects using neural indicator functions that are conditioned on pose. Occupancy testing using NASA is straightforward, circumventing the complexity of meshes and the issue of water-tightness. We demonstrate the effectiveness of NASA for 3D tracking applications, and discuss other potential extensions.
Chapter
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a continuous, real-time basis while sharing a single Digital Signal Processor. To solve model-fitting problems for HoloLens 2 hand tracking, where the computational budget is approximately 100 times smaller than an iPhone 7, we introduce a new surface model: the ‘Phong surface’. Using ideas from computer graphics, the Phong surface describes the same 3D shape as a triangulated mesh model, but with continuous surface normals which enable the use of lifting-based optimization, providing significant efficiency gains over ICP-based methods. We show that Phong surfaces retain the convergence benefits of smoother surface models, while triangle meshes do not.
Chapter
We aim to recover the dense 3D surface of the hand from depth maps and propose a network that can predict mesh vertices, transformation matrices for every joint and joint coordinates in a single forward pass. Use fully convolutional architectures, we first map depth image features to the mesh grid and then regress the mesh coordinates into real world 3D coordinates. The final mesh is found by sampling from the mesh grid refit in closed-form based on an articulated template mesh. When trained with supervision from sparse key-points, our accuracy is comparable with state-of-the-art on the NYU dataset for key point localization, all while recovering mesh vertices and dense correspondences. Under multi-view settings for training, our framework can also learn through self-supervision by minimizing a set of data-fitting terms and kinematic priors. Our approach is competitive with strongly supervised methods and showcases the potential for self-supervision in dense mesh estimation.
Article
We present a system for real-time hand-tracking to drive virtual and augmented reality (VR/AR) experiences. Using four fisheye monochrome cameras, our system generates accurate and low-jitter 3D hand motion across a large working volume for a diverse set of users. We achieve this by proposing neural network architectures for detecting hands and estimating hand keypoint locations. Our hand detection network robustly handles a variety of real world environments. The keypoint estimation network leverages tracking history to produce spatially and temporally consistent poses. We design scalable, semi-automated mechanisms to collect a large and diverse set of ground truth data using a combination of manual annotation and automated tracking. Additionally, we introduce a detection-by-tracking method that increases smoothness while reducing the computational cost; the optimized system runs at 60Hz on PC and 30Hz on a mobile processor. Together, these contributions yield a practical system for capturing a user's hands and is the default feature on the Oculus Quest VR headset powering input and social presence.
Full-text available
Article
We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments. Existing methods typically fail for hand-object interactions in cluttered scenes imaged from egocentric viewpoints, common for virtual or augmented reality applications. Our approach uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations. Hand localization is achieved by using a CNN to estimate the 2D position of the hand center in the input, even in the presence of clutter and occlusions. The localized hand position, together with the corresponding input depth value, is used to generate a normalized cropped image that is fed into a second CNN to regress relative 3D hand joint locations in real time. For added accuracy, robustness and temporal stability, we refine the pose estimates using a kinematic pose tracking energy. To train the CNNs, we introduce a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes. Through quantitative and qualitative evaluation, we show that our method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives.
Full-text available
Article
State-of-the-art methods for 3D hand pose estimation from depth images require large amounts of annotated training data. We propose to model the statistical relationships of 3D hand poses and corresponding depth images using two deep generative models with a shared latent space. By design, our architecture allows for learning from unlabeled image data in a semi-supervised manner. Assuming a one-to-one mapping between a pose and a depth map, any given point in the shared latent space can be projected into both a hand pose and a corresponding depth map. Regressing the hand pose can then be done by learning a discriminator to estimate the posterior of the latent pose given some depth map. To improve generalization and to better exploit unlabeled depth maps, we jointly train a generator and a discriminator. At each iteration, the generator is updated with the back-propagated gradient from the discriminator to synthesize realistic depth maps of the articulated hand, while the discriminator benefits from an augmented training set of synthesized and unlabeled samples. The proposed discriminator network architecture is highly efficient and runs at 90 FPS on the CPU with accuracies comparable or better than state-of-art on 3 publicly available benchmarks.
Full-text available
Article
We propose an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network trained to predict an estimate of the 3D pose by using a feedback loop. The components of this feedback loop are also Deep Networks, optimized using training data. They remove the need for fitting a 3D model to the input data, which requires both a carefully designed fitting function and algorithm. We show that our approach outperforms state-of-the-art methods, and is efficient as our implementation runs at over 400 fps on a single GPU.
Full-text available
Article
Previous learning based hand pose estimation methods does not fully exploit the prior information in hand model geometry. Instead, they usually rely a separate model fitting step to generate valid hand poses. Such a post processing is inconvenient and sub-optimal. In this work, we propose a model based deep learning approach that adopts a forward kinematics based layer to ensure the geometric validity of estimated poses. For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation. Our approach is verified on challenging public datasets and achieves state-of-the-art performance.
Full-text available
Article
We contribute a new pipeline for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Our algorithm supports both incremental reconstruction, improving the surface estimation over time, as well as parameterizing the nonrigid scene motion. Our approach is highly robust to both large frame-to-frame motion and topology changes, allowing us to reconstruct extremely challenging scenes. We demonstrate advantages over related real-time techniques that either deform an online generated template or continually fuse depth data nonrigidly into a single reference model. Finally, we show geometric reconstruction results on par with offline methods which require orders of magnitude more processing time and many more RGBD cameras.
Full-text available
Conference Paper
We present a new real-time hand tracking system based on a single depth camera. The system can accurately reconstruct complex hand poses across a variety of subjects. It also allows for robust tracking, rapidly recovering from any temporary failures. Most uniquely, our tracker is highly flexible, dramatically improving upon previous approaches which have focused on front-facing close-range scenarios. This flexibility opens up new possibilities for human-computer interaction with examples including tracking at distances from tens of centimeters through to several meters (for controlling the TV at a distance), supporting tracking using a moving depth camera (for mobile scenarios), and arbitrary camera placements (for VR headsets). These features are achieved through a new pipeline that combines a multi-layered discriminative reinitialization strategy for per-frame pose estimation, followed by a generative model-fitting stage. We provide extensive technical details and a detailed qualitative and quantitative analysis.
Full-text available
Conference Paper
In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.
Full-text available
Article
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
Full-text available
Conference Paper
We present a novel method for real-time continuous pose recovery of markerless complex articulable objects from a single depth image. Our method consists of the following stages: a randomized decision forest classifier for image segmentation, a robust method for labeled dataset generation, a convolutional network for dense feature extraction, and finally an inverse kinematics stage for stable real-time pose recovery. As one possible application of this pipeline, we show state-of-the-art results for real-time puppeteering of a skinned hand-model.
Full-text available
Article
We introduce and evaluate several architectures for Convolutional Neural Networks to predict the 3D joint locations of a hand given a depth map. We first show that a prior on the 3D pose can be easily introduced and significantly improves the accuracy and reliability of the predictions. We also show how to use context efficiently to deal with ambiguities between fingers. These two contributions allow us to significantly outperform the state-of-the-art on several challenging benchmarks, both in terms of accuracy and computation times.
Full-text available
Conference Paper
Capturing the motion of two hands interacting with an object is a very challenging task due to the large number of degrees of freedom, self-occlusions, and similarity between the fingers, even in the case of multiple cameras observing the scene. In this paper we propose to use discriminatively learned salient points on the fingers and to estimate the finger-salient point associations simultaneously with the estimation of the hand pose. We introduce a differentiable objective function that also takes edges, optical flow and collisions into account. Our qualitative and quantitative evaluations show that the proposed approach achieves very accurate results for several challenging sequences containing hands and objects in action.
Full-text available
Chapter
We propose a method that relies on markerless visual observations to track the full articulation of two hands that interact with each-other in a complex, unconstrained manner. We formulate this as an optimization problem whose 54-dimensional parameter space represents all possible configurations of two hands, each represented as a kinematic structure with 26 Degrees of Freedom (DoFs). To solve this problem, we employ Particle Swarm Optimization (PSO), an evolutionary, stochastic optimization method with the objective of finding the two-hands configuration that best explains observations provided by an RGB-D sensor. To the best of our knowledge, the proposed method is the first to attempt and achieve the articulated motion tracking of two strongly interacting hands. Extensive quantitative and qualitative experiments with simulated and real world image sequences demonstrate that an accurate and efficient solution of this problem is indeed feasible.
Full-text available
Article
A novel model-based approach to 3D hand tracking from monocular video is presented. The 3D hand pose, the hand texture, and the illuminant are dynamically estimated through minimization of an objective function. Derived from an inverse problem formulation, the objective function enables explicit use of temporal texture continuity and shading information while handling important self-occlusions and time-varying illumination. The minimization is done efficiently using a quasi-Newton method, for which we provide a rigorous derivation of the objective function gradient. Particular attention is given to terms related to the change of visibility near self-occlusion boundaries that are neglected in existing formulations. To this end, we introduce new occlusion forces and show that using all gradient terms greatly improves the performance of the method. Qualitative and quantitative experimental results demonstrate the potential of the approach.
Article
We present a new algorithm for real-time hand tracking on commodity depth-sensing devices. Our method does not require a user-specific calibration session, but rather learns the geometry as the user performs live in front of the camera, thus enabling seamless virtual interaction at the consumer level. The key novelty in our approach is an online optimization algorithm that jointly estimates pose and shape in each frame, and determines the uncertainty in such estimates. This knowledge allows the algorithm to integrate per-frame estimates over time, and build a personalized geometric model of the captured user. Our approach can easily be integrated in state-of-the-art continuous generative motion tracking software. We provide a detailed evaluation that shows how our approach achieves accurate motion tracking for real-time applications, while significantly simplifying the workflow of accurate hand performance capture. We also provide quantitative evaluation datasets at http://gfx.uvic.ca/datasets/handy
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Article
Tracking the full skeletal pose of the hands and fingers is a challenging problem that has a plethora of applications for user interaction. Existing techniques either require wearable hardware, add restrictions to user pose, or require significant computation resources. This research explores a new approach to tracking hands, or any articulated model, by using an augmented rigid body simulation. This allows us to phrase 3D object tracking as a linear complementarity problem with a well-defined solution. Based on a depth sensor's samples, the system generates constraints that limit motion orthogonal to the rigid body model's surface. These constraints, along with prior motion, collision/contact constraints, and joint mechanics, are resolved with a projected Gauss-Seidel solver. Due to camera noise properties and attachment errors, the numerous surface constraints are impulse capped to avoid overpowering mechanical constraints. To improve tracking accuracy, multiple simulations are spawned at each frame and fed a variety of heuristics, constraints and poses. A 3D error metric selects the best-fit simulation, helping the system handle challenging hand motions. Such an approach enables real-time, robust, and accurate 3D skeletal tracking of a user's hand on a variety of depth cameras, while only utilizing a single x86 CPU core for processing.
Article
Modern systems for real-time hand tracking rely on a combination of discriminative and generative approaches to robustly recover hand poses. Generative approaches require the specification of a geometric model. In this paper, we propose a the use of sphere-meshes as a novel geometric representation for real-time generative hand tracking. How tightly this model fits a specific user heavily affects tracking precision. We derive an optimization to non-rigidly deform a template model to fit the user data in a number of poses. This optimization jointly captures the user's static and dynamic hand geometry, thus facilitating high-precision registration. At the same time, the limited number of primitives in the tracking template allows us to retain excellent computational performance. We confirm this by embedding our models in an open source real-time registration algorithm to obtain a tracker steadily running at 60Hz. We demonstrate the effectiveness of our solution by qualitatively and quantitatively evaluating tracking precision on a variety of complex motions. We show that the improved tracking accuracy at high frame-rate enables stable tracking of extended and complex motion sequences without the need for per-frame re-initialization. To enable further research in the area of high-precision hand tracking, we publicly release source code and evaluation datasets.
Conference Paper
Discriminative methods often generate hand poses kinematically implausible, then generative methods are used to correct (or verify) these results in a hybrid method. Estimating 3D hand pose in a hierarchy, where the high-dimensional output space is decomposed into smaller ones, has been shown effective. Existing hierarchical methods mainly focus on the decomposition of the output space while the input space remains almost the same along the hierarchy. In this paper, a hybrid hand pose estimation method is proposed by applying the kinematic hierarchy strategy to the input space (as well as the output space) of the discriminative method by a spatial attention mechanism and to the optimization of the generative method by hierarchical Particle Swarm Optimization (PSO). The spatial attention mechanism integrates cascaded and hierarchical regression into a CNN framework by transforming both the input (and feature space) and the output space, which greatly reduces the viewpoint and articulation variations. Between the levels in the hierarchy, the hierarchical PSO forces the kinematic constraints to the results of the CNNs. The experimental results show that our method significantly outperforms four state-of-the-art methods and three baselines on three public benchmarks.
Conference Paper
We present a novel approach for the reconstruction of dynamic geometric shapes using a single hand-held consumer-grade RGB-D sensor at real-time rates. Our method builds up the scene model from scratch during the scanning process, thus it does not require a pre-defined shape template to start with. Geometry and motion are parameterized in a unified manner by a volumetric representation that encodes a distance field of the surface geometry as well as the non-rigid space deformation. Motion tracking is based on a set of extracted sparse color features in combination with a dense depth constraint. This enables accurate tracking and drastically reduces drift inherent to standard model-to-depth alignment. We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints. The problem is tackled in real-time at the camera’s capture rate using a data-parallel flip-flop optimization strategy. Our results demonstrate robust tracking even for fast motion and scenes that lack geometric features.
Article
In this paper we present the latent regression forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. Prior discriminative methods often fall into two categories: holistic and patch-based. Holistic methods are efficient but less flexible due to their nearest neighbour nature. Patch-based methods can generalise to unseen samples by consider local appearance only. However, they are complex because each pixel need to be classified or regressed during testing. In contrast to these two baselines, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt latent tree model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180 K annotated images from 10 different subjects. Our experiments on two datasets show that the LRF outperforms baselines and prior arts in both accuracy and efficiency.
Article
Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today's dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve a detailed pose fit. We follow this paradigm, but make several changes to the model-fitting, namely using: (1) a more discriminative objective function; (2) a smooth-surface model that provides gradients for non-linear optimization; and (3) joint optimization over both the model pose and the correspondences between observed data points and the model surface. While each of these changes may actually increase the cost per fitting iteration, we find a compensating decrease in the number of iterations. Further, the wide basin of convergence means that fewer starting points are needed for successful model fitting. Our system runs in real-time on CPU only, which frees up the commonly over-burdened GPU for experience designers. The hand tracker is efficient enough to run on low-power devices such as tablets. We can track up to several meters from the camera to provide a large working volume for interaction, even using the noisy data from current-generation depth cameras. Quantitative assessments on standard datasets show that the new approach exceeds the state of the art in accuracy. Qualitative results take the form of live recordings of a range of interactive experiences enabled by this new approach.
Conference Paper
Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded high-dimensional and non-linear regression problem. Most existing discriminative methods regress the hand pose directly from a single depth image, which cannot fully utilize the depth information. In this paper, we propose a novel multi-view CNNs based approach for 3D hand pose estimation. The query depth image is projected onto multiple planes, and multi-view CNNs are trained to learn the mapping from projected images to 2D heat-maps which estimate 2D joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experimental results show that the proposed method is superior than several state-of-the-art methods on two challenging datasets. Moreover, a quantitative cross-dataset experiment and a qualitative experiment also demonstrate the good generalization ability of the proposed method.
Article
Markerless tracking of hands and fingers is a promising enabler for human-computer interaction. However, adoption has been limited because of tracking inaccuracies, incomplete coverage of motions, low framerate, complex camera setups, and high computational requirements. In this paper, we present a fast method for accurately tracking rapid and complex articulations of the hand using a single depth camera. Our algorithm uses a novel detection-guided optimization strategy that increases the robustness and speed of pose estimation. In the detection step, a randomized decision forest classifies pixels into parts of the hand. In the optimization step, a novel objective function combines the detected part labels and a Gaussian mixture representation of the depth to estimate a pose that best fits the depth. Our approach needs comparably less computational resources which makes it extremely fast (50 fps without GPU support). The approach also supports varying static, or moving, camera-to-scene arrangements. We show the benefits of our method by evaluating on public datasets and comparing against previous work.
Article
We present a novel approach for the reconstruction of dynamic geometric shapes using a single hand-held consumer-grade RGB-D sensor at real-time rates. Our method does not require a pre-defined shape template to start with and builds up the scene model from scratch during the scanning process. Geometry and motion are parameterized in a unified manner by a volumetric representation that encodes a distance field of the surface geometry as well as the non-rigid space deformation. Motion tracking is based on a set of extracted sparse color features in combination with a dense depth-based constraint formulation. This enables accurate tracking and drastically reduces drift inherent to standard model-to-depth alignment. We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints. The problem is tackled in real-time at the camera's capture rate using a data-parallel flip-flop optimization strategy. Our results demonstrate robust tracking even for fast motion and scenes that lack geometric features.
Article
Matrix factorization (or low-rank matrix completion) with missing data is a key computation in many computer vision and machine learning tasks, and is also related to a broader class of nonlinear optimization problems such as bundle adjustment. The problem has received much attention recently, with renewed interest in variable-projection approaches, yielding dramatic improvements in reliability and speed. However, on a wide class of problems, no one approach dominates, and because the various approaches have been derived in a multitude of different ways, it has been difficult to unify them. This paper provides a unified derivation of a number of recent approaches, so that similarities and differences are easily observed. We also present a simple meta-algorithm which wraps any existing algorithm, yielding 100% success rate on many standard datasets. Given 100% success, the focus of evaluation must turn to speed, as 100% success is trivially achieved if we do not care about speed. Again our unification allows a number of generic improvements applicable to all members of the family to be isolated, yielding a unified algorithm that outperforms our re-implementation of existing algorithms, which in some cases already outperform the original authors' publicly available codes.