Conference PaperPDF Available

TwinFusion: High Framerate Non-Rigid Fusion through Fast Correspondence Tracking

  • perceptiveIO, Inc

Abstract and Figures

Real time non-rigid reconstruction pipelines are extremely computationally expensive and easily saturate the highest end GPUs currently available. This requires careful strategic choices of highly inter-connected parameters to divide up the limited compute. Offline systems, however, prove the value of increasing voxel resolution, more iterations, and higher frame rates. To this end, we demonstrate a set of remarkably simple but effective modifications to these algorithms that significantly reduce the average per-frame computation cost allowing these parameters to be increased. First, we divide the depth stream into tracking frames and fusion frames, disabling model accumulation (fusion) on the former. Second, we efficiently track point correspondences across neighboring tracking frames. Third, we leverage these correspondences to initialize the standard non-rigid alignment phase of the fusion frame. As a result, compute resources in the modified non-rigid reconstruction pipeline can immediately be repurposed. We leverage recent high frame rate depth algorithms to build a novel “twin” sensor consisting of a low-res/high-fps tracking camera and a second low-fps/high-res fusion camera.
Content may be subject to copyright.
TwinFusion: High Framerate Non-Rigid Fusion
through Fast Correspondence Tracking
Kaiwen Guo, Jonathan Taylor, Sean Fanello, Andrea Tagliasacchi,
Mingsong Dou, Philip Davidson, Adarsh Kowdle, Shahram Izadi
Google Inc.
Real time non-rigid reconstruction pipelines are extremely
computationally expensive and easily saturate the highest
end GPUs currently available. This requires careful strate-
gic choices to be made about a set of highly interconnected
parameters that divide up the limited compute. At the same
time, offline systems, prove the value of increasing voxel res-
olution, more iterations, and higher frame rates. To this end,
we demonstrate a set of remarkably simple but effective mod-
ifications to these algorithms that significantly reduce the av-
erage per-frame computation cost allowing these parameters
to be increased. Specifically, we divide the depth stream into
sub-frames and fusion-frames, disabling both model accu-
mulation (fusion) and non-rigid alignment (model tracking)
on the former. Instead, we efficiently track point correspon-
dences across neighboring sub-frames. We then leverage
these correspondences to initialize the standard non-rigid
alignment to a fusion-frame where data can then be accu-
mulated into the model. As a result, compute resources in
the modified non-rigid reconstruction pipeline can be imme-
diately repurposed to increase voxel resolution, use more
iterations or to increase the frame rate. To demonstrate the
latter, we leverage recent high frame rate depth algorithms
to build a novel “twin” sensor consisting of a low-res/high-
fps sub-frame camera and a second low-fps/high-res fusion
1. Introduction
The use of cameras to digitize the geometry, texture, light-
ing and motion of arbitrary scenes is a fundamental problem
in computer vision. General monocular solutions remain
elusive, but practical algorithms have been developed that
leverage motion, shape or appearance priors, and/or require
instrumentation of the scene using motion markers or multi-
ple calibrated cameras.
Twin Pod
DynamicFusion TwinFusion
Figure 1. (top) Our TwinPod hybrid depth camera captures high-
speed performance with a pair of high (resp. low) framerate and
low (resp. high) resolution sensors. (bottom-left) The reconstruc-
tion in the canonical frame obtained by our re-implementation of
DynamicFusion on the 30FPS stream. (bottom-right) Our TwinFu-
sion algorithm efficiently resolves motion from the high-framerate
camera and exploits this information to guide the high-resolution
non-rigid reconstruction.
Digitizing a rigid world
As a large portion of the world
remains effectively static, the assumption of global rigid-
ity has proven to be highly applicable: multiple viewpoints
taken at different times can be treated as if they were taken
at the same time and without scene deformation. By leverag-
ing rigidity, researchers have been successful in developing
offline systems capable of sparsely reconstructing sparse city
scale environments from crowd sourced RGB images [1],
or simultaneously mapping and localizing a camera in an
environment [21]. At the heart of such systems is the multi-
view triangulation of a sparse set of keypoints, each iden-
tified by local feature descriptors that have been carefully
hand crafted [19] or learned [39] to be invariant to scale,
orientation, and lighting conditions. Putting these keypoint
observations into correspondence is achieved by a carefully
initialized local optimization [20], or via a global combinato-
rial optimization [13]. The main limitations of such systems
are that (1) the images must be sufficiently rich in visual fea-
tures [3], and (2) the computed reconstructions are typically
sparse [31].
Depth to the rescue
The advent of real-time depth cam-
eras has allowed for the dense reconstructions of scenes,
including those with large untextured areas [23]. As these
processes rely on local geometric registration, high accu-
racy frame-to-frame tracking is imperative to avoid error
accumulation, as this would likely lead to loss of tracking.
Thanks to the limited number of DOFs,
Hz depth cameras
have proven effective for scene reconstruction [24,38,5,
37], but these results still assume slow camera motion in
order to avoid tracking failure. Unfortunately, as the number
of degrees of freedom increase, the problem is exacerbated.
For this reason, application-specific priors and re-initializers
are typically introduced to compensate for the limited frame-
rate; see [26,29] for hand (
26+5dof), [36,33] for face
(25+50dof), and [17,2] for body tracking (72+10dof).
Digitizing generic non-rigid geometry
General non-
rigid reconstruction techniques such as [22,15,7] lack a
fixed model, so incrementally accumulating a model through
frame-to-frame tracking remains the only viable option. It is
thus not surprising that these techniques, employing tens of
thousands of parameters to model the general non-rigid scene
deformation, are extremely brittle at low framerates; see Fig-
ure 1. The impressive results showcased in these works
often rely on some combination of carefully orchestrated
motions [22], multi-view cameras [7], and model resetting
strategies [8]. State-of-the-art real time non-rigid reconstruc-
tion pipelines are computationally expensive, and easily satu-
rate the highest end GPUs currently available. This requires
careful strategic choices of parameters (e.g. frame-rate, num-
ber of solver iterations, voxel resolution) that determine the
allotment of the limited compute. Additionally, these pa-
rameters are so highly interconnected (e.g. higher framerate
requires less iterations) that choosing their values is some-
what of an art. Further, offline systems that crank these dials
to their max [17] show that we are nowhere close to realizing
the full potential of these algorithms in real-time, lest we
wait for years in advances in GPU performance.
High framerate tracking
The introduction of high fram-
erate depth cameras, such as the 200Hz sensor in [12],
promises to dramatically increase the robustness of com-
plex real-time tracking problems. For example, Taylor et al.
[32] leveraged this camera to robustly track a hand model
with negligible reliance on discriminative reinitialization.
In contrast to such work, running non-rigid reconstruction
techniques at higher framerates would require diverting com-
pute resources to processing more frames by reducing voxel
resolution, decreasing the expressiveness of the deformation
model, or reducing the number of iterations in the non-rigid
alignment. Any one of these modifications would likely
erase any accuracy gains made by increasing the framer-
ate. This again elucidates the frustrating zero-sum game
one finds oneself in as they try to tune the various parame-
ters in non-rigid reconstruction pipelines. In response, we
demonstrate a set of remarkably simple, but effective mod-
ifications to the standard non-rigid reconstruction pipeline
that allows the average per-frame computation cost to be
significantly reduced. In particular, we divide the sequence
into fusion-frames and sub-frames, and only enable non-rigid
tracking and model accumulation (i.e. fusion) on the fusion-
frames. Further, we only use the sub-frames to perform a
highly efficient frame-to-frame tracking of point correspon-
dences that is highly effective under the assumption of small
inter-frame motion. These correspondences then allow a
robust bootstrapping of the typical non-rigid alignment to
the fusion-frame. Lastly, we notice that the resolution of the
sub-frames can be dramatically reduced allowing for further
computational savings from any upstream depth algorithm.
As a result of these simple modifications, the majority of the
non-rigid reconstruction algorithms can immediately free
up and repurpose compute by simply tagging an increasing
number of frames as sub-frames. In particular, we lever-
age this to significantly increase the framerate of the depth
images that can be processed unlocking the advantages of
recent high framerate depth cameras.
Introducing the TwinPod
Currently, there are no avail-
able consumer hardware that could take full and complete
advantage of these modifications (most RGBD sensors have
a framerate
Hz). Hence, in this paper we further in-
troduce a novel hybrid RGBD capture sensor specifically
designed to provide this algorithm the ideal input. Our sen-
sor consists of a pair of depth cameras that capture data at
complementary framerates and resolutions; see Figure 2. A
fusion camera streams high resolution images at a low fram-
erate (1280x1024@30Hz) – data from this source is used to
fuse detailed geometric information into the model over time.
As this process requires the current deformation parameters
of the model, a separate sub-frame camera streams low res-
olution images at very high framerate (640x512@500Hz) –
lo-res @ 500 Hzhi-res @ 30 Hz
lo-res @ 500 Hz
hi-res @ 30 Hz
twin pod
Figure 2. (left) The left tracking camera produces high-fps (
Hz) low-resolution (
640 ×512
) depth maps, whereas the tracking camera
outputs high-resolution depth (
1280 ×1204
) at low-fps (
Hz). (right) Example of capture sequence with keyframe and subframe notation.
Notice how the big frame-to-frame motion in the high resolution capture (bottom) is compensated by the high speed camera (top).
data from this source is used to efficiently track point cor-
respondences through time. Notice we do not perform any
non-rigid alignment in the sub-frames: the actual non-rigid
tracking happens only in the fusion-frames, leveraging the
tracked correspondences in the sub-frames. With a mere
of compute time available between sub-frames, we propose a
method to track point correspondences by performing a fast
local search between neighboring frames. The overall result
is the first fusion system leveraging the recent availability of
high speed depth cameras to more accurately track non-rigid
geometry in rapid motion.
2. Related works
The pioneering DynamicFusion work of Newcombe et
al. [22] demonstrated how high-quality reconstructions can
be obtained without strong assumptions about the geom-
etry being observed. This is achieved by representing
the reconstructed surface via a Truncated Signed Distance
Field (TSDF) [4], and non-rigidly deforming this model onto
the data via an embedded deformation graph [28,6]. Once in
good alignment, similarly to KinectFusion [23], the current
depth frame can then be incrementally fused into the canoni-
cal model. As an alternative to deformation graphs, Innmann
et al. [15] encodes transformations in the same canonical
voxel space used to store the TSDF. With the addition of
suitable regularizers, Slavcheva et al. [27] claims this even
allows for topological changes to be correctly handled.
Coping with low framerates
Follow-up works such as [7]
and [8] extended this approach to multi-camera setups to-
wards the general target of free viewpoint rendering [25]. To
avoid loss of tracking due to the large inter-frame motions
caused by low framerate acquisition, these methods heavily
rely on semantic correspondences computed via discrimina-
tive methods [35] or spectral embeddings [16]. Other ways to
improve tracking include accounting for SIFT keypoint [15],
shading constraints [14], or skeletal shape priors [40]. Unfor-
tunately, when motions between adjacent frames are bigger
than some threshold, all these systems start to struggle; see
Figure 1. The best one can hope for is that, upon tracking
failure, the system resets the reconstruction to the current
depth map [7], but this leads to unsightly artifacts. Overall,
the framerate at which these systems run limits their overall
robustness beyond carefully orchestrated motions; e.g. [22].
Framerate to the rescue
In our work, we tackle these
problems with an end-to-end solution encompassing both
hardware and software. We contribute along three major
directions by introducing:
a hybrid depth camera system for simultaneous high-
framerate / high-quality capture,
an highly efficient non-rigid tracking/fusion algorithm
capable to exploit high framerate data,
the first quantitative evaluation framework (data and
metrics) for non-rigid tracking/fusion.
3. Method
Our algorithm processes two streams of real-time RGBD
data, and reconstructs the geometry being observed. The
stream, which we call the fusion-frame stream, is
high resolution (1280x1024) but low-framerate (30Hz). The
second stream, which we call the sub-frame stream is at a
low resolution (640x512) but operates at a high-framerate
(500Hz), and is synchronized to the fusion-frame stream as
to provide
per fusion-frame (see
Figure 2).
Ground Truth F-ICP S-ICP with λnorm = 0 S-ICP with λnorm >0
Figure 3. (left) Given ground-truth correspondences between keyframes the optimization (1) converges perfectly, enabling accurate fusion.
(middle-left) When motion between keyframes is too significant, closest point (CP) correspondences can drive registration towards a local
minima. (middle-right) By locally tracking closest-point correspondences through time (
λnorm = 0
in Eq. 5) we can leverage, to some extent,
sub-frame information. (right) By incorporating normals in the local lookup (
λnorm >0
in Eq. 5), correspondences close to ground truth can
be recovered with minimal computational effort.
Non-rigid registration – Section 3.1
Our reconstruction
algorithm, non-rigidly registers the model in the previous
to the current fusion-frame
. Then, it
fuses the deformed model together with the data and gener-
ates an updated model
. To enable fusion, we represent
the model Min each keyframe as a TSDF; see [4].
Sub-frame matching – Section 3.2
Between each pair of
, we efficiently process the corre-
sponding sub-frames
to summarize the observed
motion. This information is then used to guide the non-rigid
registration between fusion-frames. As the framerate of
is very high, the magnitude of the motion is very small, and
a simple yet very efficient technique can be used for the task
at hand. The efficiency of this step is key, as in order to
avoid dropping frames, at most
ms of processing time is
available. For convenience, in what follows we drop the de-
pendence on
in the sub-sequent sub-frames
S1, ..., SS
, and
refer to the previous fusion-frame as
, and to the current
fusion-frame as F.
3.1. Non-rigid registration
Following the state of the art in non-rigid reconstruc-
tion [22,7,8], we extract a triangular mesh representation of
the model
via marching cubes [18]. The deformation of
M(θ) = {vm(θ)}
, is encoded by a deforma-
tion graph [28] parameterized by a low-dimensional vector
. The mesh of the previous fusion-frame
is repre-
sented as
, which is the starting point of our iterative
optimization. Our primary goal is to register this mesh with
the target frame
by iteratively solving the optimization
arg min
λdataEdata (θ) + λregEreg (θ) + λcorrEcorr (θ)(1)
Data fitting
The term
ensures that the deformed
mesh is in alignment with the data in the fusion-frame:
Edata(θ) =
ρ(hnm, pmvm(θ)i)(2)
(pm, nm)=ΠF(vm(θ))
respectively are the position
and normal of the projective correspondence of
in the
depth map associated with the fusion-frame
, and
is a
robust kernel allowing us to be resilient to outliers.
The term
is a weighted sum of the
, and
regularizers. Briefly, these
terms encourage smoothly varying as-rigid-as-possible de-
formations that respect visual hull constraints; see [8] for
more details.
Landmark correspondences
In low frame-rate tracking,
to mitigate the effect of large motions, it is typical to
use a set of correspondences to guide the iterative opti-
mization towards the correct local basin of convergence.
For example, [15] employs SIFT features, [7] computes
matches via a learnt hash functions, while [8] computes
them by mapping the geometry in an isometry-invariant
latent space. Regardless of the process, a set of correspon-
C={(vp(θ), cp)}p∈P
P ⊆ {1, . . . , M }
can be introduced. These can be accounted for in the
optimization by the following energy:
Ecorr(θ) =
is a robust kernel allowing us to be robust to
outliers caused by poorly tracked correspondences; see Sec-
tion 6. In contrast to other methods, and as discussed in
the following section, we compute these correspondences
by leveraging the sub-frames
between two subsequent
3.2. Sub-frame matching
Since we expect the motion between two sub-frames
to be small, it should be possible to perform a
local search to find good correspondences between two ad-
jacent sub-frames. In particular, we use an efficient local
projective search. For a 3D point
x∈ Ss
, let its projective
neighborhood on the sub-frame
be defined by the set
of 3D points:
N(x;Ss+1) = {n∈ Ss+1 :kΠ(x)Π(n)k< }(4)
For this query point, we find an optimal correspondence by
solving, via exhaustive search,
arg min
n∈N (x)
Eicp(x, n) + λnorm Enorm(x, n)(5)
Eicp(x, n) = knxk2(6)
Enorm(x, n) = kSs+1
(n)− Ss
returns the unit normal at the point. The term
encourages a match with a nearby target point, while
Enorm encourages the matches to have similar normals.
Fusion-frame to fusion-frame correspondences
those vertices that are visible in the fusion-frame
, we then
build the set of correspondences
C={(vp(θ), np)}
. We
track each point
by initializing it with
in the first
keyframe, and looping over each sub-frame while applying
the optimization in (5).
3.3. Optimization schedule
To align the model to the current fusion-frame we first
perform a Gauss-Newton optimization of (1) with
λdata = 0
λcorr = 1
starting from the deformation parameters from
the previous frame until convergence. We then refine the
parameters by setting
λdata = 1
λcorr = 0
. Importantly,
the minimization of the energy (1) takes place only on the
4. Capture setup
Our method assumes as input two streams of RGBD data,
low-res/high-fps (
640 ×512
Hz) and high-res/low-fps
1280 ×1024
Hz). Although many commercial depth
sensors are available, none of them are capable of running at
the required high resolution nor high frame-rate. However, a
number of recent algorithmic contributions has made high-
framerate depth estimation a reality [10,9,12,30]. Inspired
by this work, we resort to active stereo to compute disparity
maps, where a spatially unique texture is projected into the
environment to simplify the task of correspondence search.
Camera hardware
The basic setup consists of two IR
cameras and a laser projector that provides active texture in
the scene. Extending the work of [12], we chose the OnSemi
Python1300 packaged as an USB3 camera module by Ximea.
These modules are already commercially available and rep-
resent a good tradeoff between price and quality. Moreover
they expose an input trigger port that is crucial for synchro-
nization. This sensor is capable to achieve a framerate of
fps at a spatial resolution of
and over
640 ×512
resolution. We also coupled an RGB camera
for passive texture. The second hardware component consist
of the IR illumination that facilitates the goal of any stereo
algorithm. We leverage a VCSEL-based illuminator that is
commercialized by Heptagon.
TwinPod design
To implement the twin RGBD stream
used in this work, we built a TwinPod, consisting of
IR and
RGB cameras; see Fig. 2. All the cameras are calibrated
and synced. Given the large field of view of our lenses
80 deg
) we use
projectors to cover the full scene. To
compute disparity maps at high framerate, we use the Hash-
Match algorithm described in [11]. This method employs a
learnt representation to remap the IR images into a binary
space and then performs fast CRF inference in this new fea-
ture space. This technique runs at
ms at high resolution
) and at
ms at low resolution (
) on
an NVIDIA Titan X GPU. In Figure 2we show an example
of both the fast framerate and high resolution streams this
device can capture. Notice how between two fusion-frames
there is considerable non-rigid motion. Our track-
ing algorithm leverages instead the small motion between
consecutive frames in the high speed capture stream.
5. Evaluation
We extensively evaluate TwinFusion on multiple synthetic
and real world sequences. We provide both quantitative and
qualitative evaluations, showing how our method outper-
forms the state-of-the-art. Further evaluations can be found
in the supplementary video.
Dealing with tracking failures
Non-rigid reconstruction
pipelines rely on a non-rigid alignment via tracking in order
to accumulate detail and average out noise. When there is
a misalignment, standard fusion pipelines quickly contami-
nate the accumulated model with erroneous geometry; see
Figure 1. At that point it is nearly impossible to align an
incorrect model to the next frame and catastrophic failure
occurs. Of course, tracking failures are bound to occur at
some point, so current methods appear to take one of two
approaches to obtain reasonable results. One approach is
Baseline DynamicFusion Baseline TwinFusionDynamicFusionTwinFusion
Full Sequence Tracking Sub-sequence Tracking
Frame 11Frame 108
Figure 4.
evaluation of tracking results on the synthetic bouncing sequence from Vlasic et al. [34] (i.e. beginning
and middle of sequence as highlighted in Figure 5). These results are better appreciated in the supplementary video.
to carefully capture a slow moving sequence with little oc-
clusion, and choose parameters carefully so that tracking
succeeds [22]. Another approach is to constantly check if
the model aligns with the data and perform a partial reset
in regions where registration fails [8]. In both approaches,
tracking failures appear hidden, despite good tracking being
critical for building robust systems that accumulate model
detail. This further validates work that attempts to isolate
a single component (i.e. tracking) as a means to realize a
universally improved system.
Evaluation methodology
Therefore, it seems that a robust
and accurate system will necessarily have to perform partial
resetting to erase erroneous geometry and double surfaces
but also track well in order to accumulate detail. As partial
resetting can be added to any system for increased robust-
ness, we focus on evaluating tracking. To avoid the problem
of tracking failing, we instead divide all of our sequences up
into a collection of short sub-sequences. As shown below,
even maintaining tracking for these very short subsequences
is challenging, but our algorithm performs much better than
the state-of-the-art. We use DynamicFusion [22] as an ex-
emplar of a non rigid reconstruction pipeline, which we
approximate by using our TwinFusion algorithm by ignor-
ing correspondences estimated from sub-frames (i.e. with
λcorr = 0
). We also compare against a variant that performs
non-rigid alignment on both sub-frames and fusion-frames.
While this method is far too expensive to execute in real-time,
it represents an upper bound on tracking performance, and
we refer to it as our Baseline. Finally, we asked the authors
of [8] to run their Motion2Fusion pipeline in a tracking-only
mode on our dataset.
5.1. Synthetic evaluation dataset
Non-rigid fusion pipelines are complicated systems con-
taining a myriad of interacting subcomponents: the accu-
mulation of a model (fusion), the non-rigid deformation
function (parameterization), and the ability to find parame-
ters for that deformation function that non-rigidly aligns the
deformed model to the data in the current frame (tracking).
Although the quality of a reconstruction perceived by a user
may be the ultimate test for a system, little insight can be
gleaned as to what components are performing well. Quanti-
fying the performance of one of these components is crucial
to this work as our main contribution is towards improving
the tracking component. Towards this goal, we use a 3D
mesh of an object that dynamically changes its shape by
performing fast motions. We place a synthetic camera facing
a moving object, and render 2D depth maps at both our high
and low acquisition framerates. Our dataset uses the 4D
sequences acquired by Vlasic et al. [34], which consists of
dynamic surface mesh sequences with consistent topology
at 30fps. Using those as fusion-frames, we then temporally
interpolate the mesh surface over the original sequence to
generate an artificial high-framerate sub-frame sequence at
480fps, and render both into sequences of depth maps to
match the input formats generated by the TwinPod.
Figure 5.
We compare the impact in ac-
curacy on our entire system that results from the varying tracking
accuracies elucidated in Figure 4. In particular, TwinFusion, Dy-
namicFusion and an offline Baseline are run with fusion (i.e. TSDF
update) turned off. We consider running on each sequence in its
entirety (left) and on a set of subsequences with ground-truth reini-
tialization (right).
To quantify tracking/fusion error, we first take each (cur-
rent) model vertex
back into the reference pose
, and
find the closest ground truth vertex in the first (i.e. template)
frame as:
h(p) = arg min
Then, the average error is computed as follows:
E(θ) = 1
Tracking evaluation – Figure 4and Figure 5
We ana-
lyze the per frame error of the non-rigid alignment error
across all methods by disabling fusion (i.e. the TSDF is
initialized with ground truth and never updated). Note how
a tracking failure early on in the sequence can completely
spoil the remaining results (Figure 5, left). Hence, we also
consider dividing each sequence into a set of shorter sub-
sequences (Figure 5, right). It is clear from these results that
TwinFusion significantly outperforms the standard Dynam-
icFusion approach. Moreover, the TwinFusion results are
very close to the (offline) Baseline, demonstrating how our
algorithm provides a viable way to unlock the benefits of
higher framerate depth streams. Motion2Fusion [8] without
its resetting strategy, behaves similarly to DynamicFusion:
this is expected since in this particular case it cannot take
advantage of multiple views like in the original algorithm.
Figure 6.
We compare the impact in accu-
racy (average error from Eq. 9, in meters) on our entire system that
results from the varying tracking accuracies elucidated in Figure 5.
In particular, TwinFusion, DynamicFusion and an offline Baseline
are run with Fusion (i.e. model accumulation) turned on.
Fusion evaluation – Figure 6and Figure 7
Having quan-
tified the increased non-rigid alignment accuracy we obtain
through TwinFusion, we now seek to elucidate how tracking
accuracy interacts with a fusion system. To this end, we
perform a similar experiment but with fusion enabled, so
that the system is trying to accumulate a model as it tracks.
Note how we do not attempt to fuse the entire sequence, but
Baseline DynamicFusionTwinFusion
Frame 100 Frame 101 Frame 105 Frame 107 Frame 109
Figure 7.
An example of the impact of track-
ing on fusion results. Frames are selected to represent the range
of motion between two restart; see Figure 6. Tracking keyframes
at low framerate (DynamicFusion) cannot keep up with the fast
motion of the actor landing his jump, resulting in significant ghost
geometry. The correspondence tracking in TwinFusion allow us to
reach results comparable to the ones of the Baseline, but a fraction
of the computational budget, and in real-time.
Figure 8. Qualitative results on a challenging real scene recorded
by our TwinPod. We show the fused models for multiple frames
processed with the TwinFusion and DynamicFusion. TwinFusion
achieves very compelling results in real-time, under a tractable
computational budget. Conversely, DynamicFusion cannot cope
with the fast motion. Please see
supplementary video
for more
instead only consider short sub-sequences. Indeed, since
tracking with a template eventually fails (see previous dis-
cussion), accumulating a model only hastens failure and
thus such an experiment would be of no value. Notice in
Figure 6and Figure 7how TwinFusion is always very com-
parable with the expensive, offline baseline method, whereas
DynamicFusion dramatically and rapidly fails.
5.2. Live captured data – Figure 8
We finally use our TwinPod to live capture sequences of
subjects performing fast actions. Here we demonstrate the
benefits of TwinFusion in challenging, real scenarios using
a single camera. As shown in Figure 8, the system robustly
tracks highly deformable objects, such as a scarf, whereas
DynamicFusion fails – please see supplementary video.
6. Conclusions
In this paper, we recognize non-rigid registration via track-
ing as a crucial part of modern non-rigid reconstruction
pipelines that hope to accumulate model detail. In addi-
tion, we also recognize that the robustness and accuracy of
tracking has not been carefully examined in the literature. In-
stead, the most seemingly robust systems [8] are frequently
performing partial resets of the misaligned model, deleting
phantom surfaces and erroneous geometry, but also deleting
accumulated detail. Systems that do not partially reset the
model appear to track only a small set of sequences, which is
not surprising since one should expect tracking to eventually
fail in general. Thus by focusing on and improving tracking
accuracy and robustness, we can either increase the number
of sequences the latter systems will run on or, more desirably,
increase the amount of detail that robust resetting systems
such as [8] can accumulate. To this end, we have introduced
a set of simple but surprisingly effective modifications to
any standard non-rigid tracking pipeline. The modifications
rely on tracking point correspondences, leveraging the small
inter-frame motion between sub-frames. In future work, we
plan to explore even faster depth streams to push the track-
ing precision even further; e.g. the
fps system in [30].
One limitation of our method is the possibility that corre-
spondences slip during tangential motion, and we leave it as
future work to examine leveraging color constraints or regu-
larizers that might alleviate this problem. Nonetheless, we
find through synthetic and qualitative experiments that we
obtain better tracking accuracy than other real time methods.
Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Cur-
less, B., Seitz, S. M. & Szeliski, R. Building rome in a
day. Communications of the ACM (2011).
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P.,
Romero, J. & Black, M. J. Keep it SMPL: Automatic
estimation of 3D human pose and shape from a single
image in Proc. of the European Conf. on Comp. Vision
Crivellaro, A., Rad, M., Verdie, Y., Yi, K. M., Fua, P.
& Lepetit, V. Robust 3D Object Tracking from Monoc-
ular Images using Stable Parts. IEEE Transactions on
Pattern Analysis and Machine Intelligence (2017).
Curless, B. & Levoy, M. A volumetric method for build-
ing complex models from range images in SIGGRAPH
Dai, A., Nießner, M., Zoll
ofer, M., Izadi, S. &
Theobalt, C. BundleFusion: Real-time Globally Con-
sistent 3D Reconstruction using On-the-fly Surface
Re-integration. ACM Transactions on Graphics 2017
(TOG) (2017).
Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A. & Izadi,
S. 3D Scanning Deformable Objects with a Single
RGBD Sensor in CVPR (2015).
Dou, M., Khamis, S., Degtyarev, Y., Davidson, P.,
Fanello, S. R., Kowdle, A., Orts Escolano, S., Rhe-
mann, C., Kim, D., Taylor, J., Kohli, P., Tankovich, V.
& Izadi, S. Fusion4D: Real-time Performance Capture
of Challenging Scenes. ACM TOG (2016).
Dou, M., Davidson, P., Fanello, S. R., Khamis, S., Kow-
dle, A., Rhemann, C., Tankovich, V., & Izadi, S. Mo-
tion2Fusion: Real-time Volumetric Performance Cap-
ture. ACM TOG (SIGGRAPH Asia) (2017).
Fanello, S. R., Rhemann, C., Tankovich, V., Kowdle,
A, Escolano, S. O., Kim, D & Izadi, S. Hyperdepth:
Learning depth from structured light without matching
in CVPR (2016).
Fanello, S. R., Keskin, C., Izadi, S., Kohli, P., Kim, D.,
Sweeney, D., Criminisi, A., Shotton, J., Kang, S. B. &
Paek, T. Learning to be a depth camera for close-range
human capture and interaction in ACM Transactions
on Graphics (TOG) (2014).
Fanello, S. R., Valentin, J., Kowdle, A., Rhemann, C.,
Tankovich, V., Ciliberto, C., Davidson, P. & Izadi, S.
Low Compute and Fully Parallel Computer Vision with
HashMatch in ICCV (2017).
Fanello, S. R., Valentin, J., Rhemann, C., Kowdle,
A., Tankovich, V. & Izadi, S. UltraStereo: Efficient
Learning-based Matching for Active Stereo Systems in
CVPR (2017).
Fischler, M. A. & Bolles, R. C. Random sample con-
sensus: a paradigm for model fitting with applications
to image analysis and automated cartography. Commu-
nications of the ACM (1981).
Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q. & Liu, Y. Real-
time Geometry, Albedo and Motion Reconstruction
Using a Single RGBD Camera. ACM Transactions on
Graphics (TOG) (2017).
Innmann, M., Zollh
ofer, M., Nießner, M., Theobalt, C.
& Stamminger, M. VolumeDeform: Real-time volumet-
ric non-rigid reconstruction in ECCV (2016).
Jain, V. & Zhang, H. Robust 3D Shape Correspondence
in the Spectral Domain in SMA (2006).
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G.
& Black, M. J. SMPL: A skinned multi-person linear
model. ACM Transactions on Graphics (TOG) (2015).
Lorensen, W. E. & Cline, H. E. Marching cubes: A
high resolution 3D surface construction algorithm in
ACM siggraph computer graphics (1987).
Lowe, D. G. Distinctive Image Features from Scale-
Invariant Keypoints. IJCV (2004).
Lucas, B. D., Kanade, T., et al. An iterative image
registration technique with an application to stereo
vision (1981).
Mur-Artal, R., Montiel, J. M. M. & Tardos, J. D. ORB-
SLAM: a versatile and accurate monocular SLAM
system. IEEE Transactions on Robotics (2015).
22. Newcombe, R. A., Fox, D. & Seitz, S. M. Dynamicfu-
sion: Reconstruction and tracking of non-rigid scenes
in real-time in Proceedings of the IEEE conference on
computer vision and pattern recognition (2015).
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux,
D., Kim, D., Davison, A. J., Kohi, P., Shotton, J.,
Hodges, S. & Fitzgibbon, A. KinectFusion: Real-time
dense surface mapping and tracking in Proc. ISMAR
Nießner, M., Zollh
ofer, M., Izadi, S. & Stamminger,
M. Real-time 3D reconstruction at scale using voxel
hashing. ACM TOG (2013).
Orts-Escolano, S., Rhemann, C., Fanello, S., Chang,
W., Kowdle, A., Degtyarev, Y., Kim, D., Davidson,
P. L., Khamis, S., Dou, M., et al. Holoportation: Virtual
3D Teleportation in Real-time in Proc. UIST (2016).
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton,
J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A.,
Wei, Y., et al. Accurate, Robust, and Flexible Real-time
Hand Tracking in Proc. CHI (2015).
Slavcheva, M., Baust, M., Cremers, D. & Ilic, S.
KillingFusion: Non-Rigid 3D Reconstruction Without
Correspondences in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) (2017).
Sumner, R. W., Schmid, J. & Pauly, M. Embedded de-
formation for shape manipulation. ACM TOG (2007).
Tagliasacchi, A., Schroeder, M., Tkach, A., Bouaziz,
S., Botsch, M. & Pauly, M. Robust Articulated-ICP for
Real-Time Hand Tracking. Computer Graphics Forum
(Proc. Symposium on Geometry Processing) (2015).
Tankovich, V., Schoenberg, M., Fanello, S. R., Kowdle,
A., Rhemann, C., Dzitsiuk, M., Schmidt, M., Valentin,
J. & Izadi, S. SOS: Stereo Matching in O(1) with
Slanted Support Windows in IROS (2018).
Tanskanen, P., Kolev, K., Meier, L., Camposeco, F.,
Saurer, O. & Pollefeys, M. Live metric 3d reconstruc-
tion on mobile phones in Proceedings of the IEEE
International Conference on Computer Vision (2013).
Taylor, J., Tankovich, V., Tang, D., Keskin, C., Kim,
D., Davidson, P., Kowdle, A. & Izadi, S. Articulated
Distance Fields for Ultra-Fast Tracking of Hands Inter-
acting. ACM Trans. on Graphics (Proc. of SIGGRAPH
Asia) (2017).
Thies, J., Zollh
ofer, M., Stamminger, M., Theobalt, C.
& Nießner, M. Face2Face: Real-time Face Capture
and Reenactment of RGB Videos in CVPR (2016).
Vlasic, D., Baran, I., Matusik, W. & Popovi
c, J. Ar-
ticulated mesh animation from multi-view silhouettes.
ACM TOG (Proc. SIGGRAPH) (2008).
Wang, S., Fanello, S. R., Rhemann, C., Izadi, S. &
Kohli, P. The Global Patch Collider in CVPR (2016).
Weise, T., Bouaziz, S., Li, H. & Pauly, M. Real-
time performance-based facial animation. ACM TOG
Whelan, T., Leutenegger, S., Salas-Moreno, R, Glocker,
B. & Davison, A. ElasticFusion: Dense SLAM without
a pose graph in (2015).
Whelan, T., Kaess, M., Fallon, M., Johannsson, H.,
Leonard, J. & McDonald, J. Kintinuous: Spatially ex-
tended kinectfusion (2012).
Yi, K. M., Trulls, E., Lepetit, V. & Fua, P. LIFT:
Learned Invariant Feature Transform in Proceedings of
the European Conference on Computer Vision (2016).
Yu, T., Guo, K., Xu, F., Dong, Y., Su, Z., Zhao, J., Li,
J., Dai, Q. & Liu, Y. BodyFusion: Real-time Capture of
Human Motion and Surface Geometry Using a Single
Depth Camera in Proc. of the Intern. Conf. on Comp.
Vision (ACM, 2017).

Supplementary resource (1)

... However, the inherent depth ambiguity is still unsolved. Some approaches [46,3,57,65,11] using the commodity depth cameras enable alleviating this, but these active IR-based cameras are unsuitable for outdoor capture and the capture volume is limited. Recently, Li et al. [28] employs a consumer-level LiDAR which provides large-scale depth information, to enable large-scale 3D human motion capture, but it still suffers from severe self-occlusion. ...
We propose a multi-sensor fusion method for capturing challenging 3D human motions with accurate consecutive local poses and global trajectories in large-scale scenarios, only using a single LiDAR and 4 IMUs. Specifically, to fully utilize the global geometry information captured by LiDAR and local dynamic motions captured by IMUs, we design a two-stage pose estimator in a coarse-to-fine manner, where point clouds provide the coarse body shape and IMU measurements optimize the local actions. Furthermore, considering the translation deviation caused by the view-dependent partial point cloud, we propose a pose-guided translation corrector. It predicts the offset between captured points and the real root locations, which makes the consecutive movements and trajectories more precise and natural. Extensive quantitative and qualitative experiments demonstrate the capability of our approach for compelling motion capture in large-scale scenarios, which outperforms other methods by an obvious margin. We will release our code and captured dataset to stimulate future research.
... Capturing general non-rigidly deforming scenes [66], [67], even at real-time frame rates [67], [68], [69], is feasible, but only works reliably for small, controlled, and slow motions. Higher robustness can be achieved by using higher frame rate sensors [70], [71] or multi-view setups [72], [73], [74], [75], [76]. Techniques that are specifically tailored to humans increase robustness [77], [78], [79] by integrating a skeletal motion prior [77] or a parametric model [78], [80]. ...
Full-text available
Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness. This work is an extended version of DeepCap where we provide more detailed explanations, comparisons and results as well as applications.
... [38,39] bypass the correspondence estimation stage by imposing divergence constraints over the entire deformation vector field, and Refs. [17,40] give dedicated hardware designs for obtaining cleaner and more complete depth and texture information. ...
Full-text available
Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics, computer vision, and robotics. However, due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information, traditional approaches often produce low-quality geometry with holes, bumps, and misalignments. We propose a novel 3D dynamic reconstruction system, named HDR-Net-Fusion, which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels, using a hierarchical deep reinforcement (HDR) network. The latter comprises two parts: a global HDR-Net which rapidly detects local regions with large geometric errors, and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions. Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality. The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset. Our method can reconstruct geometry with higher quality than traditional methods.
... Many approaches [3][4][5]7,[17][18][19][20] focus on the easiness of the method and employ a single RGB-D sensor setup to achieve a complete reconstruction with a temporal fusion pipeline. However, these single-view methods suffer from carefully turning around to obtain a complete reconstruction. ...
Full-text available
High-quality and complete human motion 4D reconstruction is of great significance for immersive VR and even human operation. However, it has inevitable self-scanning constraints, and tracking under monocular settings also has strict restrictions. In this paper, we propose a human motion capture system combined with human priors and performance capture that only uses a single RGB-D sensor. To break the self-scanning constraint, we generated a complete mesh only using the front view input to initialize the geometric capture. In order to construct a correct warping field, most previous methods initialize their systems in a strict way. To maintain high fidelity while increasing the easiness of the system, we updated the model while capturing motion. Additionally, we blended in human priors in order to improve the reliability of model warping. Extensive experiments demonstrated that our method can be used more comfortably while maintaining credible geometric warping and remaining free of self-scanning constraints.
... Human body movements such as waving arms would easily break above requirement when performed too fast. Whereas high speed cameras [30,21] could mitigate this behavior, here we show that HumanGPS is also an effective way to improve the results without the need of custom hardware. ...
Full-text available
In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g., left vs. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects.
... Hence, research has shifted from expensive and complex multi-view capture setups [52,6,10,7,8,50,14,17,38,44,72,73,77,50] to depth cameras [61,22,46,28,21,36,80,16,82,83,81,76] over the past decade. Unfortunately, the latter are sensitive to bright sunlight and thus not suited for outdoor use-cases. ...
Full-text available
Recent monocular human performance capture approaches have shown compelling dense tracking results of the full body from a single RGB camera. However, existing methods either do not estimate clothing at all or model cloth deformation with simple geometric priors instead of taking into account the underlying physical principles. This leads to noticeable artifacts in their reconstructions, such as baked-in wrinkles, implausible deformations that seemingly defy gravity, and intersections between cloth and body. To address these problems, we propose a person-specific, learning-based method that integrates a finite element-based simulation layer into the training process to provide for the first time physics supervision in the context of weakly-supervised deep monocular human performance capture. We show how integrating physics into the training process improves the learned cloth deformations, allows modeling clothing as a separate piece of geometry, and largely reduces cloth-body intersections. Relying only on weak 2D multi-view supervision during training, our approach leads to a significant improvement over current state-of-the-art methods and is thus a clear step towards realistic monocular capture of the entire deforming surface of a clothed human.
High-quality and complete 4D reconstruction of human activities is critical for immersive VR/AR experience, but it suffers from inherent self-scanning constraint and consequent fragile tracking under the monocular setting. In this paper, inspired by the huge potential of learning-based human modeling, we propose RobustFusion, a robust human performance capture system combined with various data-driven visual cues using a single RGBD camera. To break the orchestrated self-scanning constraint, we propose a data-driven model completion scheme to generate a complete and fine-detailed initial model using only the front-view input. To enable robust tracking, we embrace both the initial model and the various visual cues into a novel performance capture scheme with hybrid motion optimization and semantic volumetric fusion, which can successfully capture challenging human motions under the monocular setting without pre-scanned detailed template and owns the reinitialization ability to recover from tracking failures and the disappear-reoccur scenarios. Extensive experiments demonstrate the robustness of our approach to achieve high-quality 4D reconstruction for challenging human motions, liberating the cumbersome self-scanning constraint.
Full-text available
We present Motion2Fusion, a state-of-the-art 360 performance capture system that enables *real-time* reconstruction of arbitrary non-rigid scenes. We provide three major contributions over prior work: 1) a new non-rigid fusion pipeline allowing for far more faithful reconstruction of high frequency geometric details, avoiding the over-smoothing and visual artifacts observed previously. 2) a high speed pipeline coupled with a machine learning technique for 3D correspondence field estimation reducing tracking errors and artifacts that are attributed to fast motions. 3) a backward and forward non-rigid alignment strategy that more robustly deals with topology changes but is still free from scene priors. Our novel performance capture system demonstrates real-time results nearing 3x speed-up from previous state-of-the-art work on the exact same GPU hardware. Extensive quantitative and qualitative comparisons show more precise geometric and texturing results with less artifacts due to fast motions or topology changes than prior art.
The state of the art in articulated hand tracking has been greatly advanced by hybrid methods that fit a generative hand model to depth data, leveraging both temporally and discriminatively predicted starting poses. In this paradigm, the generative model is used to define an energy function and a local iterative optimization is performed from these starting poses in order to find a "good local minimum" (i.e. a local minimum close to the true pose). Performing this optimization quickly is key to exploring more starting poses, performing more iterations and, crucially, exploiting high frame rates that ensure that temporally predicted starting poses are in the basin of convergence of a good local minimum. At the same time, a detailed and accurate generative model tends to deepen the good local minima and widen their basins of convergence. Recent work, however, has largely had to trade-off such a detailed hand model with one that facilitates such rapid optimization. We present a new implicit model of hand geometry that mostly avoids this compromise and leverage it to build an ultra-fast hybrid hand tracking system. Specifically, we construct an articulated signed distance function that, for any pose, yields a closed form calculation of both the distance to the detailed surface geometry and the necessary derivatives to perform gradient based optimization. There is no need to introduce or update any explicit "correspondences" yielding a simple algorithm that maps well to parallel hardware such as GPUs. As a result, our system can run at extremely high frame rates (e.g. up to 1000fps). Furthermore, we demonstrate how to detect, segment and optimize for two strongly interacting hands, recovering complex interactions at extremely high framerates. In the absence of publicly available datasets of sufficiently high frame rate, we leverage a multiview capture system to create a new 180fps dataset of one and two hands interacting together or with objects.
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.