Content uploaded by Sean Ryan Fanello

Author content

All content in this area was uploaded by Sean Ryan Fanello on Aug 14, 2018

Content may be subject to copyright.

TwinFusion: High Framerate Non-Rigid Fusion

through Fast Correspondence Tracking

Kaiwen Guo, Jonathan Taylor, Sean Fanello, Andrea Tagliasacchi,

Mingsong Dou, Philip Davidson, Adarsh Kowdle, Shahram Izadi

Google Inc.

Abstract

Real time non-rigid reconstruction pipelines are extremely

computationally expensive and easily saturate the highest

end GPUs currently available. This requires careful strate-

gic choices to be made about a set of highly interconnected

parameters that divide up the limited compute. At the same

time, ofﬂine systems, prove the value of increasing voxel res-

olution, more iterations, and higher frame rates. To this end,

we demonstrate a set of remarkably simple but effective mod-

iﬁcations to these algorithms that signiﬁcantly reduce the av-

erage per-frame computation cost allowing these parameters

to be increased. Speciﬁcally, we divide the depth stream into

sub-frames and fusion-frames, disabling both model accu-

mulation (fusion) and non-rigid alignment (model tracking)

on the former. Instead, we efﬁciently track point correspon-

dences across neighboring sub-frames. We then leverage

these correspondences to initialize the standard non-rigid

alignment to a fusion-frame where data can then be accu-

mulated into the model. As a result, compute resources in

the modiﬁed non-rigid reconstruction pipeline can be imme-

diately repurposed to increase voxel resolution, use more

iterations or to increase the frame rate. To demonstrate the

latter, we leverage recent high frame rate depth algorithms

to build a novel “twin” sensor consisting of a low-res/high-

fps sub-frame camera and a second low-fps/high-res fusion

camera.

1. Introduction

The use of cameras to digitize the geometry, texture, light-

ing and motion of arbitrary scenes is a fundamental problem

in computer vision. General monocular solutions remain

elusive, but practical algorithms have been developed that

leverage motion, shape or appearance priors, and/or require

instrumentation of the scene using motion markers or multi-

ple calibrated cameras.

Twin Pod

DynamicFusion TwinFusion

Figure 1. (top) Our TwinPod hybrid depth camera captures high-

speed performance with a pair of high (resp. low) framerate and

low (resp. high) resolution sensors. (bottom-left) The reconstruc-

tion in the canonical frame obtained by our re-implementation of

DynamicFusion on the 30FPS stream. (bottom-right) Our TwinFu-

sion algorithm efﬁciently resolves motion from the high-framerate

camera and exploits this information to guide the high-resolution

non-rigid reconstruction.

Digitizing a rigid world

As a large portion of the world

remains effectively static, the assumption of global rigid-

ity has proven to be highly applicable: multiple viewpoints

taken at different times can be treated as if they were taken

at the same time and without scene deformation. By leverag-

ing rigidity, researchers have been successful in developing

ofﬂine systems capable of sparsely reconstructing sparse city

scale environments from crowd sourced RGB images [1],

or simultaneously mapping and localizing a camera in an

environment [21]. At the heart of such systems is the multi-

view triangulation of a sparse set of keypoints, each iden-

tiﬁed by local feature descriptors that have been carefully

hand crafted [19] or learned [39] to be invariant to scale,

orientation, and lighting conditions. Putting these keypoint

observations into correspondence is achieved by a carefully

initialized local optimization [20], or via a global combinato-

rial optimization [13]. The main limitations of such systems

are that (1) the images must be sufﬁciently rich in visual fea-

tures [3], and (2) the computed reconstructions are typically

sparse [31].

Depth to the rescue

The advent of real-time depth cam-

eras has allowed for the dense reconstructions of scenes,

including those with large untextured areas [23]. As these

processes rely on local geometric registration, high accu-

racy frame-to-frame tracking is imperative to avoid error

accumulation, as this would likely lead to loss of tracking.

Thanks to the limited number of DOFs,

30

Hz depth cameras

have proven effective for scene reconstruction [24,38,5,

37], but these results still assume slow camera motion in

order to avoid tracking failure. Unfortunately, as the number

of degrees of freedom increase, the problem is exacerbated.

For this reason, application-speciﬁc priors and re-initializers

are typically introduced to compensate for the limited frame-

rate; see [26,29] for hand (

≈

26+5dof), [36,33] for face

(≈25+50dof), and [17,2] for body tracking (≈72+10dof).

Digitizing generic non-rigid geometry

General non-

rigid reconstruction techniques such as [22,15,7] lack a

ﬁxed model, so incrementally accumulating a model through

frame-to-frame tracking remains the only viable option. It is

thus not surprising that these techniques, employing tens of

thousands of parameters to model the general non-rigid scene

deformation, are extremely brittle at low framerates; see Fig-

ure 1. The impressive results showcased in these works

often rely on some combination of carefully orchestrated

motions [22], multi-view cameras [7], and model resetting

strategies [8]. State-of-the-art real time non-rigid reconstruc-

tion pipelines are computationally expensive, and easily satu-

rate the highest end GPUs currently available. This requires

careful strategic choices of parameters (e.g. frame-rate, num-

ber of solver iterations, voxel resolution) that determine the

allotment of the limited compute. Additionally, these pa-

rameters are so highly interconnected (e.g. higher framerate

requires less iterations) that choosing their values is some-

what of an art. Further, ofﬂine systems that crank these dials

to their max [17] show that we are nowhere close to realizing

the full potential of these algorithms in real-time, lest we

wait for years in advances in GPU performance.

High framerate tracking

The introduction of high fram-

erate depth cameras, such as the 200Hz sensor in [12],

promises to dramatically increase the robustness of com-

plex real-time tracking problems. For example, Taylor et al.

[32] leveraged this camera to robustly track a hand model

with negligible reliance on discriminative reinitialization.

In contrast to such work, running non-rigid reconstruction

techniques at higher framerates would require diverting com-

pute resources to processing more frames by reducing voxel

resolution, decreasing the expressiveness of the deformation

model, or reducing the number of iterations in the non-rigid

alignment. Any one of these modiﬁcations would likely

erase any accuracy gains made by increasing the framer-

ate. This again elucidates the frustrating zero-sum game

one ﬁnds oneself in as they try to tune the various parame-

ters in non-rigid reconstruction pipelines. In response, we

demonstrate a set of remarkably simple, but effective mod-

iﬁcations to the standard non-rigid reconstruction pipeline

that allows the average per-frame computation cost to be

signiﬁcantly reduced. In particular, we divide the sequence

into fusion-frames and sub-frames, and only enable non-rigid

tracking and model accumulation (i.e. fusion) on the fusion-

frames. Further, we only use the sub-frames to perform a

highly efﬁcient frame-to-frame tracking of point correspon-

dences that is highly effective under the assumption of small

inter-frame motion. These correspondences then allow a

robust bootstrapping of the typical non-rigid alignment to

the fusion-frame. Lastly, we notice that the resolution of the

sub-frames can be dramatically reduced allowing for further

computational savings from any upstream depth algorithm.

As a result of these simple modiﬁcations, the majority of the

non-rigid reconstruction algorithms can immediately free

up and repurpose compute by simply tagging an increasing

number of frames as sub-frames. In particular, we lever-

age this to signiﬁcantly increase the framerate of the depth

images that can be processed unlocking the advantages of

recent high framerate depth cameras.

Introducing the TwinPod

Currently, there are no avail-

able consumer hardware that could take full and complete

advantage of these modiﬁcations (most RGBD sensors have

a framerate

≤90

Hz). Hence, in this paper we further in-

troduce a novel hybrid RGBD capture sensor speciﬁcally

designed to provide this algorithm the ideal input. Our sen-

sor consists of a pair of depth cameras that capture data at

complementary framerates and resolutions; see Figure 2. A

fusion camera streams high resolution images at a low fram-

erate (1280x1024@30Hz) – data from this source is used to

fuse detailed geometric information into the model over time.

As this process requires the current deformation parameters

of the model, a separate sub-frame camera streams low res-

olution images at very high framerate (640x512@500Hz) –

lo-res @ 500 Hzhi-res @ 30 Hz

lo-res @ 500 Hz

hi-res @ 30 Hz

twin pod

Figure 2. (left) The left tracking camera produces high-fps (

500

Hz) low-resolution (

640 ×512

) depth maps, whereas the tracking camera

outputs high-resolution depth (

1280 ×1204

) at low-fps (

30

Hz). (right) Example of capture sequence with keyframe and subframe notation.

Notice how the big frame-to-frame motion in the high resolution capture (bottom) is compensated by the high speed camera (top).

data from this source is used to efﬁciently track point cor-

respondences through time. Notice we do not perform any

non-rigid alignment in the sub-frames: the actual non-rigid

tracking happens only in the fusion-frames, leveraging the

tracked correspondences in the sub-frames. With a mere

2

ms

of compute time available between sub-frames, we propose a

method to track point correspondences by performing a fast

local search between neighboring frames. The overall result

is the ﬁrst fusion system leveraging the recent availability of

high speed depth cameras to more accurately track non-rigid

geometry in rapid motion.

2. Related works

The pioneering DynamicFusion work of Newcombe et

al. [22] demonstrated how high-quality reconstructions can

be obtained without strong assumptions about the geom-

etry being observed. This is achieved by representing

the reconstructed surface via a Truncated Signed Distance

Field (TSDF) [4], and non-rigidly deforming this model onto

the data via an embedded deformation graph [28,6]. Once in

good alignment, similarly to KinectFusion [23], the current

depth frame can then be incrementally fused into the canoni-

cal model. As an alternative to deformation graphs, Innmann

et al. [15] encodes transformations in the same canonical

voxel space used to store the TSDF. With the addition of

suitable regularizers, Slavcheva et al. [27] claims this even

allows for topological changes to be correctly handled.

Coping with low framerates

Follow-up works such as [7]

and [8] extended this approach to multi-camera setups to-

wards the general target of free viewpoint rendering [25]. To

avoid loss of tracking due to the large inter-frame motions

caused by low framerate acquisition, these methods heavily

rely on semantic correspondences computed via discrimina-

tive methods [35] or spectral embeddings [16]. Other ways to

improve tracking include accounting for SIFT keypoint [15],

shading constraints [14], or skeletal shape priors [40]. Unfor-

tunately, when motions between adjacent frames are bigger

than some threshold, all these systems start to struggle; see

Figure 1. The best one can hope for is that, upon tracking

failure, the system resets the reconstruction to the current

depth map [7], but this leads to unsightly artifacts. Overall,

the framerate at which these systems run limits their overall

robustness beyond carefully orchestrated motions; e.g. [22].

Framerate to the rescue

In our work, we tackle these

problems with an end-to-end solution encompassing both

hardware and software. We contribute along three major

directions by introducing:

•

a hybrid depth camera system for simultaneous high-

framerate / high-quality capture,

•

an highly efﬁcient non-rigid tracking/fusion algorithm

capable to exploit high framerate data,

•

the ﬁrst quantitative evaluation framework (data and

metrics) for non-rigid tracking/fusion.

3. Method

Our algorithm processes two streams of real-time RGBD

data, and reconstructs the geometry being observed. The

ﬁrst

{Ft}

stream, which we call the fusion-frame stream, is

high resolution (1280x1024) but low-framerate (30Hz). The

second stream, which we call the sub-frame stream is at a

low resolution (640x512) but operates at a high-framerate

(500Hz), and is synchronized to the fusion-frame stream as

to provide

S

sub-frames

{Ss

t}S

s=1

per fusion-frame (see

Figure 2).

Ground Truth F-ICP S-ICP with λnorm = 0 S-ICP with λnorm >0

Figure 3. (left) Given ground-truth correspondences between keyframes the optimization (1) converges perfectly, enabling accurate fusion.

(middle-left) When motion between keyframes is too signiﬁcant, closest point (CP) correspondences can drive registration towards a local

minima. (middle-right) By locally tracking closest-point correspondences through time (

λnorm = 0

in Eq. 5) we can leverage, to some extent,

sub-frame information. (right) By incorporating normals in the local lookup (

λnorm >0

in Eq. 5), correspondences close to ground truth can

be recovered with minimal computational effort.

Non-rigid registration – Section 3.1

Our reconstruction

algorithm, non-rigidly registers the model in the previous

fusion-frame

Mt−1

to the current fusion-frame

Ft

. Then, it

fuses the deformed model together with the data and gener-

ates an updated model

Mt

. To enable fusion, we represent

the model Min each keyframe as a TSDF; see [4].

Sub-frame matching – Section 3.2

Between each pair of

fusion-frames

(Ft−1,Ft)

, we efﬁciently process the corre-

sponding sub-frames

{Ss

t−1}S

s=1

to summarize the observed

motion. This information is then used to guide the non-rigid

registration between fusion-frames. As the framerate of

{St}

is very high, the magnitude of the motion is very small, and

a simple yet very efﬁcient technique can be used for the task

at hand. The efﬁciency of this step is key, as in order to

avoid dropping frames, at most

2

ms of processing time is

available. For convenience, in what follows we drop the de-

pendence on

t

in the sub-sequent sub-frames

S1, ..., SS

, and

refer to the previous fusion-frame as

¯

F

, and to the current

fusion-frame as F.

3.1. Non-rigid registration

Following the state of the art in non-rigid reconstruc-

tion [22,7,8], we extract a triangular mesh representation of

the model

M

via marching cubes [18]. The deformation of

its

M

vertices

M(θ) = {vm(θ)}

, is encoded by a deforma-

tion graph [28] parameterized by a low-dimensional vector

θ∈Rd

. The mesh of the previous fusion-frame

¯

M

is repre-

sented as

M(¯

θ)

, which is the starting point of our iterative

optimization. Our primary goal is to register this mesh with

the target frame

F

by iteratively solving the optimization

problem:

arg min

θ

λdataEdata (θ) + λregEreg (θ) + λcorrEcorr (θ)(1)

Data ﬁtting

The term

Edata

ensures that the deformed

mesh is in alignment with the data in the fusion-frame:

Edata(θ) =

M

X

m=1

ρ(hnm, pm−vm(θ)i)(2)

where

(pm, nm)=ΠF(vm(θ))

respectively are the position

and normal of the projective correspondence of

vm(θ)

in the

depth map associated with the fusion-frame

F

, and

ρ(·)

is a

robust kernel allowing us to be resilient to outliers.

Regularizers

The term

Ereg(θ)

is a weighted sum of the

typical

Erot

,

Esmooth

, and

Ehull

regularizers. Brieﬂy, these

terms encourage smoothly varying as-rigid-as-possible de-

formations that respect visual hull constraints; see [8] for

more details.

Landmark correspondences

In low frame-rate tracking,

to mitigate the effect of large motions, it is typical to

use a set of correspondences to guide the iterative opti-

mization towards the correct local basin of convergence.

For example, [15] employs SIFT features, [7] computes

matches via a learnt hash functions, while [8] computes

them by mapping the geometry in an isometry-invariant

latent space. Regardless of the process, a set of correspon-

dences

C={(vp(θ), cp)}p∈P

where

P ⊆ {1, . . . , M }

and

c∈R3

can be introduced. These can be accounted for in the

optimization by the following energy:

Ecorr(θ) =

P

X

p=1

ρ(kvp(θ)−cpk)(3)

where

ρ(·)

is a robust kernel allowing us to be robust to

outliers caused by poorly tracked correspondences; see Sec-

tion 6. In contrast to other methods, and as discussed in

the following section, we compute these correspondences

by leveraging the sub-frames

{Ss}

between two subsequent

fusion-frames.

3.2. Sub-frame matching

Since we expect the motion between two sub-frames

Ss

and

Ss+1

to be small, it should be possible to perform a

local search to ﬁnd good correspondences between two ad-

jacent sub-frames. In particular, we use an efﬁcient local

projective search. For a 3D point

x∈ Ss

, let its projective

neighborhood on the sub-frame

Ss+1

be deﬁned by the set

of 3D points:

N(x;Ss+1) = {n∈ Ss+1 :kΠ(x)−Π(n)k∞< }(4)

For this query point, we ﬁnd an optimal correspondence by

solving, via exhaustive search,

arg min

n∈N (x)

Eicp(x, n) + λnorm Enorm(x, n)(5)

Eicp(x, n) = kn−xk2(6)

Enorm(x, n) = kSs+1

⊥(n)− Ss

⊥(x)k2(7)

where

S⊥(.)

returns the unit normal at the point. The term

Eicp

encourages a match with a nearby target point, while

Enorm encourages the matches to have similar normals.

Fusion-frame to fusion-frame correspondences

For

those vertices that are visible in the fusion-frame

¯

F

, we then

build the set of correspondences

C={(vp(θ), np)}

. We

track each point

np

by initializing it with

vp(θ)

in the ﬁrst

keyframe, and looping over each sub-frame while applying

the optimization in (5).

3.3. Optimization schedule

To align the model to the current fusion-frame we ﬁrst

perform a Gauss-Newton optimization of (1) with

λdata = 0

and

λcorr = 1

starting from the deformation parameters from

the previous frame until convergence. We then reﬁne the

parameters by setting

λdata = 1

and

λcorr = 0

. Importantly,

the minimization of the energy (1) takes place only on the

fusion-frames.

4. Capture setup

Our method assumes as input two streams of RGBD data,

low-res/high-fps (

640 ×512

at

500

Hz) and high-res/low-fps

(

1280 ×1024

at

30

Hz). Although many commercial depth

sensors are available, none of them are capable of running at

the required high resolution nor high frame-rate. However, a

number of recent algorithmic contributions has made high-

framerate depth estimation a reality [10,9,12,30]. Inspired

by this work, we resort to active stereo to compute disparity

maps, where a spatially unique texture is projected into the

environment to simplify the task of correspondence search.

Camera hardware

The basic setup consists of two IR

cameras and a laser projector that provides active texture in

the scene. Extending the work of [12], we chose the OnSemi

Python1300 packaged as an USB3 camera module by Ximea.

These modules are already commercially available and rep-

resent a good tradeoff between price and quality. Moreover

they expose an input trigger port that is crucial for synchro-

nization. This sensor is capable to achieve a framerate of

210

fps at a spatial resolution of

1280×1024

and over

800

fps

at

640 ×512

resolution. We also coupled an RGB camera

for passive texture. The second hardware component consist

of the IR illumination that facilitates the goal of any stereo

algorithm. We leverage a VCSEL-based illuminator that is

commercialized by Heptagon.

TwinPod design

To implement the twin RGBD stream

used in this work, we built a TwinPod, consisting of

4

IR and

2

RGB cameras; see Fig. 2. All the cameras are calibrated

and synced. Given the large ﬁeld of view of our lenses

(

80 deg

) we use

4

projectors to cover the full scene. To

compute disparity maps at high framerate, we use the Hash-

Match algorithm described in [11]. This method employs a

learnt representation to remap the IR images into a binary

space and then performs fast CRF inference in this new fea-

ture space. This technique runs at

1

ms at high resolution

(

1280×1024

) and at

0.25

ms at low resolution (

640×512

) on

an NVIDIA Titan X GPU. In Figure 2we show an example

of both the fast framerate and high resolution streams this

device can capture. Notice how between two fusion-frames

F1

and

F2

there is considerable non-rigid motion. Our track-

ing algorithm leverages instead the small motion between

consecutive frames in the high speed capture stream.

5. Evaluation

We extensively evaluate TwinFusion on multiple synthetic

and real world sequences. We provide both quantitative and

qualitative evaluations, showing how our method outper-

forms the state-of-the-art. Further evaluations can be found

in the supplementary video.

Dealing with tracking failures

Non-rigid reconstruction

pipelines rely on a non-rigid alignment via tracking in order

to accumulate detail and average out noise. When there is

a misalignment, standard fusion pipelines quickly contami-

nate the accumulated model with erroneous geometry; see

Figure 1. At that point it is nearly impossible to align an

incorrect model to the next frame and catastrophic failure

occurs. Of course, tracking failures are bound to occur at

some point, so current methods appear to take one of two

approaches to obtain reasonable results. One approach is

Baseline DynamicFusion Baseline TwinFusionDynamicFusionTwinFusion

Full Sequence Tracking Sub-sequence Tracking

Frame 11Frame 108

Figure 4.

Tracking/Qualitative:

evaluation of tracking results on the synthetic bouncing sequence from Vlasic et al. [34] (i.e. beginning

and middle of sequence as highlighted in Figure 5). These results are better appreciated in the supplementary video.

to carefully capture a slow moving sequence with little oc-

clusion, and choose parameters carefully so that tracking

succeeds [22]. Another approach is to constantly check if

the model aligns with the data and perform a partial reset

in regions where registration fails [8]. In both approaches,

tracking failures appear hidden, despite good tracking being

critical for building robust systems that accumulate model

detail. This further validates work that attempts to isolate

a single component (i.e. tracking) as a means to realize a

universally improved system.

Evaluation methodology

Therefore, it seems that a robust

and accurate system will necessarily have to perform partial

resetting to erase erroneous geometry and double surfaces

but also track well in order to accumulate detail. As partial

resetting can be added to any system for increased robust-

ness, we focus on evaluating tracking. To avoid the problem

of tracking failing, we instead divide all of our sequences up

into a collection of short sub-sequences. As shown below,

even maintaining tracking for these very short subsequences

is challenging, but our algorithm performs much better than

the state-of-the-art. We use DynamicFusion [22] as an ex-

emplar of a non rigid reconstruction pipeline, which we

approximate by using our TwinFusion algorithm by ignor-

ing correspondences estimated from sub-frames (i.e. with

λcorr = 0

). We also compare against a variant that performs

non-rigid alignment on both sub-frames and fusion-frames.

While this method is far too expensive to execute in real-time,

it represents an upper bound on tracking performance, and

we refer to it as our Baseline. Finally, we asked the authors

of [8] to run their Motion2Fusion pipeline in a tracking-only

mode on our dataset.

5.1. Synthetic evaluation dataset

Non-rigid fusion pipelines are complicated systems con-

taining a myriad of interacting subcomponents: the accu-

mulation of a model (fusion), the non-rigid deformation

function (parameterization), and the ability to ﬁnd parame-

ters for that deformation function that non-rigidly aligns the

deformed model to the data in the current frame (tracking).

Although the quality of a reconstruction perceived by a user

may be the ultimate test for a system, little insight can be

gleaned as to what components are performing well. Quanti-

fying the performance of one of these components is crucial

to this work as our main contribution is towards improving

the tracking component. Towards this goal, we use a 3D

mesh of an object that dynamically changes its shape by

performing fast motions. We place a synthetic camera facing

a moving object, and render 2D depth maps at both our high

and low acquisition framerates. Our dataset uses the 4D

sequences acquired by Vlasic et al. [34], which consists of

dynamic surface mesh sequences with consistent topology

at 30fps. Using those as fusion-frames, we then temporally

interpolate the mesh surface over the original sequence to

generate an artiﬁcial high-framerate sub-frame sequence at

480fps, and render both into sequences of depth maps to

match the input formats generated by the TwinPod.

Figure 5.

Tracking/Quantitative:

We compare the impact in ac-

curacy on our entire system that results from the varying tracking

accuracies elucidated in Figure 4. In particular, TwinFusion, Dy-

namicFusion and an ofﬂine Baseline are run with fusion (i.e. TSDF

update) turned off. We consider running on each sequence in its

entirety (left) and on a set of subsequences with ground-truth reini-

tialization (right).

To quantify tracking/fusion error, we ﬁrst take each (cur-

rent) model vertex

vm(θ)

back into the reference pose

ˆ

θ

, and

ﬁnd the closest ground truth vertex in the ﬁrst (i.e. template)

frame as:

h∗(p) = arg min

m

kp−˜v1

mk(8)

Then, the average error is computed as follows:

E(θ) = 1

M

M

X

m=1

kvm(θ)−˜vt

h∗(vm(ˆ

θ))k(9)

Tracking evaluation – Figure 4and Figure 5

We ana-

lyze the per frame error of the non-rigid alignment error

across all methods by disabling fusion (i.e. the TSDF is

initialized with ground truth and never updated). Note how

a tracking failure early on in the sequence can completely

spoil the remaining results (Figure 5, left). Hence, we also

consider dividing each sequence into a set of shorter sub-

sequences (Figure 5, right). It is clear from these results that

TwinFusion signiﬁcantly outperforms the standard Dynam-

icFusion approach. Moreover, the TwinFusion results are

very close to the (ofﬂine) Baseline, demonstrating how our

algorithm provides a viable way to unlock the beneﬁts of

higher framerate depth streams. Motion2Fusion [8] without

its resetting strategy, behaves similarly to DynamicFusion:

this is expected since in this particular case it cannot take

advantage of multiple views like in the original algorithm.

Figure 6.

Fusion/Quantitative:

We compare the impact in accu-

racy (average error from Eq. 9, in meters) on our entire system that

results from the varying tracking accuracies elucidated in Figure 5.

In particular, TwinFusion, DynamicFusion and an ofﬂine Baseline

are run with Fusion (i.e. model accumulation) turned on.

Fusion evaluation – Figure 6and Figure 7

Having quan-

tiﬁed the increased non-rigid alignment accuracy we obtain

through TwinFusion, we now seek to elucidate how tracking

accuracy interacts with a fusion system. To this end, we

perform a similar experiment but with fusion enabled, so

that the system is trying to accumulate a model as it tracks.

Note how we do not attempt to fuse the entire sequence, but

Baseline DynamicFusionTwinFusion

Frame 100 Frame 101 Frame 105 Frame 107 Frame 109

Figure 7.

Fusion/Qualitative:

An example of the impact of track-

ing on fusion results. Frames are selected to represent the range

of motion between two restart; see Figure 6. Tracking keyframes

at low framerate (DynamicFusion) cannot keep up with the fast

motion of the actor landing his jump, resulting in signiﬁcant ghost

geometry. The correspondence tracking in TwinFusion allow us to

reach results comparable to the ones of the Baseline, but a fraction

of the computational budget, and in real-time.

Figure 8. Qualitative results on a challenging real scene recorded

by our TwinPod. We show the fused models for multiple frames

processed with the TwinFusion and DynamicFusion. TwinFusion

achieves very compelling results in real-time, under a tractable

computational budget. Conversely, DynamicFusion cannot cope

with the fast motion. Please see

supplementary video

for more

comparisons.

instead only consider short sub-sequences. Indeed, since

tracking with a template eventually fails (see previous dis-

cussion), accumulating a model only hastens failure and

thus such an experiment would be of no value. Notice in

Figure 6and Figure 7how TwinFusion is always very com-

parable with the expensive, ofﬂine baseline method, whereas

DynamicFusion dramatically and rapidly fails.

5.2. Live captured data – Figure 8

We ﬁnally use our TwinPod to live capture sequences of

subjects performing fast actions. Here we demonstrate the

beneﬁts of TwinFusion in challenging, real scenarios using

a single camera. As shown in Figure 8, the system robustly

tracks highly deformable objects, such as a scarf, whereas

DynamicFusion fails – please see supplementary video.

6. Conclusions

In this paper, we recognize non-rigid registration via track-

ing as a crucial part of modern non-rigid reconstruction

pipelines that hope to accumulate model detail. In addi-

tion, we also recognize that the robustness and accuracy of

tracking has not been carefully examined in the literature. In-

stead, the most seemingly robust systems [8] are frequently

performing partial resets of the misaligned model, deleting

phantom surfaces and erroneous geometry, but also deleting

accumulated detail. Systems that do not partially reset the

model appear to track only a small set of sequences, which is

not surprising since one should expect tracking to eventually

fail in general. Thus by focusing on and improving tracking

accuracy and robustness, we can either increase the number

of sequences the latter systems will run on or, more desirably,

increase the amount of detail that robust resetting systems

such as [8] can accumulate. To this end, we have introduced

a set of simple but surprisingly effective modiﬁcations to

any standard non-rigid tracking pipeline. The modiﬁcations

rely on tracking point correspondences, leveraging the small

inter-frame motion between sub-frames. In future work, we

plan to explore even faster depth streams to push the track-

ing precision even further; e.g. the

4000

fps system in [30].

One limitation of our method is the possibility that corre-

spondences slip during tangential motion, and we leave it as

future work to examine leveraging color constraints or regu-

larizers that might alleviate this problem. Nonetheless, we

ﬁnd through synthetic and qualitative experiments that we

obtain better tracking accuracy than other real time methods.

References

1.

Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Cur-

less, B., Seitz, S. M. & Szeliski, R. Building rome in a

day. Communications of the ACM (2011).

2.

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P.,

Romero, J. & Black, M. J. Keep it SMPL: Automatic

estimation of 3D human pose and shape from a single

image in Proc. of the European Conf. on Comp. Vision

(2016).

3.

Crivellaro, A., Rad, M., Verdie, Y., Yi, K. M., Fua, P.

& Lepetit, V. Robust 3D Object Tracking from Monoc-

ular Images using Stable Parts. IEEE Transactions on

Pattern Analysis and Machine Intelligence (2017).

4.

Curless, B. & Levoy, M. A volumetric method for build-

ing complex models from range images in SIGGRAPH

(1996).

5.

Dai, A., Nießner, M., Zoll

¨

ofer, M., Izadi, S. &

Theobalt, C. BundleFusion: Real-time Globally Con-

sistent 3D Reconstruction using On-the-ﬂy Surface

Re-integration. ACM Transactions on Graphics 2017

(TOG) (2017).

6.

Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A. & Izadi,

S. 3D Scanning Deformable Objects with a Single

RGBD Sensor in CVPR (2015).

7.

Dou, M., Khamis, S., Degtyarev, Y., Davidson, P.,

Fanello, S. R., Kowdle, A., Orts Escolano, S., Rhe-

mann, C., Kim, D., Taylor, J., Kohli, P., Tankovich, V.

& Izadi, S. Fusion4D: Real-time Performance Capture

of Challenging Scenes. ACM TOG (2016).

8.

Dou, M., Davidson, P., Fanello, S. R., Khamis, S., Kow-

dle, A., Rhemann, C., Tankovich, V., & Izadi, S. Mo-

tion2Fusion: Real-time Volumetric Performance Cap-

ture. ACM TOG (SIGGRAPH Asia) (2017).

9.

Fanello, S. R., Rhemann, C., Tankovich, V., Kowdle,

A, Escolano, S. O., Kim, D & Izadi, S. Hyperdepth:

Learning depth from structured light without matching

in CVPR (2016).

10.

Fanello, S. R., Keskin, C., Izadi, S., Kohli, P., Kim, D.,

Sweeney, D., Criminisi, A., Shotton, J., Kang, S. B. &

Paek, T. Learning to be a depth camera for close-range

human capture and interaction in ACM Transactions

on Graphics (TOG) (2014).

11.

Fanello, S. R., Valentin, J., Kowdle, A., Rhemann, C.,

Tankovich, V., Ciliberto, C., Davidson, P. & Izadi, S.

Low Compute and Fully Parallel Computer Vision with

HashMatch in ICCV (2017).

12.

Fanello, S. R., Valentin, J., Rhemann, C., Kowdle,

A., Tankovich, V. & Izadi, S. UltraStereo: Efﬁcient

Learning-based Matching for Active Stereo Systems in

CVPR (2017).

13.

Fischler, M. A. & Bolles, R. C. Random sample con-

sensus: a paradigm for model ﬁtting with applications

to image analysis and automated cartography. Commu-

nications of the ACM (1981).

14.

Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q. & Liu, Y. Real-

time Geometry, Albedo and Motion Reconstruction

Using a Single RGBD Camera. ACM Transactions on

Graphics (TOG) (2017).

15.

Innmann, M., Zollh

¨

ofer, M., Nießner, M., Theobalt, C.

& Stamminger, M. VolumeDeform: Real-time volumet-

ric non-rigid reconstruction in ECCV (2016).

16.

Jain, V. & Zhang, H. Robust 3D Shape Correspondence

in the Spectral Domain in SMA (2006).

17.

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G.

& Black, M. J. SMPL: A skinned multi-person linear

model. ACM Transactions on Graphics (TOG) (2015).

18.

Lorensen, W. E. & Cline, H. E. Marching cubes: A

high resolution 3D surface construction algorithm in

ACM siggraph computer graphics (1987).

19.

Lowe, D. G. Distinctive Image Features from Scale-

Invariant Keypoints. IJCV (2004).

20.

Lucas, B. D., Kanade, T., et al. An iterative image

registration technique with an application to stereo

vision (1981).

21.

Mur-Artal, R., Montiel, J. M. M. & Tardos, J. D. ORB-

SLAM: a versatile and accurate monocular SLAM

system. IEEE Transactions on Robotics (2015).

22. Newcombe, R. A., Fox, D. & Seitz, S. M. Dynamicfu-

sion: Reconstruction and tracking of non-rigid scenes

in real-time in Proceedings of the IEEE conference on

computer vision and pattern recognition (2015).

23.

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux,

D., Kim, D., Davison, A. J., Kohi, P., Shotton, J.,

Hodges, S. & Fitzgibbon, A. KinectFusion: Real-time

dense surface mapping and tracking in Proc. ISMAR

(2011).

24.

Nießner, M., Zollh

¨

ofer, M., Izadi, S. & Stamminger,

M. Real-time 3D reconstruction at scale using voxel

hashing. ACM TOG (2013).

25.

Orts-Escolano, S., Rhemann, C., Fanello, S., Chang,

W., Kowdle, A., Degtyarev, Y., Kim, D., Davidson,

P. L., Khamis, S., Dou, M., et al. Holoportation: Virtual

3D Teleportation in Real-time in Proc. UIST (2016).

26.

Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton,

J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A.,

Wei, Y., et al. Accurate, Robust, and Flexible Real-time

Hand Tracking in Proc. CHI (2015).

27.

Slavcheva, M., Baust, M., Cremers, D. & Ilic, S.

KillingFusion: Non-Rigid 3D Reconstruction Without

Correspondences in The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR) (2017).

28.

Sumner, R. W., Schmid, J. & Pauly, M. Embedded de-

formation for shape manipulation. ACM TOG (2007).

29.

Tagliasacchi, A., Schroeder, M., Tkach, A., Bouaziz,

S., Botsch, M. & Pauly, M. Robust Articulated-ICP for

Real-Time Hand Tracking. Computer Graphics Forum

(Proc. Symposium on Geometry Processing) (2015).

30.

Tankovich, V., Schoenberg, M., Fanello, S. R., Kowdle,

A., Rhemann, C., Dzitsiuk, M., Schmidt, M., Valentin,

J. & Izadi, S. SOS: Stereo Matching in O(1) with

Slanted Support Windows in IROS (2018).

31.

Tanskanen, P., Kolev, K., Meier, L., Camposeco, F.,

Saurer, O. & Pollefeys, M. Live metric 3d reconstruc-

tion on mobile phones in Proceedings of the IEEE

International Conference on Computer Vision (2013).

32.

Taylor, J., Tankovich, V., Tang, D., Keskin, C., Kim,

D., Davidson, P., Kowdle, A. & Izadi, S. Articulated

Distance Fields for Ultra-Fast Tracking of Hands Inter-

acting. ACM Trans. on Graphics (Proc. of SIGGRAPH

Asia) (2017).

33.

Thies, J., Zollh

¨

ofer, M., Stamminger, M., Theobalt, C.

& Nießner, M. Face2Face: Real-time Face Capture

and Reenactment of RGB Videos in CVPR (2016).

34.

Vlasic, D., Baran, I., Matusik, W. & Popovi

´

c, J. Ar-

ticulated mesh animation from multi-view silhouettes.

ACM TOG (Proc. SIGGRAPH) (2008).

35.

Wang, S., Fanello, S. R., Rhemann, C., Izadi, S. &

Kohli, P. The Global Patch Collider in CVPR (2016).

36.

Weise, T., Bouaziz, S., Li, H. & Pauly, M. Real-

time performance-based facial animation. ACM TOG

(2011).

37.

Whelan, T., Leutenegger, S., Salas-Moreno, R, Glocker,

B. & Davison, A. ElasticFusion: Dense SLAM without

a pose graph in (2015).

38.

Whelan, T., Kaess, M., Fallon, M., Johannsson, H.,

Leonard, J. & McDonald, J. Kintinuous: Spatially ex-

tended kinectfusion (2012).

39.

Yi, K. M., Trulls, E., Lepetit, V. & Fua, P. LIFT:

Learned Invariant Feature Transform in Proceedings of

the European Conference on Computer Vision (2016).

40.

Yu, T., Guo, K., Xu, F., Dong, Y., Su, Z., Zhao, J., Li,

J., Dai, Q. & Liu, Y. BodyFusion: Real-time Capture of

Human Motion and Surface Geometry Using a Single

Depth Camera in Proc. of the Intern. Conf. on Comp.

Vision (ACM, 2017).