ArticlePDF Available

Fusion4D: Real-time Performance Capture of Challenging Scenes

Authors:

Abstract and Figures

We contribute a new pipeline for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Our algorithm supports both incremental reconstruction, improving the surface estimation over time, as well as parameterizing the nonrigid scene motion. Our approach is highly robust to both large frame-to-frame motion and topology changes, allowing us to reconstruct extremely challenging scenes. We demonstrate advantages over related real-time techniques that either deform an online generated template or continually fuse depth data nonrigidly into a single reference model. Finally, we show geometric reconstruction results on par with offline methods which require orders of magnitude more processing time and many more RGBD cameras.
Content may be subject to copyright.
Fusion4D: Real-time Performance Capture of Challenging Scenes
Mingsong Dou Sameh Khamis Yury Degtyarev Philip DavidsonSean Ryan Fanello
Adarsh KowdleSergio Orts EscolanoChristoph RhemannDavid Kim Jonathan Taylor
Microsoft Research
Figure 1:
We present a new method for real-time high quality 4D (i.e. spatio-temporally coherent) performance capture, allowing for
incremental nonrigid reconstruction from noisy input from multiple RGBD cameras. Our system demonstrates unprecedented reconstructions
of challenging nonrigid sequences, at real-time rates, including robust handling of large frame-to-frame motions and topology changes.
Abstract
We contribute a new pipeline for live multi-view performance cap-
ture, generating temporally coherent high-quality reconstructions in
real-time. Our algorithm supports both incremental reconstruction,
improving the surface estimation over time, as well as parameter-
izing the nonrigid scene motion. Our approach is highly robust to
both large frame-to-frame motion and topology changes, allowing
us to reconstruct extremely challenging scenes. We demonstrate
advantages over related real-time techniques that either deform an
online generated template or continually fuse depth data nonrigidly
into a single reference model. Finally, we show geometric recon-
struction results on par with ofﬂine methods which require orders of
magnitude more processing time and many more RGBD cameras.
Keywords: nonrigid, real-time, 4D reconstruction, multi-view
Concepts: Computing methodologies Motion capture;
1 Introduction
Whilst real-time 3D reconstruction has “come of age” in recent years
with the ubiquity of RGBD cameras, the majority of systems still
focus on static, non-moving, scenes. This is due to computational
and algorithmic challenges in reconstructing scenes under nonrigid
motion. In contrast to rigid scenes where motion is encoded by
a single 6DoF (six degrees of freedom) pose, the nonrigid case
requires solving for orders of magnitude more parameters in real-
time. Whereas both tasks must deal with noisy or missing data, and
handle occlusions and large frame-to-frame motions, the nonrigid
case is further complicated by changing scene topology – e.g. a
person removing a worn jacket or interlocked hands separating apart.
Authors contributed equally to this work
Corresponding author: shahrami@microsoft.com
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for proﬁt or commercial advantage and that copies bear
this notice and the full citation on the ﬁrst page. Copyrights for third-party
components of this work must be honored. For all other uses, contact the
SIGGRAPH ’16 Technical Paper, July 24-28, 2016, Anaheim, CA,
ISBN: 978-1-4503-4279-7/16/07
DOI: http://dx.doi.org/10.1145/2897824.2925969
Despite these challenges, there is clear value in reconstructing non-
rigid motion and surface deformations in real-time. In particular,
performance capture, where multiple cameras are used to reconstruct
human motion and shape, and even object interactions, is currently
constrained to ofﬂine processing: people interact in a scene and
then expect hours of processing time before seeing the ﬁnal result.
What if this processing could happen live in real-time directly as the
performance is happening? This can lead to new real-time experi-
ences such as the ability to watch a remote concert or sporting event
live in full 3D, or even the ability to communicate in real-time with
remotely captured people using immersive AR/VR displays.
However, despite remarkable progress in ofﬂine performance capture
over the years (see [Theobalt et al
.
2010; Ye et al
.
2013; Smolic
2011] for surveys), real-time approaches have been incredibly rare,
especially when considering high quality reconstruction of general
shape and motion i.e. without a strong prior on the human body.
Recent work has demonstrated compelling real-time reconstructions
of general nonrigid scenes using a single depth camera [Zollh
¨
ofer
et al
.
2014; Newcombe et al
.
2015]. Our motivation, however,
differs to these systems as we focus on robust real-time performance
capture across multiple views. As quantiﬁed later in this paper, this
prior work cannot meet our requirements for real-time performance
capture for two main reasons. First these systems rely on a reference
model that is used for model ﬁtting e.g. Zollh
¨
ofer et al. [2014] use a
statically captured reference model, i.e. template, and Newcombe et
al. [2015] use a volumetric model that is incrementally updated with
new depth input. Ultimately, this reference model regularizes the
model ﬁtting, but can also overly constrain it so that major changes
in shape and topology are hard to accommodate. Second, these
systems ﬁnd correspondences by assuming small frame-to-frame
motions, which makes the nonrigid estimation brittle in the presence
of large movements.
We contribute Fusion4D, a new pipeline for live multi-view perfor-
mance capture, generating temporally coherent high-quality recon-
structions in real-time, with several unique capabilities over this
prior work: (1) We make no prior assumption regarding the captured
scene, operating without a skeleton or template model, allowing
reconstruction of arbitrary scenes; (2) We are highly robust to both
large frame-to-frame motion and topology changes, allowing recon-
struction of extremely challenging scenes; (3) We scale to multi-view
capture from multiple RGBD cameras, allowing for performance
capture at qualities never before seen in real-time systems.
Fusion4D combines the concept of volumetric fusion with estima-
tion of a smooth deformation ﬁeld across RGBD views. This enables
both incremental reconstruction, improving the surface estimation
over time, as well as parameterization of nonrigid scene motion. Our
approach robustly handles large frame-to-frame motion by using a
novel, fully parallelized, nonrigid registration framework, including
a learning-based RGBD correspondence matching regime. It also
robustly handles topology changes, by switching between reference
models to better explain the data over time, and robustly blending
between data and reference volumes based on correspondence esti-
mation and alignment error. We compare to related work and show
several clear improvements over real-time approaches that either
track an online generated template or fuse depth data into a single
reference model incrementally. Further, we show geometric recon-
struction results on-par with ofﬂine methods which require orders of
magnitude more processing time and many more RGBD cameras.
2 Related Work
Multi-view Performance Capture:
Many compelling ofﬂine per-
formance capture systems have been proposed. Some speciﬁ-
cally model complex human motion and dynamic geometry, in-
cluding people with general clothing, possibly along with pose
parameters of an underlying kinematic skeleton (see [Theobalt
et al
.
2010] for a full review). Some methods employ variants
of shape-from-silhouette [Waschb
¨
usch et al
.
2005] or active or pas-
sive stereo [Starck and Hilton 2007]. Template-based approaches
deform a static shape model such that it matches a human [de Aguiar
et al
.
2008; Vlasic et al
.
2008; Gall et al
.
2009] or a person’s cloth-
.
2008]. Vlasic et al. [2009] use a sophisticated
photometric stereo light stage with multiple high-speed cameras
to capture geometry of a human at high detail. Dou et al. [2013]
capture precise surface deformations using an eight-Kinect rig, by
deforming a human template, generated from a KinectFusion scan,
using embedded deformation [Sumner et al
.
2007]. Other methods
jointly track a skeleton and the nonrigidly deforming surface [Vlasic
et al. 2008; Gall et al. 2009].
Whilst compelling, these multi-camera approaches require consid-
erable compute and are orders of magnitude slower than real-time,
also requiring dense camera setups in controlled studios, with sophis-
ticated lighting and/or chroma-keying for background subtraction.
Perhaps the high-end nature of these systems is exempliﬁed by [Col-
let et al
.
2015] which uses over 30 RGBD cameras and a large
studio setting with green screen and controlled lighting, producing
extremely high quality results, but at approximately 30 seconds per
frame. We compare to this system later, and demonstrate comparable
results in real-time with a greatly reduced set of RGBD cameras.
Accommodating General Scenes:
The approach of [Li et al
.
2009]
uses a coarse approximation of the scanned object as a shape prior to
obtain high quality nonrigid reconstructions of general scenes. Oth-
ers also treat the template as a generally deformable shape without
skeleton and use volumetric [de Aguiar et al
.
2008] or patch-based
deformation methods [Cagniart et al
.
2010]. Other nonrigid tech-
niques remove the need for a shape or template prior, but assume
small and smooth motions [Zeng et al
.
2013; Wand et al
.
2009; Mitra
et al
.
2007]; or deal with topology changes in the input data (e.g.,
the fusing and then separation of hands) but suffer from drift and
over-smoothing of results for longer sequences [Tevs et al
.
2012;
Bojsen-Hansen et al
.
2012]. [Guo et al
.
2015; Collet et al
.
2015]
introduce the notion of keyframe-like transitions in ofﬂine nonrigid
reconstructions, to accommodate topology changes and tracking
failures. [Dou et al
.
2015] demonstrate a compelling ofﬂine system
with nonrigid variants of loop closure and bundle adjustment to
create compelling scans of arbitrary scenes without a prior human
or template model. All these more general techniques are far from
real-time, ranging from seconds to hours per frame.
Real-time Approaches:
Only recently have we seen real-time non-
rigid reconstruction systems appear. Approaches fall into three
categories. Single object parametric approaches focus on a single
object of interest, e.g. face, hand, or body, which is parametrized
ahead of time in an ofﬂine manner, and tracked or deformed to ﬁt the
data in real-time. Compelling real-time reconstructions of nonrigid
articulated motion (e.g. [Ye et al
.
2013; Stoll et al
.
2011; Zhang et al
.
2014]) and shape (e.g. [Ye et al
.
2013; Ye and Yang 2014]) have
been demonstrated. However by their very nature, these approaches
rely on strong priors based on either pre-learned statistical models,
articulated skeletons, or morphable shape models, prohibiting cap-
ture of arbitrary scenes or objects. Often the parametric model is
not rich enough to capture challenging poses or all types of shape
variation. For human bodies, even with extremely rich ofﬂine shape
and pose models [Bogo et al
.
2015], reconstructions can suffer from
the effect of uncanny valley [Mori et al
.
2012]; and clothing or hair
can prove problematic [Bogo et al. 2015].
Recently, real-time template-based reconstruction of more diverse
nonrigidly moving objects was demonstrated [Zollh
¨
ofer et al
.
2014].
Here an online template model was captured statically, and deformed
in real-time to ﬁt the data captured from a novel RGBD sensor. Addi-
tionally, displacements on this tracked surface model were computed
from the input data and fused over time. Despite impressive real-
time results, this work still requires a template to be ﬁrst acquired
rigidly, making it impractical for capturing children, animals or
other objects that rarely hold still. Furthermore, the template model
is ﬁxed and so any scene topology change will break the ﬁtting.
Such approaches also rely heavily on closest point correspondences
[Rusinkiewicz and Levoy 2001] and are not robust to large frame-
to-frame motions. Finally in both template based and single object
parametric approaches the model is ﬁxed, and the aim is to deform
or articulate the model to explain the data rather than incrementally
reconstruct the scene. This means that new input data does not reﬁne
the reconstructed model over the time.
DynamicFusion [Newcombe et al
.
challenges inherent in template-based reconstruction techniques
by demonstrating compelling results of nonrigid volumetric fusion
using a single Kinect sensor. The reference surface model is incre-
mentally updated based on new depth measurements, reﬁning and
completing the model over time. This is achieved by warping a ref-
erence volume nonrigidly to each new input frame, and fusing depth
samples into the model. However, as shown in the supplementary
video of this work the frame-to-frame motions are slow and carefully
orchestrated, again due to reliance on closest point correspondences.
Also, the reliance on a single volume registered to a single point in
time means that the current data being captured cannot represent
a scene dramatically different from the model. This makes ﬁtting
the model to the data and incorporating it back into the model more
challenging. Gross inconsistencies between the reference volume
and data can result in tracking failures. For example, if the reference
model is built with a user’s hands fused together, estimation of the
deformation ﬁeld will fail when the hands are seen to separate in
the data. In practice, these types of topology changes occur often as
people interact in the scene.
3 System Overview
Our work, Fusion4D, attempts to bring aspects inherent in multi-
view performance capture systems to real-time scenarios. In so
doing, we need to design a new pipeline that addresses the limita-
tions outlined in current real-time nonrigid reconstruction systems.
Namely, we need to be robust to fast motions and topology changes
and support multi-view input, whilst still maintaining real-time rates.
Figure 2: The Fusion4D pipeline. Please see text in Sec. 3 for details.
Fig. 2 shows the main system pipeline. We accumulate our 3D
reconstruction in a hierarchical voxel grid and employ volumetric
fusion [Curless and Levoy 1996] to denoise the surface over time
(Sec. 6). Unlike existing real-time approaches, we use the concept
of key volumes to deal with radically different surface topologies
over time (Sec. 6). This is a voxel grid that maintains the reference
model, and ensures smooth nonrigid motions within the key vol-
ume sequence, but allows more drastic changes across key volumes.
This is conceptually similar to the concept of a keyframe or anchor
frame used in nonrigid tracking [Guo et al
.
2015; Collet et al
.
2015;
Beeler et al. 2011], but this concept is extended for online nonrigid
volumetric reconstruction.
We take multiple RGBD frames as input and ﬁrst estimate a segmen-
tation mask per camera (Sec. 4). A dense correspondence ﬁeld is
estimated per separate RGBD frame using a novel learning-based
technique (Sec. 5.2.4). This correspondence ﬁeld is used to initialize
the nonrigid alignment, and allows for robustness to fast motions –
a failure case when closest point correspondences are assumed as in
[Zollh¨
ofer et al. 2014; Newcombe et al. 2015].
Next is nonrigid alignment, where we estimate a deformation ﬁeld to
warp the current key volume to the data. We cover the details of this
step in Sec. 5. In addition to fusing data into the key (or reference)
volume as in [Newcombe et al
.
2015], we also fuse the currently
accumulated model into the data volume by warping and resampling
the key volume. This allows Fusion4D to be more responsive to new
data, whilst allowing more conservative model updates. Nonrigid
alignment error and the estimated correspondence ﬁelds can be
used to guide the fusion process, allowing for new data to appear
very quickly when occluded regions, topology changes, or tracking
failures occur, but also allowing fusion into the model over time.
4 Raw Depth Acquisition and Preprocessing
In terms of acquisition our setup is similar to [Collet et al
.
2015],
but with a reduced number of cameras and no green screen. Our
system, in its most general form, produces
N
depthmaps using
2N
monocular infrared (IR) cameras and
N
RGB images used only to
provide texture information. Whereas the setup in [Collet et al
.
2015]
consists of
106
cameras producing
24
depthmaps, our setup uses
only
24
cameras, producing
N= 8
depthmaps and RGB images.
All of our cameras are in a trinocular conﬁguration and have a
1
megapixel output resolution. Depth estimation is carried out using
the PatchMatch Stereo algorithm [Bleyer et al
.
2011], which runs in
real-time on GPU hardware (see [Zollh
¨
ofer et al
.
et al. 2013] for more details).
A segmentation step follows the depth computation algorithm, where
2D silhouettes of the regions of interest are produced. The segmenta-
tion mask plays a crucial role in estimating the visual hull constraint
(see Sec. 5.2.3) that helps ameliorate issues with missing data in the
input depth and ensures that foreground data is not deleted from the
model. Our segmentation also avoids the need for a green screen
setup as in [Collet et al
.
2015] and allows capture in natural and
realistic settings. In our pipeline we employed a simple background
model (using both RGB and depth cues) that does not take into
account temporal consistency. This background model is used to
compute unary potentials by considering pixel-wise differences with
the current scene observation. We then use a dense Conditional Ran-
dom Field (CRF) [Kr
¨
ahenb
¨
uh and Koltun 2011] model to enforce
smoothness constraints between neighboring pixels. Due to our real-
time requirements, we use an approximate GPU implementation
similar to [Vineet et al. 2012].
5 Nonrigid Motion Field Estimation
In each frame we observe
N
depthmaps,
{Dn}N
n=1
and
N
fore-
{Sn}N
n=1
. As is common [Curless and Levoy 1996;
Newcombe et al
.
2011; Newcombe et al
.
2015], we accumulate
this depth data into a non-parametric surface represented implicitly
by a truncated signed distance function (TSDF) or volume
V
in
some “reference frame” (which we denote as key volume). This
allows efﬁcient alignment and allows for all the data to be averaged
into a complete surface with greatly reduced noise. Further, the
zero crossings of the TSDF can be easily located to extract a high
quality mesh
1V={vm}M
m=1 R3
with corresponding normals
{nm}M
m=1
. The goal of this section is to show how to estimate a
deformation ﬁeld that warps the key volume
V
or the mesh
V
to
align with the raw depth maps
{Dn}N
n=1
. We typically refer
V
or
V
as model, and {Dn}N
n=1 as data.
5.1 Deformation Model
Following [Li et al
.
2009] and [Dou et al
.
2015] we choose the
embedded deformation (ED) model of [Sumner et al
.
2007] to pa-
rameterize the nonrigid deformation ﬁeld. Before processing each
new frame, we begin by uniformly sampling a set of
K
“ED nodes”
within the reference volume by sampling locations
{gk}K
k=1 R3
from the mesh
V
extracted from this volume. Every vertex
vm
in
that mesh is then “skinned” to its closest ED nodes
Sm⊆ {1, ..., K}
using a set of ﬁxed skinning weights
{wm
k:k∈ Sm} ⊆ [0,1]
calculated as
wm
k=1
Zexp kvmgkk2/2σ2
, where
Z
is a nor-
malization constant ensuring that, for each vertex, these weights
1A triangulation is also extracted which we use for rendering.
σ
deﬁnes the effective radius of the ED nodes,
which we set as
σ= 0.5d
, where
d
is the average distance between
neighboring ED nodes after the uniform sampling.
We then represent the local deformation around each ED node
gk
using an afﬁne transformation
AkR3×3
and a translation
tk
R3
. In addition, a global rotation
RSO(3)
and translation
TR3
G={R, T } ∪ {Ak,tk}K
k=1
fully
parameterizes the deformation that warps any point vR3to
T(vm;G) = RX
k∈Sm
wm
k[Ak(vgk) + gk+tk] + T. (1)
Equally, a normal nwill be transformed to
T(nm;G) = RX
k∈Sm
wm
kAT
knm,(2)
and normalization is applied afterwards.
5.2 Energy Function
To estimate the parameters
G
, we formulate an energy function
E(G)
that penalizes the misalignment between our model and the
observed data, regularizes the types of allowed deformations and
encodes other priors and constraints. The energy function
E(G) =λdataEdata (G) + λhullEhull (G) + λcorrEcorr(G) +
λrotErot (G) + λsmoothEsmooth (G)(3)
consists of a variety of terms that we systematically deﬁne below.
5.2.1 Data Term
The most crucial portion of our energy formulation is a data term
that penalizes misalignments between the deformed model and the
data. In its most natural form, this term would be written as
ˆ
Edata(G) =
N
X
n=1
M
X
m=1
min
x∈P(Dn)kT (vm;G)xk2(4)
where
P(Dn)R3
extracts a point cloud from depth map
Dn
. We,
however, approximate this using a projective point-to-plane term as
Edata(G) =
N
X
n=1 X
m∈Vn(G)
˜
nm(G)>(˜
vm(G)Γn(˜
vm(G)))2
(5)
where
˜
nm(G) = T(nm;G)
and
˜
vm(G) = T(vm;G)
(with
slight notational abuse we simply use
˜
v
and
˜
n
to represent the
warped points and normals);
Γn(v) = Pnn(v))
, with
Πn:
R3R2
projecting a point into the
n
’th depth map and
Pn:
R2R3
back-projecting the corresponding pixel in
Dn
into 3D;
and
Vn(G)⊆ {1, ..., M }
are vertex indices that are considered to
be “visible” in view
n
when the model is deformed using
G
. In
particular, we consider a vertex to be visible if
Πn(˜
vm)is a valid and visible pixel in view nand
k˜
vmPnn(˜
vm))k ≤ dand
˜
n>
mP
nn(˜
vm)) < n
where
P
n:R2R3
maps pixels to normal vectors estimated from
Dn
;
d
and
n
are the truncation thresholds for depth and normal
respectively.
Although
(5)
is an approximation to
(4)
, it offers a variety of key ben-
eﬁts. First, the use of a point-to-plane term is a well known strategy
to speed up convergence [Chen and Medioni 1992]. Second, the use
of a “projective correspondence” avoids the expensive minimization
in
(4)
. Lastly, the visibility set
Vn(G)
is explicitly computed to be
robust to outliers which avoids employing a robust data term here
that often slows Gauss-Newton like methods [Zach 2014]. Interest-
ingly, the last two points interfere with the differentiability of
(5)
as
Pnn(˜
v))
jumps as the projection crosses pixel boundaries and
V(G)
undergoes discrete modiﬁcations as
G
changes. Nonetheless,
we use a further approximation (see Sec. 5.3) at each Gauss-Newton
iteration whose derivative both exists everywhere and is more efﬁ-
cient to compute.
5.2.2 Regularization Terms
As the deformation model above could easily represent unreason-
able deformations, we follow [Dou et al
.
2015] by deploying two
regularization terms to restrict the class of allowed deformations.
The ﬁrst term
Erot(G) =
K
X
k=1
kAT
kAkIkF+
K
X
k=1
(det(Ak)1)2.(6)
encourages each local deformation to be close to a rigid transform.
The second encourages the neighboring afﬁne transformations to be
similar as
Esmooth(G) =
K
X
k=1 X
j∈Nk
wjk ρ(kAj(gkgj)+gj+tj(gk+tk)k2)
(7)
where
wjk = exp(−kgkgjk2/2σ2)
is a smoothness weight that
is inversely proportional to the distance between two neighboring
ED nodes, and
σ
is set to be the average distance between all pairs of
neighboring ED nodes. Here
Nk
denotes the set of ED nodes neigh-
boring node
k
, and
ρ(·)
is a robustiﬁer to allow for discontinuities
in the deformation ﬁeld.
5.2.3 Visual Hull Term
The data term above only constrains the deformation when the
warped model is close to the data. To see why this is problem-
atic, let us assume momentarily the best case scenario where we
happen to have a perfect model that should be able to fully “explain”
the data. If a piece of the model is currently being deformed to a
location outside the truncation threshold of depth maps, the gradient
will be zero. Another, more fundamental issue, is that a piece of
the model that is currently unobserved (e.g. a hand hidden behind a
user’s back) is allowed to enter free-space. This occurs despite the
fact that we know that free-space should not be occupied as we have
observed it to be free. Up until now, other methods [Newcombe et al
.
2015] have generally ignored this constraint, or equivalently their
model has only been forced to explain the foreground data while
ignoring “negative” background data.
codes the constraint that the deformed model lies within the visual
hull. The visual hull is a concept used in shape-from-silhouette
space-carving reconstruction techniques [Kutulakos and Seitz 2000].
Typically in 2D it is deﬁned as the intersection of the cones cut-out
by the back-projection of an object’s silhouette into free-space.
The near-camera side of the back-projected cone that each silhouette
generates is cut using the observed depth data (see Fig. 3) before
being intersected. In the single viewpoint scenario, where there is
much occlusion, the constraint helps push portions of the deformed
Figure 3:
An illustration of visual hull in our optimization. Left: the
ﬁrst camera’s visual hull (shaded region) is deﬁned by the foreground
segmentation and the observed data (red line on the surface). In this
case, a hole on foreground causes the hull to extend all the way to
the camera. Our energy penalizes the surface (e.g., those drawn in
black) from erroneously moving outside of the visual hull into known
free-space. Middle: A second camera can be added which gives a
different visual hull constraint. Right: The intersection of multiple
visual hulls yield increasingly strong constraints on where the entire
model must lie.
model corresponding to the visual hull of the true scene into oc-
cluded regions. In the multiview case (see Fig. 3), occlusion are less
ubiquitous, and the term is able to provide constraints in free space
where data is missing. For example, depth sensors often struggle
to observe data on a user’s hair and yet a multi-view visual hull
constraint will still provide a tight bounding box on where the head
should lie. Without this term, misalignment will be more pronounced
making the accumulation of highly noisy data prohibitive.
The visual hull can be represented as an occupancy volume
H
with
values of 1 inside the visual hull and 0 outside. Each voxel of
H
is
projected to each depthmap and set to 0 if it is in the background
mask or closer to the camera than the depth pixel that it is projected
onto. To be conservative, we set a voxel as occupied if it is in front of
an invalid foreground depth pixel. To apply the visual hull constraint
in the form of a cost function term, we ﬁrst calculate an approximate
distance transform
H
to the visual hull, where the distance would
be 0 for space inside the hull. The visual hull term is written as
Ehull(G) =
M
X
m=1
H(T(vm;G))2.(8)
The exact computation of
H
is computationally expensive and unsuit-
able for a real-time setting. Instead, we approximate
H
by applying
Gaussian blur to the occupancy volume
2
, which is implemented
efﬁciently on the GPU.
5.2.4 Correspondence Term
Finding the 3D motion ﬁeld of nonrigid surfaces is an extremely
challenging task. Approaches relying on non-convex optimization
can easily end up in erroneous local optima due to bad starting points
caused by noisy and inconsistent input data e.g. due to large motions.
A key role is played by the initial alignment of the current input data
Dn
and the model. Our aim is therefore to ﬁnd point-wise corre-
spondences to provide a robust initialization for the solver. Finding
reliable matches between images has been exhaustively studied; re-
cently, deep learning techniques have shown superior performance
[Weinzaepfel et al
.
2013; Revaud et al
.
2015; Wei et al
.
2015]. How-
ever these are computationally expensive, and currently prohibitive
for real-time scenarios (even with GPU implementations).
In this paper we extend the recently proposed Global Patch Collider
(GPC) [Wang et al
.
2016] framework to efﬁciently generate accurate
2Followed with postprocessing, i.e., applying 1.0− H and scaling.
correspondences for RGBD data. GPC ﬁnds correspondences in
linear time, avoiding the computation of costly distance functions
among all the possible candidates. The method relies on decision
trees which have the advantages of being fully parallelizable. Train-
ing is performed ofﬂine on held-out annotated data, and at test time,
the correspondence estimation is fully integrated in the real-time
system pipeline. Note, no user subject training is required.
Given two consecutive images
Is
and
It
, our target is to ﬁnd local
correspondences between pixel positions. We consider a local patch
x
with center coordinate
p
from an image
I
, which is passed through
a decision tree until it reaches one terminal node (leaf). The leaf
node can be interpreted as a hash key for the image patch. The GPC
returns as matches only pixels which end up in the same terminal
node. To increase recall multiple trees are run and matches are se-
lected as unique intersections over all the terminal nodes (see [Wang
et al
.
2016] for details). Correspondence estimation with decision
trees is also used in [Pons-Moll et al
.
2015; Shotton et al
.
2013]. A
key difference is that this prior work computes the correspondences
with respect to a template model and only for the segmented object
of interest. We, on the other hand, do not require a template model
and compute the correspondences between two image frames, at a
local patch level, and subsequently we are agnostic to the speciﬁc
objects in the scene at both training and test time.
In [Wang et al
.
2016] the authors rely on multi-scale image descrip-
tors in order to ensure robustness to scale and perspective transfor-
mation. In this work we extend their method by making use of depth,
which gives scale invariance. We also use a different strategy for
the match retrieval phase based on a voting scheme. Formally, our
split node contains a set of learned parameters
δ= (u,v, θ)
, where
(u,v)
are 2D pixel offsets and
θ
represents a threshold value. The
split function fis evaluated at pixel pas
f(p;θ) = (Lif Is(p+u/ds)It(p+v/dt)< θ
Rotherwise (9)
where
Is
and
It
are the two input RGB images and
ds=Ds(p)
and
dt=Dt(p)
are the depth values at the pixel coordinate
p
.
Normalizing these offsets by the depth of the current pixel provide
invariance to scaling factors. This kind of pixel difference test is
commonly used with decision forest classiﬁers due to its efﬁciency
and discriminative power [Wang et al. 2016].
During training, we select the split functions to maximize the
weighted harmonic mean between precision and recall of the patch
correspondences. Ground truth correspondences for training the split
function parameters of the decision trees are obtained via the ofﬂine
but accurate nonrigid bundle adjustment method proposed by [Dou
et al
.
2015]. We tested different conﬁgurations of the algorithm and
empirically found that
5
trees with
15
between precision and recall. At test time, when simple pixel dif-
ferences are used as features, the intersection strategy proposed in
[Wang et al
.
2016] is not robust due to perspective transformations
of RGB images. A single tree does not have the ability to handle all
possible image patch transformations. Intersection across multiple
trees (as proposed in [Wang et al
.
2016]) also fails to retrieve the
correct match in the case of RGBD data. Only few correspondences
usually belonging to small motion regions are estimated.
We address this by taking the union over all the trees, thus modeling
all image transformations. However a simple union strategy gen-
erates many false positives. We solve this problem by proposing a
voting scheme. Each tree with a unique collision (i.e. a leaf with
only two candidates) votes for a possible match, and the one with the
highest number of votes is returned. This approach generates much
more dense and reliable correspondences even when large motion is
present. We evaluate this method in Sec. 7.3.
This method gives us, in the
n
’th view, a set of
Fn
matches
{uprev
nf , unf }Nf
f=1
between pixels in the current frame and the previ-
ous frame. For each match
(uprev
nf , unf )
we can ﬁnd a corresponding
point qnf R3in the reference frame using
qnf = argmin
vV
kΠn(T(v;Gprev )) uprev
nf k(10)
where
Gprev
are the parameters that deform the reference surface
V
to the previous frame. We would then like to encourage these
model points to deform to their 3D correspondences. To this end,
we employ the the energy term
Ecorr(G) =
N
X
n=1
Fn
X
f=1
ρ(kT (qnf ;G)Pn(unf )k2)(11)
where ρ(·)is a robustiﬁer to handle correspondence outliers.
5.3 Optimization
In this section, we show how to rapidly and robustly minimize
E(G)
on the GPU to obtain an alignment between the model and
the current frame. To this end, we let
XRD
represent the
concatenation of all the parameters and let each entry of
f(X)RC
contain each of the
C
unsquared terms (i.e. the residuals) from
the energy above so that
E(G) = f(X)>f(X)
. In this form, the
problem of minimizing
E(G)
can be seen as a standard sparse non-
linear least squares problem which can be solved by approaches
based on the Gauss-Newton algorithm. We handle the robust terms
using the square-rooting technique described in [Engels et al
.
2006;
Zach 2014].
For each frame we initialize all the parameters from the motion
ﬁeld of the previous frame. We then ﬁx the ED nodes parame-
ters
{Ak,tk}K
k=1
and estimate the global rigid motion parameters
{R, T }
using projective iterative closest point (ICP) [Rusinkiewicz
and Levoy 2001]. Next we ﬁx the global rigid motion parameters and
estimate the ED nodes parameters. The details of the optimization
are presented in the following sections.
5.3.1 Computing a Step Direction
We compute a step direction hRDin the style of the Levenberg-
Marquardt (LM) solver on the GPU. At any point
X
in the search
space we solve for
(J>J+µI)h=J>f(12)
where
µ
is a damping factor,
JRC×D
is the Jacobian of
f(X)
and
f
is simply an abbreviation for
f(X)
to obtain a step direction
h
.
If the update will lower the energy (i.e.
E(X+h)< E(X)
the step
is accepted (i.e.
XX+h
) and the damping factor is lowered to
be more aggressive. When the step is rejected, as it would raise the
energy, the damping factor is raised and
(12)
is solved again. This
behaviour can be interpreted as interpolating between an aggressive
Gauss-Newton minimization and a robust gradient descent search as
lowering the damping factor implicitly downscales the update as a
back-tracking line search would.
Per-Iteration Approximation
In order to deal with the non-
differentiability of
Edata(G)
and improve performance, at the start
of each iteration we can take a copy of the current set of parameters
G0Gto create a differentiable approximation to Edata(G)as
˜
Edata(G) =
N
X
n=1 X
m∈Vn(G0)
˜
nm(G0)>(˜
vm(G)Γn(˜
vm(G0)))2
.
(13)
In addition to being differentiable, the independence of
˜
nm
greatly
simpliﬁes the necessary derivative calculations as the derivative with
respect to any parameter in Gis the same for any view.
Evaluation of J>Jand J>f
In order to make this algorithm
tractable for the large number of parameters we must handle, we
bypass the traditional approach of evaluating and storing
J
so that
it can be reused in the computation of
J>J
and
J>f
directly evaluate both
J>J
and
J>f
given the current parameters
X
. In our scenario, this approach results in a dramatically cheaper
memory footprint while simultaneously minimizing global memory
reads and writes. This is because the number of residuals in our
problem is orders of magnitude larger than the number of parameters
(i.e.
C >> D
) and therefore the size of the Jacobian
JRC×D
dwarfs that of J>JRD×D.
Further,
J>J
itself is a sparse matrix composed of non-zero blocks
{hij R12×12 :i, j ∈ {1, ..., K } ∧ ij}
created by ordering
parameter blocks from
K
ED nodes, where
ij
denotes that the
i
’th and
j
’th ED nodes simultaneously contribute to at least one
residual. The (i, j )’th block can be computed as
hij =X
c∈Iij
j>
cijcj (14)
where
Iij
is the collection of residuals dependent on both parameter
block
i
and
j
and
jci
is the gradient of c’th residual,
fc
,w.r.t.
i
-
th parameter block. Note that each
Iij
will not change during a
step calculation (due to our approximation) so we only need to
calculate each index set once. Further, the cheap derivatives of the
approximation in
(13)
ensure that the complexity of computing
J>J
,
although linearly proportional to the number of surface vertices, is
independent of the number of cameras.
To avoid atomic operations on the GPU global memory, we let
each CUDA block handle one
J>J
block and perform reduction on
the GPU shared memory. Similarly,
J>fRD×1
can be divided
into
K
segments,
{(J>f)iR12×1}K
i=1
, with the
i
’th segment
calculated as
(J>f)i=X
c∈Ii
j>
cifc(15)
where
Ii
contains all the constraints related to ED node
i
. We assign
one GPU block per
(J>f)i
and again perform the reduction on
shared memory.
Linear Equations Solver
Solving the cost function in Eq.
(3)
amounts to a series of linear solves of the normal equations
(Eq.
(12)
). DynamicFusion [Newcombe et al
.
2015] uses a direct
sparse Cholesky decomposition. Given their approximation of the
data term component of
J>J
as a block diagonal matrix this still re-
sults in a real-time system. However, we do not wish to compromise
the ﬁdelity of the reconstruction by approximating
J>J
if we can
still optimize the cost function in real-time, so we chose to iteratively
solve using preconditioned conjugate gradient (PCG). The diagonal
blocks of J>Jare used as the preconditioner.
Our approach to the linear solver is akin to the approach taken by
[Zollh
¨
ofer et al
.
2014], but instead of implementing our solver in
terms of
Jf
and
J>f
, we use terms
J>J
and
J>f
. Both approaches
can effectively handle a prohibitively large number of residuals, but
while [Zollh
¨
ofer et al
.
2014] template-based approach must scale to
a large number of parameters, our approach requires considerably
less Jacobian evaluations and therefore is signiﬁcantly faster. To
perform sparse matrix-vector multiplication, a core routine in our
system, we use a custom warp-level optimized kernel.
Figure 4:
Solver convergence over a sequence for a ﬁxed number of
iterations: Green dashed line demonstrates an approximate evalua-
tion of
J>J
. Red line shows an exact Cholesky solve. Our method
is shown in yellow and shows similar convergence behavior as the
exact method, with improvements over approximate approaches.
5.4 Implementation Details
In our experiments, we set the volume resolution to be 4mm. March-
ing cubes then extracts a mesh with around
250K
vertices. In the
multi-camera capture system, each surface vertex might be observed
by more than one camera (observed
~
3 times in our case). In total
the number of residuals
C
in our experiment is around 1 million,
with the data terms and visual hull terms constituting the majority.
We sample one ED node every 4cm, which leads to
~2K
ED nodes
in total, and thus the number of parameters D24K.
The sparsity of
J>J
is largely determined by two parameters:
|Sm|
,
the number of neighboring ED nodes that a surface vertex
m
is
skinned to, and
|Nk|
, the number of neighboring ED nodes that an
ED node
k
is connected to for the regularization cost term. We let
|Sm|= 4 m
and
|Nk|= 8 k
in our experiments, resulting in
~15K non-zero J>Jblocks.
We run 5 iterations of the LM solver to estimate all the nonrigid
parameters, and for each iteration of LM the PCG solver is run for
10 iterations. As shown in Fig. 4, our PCG solver with 10 itera-
tions achieves the same alignment performance as an exact Cholesky
solver. It also shows that full
J>J
rather than the approximate eval-
uation (as in [Newcombe et al
.
2015]) is important for convergence.
6 Data Fusion
The nonrigid matching stage estimates a deformation ﬁeld which
can be applied to either a volume or a surface to align with the input
data in a frame. This alignment can be used, for example, to fuse
that data into the volumetric model in order to denoise the model or
to deform the model into the current frame for rendering. Indeed,
prior work [Dou et al
.
2015; Newcombe et al
.
2015] deﬁned the
ﬁrst frame as the reference frame (or model), and then incrementally
aligned with and fused the data from all subsequent frames. The
model is warped into each frame to provide a temporal sequence
of reconstructions. This strategy works very well for simple exam-
ples (e.g., slow motion, small deformation), but our experiments
show that it fails for realistic situations, as shown in our results and
supplementary video.
It is difﬁcult, and often impossible, to use a single reference model
to explain every possible frame. In an unconstrained and realistic
setting, the latter frames might introduce dramatic deformations or
even have completely different surface topology (e.g., surfaces that
split or merge). These approaches will then struggle as currently
used deformation ﬁelds do not allow for the discontinuities needed
to model this behaviour. Second, it is unrealistic to expect that the
nonrigid tracking would never fail, at which point the warped model
would not be true to the data.
We approach this problem by redesigning the fusion pipeline. Our
gold standard is that the temporal information from the estimated
model should never downgrade the quality of the observed data.
Put another way, the accumulated model should “upgrade” the data
frame, when deemed feasible, by adding accumulated detail or ﬁlling
in holes caused by occlusion or sensor failures. With this standard
in mind, we designed a data fusion pipeline aimed at improving
the quality and ﬁdelity of the reconstruction at the data frame by
robustly handling realistic surface deformations and tracking failure.
There are two key features in our pipeline that tackle this goal:
1. Data Volume.
While previous work maintained a volume for
the reference (or the model), which we refer to as
Vr
, we
also maintain a volume at the “data frame”
Vd
. Following
the nonrigid alignment we then fuse the data from the current
frame into the reference volume
Vr
as in [Newcombe et al
.
2011]. We also, however, fuse the reference volume back
into the data frame volume
Vd
. The fusion into
Vd
is very
selective as to which data from the previously accumulated
reference volume is integrated. This allows us to guarantee
that the quality of the fused data is never lower than the quality
of the observed data in the current frame, even with a poor
quality alignment from the reference volume. We then use the
fused data volume to extract a high quality reconstruction of
the current frame for output, or to reset the reference volume
as described below.
2. Key Volumes.
The key volume strategy allows us to consis-
tently maintain a high quality reference model that handles
tracking failures. Instead of simply ﬁxing the reference frame
to the ﬁrst frame, we explicitly handle drastic misalignments
by periodically resetting the reference to a fused data volume
which we then call a key volume. In addition, we detect model-
data misalignments and refresh the misaligned voxels using
the corresponding voxels from the data volume. Voxel refresh-
ing within a subsequence corresponding to a key volume ﬁxes
small scale tracking failures and keeps small data changes
from being ignored (e.g., clothes wrinkling). However, when
a larger tracking failure occurs (e.g., losing track of an entire
arm), refreshing the voxels in the key volume would only re-
place the arm voxels with empty space. Further, the arm in the
data frame will not be reﬂected in the key volume because no
motion ﬁeld is estimated there to warp the data to the reference.
In this case, resetting the reference volume (i.e. as a new key
volume) would re-enables the tracking and data fusion for the
regions that previously lost tracking.
6.1 Fusion at the Data Frame
6.1.1 Volume Warping
We represent the volume as a two level hierarchy similar to [Chen
et al
.
2013]. As in [Curless and Levoy 1996], each voxel at location
xR3
has a signed distance value and a weight
hd, wi
associated
with it, i.e.,V= (D,W).
At any given iteration we start by sampling a new data volume
Vd
from the depth maps. We next warp the current reference volume
Vr
to this data volume and fuse with the data using the estimated
deformation ﬁeld (see Sec. 5.1 for the details). The ED graph aligns
the reference surface
Vr
to the data frame. The same forward
warping function in Eq.
(1)
can also be applied to a voxel
xr
in the
reference to compute the warped voxel
˜
xr=T(xr;G)
. The warped
voxel then gets to cast a weighted vote for (i.e., accumulate) its data
Figure 5:
Left: reference surface. Middle and Right: surfaces from
warped volume without and with voxel collision detection.
Figure 6:
Volume blending. Left to right: reference surface; non-
rigid alignment residual showing topology change; extracted surface
at the warped reference; extracted surface from ﬁnal blended volume
hdr, wri
at neighboring voxels within some distance
τ
on the regular
lattice of the data volume. Every data voxel
xd
would then calculate
the weighted average of the accumulated data
h¯
dr, wri
, both SDF
value and SDF weight, using the weight exp(−k˜
xrxdk2/2σ2).
Note, this blending (or averaging) is bound to cause some geometric
blur. To ameliorate this effect, each reference voxel
xr
does not
directly vote for the SDF value it is carrying (i.e.,
dr
) but for the
corrected value ¯
drusing the gradient ﬁeld of the SDF, i.e.,
¯
dr=dr+ (˜
xrxd)>˜
,
xr
in the reference.
˜
using Eq.
(2)
and approximates the gradient ﬁeld at the data volume.
In other words,
¯
dr
is the prediction of the SDF value at
xd
given the
SDF value and gradient at ˜
xr.
6.1.2 Selective Fusion
To ensure a high-ﬁdelity reconstruction at the data frame, we need
to ensure that each warped reference voxel
˜
xr
will not corrupt the
reconstructed result. To this end we perform two tests before fusing
in a warped voxel and reject its vote if it fails either.
Voxel Collision.
When two model parts move towards each other
(e.g., clapping hands), the reference voxels contributing to different
surface areas might collide after warping, and averaging the SDF
values voted for by these voxels is problematic: in the worst case,
the voxels with a higher absolute SDF value will overwhelm the
voxels at the zero crossing, leading to a hole in the model (Fig. 5).
To deal with this voxel collision problem, we perform the fusion
in two passes. In the ﬁrst pass, and for any given data voxel
xd
,
we evaluate all the reference voxels voting at its location and save
the reference voxel
˙
xr
with the smallest absolute SDF value. In
the second pass we reject the vote of any reference voxel
xr
at this
location if |xr˙
xr|> η.
Voxel Misalignment.
We also need to evaluate a proxy error at each
reference voxel
xr
to detect if the nonrigid tracking failed so we
are able to similarly reject its vote. To do this we ﬁrst calculate an
alignment error at each warped model vertex ˜
xr
e˜
xr=(|Dd(˜
xr)|if Hd(˜
xr) = 0
min |Dd(˜
xr)|,Hd(˜
xr)otherwise (16)
where
Dd
is the fused TSDF at the data frame, and
Hd
is the visual
hull distance transform (Sec. 5.2.3). We then aggregate this error at
the ED nodes by averaging the errors from the vertices associated
with the same ED node. This aggregation process reduces the inﬂu-
ence of the noise in the depth data on the alignment error. Finally,
we reject any reference voxel if any of its neighboring ED nodes has
an alignment error beyond a certain threshold. The extracted surface
from ˜
Vris illustrated in Fig. 6.
6.1.3 Volume Blending
After we fuse the depth maps into a data volume
Vd
and warp the
reference volume to the data frame forming
˜
Vr
, the next step is to
blend the two volumes
Vd
and
˜
Vr
to get the ﬁnal fused volume
¯
Vd
,
used for the reconstructed output.3
Even after the conservative selective fusion described in the previous
section, simply taking a weighted average of the two volumes (i.e.,
¯
dd=˜
dr˜wr+ddwd
˜wr+wd
) leads to artifacts. This naive blending does not
guarantee that the SDF band around the zero-crossing will have
a smooth transition of values. This is because boundary voxels
that survived the rejection phase will suppress any zero-crossings
coming from the data, causing artifacts and lowering the quality at
the output.
To handle this problem, we start by projecting the reference surface
vertices
Vr
to the depth maps. We can then calculate a per-pixel
depth alignment error as the difference between the vertex depth
d
and its projective depth
dproj
, normalized by a maximum
dmax
. Put
together, we calculate
epixel =(min ( 1.0,|ddproj|/ dmax )if dproj is valid
1.0otherwise. (17)
Each voxel in the data volume
Vd
can then have an aggregated
average depth alignment error
evoxel
when projecting it to depth
maps. Finally, instead of using the naive blending described above,
we use the blending function
¯
dd=˜
dr˜wr(1.0evoxel) + ddwd
˜wr(1.0evoxel) + wd,(18)
downweighting the reference voxel data by its depth misalignment.
6.2 Fusion at the Reference Frame
As in [Newcombe et al
.
2015], to update the reference model we
warp each reference voxel
xr
to the data frame, project it to the depth
maps, and update the TSDF value and weight. This avoids an explicit
data-to-model warp. Additionally, we also know the reference voxels
˜
xr
not aligned well to the data from Eq.
(16)
. For these voxels we
discard their data and refresh it from the data in the current data
frame. Finally, we reset the entire volume periodically to the fused
data volume
¯
Vd
(i.e., key volumes) to handle large misalignments
that cannot be recovered from by the per-voxel refresh.
3
Marching cubes is applied to this volume to extract the ﬁnal mesh
representation.
Figure 8:
A comparison of the input data as a point cloud (left), the
fused live data without nonrigid alignment (center), and the output
of our system (right).
Figure 9:
Our system is robust to many complex topology changes.
Figure 10: Our approach is robust to fast motions.
7 Results
We now provide results, experiments and comparisons of our real-
time performance capture method.
7.1 Live Performance Capture
Our system is fully implemented on the GPU using CUDA. Re-
sults of live multi-view scene captures for our test scenes are shown
Figure 11: Quantitative comparison with [Collet et al. 2015]
in Figures 1 and 7 as well as in the supplementary material. It is
important to stress that all these sequences were captured online
and in real-time, including depth estimation and full nonrigid re-
construction. Furthermore, these sequences are captured over long
time periods comprising many minutes. We make a strong case for
nonrigid alignment in Fig. 8. While volumetrically fusing the live
data does produce a more aesthetically appealing result compared to
the input point cloud, it cannot resolve issues arising from missing
data (holes) or noise. On the other hand, these issues are signiﬁcantly
ameliorated in the reconstructed mesh with Fusion4D by leveraging
temporal information.
We captured a variety of diverse and challenging nonrigidly moving
scenes. This includes multiple people interacting, deforming objects,
topology changes and fast motions. Fig. 7 shows multiple examples
for each of these scenes. Our reconstruction algorithm is able to
deal with extremely fast motion, where most online nonrigid sys-
tems would fail. Fig. 10 depicts typical situations where the small
motion assumption does not hold. This robustness is in due to the
ability to estimate fast RGBD correspondences allowing for robust
initialization of the ED graph, and also the ability to recover from
misalignment errors. In Fig. 9 we show a number of challenging
topology changes that our system can cope with in a robust manner.
This includes hands being initially reconstructed on the hips of the
performer and then moved, and items of clothing being removed,
such as a jacket or scarf etc.
Other examples of reconstructions in Fig. 7 and supplementary
video, depict clothing changes, taekwondo moves, dancing, animals,
moving hair and interaction with objects. For any of these situations
the algorithm automatically retrieves the nonrigid reconstruction
with real-time performance. Notice also that the method has no
shape prior of the object of interest and can easily generalize to
non-human models, for example animals or objects.
7.2 Computational Time
Similar to [Collet et al
.
2015] the input RGBD and segmentation
data is generated on dedicated PCs. Each machine is an Intel Core i7
3.4GHz CPU, 16GB of RAM and it uses two NVIDIA Titan X GPUs.
Each PC processes two depthmaps and two segmentation masks in
parallel. The total time is
21
ms and
4
ms for the stereo matching and
segmentation, respectively. Correspondence estimation requires
5
ms
with a parallel GPU implementation. In total each machine uses no
more than
30
ms to generate the input for the nonrigid reconstruction
pipeline. RGBD frames are generated in parallel to the nonrigid
pipeline, but do introduce 1 frame of latency.
A master PC (another Intel Core i7 3.4GHz CPU, 16GB of RAM,
with a single NVIDIA Titan X), aggregates and synchronizes all
the depthmaps, segmentation masks and correspondences. Once the
RGBD inputs are available, the average processing time to nonrigidly
reconstruct is
32
ms (i.e.,
31
fps) with
3
ms for preprocessing (
10%
Figure 7: Real-time results captured of Fusion4D, showing a variety of challenging sequences. Please also see accompanying video.
Figure 12:
Quantitative comparisons of different correspondence
methods: SIFT, FAST+SIFT, DeepMatch, EpicFlow and Global
Patch Collider. We computed the residual error and reported the
percentage of vertexes with error
>5
mm. The method proposed in
Sec. 5.2.4 achieved the best score with only 29% outliers.
of the overall pipeline),
2
ms (
7%
) for rigid pose estimation (on
average
4
iterations),
20
ms (
64%
) for the nonrigid registration (5
LM iterations, with 10 PCG iterations), and 6ms (19%) for fusion.
7.3 Correspondence Evaluation
In Sec. 5.2.4 we described our approach to estimating RGBD cor-
respondences. We now evaluate its robustness compared to other
state-of-the-art methods. One sequence with very fast motions is con-
sidered. In order to compare different correspondence algorithms,
we only minimize the
Ecorr (G)
term in Eq. 3 and we compute the
residual error. We report results as percentage of alignment error
between the current observation and the model. In particular, we
show the percentage of vertices with error
>5
mm. We compared
different methods: standard SIFT detector and descriptors [Lowe
2004] , a FAST detector [Rosten and Drummond 2005] followed by
SIFT descriptors, DeepMatch [Weinzaepfel et al
.
2013], EpicFlow
[Revaud et al
.
2015] and our extension of Global Patch Collider
[Wang et al
.
2016] described in Sec. 5.2.4. Quantitative results on
this fast motion sequence are reported in Fig. 12. The best results
are obtained by our method with
29%
outliers, then SIFT (
34%
),
FAST+SIFT (
34%
), DeepMatch (
36%
) and EpicFlow (
36%
). Most
of the error occurred in regions where very large motion is present:
a qualitative comparison is depicted in Fig. 13.
7.4 Nonrigid Reconstruction Comparisons
In Fig. 15, we compare to the dataset of [Collet et al
.
2015] for a
sequence with extremely high motions. The ﬁgure compares render-
ings of the original meshes and multiple reconstructions, where red
corresponds to a ﬁtting error of
15
mm. In particular, we compare
our method with [Zollh
¨
ofer et al
.
2014] and [Newcombe et al
.
2015],
showing our superior reconstructions in these challenging situations.
We also show results and distance metrics for the method of [Collet
et al
.
2015] which is an ofﬂine technique with a runtime of about
30
minutes per frame on the CPU, and runs with
30
more cameras
than our system. In a more quantitative analysis (Fig. 11) we plot
the error over the input mesh for our method and [Collet et al
.
2015],
which shows that our algorithm can match the motion and ﬁne scale
details exhibited in this sequence. Our approach shows qualitatively
similar results but with a system that is about
4
orders of magnitude
faster, allowing for true real-time performance capture.
Finally, multiple qualitative comparisons among different state of
the art methods are shown in Fig. 14. These sequences exhibits all
classical situations where online methods fail, such as large motions
and topology changes. Again our real-time reconstruction methods
correctly retrieves the non rigid shapes for any of these scenarios.
Please also see accompanying video ﬁgure.
Figure 13:
Qualitative comparisons of correspondence algorithms.
We show the detected correspondences (green lines) between the
previous frame (yellow points) and current frame (cyan points). GPC
shows less residual error in fast motion regions, whereas current
state of the art algorithms (DeepMatch, EpicFlow) and traditional
correspondence methods (SIFT, FAST) show higher error due to the
highest percentage of false positives (FAST, DeepMatch, EpicFlow),
or due to the poor recall (SIFT).
Figure 14:
Qualitative comparisons with state of the art approaches.
8 Limitations
Even though we demonstrated one of the ﬁrst methods for real-time
nonrigid reconstruction from multiple views, showing reconstruction
of challenging scenes, our system is not without limitations. Given
the tight real-time constraint (33ms/frame) of our approach, we
rely on temporal coherence of the RGBD input stream making
the processing at 30Hz a necessity. If the frame rate is too low
or frame-to-frame motion is too large, either the frame-to-frame
correspondences would be inaccurately estimated or the nonrigid
Figure 15:
Qualitative comparisons with the high quality ofﬂine
system of [Collet et al. 2015].
Figure 16:
Current limitations of our system. From left to right:
Noisy data when tracking is lost. Holes due to segmentation errors.
Oversmoothing due to alignment errors.
alignment would fail to converge given the tight time budget. In
either case our method might lose tracking. In both scenarios our
system does fall back to the live fused data. However, as shown
in Fig. 16 the volume blending can look noisy as new data is ﬁrst
being fused. Another issue in our current work is robustness to
segmentation errors. Large segmentation errors, if there is missing
depth data for instance, can lead to incorrect visual hull estimation.
This can cause some noise to be integrated into the model as shown
in Fig. 16. Finally, any small nonrigid alignment errors can cause
slight oversmoothing of the model at times e.g. Fig. 16. We deal
with topology change by refreshing correspondence voxels. This
strategy works in general, but has artifacts when one object slides
over another surface, e.g., unzipping a jacket. To solve the topology
problem intrinsically, a nonrigid matching algorithm that explicitly
handles topology changes needs to be designed.
9 Conclusions
We have demonstrated Fusion4D; the ﬁrst real-time multi-view non-
rigid reconstruction system for live performance capture. We have
contributed a new pipeline for live multi-view performance capture,
generating high-quality reconstructions in real-time, with several
unique capabilities over prior work. As shown, our reconstruction
algorithm enables both incremental reconstruction, improving the
surface estimation over time, as well as parameterizing the nonrigid
scene motion. We also demonstrated how our approach robustly
handles both large frame-to-frame motion and topology changes.
This was achieved using a novel real-time solver, correspondence
algorithm, and fusion method. We believe our work can enable new
types of live performance capture experiences, such as broadcasting
live events including sports and concerts in 3D, and also the ability to
capture humans live and have them re-rendered in other geographic
locations to enable high ﬁdelity immersive telepresence.
References
BEE LER , T., HAHN, F., B RAD LEY, D., BICKEL, B. , BEA RD SLE Y,
P., GOT SMA N, C. , SUMNER, R. W., AN D GROSS , M. 2011.
High-quality passive facial performance capture using anchor
frames. ACM Transactions on Graphics (TOG) 30, 4, 75.
BLE YER , M., R HEMANN, C., AND ROT HER , C. 2011. Patchmatch
stereo: Stereo matching with slanted support windows. In Proc.
BMVC, vol. 11, 1–11.
BOG O, F., BL ACK , M. J. , LOP ER, M., A ND ROMERO, J. 2015.
Detailed full-body reconstructions of moving people from monoc-
ular RGB-D sequences. In ICCV, 2300–2308.
BOJSEN-HANS EN, M ., LI, H., A ND WOJ TAN, C. 2012. Tracking
surfaces with evolving topology. ACM Trans. Graph. 31, 4, 53.
BRA DLE Y, D. , PO PA, T., SH EFFE R, A. , HEIDRICH, W., AND
BOUBEKEUR, T. 2008. Markerless garment capture. ACM TOG
(Proc. SIGGRAPH) 27, 3, 99.
CAGN IART, C ., BOYER , E., A ND ILIC, S . 2010. Free-form mesh
tracking: a patch-based approach. In Proc. CVPR.
CHE N, Y., AND MEDIONI, G. 1992. Object modelling by registra-
tion of multiple range images. CVIU 10, 3, 144–155.
CHE N, J., BAUT EMBA CH, D., AN D IZADI, S. 2013. Scalable
real-time volumetric surface reconstruction. ACM TOG.
COL LET, A., CH UANG, M., SWE ENE Y, P., GI LLETT, D. , EV SEE V,
D., CAL ABR ESE, D., HOP PE, H ., KIRK, A ., A ND SUL LIVAN ,
S. 2015. High-quality streamable free-viewpoint video. ACM
TOG 34, 4, 69.
CUR LES S, B. , AND LEVOY, M. 1996. A volumetric method for
building complex models from range images. In Proceedings of
the 23rd annual conference on Computer graphics and interactive
techniques, ACM, 303–312.
DE AGUIAR, E., ST OLL , C., T HEOBALT, C. , AHMED , N., S EIDEL,
H.-P., AN D THRUN, S. 2008. Performance capture from sparse
multi-view video. ACM TOG (Proc. SIGGRAPH) 27, 1–10.
DOU , M., FUCH S, H., AN D FRAH M, J .-M. 2013. Scanning and
tracking dynamic objects with commodity depth cameras. In
Proc. ISMAR, IEEE, 99–106.
DOU , M., TAYL OR, J., FU CHS , H., FITZGIBBON, A., AND IZAD I,
S. 2015. 3d scanning deformable objects with a single rgbd
sensor. In CVPR.
ENG ELS , C., S TEW
´
ENIUS, H., AND NIST
´
ER , D. 2006. Bundle
adjustment rules. Photogrammetric computer vision 2, 124–131.
GAL L, J., STO LL, C ., DEAGUIAR, E ., THEO BALT, C., ROSE N-
HA HN, B., AND SEIDEL, H .-P. 2009. Motion capture using joint
skeleton tracking and surface estimation. In Proc. CVPR, IEEE,
1746–1753.
GUO , K., XU, F., WANG, Y., LI U, Y., AN D DAI, Q . 2015. Robust
non-rigid motion tracking and surface reconstruction using l0
regularization. In ICCV, 3083–3091.
KR
¨
AHENB
¨
UH , P., A ND KOLTUN, V. 2011. Efﬁcient inference in
fully connected crfs with gaussian edge potentials. NIPS.
KUT ULA KOS , K. N., AND SE ITZ , S. M. 2000. A theory of shape
by space carving. IJCV.
LI, H., ADAM S, B., GUIBAS, L. J ., AND PAULY, M. 2009. Robust
single-view geometry and motion reconstruction. ACM TOG.
LOWE , D. G. 2004. Distinctive image features from scale-invariant
keypoints. IJCV.
MIT RA, N. J., FL
¨
ORY, S., OVSJA NIK OV, M. , GELFAND, N.,
GUI BAS, L . J., AND POT TMANN , H. 2007. Dynamic geometry
registration. In Proc. SGP, 173–182.
MOR I, M., MAC DORMAN, K. F., AND KAG EKI , N. 2012. The
uncanny valley [from the ﬁeld]. Robotics & Automation Magazine,
IEEE 19, 2, 98–100.
NEW COM BE, R . A., IZA DI, S., HILLIGES, O ., MOLYNE AUX, D .,
KIM , D., DAVI SON , A. J., KOH LI, P., S HOT TON, J., HODGES,
S., AN D FITZGIBBON, A. 2011. KinectFusion: Real-time dense
surface mapping and tracking. In Proc. ISMAR, 127–136.
NEW COM BE, R . A., FOX, D., AN D SEIT Z, S. M. 2015. Dy-
namicfusion: Reconstruction and tracking of non-rigid scenes in
real-time. In CVPR, 343–352.
PONS-MOLL , G., TAYL OR, J ., SH OTT ON, J., HE RTZMAN N, A.,
AN D FITZGIBBON, A. 2015. Metric regression forests for corre-
spondence estimation. IJCV 113, 3, 163–175.
PRA DEE P, V., RHEMANN, C ., IZAD I, S., ZACH, C., BLE YER ,
M., AN D BATHIC HE, S. 2013. MonoFusion: Real-time 3D
reconstruction of small scenes with a single web camera. In Proc.
ISMAR, IEEE, 83–88.
REVAUD , J., WEIN ZAE PFEL, P., HARC HAOU I, Z. , AN D SCHMID,
C. 2015. Epicﬂow: Edge-preserving interpolation of correspon-
dences for optical ﬂow. CVPR.
ROS TEN , E., AND DRUMMOND, T. 2005. Fusing points and lines
for high performance tracking. In ICCV.
RUSINKIEWICZ, S., AND LE VOY, M. 2001. Efﬁcient variants of
the icp algorithm. In 3DIM, 145–152.
SHOT TON , J., G LOCKE R, B., ZAC H, C., IZA DI, S ., CRIMINISI,
A., AN D FITZGIBBON, A. 2013. Scene coordinate regression
forests for camera relocalization in rgb-d images. In CVPR.
SMOLIC, A. 2011. 3d video and free viewpoint videofrom capture
to display. Pattern recognition 44, 9, 1958–1968.
STARC K, J., AND HILTO N, A. 2007. Surface capture for
performance-based animation. Computer Graphics and Applica-
tions 27, 3, 21–31.
STOLL, C., HA SLE R, N. , GA LL, J., SEIDEL, H ., AN D THEOBALT,
C. 2011. Fast articulated motion tracking using a sums of gaus-
sians body model. In Proc. ICCV, IEEE, 951–958.
SUMNER, R. W., SCHMID, J., A ND PAULY, M. 2007. Embedded
deformation for shape manipulation. ACM TOG 26, 3, 80.
TEV S, A., BER NER, A., WAN D, M., IH RKE , I., B OK ELO H, M. ,
KER BER , J., A ND SEIDEL, H .-P. 2012. Animation cartography-
intrinsic reconstruction of shape and motion. ACM TOG.
THE OBALT, C., DE AGUIAR, E., ST OLL , C., S EIDEL, H.-P., AND
THRU N, S. 2010. Performance capture from multi-view video. In
Image and Geometry Processing for 3D-Cinematography, R. Ron-
fard and G. Taubin, Eds. Springer, 127ff.
VINEET, V., WARR ELL , J., AND TORR , P. H. S . 2012. Filter-based
mean-ﬁeld inference for random ﬁelds with higher-order terms
and product label-spaces. In ECCV.
VLASIC, D., BARA N, I., MATU SIK , W., AN D POPOV I
´
C, J. 2008.
Articulated mesh animation from multi-view silhouettes. ACM
TOG (Proc. SIGGRAPH).
VLASIC, D., PEE RS, P., BA RA N, I., DEB EVE C, P., PO POV IC, J.,
RUSINKIEWICZ, S., AND MATUSI K, W. 2009. Dynamic shape
capture using multi-view photometric stereo. ACM TOG (Proc.
SIGGRAPH Asia) 28, 5, 174.
WAND , M., ADAMS, B., OVSJA NIK OV, M., BE RNE R, A. ,
BOK ELO H, M. , JENKE , P., G UIBAS, L., SEIDEL, H.-P., AND
SCHILLING, A. 2009. Efﬁcient reconstruction of nonrigid shape
and motion from real-time 3D scanner data. ACM TOG.
WANG , S., FANEL LO, S. R., RHEMANN, C., IZAD I, S., AND
KOH LI, P. 2016. The global patch collider. CVPR.
WASCHB
¨
US CH, M., W
¨
UR MLI N, S. , COT TIN G, D. , SADLO , F.,
AN D GROSS , M. 2005. Scalable 3D video of dynamic scenes. In
Proc. Paciﬁc Graphics, 629–638.
WEI , L., HUAN G, Q., CEY LAN , D., VOUG A, E. , AND LI, H.
2015. Dense human body correspondences using convolutional
networks. arXiv preprint arXiv:1511.05904.
WEI NZA EPF EL, P., R EVAUD, J ., HARC HAOU I, Z., AND SCHMID,
C. 2013. Deepﬂow: Large displacement optical ﬂow with deep
matching. In ICCV.
YE, M., AND YAN G, R. 2014. Real-time simultaneous pose
and shape estimation for articulated objects using a single depth
camera. In CVPR, IEEE.
YE, M., ZHA NG, Q ., WANG , L., ZHU , J., YAN G, R. , AND GA LL ,
J. 2013. A survey on human motion analysis from depth data.
In Time-of-Flight and Depth Imaging. Sensors, Algorithms, and
Applications. Springer, 149–187.
ZACH , C. 2014. Robust bundle adjustment revisited. In Computer
Vision–ECCV 2014. Springer, 772–787.
ZEN G, M., ZHE NG, J., CHEN G, X. , AND LIU, X. 2013. Template-
less quasi-rigid shape modeling with implicit loop-closure. In
Proc. CVPR, IEEE, 145–152.
ZHA NG, Q., FU, B ., YE, M., A ND YANG, R. 2014. Quality
dynamic human body modeling using a single low-cost depth
camera. In CVPR, IEEE, 676–683.
ZOL LH
¨
OF ER, M., NI ESSNER , M., IZAD I, S. , RHEMANN, C.,
ZACH , C., F ISHER, M., WU, C., FITZGIBBON, A. , LOO P, C.,
THE OBALT, C., ET A L. 2014. Real-time non-rigid reconstruction
using an rgb-d camera. ACM TOG.
... The demand for accurate reconstruction of three-dimensional (3D) objects has been increasing recently in various fields [1][2][3][4][5][6][7], such as computer vision, computer graphics, robotics, and image processing. However, 3D and four-dimensional (4D) scanning devices that accurately reconstruct 3D objects are still prohibitively expensive for widespread use. ...
... These examinations provide helpful guidelines for high-quality 3D surfaces in the overall pipeline for 3D reconstruction using active stereo sensors. We believe our analyses, benchmarks, and guidelines will help people build their own studios and further the research related to 3D reconstruction [1,40]. ...
Article
Full-text available
It is possible to construct cost-efficient three-dimensional (3D) or four-dimensional (4D) scanning systems using multiple affordable off-the-shelf RGB-D sensors to produce high-quality reconstructions of 3D objects. However, the quality of these systems’ reconstructions is sensitive to a number of factors in reconstruction pipelines, such as multi-view calibration, depth estimation, 3D reconstruction, and color mapping accuracy, because the successive pipelines to reconstruct 3D meshes from multiple active stereo sensors are strongly correlated with each other. This paper categorizes the pipelines into sub-procedures and analyze various factors that can significantly affect reconstruction quality. Thus, this paper provides analytical and practical guidelines for high-quality 3D reconstructions with off-the-shelf sensors. For each sub-procedure, this paper shows comparisons and evaluations of several methods using data captured by 18 RGB-D sensors and provide analyses and discussions towards robust 3D reconstruction. Through various experiments, it has been demonstrated that significantly more accurate 3D scans can be obtained with the considerations along the pipelines. We believe our analyses, benchmarks, and guidelines will help anyone build their own studio and their further research for 3D reconstruction.
... It has several applications in the domains of fashion, AR/VR, sports and healthcare. Traditional stereo/multiview (including RGB and depth sensor) based reconstruction solutions (Gall et al, 2009;Shotton et al, 2011;Wei et al, 2012;Baak et al, 2011;Newcombe et al, 2015;Dou et al, 2016;Bogo et al, 2017) typically require studio environments with controlled lighting and multiple synchronized and calibrated cameras. Thus, recent approaches have shifted their focus on in-the-wild 3D reconstruction of humans. ...
... Reconstruction: Recovering 3D human body from multi-camera setup requires traditional techniques like voxel carving, triangulation, multi-view stereo, shapefrom-X (Azevedo et al, 2009;Dou et al, 2016;Bogo et al, 2017;Mulayim et al, 2003). Stereo cameras and consumer RGBD sensors are highly susceptible to noise. ...
Preprint
Full-text available
Recent advancements in deep learning have enabled 3D human body reconstruction from a monocular image, which has broad applications in multiple domains. In this paper, we propose SHARP (SHape Aware Reconstruction of People in loose clothing), a novel end-to-end trainable network that accurately recovers the 3D geometry and appearance of humans in loose clothing from a monocular image. SHARP uses a sparse and efficient fusion strategy to combine parametric body prior with a non-parametric 2D representation of clothed humans. The parametric body prior enforces geometrical consistency on the body shape and pose, while the non-parametric representation models loose clothing and handle self-occlusions as well. We also leverage the sparseness of the non-parametric representation for faster training of our network while using losses on 2D maps. Another key contribution is 3DHumans, our new life-like dataset of 3D human body scans with rich geometrical and textural details. We evaluate SHARP on 3DHumans and other publicly available datasets and show superior qualitative and quantitative performance than existing state-of-the-art methods.
... Depth estimation Fusion4D (Dou et al. 2016) introduced a method for real-time online reconstruction for a video sequence from RGB image, depth and high-quality segmentation as input (Dou et al. 2016) and is restricted to relative simple indoor scenes. The proposed method only need RGB images as input and works for crowded indoor and outdoor scenes with multiple people. ...
... Depth estimation Fusion4D (Dou et al. 2016) introduced a method for real-time online reconstruction for a video sequence from RGB image, depth and high-quality segmentation as input (Dou et al. 2016) and is restricted to relative simple indoor scenes. The proposed method only need RGB images as input and works for crowded indoor and outdoor scenes with multiple people. ...
Article
Full-text available
We introduce the first approach to solve the challenging problem of automatic 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (≈40%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 40\%$$\end{document}) improvement in semantic segmentation, reconstruction and scene flow accuracy. In addition to the evaluation on several indoor and outdoor scenes, the proposed joint 4D scene understanding framework is applied to challenging outdoor sports scenes in the wild captured with manually operated wide-baseline broadcast cameras.
... Various spatial constraints have been studied, such as isometric mapping [12,13], as-rigid-as-possible [14][15][16][17][18], membrane model [19], and conformal mapping [20][21][22]. Spatial constraints are also used in the case of registering multiple frames [23,24]. In the present paper, the proposed method uses a spatial constraint based on conformal mapping because this constraint is applicable to various deformations, as compared to other constraints. ...
Article
Full-text available
Estimation of muscle activity is very important as it can be a cue to assess a person’s movements and intentions. If muscle activity states can be obtained through non-contact measurement, through visual measurement systems, for example, muscle activity will provide data support and help for various study fields. In the present paper, we propose a method to predict human muscle activity from skin surface strain. This requires us to obtain a 3D reconstruction model with a high relative accuracy. The problem is that reconstruction errors due to noise on raw data generated in a visual measurement system are inevitable. In particular, the independent noise between each frame on the time series makes it difficult to accurately track the motion. In order to obtain more precise information about the human skin surface, we propose a method that introduces a temporal constraint in the non-rigid registration process. We can achieve more accurate tracking of shape and motion by constraining the point cloud motion over the time series. Using surface strain as input, we build a multilayer perceptron artificial neural network for inferring muscle activity. In the present paper, we investigate simple lower limb movements to train the network. As a result, we successfully achieve the estimation of muscle activity via surface strain.
... Many previous pioneer works fall short of these goals. There has been attempts either rely on deploying depth sensors for high-quality geometry reconstruction before rendering [10,12], or building dense camera arrays to capture the changing appearances from different viewing angles [11]. The requirement of professional equipment limits the application of such technologies in personal and daily usage scenarios. ...
Preprint
Full-text available
This work targets at using a general deep learning framework to synthesize free-viewpoint images of arbitrary human performers, only requiring a sparse number of camera views as inputs and skirting per-case fine-tuning. The large variation of geometry and appearance, caused by articulated body poses, shapes and clothing types, are the key bottlenecks of this task. To overcome these challenges, we present a simple yet powerful framework, named Generalizable Neural Performer (GNR), that learns a generalizable and robust neural body representation over various geometry and appearance. Specifically, we compress the light fields for novel view human rendering as conditional implicit neural radiance fields from both geometry and appearance aspects. We first introduce an Implicit Geometric Body Embedding strategy to enhance the robustness based on both parametric 3D human body model and multi-view images hints. We further propose a Screen-Space Occlusion-Aware Appearance Blending technique to preserve the high-quality appearance, through interpolating source view appearance to the radiance fields with a relax but approximate geometric guidance. To evaluate our method, we present our ongoing effort of constructing a dataset with remarkable complexity and diversity. The dataset GeneBody-1.0, includes over 360M frames of 370 subjects under multi-view cameras capturing, performing a large variety of pose actions, along with diverse body shapes, clothing, accessories and hairdos. Experiments on GeneBody-1.0 and ZJU-Mocap show better robustness of our methods than recent state-of-the-art generalizable methods among all cross-dataset, unseen subjects and unseen poses settings. We also demonstrate the competitiveness of our model compared with cutting-edge case-specific ones. Dataset, code and model will be made publicly available.
... Some attempts have tried to lower the demanding requirements of multi-camera settings, where many sensors and lightning sources are required. One solution is enhancing the data obtained with a single depth sensor [10,33,42,75,91] or multiple depth sensors [21,34,96]. Other approaches try to reduce the number of cameras in operation and still obtain reasonably accurate results [98,99]. ...
Preprint
Full-text available
We present a dataset of 1000 video sequences of human portraits recorded in real and uncontrolled conditions by using a handheld smartphone accompanied by an external high-quality depth camera. The collected dataset contains 200 people captured in different poses and locations and its main purpose is to bridge the gap between raw measurements obtained from a smartphone and downstream applications, such as state estimation, 3D reconstruction, view synthesis, etc. The sensors employed in data collection are the smartphone's camera and Inertial Measurement Unit (IMU), and an external Azure Kinect DK depth camera software synchronized with sub-millisecond precision to the smartphone system. During the recording, the smartphone flash is used to provide a periodic secondary source of lightning. Accurate mask of the foremost person is provided as well as its impact on the camera alignment accuracy. For evaluation purposes, we compare multiple state-of-the-art camera alignment methods by using a Motion Capture system. We provide a smartphone visual-inertial benchmark for portrait capturing, where we report results for multiple methods and motivate further use of the provided trajectories, available in the dataset, in view synthesis and 3D reconstruction tasks.
... Threedimensional human character reconstruction is traditionally the very first step towards human avatar modeling. Previous studies focused on using multi-view images [39,70,74,76,77] or RGB(D) image sequences [3,4,7,12,13,21,23,80,81,[83][84][85]89] for human model reconstruction. Extremely high-quality reconstruction results have also been demonstrated with tens or even hundreds of cameras [10]. ...
Preprint
It is extremely challenging to create an animatable clothed human avatar from RGB videos, especially for loose clothes due to the difficulties in motion modeling. To address this problem, we introduce a novel representation on the basis of recent neural scene rendering techniques. The core of our representation is a set of structured local radiance fields, which are anchored to the pre-defined nodes sampled on a statistical human body template. These local radiance fields not only leverage the flexibility of implicit representation in shape and appearance modeling, but also factorize cloth deformations into skeleton motions, node residual translations and the dynamic detail variations inside each individual radiance field. To learn our representation from RGB data and facilitate pose generalization, we propose to learn the node translations and the detail variations in a conditional generative latent space. Overall, our method enables automatic construction of animatable human avatars for various types of clothes without the need for scanning subject-specific templates, and can generate realistic images with dynamic details for novel poses. Experiment show that our method outperforms state-of-the-art methods both qualitatively and quantitatively.
Article
Non‐rigid registration computes an alignment between a source surface with a target surface in a non‐rigid manner. In the past decade, with the advances in 3D sensing technologies that can measure time‐varying surfaces, non‐rigid registration has been applied for the acquisition of deformable shapes and has a wide range of applications. This survey presents a comprehensive review of non‐rigid registration methods for 3D shapes, focusing on techniques related to dynamic shape acquisition and reconstruction. In particular, we review different approaches for representing the deformation field, and the methods for computing the desired deformation. Both optimization‐based and learning‐based methods are covered. We also review benchmarks and datasets for evaluating non‐rigid registration methods, and discuss potential future research directions.
Article
Synthesizing novel views of dynamic humans from stationary monocular cameras is a specialized but desirable setup. This is particularly attractive as it does not require static scenes, controlled environments, or specialized capture hardware. In contrast to techniques that exploit multi‐view observations, the problem of modeling a dynamic scene from a single view is significantly more under‐constrained and ill‐posed. In this paper, we introduce Neural Motion Consensus Flow (MoCo‐Flow), a representation that models dynamic humans in stationary monocular cameras using a 4D continuous time‐variant function. We learn the proposed representation by optimizing for a dynamic scene that minimizes the total rendering error, over all the observed images. At the heart of our work lies a carefully designed optimization scheme, which includes a dedicated initialization step and is constrained by a motion consensus regularization on the estimated motion flow. We extensively evaluate MoCo‐Flow on several datasets that contain human motions of varying complexity, and compare, both qualitatively and quantitatively, to several baselines and ablated variations of our methods, showing the efficacy and merits of the proposed approach. Pretrained model, code, and data will be released for research purposes upon paper acceptance.
Article
Full-text available
We present a 3D scanning system for deformable objects that uses only a single Kinect sensor. Our work allows considerable amount of nonrigid deformations during scanning, and achieves high quality results without heavily constraining user or camera motion. We do not rely on any prior shape knowledge, enabling general object scanning with freeform deformations. To deal with the drift problem when nonrigidly aligning the input sequence, we automatically detect loop closures, distribute the alignment error over the loop, and finally use a bundle adjustment algorithm to optimize for the latent 3D shape and nonrigid deformation parameters simultaneously. We demonstrate high quality scanning results in some challenging sequences, comparing with state of art nonrigid techniques, as well as ground truth data.
Conference Paper
In this paper we investigate the status of bundle adjustment as a component of a real-time camera tracking system and show that with current computing hardware a significant amount of bundle adjustment can be performed every time a new frame is added, even under stringent real-time constraints. We also show, by quantifying the failure rate over long video sequences, that the bundle adjustment is able to significantly decrease the rate of gross failures in the camera tracking. Thus, bundle adjustment does not only bring accuracy improvements. The accuracy improvements also suppress error buildup in a way that is crucial for the performance of the camera tracker. Our experimental study is performed in the setting of tracking the trajectory a calibrated camera moving in 3D for various types of motion, showing that bundle adjustment should be considered an important component for a state-of-the-art real-time camera tracking system.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.