Conference PaperPDF Available

Real-time Compression and Streaming of 4D Performances

Authors:

Abstract and Figures

We introduce a realtime compression architecture for 4D performance capture that is two orders of magnitude faster than current state-of-the-art techniques, yet achieves comparable visual quality and bitrate. We note how much of the algorithmic complexity in traditional 4D compression arises from the necessity to encode geometry using an explicit model (i.e. a triangle mesh). In contrast, we propose an encoder that leverages an implicit representation (namely a Signed Distance Function) to represent the observed geometry, as well as its changes through time. We demonstrate how SDFs, when defined over a small local region (i.e. a block), admit a low-dimensional embedding due to the innate geometric redundancies in their representation. We then propose an optimization that takes a Truncated SDF (i.e. a TSDF), such as those found in most rigid/non-rigid reconstruction pipelines, and efficiently projects each TSDF block onto the SDF latent space. This results in a collection of low entropy tuples that can be effectively quantized and symbolically encoded. On the decoder side, to avoid the typical artifacts of block-based coding, we also propose a variational optimization that compensates for quantization residuals in order to penalize unsightly discontinuities in the decompressed signal. This optimization is expressed in the SDF latent embedding, and hence can also be performed efficiently. We demonstrate our compression/decompression architecture by realizing, to the best of our knowledge, the first system for streaming a real-time captured 4D performance on consumer-level networks.
Content may be subject to copyright.
Real-time Compression and Streaming of 4D Performances
DANHANG TANG, MINGSONG DOU, PETER LINCOLN, PHILIP DAVIDSON, KAIWEN GUO, JONATHAN
TAYLOR, SEAN FANELLO, CEM KESKIN, ADARSH KOWDLE, SOFIEN BOUAZIZ, SHAHRAM IZADI,
and ANDREA TAGLIASACCHI, Google Inc.
15 Mbps
30 Hz
Fig. 1. We acquire a performance via a rig consisting of 8
×
RGBD cameras in a panoptic configuration. The 4D geometry is reconstructed via a state-of-the-art
non-rigid fusion algorithm, and then compressed in real-time by our algorithm. The compressed data is streamed over a medium bandwidth connection
(
20Mbps), and decoded in real-time on consumer-level devices (Razer Blade Pro w/ GTX 1080m graphic card). Our system opens the door towards real-time
telepresence for Virtual and Augmented Reality – see the accompanying video.
We introduce a realtime compression architecture for 4D performance cap-
ture that is two orders of magnitude faster than current state-of-the-art
techniques, yet achieves comparable visual quality and bitrate. We note how
much of the algorithmic complexity in traditional 4D compression arises
from the necessity to encode geometry using an explicit model (i.e. a triangle
mesh). In contrast, we propose an encoder that leverages an implicit repre-
sentation (namely a Signed Distance Function) to represent the observed
geometry, as well as its changes through time. We demonstrate how SDFs,
when dened over a small local region (i.e. a block), admit a low-dimensional
embedding due to the innate geometric redundancies in their representa-
tion. We then propose an optimization that takes a Truncated SDF (i.e. a
TSDF), such as those found in most rigid/non-rigid reconstruction pipelines,
and eciently projects each TSDF block onto the SDF latent space. This
results in a collection of low entropy tuples that can be eectively quan-
tized and symbolically encoded. On the decoder side, to avoid the typical
artifacts of block-based coding, we also propose a variational optimization
that compensates for quantization residuals in order to penalize unsightly
discontinuities in the decompressed signal. This optimization is expressed
in the SDF latent embedding, and hence can also be performed eciently.
We demonstrate our compression/decompression architecture by realizing,
to the best of our knowledge, the rst system for streaming a real-time
captured 4D performance on consumer-level networks.
Authors’ address: Danhang Tang; Mingsong Dou; Peter Lincoln; Philip Davidson;
Kaiwen Guo; Jonathan Taylor; Sean Fanello; Cem Keskin; Adarsh Kowdle; Soen
Bouaziz; Shahram Izadi; Andrea Tagliasacchi, Google Inc.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
©2018 Copyright held by the owner/author(s).
0730-0301/2018/11-ART256
https://doi.org/10.1145/3272127.3275096
CCS Concepts:
Theory of computation Data compression
;
Com-
puting methodologies Computer vision;Volumetric models;
Additional Key Words and Phrases: 4D compression, free viewpoint video.
ACM Reference Format:
Danhang Tang, Mingsong Dou, Peter Lincoln, Philip Davidson, Kaiwen
Guo, Jonathan Taylor, Sean Fanello, Cem Keskin, Adarsh Kowdle, Soen
Bouaziz, Shahram Izadi, and Andrea Tagliasacchi. 2018. Real-time Compres-
sion and Streaming of 4D Performances. ACM Trans. Graph. 37, 6, Article 256
(November 2018), 11 pages. https://doi.org/10.1145/3272127.3275096
1 INTRODUCTION
Recent commercialization of AR/VR headsets, such as the Oculus
Rift and Microsoft Hololens, has enabled the consumption of 4D data
from unconstrained viewpoints (i.e. free-viewpoint video). In turn,
this has created a demand for algorithms to capture [Collet et al
.
2015], store [Prada et al
.
2017], and transfer [Orts-Escolano et al
.
2016] 4D data for user consumption. These datasets are acquired
in capture setups comprised of hundreds of traditional color cam-
eras, resulting in computationally intensive processing to generate
models capable of free viewpoint rendering. While pre-processing
can take place oine, the resulting “temporal sequence of colored
meshes” is far from compressed, with bitrates reaching
100Mbps.
In comparison, commercial solutions for internet video streaming,
such as Netix, typically cap the maximum bandwidth at a modest
16Mbps. Hence, state-of-the-art algorithms focus on oine com-
pression, but with computational costs of 30s/frame on a cluster
comprised of
70 high-end workstations [Collet et al
.
2015], these
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
256:2 Tang et al.
solutions will be a struggle to employ in scenarios beyond a profes-
sional capture studio. Further, recent advances in VR/AR telepres-
ence not only require the transfer of 4D data over limited bandwidth
connections to a client with limited compute, but also require this
to happen in a streaming fashion (i.e. real-time compression).
In direct contrast to Collet et al
.
[2015], we approach these chal-
lenges by revising the 4D pipeline end-to-end. First, instead of re-
quiring
100 RGB and IR cameras, we only employ a dozen RGB+D
cameras, which are sucient to provide a panoptic view of the per-
former. We then make the fundamental observation that much of
the algorithmic/computational complexity of previous approaches
revolves around the use of triangular meshes (i.e. an explicit repre-
sentation), and their reliance on a UV parameterization for texture
mapping. The state-of-the-art method by Prada et al
.
[2017] encodes
dierentials in topology (i.e. connectivity changes) and geometry
(i.e. vertex displacements) with mesh operations. However, substan-
tial eorts need to be invested in the maintenance of the UV atlas,
so that video texture compression remains eective. In contrast,
our geometry is represented in implicit form as a truncated signed
distance function (TSDF) [Curless and Levoy 1996]. Our novel TSDF
encoder achieves geometry compression rates that are lower than
those in the state of the art (7Mbps vs. 8Mbps of [Prada et al
.
2017]),
at a framerate that is two orders of magnitude faster than current
mesh-based methods (30 fps vs. 1min/frame of [Prada et al
.
2017]).
This is made possible thanks to the implicit TSDF representation of
the geometry, that can be eciently encoded in a low dimensional
manifold.
2 RELATED WORKS
We give an overview of the state-of-the-art on 3D/4D data compres-
sion, and highlight our contributions in contrast to these methods.
Mesh-based compression. There is a signicant body of work on
compression of triangular meshes; see the surveys by Peng et al
.
[2005] and by Maglo et al
.
[2015]. Lossy compression of a static mesh
using a Fourier transform and manifold harmonics is possible [Karni
and Gotsman 2000; Lévy and Zhang 2010], but these methods hardly
scale to real-time applications due to the computational complex-
ity of spectral decomposition of the Laplacian operator. Further-
more, not only does the connectivity need to be compressed and
transferred to the client, but lossy quantization typically results in
low-frequency aberrations to the signal [Sorkine et al
.
2003]. Under
the assumption of temporally coherent topology, these techniques
can be generalized to compression of animations [Karni and Gots-
man 2004]. However, quantization still results in low-frequency
aberrations that introduce very signicant visual artifacts such as
foot-skate eects [Vasa and Skala 2011]. Overall, all these meth-
ods, including the [MPEG4/AFX 2008] standard, assume temporally
consistent topology/connectivity, making them unsuitable for the
scenarios we seek to address. Another class of geometric compres-
sors, derived from the EdgeBreaker encoders [Rossignac 1999], lie
at the foundation of the modern Google/Draco library [Galligan
et al. 2018]. This method is compared with ours in Section 4.
Compression of volumetric sequences. The work by Collet et al
.
[2015] pioneered volumetric video compression, with the primary
focus of compressing hundreds of RGB camera data in a single
texture map. Given a keyframe, they rst non-rigidly deform this
template to align with a set of subsequent subframes using em-
bedded deformations [Sumner et al
.
2007]. Keyframe geometry is
compressed via vertex position quantization, whereas connectivity
information is delta-encoded. Dierentials with respect to a linear
motion predictor are then quantized and then compressed with
minimal entropy. Optimal keyframes are selected via heuristics, and
post motion-compensation residuals are implicitly assumed to be
zero. As the texture atlas remains constant in subframes (the set
of frames between adjacent keyframes), the texture image stream
is temporally coherent, making it easy to compress by traditional
video encoders. As reported in [Prada et al
.
2017, Tab.2], this re-
sults in texture bitrates ranging from 1
.
8to 6
.
5Mbps, and geometry
bitrates ranging from 6
.
6to 21
.
7Mbps. The work by Prada et al
.
[2017] is a direct extension of [Collet et al
.
2015] that uses a dier-
ential encoder. However, these dierentials are explicit mesh editing
operations correcting the residuals that motion compensation failed
to explain. This results in better texture compression, as it enables
the atlases to remain temporally coherent across entire sequences
(
13% texture bitrate), and better geometry bitrates, as only mesh
dierentials, rather than entire keyframes, need to be compressed
(
30% geometry bitrate); see [Prada et al
.
2017, Tab.2]. Note how
neither [Collet et al
.
2015] nor [Prada et al
.
2017] is suitable for
streaming compression, as they both have a runtime complexity
that is in the order of 1min/frame.
Volume compression. Implicit 3D representation of a volume, such
as signed distance function (SDF), has also been explored for 3D data
compression. The work in [Schelkens et al. 2003] is an en example
based on 3D wavelets that form the foundation of the JPEG2000/jp3D
standard. Dictionary learning and sparse coding have been used in
the context of local surface patch encoding [Ruhnke et al
.
2013],
but the method is far from real-time. Similar to this work, Canelhas
et al
.
[2017] chose to split the volume into blocks, and perform KLT
compression to eciently encode the data. In their method, discon-
tinuities at the TSDF boundaries are dealt with by allowing blocks
to overlap, resulting in very high bitrate. Moreover, the encoding
of a single slice of
×
100 blocks (10% of volume) of size 16
3
takes
45ms, making the method unsuitable for real-time scenarios.
Compression as 3D shape approximation. Shape approximation
can also be interpreted as a form of lossy compression. In this domain
we nd several variants, from traditional decimation [Cohen-Steiner
et al
.
2004; Garland and Heckbert 1997] to its level-of-detail vari-
ants [Hoppe 1996; Valette and Prost 2004]. Another form of compres-
sion replaces the original geometry through a set of approximating
proxies, such as bounding cages [Calderon and Boubekeur 2017]
and sphere-meshes [Thiery et al
.
2013]. The latter are somewhat
relevant due to their recent extension to 4D geometry [Thiery et al
.
2016], but [MPEG4/AFX 2008], only support temporally coherent
topology.
Contributions. In summary, our main contribution is a novel end-
to-end architecture for real-time 4D data streaming on consumer-
level hardware, which includes the following elements:
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
Real-time Compression and Streaming of 4D Performances 256:3
bitstream
(.3 Mbps)
ENCODER DECODER
bitstream
(8 Mbps)
m2f KLT
NVDECNVENC
Dicer
Quantizer iKLT
Range
Encoder
Range
Decoder
bitstream
(7 Mbps)
Texture
Mapping
iDicer iQuantizer
Marching
Cubes
8x{RGB}
(blocks)
8x{Depth} (discontinuous) (smooth)
Fig. 2. The architecture of our real-time 4D compression/decompression pipeline; we visualize 2D blocks for illustrative purposes only. From the depth
images, a TSDF volume is created by motion2fusion [Dou et al
.
2017]; each of the 8
×
8
×
8blocks of the TSDF is then embedded in a compact latent space,
quantized and then symbolically encoded. The decompressor inverts this process, where an optional inverse quantization (iQ) filter can be used to remove the
discontinuities caused by the fact that the signal compresses adjacent blocks independently. A triangular mesh is then extracted from the decompressed TSDF
via iso-surfacing. Given our limited number of RGB cameras, the compression of color information is performed per-channel and without the need of any UV
mapping; given the camera calibrations, the color streams can then be mapped on the surface projectively within a programmable shader.
A new ecient compression architecture leveraging implicit
representations instead of triangular meshes.
We show how we can reduce the number of coecients
needed to encode a volume by solving, at run-time, a col-
lection of small least squares problems.
We propose an optimization technique to enforce smoothness
among neighboring blocks, removing visually unpleasant
block-artifacts.
These changes allow us to achieve state-of-the-art compres-
sion performance, which we evaluate in terms of rate vs
distortion analysis.
We propose a GPU compression architecture, that, to the best
of our knowledge, is the rst to achieve real-time compression
performance.
3 TECHNICAL OUTLINE
We overview our compression architecture in Figure 2. Our pipeline
assumes a stream of depth and color image sets
{Dn
t,Cn
t}
from a
multiview capture system that are used to generate a volumetric
reconstruction (Section 3.1). The encoder/decoder has two streams
of computation executed in parallel, one for geometry (Section 3.2),
one for color (Section 3.5), which are transferred via a peer-to-peer
network channel (Section 3.6). We do not build a texture atlas per-
frame (alike [Dou et al
.
2017]), nor we update it over time (alike
[Orts-Escolano et al
.
2016]), but rather we apply standard video
compression to stream the 8RGB views to a client; the known
camera calibrations are then used to color the rendered model via
projective-texturing.
3.1 Panoptic 4D capture and reconstruction
We built a capture setup with
×
8RGBD cameras placed around
the performer capable of capturing a panoptic view of the scene;
see Figure 3. Each of these cameras consists of
×
1RBG camera,
×
2IR
cameras that have been mutually calibrated, and an uncalibrated IR
pattern projector. We leverage state-of-the-art disparity estimation
system by Kowdle et al
.
[2018] to compute depth very eciently.
The depth maps coming from dierent viewpoints are then merged
following modern extensions [Dou et al
.
2017] of traditional fusion
paradigms [Dou et al
.
2016; Innmann et al
.
2016; Newcombe et al
.
2015] that aggregate observations from all cameras into a single
representation.
Implicit geometry representation. An SDF is an implicit repre-
sentation of the 3D volume and it can be expressed as a function
ϕ(z)
:
R3R
that encodes a surface by storing at each grid point
Volumetric Data Capture Setup
x8 RGBD cameras
Fig. 3. Our panoptic capture rig, examples of the RGB and depth views and
resulting projective texture mapped model.
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
256:4 Tang et al.
the signed distance from the surface
S
– positive if outside, and neg-
ative otherwise. The SDF is typically represented as a dense uniform
3D grid of signed distance values with resolution
W×H×D
. A trian-
gular mesh representation of the the surface
S={zR3|ϕ(z)=
0
}
can be easily extracted via iso-surfacing [de Araujo et al
.
2015];
practical and ecient implementations of these methods currently
exist on mobile consumer devices. In practical applications, where
resources such as computate power and memory are scarce, a Trun-
cated SDF is employed: an SDF variant that is only dened in the
neighborhood of the surface corresponding to the SDF zero crossing
Φ(z)
:
{zR3|ϕ(z)<ε} → R
. Although surfaces are most com-
monly represented explicitly using a triangular/quad mesh, implicit
representations make dealing with topology changes easier.
Using our reimplementation of [Dou et al
.
2017] we transform
the input stream into a sequence of TSDFs
{Φt}
and corresponding
sets of RGB images {Cn}tfor each frame t.
3.2 Signal compression – Preliminaries
Before diving into the actual 3D volume compression algorithm, we
briey describe some preliminary tools that are used in the rest of
the pipeline, namely KLT and range encoding.
3.2.1 Karhunen-Loéve Transform. Given a collection
{xmRn}
of
M
training signals, we can learn a transformation matrix
P
and oset
µ
that dene a forward KLT transform, as well as its inverse iKLT,
of an input signal x(a transform based on PCA decomposition):
X=KLT(x)=PT(xµ)(forward transform) (1)
x=iKLT(X)=PX +µ(inverse transform) (2)
Given
µ=Em[xm]
, the matrix
P
is computed via eigendecom-
position of the covariance matrix
C=Em(xmµ)T(xmµ)
resulting in the matrix factorization
C=P Σ PT
. Where
Σ
is a
diagonal matrix whose entries
{σn}
are the eigenvalues of
C
, and
the columns of
P
are the corresponding eigenvectors. In particular,
if the signal to be compressed is drawn from a linear manifold of
dimensionality
dn
, the trailing
(nd)
entries of
X
will be zero.
Consequently, we can encode a
xRn
signal in a lossless way by
only storing its
XRd
transform coecients. Note that to invert
the process, the decoder only necessitates the transformation matrix
P
and the mean vector
µ
. In practical applications, our signals are
not strictly drawn from a low-dimensional linear manifold, but the
entries of the diagonal matrix
Σ
represents how the energy of the
signal is distributed in the latent space. In more details, we could
only use the rst
d
basis vectors to encode/decode the signal, more
formally:
ˆ
X=KLTd(x)=(PId)T(xµ)(lossy encoder) (3)
ˆ
x=iKLTd(X)=PIdˆ
X+µ(lossy decoder) (4)
Where
Id
is a diagonal matrix where only the rst
d
elements are
set to one. The reconstruction error is bound by:
ϵ(d)=Emxmˆ
xm2
2=
n
Õ
i=d+1
σi(5)
For linear latent spaces of order
d
, not only does the optimality
theorem of the KLT basis ensure that the error above is zero (i.e. loss-
less compression), but that the transformed signal
ˆ
X
has minimal
entropy [Jorgensen and Song 2007]. In the context of compression,
this is signicant, as entropy captures the average number of bits
necessary to encode the signal. Furthermore, similar to the Fourier
Transform (FT), as the covariance matrix is symmetric, the matrix
P
is orthonormal. Uncorrelated bases ensure that adding more dimen-
sions always adds more information, and this is particularly useful
when dealing with variable bitrate streaming. Note that, in contrast
to compression based on the FT, the
KLTd
compression bases are
signal-dependent: we train
P
and then stream
Pd
, containing the
rst
d
columns of
P
to the client. Note this operation could be done
oine as part of a codec installation process.
In practice, the full covariance matrix of an entire frame with
millions of voxels would be extremely large, and populating it accu-
rately would require a dataset several orders of magnitude larger
than ours. Further, dropping eigenvectors would also have a detri-
mental global smoothing eect. Besides, a TSDF is a sparse signal
with interesting values centered around the implicit surface only,
which is why we apply KLT to individual, non-overlapping blocks
of size
8×8×8
, which is large enough to contain interesting local
surface characteristics, and small enough to capture its variation
using our dataset. More specically, in our implementation we use
5
mm
voxel resolution, so a block has a volume of 40
×
40
×
40
mm3
.
The learned eigenbases from these blocks need to be transmitted
only once to the decoder.
3.2.2 Range Encoding. The
d
coecients obtained through KLT
projection can be further compressed using a range encoder, which
is a lossless compression technique specialized for compressing
entire sequences rather than individual symbols as in the case of
Human coding. A range encoder computes the cumulative dis-
tribution function (CDF) from the probabilities of a set of discrete
symbols, and recursively subdivides the range of a xed precision
integer respectively to map an entire sequence of symbols into
an integer [Martin 1979]. However, to apply range encoding to a
sequence of real valued descriptors as in our case, one needs to
quantize the symbols rst. As such, despite range encoding being a
lossless compression method on discrete symbols, the combination
results in lossy compression due to quantization error. In particular,
we treat each of the
d
components as a separate message, as treating
them jointly may not lend itself to ecient compression. This is due
to that fact that the distribution of symbols in each latent dimension
can be vastly dierent and that the components are uncorrelated be-
cause of the KLT. Hence, we compute the CDF, quantize the values,
and encode the sequence separately for each latent dimension.
3.3 Oline Training of KLT / antization
The tools in the previous section can be applied directly to transform
a signal, such as an SDF, into a bit stream. The ability, however,
to actually compress the signal greatly depends on how exactly
they are applied. In particular, applying these tools to the entire
signal (e.g. the full TSDF) does not expose repetitive structures that
would enable eective compression. As such, inspired by image
compression techniques, we look at
8×8×8
sized blocks where we
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
Real-time Compression and Streaming of 4D Performances 256:5
Neumann TSDF Dirichlet TSDF Traditional TSDF
.2
.4
.6
.8
1.
10 20 30 40
10 20 30 40
.2
.4
.6
.8
1.
.2
.4
.6
.8
1.
10 20 30 40
x
x
x
x
x
x
+2
0
-2
+2
0
-2
+20
0
-20
Fig. 4. (le) A selection of random exemplar for several variants of TSDF
training signals that are used to train the KLT. (middle) A cross section of the
function along the segment
x
, highlighting the dierent levels of continuity
in the training signal. (right) When trained on 10000 randomly generated
32
×
32 images, the KLT spectra is heavily aected by the smoothness in
the training dataset.
are likely to see repetitive structure (e.g. planar structures). Further,
it turns out that leveraging a training set to choose an eective set
of KLT parameters (i.e.
P
and
µ
) is considerably nuanced, and one
of our contributions is showing below how to do that eectively.
Finally we demonstrate how to use this training set to form an
eective quantization scheme.
Learning
P,µ
from 2D data. The core of our compression archi-
tecture exploits the repetitive geometric structure seen in typical
SDFs. For example, the walls of an empty room if axis aligned can
largely be explained by a small number of basis vectors that repre-
sent axis aligned blocks. When non-axis aligned planar surfaces are
introduced, one can reconstruct an arbitrary block containing such
a surface by linearly blending a basis of blocks containing planes
at a sparse sampling of orientations. Even for surfaces, the SDF is
an identity function along the surface normal indicating that the
amount of variation in an SDF is much lower than for arbitrary 3D
function. We thus use KLT to nd a latent space where the variance
on each component is known, which we will later exploit in order
to encourage bits to be assigned to components of high variance.
As we will see, however, applying this logic naively to generate a
training set of blocks sampled from a TSDF can have disastrous
eects. Specically, one must carefully consider how to deal with
blocks that contain unobserved voxels; the block pixels colored in
black in Figure 2. This is complicated by the fact that although real
world objects have a well dened inside and outside, it is not obvious
how to discern this for the unobserved voxels outside the truncation
band. The three choices that we consider are:
Traditional TSDF
Assigning some arbitrary constant value
to to all such unassigned voxel. This will create sharp discon-
tinuities at the edge of the truncation band; i.e. presumably
the representation used by Canelhas et al. [2017].
Dirichlet TSDF
Only consider “unambiguous” blocks where
using the values of voxels on the truncation boundary (i.e.
with a value nearly
±
1) to ood-ll any unobserved voxels
does not introduce large C0discontinuities.
0 10 20 30 40
0.2
0.4
0.6
0.8
1.0
0 10 20 30 40
0.2
0.4
0.6
0.8
1.0
0 10 20 30 40
0.2
0.4
0.6
0.8
1.0
Fig. 5. The (sorted) eigenvalues (diagonal of
Σ
) for a training signal derived
from a collection of 4D sequences. Note the decay rate is slower than that
of Figure 4, as these eigenbases need to represent surface details, and not
just the orientation of a planar patch.
Neumann TSDF
Increase the truncation band and consider
blocks that never contain an unobserved value. These blocks
will look indistinguishable from SDF blocks at training time,
but requires additional care at test time when the truncation
band is necessarily reduced for performance reasons; see Equa-
tion 9.
To elucidate the benets of these increasingly complicated ap-
proaches, we use an illustrative 2D example that naturally general-
izes to 3D; see Figure 4. In particular, we consider training a basis
on 32
×
32 blocks extracted from a 2D TSDF containing a single
planar surface (i.e. a line). Notice that for the truly SDF like data
resulting from the Neumann strategy, the eigenvalues of the eigen-
decomposition of our signal quickly falls to zero, faithfully allowing
the zero-crossing to be reconstructed with a small number of basis
vectors. In contrast, the continuous but kinked signals resulting
from the Dirichlet strategy extend the spectra of the signal as
higher frequency components are required to reconstruct the sharp
transition to the constant valued plateaus. Finally, the Traditional
strategy massively reduces the compactness of the spectra as the
decomposition tries to model the sharp discontinuities in the signal.
In more details, to capture 99% of the energy in the signal, Tradi-
tional TSDFs require
d=
94 KLT basis vectors, Dirichlet TSDFs
require 14 KLT basis vectors, and Neumann TSDFs require
d=
3
KLT basis vectors. Note how Neumann is able to discover the true
dimensionality of the input data (i.e. d=3).
Learning
P,µ
from 3D captured data. Extending the example above
to 3D, we train our KLT encoder on a collection of 3D spatial blocks
extracted from 4D sequences we acquired using the capture system
described in Section 3.1. To deal with unobserved voxels, we apply
the Neumann strategy at training time and carefully design our test
time encoder to deal with unobserved voxels that occur with the
smaller truncation band required to run at real time; see Equation 9.
Note that these training sequences are excluded in the quantitative
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
256:6 Tang et al.
evaluations of Section 4. In contrast to the blocks extracted by the
dicer module at test time, the training blocks need not be non-
overlapping, so instead we extract them by sliding-window over
every training frame
{Φm}
. The result of this operation is sum-
marized by the KLT spectra in Figure 5. As detailed in Section 4,
the number
d
of employed basis vectors is chosen according to the
desired tradeo between bitrate and compression quality.
Learning the quantization scheme. To be able to employ symbolic
encoders, we need to develop a quantization function
Q(·)
that maps
the latent coecients
ˆ
X
to a vector of quantized coecients
ˆ
Q
as
ˆ
Q=Q( ˆ
X)
. To this end, we quantize each dimension
h
of the latent
space individually by choosing
Kh
quantization bins
1{ηh
k}Kh
k=1R
and dening the h’th component of the quantization function as
Qh(ˆ
X)=arg min
k{1,2,. . .Kh}(ˆ
Xhηh
k)2.(6)
To ensure there are bins near coecients that are likely to occur we
minimize
arg min
{wh
k,n},{ηh
k}
Kh
Õ
k=1
N
Õ
n=1
wh
k,n(ˆ
Xh
nηh
k)2,(7)
where
ˆ
Xh
n
is the
h
’th coecient derived from the
n
’th training ex-
ample in our training set. Each point has a boolean vector aliating
it to a bin
Wh
n=[wh
1,n, . . . , wh
Kh,n]TBKh
where only a single
value is set to 1 such that
Íiwh
i,n=
1. This is a 1-dimensional form
of the classic K-means problem and can be solved in closed form
using dynamic programming [Grønlund et al
.
2017]. The eigenval-
ues corresponding to the principal components of the data are a
measure of variance in the projected space. We leverage this by
assigning a varied number of bins to each component based on their
respective eigenvalues, which range between
[
0
,Vtotal]
. In particular,
we adopt a linear mapping between the standard deviation of the
data and number of bins, resulting in
Kh=KtotalVh/Vtotal
bins for
eigenbasis
h
. The components receiving a single bin are not encoded
since their values are xed for all the samples, and they are simply
added as a bias to the reconstructed result in the decompression
stage. As such, the choice of
Ktotal
partially controls the trade o
between the reconstruction accuracy and bitrate.
3.4 Run-time 3D volume compression/decompression
Encoder. At run time, the encoding architecture entails the fol-
lowing steps: A variant of KLT, designed to deal with unobserved
voxels, maps each of the
m=
1
. . . M
blocks to a vector of coe-
cients
{ˆ
Xm}
in a low-dimensional latent space (Section 3.4.2). We
compute histograms on coecients of each block, and then quantize
them into the set
{ˆ
Qm}
; these histograms are re-used to compute the
optimal quantization thresholds (Section 3.4.3). To further reduce
bitrate allocations, the lossless range encoder module compresses
the symbols into a bitstream.
Decoder. At run time, the decoding architecture mostly inverts the
process described above, although due to quantization the signal can
only be approximately reconstructed; see Section 3.4.4. The decode
1
Note that what we refer to as “bins” are really just real numbers that implicitly dene
a set of intervals on the real line through (6).
process produces a set of coecients
{ˆ
Xm}
. However, at low bitrates,
this results in blocky-artifacts typical of JPEG compression. Since
we know that the compressed signal is a manifold, these artifacts
can be removed by means of an optimization designed to remove
C1
discontinuities from the signal (Section 3.4.5). Finally, we extract
the zero crossing of
Φ
via a GPU implementation of [Lorensen and
Cline 1987]. This extracts a triangular mesh representation of
S
that can be texture mapped.
We now detail all the above steps, which can run in real-time
even on mobile platforms.
3.4.1 Dicer. As we stream the content of a TSDF only the observed
blocks are transmitted. To reconstruct the volume we therefore
need to encode and transmit the block contents as well as the block
indices. The blocks are transmitted with indices sorted in an as-
cending manner and we use delta encoding to convert the vector of
indices [i,j,k, . . .]into a vector [i,ji,kj, . . .]. This conversion
makes the vector entropy encoder friendly as it becomes a set of
small duplicated integers. Unlike the block data, we want lossless
reconstruction of the indices so we cannot train the range encoder
beforehand. This means that for each frame we need to calculate
the CDF and the code book to compress the index vector.
3.4.2 KLT – compression. The full volume
Φ
is decomposed into a
set
{xm}
of non-overlapping
8×8×8
blocks. In this stage, we seek to
leverage the learnt basis
P
and
µ
to map each
xRn
to a manifold
with much lower dimensionality
d
. We again, however, need to
deal with the possibility of blocks containing unobserved voxels.
At training time, it was possible to apply the Neumann strategy.
At test time, however, we cannot increase the truncation region
as it would reduce the frame rate of our non-rigid reconstruction
pipeline. To this end, we instead minimize the objective:
arg min
ˆ
XW[x− (PIdˆ
X+µ)]∥.(8)
The matrix
WRn×n
is a diagonal matrix where the i-th diagonal
entry is set to 1for observed voxels, and 0otherwise. The least-
squares solution of the above optimization problem can be computed
in closed form as:
ˆ
X=(IT
dPTWPId)1[(PId)TW(xµ)].(9)
Note that when all the blocks are observed this simplies to Equa-
tion 3. Computing
ˆ
X
requires the solution of a small
d×d
dimen-
sional linear problem, that can be eciently computed.
3.4.3 antization and encoding. We now apply our quantization
scheme to the input coecients
{ˆ
Xm}
via
ˆ
Qm=Q(ˆ
Xm)
to obtain
the set of quantized coecients
{ˆ
Qm}
. For each dimension
h
, we
apply the range encoder to those coecients
{ˆ
Qh
m}
, across all blocks,
as if they were symbols so as to losslessly compress them, leveraging
any low entropy empirical distributions that appears over these
symbols. The bitstream from the range encoder is sent directly to
the client.
3.4.4 iKLT – decompression. The decompressed signal can be ef-
ciently obtained via Eq. 4, namely:
ˆ
x=PIdˆ
X+µ
. If enough bins
were used during quantization, this will result in a highly accurate,
but possibly high bitrate, reconstruction.
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
Real-time Compression and Streaming of 4D Performances 256:7
Fig. 6. (top) A smooth signal is to be compressed by representing each non-
overlapping block of 100 samples. (middle) By compressing with
KLT4
(or
99% of the variance explained), the signal contains discontinuities at block
boundaries. (boom) Our variational optimization recovers the quantized
KLT coeicients, and generates a C1continuous signal.
3.4.5 iantizer – block artifacts removal. Due to our block com-
pression strategy, even when using a number
d
of KLT basis vec-
tors that explain 99% of the variance in the training data, the de-
compressed signal presents
C0
discontinuities at block boundaries;
see Figure 6 for a 2D example. If not enough bins are used dur-
ing quantization, these can become quite pronounced. To address
this problem, we try to use a set of corrections
˜
X
to the quantized
coecients and look at the corrected reconstruction
˜
xm(˜
X)=PId(ˆ
Q+˜
Xm)+µ.(10)
We know that the corrections are in the intervals dened by the
quantization coecients. That is
˜
Xm∈ C =[−ϵ1
m,ϵ1
m] × ... × [−ϵd
m,ϵd
m].(11)
where for dimension
h∈ {
1
, ...d}
,
ϵh
mR
is half the width of
the interval that
(6)
assigns to bin center
ˆ
Qh
m
. To nd the correc-
tions, we penalize the presence of discontinuities across blocks in the
reconstructed signal. We rst abstract the iDicer functional block
by parameterizing the reconstructed TSDF in the entire volume
Φ(z)=˜
x(z|{ ˜
Xm})
:
R3R
. We can then recover the corrections
{˜
Xm}
via a variational optimization that penalizes discontinuities
in the reconstructed signal ˜
x:
arg min
{˜
Xm}Õ
i,j,k∆Φi jk (z)({ ˜
Xm})∥2where {˜
X}⊆C.(12)
Here
is the Laplacian operator of a volume, and
ijk
indexes a pixel
in the volume. In practice, the constraints in
(11)
are hard to deal
with, so we relax the constraint as
arg min
{˜
Xm}Õ
i,j,k∆Φi jk (z)({ ˜
Xm})∥2+λÕ
m
d
Õ
h=1˜
Xh
m
ϵh
m2
,(13)
a linear least squares problem where
λ
is a weight indicating how
strongly the relaxed constraint is enforced. See Figure 2 for a 2D
illustration of this process, and Figure 7 to see its eects on the
rendered mesh model.
Fig. 7. (le) A low-bitrate TSDF is received by the decoder and visual-
ized as a mesh triangulating its zero crossing. In order to exaggerate the
block-artifacts for visualization, we choose
KT ot a l =
512 in this exam-
ple. (middle) Our filtering technique removes these structured artifacts via
variational optimization. (right) The ground truth triangulated TSDF.
3.5 Color Compression
Although the scope and contribution of this paper is geometry com-
pression, we also provide a viable solution to deal with texturing.
Since we are targeting real-time applications, building a consistent
texture atlas would break this requirement [Collet et al
.
2015]. We
recall that in our system we use only 8RGB views, therefore we
decided to employ traditional video encoding algorithms to com-
press all the images and stream them to the client. The receiver,
will collect those images and apply projective texturing method to
render the nal reconstructed mesh.
In particular, we employ a separate thread to process the
N
streams from the RGB cameras via the real-time H.264 GPU en-
coder (NVIDIA NVENC). This can be eciently done on an Nvidia
Quadro P6000, which has two dedicated chips to o-load the video
encoding, therefore not interfering with the rest of the reconstruc-
tion pipeline. In terms of run time, it takes around 5ms to compress
all of the data when both the encoding chips present on the GPU
are used in parallel. The same run time is needed on the client side
to decode. The bandwidth required to achieve a level of quality
with negligible loss is around 8Mbps. Note that the RGB streams
coming from the
N
RGB cameras are already temporally consistent,
enabling standard H.264 encoding.
3.6 Networking and live demo setup
Our live demonstration system consists of two main parts, con-
nected by a low speed link: a high performance backend collection
of capture/fusion workstations (
×
10 HP Z840), a WiFi 802.11a router
(Netgear AC1900), and a remote display laptop (Razer Blade Pro).
The backend system divides the capture load into a hub and spoke
model. Cameras are grouped into pods, each consisting of two IR
cameras for depth and one color camera for texture; each camera is
operating at 1280
×
1024 @30Hz. As detailed in Section 3.1, each pod
also has an active IR dot pattern emitter to support active stereo.
In our demo system, there are eight pods divided among the
capture workstations. Each capture workstation performs synchro-
nization, rectication, color processing, active stereo matching, and
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
256:8 Tang et al.
0 16 32 48 64 80 96
Number of Retained Bases
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Reconstruction Error (mm)
0 16 32 48 64 80 96
Number of Retained Bases
0
2
4
6
8
10
12
14
Bitrate (Mbps)
GT 16 Bases 32 Bases 48 Bases 64 Bases
Fig. 8. Measuring the impact of the number of employed KLT basis vectors
on (top-le) reconstruction quality and (top-right) bitrate. (boom) ali-
tative examples when retaining dierent number of basis vectors. Note we
keep Kt ot a l =2048 fixed in this experiment.
segmentation. Each capture workstation is directly connected to
a central fusion workstation using 40 GigE links. Capture work-
stations transmit segmented color (24 bit RGB) and depth (16 bit)
images using run-length compression to skip over discarded regions
of the view over TCP using a request/serve loop. On average, this
requires a fraction of the worst-case of 1572 Mbps required per
pod, around 380 Mbps. A “fusion box” workstation combines the
eight depth maps and color images into a single textured mesh for
local previewing. To support real-time streaming of the volume
and remote viewing, the fusion box also compresses the TSDF vol-
ume representation and the eight color textures. The compressed
representation, using the process described in Section 3.4 for the
mesh and hardware H.264 encoding for the combined textures is
computed for every fused frame at 30 hertz. On average each com-
pressed frame requires about 20 KB. The client laptop, connected
via Wi-Fi, uses a request/serve loop to receive the compressed data
and performs decompression and display locally.
4 EVALUATION
In this section, we perform quantitative and qualitative analysis of
our algorithm, and compare it to our re-implementation of state-of-
the-art techniques. We urge the reader to watch the
accompanying
video
where further comparisons are provided, as well as a live
recording of our demo streaming a 4D performance in real-time via
WiFi to a consumer-level laptop. Note that in our experiments we
focus our analysis around two core aspects:
the requirement for real-time (30 fps in both encoder/decoder)
performance
the compression of geometry (i.e. not color/texture)
4.1 Datasets and metrics
We acquire three RGBD datasets by capturing the performance
of three dierent subjects at 30Hz. Each sequence,
500 frames in
length, was recorded with the setup described in Section 3.6. To train
the KLT basis of Section 3.2 we randomly selected 50 random frames
from Sequence 1. We then use this basis to test on the remaining
1450 frames.
We generate high quality reconstructions that are then used as
ground-truth meshes. To do so we attempted to re-implement the
FVV pipeline of Collet et al
.
[2015]. We rst aggregate the depth
maps into a shared coordinate frame, and then register the multiple
surfaces to each other via local optimization [Bouaziz et al
.
2013] to
generate a merged (oriented) point cloud. We then generate a wa-
tertight surface by solving a Poisson problem [Kazhdan and Hoppe
2013], where screening constraints enforce that parts of the volume
be outside as expressed by the visual hull. We regard these frames
as ground truth reconstructions. Note that this process does not run
in real-time, with an average execution budget of 2min/frame. We
use it only for quantitative evaluations, and it is not employed in
our demo system.
Typically to decide the degree to which two geometric represen-
tations are identical, the Hausdor distance is employed [Cignoni
et al
.
1998]. However, as a single outlier might bias our metric, in
our experiments we employ an 1relaxation of the Hausdor met-
ric [Tkach et al
.
2016, Eq. 15]. Given our decompressed model
X
,
and a ground truth mesh
Y
, our metric is the (average, and area
weighted) 1medians of (mutual) closest point distances:
H(X,Y)=1
2A(X)Õ
xX
A(x)inf
yYd(x,y)+1
2A(Y)Õ
yY
A(y)inf
xXd(x,y)
where
x
is a vertex in the mesh
X
, and
A(.)
returns the area of a
vertex, or of the entire mesh according to its argument.
4.2 Impact of compression parameters
In our algorithm, reconstruction error, as measured by
H
, can be
attributed to two dierent parameters: (Section 4.2.1) the number
of learned KLT basis vectors
d
; and (Section 4.2.2) the number of
quantization bins Kt ot al .
4.2.1 Impact of KLT dimensionality – Figure 8. We show the impact
of reducing the number
d
of basis vectors on the reconstruction
error as well as on the bitrate. Although the bitrate increases almost
linearly in
d
, the reconstruction error plateaus for
d>
48. In the
qualitative examples, when
d=
16, blocky artifacts remain even
after our global solve attempts to remove them. When
d=
32, the
reconstruction error is still high, but the nal global solve manages
to remove most of the blocky artifacts, and projective texturing
makes the results visually acceptable. When
d
48 it is dicult to
notice any dierences from the ground truth.
4.2.2 Impact of quantization – Figure 9. We now investigate the
impact of quantization when xing the number of retained basis
vectors. As shown in the gure, in contrast to
d
, the bitrate changes
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
Real-time Compression and Streaming of 4D Performances 256:9
1024 2048 3072 4096
Number of Total Quantization Bins
0.1
0.2
0.3
0.4
0.5
0.6
Reconstruction Error (mm)
1024 2048 3072 4096
Number of Total Quantization Bins
2
4
6
8
10
12
Bitrate (Mbps)
GT 1024 bins 2048 bins 3072 bins 4096 bins
Fig. 9. Measuring the impact of quantization on (top-le) reconstruction
quality and (top-right) bitrate. (boom) alitative examples when employ-
ing a dierent number of quantization bins. Note we keep
d=
48 fixed in
this experiment.
Table 1. Average timings for encoding a single of 300k faces (175k ver-
tices) across dierent methods. On the encoder side, our compressor (with
KT ot a l =
2048,
d=
48) is orders of magnitude faster than existing tech-
niques. Since our implementation of FVV [Collet et al
.
2015] is not eicient,
we report the encoding speed from their paper when
70 workstations
were used – note no decoding framerate was reported.
Method Encoding Decoding
Ours (CPU / GPU) 17 ms / 2 ms 19 ms / 3ms
Draco (qp=8 / 14) 180 ms / 183 ms 81 ms / 98 ms
FVV 25 seconds N/A
JP3D 14 seconds 25 seconds
sub-linearly with the number of total assigned bins
KT ot al
. When
KT ot al =
2048, the compressor produces a satisfying visual output
with a bandwidth comparable to high-denition (HD) broadcast.
4.3 Comparisons to the state-of-the-art – Figure 10
We compare with two state-of-the-art geometry compression al-
gorithms: FVV [Collet et al
.
2015] and Draco [Galligan et al
.
2018].
Whilst Draco is tailored for quasi-lossless static mesh compression,
FVV is designed to compress 4D acquired data similar to the scenario
we are targeting. Qualitative comparisons are reported in Figure 11,
as well as in the supplementary video.
The Google Draco [Galligan et al
.
2018] mesh compression li-
brary implements a modern variant of EdgeBreaker [Rossignac
1999], where a triangle spanning tree is built by traversing vertices
one at a time – as such, this architecture is dicult to parallelize,
and therefore not well-suited for real-time applications; see Table 1.
Draco uses the parameter
qp
to control level of location quantiza-
tion, i.e., the trade-o between compression rate and loss. With our
dataset, we observe artifacts by vertices collapsing when
qp
8,
2223242526272829
Bitrates (Mbps)
101
100
101
Reconstruction Error (mm)
Ours
FVV (tracked)
FVV (untracked)
Draco
JP3D
0 5 10 15 20 25
Bitrates (Mbps)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Reconstruction Error (mm)
Ours
FVV (tracked)
Fig. 10. (le) Comparison with prior arts on our dataset. Note that since
some results are orders of magnitude apart, semilogy axes are used. To
generate a curve Draco [Galligan et al
.
2018], we vary the parameter quan-
tization bits (qp). (right) Comparison with FVV [Collet et al
.
2015] on their
BREAKDANCERS
sequence. FVV’s bitrate was obtained from their paper (by
removing the declared texture bitrate). Our rate-distortion curve has been
generated by varying the number of KLT basis d.
hence we only vary
qb
between 8and 14 (recommended setting).
The computational time increases altogether with this parameter.
At the same level of reconstruction error, their output le size is 8x
larger than ours; see Figure 10 (left).
As detailed in Section 4, we re-implemented Free Viewpoint
Video [Collet et al
.
2015] in order to generate high quality, un-
compressed reconstructions and compare their method on our ac-
quired dataset. The untracked, per-frame reconstructions are used
as ground-truth meshes. The method in Collet et al
.
[2015] tracks
meshes over time and performs an adaptive decimation to preserve
the details on the subject’s face. We compare our system with FVV
tracked, where both tracking and decimation are employed, and
with FVV untracked, where we only use the smart decimation and
avoid the expensive non-rigid tracking of the meshes.
Our method, outperforms FVV (tracked) in terms of reconstruc-
tion and bitrate. The FVV (untracked), but decimated, generally
has lower error but prohibitive bitrates (around 128 Mbps). Fig-
ure 10 shows that our method (
KT ot al =
2048,
d=
32) achieves
similar error with the untracked version of FVV with
24x lower
bitrate. Notice that whereas FVV focuses on exploiting dynamics
in the scene to achieve compelling compression performances, our
methods mostly focus on encoding and transmitting the keyframes,
therefore we could potentially combine the two to achieve better
compression rates. Finally, it is important to highlight again that
FVV is much slower than our method. In Collet et al
.
[2015], they
describe the tracked version has a speed of
30 minutes per frame
on a single workstation.
We also compare our approach with JP3D [Schelkens et al
.
2006],
an extension of the JPEG2000 codec to volumetric data. Similar
to our approach, JP3D decomposes volumetric data as blocks and
processes them independently. However, while our algorithm uses
KLT as a data driven transformation, JP3D uses a generic discrete
wavelet transform to process the blocks. The wavelet coecients are
then quantized before being compressed with an arithmetic encoder.
In the experiment, we use the implementation in OpenJPEG, where
3 successive layers are chosen with compression ratio 20, 10, and
5. The other parameters are kept as default. Since JP3D assumes a
dense 3D volume as input, even though it does not use expensive
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
256:10 Tang et al.
temporal tracking, encoding time of each frame still takes 14 seconds
while decoding takes 25 seconds. With similar reconstruction error,
their output bitrate is 131Mbps, which is still one order of magnitude
larger than ours.
5 LIMITATIONS AND FUTURE WORK
We have introduced the rst compression architecture for dynamic
geometry capable of reaching real-time encoding/decoding perfor-
mance. Our fundamental contribution is to pivot away from triangu-
lar meshes towards implicit representations, and, in particular, the
TSDFs commonly used for non-rigid tracking. Thanks to the simple
memory layout of a TSDF, the volume can be decomposed into
equally-sized blocks, and encoding/decoding eciently performed
on each of them independently, in parallel.
In contrast to [Collet et al
.
2015] that utilizes temporal informa-
tion, our work focuses on encoding an entire TSDF volume. Nonethe-
less, the compression of post motion-compensation residuals truly
lies at the foundation of the excellent compression performance
of modern video encoders [Richardson 2011]. An interesting di-
rection for future works would be how to leverage the
SE (
3
)
cues
provided by motion2fusion [Dou et al
.
2017] to encode post motion-
compensation residuals. While designed for eciency, our KLT
compression scheme is linear, and not particularly suited to rep-
resent piecewise
SE (
3
)
transformations of rest-pose geometry. An
interesting venue for future works would be the investigation of
non-linear latent embeddings that could produce more compact,
and hence easier to compress, encodings. Despite their evaluation
in real-time remaining a challenge, convolutional auto-encoders
seem to be a perfect choice for the task at hand [Rippel and Bourdev
2017].
REFERENCES
Soen Bouaziz, Andrea Tagliasacchi, and Mark Pauly. 2013. Sparse Iterative Closest
Point. Computer Graphics Forum (Symposium on Geometry Processing) (2013).
Stéphane Calderon and Tamy Boubekeur. 2017. Bounding proxies for shape approxi-
mation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 57.
Daniel-Ricao Canelhas, Erik Schaernicht, Todor Stoyanov, Achim J Lilienthal, and
Andrew J Davison. 2017. Compressed Voxel-Based Mapping Using Unsupervised
Learning. Robotics (2017).
Paolo Cignoni, Claudio Rocchini, and Roberto Scopigno. 1998. Metro: Measuring error
on simplied surfaces. (1998).
David Cohen-Steiner, Pierre Alliez, and Mathieu Desbrun. 2004. Variational shape
approximation. ACM Trans. on Graphics (Proc. of SIGGRAPH) (2004).
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese,
Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality streamable
free-viewpoint video. ACM Trans. on Graphics (TOG) (2015).
Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models
from range images. In SIGGRAPH.
B. R. de Araujo, Daniel S. Lopes, Pauline Jepp, Joaquim A. Jorge, and Brian Wyvill. 2015.
A Survey on Implicit Surface Polygonization. Comput. Surveys (2015).
Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle,
Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2fusion:
real-time volumetric performance capture. ACM Trans. on Graphics (Proc. of SIG-
GRAPH Asia) (2017).
Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello,
Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan
Taylor, et al
.
2016. Fusion4d: Real-time performance capture of challenging scenes.
ACM Transactions on Graphics (TOG) (2016).
Frank Galligan, Michael Hemmer, Ondrej Stava, Fan Zhang, and Jamieson Brettle. 2018.
Google/Draco: a library for compressing and decompressing 3D geometric meshes
and point clouds. https://github.com/google/draco. (2018).
Michael Garland and Paul S Heckbert. 1997. Surface simplication using quadric error
metrics. In Proc. of ACM SIGGRAPH.
Allan Grønlund, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen,
Stefan Schneider, and Mingzhou Song. 2017. Fast exact k-means, k-medians and
bregman divergence clustering in 1d. arXiv preprint arXiv:1701.07204 (2017).
Hugues Hoppe. 1996. Progressive Meshes. In Proc. of ACM SIGGRAPH.
Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc
Stamminger. 2016. VolumeDeform: Real-time volumetric non-rigid reconstruction.
In Proc. of the European Conf. on Comp. Vision. 362–379.
Palle ET Jorgensen and Myung-Sin Song. 2007. Entropy encoding, Hilbert space, and
Karhunen-Loeve transforms. J. Math. Phys. 48, 10 (2007), 103503.
Zachi Karni and Craig Gotsman. 2000. Spectral compression of mesh geometry. In
Proceedings of the 27th annual conference on Computer graphics and interactive
techniques. ACM Press/Addison-Wesley Publishing Co.
Zachi Karni and Craig Gotsman. 2004. Compression of soft-body animation sequences.
Computers & Graphics (2004).
Michael Kazhdan and Hugues Hoppe. 2013. Screened poisson surface reconstruction.
ACM Transactions on Graphics (ToG) (2013).
Adarsh Kowdle, Christoph Rhemann, Sean Fanello, Andrea Tagliasacchi, Jonathan
Taylor, Philip Davidson, Mingsong Dou, Kaiwen Guo, Cem Keskin, Sameh Khamis,
David Kim, Danhang Tang, Vladimir Tankovich, Julien Valentin, and Shahram Izadi.
2018. The Need 4 Speed in Real-Time Dense Visual Tracking. (2018).
Bruno Lévy and Hao (Richard) Zhang. 2010. Spectral Mesh Processing. In ACM SIG-
GRAPH Courses.
William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D
surface construction algorithm. In Computer Graphics (Proc. SIGGRAPH), Vol. 21.
163–169.
Adrien Maglo, Guillaume Lavoué, Florent Dupont, and Céline Hudelot. 2015. 3d mesh
compression: Survey, comparisons, and emerging trends. Comput. Surveys (2015).
G Nigel N Martin. 1979. Range encoding: an algorithm for removing redundancy from
a digitised message. In Proc. IERE Video & Data Recording Conf., 1979.
MPEG4/AFX. 2008. ISO/IEC 14496-16: MPEG-4 Part 16, Animation Framework eX-
tension (AFX). Technical Report. The Moving Picture Experts Group. https:
//mpeg.chiariglione.org/standards/mpeg-4/animation- framework-extension- afx
Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. Dynamicfusion: Recon-
struction and tracking of non-rigid scenes in real-time. In Proc. of Comp. Vision and
Pattern Recognition (CVPR).
Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, AdarshKowdle,
Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al
.
2016. Holoportation: Virtual 3d teleportation in real-time. In Proc. of the Symposium
on User Interface Software and Technology.
Jingliang Peng, Chang-Su Kim, and C-C Jay Kuo. 2005. Technologies for 3D mesh
compression: A survey. Journal of Visual Communication and Image Representation
(2005).
Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2017.
Spatiotemporal atlas parameterization for evolving meshes. ACM Trans. on Graphics
(TOG) (2017).
Iain E Richardson. 2011. The H. 264 advanced video compression standard. John Wiley &
Sons.
Oren Rippel and Lubomir Bourdev. 2017. Real-time adaptive image compression. arXiv
preprint arXiv:1705.05823 (2017).
Jarek Rossignac. 1999. Edgebreaker: Connectivity compression for triangle meshes.
IEEE Transactions on Visualization and Computer Graphics (1999).
Michael Ruhnke, Liefeng Bo, Dieter Fox, and Wolfram Burgard. 2013. Compact RGBD
Surface Models Based on Sparse Coding. (2013).
Peter Schelkens, Adrian Munteanu, Joeri Barbarien, Mihnea Galca, Xavier Giro-Nieto,
and Jan Cornelis. 2003. Wavelet coding of volumetric medical datasets. IEEE
Transactions on medical Imaging (2003).
Peter Schelkens, Adrian Munteanu, Alexis Tzannes, and Christopher M. Brislawn. 2006.
JPEG2000 Part 10 - Volumetric data encoding. (2006).
Olga Sorkine, Daniel Cohen-Or, and Sivan Toledo. 2003. High-Pass Quantization for
Mesh Encoding.. In Proc. of the Symposium on Geometry Processing.
Robert W Sumner, Johannes Schmid, and Mark Pauly. 2007. Embedded manipulation
for shape manipulation. ACM Trans. on Graphics (TOG) (2007).
Jean-Marc Thiery, Emilie Guy, and Tamy Boubekeur. 2013. Sphere-Meshes: Shape
Approximation using Spherical Quadric Error Metrics. ACM Trans. on Graphics
(Proc. of SIGGRAPH Asia) (2013).
Jean-Marc Thiery, Émilie Guy, Tamy Boubekeur, and Elmar Eisemann. 2016. Animated
Mesh Approximation With Sphere-Meshes. ACM Trans. on Graphics (TOG) (2016).
Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. 2016. Sphere-Meshes for Real-
Time Hand Modeling and Tracking. ACM Transaction on Graphics (Proc. SIGGRAPH
Asia) (2016).
Sébastien Valette and Rémy Prost. 2004. Wavelet-basedprogressive compression scheme
for triangle meshes: Wavemesh. IEEE Transactions on Visualization and Computer
Graphics 10, 2 (2004), 123–129.
Libor Vasa and Vaclav Skala. 2011. A perception correlated comparison method for
dynamic meshes. IEEE transactions on visualization and computer graphics (2011).
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
Real-time Compression and Streaming of 4D Performances 256:11
Draco [Galligan et al. 2018] FVV [Collet et al. 2015] Proposed Method
Fig. 11. alitative comparisons on three test sequences. We compare our approach with Draco [Galligan et al
.
2018] (qp=14) and FVV [Collet et al
.
2015].
Notice how FVV over-smooths details especially in the body regions. Draco requires bitrates that are 10 times larger than ours with results that are comparable
to ours in terms of visual quality.
ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.
... to capture and stream this data [7,10,16,53,69] require cumbersome capture technology, such as a large number of calibrated cameras or depth sensors, and the expert knowledge to install and deploy these systems. Videoconferencing, on the other hand, simply requires a single video camera, such as those found on common consumer devices, e.g. ...
Chapter
Full-text available
We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution reconstruction in a memory-efficient manner, its computationally expensive inference prevents us from deploying such a system for real-time applications. To this end, we propose a novel hierarchical surface localization algorithm and a direct rendering method without explicitly extracting surface meshes. By culling unnecessary regions for evaluation in a coarse-to-fine manner, we successfully accelerate the reconstruction by two orders of magnitude from the baseline without compromising the quality. Furthermore, we introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples. We adaptively update the sampling probability of the training data based on the current reconstruction accuracy, which effectively alleviates reconstruction artifacts. Our experiments and evaluations demonstrate the robustness of our system to various challenging angles, illuminations, poses, and clothing styles. We also show that our approach compares favorably with the state-of-the-art monocular performance capture. Our proposed approach removes the need for multi-view studio settings and enables a consumer-accessible solution for volumetric capture.
... to capture and stream this data [7,10,16,53,69] require cumbersome capture technology, such as a large number of calibrated cameras or depth sensors, and the expert knowledge to install and deploy these systems. Videoconferencing, on the other hand, simply requires a single video camera, such as those found on common consumer devices, e.g. ...
... to capture and stream this data [7,10,16,53,69] require cumbersome capture technology, such as a large number of calibrated cameras or depth sensors, and the expert knowledge to install and deploy these systems. Videoconferencing, on the other hand, simply requires a single video camera, such as those found on common consumer devices, e.g. ...
Preprint
Full-text available
We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution reconstruction in a memory-efficient manner, its computationally expensive inference prevents us from deploying such a system for real-time applications. To this end, we propose a novel hierarchical surface localization algorithm and a direct rendering method without explicitly extracting surface meshes. By culling unnecessary regions for evaluation in a coarse-to-fine manner, we successfully accelerate the reconstruction by two orders of magnitude from the baseline without compromising the quality. Furthermore, we introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples. We adaptively update the sampling probability of the training data based on the current reconstruction accuracy, which effectively alleviates reconstruction artifacts. Our experiments and evaluations demonstrate the robustness of our system to various challenging angles, illuminations, poses, and clothing styles. We also show that our approach compares favorably with the state-of-the-art monocular performance capture. Our proposed approach removes the need for multi-view studio settings and enables a consumer-accessible solution for volumetric capture.
... In the literature, a lot of works have been presented related to compression of 3D meshes and point clouds [38][39][40]. The spectral compression methods utilize the subspace of the eigenvector matrix U c [i] for encoding the geometry of a 3D mesh. ...
Article
Full-text available
Recently, spectral methods have been extensively used in the processing of 3D meshes. They usually take advantage of some unique properties that the eigenvalues and the eigenvectors of the decomposed Laplacian matrix have. However, despite their superior behavior and performance, they suffer from computational complexity, especially while the number of vertices of the model increases. In this work, we suggest the use of a fast and efficient spectral processing approach applied to dense static and dynamic 3D meshes, which can be ideally suited for real-time denoising and compression applications. To increase the computational efficiency of the method, we exploit potential spectral coherence between adjacent parts of a mesh and then we apply an orthogonal iteration approach for the tracking of the graph Laplacian eigenspaces. Additionally, we present a dynamic version that automatically identifies the optimal subspace size that satisfies a given reconstruction quality threshold. In this way, we overcome the problem of the perceptual distortions, due to the fixed number of subspace sizes that is used for all the separated parts individually. Extensive simulations carried out using different 3D models in different use cases (i.e., compression and denoising), showed that the proposed approach is very fast, especially in comparison with the SVD based spectral processing approaches, while at the same time the quality of the reconstructed models is of similar or even better reconstruction quality. The experimental analysis also showed that the proposed approach could also be used by other denoising methods as a preprocessing step, in order to optimize the reconstruction quality of their results and decrease their computational complexity since they need fewer iterations to converge.
... Accordingly, the capture and transmission of FVV requires a large number of cameras and expensive bandwidth [3]. In [4], a realtime compression architecture for 4D performance capture is proposed to realize real-time transmission of 3D data sets while achieving comparable visual quality and bitrate. An encoder that leverages an implicit representation is introduced to represent the observed geometry, as well as its changes through time. ...
Article
Full-text available
Depth image-based rendering (DIBR) plays an important role in 3D video and free viewpoint video synthesis. However, artifacts might occur in the synthesized view due to viewpoint changes and stereo depth estimation errors. Holes are usually out-of-field regions and disocclusions, and filling them appropriately becomes a challenge. In this paper, a virtual view synthesis approach based on asymmetric bidirectional DIBR is proposed. A depth image preprocessing method is applied to detect and correct unreliable depth values around the foreground edges. For the primary view, all pixels are warped to the virtual view by the modified DIBR method. For the auxiliary view, only the selected regions are warped, which contain the contents that are not visible in the primary view. This approach reduces the computational cost and prevents irrelevant foreground pixels from being warped to the holes. During the merging process, a color correction approach is introduced to make the result appear more natural. In addition, a depth-guided inpainting method is proposed to handle the remaining holes in the merged image. Experimental results show that, compared with bidirectional DIBR, the proposed rendering method can reduce about 37% rendering time and achieve 97% hole reduction. In terms of visual quality and objective evaluation, our approach performs better than the previous methods.
Article
Full-text available
We present Motion2Fusion, a state-of-the-art 360 performance capture system that enables *real-time* reconstruction of arbitrary non-rigid scenes. We provide three major contributions over prior work: 1) a new non-rigid fusion pipeline allowing for far more faithful reconstruction of high frequency geometric details, avoiding the over-smoothing and visual artifacts observed previously. 2) a high speed pipeline coupled with a machine learning technique for 3D correspondence field estimation reducing tracking errors and artifacts that are attributed to fast motions. 3) a backward and forward non-rigid alignment strategy that more robustly deals with topology changes but is still free from scene priors. Our novel performance capture system demonstrates real-time results nearing 3x speed-up from previous state-of-the-art work on the exact same GPU hardware. Extensive quantitative and qualitative comparisons show more precise geometric and texturing results with less artifacts due to fast motions or topology changes than prior art.
Article
Full-text available
In order to deal with the scaling problem of volumetric map representations, we propose spatially local methods for high-ratio compression of 3D maps, represented as truncated signed distance fields. We show that these compressed maps can be used as meaningful descriptors for selective decompression in scenarios relevant to robotic applications. As compression methods, we compare using PCA-derived low-dimensional bases to nonlinear auto-encoder networks. Selecting two application-oriented performance metrics, we evaluate the impact of different compression rates on reconstruction fidelity as well as to the task of map-aided ego-motion estimation. It is demonstrated that lossily reconstructed distance fields used as cost functions for ego-motion estimation can outperform the original maps in challenging scenarios from standard RGB-D (color plus depth) data sets due to the rejection of high-frequency noise content.
Article
In this paper, we describe a novel approach to construct compact colored 3D environment models representing local surface attributes via sparse coding. Our method decomposes a set of colored point clouds into local surface patches and encodes them based on an overcomplete dictionary. Instead of storing the entire point cloud, we store a dictionary, surface patch positions, and a sparse code description of the depth and RGB attributes for every patch. The dictionary is learned in an unsupervised way from surface patches sampled from indoor maps. We show that better dictionaries can be learned by extending the K-SVD method with a binary weighting scheme that ignores undefined surface cells. Through experimental evaluation on real world laser and RGBD datasets we demonstrate that our method produces compact and accurate models. Furthermore, we clearly outperform an existing state of the art method in terms of compactness, accuracy, and computation time. Additionally, we demonstrate that our sparse code descriptions can be utilized for other important tasks such as object detection.
Conference Paper
The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3D reconstruction and simultaneous localization and mapping. Most of these algorithms naturally benefit from the ability to accurately track the pose of an object or scene of interest from one frame to the next. However, commercially available depth sensors (typically running at 30fps) can allow for large inter-frame motions to occur that make such tracking problematic. A high frame rate depth camera would thus greatly ameliorate these issues, and further increase the tractability of these computer vision problems. Nonetheless, the depth accuracy of recent systems for high-speed depth estimation [Fanello et al. 2017b] can degrade at high frame rates. This is because the active illumination employed produces a low SNR and thus a high exposure time is required to obtain a dense accurate depth image. Furthermore in the presence of rapid motion, longer exposure times produce artifacts due to motion blur, and necessitates a lower frame rate that introduces large inter-frame motion that often yield tracking failures. In contrast, this paper proposes a novel combination of hardware and software components that avoids the need to compromise between a dense accurate depth map and a high frame rate. We document the creation of a full 3D capture system for high speed and quality depth estimation, and demonstrate its advantages in a variety of tracking and reconstruction tasks. We extend the state of the art active stereo algorithm presented in Fanello et al. [2017b] by adding a space-time feature in the matching phase. We also propose a machine learning based depth refinement step that is an order of magnitude faster than traditional postprocessing methods. We quantitatively and qualitatively demonstrate the benefits of the proposed algorithms in the acquisition of geometry in motion. Our pipeline executes in 1.1ms leveraging modern GPUs and off-the-shelf cameras and illumination components. We show how the sensor can be employed in many different applications, from [non-]rigid reconstructions to hand/face tracking. Further, we show many advantages over existing state of the art depth camera technologies beyond framerate, including latency, motion artifacts, multi-path errors, and multi-sensor interference.
Article
Many computer graphics applications use simpler yet faithful approximations of complex shapes to conduct reliably part of their computations. Some tasks, such as physical simulation, collision detection, occlusion queries or free-form deformation, require the simpler proxy to strictly enclose the input shape. While there are algorithms that can output such bounding proxies on simple input shapes, most of them fail at generating a proper coarse approximant on real-world complex shapes, which may contain multiple components and have a high genus. We advocate that, before reducing the number of primitives to describe a shape, one needs to regularize it while maintaining the strict enclosing property, to avoid any geometric aliasing that makes the decimation unreliable. Depending on the scale of the desired approximation, the topology of the shape itself may indeed have to be first simplified, to let the subsequent geometric optimization be free from topological locks. We propose a new bounding shape approximation algorithm which takes as input an arbitrary surface mesh, with potentially complex multi-component structures, and generates automatically a bounding proxy which is tightened on the input and can match even the coarsest levels of approximation. To sustain the nonlinear approximation process that may eventually abstract both geometry and topology, we propose to use an intermediate regularized representation in the form of a shape closing, computed in real time using a new fast morphological framework designed for efficient parallel execution. Once the desired level of approximation is reached in the shape closing, a coarse, tight and bounding polygonization of the proxy geometry is extracted using an adaptive meshing scheme. Our underlying representation is both geometry- and topology-adaptive and can be optionally controlled accurately by a user, through sizing and orientation fields, yielding an intuitive brush metaphor within an interactive proxy design environment. We provide extensive experiments on various kinds of input meshes and illustrate the potential applications of our method in scenarios that benefit greatly from coarse, tight bounding substitutes to the actual high resolution geometry of the original 3D model, including freeform deformation, physical simulation and level of detail generation for rendering.
Article
We convert a sequence of unstructured textured meshes into a mesh with incrementally changing connectivity and atlas parameterization. Like prior work on surface tracking, we seek temporally coherent mesh connectivity to enable efficient representation of surface geometry and texture. Like recent work on evolving meshes, we pursue local remeshing to permit tracking over long sequences containing significant deformations or topological changes. Our main contribution is to show that both goals are realizable within a common framework that simultaneously evolves both the set of mesh triangles and the parametric map. Sparsifying the remeshing operations allows the formation of large spatiotemporal texture charts. These charts are packed as prisms into a 3D atlas for a texture video. Reducing tracking drift using mesh-based optical flow helps improve compression of the resulting video stream.
Article
We present a machine learning-based approach to lossy image compression which outperforms all existing codecs, while running in real-time. Our algorithm typically produces files 2.5 times smaller than JPEG and JPEG 2000, 2 times smaller than WebP, and 1.7 times smaller than BPG on datasets of generic images across all quality levels. At the same time, our codec is designed to be lightweight and deployable: for example, it can encode or decode the Kodak dataset in around 10ms per image on GPU. Our architecture is an autoencoder featuring pyramidal analysis, an adaptive coding module, and regularization of the expected codelength. We also supplement our approach with adversarial training specialized towards use in a compression setting: this enables us to produce visually pleasing reconstructions for very low bitrates.
Article
The $k$-Means clustering problem on $n$ points is NP-Hard for any dimension $d\ge 2$, however, for the 1D case there exist exact polynomial time algorithms. The current state of the art is a $O(kn^2)$ dynamic programming algorithm that uses $O(nk)$ space. We present a new algorithm improving this to $O(kn \log n)$ time and optimal $O(n)$ space. We generalize our algorithm to work for the absolute distance instead of squared distance and to work for any Bregman Divergence as well.
Article
Modern systems for real-time hand tracking rely on a combination of discriminative and generative approaches to robustly recover hand poses. Generative approaches require the specification of a geometric model. In this paper, we propose a the use of sphere-meshes as a novel geometric representation for real-time generative hand tracking. How tightly this model fits a specific user heavily affects tracking precision. We derive an optimization to non-rigidly deform a template model to fit the user data in a number of poses. This optimization jointly captures the user's static and dynamic hand geometry, thus facilitating high-precision registration. At the same time, the limited number of primitives in the tracking template allows us to retain excellent computational performance. We confirm this by embedding our models in an open source real-time registration algorithm to obtain a tracker steadily running at 60Hz. We demonstrate the effectiveness of our solution by qualitatively and quantitatively evaluating tracking precision on a variety of complex motions. We show that the improved tracking accuracy at high frame-rate enables stable tracking of extended and complex motion sequences without the need for per-frame re-initialization. To enable further research in the area of high-precision hand tracking, we publicly release source code and evaluation datasets.