Content uploaded by Danhang Tang

Author content

All content in this area was uploaded by Danhang Tang on Mar 05, 2019

Content may be subject to copyright.

Real-time Compression and Streaming of 4D Performances

DANHANG TANG, MINGSONG DOU, PETER LINCOLN, PHILIP DAVIDSON, KAIWEN GUO, JONATHAN

TAYLOR, SEAN FANELLO, CEM KESKIN, ADARSH KOWDLE, SOFIEN BOUAZIZ, SHAHRAM IZADI,

and ANDREA TAGLIASACCHI, Google Inc.

15 Mbps

30 Hz

Fig. 1. We acquire a performance via a rig consisting of 8

×

RGBD cameras in a panoptic configuration. The 4D geometry is reconstructed via a state-of-the-art

non-rigid fusion algorithm, and then compressed in real-time by our algorithm. The compressed data is streamed over a medium bandwidth connection

(

≈

20Mbps), and decoded in real-time on consumer-level devices (Razer Blade Pro w/ GTX 1080m graphic card). Our system opens the door towards real-time

telepresence for Virtual and Augmented Reality – see the accompanying video.

We introduce a realtime compression architecture for 4D performance cap-

ture that is two orders of magnitude faster than current state-of-the-art

techniques, yet achieves comparable visual quality and bitrate. We note how

much of the algorithmic complexity in traditional 4D compression arises

from the necessity to encode geometry using an explicit model (i.e. a triangle

mesh). In contrast, we propose an encoder that leverages an implicit repre-

sentation (namely a Signed Distance Function) to represent the observed

geometry, as well as its changes through time. We demonstrate how SDFs,

when dened over a small local region (i.e. a block), admit a low-dimensional

embedding due to the innate geometric redundancies in their representa-

tion. We then propose an optimization that takes a Truncated SDF (i.e. a

TSDF), such as those found in most rigid/non-rigid reconstruction pipelines,

and eciently projects each TSDF block onto the SDF latent space. This

results in a collection of low entropy tuples that can be eectively quan-

tized and symbolically encoded. On the decoder side, to avoid the typical

artifacts of block-based coding, we also propose a variational optimization

that compensates for quantization residuals in order to penalize unsightly

discontinuities in the decompressed signal. This optimization is expressed

in the SDF latent embedding, and hence can also be performed eciently.

We demonstrate our compression/decompression architecture by realizing,

to the best of our knowledge, the rst system for streaming a real-time

captured 4D performance on consumer-level networks.

Authors’ address: Danhang Tang; Mingsong Dou; Peter Lincoln; Philip Davidson;

Kaiwen Guo; Jonathan Taylor; Sean Fanello; Cem Keskin; Adarsh Kowdle; Soen

Bouaziz; Shahram Izadi; Andrea Tagliasacchi, Google Inc.

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

©2018 Copyright held by the owner/author(s).

0730-0301/2018/11-ART256

https://doi.org/10.1145/3272127.3275096

CCS Concepts:

•Theory of computation →Data compression

;

•Com-

puting methodologies →Computer vision;Volumetric models;

Additional Key Words and Phrases: 4D compression, free viewpoint video.

ACM Reference Format:

Danhang Tang, Mingsong Dou, Peter Lincoln, Philip Davidson, Kaiwen

Guo, Jonathan Taylor, Sean Fanello, Cem Keskin, Adarsh Kowdle, Soen

Bouaziz, Shahram Izadi, and Andrea Tagliasacchi. 2018. Real-time Compres-

sion and Streaming of 4D Performances. ACM Trans. Graph. 37, 6, Article 256

(November 2018), 11 pages. https://doi.org/10.1145/3272127.3275096

1 INTRODUCTION

Recent commercialization of AR/VR headsets, such as the Oculus

Rift and Microsoft Hololens, has enabled the consumption of 4D data

from unconstrained viewpoints (i.e. free-viewpoint video). In turn,

this has created a demand for algorithms to capture [Collet et al

.

2015], store [Prada et al

.

2017], and transfer [Orts-Escolano et al

.

2016] 4D data for user consumption. These datasets are acquired

in capture setups comprised of hundreds of traditional color cam-

eras, resulting in computationally intensive processing to generate

models capable of free viewpoint rendering. While pre-processing

can take place oine, the resulting “temporal sequence of colored

meshes” is far from compressed, with bitrates reaching

≈

100Mbps.

In comparison, commercial solutions for internet video streaming,

such as Netix, typically cap the maximum bandwidth at a modest

16Mbps. Hence, state-of-the-art algorithms focus on oine com-

pression, but with computational costs of 30s/frame on a cluster

comprised of

≈

70 high-end workstations [Collet et al

.

2015], these

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

256:2 •Tang et al.

solutions will be a struggle to employ in scenarios beyond a profes-

sional capture studio. Further, recent advances in VR/AR telepres-

ence not only require the transfer of 4D data over limited bandwidth

connections to a client with limited compute, but also require this

to happen in a streaming fashion (i.e. real-time compression).

In direct contrast to Collet et al

.

[2015], we approach these chal-

lenges by revising the 4D pipeline end-to-end. First, instead of re-

quiring

≈

100 RGB and IR cameras, we only employ a dozen RGB+D

cameras, which are sucient to provide a panoptic view of the per-

former. We then make the fundamental observation that much of

the algorithmic/computational complexity of previous approaches

revolves around the use of triangular meshes (i.e. an explicit repre-

sentation), and their reliance on a UV parameterization for texture

mapping. The state-of-the-art method by Prada et al

.

[2017] encodes

dierentials in topology (i.e. connectivity changes) and geometry

(i.e. vertex displacements) with mesh operations. However, substan-

tial eorts need to be invested in the maintenance of the UV atlas,

so that video texture compression remains eective. In contrast,

our geometry is represented in implicit form as a truncated signed

distance function (TSDF) [Curless and Levoy 1996]. Our novel TSDF

encoder achieves geometry compression rates that are lower than

those in the state of the art (7Mbps vs. 8Mbps of [Prada et al

.

2017]),

at a framerate that is two orders of magnitude faster than current

mesh-based methods (30 fps vs. 1min/frame of [Prada et al

.

2017]).

This is made possible thanks to the implicit TSDF representation of

the geometry, that can be eciently encoded in a low dimensional

manifold.

2 RELATED WORKS

We give an overview of the state-of-the-art on 3D/4D data compres-

sion, and highlight our contributions in contrast to these methods.

Mesh-based compression. There is a signicant body of work on

compression of triangular meshes; see the surveys by Peng et al

.

[2005] and by Maglo et al

.

[2015]. Lossy compression of a static mesh

using a Fourier transform and manifold harmonics is possible [Karni

and Gotsman 2000; Lévy and Zhang 2010], but these methods hardly

scale to real-time applications due to the computational complex-

ity of spectral decomposition of the Laplacian operator. Further-

more, not only does the connectivity need to be compressed and

transferred to the client, but lossy quantization typically results in

low-frequency aberrations to the signal [Sorkine et al

.

2003]. Under

the assumption of temporally coherent topology, these techniques

can be generalized to compression of animations [Karni and Gots-

man 2004]. However, quantization still results in low-frequency

aberrations that introduce very signicant visual artifacts such as

foot-skate eects [Vasa and Skala 2011]. Overall, all these meth-

ods, including the [MPEG4/AFX 2008] standard, assume temporally

consistent topology/connectivity, making them unsuitable for the

scenarios we seek to address. Another class of geometric compres-

sors, derived from the EdgeBreaker encoders [Rossignac 1999], lie

at the foundation of the modern Google/Draco library [Galligan

et al. 2018]. This method is compared with ours in Section 4.

Compression of volumetric sequences. The work by Collet et al

.

[2015] pioneered volumetric video compression, with the primary

focus of compressing hundreds of RGB camera data in a single

texture map. Given a keyframe, they rst non-rigidly deform this

template to align with a set of subsequent subframes using em-

bedded deformations [Sumner et al

.

2007]. Keyframe geometry is

compressed via vertex position quantization, whereas connectivity

information is delta-encoded. Dierentials with respect to a linear

motion predictor are then quantized and then compressed with

minimal entropy. Optimal keyframes are selected via heuristics, and

post motion-compensation residuals are implicitly assumed to be

zero. As the texture atlas remains constant in subframes (the set

of frames between adjacent keyframes), the texture image stream

is temporally coherent, making it easy to compress by traditional

video encoders. As reported in [Prada et al

.

2017, Tab.2], this re-

sults in texture bitrates ranging from 1

.

8to 6

.

5Mbps, and geometry

bitrates ranging from 6

.

6to 21

.

7Mbps. The work by Prada et al

.

[2017] is a direct extension of [Collet et al

.

2015] that uses a dier-

ential encoder. However, these dierentials are explicit mesh editing

operations correcting the residuals that motion compensation failed

to explain. This results in better texture compression, as it enables

the atlases to remain temporally coherent across entire sequences

(

−

13% texture bitrate), and better geometry bitrates, as only mesh

dierentials, rather than entire keyframes, need to be compressed

(

−

30% geometry bitrate); see [Prada et al

.

2017, Tab.2]. Note how

neither [Collet et al

.

2015] nor [Prada et al

.

2017] is suitable for

streaming compression, as they both have a runtime complexity

that is in the order of 1min/frame.

Volume compression. Implicit 3D representation of a volume, such

as signed distance function (SDF), has also been explored for 3D data

compression. The work in [Schelkens et al. 2003] is an en example

based on 3D wavelets that form the foundation of the JPEG2000/jp3D

standard. Dictionary learning and sparse coding have been used in

the context of local surface patch encoding [Ruhnke et al

.

2013],

but the method is far from real-time. Similar to this work, Canelhas

et al

.

[2017] chose to split the volume into blocks, and perform KLT

compression to eciently encode the data. In their method, discon-

tinuities at the TSDF boundaries are dealt with by allowing blocks

to overlap, resulting in very high bitrate. Moreover, the encoding

of a single slice of

×

100 blocks (10% of volume) of size 16

3

takes

≈45ms, making the method unsuitable for real-time scenarios.

Compression as 3D shape approximation. Shape approximation

can also be interpreted as a form of lossy compression. In this domain

we nd several variants, from traditional decimation [Cohen-Steiner

et al

.

2004; Garland and Heckbert 1997] to its level-of-detail vari-

ants [Hoppe 1996; Valette and Prost 2004]. Another form of compres-

sion replaces the original geometry through a set of approximating

proxies, such as bounding cages [Calderon and Boubekeur 2017]

and sphere-meshes [Thiery et al

.

2013]. The latter are somewhat

relevant due to their recent extension to 4D geometry [Thiery et al

.

2016], but [MPEG4/AFX 2008], only support temporally coherent

topology.

Contributions. In summary, our main contribution is a novel end-

to-end architecture for real-time 4D data streaming on consumer-

level hardware, which includes the following elements:

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

Real-time Compression and Streaming of 4D Performances •256:3

bitstream

(.3 Mbps)

ENCODER DECODER

bitstream

(8 Mbps)

m2f KLT

NVDECNVENC

Dicer

Quantizer iKLT

Range

Encoder

Range

Decoder

bitstream

(7 Mbps)

Texture

Mapping

iDicer iQuantizer

Marching

Cubes

8x{RGB}

…

…

(blocks)

…

8x{Depth} (discontinuous) (smooth)

Fig. 2. The architecture of our real-time 4D compression/decompression pipeline; we visualize 2D blocks for illustrative purposes only. From the depth

images, a TSDF volume is created by motion2fusion [Dou et al

.

2017]; each of the 8

×

8

×

8blocks of the TSDF is then embedded in a compact latent space,

quantized and then symbolically encoded. The decompressor inverts this process, where an optional inverse quantization (iQ) filter can be used to remove the

discontinuities caused by the fact that the signal compresses adjacent blocks independently. A triangular mesh is then extracted from the decompressed TSDF

via iso-surfacing. Given our limited number of RGB cameras, the compression of color information is performed per-channel and without the need of any UV

mapping; given the camera calibrations, the color streams can then be mapped on the surface projectively within a programmable shader.

•

A new ecient compression architecture leveraging implicit

representations instead of triangular meshes.

•

We show how we can reduce the number of coecients

needed to encode a volume by solving, at run-time, a col-

lection of small least squares problems.

•

We propose an optimization technique to enforce smoothness

among neighboring blocks, removing visually unpleasant

block-artifacts.

•

These changes allow us to achieve state-of-the-art compres-

sion performance, which we evaluate in terms of rate vs

distortion analysis.

•

We propose a GPU compression architecture, that, to the best

of our knowledge, is the rst to achieve real-time compression

performance.

3 TECHNICAL OUTLINE

We overview our compression architecture in Figure 2. Our pipeline

assumes a stream of depth and color image sets

{Dn

t,Cn

t}

from a

multiview capture system that are used to generate a volumetric

reconstruction (Section 3.1). The encoder/decoder has two streams

of computation executed in parallel, one for geometry (Section 3.2),

one for color (Section 3.5), which are transferred via a peer-to-peer

network channel (Section 3.6). We do not build a texture atlas per-

frame (alike [Dou et al

.

2017]), nor we update it over time (alike

[Orts-Escolano et al

.

2016]), but rather we apply standard video

compression to stream the 8RGB views to a client; the known

camera calibrations are then used to color the rendered model via

projective-texturing.

3.1 Panoptic 4D capture and reconstruction

We built a capture setup with

×

8RGBD cameras placed around

the performer capable of capturing a panoptic view of the scene;

see Figure 3. Each of these cameras consists of

×

1RBG camera,

×

2IR

cameras that have been mutually calibrated, and an uncalibrated IR

pattern projector. We leverage state-of-the-art disparity estimation

system by Kowdle et al

.

[2018] to compute depth very eciently.

The depth maps coming from dierent viewpoints are then merged

following modern extensions [Dou et al

.

2017] of traditional fusion

paradigms [Dou et al

.

2016; Innmann et al

.

2016; Newcombe et al

.

2015] that aggregate observations from all cameras into a single

representation.

Implicit geometry representation. An SDF is an implicit repre-

sentation of the 3D volume and it can be expressed as a function

ϕ(z)

:

R3→R

that encodes a surface by storing at each grid point

Volumetric Data Capture Setup

x8 RGBD cameras

Fig. 3. Our panoptic capture rig, examples of the RGB and depth views and

resulting projective texture mapped model.

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

256:4 •Tang et al.

the signed distance from the surface

S

– positive if outside, and neg-

ative otherwise. The SDF is typically represented as a dense uniform

3D grid of signed distance values with resolution

W×H×D

. A trian-

gular mesh representation of the the surface

S={z∈R3|ϕ(z)=

0

}

can be easily extracted via iso-surfacing [de Araujo et al

.

2015];

practical and ecient implementations of these methods currently

exist on mobile consumer devices. In practical applications, where

resources such as computate power and memory are scarce, a Trun-

cated SDF is employed: an SDF variant that is only dened in the

neighborhood of the surface corresponding to the SDF zero crossing

Φ(z)

:

{z∈R3|ϕ(z)<ε} → R

. Although surfaces are most com-

monly represented explicitly using a triangular/quad mesh, implicit

representations make dealing with topology changes easier.

Using our reimplementation of [Dou et al

.

2017] we transform

the input stream into a sequence of TSDFs

{Φt}

and corresponding

sets of RGB images {Cn}tfor each frame t.

3.2 Signal compression – Preliminaries

Before diving into the actual 3D volume compression algorithm, we

briey describe some preliminary tools that are used in the rest of

the pipeline, namely KLT and range encoding.

3.2.1 Karhunen-Loéve Transform. Given a collection

{xm∈Rn}

of

M

training signals, we can learn a transformation matrix

P

and oset

µ

that dene a forward KLT transform, as well as its inverse iKLT,

of an input signal x(a transform based on PCA decomposition):

X=KLT(x)=PT(x−µ)(forward transform) (1)

x=iKLT(X)=PX +µ(inverse transform) (2)

Given

µ=Em[xm]

, the matrix

P

is computed via eigendecom-

position of the covariance matrix

C=Em(xm−µ)T(xm−µ)

resulting in the matrix factorization

C=P Σ PT

. Where

Σ

is a

diagonal matrix whose entries

{σn}

are the eigenvalues of

C

, and

the columns of

P

are the corresponding eigenvectors. In particular,

if the signal to be compressed is drawn from a linear manifold of

dimensionality

d≪n

, the trailing

(n−d)

entries of

X

will be zero.

Consequently, we can encode a

x∈Rn

signal in a lossless way by

only storing its

X∈Rd

transform coecients. Note that to invert

the process, the decoder only necessitates the transformation matrix

P

and the mean vector

µ

. In practical applications, our signals are

not strictly drawn from a low-dimensional linear manifold, but the

entries of the diagonal matrix

Σ

represents how the energy of the

signal is distributed in the latent space. In more details, we could

only use the rst

d

basis vectors to encode/decode the signal, more

formally:

ˆ

X=KLTd(x)=(PId)T(x−µ)(lossy encoder) (3)

ˆ

x=iKLTd(X)=PIdˆ

X+µ(lossy decoder) (4)

Where

Id

is a diagonal matrix where only the rst

d

elements are

set to one. The reconstruction error is bound by:

ϵ(d)=Em∥xm−ˆ

xm∥2

2=

n

Õ

i=d+1

σi(5)

For linear latent spaces of order

d

, not only does the optimality

theorem of the KLT basis ensure that the error above is zero (i.e. loss-

less compression), but that the transformed signal

ˆ

X

has minimal

entropy [Jorgensen and Song 2007]. In the context of compression,

this is signicant, as entropy captures the average number of bits

necessary to encode the signal. Furthermore, similar to the Fourier

Transform (FT), as the covariance matrix is symmetric, the matrix

P

is orthonormal. Uncorrelated bases ensure that adding more dimen-

sions always adds more information, and this is particularly useful

when dealing with variable bitrate streaming. Note that, in contrast

to compression based on the FT, the

KLTd

compression bases are

signal-dependent: we train

P

and then stream

Pd

, containing the

rst

d

columns of

P

to the client. Note this operation could be done

oine as part of a codec installation process.

In practice, the full covariance matrix of an entire frame with

millions of voxels would be extremely large, and populating it accu-

rately would require a dataset several orders of magnitude larger

than ours. Further, dropping eigenvectors would also have a detri-

mental global smoothing eect. Besides, a TSDF is a sparse signal

with interesting values centered around the implicit surface only,

which is why we apply KLT to individual, non-overlapping blocks

of size

8×8×8

, which is large enough to contain interesting local

surface characteristics, and small enough to capture its variation

using our dataset. More specically, in our implementation we use

5

mm

voxel resolution, so a block has a volume of 40

×

40

×

40

mm3

.

The learned eigenbases from these blocks need to be transmitted

only once to the decoder.

3.2.2 Range Encoding. The

d

coecients obtained through KLT

projection can be further compressed using a range encoder, which

is a lossless compression technique specialized for compressing

entire sequences rather than individual symbols as in the case of

Human coding. A range encoder computes the cumulative dis-

tribution function (CDF) from the probabilities of a set of discrete

symbols, and recursively subdivides the range of a xed precision

integer respectively to map an entire sequence of symbols into

an integer [Martin 1979]. However, to apply range encoding to a

sequence of real valued descriptors as in our case, one needs to

quantize the symbols rst. As such, despite range encoding being a

lossless compression method on discrete symbols, the combination

results in lossy compression due to quantization error. In particular,

we treat each of the

d

components as a separate message, as treating

them jointly may not lend itself to ecient compression. This is due

to that fact that the distribution of symbols in each latent dimension

can be vastly dierent and that the components are uncorrelated be-

cause of the KLT. Hence, we compute the CDF, quantize the values,

and encode the sequence separately for each latent dimension.

3.3 Oline Training of KLT / antization

The tools in the previous section can be applied directly to transform

a signal, such as an SDF, into a bit stream. The ability, however,

to actually compress the signal greatly depends on how exactly

they are applied. In particular, applying these tools to the entire

signal (e.g. the full TSDF) does not expose repetitive structures that

would enable eective compression. As such, inspired by image

compression techniques, we look at

8×8×8

sized blocks where we

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

Real-time Compression and Streaming of 4D Performances •256:5

Neumann TSDF Dirichlet TSDF Traditional TSDF

.2

.4

.6

.8

1.

10 20 30 40

10 20 30 40

.2

.4

.6

.8

1.

.2

.4

.6

.8

1.

10 20 30 40

x

x

x

x

x

x

+2

0

-2

+2

0

-2

+20

0

-20

Fig. 4. (le) A selection of random exemplar for several variants of TSDF

training signals that are used to train the KLT. (middle) A cross section of the

function along the segment

x

, highlighting the dierent levels of continuity

in the training signal. (right) When trained on 10000 randomly generated

32

×

32 images, the KLT spectra is heavily aected by the smoothness in

the training dataset.

are likely to see repetitive structure (e.g. planar structures). Further,

it turns out that leveraging a training set to choose an eective set

of KLT parameters (i.e.

P

and

µ

) is considerably nuanced, and one

of our contributions is showing below how to do that eectively.

Finally we demonstrate how to use this training set to form an

eective quantization scheme.

Learning

P,µ

from 2D data. The core of our compression archi-

tecture exploits the repetitive geometric structure seen in typical

SDFs. For example, the walls of an empty room if axis aligned can

largely be explained by a small number of basis vectors that repre-

sent axis aligned blocks. When non-axis aligned planar surfaces are

introduced, one can reconstruct an arbitrary block containing such

a surface by linearly blending a basis of blocks containing planes

at a sparse sampling of orientations. Even for surfaces, the SDF is

an identity function along the surface normal indicating that the

amount of variation in an SDF is much lower than for arbitrary 3D

function. We thus use KLT to nd a latent space where the variance

on each component is known, which we will later exploit in order

to encourage bits to be assigned to components of high variance.

As we will see, however, applying this logic naively to generate a

training set of blocks sampled from a TSDF can have disastrous

eects. Specically, one must carefully consider how to deal with

blocks that contain unobserved voxels; the block pixels colored in

black in Figure 2. This is complicated by the fact that although real

world objects have a well dened inside and outside, it is not obvious

how to discern this for the unobserved voxels outside the truncation

band. The three choices that we consider are:

Traditional TSDF

Assigning some arbitrary constant value

to to all such unassigned voxel. This will create sharp discon-

tinuities at the edge of the truncation band; i.e. presumably

the representation used by Canelhas et al. [2017].

Dirichlet TSDF

Only consider “unambiguous” blocks where

using the values of voxels on the truncation boundary (i.e.

with a value nearly

±

1) to ood-ll any unobserved voxels

does not introduce large C0discontinuities.

0 10 20 30 40

0.2

0.4

0.6

0.8

1.0

0 10 20 30 40

0.2

0.4

0.6

0.8

1.0

0 10 20 30 40

0.2

0.4

0.6

0.8

1.0

Fig. 5. The (sorted) eigenvalues (diagonal of

Σ

) for a training signal derived

from a collection of 4D sequences. Note the decay rate is slower than that

of Figure 4, as these eigenbases need to represent surface details, and not

just the orientation of a planar patch.

Neumann TSDF

Increase the truncation band and consider

blocks that never contain an unobserved value. These blocks

will look indistinguishable from SDF blocks at training time,

but requires additional care at test time when the truncation

band is necessarily reduced for performance reasons; see Equa-

tion 9.

To elucidate the benets of these increasingly complicated ap-

proaches, we use an illustrative 2D example that naturally general-

izes to 3D; see Figure 4. In particular, we consider training a basis

on 32

×

32 blocks extracted from a 2D TSDF containing a single

planar surface (i.e. a line). Notice that for the truly SDF like data

resulting from the Neumann strategy, the eigenvalues of the eigen-

decomposition of our signal quickly falls to zero, faithfully allowing

the zero-crossing to be reconstructed with a small number of basis

vectors. In contrast, the continuous but kinked signals resulting

from the Dirichlet strategy extend the spectra of the signal as

higher frequency components are required to reconstruct the sharp

transition to the constant valued plateaus. Finally, the Traditional

strategy massively reduces the compactness of the spectra as the

decomposition tries to model the sharp discontinuities in the signal.

In more details, to capture 99% of the energy in the signal, Tradi-

tional TSDFs require

d=

94 KLT basis vectors, Dirichlet TSDFs

require 14 KLT basis vectors, and Neumann TSDFs require

d=

3

KLT basis vectors. Note how Neumann is able to discover the true

dimensionality of the input data (i.e. d=3).

Learning

P,µ

from 3D captured data. Extending the example above

to 3D, we train our KLT encoder on a collection of 3D spatial blocks

extracted from 4D sequences we acquired using the capture system

described in Section 3.1. To deal with unobserved voxels, we apply

the Neumann strategy at training time and carefully design our test

time encoder to deal with unobserved voxels that occur with the

smaller truncation band required to run at real time; see Equation 9.

Note that these training sequences are excluded in the quantitative

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

256:6 •Tang et al.

evaluations of Section 4. In contrast to the blocks extracted by the

dicer module at test time, the training blocks need not be non-

overlapping, so instead we extract them by sliding-window over

every training frame

{Φm}

. The result of this operation is sum-

marized by the KLT spectra in Figure 5. As detailed in Section 4,

the number

d

of employed basis vectors is chosen according to the

desired tradeo between bitrate and compression quality.

Learning the quantization scheme. To be able to employ symbolic

encoders, we need to develop a quantization function

Q(·)

that maps

the latent coecients

ˆ

X

to a vector of quantized coecients

ˆ

Q

as

ˆ

Q=Q( ˆ

X)

. To this end, we quantize each dimension

h

of the latent

space individually by choosing

Kh

quantization bins

1{ηh

k}Kh

k=1⊆R

and dening the h’th component of the quantization function as

Qh(ˆ

X)=arg min

k∈{1,2,. . .Kh}(ˆ

Xh−ηh

k)2.(6)

To ensure there are bins near coecients that are likely to occur we

minimize

arg min

{wh

k,n},{ηh

k}

Kh

Õ

k=1

N

Õ

n=1

wh

k,n(ˆ

Xh

n−ηh

k)2,(7)

where

ˆ

Xh

n

is the

h

’th coecient derived from the

n

’th training ex-

ample in our training set. Each point has a boolean vector aliating

it to a bin

Wh

n=[wh

1,n, . . . , wh

Kh,n]T∈BKh

where only a single

value is set to 1 such that

Íiwh

i,n=

1. This is a 1-dimensional form

of the classic K-means problem and can be solved in closed form

using dynamic programming [Grønlund et al

.

2017]. The eigenval-

ues corresponding to the principal components of the data are a

measure of variance in the projected space. We leverage this by

assigning a varied number of bins to each component based on their

respective eigenvalues, which range between

[

0

,Vtotal]

. In particular,

we adopt a linear mapping between the standard deviation of the

data and number of bins, resulting in

Kh=Ktotal√Vh/√Vtotal

bins for

eigenbasis

h

. The components receiving a single bin are not encoded

since their values are xed for all the samples, and they are simply

added as a bias to the reconstructed result in the decompression

stage. As such, the choice of

Ktotal

partially controls the trade o

between the reconstruction accuracy and bitrate.

3.4 Run-time 3D volume compression/decompression

Encoder. At run time, the encoding architecture entails the fol-

lowing steps: A variant of KLT, designed to deal with unobserved

voxels, maps each of the

m=

1

. . . M

blocks to a vector of coe-

cients

{ˆ

Xm}

in a low-dimensional latent space (Section 3.4.2). We

compute histograms on coecients of each block, and then quantize

them into the set

{ˆ

Qm}

; these histograms are re-used to compute the

optimal quantization thresholds (Section 3.4.3). To further reduce

bitrate allocations, the lossless range encoder module compresses

the symbols into a bitstream.

Decoder. At run time, the decoding architecture mostly inverts the

process described above, although due to quantization the signal can

only be approximately reconstructed; see Section 3.4.4. The decode

1

Note that what we refer to as “bins” are really just real numbers that implicitly dene

a set of intervals on the real line through (6).

process produces a set of coecients

{ˆ

Xm}

. However, at low bitrates,

this results in blocky-artifacts typical of JPEG compression. Since

we know that the compressed signal is a manifold, these artifacts

can be removed by means of an optimization designed to remove

C1

discontinuities from the signal (Section 3.4.5). Finally, we extract

the zero crossing of

Φ

via a GPU implementation of [Lorensen and

Cline 1987]. This extracts a triangular mesh representation of

S

that can be texture mapped.

We now detail all the above steps, which can run in real-time

even on mobile platforms.

3.4.1 Dicer. As we stream the content of a TSDF only the observed

blocks are transmitted. To reconstruct the volume we therefore

need to encode and transmit the block contents as well as the block

indices. The blocks are transmitted with indices sorted in an as-

cending manner and we use delta encoding to convert the vector of

indices [i,j,k, . . .]into a vector [i,j−i,k−j, . . .]. This conversion

makes the vector entropy encoder friendly as it becomes a set of

small duplicated integers. Unlike the block data, we want lossless

reconstruction of the indices so we cannot train the range encoder

beforehand. This means that for each frame we need to calculate

the CDF and the code book to compress the index vector.

3.4.2 KLT – compression. The full volume

Φ

is decomposed into a

set

{xm}

of non-overlapping

8×8×8

blocks. In this stage, we seek to

leverage the learnt basis

P

and

µ

to map each

x∈Rn

to a manifold

with much lower dimensionality

d

. We again, however, need to

deal with the possibility of blocks containing unobserved voxels.

At training time, it was possible to apply the Neumann strategy.

At test time, however, we cannot increase the truncation region

as it would reduce the frame rate of our non-rigid reconstruction

pipeline. To this end, we instead minimize the objective:

arg min

ˆ

X∥W[x− (PIdˆ

X+µ)]∥.(8)

The matrix

W∈Rn×n

is a diagonal matrix where the i-th diagonal

entry is set to 1for observed voxels, and 0otherwise. The least-

squares solution of the above optimization problem can be computed

in closed form as:

ˆ

X=(IT

dPTWPId)−1[(PId)TW(x−µ)].(9)

Note that when all the blocks are observed this simplies to Equa-

tion 3. Computing

ˆ

X

requires the solution of a small

d×d

dimen-

sional linear problem, that can be eciently computed.

3.4.3 antization and encoding. We now apply our quantization

scheme to the input coecients

{ˆ

Xm}

via

ˆ

Qm=Q(ˆ

Xm)

to obtain

the set of quantized coecients

{ˆ

Qm}

. For each dimension

h

, we

apply the range encoder to those coecients

{ˆ

Qh

m}

, across all blocks,

as if they were symbols so as to losslessly compress them, leveraging

any low entropy empirical distributions that appears over these

symbols. The bitstream from the range encoder is sent directly to

the client.

3.4.4 iKLT – decompression. The decompressed signal can be ef-

ciently obtained via Eq. 4, namely:

ˆ

x=PIdˆ

X+µ

. If enough bins

were used during quantization, this will result in a highly accurate,

but possibly high bitrate, reconstruction.

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

Real-time Compression and Streaming of 4D Performances •256:7

100 200 300 400 500 600 700 800

original

100 200 300 400 500 600 700 800

quantized

original

100 200 300 400 500 600 700 800

recovered

original

Fig. 6. (top) A smooth signal is to be compressed by representing each non-

overlapping block of 100 samples. (middle) By compressing with

KLT4

(or

99% of the variance explained), the signal contains discontinuities at block

boundaries. (boom) Our variational optimization recovers the quantized

KLT coeicients, and generates a C1continuous signal.

3.4.5 iantizer – block artifacts removal. Due to our block com-

pression strategy, even when using a number

d

of KLT basis vec-

tors that explain 99% of the variance in the training data, the de-

compressed signal presents

C0

discontinuities at block boundaries;

see Figure 6 for a 2D example. If not enough bins are used dur-

ing quantization, these can become quite pronounced. To address

this problem, we try to use a set of corrections

˜

X

to the quantized

coecients and look at the corrected reconstruction

˜

xm(˜

X)=PId(ˆ

Q+˜

Xm)+µ.(10)

We know that the corrections are in the intervals dened by the

quantization coecients. That is

˜

Xm∈ C =[−ϵ1

m,ϵ1

m] × ... × [−ϵd

m,ϵd

m].(11)

where for dimension

h∈ {

1

, ...d}

,

ϵh

m∈R

is half the width of

the interval that

(6)

assigns to bin center

ˆ

Qh

m

. To nd the correc-

tions, we penalize the presence of discontinuities across blocks in the

reconstructed signal. We rst abstract the iDicer functional block

by parameterizing the reconstructed TSDF in the entire volume

Φ(z)=˜

x(z|{ ˜

Xm})

:

R3→R

. We can then recover the corrections

{˜

Xm}

via a variational optimization that penalizes discontinuities

in the reconstructed signal ˜

x:

arg min

{˜

Xm}Õ

i,j,k∥∆Φi jk (z)({ ˜

Xm})∥2where {˜

X}⊆C.(12)

Here

∆

is the Laplacian operator of a volume, and

ijk

indexes a pixel

in the volume. In practice, the constraints in

(11)

are hard to deal

with, so we relax the constraint as

arg min

{˜

Xm}Õ

i,j,k∥∆Φi jk (z)({ ˜

Xm})∥2+λÕ

m

d

Õ

h=1˜

Xh

m

ϵh

m2

,(13)

a linear least squares problem where

λ

is a weight indicating how

strongly the relaxed constraint is enforced. See Figure 2 for a 2D

illustration of this process, and Figure 7 to see its eects on the

rendered mesh model.

Fig. 7. (le) A low-bitrate TSDF is received by the decoder and visual-

ized as a mesh triangulating its zero crossing. In order to exaggerate the

block-artifacts for visualization, we choose

KT ot a l =

512 in this exam-

ple. (middle) Our filtering technique removes these structured artifacts via

variational optimization. (right) The ground truth triangulated TSDF.

3.5 Color Compression

Although the scope and contribution of this paper is geometry com-

pression, we also provide a viable solution to deal with texturing.

Since we are targeting real-time applications, building a consistent

texture atlas would break this requirement [Collet et al

.

2015]. We

recall that in our system we use only 8RGB views, therefore we

decided to employ traditional video encoding algorithms to com-

press all the images and stream them to the client. The receiver,

will collect those images and apply projective texturing method to

render the nal reconstructed mesh.

In particular, we employ a separate thread to process the

N

streams from the RGB cameras via the real-time H.264 GPU en-

coder (NVIDIA NVENC). This can be eciently done on an Nvidia

Quadro P6000, which has two dedicated chips to o-load the video

encoding, therefore not interfering with the rest of the reconstruc-

tion pipeline. In terms of run time, it takes around 5ms to compress

all of the data when both the encoding chips present on the GPU

are used in parallel. The same run time is needed on the client side

to decode. The bandwidth required to achieve a level of quality

with negligible loss is around 8Mbps. Note that the RGB streams

coming from the

N

RGB cameras are already temporally consistent,

enabling standard H.264 encoding.

3.6 Networking and live demo setup

Our live demonstration system consists of two main parts, con-

nected by a low speed link: a high performance backend collection

of capture/fusion workstations (

×

10 HP Z840), a WiFi 802.11a router

(Netgear AC1900), and a remote display laptop (Razer Blade Pro).

The backend system divides the capture load into a hub and spoke

model. Cameras are grouped into pods, each consisting of two IR

cameras for depth and one color camera for texture; each camera is

operating at 1280

×

1024 @30Hz. As detailed in Section 3.1, each pod

also has an active IR dot pattern emitter to support active stereo.

In our demo system, there are eight pods divided among the

capture workstations. Each capture workstation performs synchro-

nization, rectication, color processing, active stereo matching, and

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

256:8 •Tang et al.

0 16 32 48 64 80 96

Number of Retained Bases

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Reconstruction Error (mm)

0 16 32 48 64 80 96

Number of Retained Bases

0

2

4

6

8

10

12

14

Bitrate (Mbps)

GT 16 Bases 32 Bases 48 Bases 64 Bases

Fig. 8. Measuring the impact of the number of employed KLT basis vectors

on (top-le) reconstruction quality and (top-right) bitrate. (boom) ali-

tative examples when retaining dierent number of basis vectors. Note we

keep Kt ot a l =2048 fixed in this experiment.

segmentation. Each capture workstation is directly connected to

a central fusion workstation using 40 GigE links. Capture work-

stations transmit segmented color (24 bit RGB) and depth (16 bit)

images using run-length compression to skip over discarded regions

of the view over TCP using a request/serve loop. On average, this

requires a fraction of the worst-case of 1572 Mbps required per

pod, around 380 Mbps. A “fusion box” workstation combines the

eight depth maps and color images into a single textured mesh for

local previewing. To support real-time streaming of the volume

and remote viewing, the fusion box also compresses the TSDF vol-

ume representation and the eight color textures. The compressed

representation, using the process described in Section 3.4 for the

mesh and hardware H.264 encoding for the combined textures is

computed for every fused frame at 30 hertz. On average each com-

pressed frame requires about 20 KB. The client laptop, connected

via Wi-Fi, uses a request/serve loop to receive the compressed data

and performs decompression and display locally.

4 EVALUATION

In this section, we perform quantitative and qualitative analysis of

our algorithm, and compare it to our re-implementation of state-of-

the-art techniques. We urge the reader to watch the

accompanying

video

where further comparisons are provided, as well as a live

recording of our demo streaming a 4D performance in real-time via

WiFi to a consumer-level laptop. Note that in our experiments we

focus our analysis around two core aspects:

•

the requirement for real-time (30 fps in both encoder/decoder)

performance

•the compression of geometry (i.e. not color/texture)

4.1 Datasets and metrics

We acquire three RGBD datasets by capturing the performance

of three dierent subjects at 30Hz. Each sequence,

≈

500 frames in

length, was recorded with the setup described in Section 3.6. To train

the KLT basis of Section 3.2 we randomly selected 50 random frames

from Sequence 1. We then use this basis to test on the remaining

≈1450 frames.

We generate high quality reconstructions that are then used as

ground-truth meshes. To do so we attempted to re-implement the

FVV pipeline of Collet et al

.

[2015]. We rst aggregate the depth

maps into a shared coordinate frame, and then register the multiple

surfaces to each other via local optimization [Bouaziz et al

.

2013] to

generate a merged (oriented) point cloud. We then generate a wa-

tertight surface by solving a Poisson problem [Kazhdan and Hoppe

2013], where screening constraints enforce that parts of the volume

be outside as expressed by the visual hull. We regard these frames

as ground truth reconstructions. Note that this process does not run

in real-time, with an average execution budget of 2min/frame. We

use it only for quantitative evaluations, and it is not employed in

our demo system.

Typically to decide the degree to which two geometric represen-

tations are identical, the Hausdor distance is employed [Cignoni

et al

.

1998]. However, as a single outlier might bias our metric, in

our experiments we employ an ℓ1relaxation of the Hausdor met-

ric [Tkach et al

.

2016, Eq. 15]. Given our decompressed model

X

,

and a ground truth mesh

Y

, our metric is the (average, and area

weighted) ℓ1medians of (mutual) closest point distances:

H(X,Y)=1

2A(X)Õ

x∈X

A(x)inf

y∈Yd(x,y)+1

2A(Y)Õ

y∈Y

A(y)inf

x∈Xd(x,y)

where

x

is a vertex in the mesh

X

, and

A(.)

returns the area of a

vertex, or of the entire mesh according to its argument.

4.2 Impact of compression parameters

In our algorithm, reconstruction error, as measured by

H

, can be

attributed to two dierent parameters: (Section 4.2.1) the number

of learned KLT basis vectors

d

; and (Section 4.2.2) the number of

quantization bins Kt ot al .

4.2.1 Impact of KLT dimensionality – Figure 8. We show the impact

of reducing the number

d

of basis vectors on the reconstruction

error as well as on the bitrate. Although the bitrate increases almost

linearly in

d

, the reconstruction error plateaus for

d>

48. In the

qualitative examples, when

d=

16, blocky artifacts remain even

after our global solve attempts to remove them. When

d=

32, the

reconstruction error is still high, but the nal global solve manages

to remove most of the blocky artifacts, and projective texturing

makes the results visually acceptable. When

d≥

48 it is dicult to

notice any dierences from the ground truth.

4.2.2 Impact of quantization – Figure 9. We now investigate the

impact of quantization when xing the number of retained basis

vectors. As shown in the gure, in contrast to

d

, the bitrate changes

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

Real-time Compression and Streaming of 4D Performances •256:9

1024 2048 3072 4096

Number of Total Quantization Bins

0.1

0.2

0.3

0.4

0.5

0.6

Reconstruction Error (mm)

1024 2048 3072 4096

Number of Total Quantization Bins

2

4

6

8

10

12

Bitrate (Mbps)

GT 1024 bins 2048 bins 3072 bins 4096 bins

Fig. 9. Measuring the impact of quantization on (top-le) reconstruction

quality and (top-right) bitrate. (boom) alitative examples when employ-

ing a dierent number of quantization bins. Note we keep

d=

48 fixed in

this experiment.

Table 1. Average timings for encoding a single of 300k faces (175k ver-

tices) across dierent methods. On the encoder side, our compressor (with

KT ot a l =

2048,

d=

48) is orders of magnitude faster than existing tech-

niques. Since our implementation of FVV [Collet et al

.

2015] is not eicient,

we report the encoding speed from their paper when

≈

70 workstations

were used – note no decoding framerate was reported.

Method Encoding Decoding

Ours (CPU / GPU) 17 ms / 2 ms 19 ms / 3ms

Draco (qp=8 / 14) 180 ms / 183 ms 81 ms / 98 ms

FVV 25 seconds N/A

JP3D 14 seconds 25 seconds

sub-linearly with the number of total assigned bins

KT ot al

. When

KT ot al =

2048, the compressor produces a satisfying visual output

with a bandwidth comparable to high-denition (HD) broadcast.

4.3 Comparisons to the state-of-the-art – Figure 10

We compare with two state-of-the-art geometry compression al-

gorithms: FVV [Collet et al

.

2015] and Draco [Galligan et al

.

2018].

Whilst Draco is tailored for quasi-lossless static mesh compression,

FVV is designed to compress 4D acquired data similar to the scenario

we are targeting. Qualitative comparisons are reported in Figure 11,

as well as in the supplementary video.

The Google Draco [Galligan et al

.

2018] mesh compression li-

brary implements a modern variant of EdgeBreaker [Rossignac

1999], where a triangle spanning tree is built by traversing vertices

one at a time – as such, this architecture is dicult to parallelize,

and therefore not well-suited for real-time applications; see Table 1.

Draco uses the parameter

qp

to control level of location quantiza-

tion, i.e., the trade-o between compression rate and loss. With our

dataset, we observe artifacts by vertices collapsing when

qp ≤

8,

2223242526272829

Bitrates (Mbps)

10−1

100

101

Reconstruction Error (mm)

Ours

FVV (tracked)

FVV (untracked)

Draco

JP3D

0 5 10 15 20 25

Bitrates (Mbps)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Reconstruction Error (mm)

Ours

FVV (tracked)

Fig. 10. (le) Comparison with prior arts on our dataset. Note that since

some results are orders of magnitude apart, semilogy axes are used. To

generate a curve Draco [Galligan et al

.

2018], we vary the parameter quan-

tization bits (qp). (right) Comparison with FVV [Collet et al

.

2015] on their

BREAKDANCERS

sequence. FVV’s bitrate was obtained from their paper (by

removing the declared texture bitrate). Our rate-distortion curve has been

generated by varying the number of KLT basis d.

hence we only vary

qb

between 8and 14 (recommended setting).

The computational time increases altogether with this parameter.

At the same level of reconstruction error, their output le size is 8x

larger than ours; see Figure 10 (left).

As detailed in Section 4, we re-implemented Free Viewpoint

Video [Collet et al

.

2015] in order to generate high quality, un-

compressed reconstructions and compare their method on our ac-

quired dataset. The untracked, per-frame reconstructions are used

as ground-truth meshes. The method in Collet et al

.

[2015] tracks

meshes over time and performs an adaptive decimation to preserve

the details on the subject’s face. We compare our system with FVV

tracked, where both tracking and decimation are employed, and

with FVV untracked, where we only use the smart decimation and

avoid the expensive non-rigid tracking of the meshes.

Our method, outperforms FVV (tracked) in terms of reconstruc-

tion and bitrate. The FVV (untracked), but decimated, generally

has lower error but prohibitive bitrates (around 128 Mbps). Fig-

ure 10 shows that our method (

KT ot al =

2048,

d=

32) achieves

similar error with the untracked version of FVV with

≈

24x lower

bitrate. Notice that whereas FVV focuses on exploiting dynamics

in the scene to achieve compelling compression performances, our

methods mostly focus on encoding and transmitting the keyframes,

therefore we could potentially combine the two to achieve better

compression rates. Finally, it is important to highlight again that

FVV is much slower than our method. In Collet et al

.

[2015], they

describe the tracked version has a speed of

≈

30 minutes per frame

on a single workstation.

We also compare our approach with JP3D [Schelkens et al

.

2006],

an extension of the JPEG2000 codec to volumetric data. Similar

to our approach, JP3D decomposes volumetric data as blocks and

processes them independently. However, while our algorithm uses

KLT as a data driven transformation, JP3D uses a generic discrete

wavelet transform to process the blocks. The wavelet coecients are

then quantized before being compressed with an arithmetic encoder.

In the experiment, we use the implementation in OpenJPEG, where

3 successive layers are chosen with compression ratio 20, 10, and

5. The other parameters are kept as default. Since JP3D assumes a

dense 3D volume as input, even though it does not use expensive

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

256:10 •Tang et al.

temporal tracking, encoding time of each frame still takes 14 seconds

while decoding takes 25 seconds. With similar reconstruction error,

their output bitrate is 131Mbps, which is still one order of magnitude

larger than ours.

5 LIMITATIONS AND FUTURE WORK

We have introduced the rst compression architecture for dynamic

geometry capable of reaching real-time encoding/decoding perfor-

mance. Our fundamental contribution is to pivot away from triangu-

lar meshes towards implicit representations, and, in particular, the

TSDFs commonly used for non-rigid tracking. Thanks to the simple

memory layout of a TSDF, the volume can be decomposed into

equally-sized blocks, and encoding/decoding eciently performed

on each of them independently, in parallel.

In contrast to [Collet et al

.

2015] that utilizes temporal informa-

tion, our work focuses on encoding an entire TSDF volume. Nonethe-

less, the compression of post motion-compensation residuals truly

lies at the foundation of the excellent compression performance

of modern video encoders [Richardson 2011]. An interesting di-

rection for future works would be how to leverage the

SE (

3

)

cues

provided by motion2fusion [Dou et al

.

2017] to encode post motion-

compensation residuals. While designed for eciency, our KLT

compression scheme is linear, and not particularly suited to rep-

resent piecewise

SE (

3

)

transformations of rest-pose geometry. An

interesting venue for future works would be the investigation of

non-linear latent embeddings that could produce more compact,

and hence easier to compress, encodings. Despite their evaluation

in real-time remaining a challenge, convolutional auto-encoders

seem to be a perfect choice for the task at hand [Rippel and Bourdev

2017].

REFERENCES

Soen Bouaziz, Andrea Tagliasacchi, and Mark Pauly. 2013. Sparse Iterative Closest

Point. Computer Graphics Forum (Symposium on Geometry Processing) (2013).

Stéphane Calderon and Tamy Boubekeur. 2017. Bounding proxies for shape approxi-

mation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 57.

Daniel-Ricao Canelhas, Erik Schaernicht, Todor Stoyanov, Achim J Lilienthal, and

Andrew J Davison. 2017. Compressed Voxel-Based Mapping Using Unsupervised

Learning. Robotics (2017).

Paolo Cignoni, Claudio Rocchini, and Roberto Scopigno. 1998. Metro: Measuring error

on simplied surfaces. (1998).

David Cohen-Steiner, Pierre Alliez, and Mathieu Desbrun. 2004. Variational shape

approximation. ACM Trans. on Graphics (Proc. of SIGGRAPH) (2004).

Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese,

Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality streamable

free-viewpoint video. ACM Trans. on Graphics (TOG) (2015).

Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models

from range images. In SIGGRAPH.

B. R. de Araujo, Daniel S. Lopes, Pauline Jepp, Joaquim A. Jorge, and Brian Wyvill. 2015.

A Survey on Implicit Surface Polygonization. Comput. Surveys (2015).

Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle,

Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2fusion:

real-time volumetric performance capture. ACM Trans. on Graphics (Proc. of SIG-

GRAPH Asia) (2017).

Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello,

Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan

Taylor, et al

.

2016. Fusion4d: Real-time performance capture of challenging scenes.

ACM Transactions on Graphics (TOG) (2016).

Frank Galligan, Michael Hemmer, Ondrej Stava, Fan Zhang, and Jamieson Brettle. 2018.

Google/Draco: a library for compressing and decompressing 3D geometric meshes

and point clouds. https://github.com/google/draco. (2018).

Michael Garland and Paul S Heckbert. 1997. Surface simplication using quadric error

metrics. In Proc. of ACM SIGGRAPH.

Allan Grønlund, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen,

Stefan Schneider, and Mingzhou Song. 2017. Fast exact k-means, k-medians and

bregman divergence clustering in 1d. arXiv preprint arXiv:1701.07204 (2017).

Hugues Hoppe. 1996. Progressive Meshes. In Proc. of ACM SIGGRAPH.

Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc

Stamminger. 2016. VolumeDeform: Real-time volumetric non-rigid reconstruction.

In Proc. of the European Conf. on Comp. Vision. 362–379.

Palle ET Jorgensen and Myung-Sin Song. 2007. Entropy encoding, Hilbert space, and

Karhunen-Loeve transforms. J. Math. Phys. 48, 10 (2007), 103503.

Zachi Karni and Craig Gotsman. 2000. Spectral compression of mesh geometry. In

Proceedings of the 27th annual conference on Computer graphics and interactive

techniques. ACM Press/Addison-Wesley Publishing Co.

Zachi Karni and Craig Gotsman. 2004. Compression of soft-body animation sequences.

Computers & Graphics (2004).

Michael Kazhdan and Hugues Hoppe. 2013. Screened poisson surface reconstruction.

ACM Transactions on Graphics (ToG) (2013).

Adarsh Kowdle, Christoph Rhemann, Sean Fanello, Andrea Tagliasacchi, Jonathan

Taylor, Philip Davidson, Mingsong Dou, Kaiwen Guo, Cem Keskin, Sameh Khamis,

David Kim, Danhang Tang, Vladimir Tankovich, Julien Valentin, and Shahram Izadi.

2018. The Need 4 Speed in Real-Time Dense Visual Tracking. (2018).

Bruno Lévy and Hao (Richard) Zhang. 2010. Spectral Mesh Processing. In ACM SIG-

GRAPH Courses.

William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D

surface construction algorithm. In Computer Graphics (Proc. SIGGRAPH), Vol. 21.

163–169.

Adrien Maglo, Guillaume Lavoué, Florent Dupont, and Céline Hudelot. 2015. 3d mesh

compression: Survey, comparisons, and emerging trends. Comput. Surveys (2015).

G Nigel N Martin. 1979. Range encoding: an algorithm for removing redundancy from

a digitised message. In Proc. IERE Video & Data Recording Conf., 1979.

MPEG4/AFX. 2008. ISO/IEC 14496-16: MPEG-4 Part 16, Animation Framework eX-

tension (AFX). Technical Report. The Moving Picture Experts Group. https:

//mpeg.chiariglione.org/standards/mpeg-4/animation- framework-extension- afx

Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. Dynamicfusion: Recon-

struction and tracking of non-rigid scenes in real-time. In Proc. of Comp. Vision and

Pattern Recognition (CVPR).

Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, AdarshKowdle,

Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al

.

2016. Holoportation: Virtual 3d teleportation in real-time. In Proc. of the Symposium

on User Interface Software and Technology.

Jingliang Peng, Chang-Su Kim, and C-C Jay Kuo. 2005. Technologies for 3D mesh

compression: A survey. Journal of Visual Communication and Image Representation

(2005).

Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2017.

Spatiotemporal atlas parameterization for evolving meshes. ACM Trans. on Graphics

(TOG) (2017).

Iain E Richardson. 2011. The H. 264 advanced video compression standard. John Wiley &

Sons.

Oren Rippel and Lubomir Bourdev. 2017. Real-time adaptive image compression. arXiv

preprint arXiv:1705.05823 (2017).

Jarek Rossignac. 1999. Edgebreaker: Connectivity compression for triangle meshes.

IEEE Transactions on Visualization and Computer Graphics (1999).

Michael Ruhnke, Liefeng Bo, Dieter Fox, and Wolfram Burgard. 2013. Compact RGBD

Surface Models Based on Sparse Coding. (2013).

Peter Schelkens, Adrian Munteanu, Joeri Barbarien, Mihnea Galca, Xavier Giro-Nieto,

and Jan Cornelis. 2003. Wavelet coding of volumetric medical datasets. IEEE

Transactions on medical Imaging (2003).

Peter Schelkens, Adrian Munteanu, Alexis Tzannes, and Christopher M. Brislawn. 2006.

JPEG2000 Part 10 - Volumetric data encoding. (2006).

Olga Sorkine, Daniel Cohen-Or, and Sivan Toledo. 2003. High-Pass Quantization for

Mesh Encoding.. In Proc. of the Symposium on Geometry Processing.

Robert W Sumner, Johannes Schmid, and Mark Pauly. 2007. Embedded manipulation

for shape manipulation. ACM Trans. on Graphics (TOG) (2007).

Jean-Marc Thiery, Emilie Guy, and Tamy Boubekeur. 2013. Sphere-Meshes: Shape

Approximation using Spherical Quadric Error Metrics. ACM Trans. on Graphics

(Proc. of SIGGRAPH Asia) (2013).

Jean-Marc Thiery, Émilie Guy, Tamy Boubekeur, and Elmar Eisemann. 2016. Animated

Mesh Approximation With Sphere-Meshes. ACM Trans. on Graphics (TOG) (2016).

Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. 2016. Sphere-Meshes for Real-

Time Hand Modeling and Tracking. ACM Transaction on Graphics (Proc. SIGGRAPH

Asia) (2016).

Sébastien Valette and Rémy Prost. 2004. Wavelet-basedprogressive compression scheme

for triangle meshes: Wavemesh. IEEE Transactions on Visualization and Computer

Graphics 10, 2 (2004), 123–129.

Libor Vasa and Vaclav Skala. 2011. A perception correlated comparison method for

dynamic meshes. IEEE transactions on visualization and computer graphics (2011).

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.

Real-time Compression and Streaming of 4D Performances •256:11

Draco [Galligan et al. 2018] FVV [Collet et al. 2015] Proposed Method

Fig. 11. alitative comparisons on three test sequences. We compare our approach with Draco [Galligan et al

.

2018] (qp=14) and FVV [Collet et al

.

2015].

Notice how FVV over-smooths details especially in the body regions. Draco requires bitrates that are 10 times larger than ours with results that are comparable

to ours in terms of visual quality.

ACM Trans. Graph., Vol. 37, No. 6, Article 256. Publication date: November 2018.