Conference PaperPDF Available

Driving high-resolution facial blendshapes with video performance capture

Authors:

Abstract and Figures

We present a technique for creating realistic facial animation from a set of high-resolution static scans of an actor's face driven by passive video of the actor from one or more viewpoints. We capture high-resolution static geometry using multi-view stereo and gradient-based photometric stereo [Ghosh et al. 2011]. The scan set includes around 30 expressions largely inspired by the Facial Action Coding System (FACS). Examples of the input scan geometry can be seen in Figure 1 (a). The base topology is defined by an artist for the neutral scan of each subject. The dynamic performance can be shot under existing environmental illumination using one or more off-the shelf HD video cameras.
Content may be subject to copyright.
To appear in ACM TOG 34(1).
Driving High-Resolution Facial Scans with Video Performance Capture
Graham Fyffe Andrew Jones Oleg Alexander Ryosuke Ichikari Paul Debevec
USC Institute for Creative Technologies
(a) (b) (c) (d)
Figure 1: (a) High resolution geometric and reflectance information from multiple static expression scans is automatically combined with (d)
dynamic video frames to recover (b) matching animated high resolution performance geometry that can be (c) relit under novel illumination
from a novel viewpoint. In this example, the performance is recovered using only the single camera viewpoint in (d).
Abstract
We present a process for rendering a realistic facial performance
with control of viewpoint and illumination. The performance is
based on one or more high-quality geometry and reflectance scans
of an actor in static poses, driven by one or more video streams of
a performance. We compute optical flow correspondences between
neighboring video frames, and a sparse set of correspondences be-
tween static scans and video frames. The latter are made possible
by leveraging the relightability of the static 3D scans to match the
viewpoint(s) and appearance of the actor in videos taken in arbitrary
environments. As optical flow tends to compute proper correspon-
dence for some areas but not others, we also compute a smoothed,
per-pixel confidence map for every computed flow, based on nor-
malized cross-correlation. These flows and their confidences yield
a set of weighted triangulation constraints among the static poses
and the frames of a performance. Given a single artist-prepared
face mesh for one static pose, we optimally combine the weighted
triangulation constraints, along with a shape regularization term,
into a consistent 3D geometry solution over the entire performance
that is drift-free by construction. In contrast to previous work, even
partial correspondences contribute to drift minimization, for exam-
ple where a successful match is found in the eye region but not
the mouth. Our shape regularization employs a differential shape
term based on a spatially varying blend of the differential shapes
of the static poses and neighboring dynamic poses, weighted by the
associated flow confidences. These weights also permit dynamic
reflectance maps to be produced for the performance by blending
the static scan maps. Finally, as the geometry and maps are rep-
resented on a consistent artist-friendly mesh, we render the result-
ing high-quality animated face geometry and animated reflectance
maps using standard rendering tools.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Animation
e-mail:{fyffe,jones,oalexander,debevec}@ict.usc.edu
Keywords: Facial performance capture, temporal correspondence
1 Introduction
Recent facial geometry scanning techniques can capture very high
resolution geometry, including high-frequency details such as skin
pores and wrinkles. When animating these highly detailed faces,
highly accurate temporal correspondence is required. At present,
the highest quality facial geometry is produced by static scanning
techniques, where the subject holds a facial pose for several sec-
onds. This permits the use of high-resolution cameras for accu-
rate stereo reconstruction and active illumination to recover pore-
level resolution surface details. Such techniques also capture high-
quality surface reflectance maps, enabling realistic rendering of the
captured faces. Alternatively, static facial poses may be captured
using facial casts combined with detail acquired from surface im-
prints. Unfortuately, dynamic scanning techniques are unable to
provide the same level of detail as static techniques, even when
high-speed cameras and active illumination are employed.
The classic approach to capturing facial motion is to use markers
or face paint to track points on the face. However, such techniques
struggle to capture the motion of the eyes and mouth, and rely on
a high-quality facial rig to provide high-frequency skin motion and
wrinkling. The best results are achieved when the rig is based on
high-resolution static scans of the same subject. A second approach
is to capture a performance with one or more passive video cameras.
Such setups are lightweight as they use environmental illumination
and off-the-shelf video cameras. As the camera records the entire
face, it should be possible to recover eye and mouth motion missed
by sparse markers. Still, by itself, passive video cannot match the
resolution of static scans. While it is possible to emboss some video
texture on the face [Bradley et al. 2010][Beeler et al. 2011][Val-
gaerts et al. 2012], many facial details appear only in specular re-
flections and are not visible under arbitrary illumination.
1
To appear in ACM TOG 34(1).
We present a technique for creating realistic facial animation from
a set of high-resolution scans of an actor’s face, driven by passive
video of the actor from one or more viewpoints. The videos can be
shot under existing environmental illumination using off-the-shelf
HD video cameras. The static scans can come from a variety of
sources including facial casts, passive stereo, or active illumination
techniques. High-resolution detail and relightable reflectance prop-
erties in the static scans can be transferred to the performance using
generated per-pixel weight maps. We operate our algorithm on a
performance flow graph that represents dense correspondences be-
tween dynamic frames and multiple static scans, leveraging GPU-
based optical flow to efficiently construct the graph. Besides a sin-
gle artist remesh of a scan in neutral pose, our method requires no
rigging, no training of appearance models, no facial feature detec-
tion, and no manual annotation of any kind. As a byproduct of our
method we also obtain a non-rigid registration between the artist
mesh and each static scan. Our principal contributions are:
An efficient scheme for selecting a sparse subset of image
pairs for optical flow computation for drift-free tracking.
A fully coupled 3D tracking method with differential shape
regularization using multiple locally weighted target shapes.
A message-passing-based optimization scheme leveraging
lazy evaluation of energy terms enabling fully-coupled opti-
mization over an entire performance.
2 Related Work
As many systems have been built for capturing facial geometry and
reflectance, we will restrict our discussion to those that establish
some form of dense temporal correspondence over a performance.
Many existing algorithms compute temporal correspondence for a
sequence of temporally inconsistent geometries generated by e.g.
structured light scanners or stereo algorithms. These algorithms op-
erate using only geometric constraints [Popa et al. 2010] or by de-
forming template geometry to match each geometric frame [Zhang
et al. 2004]. The disadvantage of this approach is that the per-
frame geometry often contains missing regions or erroneous ge-
ometry which must be filled or filtered out, and any details that are
missed in the initial geometry solution are non-recoverable.
Other methods operate on video footage of facial performances.
Methods employing frame-to-frame motion analysis are subject
to the accumulation of error or “drift” in the tracked geometry,
prompting many authors to seek remedies for this issue. We there-
fore limit our discussion to methods that make some effort to ad-
dress drift. Li et al. [1993] compute animated facial blendshape
weights and rigid motion parameters to match the texture of each
video frame to a reference frame, within a local minimum deter-
mined by a motion prediction step. Drift is avoided whenever a
solid match can be made back to the reference frame. [DeCarlo
and Metaxas 1996] solves for facial rig control parameters to agree
with sparse monocular optical flow constraints, applying forces to
pull model edges towards image edges in order to combat drift.
[Guenter et al. 1998] tracks motion capture dots in multiple views to
deform a neutral facial scan, increasing the realism of the rendered
performance by projecting video of the face (with the dots digitally
removed) onto the deforming geometry. The ”Universal Capture”
system described in [Borshukov et al. 2003] dispenses with the dots
and uses dense multi-view optical flow to propagate vertices from
an initial neutral expression. User intervention is required to cor-
rect drift when it occurs. [Hawkins et al. 2004] uses performance
tracking to automatically blend between multiple high-resolution
facial scans per facial region, achieving realistic multi-scale facial
deformation without the need for reprojecting per-frame video, but
uses dots to avoid drift. Bradley et al. [2010] track motion us-
ing dense multi-view optical flow, with a final registration step be-
tween the neutral mesh and every subsequent frame to reduce drift.
Beeler et al. [2011] explicitly identify anchor frames that are similar
to a manually chosen reference pose using a simple image differ-
ence metric, and track the performance bidirectionally between an-
chor frames. Non-sequential surface tracking [Klaudiny and Hilton
2012] finds a minimum-cost spanning tree over the frames in a per-
formance based on sparse feature positions, tracking facial geome-
try across edges in the tree with an additional temporal fusion step.
Valgaerts et al. [2012] apply scene flow to track binocular passive
video with a regularization term to reduce drift.
One drawback to all such optical flow tracking algorithms is that the
face is tracked from one pose to another as a whole, and success of
the tracking depends on accurate optical flow between images of the
entire face. Clearly, the human face is capable of repeating differ-
ent poses over different parts of the face asynchronously, which the
holistic approaches fail to model. For example, if the subject is talk-
ing with eyebrows raised and later with eyebrows lowered, a holis-
tic approach will fail to exploit similarities in mouth poses when
eyebrow poses differ. In contrast, our approach constructs a graph
considering similarities over multiple regions of the face across the
performance frames and a set of static facial scans, removing the
need for sparse feature tracking or anchor frame selection.
Blend-shape based animation rigs are also used to reconstruct dy-
namic poses based on multiple face scans. The company Image
Metrics (now Faceware) has developed commercial software for
driving a blend-shape rig with passive video based on active appear-
ance models [Cootes et al. 1998]. Weise et al. [2011] automatically
construct a personalized blend shape rig and drive it with Kinect
depth data using a combination of as-rigid-as-possible constraints
and optical flow. In both cases, the quality of the resulting tracked
performance is directly related to the quality of the rig. Each
tracked frame is a linear combination of the input blend-shapes,
so any performance details that lie outside the domain spanned by
the rig will not be reconstructed. Huang et al. [2011] automati-
cally choose a minimal set of blend shapes to scan based on previ-
ously captured performance with motion capture markers. Recre-
ating missing detail requires artistic effort to add corrective shapes
and cleanup animation curves [Alexander et al. 2009]. There has
been some research into other non-traditional rigs incorporating
scan data. Ma et al. [2008] fit a polynomial displacement map to
dynamic scan training data and generate detailed geometry from
sparse motion capture markers. Bickel et al. [2008] locally inter-
polate a set of static poses using radial basis functions driven by
motion capture markers. Our method combines the shape regu-
larization advantages of blendshapes with the flexibility of optical
flow based tracking. Our optimization algorithm leverages 3D in-
formation from static scans without constraining the result to lie
only within the linear combinations of the scans. At the same time,
we obtain per-pixel blend weights that can be used to produce per-
frame reflectance maps.
3 Data Capture and Preparation
We capture high-resolution static geometry using multi-view stereo
and gradient-based photometric stereo [Ghosh et al. 2011]. The
scan set includes around 30 poses largely inspired by the Facial
Action Coding System (FACS) [Ekman and Friesen 1978], selected
to span nearly the entire range of possible shapes for each part of
the face. For efficiency, we capture some poses with the subject
combining FACS action units from the upper and lower half of the
face. For example, combining eyebrows raise and cheeks puff into
a single scan. Examples of the input scan geometry can be seen in
Fig. 2. A base mesh is defined by an artist for the neutral pose scan.
2
To appear in ACM TOG 34(1).
The artist mesh has an efficient layout with edge loops following
the wrinkles of the face. The non-neutral poses are represented as
raw scan geometry, requiring no artistic topology or remeshing.
We capture dynamic performances using up to six Canon 1DX
DSLR cameras under constant illumination. In the simplest case,
we use the same cameras that were used for the static scans and
switch to 1920×1080 30p movie mode. We compute a sub-frame-
accurate synchronization offset between cameras using a correla-
tion analysis of the audio tracks. This could be omitted if cam-
eras with hardware synchronization are employed. Following each
performance, we capture a video frame of a calibration target to
calibrate camera intrinsics and extrinsics. We relight (and when
necessary, repose) the static scan data to resemble the illumination
conditions observed in the performance video. In the simplest case,
the illumination field resembles one of the photographs taken dur-
ing the static scan process and no relighting is required.
Figure 2: Sample static scans (showing geometry only).
4 The Performance Flow Graph
Optical-flow-based tracking algorithms such as [Bradley et al.
2010][Beeler et al. 2011][Klaudiny and Hilton 2012] relate frames
of a performance to each other based on optical flow correspon-
dences over a set of image pairs selected from the performance.
These methods differ in part by the choice of the image pairs to be
employed. We generalize this class of algorithms using a structure
we call the performance flow graph, which is a complete graph with
edges representing dense 2D correspondences between all pairs of
images, with each edge having a weight, or confidence, of the as-
sociated estimated correspondence field. The graphs used in previ-
ous works, including anchor frames [Beeler et al. 2011] and non-
sequential alignment with temporal fusion [Klaudiny and Hilton
2012], can be represented as a performance flow graph having unit
weight for the edges employed by the respective methods, and zero
weight for the unused edges. We further generalize the performance
flow graph to include a dense confidence field associated with each
correspondence field, allowing the confidence to vary spatially over
the image. This enables our technique to exploit relationships be-
tween images where only a partial correspondence was able to be
computed (for example, a pair of images where the mouth is similar
but the eyes are very different). Thus our technique can be viewed
as an extension of anchor frames or minimum spanning trees to
minimize drift independently over different regions of the face.
A performance capture system that considers correspondences be-
tween all possible image pairs naturally minimizes drift. However,
this would require an exorbitant number of graph edges, so we in-
stead construct a graph with a reduced set of edges that approxi-
mates the complete graph, in the sense that the correspondences are
representative of the full set with respect to confidence across the
regions of the face. Our criterion for selecting the edges to include
in the performance flow graph is that any two images having a high
DYNAMIC SEQUENCE
STATIC SCANS
Figure 3: performance flow graph showing optical flow correspon-
dences between static and dynamic images. Red lines represent op-
tical flow between neighboring frames within a performance. Blue,
green, and orange lines represent optical flow between dynamic and
static images. Based on initial low-resolution optical flow, we con-
struct a sparse graph requiring only a small subset of high resolu-
tion flows to be computed between static scans and dynamic frames.
confidence correspondence between them in the complete graph of
possible correspondences ought to have a path between them (a
concatenation of one or more correspondences) in the constructed
graph with nearly as high confidence (including the reduction in
confidence from concatenation). We claim that correspondences
between temporally neighboring dynamic frames are typically of
high quality, and no concatenation of alternative correspondences
can be as confident, therefore we always include a graph edge be-
tween each temporally neighboring pair of dynamic frames. Cor-
respondences between frames with larger temporal gaps are well-
approximated by concatenating neighbors, but decreasingly so over
larger temporal gaps (due to drift). We further claim that whenever
enough drift accumulates to warrant including a graph edge over
the larger temporal gap, there exists a path with nearly as good con-
fidence that passes through one of the predetermined static scans
(possibly a different static scan for each region of the face). We jus-
tify this claim by noting the 30 static poses based on FACS ought
to span the space of performances well enough that any region of
any dynamic frame can be corresponded to some region in some
static scan with good confidence. Therefore we do not include
any edges between non-neighboring dynamic frames, and instead
consider only edges between a static scan and a dynamic frame as
candidates for inclusion (visualized in Fig. 3). Finally, as the drift
accumulated from the concatenation described above warrants ad-
ditional edges only sparsely over time, we devise a coarse-to-fine
graph construction strategy using only a sparse subset of static-to-
dynamic graph edges. We detail this strategy in Section 4.1.
4.1 Constructing the Performance Flow Graph
The images used in our system consist of one or more dynamic
sequences of frames captured from one or more viewpoints, and
roughly similar views of a set of high-resolution static scans. The
nodes in our graph represent static poses (associated with static
scans) and dynamic poses (associated with dynamic frames from
one or more sequences). We construct the performance flow graph
by computing a large set of static-to-dynamic optical flow corre-
spondences at a reduced resolution for only a single viewpoint, and
3
To appear in ACM TOG 34(1).
then omit redundant correspondences using a novel voting algo-
rithm to select a sparse set of correspondences that is representative
of the original set. We then compute high-quality optical flow cor-
respondences at full resolution for the sparse set, and include all
viewpoints. The initial set of correspondences consists of quarter-
resolution optical flows from each static scan to every n
th
dynamic
frame. For most static scans we use every 5
th
dynamic frame, while
for the eyes-closed scan we use every dynamic frame in order to
catch rapid eye blinks. We then compute normalized cross corre-
lation fields between the warped dynamic frames and each original
static scan to evaluate the confidence of the correspondences. These
correspondences may be computed in parallel over multiple com-
puters, as there is no sequential dependency between them. We find
that at quarter resolution, flow-based cross correlation correctly as-
signs low confidence to incorrectly matched facial features, for ex-
ample when flowing disparate open and closed mouth shapes. To
reduce noise and create a semantically meaningful metric, we av-
erage the resulting confidence over twelve facial regions (see Fig.
4). These facial regions are defined once on the neutral pose, and
are warped to all other static poses using rough static-to-static opti-
cal flow. Precise registration of regions is not required, as they are
only used in selecting the structure of the performance graph. In
the subsequent tracking phase, per-pixel confidence is used.
(a) (b) (c) (d)
(e) (f)
Figure 4: We compute an initial low-resolution optical flow be-
tween a dynamic image (a) and static image (b). We then com-
pute normalized cross correlation between the static image (b) and
the warped dynamic image (c) to produce the per-pixel confidence
shown in (d). We average these values for 12 regions (e) to obtain
a per-region confidence value (f). This example shows correlation
between the neutral scan and a dynamic frame with the eyebrows
raised and the mouth slightly open. The forehead and mouth re-
gions are assigned appropriately lower confidences.
Ideally we want the performance flow graph to be sparse. Besides
temporally adjacent poses, dynamic poses should only connect to
similar static poses and edges should be evenly distributed over time
to avoid accumulation of drift. We propose an iterative greedy vot-
ing algorithm based on the per-region confidence measure to iden-
tify good edges. The confidence of correspondence between the dy-
namic frames and any region of any static facial scan can be viewed
as a curve over time (depicted in Fig. 5). In each iteration we iden-
tify the maximum confidence value over all regions, all scans, and
all frames. We add an edge between the identified dynamic pose
0 50 100 150 200 250 300 350 400
0
0.2
0.4
0.6
0.8
frame number
confidence
Forehead
Mouth
10 70 130 180 270 320
Figure 5: A plot of the per-region confidence metric over time.
Higher numbers indicate greater correlation between the dynamic
frames and a particular static scan. The cyan curve represents the
center forehead region of a brows-raised static scan which is ac-
tive throughout the later sequence. The green curve represents the
mouth region for an extreme mouth-open scan which is active only
when the mouth opens to its fullest extent. The dashed lines indicate
the timing of the sampled frames shown on the bottom row.
and static pose to the graph. We then adjust the recorded confidence
of the identified region by subtracting a hat function scaled by the
maximum confidence and centered around the maximum frame, in-
dicating that the selected edge has been accounted for, and temporal
neighbors partly so. All other regions are adjusted by subtracting
similar hat functions, scaled by the (non-maximal) per-region con-
fidence of the identified flow. This suppresses any other regions that
are satisfied by the flow. The slope of the hat function represents a
loss of confidence as this flow is combined with adjacent dynamic-
to-dynamic flows. We then iterate and choose the new highest con-
fidence value, until all confidence values fall below a threshold. The
two parameters (the slope of the hat function and the final thresh-
old value) provide intuitive control over the total number of graph
edges. We found a reasonable hat function falloff to be a 4% re-
duction for every temporal flow and a threshold value that is 20%
of the initial maximum confidence. After constructing the graph,
a typical 10-20 second performance flow graph will contain 100-
200 edges between dynamic and static poses. Again, as the change
between sequential frames is small, we preserve all edges between
neighboring dynamic poses.
After selecting the graph edges, final HD resolution optical flows
are computed for all active cameras and for all retained graph edges.
We directly load video frames using nVidia’s h264 GPU decoder
and feed them to the FlowLib implementation of GPU-optical flow
[Werlberger 2012]. Running on a Nvidia GTX 680, computation
of quarter resolution flows for graph construction take less than one
second per flow. Full-resolution HD flows for dynamic-to-dynamic
images take 8 seconds per flow, and full-resolution flows between
static and dynamic images take around 23 seconds per flow due to
a larger search window. More sophisticated correspondence esti-
mation schemes could be employed within our framework, but our
intention is that the framework be agnostic to this choice and ro-
bust to imperfections in the pairwise correspondences. After com-
puting optical flows and confidences, we synchronize all the flow
sequences to a primary camera by warping each flow frame for-
ward or backward in time based on the sub-frame synchronization
offsets between cameras.
We claim that an approximate performance flow graph constructed
in this manner is more representative of the complete set of possible
4
To appear in ACM TOG 34(1).
correspondences than previous methods that take an all-or-nothing
approach to pair selection, while still employing a number of opti-
cal flow computations on the same order as previous methods (i.e.
temporal neighbors plus additional sparse image pairs).
5 Fully Coupled Performance Tracking
The performance flow graph is representative of all the constraints
we could glean from 2D correspondence analysis of the input im-
ages, and now we aim to put those constraints to work. We formu-
late an energy function in terms of the 3D vertex positions of the
artist mesh as it deforms to fit all of the dynamic and static poses in
the performance flow graph in a common head coordinate system,
as well as the associated head-to-world rigid transforms. We collect
the free variables into a vector θ = (x
p
i
, R
p
, t
p
|p D S, i V),
where x
p
i
represents the 3D vertex position of vertex i at pose p
in the common head coordinate system, R
p
and t
p
represent the
rotation matrix and translation vector that rigidly transform pose p
from the common head coordinate system to world coordinates, D
is the set of dynamic poses, S is the set of static poses, and V is the
set of mesh vertices. The energy function is then:
E(θ) =
X
(p,q)∈F
(E
pq
corr
+ E
qp
corr
) + λ
X
p∈D∪S
|F
p
|E
p
shape
+ ζ
X
p∈S
|F
p
|E
p
wrap
+ γ|F
g
|E
ground
, (1)
where F is the set of performance flow graph edges, F
p
is the sub-
set of edges connecting to pose p, and g is the ground (neutral)
static pose. This function includes:
dense correspondence constraints E
pq
corr
associated with the
edges of the performance flow graph,
shape regularization terms E
p
shape
relating the differential
shape of dynamic and static poses to their graph neighbors,
“shrink wrap” terms E
p
wrap
to conform the static poses to the
surface of the static scan geometries,
a final grounding term E
ground
to prefer the vertex positions
in a neutral pose to be close to the artist mesh vertex positions.
We detail these terms in sections 5.2 - 5.5. Note we do not em-
ploy a stereo matching term, allowing our technique to be robust to
small synchronization errors between cameras. As the number of
poses and correspondences may vary from one dataset to another,
the summations in (1) contain balancing factors (to the immediate
right of each Σ) in order to have comparable total magnitude (pro-
portional to |F|). The terms are weighted by tunable term weights
λ, ζ and γ, which in all examples we set equal to 1.
5.1 Minimization by Lazy DDMS-TRWS
In contrast to previous work, we consider the three-dimensional
coupling between all terms in our formulation, over all dynamic and
static poses simultaneously, thereby obtaining a robust estimate that
gracefully fills in missing or unreliable information. This presents
two major challenges. First, the partial matches and loops in the
performance flow graph preclude the use of straightforward mesh
propagation schemes used in previous works. Such propagation
would produce only partial solutions for many poses. Second (as a
result of the first) we lack a complete initial estimate for traditional
optimization schemes such as Levenberg-Marquadt.
To address these challenges, we employ an iterative scheme that
admits partial intermediate solutions, with pseudocode in Algo-
rithm 1. As some of the terms in (1) are data-dependent, we
adapt the outer loop of Data Driven Mean-Shift Belief Propagation
(DDMSBP) [Park et al. 2010], which models the objective function
in each iteration as an increasingly-tight Gaussian (or quadratic)
approximation of the true function. Within each DDMS loop, we
use Gaussian Tree-Reweighted Sequential message passing (TRW-
S) [Kolmogorov 2006], adapted to allow the terms in the model to
be constructed lazily as the solution progresses over the variables.
Hence we call our scheme Lazy DDMS-TRWS. We define the or-
dering of the variables to be pose-major (i.e. visiting all the vertices
of one pose, then all the vertices of the next pose, etc.), with static
poses followed by dynamic poses in temporal order. We decom-
pose the Gaussian belief as a product of 3D Gaussians over vertices
and poses, which admits a pairwise decomposition of (1) as a sum
of quadratics. We denote the current belief of a vertex i for pose p
as
¯
x
p
i
with covariance Σ
p
i
(stored as inverse covariance for conve-
nience), omitting the i subscript to refer to all vertices collectively.
We detail the modeling of the energy terms in sections 5.2 - 5.5,
defining
¯
y
p
i
= R
p
¯
x
p
i
+ t
p
as shorthand for world space vertex
position estimates. We iterate the DDMS loop 6 times, and iterate
TRW-S until 95% of the vertices converge to within 0.01mm.
Algorithm 1 Lazy DDMS-TRWS for (1)
p,i
: (Σ
p
i
)
1
0.
for DDMS outer iterations do
// Reset the model:
p,q
: E
pq
corr
, E
p
shape
, E
p
wrap
undefined (effectively 0).
for TRW-S inner iterations do
// Major TRW-S loop over poses:
for each p D S in order of increasing o(p) do
// Update model where possible:
for each q|(p, q) F do
if (Σ
p
)
1
6= 0 and E
pq
corr
undefined then
E
pq
corr
model fit using (2) in section 5.2.
if (Σ
q
)
1
6= 0 and E
qp
corr
undefined then
E
qp
corr
model fit using (2) in section 5.2.
if (Σ
p
)
1
6= 0 and E
p
wrap
undefined then
E
p
wrap
model fit using (8) in section 5.4.
if
(p,q)∈F
|(Σ
q
)
1
6= 0 and E
p
shape
undefined then
E
p
shape
model fit using (5) in section 5.3.
// Minor TRW-S loop over vertices:
Pass messages based on (1) to update
¯
x
p
, (Σ
p
)
1
.
Update R
p
, t
p
as in section 5.6.
// Reverse TRW-S ordering:
o(s) kD Sk + 1 o(s).
5.2 Modeling the Correspondence Term
The correspondence term in (1) penalizes disagreement between
optical flow vectors and projected vertex locations. Suppose we
have a 2D optical flow correspondence field between poses p and q
in (roughly) the same view c. We may establish a 3D relationship
between x
p
i
and x
q
i
implied by the 2D correspondence field, which
we model as a quadratic penalty function:
E
pq
corr
=
1
|C|
XX
c∈C
i∈V
(x
q
i
x
p
i
f
c
pq
i
)
T
F
c
pq
i
(x
q
i
x
p
i
f
c
pq
i
), (2)
where C is the set of camera viewpoints, and f
c
pq
i
, F
c
pq
i
are respec-
tively the mean and precision matrix of the penalty, which we es-
timate from the current estimated positions as follows. We first
project
¯
y
p
i
into the image plane of view c of pose p. We then warp
the 2D image position from view c of pose p to view c of pose q
5
To appear in ACM TOG 34(1).
using the correspondence field. The warped 2D position defines a
world-space view ray that the same vertex i ought to lie on in pose
q. We transform this ray back into common head coordinates (via
t
q
, R
T
q
) and penalize the squared distance from x
q
i
to this ray.
Letting r
c
pq
i
represent the direction of this ray, this yields:
f
c
pq
i
= (I r
c
pq
i
r
c
pq
i
T
)(R
T
q
(c
c
q
t
q
)
¯
x
p
i
), (3)
where c
c
q
is the nodal point of view c of pose q, and r
c
pq
i
= R
T
q
d
c
pq
i
with d
c
pq
i
the world-space direction of the ray in view c of pose
q through the 2D image plane point f
c
pq
[P
c
p
(
¯
y
p
i
)] (where square
brackets represent bilinearly interpolated sampling of a field or im-
age), f
c
pq
the optical flow field transforming an image-space point
from view c of pose p to the corresponding point in view c of pose
q, and P
c
p
(x) the projection of a point x into the image plane of
view c of pose p (which may differ somewhat from pose to pose).
If we were to use the squared-distance-to-ray penalty directly, F
c
pq
i
would be I r
c
pq
i
r
c
pq
i
T
, which is singular. To prevent the problem
from being ill-conditioned and also to enable the use of monocular
performance data, we add a small regularization term to produce
a non-singular penalty, and weight the penalty by the confidence
of the optical flow estimate. We also assume the optical flow field
is locally smooth, so a large covariance Σ
p
i
inversely influences
the precision of the model, whereas a small covariance Σ
p
i
does
not, and weight the model accordingly. Intuitively, this weighting
causes information to propagate from the ground term outward via
the correspondences in early iterations, and blends correspondences
from all sources in later iterations. All together, this yields:
F
c
pq
i
= min(1, det(Σ
p
i
)
1
3
)v
c
p
i
τ
c
pq
[P
c
p
(
¯
y
p
i
)](Ir
c
pq
i
r
c
pq
i
T
+I), (4)
where v
c
p
i
is a soft visibility factor (obtained by blurring a binary
vertex visibility map and modulated by the cosine of the angle be-
tween surface normal and view direction), τ
c
pq
is the confidence
field associated with the correspondence field f
c
pq
, and is a small
regularization constant. We use det(Σ)
1/3
as a scalar form of
precision for 3D Gaussians.
5.3 Modeling the Differential Shape Term
The shape term in (1) constrains the differential shape of each pose
to a spatially varying convex combination of the differential shapes
of the neighboring poses in the performance flow graph:
E
p
shape
=
X
(i,j)∈E
x
p
j
x
p
i
l
p
ij
2
, (5)
l
p
ij
=
(g
j
g
i
) +
P
q|(p,q)∈F
w
pq
ij
(
¯
x
q
j
¯
x
q
i
)
+
P
q|(p,q)∈F
w
pq
ij
, (6)
w
pq
ij
=
w
pq
i
w
pq
j
w
pq
i
+ w
pq
j
, (7)
where E is the set of edges in the geometry mesh, w
pq
i
=
det(
1
|C|
P
c∈C
F
c
pq
i
+F
c
qp
i
)
1/3
(which is intuitively the strength of
the relationship between poses p and q due to the correspondence
term), g denotes the artist mesh vertex positions, and is a small
regularization constant. The weights w
pq
i
additionally enable triv-
ial synthesis of high-resolution reflectance maps for each dynamic
frame of the performance by blending the static pose data.
5.4 Modeling the Shrink Wrap Term
The shrink wrap term in (1) penalizes the distance between static
pose vertices and the raw scan geometry of the same pose. We
model this as a regularized distance-to-plane penalty:
E
p
wrap
=
X
i∈V
(x
p
i
d
p
i
)
T
g
p
i
(n
p
i
n
p
i
T
+ I)(x
p
i
d
p
i
), (8)
where (n
p
i
, d
p
i
) are the normal and centroid of a plane fitted to the
surface of the static scan for pose p close to the current estimate
¯
x
p
i
in common head coordinates, and g
p
i
is the confidence of the
planar fit. We obtain the planar fit inexpensively by projecting
¯
y
p
i
into each camera view, and sampling the raw scan surface via a
set of precomputed rasterized views of the scan. (Alternatively, a
3D search could be employed to obtain the samples.) Each surface
sample (excluding samples that are occluded or outside the raster-
ized scan) provides a plane equation based on the scan geometry
and surface normal. We let n
p
i
and d
p
i
be the weighted average
values of the plane equations over all surface samples:
n
p
i
=
X
c∈C
ω
c
p
i
R
T
p
ˆ
n
c
p
[P
c
p
(
¯
y
p
i
)] (normalized), (9)
d
p
i
=
X
c∈C
ω
c
p
i
1
X
c∈C
ω
c
p
i
R
T
p
(
ˆ
d
c
p
[P
c
p
(
¯
y
p
i
)] t
p
), (10)
g
p
i
= min(1, det(Σ
p
i
)
1
3
)
X
c∈C
ω
c
p
i
, (11)
where (
ˆ
n
c
p
,
ˆ
d
c
p
) are the world-space surface normal and position
images of the rasterized scans, and ω
c
p
i
= 0 if the vertex is occluded
in view c or lands outside of the rasterized scan, otherwise ω
c
p
i
=
v
c
p
i
exp(−k
ˆ
d
c
p
[P
c
p
(
¯
y
p
i
)]
¯
y
p
i
k
2
).
5.5 Modeling the Ground Term
The ground term in (1) penalizes the distance between vertex po-
sitions in the ground (neutral) pose and the artist mesh geometry:
E
ground
=
X
i∈V
x
g
i
R
T
g
g
i
2
, (12)
where g
i
is the position of the vertex in the artist mesh. This term
is simpler than the shrink-wrap term since the pose vertices are in
one-to-one correspondence with the artist mesh vertices.
5.6 Updating the Rigid Transforms
We initialize our optimization scheme with all (Σ
p
i
)
1
= 0 (and
hence all
¯
x
p
i
moot), fully relying on the lazy DDMS-TRWS scheme
to propagate progressively tighter estimates of the vertex positions
x
p
i
throughout the solution. Unfortunately, in our formulation the
rigid transforms (R
p
, t
p
) enjoy no such treatment as they always
occur together with x
p
i
and would produce non-quadratic terms
if they were included in the message passing domain. There-
fore we must initialize the rigid transforms to some rough ini-
tial guess, and update them after each iteration. The neutral pose
is an exception, where the transform is specified by the user (by
rigidly posing the artist mesh to their whim) and hence not up-
dated. In all our examples, the initial guess for all poses is sim-
ply the same as the user-specified rigid transform of the neutral
pose. We update (R
p
, t
p
) using a simple scheme that aligns the
neutral artist mesh to the current result. Using singular value
decomposition, we compute the closest rigid transform minimiz-
ing
P
i∈V
r
i
R
p
g
i
+ t
p
¯
R
p
¯
x
p
i
¯
t
p
2
, where r
i
is a rigidity
weight value (high weight around the eye sockets and temples, low
weight elsewhere), g
i
denotes the artist mesh vertex positions, and
(
¯
R
p
,
¯
t
p
) is the previous transform estimate.
6
To appear in ACM TOG 34(1).
5.7 Accelerating the Solution Using Keyframes
Minimizing the energy in (1) over the entire sequence requires mul-
tiple iterations of the TRW-S message passing algorithm, and mul-
tiple iterations of the DDMS outer loop. We note that the perfor-
mance flow graph assigns static-to-dynamic flows to only a sparse
subset of performance frames, which we call keyframes. Corre-
spondences among the spans of frames in between keyframes are
reliably represented using concatenation of temporal flows. There-
fore to reduce computation time we first miminize the energy at
only the keyframes and static poses, using concatenated temporal
flows in between keyframes. Each iteration of this reduced problem
is far cheaper than the full problem, so we may obtain a satisfac-
tory solution of the performance keyframes and static poses more
quickly. Next, we keep the static poses and keyframe poses fixed,
and solve the spans of in-between frames, omitting the shrink-wrap
and grounding terms as they affect only the static poses. This sub-
sequent minimization requires only a few iterations to reach a sat-
isfactory result, and each span of in-between frames may be solved
independently (running on multiple computers, for example).
6 Handling Arbitrary Illumination and Motion
Up to now, we have assumed that lighting and overall head motion
in the static scans closely matches that in the dynamic frames. For
performances in uncontrolled environments, the subject may move
or rotate their head to face different cameras, and lighting may be
arbitrary. We handle such complex cases by taking advantage of
the 3D geometry and relightable reflectance maps in the static scan
data. For every 5
th
performance frame, we compute a relighted ren-
dering of each static scan with roughly similar rigid head motion
and lighting environment as the dynamic performance. These ren-
derings are used as the static expression imagery in our pipeline.
The rigid head motion estimate does not need to be exact as the
optical flow computation is robust to a moderate degree of mis-
alignment. In our results, we (roughly) rigidly posed the head by
hand, though automated techniques could be employed [Zhu and
Ramanan 2012]. We also assume that a HDR light probe measure-
ment [Debevec 1998] exists for the new lighting environment, how-
ever, lighting could be estimated from the subject’s face [Valgaerts
et al. 2012] or eyes [Nishino and Nayar 2004].
The complex backgrounds in real-world uncontrolled environments
pose a problem, as optical flow vectors computed on background
pixels close to the silhouette of the face may confuse the correspon-
dence term if the current estimate of the facial geometry slightly
overlaps the background. This results in parts of the face “stick-
ing” to the background as the subject’s face turns from side to side
(Fig. 6). To combat this, we weight the correspondence confidence
field by a simple soft segmentation of head vs. background. Since
head motion is largely rigid, we fit a 2D affine transform to the op-
tical flow vectors in the region of the current head estimate. Then,
we weight optical flow vectors by how well they agree with the
fitted transform. We also assign high weight to the region deep
inside the current head estimate using a simple image-space ero-
sion algorithm, to prevent large jaw motions from being discarded.
The resulting soft segmentation effectively cuts the head out of the
background whenever the head is moving, thus preventing the opti-
cal flow vectors of the background from polluting the edges of the
face. When the head is not moving against the background the seg-
mentation is poor, but in this case the optical flow vectors of the
face and background agree and pollution is not damaging.
(a) (b) (c) (d)
Figure 6: (a, b) Two frames of a reconstructed performance in
front of a cluttered background, where the subject turns his head
over the course of ten frames. The silhouette of the jaw “sticks”
to the background because the optical flow vectors close to the jaw
are stationary. (c, d) A simple segmentation of the optical flow field
to exclude the background resolves the issue.
7 Results
We ran our technique on several performances from three differ-
ent subjects. Each subject had 30 static facial geometry scans cap-
tured before the performance sessions, though the performance flow
graph construction often employs only a fraction of the scans. An
artist produced a single face mesh for each subject based on their
neutral static facial scan.
7.1 Performances Following Static Scan Sessions
We captured performances of three subjects directly following their
static scan sessions. The performances were recorded from six
camera views in front of the subject with a baseline of approxi-
mately 15 degrees. Our method produced the performance anima-
tion results shown in Fig. 19 without any further user input.
7.2 Performances in Other Locations
We captured a performance of a subject using four consumer HD
video cameras in an office environment. An animator rigidly posed
a head model roughly aligned to every 5
th
frame of the performance,
to produce the static images for our performance flow graph. Im-
portantly, this rigid head motion does not need to be very accurate
for our method to operate, and we intend that an automated tech-
nique could be employed. A selection of video frames from one of
the views is shown in Fig. 7, along with renderings of the results of
our method. Despite the noisy quality of the videos and the smaller
size of the head in the frame, our method is able to capture stable
facial motion including lip synching and brow wrinkles.
7.3 High-Resolution Detail Transfer
After tracking a performance, we transfer the high-resolution re-
flectance maps from the static scans onto the performance result.
As all results are registered to the same UV parameterization by
our method, the transfer is a simple weighted blend using the cross-
correlation-based confidence weights w
pq
i
of each vertex, interpo-
lated bilinearly between vertices. We also compute values for w
pq
i
for any dynamic-to-static edge pq that was not present in the per-
formance flow graph, to produce weights for every frame of the
performance. This yields detailed reflectance maps for every per-
formance frame, suitable for realistic rendering and relighting. In
addition to transferring reflectance, we also transfer geometric de-
tails in the form of a displacement map, allowing the performance
tracking to operate on a medium-resolution mesh instead of the
full scan resolution. Fig. 8 compares transferring geometric details
7
To appear in ACM TOG 34(1).
Figure 7: A performance captured in an office environment with uncontrolled illumination, using four HD consumer video cameras and
seven static expression scans. Top row: a selection of frames from one of the camera views. Middle row: geometry tracked using the
proposed method, with reflectance maps automatically assembled from static scan data, shaded using a high-dynamic-range light probe. The
reflectance of the top and back of the head were supplemented with artist-generated static maps. The eyes and inner mouth are rendered as
black as our method does not track these features. Bottom row: gray-shaded geometry for the same frames, from a novel viewpoint. Our
method produces stable animation even with somewhat noisy video footage and significant head motion. Dynamic skin details such as brow
wrinkles are transferred from the static scans in a manner faithful to the video footage.
(a) (b) (c)
Figure 8: High-resolution details may be transferred to a medium-
resolution tracked model to save computation time. (a) medium-
resolution tracked geometry using six views. (b) medium-resolution
geometry with details automatically transferred from the high-
resolution static scans. (c) high-resolution tracked geometry. The
transferred details in (b) capture most of the dynamic facial details
seen in (c) at a reduced computational cost.
from the static scans onto a medium-resolution reconstruction to di-
rectly tracking a high-resolution mesh. As the high-resolution solve
is more expensive, we first perform the medium-resolution solve
and use it to prime the DDMS-TRWS belief in the high-resolution
solve, making convergence more rapid. In all other results, we show
medium-resolution tracking with detail transfer, as the results are
satisfactory and far cheaper to compute.
Figure 9: Results using only a single camera view, showing the last
four frames from Fig. 7. Even under uncontrolled illumination and
significant head motion, tracking is possible from a single view, at
somewhat reduced fidelity.
7.4 Monocular vs. Binocular vs. Multi-View
Our method operates on any number of camera views, producing
a result from even a single view. Fig. 9 shows results from a sin-
gle view for the same uncontrolled-illumination sequence as Fig.
7. Fig. 10 shows the incremental improvement in facial detail for
a controlled-illumination sequence using one, two, and six views.
Our method is applicable to a wide variety of camera and lighting
setups, with graceful degradation as less information is available.
7.5 Influence of Each Energy Term
The core operation of our method is to propagate a known facial
pose (the artist mesh) to a set of unknown poses (the dynamic
frames and other static scans) via the ground term and correspon-
dence terms in our energy formulation. The differential shape term
and shrink wrap term serve to regularize the shape of the solution.
We next explore the influence of these terms on the solution.
8
To appear in ACM TOG 34(1).
(a) (b) (c)
Figure 10: Example dynamic performance frame reconstructed
from (a) one view, (b) two views and (c) six views. Our method
gracefully degrades as less information is available.
Figure 11: The artist mesh is non-rigidly registered to each of the
other static expression scans as a byproduct of our method. The
registered artist mesh is shown for a selection of scans from two
different subjects. Note the variety of mouth shapes, all of which
are well-registered by our method without any user input.
Correspondence Term The correspondence term produces a
consistent parameterization of the geometry suitable for texturing
and other editing tasks. As our method computes a coupled solu-
tion of performance frames using static poses to bridge larger tem-
poral gaps, the artist mesh is non-rigidly registered to each of the
static scans as a byproduct of the optimization. (See Fig. 11 for ex-
amples.) Note especially that our method automatically produces a
complete head for each expression, despite only having static facial
scan geometry for the frontal face surface. As shown in Fig. 12, this
consistency is maintained even when the solution is obtained from
a different performance. Fig. 13 illustrates that the use of multiple
static expression scans in the performance flow graph produces a
more expressive performance, with more accentuated facial expres-
sion features, as there are more successful optical flow regions in
the face throughout the performance.
Differential Shape Term In our formulation, the differential
shape of a performance frame or pose is tied to a blend of its neigh-
bors on the performance flow graph. This allows details from mul-
tiple static poses to propagate to related poses. Even when only one
Figure 12: Top row: neutral mesh with checker visualization of tex-
ture coordinates, followed by three non-rigid registrations to other
facial scans as a byproduct of tracking a speaking performance.
Bottom row: the same, except the performance used was a series
of facial expressions with no speaking. The non-rigid registration
obtained from the performance-graph-based tracking is both con-
sistent across expressions and across performances. Note, e.g. the
consistent locations of the checkers around the contours of the lips.
static pose is used (i.e. neutral), allowing temporal neighbors to in-
fluence the differential shape provides temporal smoothing without
overly restricting the shape of each frame. Fig. 13 (c, d) illustrates
the loss of detail when temporal neighbors are excluded from the
differential shape term (compare to a, b).
Shrink Wrap Term The shrink wrap term conforms the static
poses to the raw geometry scans (Fig. 14). Without this term, subtle
details in the static scans cannot be propagated to the performance
result, and the recovered static poses have less fidelity to the scans.
7.6 Comparison to Previous Work
We ran our method on the data from [Beeler et al. 2011], using their
recovered geometry from the first frame (frame 48) as the “artist”
mesh in our method. For expression scans, we used the geome-
try from frames 285 (frown) and 333 (brow raise). As our method
makes use of the expression scans only via image-space operations
on camera footage or rasterized geometry, any point order infor-
mation present in the scans is entirely ignored. Therefore in this
test, it is as if the static scans were produced individually by the
method of [Beeler et al. 2010]. We constructed a simple UV pro-
jection on the artist mesh for texture visualization purposes, and
projected the video frames onto each frame’s geometry to produce a
per-frame UV texture map. To measure the quality of texture align-
ment over the entire sequence, we computed the temporal variance
of each pixel in the texture map (shown in Fig.15 (a, b)), using
contrast normalization to disregard low-frequency shading varia-
tion. The proposed method produces substantially lower temporal
texture variance, indicating a more consistent alignment throughout
the sequence, especially around the mouth. Examining the geome-
try in Fig.15 (c-f), the proposed method has generally comparable
quality as the previous work, with the mouth-closed shape recov-
ered more faithfully (which is consistent with the variance analy-
9
To appear in ACM TOG 34(1).
(a) (b) (c) (d)
Figure 13: Using multiple static expressions in the performance
flow graph produces more detail than using just a neutral static
expression. Multiple static expressions are included in the perfor-
mance flow graph in (a, c), whereas only the neutral expression is
included in (b, d). By including temporal neighbors and static scans
in determining the differential shape, details from the various static
scans can be propagated throughout the performance. Differential
shape is determined by the static expression(s) and temporal neigh-
bors in (a, b), whereas temporal neighbors are excluded from the
differential shape term in (c, d). Note the progressive loss of detail
in e.g. the brow region from (a) to (d).
sis). We also compared to [Klaudiny and Hilton 2012] in a similar
manner, using frame 0 as the artist mesh, and frames 25, 40, 70,
110, 155, 190, 225, 255 and 280 as static expressions. Again, no
point order information is used. Fig. 16 again shows an overall
lower temporal texture variance from the proposed method.
7.7 Performance Timings
We report performance timings in Fig. 17 for various sequences,
running on a 16-core 2.4 GHz Xeon E5620 workstation (some oper-
ations are multithreaded across the cores). All tracked meshes have
65 thousand vertices, except Fig. 8(c) and Fig. 15 which have one
million vertices. We report each stage of the process: “Graph” for
the performance graph construction, “Flow” for the high-resolution
optical flow calculations, “Key” for the performance tracking solve
on key frames, and “Tween” for the performance tracking solve
in between key frames. We mark stages that could be parallelized
over multiple machines with an asterisk (*). High-resolution solves
(Fig. 8(c) and Fig. 15) take longer than medium-resolution solves.
Sequences with uncontrolled illumination (Fig. 7 and Fig. 9) take
longer for the key frames to converge since the correspondence ty-
ing the solution to the static scans has lower confidence.
7.8 Discussion
Our method produces a consistent geometry animation on an artist-
created neutral mesh. The animation is expressive and lifelike, and
the subject is free to make natural head movements within a cer-
tain degree. Fig. 18 shows renderings from such a facial perfor-
mance rendered using advanced skin and eye shading techniques
as described in [Jimenez et al. 2012]. One notable shortcoming of
our performance flow graph construction algorithm is the neglect
of eye blinks. This results in a poor representation of the blinks
in the final animation. Our method requires one artist-generated
mesh per subject to obtain results that are immediately usable in
production pipelines. Automatic generation of this mesh could
be future work, or use existing techniques for non-rigid registra-
tion. Omitting this step would still produce a result, but would re-
quire additional cleanup around the edges as in e.g. [Beeler et al.
2011][Klaudiny and Hilton 2012].
(a) (b) (c)
(d) (e) (f)
Figure 14: The shrink wrap term conforms the artist mesh to the
static scan geometry, and also improves the transfer of expressive
details to the dynamic performance. The registered artist mesh is
shown for two static poses in (a) and (b), and a dynamic pose that
borrows brow detail from (a) and mouth detail from (b) is shown in
(c). Without the shrink wrap term, the registration to the static poses
suffers (d, e) and the detail transfer to the dynamic performance is
less sucessful (f). Fine-scale details are still transferred via dis-
placement maps, but medium-scale expressive details are lost.
8 Future Work
One of the advantages of our technique is that it relates a dynamic
performance back to facial shape scans using per-pixel weight
maps. It would be desirable to further factor our results to cre-
ate multiple localized blend shapes which are more semantically
meaningful and artist friendly. Also, our algorithm does not explic-
itly track eye or mouth contours. Eye and mouth tracking could
be further refined with additional constraints to capture eye blinks
and more subtle mouth behavior such as “sticky lips” [Alexander
et al. 2009]. Another useful direction would be to retarget perfor-
mances from one subject to another. Given a set of static scans
for both subjects, it should be possible to clone one subject’s per-
formance to the second subject as in [Seol et al. 2012]; providing
more meaningful control over this transfer remains a subject for
future research. Finally, as our framework is agnostic to the par-
ticular method employed for estimating 2D correspondences, we
would like to try more recent optical flow algorithms such as the
top performers on the Middlebury benchmark [Baker et al. 2011].
Usefully, the quality of our performance tracking can be improved
any time that an improved optical flow library becomes available.
Acknowledgements
The authors thank the following people for their support and assis-
tance: Ari Shapiro, Sin-Hwa Kang, Matt Trimmer, Koki Nagano,
Xueming Yu, Jay Busch, Paul Graham, Kathleen Haase, Bill
Swartout, Randall Hill and Randolph Hall. We thank the authors
of [Beeler et al. 2010] and [Klaudiny et al. 2010] for graciously
10
To appear in ACM TOG 34(1).
(a) (b)
(c) (d) (e) (f)
Figure 15: Top row: Temporal variance of contrast-normalized
texture (false color, where blue is lowest and red is highest), with (a)
the proposed method and (b) the method of [Beeler et al. 2011]. The
variance of the proposed method is substantially lower, indicating a
more consistent texture alignment throughout the sequence. Bottom
row: Geometry for frames 120 and 330 of the sequence, with (c, d)
the proposed method and (e, f) the prior work.
(a) (b)
Figure 16: Temporal variance of contrast-normalized texture (false
color, where blue is lowest and red is highest), with (a) the proposed
method and (b) the method of [Klaudiny et al. 2010]. As in Fig.15,
the variance of the proposed method is generally lower.
providing the data for the comparisons in Figs. 15 and 16, respec-
tively. We thank Jorge Jimenez, Etienne Danvoye, and Javier von
der Pahlen at Activision R&D for the renderings in Fig. 18. This
work was sponsored by the University of Southern California Of-
fice of the Provost and the U.S. Army Research, Development, and
Engineering Command (RDECOM). The content of the informa-
tion does not necessarily reflect the position or the policy of the US
Government, and no official endorsement should be inferred.
References
ALEXANDER, O., ROGERS, M., LAMBETH, W., CHIANG, M.,
AND DEBEVEC, P. 2009. Creating a photoreal digital actor:
The digital emily project. In Visual Media Production, 2009.
CVMP ’09. Conference for, 176–187.
BAKER, S., SCHARSTEIN, D., LEWIS, J. P., ROTH, S., BLACK,
M. J., AND SZELISKI, R. 2011. A database and evaluation
Sequence Frames Graph* Flow* Key Tween*
Fig. 7 170 0.5 hr 8.0 hr 5.2 hr 1.2 hr
Fig. 8(b) 400 1.1 hr 24 hr 4.3 hr 4.3 hr
Fig. 8(c) 400 1.1 hr 24 hr 24 hr 26 hr
Fig. 9 170 0.5 hr 2.0 hr 3.6 hr 0.9 hr
Fig. 15 347 0.1 hr 15 hr 16 hr 17 hr
Fig. 16 300 0.2 hr 12 hr 3.0 hr 3.0 hr
Fig. 19 row 2 600 1.6 hr 36 hr 6.5 hr 7.0 hr
Fig. 19 row 4 305 0.8 hr 18 hr 3.3 hr 3.5 hr
Fig. 19 row 6 250 0.7 hr 15 hr 2.6 hr 2.8 hr
Figure 17: Timings for complete processing of the sequences used
in various figues, using a single workstation. A * indicates an oper-
ation that could trivially be run in parallel across many machines.
methodology for optical flow. International Journal of Computer
Vision 92, 1 (Mar.), 1–31.
BEELER, T., BICKEL, B., BEARDSLEY, P., SUMNER, B., AND
GROSS, M. 2010. High-quality single-shot capture of facial
geometry. ACM Trans. on Graphics (Proc. SIGGRAPH) 29, 3,
40:1–40:9.
BEELER, T., HAHN, F., BRADLEY, D., BICKEL, B., BEARDS-
LEY, P., GOTSMAN, C., SUMNER, R. W., AND GROSS, M.
2011. High-quality passive facial performance capture using
anchor frames. In ACM SIGGRAPH 2011 papers, ACM, New
York, NY, USA, SIGGRAPH ’11, 75:1–75:10.
BICKEL, B., LANG, M., BOTSCH, M., OTADUY, M. A.,
AND GROSS, M. 2008. Pose-space animation and trans-
fer of facial details. In Proceedings of the 2008 ACM SIG-
GRAPH/Eurographics Symposium on Computer Animation, Eu-
rographics Association, Aire-la-Ville, Switzerland, Switzerland,
SCA ’08, 57–66.
BORSHUKOV, G., PIPONI, D., LARSEN, O., LEWIS, J. P., AND
TEMPELAAR-LIETZ, C. 2003. Universal capture: image-
based facial animation for ”the matrix reloaded”. In SIGGRAPH,
ACM, A. P. Rockwood, Ed.
BRADLEY, D., HEIDRICH, W., POPA, T., AND SHEFFER, A.
2010. High resolution passive facial performance capture. In
ACM SIGGRAPH 2010 papers, ACM, New York, NY, USA,
SIGGRAPH ’10, 41:1–41:10.
COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 1998. Ac-
tive appearance models. In IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, Springer, 484–498.
DEBEVEC, P. 1998. Rendering synthetic objects into real scenes:
Bridging traditional and image-based graphics with global illu-
mination and high dynamic range photography. In Proceedings
of the 25th Annual Conference on Computer Graphics and In-
teractive Techniques, ACM, New York, NY, USA, SIGGRAPH
’98, 189–198.
DECARLO, D., AND METAXAS, D. 1996. The integration of
optical flow and deformable models with applications to human
face shape and motion estimation. In Proceedings of the 1996
Conference on Computer Vision and Pattern Recognition (CVPR
’96), IEEE Computer Society, Washington, DC, USA, CVPR
’96, 231–238.
EKMAN, P., AND FRIESEN, W. 1978. Facial Action Coding Sys-
tem: A Technique for the Measurement of Facial Movement.
Consulting Psychologists Press, Palo Alto.
GHOSH, A., FYFFE, G., TUNWATTANAPONG, B., BUSCH, J.,
YU, X., AND DEBEVEC, P. 2011. Multiview face capture using
11
To appear in ACM TOG 34(1).
Figure 18: Real-time renderings of tracked performances using ad-
vanced skin and eye shading [Jimenez et al. 2012].
polarized spherical gradient illumination. In Proceedings of the
2011 SIGGRAPH Asia Conference, ACM, New York, NY, USA,
SA ’11, 129:1–129:10.
GUENTER, B., GRIMM, C., WOOD, D., MALVAR, H., AND
PIGHIN, F. 1998. Making faces. In Proceedings of the 25th
annual conference on Computer graphics and interactive tech-
niques, ACM, New York, NY, USA, SIGGRAPH ’98, 55–66.
HAWKINS, T., WENGER, A., TCHOU, C., GARDNER, A.,
G
¨
ORANSSON, F., AND DEBEVEC, P. 2004. Animatable facial
reflectance fields. In Rendering Techniques 2004: 15th Euro-
graphics Workshop on Rendering, 309–320.
HUANG, H., CHAI, J., TONG, X., AND WU, H.-T. 2011. Leverag-
ing motion capture and 3d scanning for high-fidelity facial per-
formance acquisition. ACM Trans. Graph. 30, 4 (July), 74:1–
74:10.
JIMENEZ, J., JARABO, A., GUTIERREZ, D., DANVOYE, E., AND
VON DER PAHLEN, J. 2012. Separable subsurface scattering
and photorealistic eyes rendering. In ACM SIGGRAPH 2012
Courses, ACM, New York, NY, USA, SIGGRAPH 2012.
KLAUDINY, M., AND HILTON, A. 2012. High-detail 3d capture
and non-sequential alignment of facial performance. In 3DIM-
PVT.
KLAUDINY, M., HILTON, A., AND EDGE, J. 2010. High-detail
3d capture of facial performance. In 3DPVT.
KOLMOGOROV, V. 2006. Convergent tree-reweighted message
passing for energy minimization. IEEE Trans. Pattern Anal.
Mach. Intell. 28, 10, 1568–1583.
LI, H., ROIVAINEN, P., AND FORCHEIMER, R. 1993. 3-d motion
estimation in model-based facial image coding. IEEE Trans. Pat-
tern Anal. Mach. Intell. 15, 6 (June), 545–555.
MA, W.-C., JONES, A., CHIANG, J.-Y., HAWKINS, T., FRED-
ERIKSEN, S., PEERS, P., VUKOVIC, M., OUHYOUNG, M.,
AND DEBEVEC, P. 2008. Facial performance synthesis using
deformation-driven polynomial displacement maps. ACM Trans.
Graph. 27, 5 (Dec.), 121:1–121:10.
NISHINO, K., AND NAYAR, S. K. 2004. Eyes for relighting. ACM
Trans. Graph. 23, 3, 704–711.
PARK, M., KASHYAP, S., COLLINS, R., AND LIU, Y. 2010. Data
driven mean-shift belief propagation for non-gaussian mrfs. In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on, 3547 –3554.
POPA, T., SOUTH-DICKINSON, I., BRADLEY, D., SHEFFER, A.,
AND HEIDRICH, W. 2010. Globally consistent space-time re-
construction. Computer Graphics Forum (Proc. SGP).
SEOL, Y., LEWIS, J., SEO, J., CHOI, B., ANJYO, K., AND NOH,
J. 2012. Spacetime expression cloning for blendshapes. ACM
Trans. Graph. 31, 2 (Apr.), 14:1–14:12.
VALGAERTS, L., WU, C., BRUHN, A., SEIDEL, H.-P., AND
THEOBALT, C. 2012. Lightweight binocular facial performance
capture under uncontrolled lighting. ACM Trans. Graph. 31, 6
(Nov.), 187:1–187:11.
WEISE, T., BOUAZIZ, S., LI, H., AND PAULY, M. 2011. Realtime
performance-based facial animation. In ACM SIGGRAPH 2011
papers, ACM, New York, NY, USA, SIGGRAPH ’11, 77:1–
77:10.
WERLBERGER, M. 2012. Convex Approaches for High Perfor-
mance Video Processing. PhD thesis, Institute for Computer
Graphics and Vision, Graz University of Technology, Graz, Aus-
tria.
ZHANG, L., SNAVELY, N., CURLESS, B., AND SEITZ, S. M.
2004. Spacetime faces: high resolution capture for modeling and
animation. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers,
ACM, New York, NY, USA, 548–558.
ZHU, X., AND RAMANAN, D. 2012. Face detection, pose esti-
mation, and landmark localization in the wild. In CVPR, 2879–
2886.
12
To appear in ACM TOG 34(1).
Figure 19: Three tracked performances with different subjects, using six camera views and six to eight static expression scans per subject.
Shown are alternating rows of selected frames from the performance video, and gray-shaded tracked geometry for the same frames. Our
method produces a consistent geometry animation on an artist-created neutral mesh. The animation is expressive and lifelike, and the subject
is free to make natural head movements within a certain degree.
13
... Manually and Video-driven Avatars: Although there has been an increasing trend in sophisticated face and motion tracking systems, the cost of production still remains high. In many production houses today, high fidelity speech animation is either created manually by an animator or using facial motion capture of human actors [4,6,14,40,42]. Both manually and performance-driven approaches require a skilled animator to precisely edit the resulting complex animation parameters. Such approaches are extremely time consuming and expensive, especially in applications such as computer generated movies, digital game production, and video conferencing, where talking faces for tens of hours of dialogue are required. ...
Preprint
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech signal is then used to drive the facial controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. We analyze the performance of the two systems and compare them to the ground truth videos using subjective evaluation tests. The end-to-end and modular systems are able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOS of 4.1 for the ground truth generated from professionally recorded videos. While the end-to-end system gives a better overall quality, the modular approach is more flexible and the quality of acoustic speech and visual speech synthesis is almost independent of each other.
... Our solution to the stabilization problem requires a smooth set of rigid transforms that, when applied to a "4D tracked" mesh, will cause the motion of many vertices to have an approximate but clear mode. While our 4D tracking system is proprietary, the stabilization algorithm presented here does not rely on details of the tracking and could be used with published algorithms [17]- [19]; see [5] for one recent survey. ...
... Modeling methods are reviewed in Hartley and Zisserman (2004), Esteban et al. (2008) and Furukawa and Hernandez (2015). There is a highly developed literature for the very important case of faces, recently reviewed in Ghosh et al. (2011), Fyffe et al. (2013 and Fyffe et al. (2017). An alternative, which we believe has not been explored in the literature, would be to recover an illumination cone for the fragment from multiple images. ...
Article
Full-text available
We present an object relighting system that allows an artist to select an object from an image and insert it into a target scene. Through simple interactions, the system can adjust illumination on the inserted object so that it appears naturally in the scene. To support image-based relighting, we build object model from the image, and propose a \emph{perceptually-inspired} approximate shading model for the relighting. It decomposes the shading field into (a) a rough shape term that can be reshaded, (b) a parametric shading detail that encodes missing features from the first term, and (c) a geometric detail term that captures fine-scale material properties. With this decomposition, the shading model combines 3D rendering and image-based composition and allows more flexible compositing than image-based methods. Quantitative evaluation and a set of user studies suggest our method is a promising alternative to existing methods of object insertion.
... Most of the recent work has been on data driven methods. Some state-of-the-art methods focus on obtaining realism based on multiview stereo [Wu et al. 2011;Ghosh et al. 2011;Beeler et al. 2011;Furukawa and Ponce 2010;Bickel et al. 2007]; this data can be used to drive blendshapes [Fyffe et al. 2013]. Some of the work is based on binocular [Valgaerts et al. 2012] and monocular [Garrido et al. 2013;Shi et al. 2014] videos. ...
Conference Paper
Full-text available
We propose a system for real-time animation of eyes that can be interactively controlled in a WebGL enabled device using a small number of animation parameters, including gaze. These animation parameters can be obtained using traditional keyframed animation curves, measured from an actor's performance using off-the-shelf eye tracking methods, or estimated from the scene observed by the character, using behavioral models of human vision. We present a model of eye movement, that includes not only movement of the globes, but also of the eyelids and other soft tissues in the eye region. The model includes formation of expression wrinkles in soft tissues. To our knowledge this is the first system for real-time animation of soft tissue movement around the eyes based on gaze input.
Chapter
In this paper, we investigate a new visual restoration task, termed as hallucinating visual instances in total absentia (HVITA). Unlike conventional image inpainting task that works on images with only part of a visual instance missing, HVITA concerns scenarios where an object is completely absent from the scene. This seemingly minor difference in fact makes the HVITA a much challenging task, as the restoration algorithm would have to not only infer the category of the object in total absentia, but also hallucinate an object of which the appearance is consistent with the background. Towards solving HVITA, we propose an end-to-end deep approach that explicitly looks into the global semantics within the image. Specifically, we transform the input image to a semantic graph, wherein each node corresponds to a detected object in the scene. We then adopt a Graph Convolutional Network on top of the scene graph to estimate the category of the missing object in the masked region, and finally introduce a Generative Adversarial Module to carry out the hallucination. Experiments on COCO, Visual Genome and NYU Depth v2 datasets demonstrate that the proposed approach yields truly encouraging and visually plausible results.
Thesis
Full-text available
Humans are extremely sensitive to facial realism and spend a surprisingly amount of time focusing their attention on other people's faces. Thus, believable human character animation requires realistic facial performance. Various techniques have been developed to capture highly detailed actor performance or to help drive facial animation. However, the eye region remains a largely unexplored field and automatic animation of this region is still an open problem. We tackle two different aspects of automatically generating facial features, aiming to recreate the small intricacies of the eye region in real-time. First, we present a system for real-time animation of eyes that can be interactively controlled using a small number of animation parameters, including gaze. These parameters can be obtained using traditional animation curves, measured from an actors performance using off-the-shelf eye tracking methods, or estimated from the scene observed by the character using behavioral models of human vision. We present a model of eye movement, that includes not only movement of the globes, but also of the eyelids and other soft tissues in the eye region. To our knowledge this is the first system for real-time animation of soft tissue movement around the eyes based on gaze input. Second, we present a method for real-time generation of distance fields for any mesh in screen space. This method does not depend on object complexity or shape, being only constrained by the intended field resolution. We procedurally generate lacrimal lakes on a human character using the generated distance field as input. We present different sampling algorithms for surface exploration and distance estimation, and compare their performance. To our knowledge this is the first method for real-time or screen space generation of distance fields.
Chapter
Facial modelling is a fundamental technique in a variety of applications in computer graphics, computer vision and pattern recognition areas. As 3D technologies evolved over the years, the quality of facial modelling greatly improved. To enhance the modelling quality and controllability of the model further, parametric methods, which represent or manipulate facial attributes (e.g. identity, expression, viseme) with a set of control parameters, have been proposed in recent years. The aim of this chapter is to give a comprehensive overview of current state-of-the-art parametric methods for realistic facial modelling and animation.
Article
Full-text available
The goal of a practical facial animation retargeting system is to reproduce the character of a source animation on a target face while providing room for additional creative control by the animator. This article presents a novel spacetime facial animation retargeting method for blendshape face models. Our approach starts from the basic principle that the source and target movements should be similar. By interpreting movement as the derivative of position with time, and adding suitable boundary conditions, we formulate the retargeting problem as a Poisson equation. Specified (e.g., neutral) expressions at the beginning and end of the animation as well as any user-specified constraints in the middle of the animation serve as boundary conditions. In addition, a model-specific prior is constructed to represent the plausible expression space of the target face during retargeting. A Bayesian formulation is then employed to produce target animation that is consistent with the source movements while satisfying the prior constraints. Since the preservation of temporal derivatives is the primary goal of the optimization, the retargeted motion preserves the rhythm and character of the source movement and is free of temporal jitter. More importantly, our approach provides spacetime editing for the popular blendshape representation of facial models, exhibiting smooth and controlled propagation of user edits across surrounding frames.
Conference Paper
We present a system for high-detail 3D capture of facial performance using standard video and lighting equipment. An actor is recorded by two stereo camera pairs under passive red, green and blue illumination from different directions. A coarse temporally aligned model of the performance is obtained from point markers painted on the face. Medium-level facial shape is reconstructed by stereo matching based on graph-cut optimisation. Fusion of the reconstructed disparity maps uses the coarse model to create a sequence of dense meshes with the same topology smoothly deforming over time. Fine details such as wrinkles and pores are represented as high-resolution normal maps computed by photometric stereo based on colour lights. Low-frequency bias in the normals and artefacts caused by the markers and shadows are corrected by exploiting the medium-level mesh. The novel combination of techniques allows to acquire complete 3D model of the facial shape in every frame of the recorded performance. Experimental results show that the system captures dynamics of the face up to subtle skin wrinkling coherently over time.
Article
Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -- even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.
We present a unified model for face detection, pose estimation, and landmark estimation in real-world, cluttered images. Our model is based on a mixtures of trees with a shared pool of parts; we model every facial landmark as a part and use global mixtures to capture topological changes due to viewpoint. We show that tree-structured models are surprisingly effective at capturing global elastic deformation, while being easy to optimize unlike dense graph structures. We present extensive results on standard face benchmarks, as well as a new “in the wild” annotated dataset, that suggests our system advances the state-of-the-art, sometimes considerably, for all three tasks. Though our model is modestly trained with hundreds of faces, it compares favorably to commercial systems trained with billions of examples (such as Google Picasa and face.com).
Conference Paper
This paper presents a novel system for the 3D capture of facial performance using standard video and lighting equipment. The mesh of an actor's face is tracked non-sequentially throughout a performance using multi-view image sequences. The minimum spanning tree calculated in expression dissimilarity space defines the traversal of the sequences optimal with respect to error accumulation. A robust patch-based frame-to-frame surface alignment combined with the optimal traversal significantly reduces drift compared to previous sequential techniques. Multi-path temporal fusion resolves inconsistencies between different alignment paths and yields a final mesh sequence which is temporally consistent. The surface tracking framework is coupled with photometric stereo using colour lights which captures metrically correct skin geometry. High-detail UV normal maps corrected for shadow and bias artefacts augment the temporally consistent mesh sequence. Evaluation on challenging performances by several actors demonstrates the acquisition of subtle skin dynamics and minimal drift over long sequences. A quantitative comparison to a state-of-the-art system shows similar quality of temporal alignment.