Conference PaperPDF Available

Abstract and Figures

Creating and animating a realistic 3D human face is an important task in computer graphics. The capability of capturing the 3D face of a human subject and reanimate it quickly will find many applications in games, training simulations, and interactive 3D graphics. We demonstrate a system to capture photorealistic 3D faces and generate the blendshape models automatically using only a single commodity RGB-D sensor. Our method can rapidly generate a set of expressive facial poses from a single depth sensor, such as a Microsoft Kinect version 1, and requires no artistic expertise in order to process those scans. The system takes only a matter of seconds to capture and produce a 3D facial pose and only requires a few minutes of processing time to transform it into a blendshape-compatible model. Our main contributions include an end-to-end pipeline for capturing and generating face blendshape models automatically, and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects and puppeteering their 3D faces in an animation system with real-time facial performance retargeting.
Content may be subject to copyright.
Blendshapes from Commodity RGB-D Sensors
Dan Casas1, Oleg Alexander1, Andrew W. Feng1, Graham Fyffe1, Ryosuke Ichikari1, Paul Debevec§1, Rhuizhe Wang2,
Evan Sumak1, and Ari Shapiro∗∗1
1Institute for Creative Technologies , University of Southern California
2University of Southern California
CR Categories: I.3.6 [Computer Graphics]: Methodology and
Techniques—Interaction Techniques; I.3.7 [Computer Graphics]:
Three-Dimensional Graphics and Realism—Animation;
Keywords: face animation, RGB-D, depth sensors, blendshapes
1 Exposition
Creating and animating a realistic 3D human face is an important
task in computer graphics. The capability of capturing the 3D face
of a human subject and reanimate it quickly will find many appli-
cations in games, training simulations, and interactive 3D graph-
ics. We demonstrate a system to capture photorealistic 3D faces
and generate the blendshape models automatically using only a sin-
gle commodity RGB-D sensor. Our method can rapidly generate
a set of expressive facial poses from a single depth sensor, such as
a Microsoft Kinect version 1, and requires no artistic expertise in
order to process those scans. The system takes only a matter of sec-
onds to capture and produce a 3D facial pose and only requires a
few minutes of processing time to transform it into a blendshape-
compatible model. Our main contributions include an end-to-end
pipeline for capturing and generating face blendshape models au-
tomatically, and a registration method that solves dense correspon-
dences between two face scans by utilizing facial landmarks de-
tection and optical flows. We demonstrate the effectiveness of the
proposed method by capturing different human subjects and pup-
peteering their 3D faces in an animation system with real-time fa-
cial performance retargeting.
Since our method is low cost, fast, and requires little or no exper-
tise on the part of the operator, this method has the potential to
expand the number of photoreal, individualized faces that are avail-
able for simulation. Our method is data agnostic, and can utilize
scan input from depth-sensors; as the scan quality of such com-
ponenets improved, the quality of the resulting blendshapes will
also improve. For example, we demonstrate photorealistic results
suitable for close-up encounters when using the Occipital Structure
Sensor instead of the Kinect v1.
A key to our method is the proper alignment of both texture and ge-
ometry. Rather than storing our scans as geometry and textures, we
choose instead to store our scans as images. Each one of our scans
oalexander@ict.usc.edu
e-mail:feng@ict.usc.edu
e-mail:fyffe@ict.usc.edu
§e-mail:debevec@ict.usc.edu
e-mail:rhuizewa@usc.edu
ke-mail:suma@ict.usc.edu
∗∗e-mail:shapiro@ict.usc.edu
Figure 1: Blendshape results using our method with data from a
commodity RGB-D sensor (Occipital Structure Sensor)
Figure 2: Real-time puppeteering on digital image. Our shapes
preserve their photorealistic facial quality generated from the orig-
inal scans.
is stored as a 32 bit float EXR texture map image, and a high reso-
lution point cloud. Our texture and geometry alignment algorithm
has three steps and runs in 2D. First, we construct a Delaunay trian-
gulation between the user supplied points and apply affine triangles
to roughly pre-warp the source diffuse texture to the target. Second,
we use GPU-accelerated optical flow to compute a dense warp field
from the pre-warped source diffuse texture to the target. Finally,
we apply the dense warp to each one of our source texture maps,
including diffuse, specular, specular normals, and point cloud. The
result is the source scan warped to the target UV space. The sub-
millimeter correspondence is able to align individual pores across
the majority of the face.
Recent template-based face scanning technologies deform a generic
template model toward raw face scans, resulting face model repre-
sents only an approximation of the original geometry instead of an
exact reconstruction. Moreover, since the template model is de-
formed under geometric constraints, there is no guarantees on exact
texture alignments between different facial expressions. By con-
trast, our method starts from raw face scans and directly extract
consistent mesh from the scans. Therefore the geometric represen-
tations are exact to the original shapes. By contrast, our method un-
wraps both geometric and texture information into 2D images and
solves for alignments in the UV space using facial feature detection
and optical flow. Thus, template-based methods tend to produce
results that are similar across scan subjects, since their template is
shared, while our method preserves the unique appearance of each
face, resulting in greater photorealism.
Conference Paper
We demonstrate a system that can generate a photorealistic, interactive 3D character from a human subject that is capable of movement, emotion, speech and gesture in less than 20 minutes without the need for 3D artist intervention or specialized technical knowledge through a near automatic process. Our method uses mostly commodity- or of-the-shelf hardware. We demonstrate the just-in-time use of generating such 3D models for virtual and augmented reality, games, simulation and communication. We anticipate that the inexpensive generation of such photorealistic models will be useful in many venues where a just-in-time 3D construction of digital avatars that resemble particular human subjects is necessary. Figure 1 shows the overall workflow of our virtual character creation pipeline.
Article
We demonstrate a system that can generate a photorealistic, interactive 3-D character from a human subject that is capable of movement, emotion, speech, and gesture in less than 20 min without the need for 3-D artist intervention or specialized technical knowledge through a near automatic process. Our method uses mostly commodity or off-the-shelf hardware. We demonstrate the just-in-time use of generating such 3-D models for virtual and augmented reality, games, simulation, and communication. We anticipate that the inexpensive generation of such photorealistic models will be useful in many venues where a just-in-time 3-D reconstructions of digital avatars that resemble particular human subjects is necessary.
ResearchGate has not been able to resolve any references for this publication.