Conference PaperPDF Available

Video Rewrite: Driving Visual Speech with Audio

Authors:
ACM SIGGRAPH 97
1
ABSTRACT
Video Rewrite uses existing footage to create automatically new
video of a person mouthing words that she did not speak in the
original footage. This technique is useful in movie dubbing, for
example, where the movie sequence can be modified to sync the
actors’ lip motions to the new soundtrack.
Video Rewrite automatically labels the phonemes in the train-
ing data and in the new audio track. Video Rewrite reorders the
mouth images in the training footage to match the phoneme
sequence of the new audio track. When particular phonemes are
unavailable in the training footage, Video Rewrite selects the clos-
est approximations. The resulting sequence of mouth images is
stitched into the background footage. This stitching process auto-
matically corrects for differences in head position and orientation
between the mouth images and the background footage.
Video Rewrite uses computer-vision techniques to track points
on the speaker’s mouth in the training footage, and morphing tech-
niques to combine these mouth gestures into the final video
sequence. The new video combines the dynamics of the original
actor’s articulations with the mannerisms and setting dictated by
the background footage. Video Rewrite is the first facial-animation
system to automate all the labeling and assembly tasks required to
resync existing footage to a new soundtrack.
CR Categories:
I.3.3 [Computer Graphics]: Picture/Image Gener-
ation—Morphing; I.4.6 [Image Processing]: Segmentation—Fea-
ture Detection; I.3.8 [Computer Graphics]: Applications—Facial
Synthesis; I.4.10 [Image Processing]: Applications—Feature
Transformations.
Additional Keywords:
Facial Animation, Lip Sync.
1 WHY AND HOW WE REWRITE VIDEO
We are very sensitive to the synchronization between speech and
lip motions. For example, the special effects in
Forest Gump
are
compelling because the Kennedy and Nixon footage is lip synched
to the movie’s new soundtrack. In contrast, close-ups in dubbed
movies are often disturbing due to the lack of lip sync. Video
Rewrite is a system for automatically synthesizing faces with
proper lip sync. It can be used for dubbing movies, teleconferenc-
ing, and special effects.
Video Rewrite automatically pieces together from old footage
a new video that shows an actor mouthing a new utterance. The
results are similar to labor-intensive special effects in
Forest
Gump
. These effects are successful because they start from actual
film footage and modify it to match the new speech. Modifying
and reassembling such footage in a smart way and synchronizing it
to the new sound track leads to final footage of realistic quality.
Video Rewrite uses a similar approach but does not require labor-
intensive interaction.
Our approach allows Video Rewrite to learn from example
footage how a person’s face changes during speech. We learn what
a person’s mouth looks like from a video of that person speaking
normally. We capture the dynamics and idiosyncrasies of her artic-
ulation by creating a database of video clips. For example, if a
woman speaks out of one side of her mouth, this detail is recreated
accurately. In contrast, most current facial-animation systems rely
on generic head models that do not capture the idiosyncrasies of an
individual speaker.
To model a new person, Video Rewrite requires a small num-
ber (26 in this work) of hand-labeled images. This is the only
human intervention that is required in the whole process. Even this
level of human interaction is not a fundamental requirement: We
could use face-independent models instead [Kirby90, Covell96].
Video Rewrite shares its philosophy with concatenative speech
synthesis [Moulines90]. Instead of modeling the vocal tract, con-
catenative speech synthesis analyzes a corpus of speech, selects
examples of phonemes, and normalizes those examples. Phonemes
are the distinct sounds within a language, such as the /IY/ and /P/
in “teapot.” Concatenative speech synthesizes new sounds by con-
catenating the proper sequence of phonemes. After the appropriate
warping of pitch and duration, the resulting speech is natural
sounding. This approach to synthesis is data driven: The algo-
rithms analyze and resynthesize sounds using little hand-coded
knowledge of speech. Yet they are effective at implicitly capturing
the nuances of human speech.
Video Rewrite uses a similar approach to create new sequences
of visemes. Visemes are the visual counterpart to phonemes.
Visemes are visually distinct mouth, teeth, and tongue articulations
for a language. For example, the phonemes /B/ and /P/ are visually
indistinguishable and are grouped into a single viseme class.
Video Rewrite: Driving Visual Speech with Audio
Christoph Bregler, Michele Covell, Malcolm Slaney
Interval Research Corporation
1
801 Page Mill Road, Building C, Palo Alto, CA, 94304. E-mail
:
b
regler@cs.berkeley.edu, covell@interval.com, malcolm@inter
-
v
al.com. See the SIGGRAPH Video Proceedings or http:/
/
w
ww.interval.com/papers/1997-012/ for the latest animations.
Figure 1: Overview of analysis stage. Video Rewrite uses
the audio track to segment the video into triphones. Vision
techniques find the orientation of the head, and the shape
and position of the mouth and chin in each image. In the
synthesis stage, Video Rewrite selects from this video
model to synchronize new lip videos to any given audio.
Video Model
/EH-B-AA/ /IY-B-AA/ /OW-B-AA/
/AA-B-AA/
Phoneme
Labeling
Visual
Labeling
Permission to make digital/hard copy of all or part of this material
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advan-
tage, the copyright notice, the title of the publication and its date
appear, and notice is given that copying is by permission of ACM,
Inc. To copy otherwise, to republish, to post on servers, or to redis-
tribute to lists, requires prior specific permission and/or a fee.
© 1997 ACM.
ACM SIGGRAPH 97
2
Video Rewrite creates new videos using two steps: analysis of
a training database and synthesis of new footage. In the
analysis
stage, Video Rewrite automatically segments into phonemes the
audio track of the training database. We use these labels to segment
the video track as well. We automatically track facial features in
this segmented footage. The phoneme and facial labels together
completely describe the visemes in the training database. In the
synthesis
stage, our system uses this video database, along with a
new utterance. It automatically retrieves the appropriate viseme
sequences, and blends them into a background scene using mor-
phing techniques. The result is a new video with lip and jaw move-
ments that synchronize to the new audio. The steps used in the
analysis stage are shown in Figure 1; those of the synthesis stage
are shown in Figure 2.
In the remainder of this paper, we first review other approaches
to synthesizing talking faces (Section 2). We then describe the
analysis and synthesis stages of Video Rewrite. In the analysis
stage (Section 3), a collection of video is analyzed and stored in a
database that matches sounds to video sequences. In the synthesis
stage (Section 4), new speech is labeled, and the appropriate
sequences are retrieved from the database. The final sections of
this paper describe our results (Section 5), future work (Section 6),
and contributions (Section 7).
2 SYNTHETIC VISUAL SPEECH
Facial-animation systems build a model of what a person’s speech
sounds and looks like. They use this model to generate a new out-
put sequence, which matches the (new) target utterance. On the
model-building side (analysis), there are typically three distin-
guishing choices: how the facial appearance is learned or
described, how the facial appearance is controlled or labeled, and
how the viseme labels are learned or described. For output-
sequence generation (synthesis), the distinguishing choice is how
the target utterance is characterized. This section reviews a repre-
sentative sample of past research in these areas.
2.1 Source of Facial Appearance
Many facial-animation systems use a generic 3D mesh model of a
face [Parke72, Lewis91, Guiard-Marigny94], sometimes adding
texture mapping to improve realism [Morshima91, Cohen93,
Waters95]. Another synthetic source of face data is hand-drawn
images [Litwinowicz94]. Other systems use real faces for their
source examples, including approaches that use 3D scans
[Williams90] and still images [Scott94]. We use video footage to
train Video Rewrite’s models.
2.2 Facial Appearance Control
Once a facial model is captured or created, the control parameters
that exercise that model must be defined. In systems that rely on a
3D mesh model for appearance, the control parameters are the
allowed 3D mesh deformations. Most of the image-based systems
label the positions of specific facial locations as their control
parameters. Of the systems that use facial-location labels, most
rely on manual labeling of each example image [Scott94,
Litwinowicz94]. Video Rewrite creates its video model by auto-
matically labeling specific facial locations.
2.3 Viseme Labels
Many facial-animation systems label different visual configura-
tions with an associated
phoneme
. These systems then match these
phoneme labels with their corresponding labels in the target utter-
ance. With synthetic images, the phoneme labels are artificial or
are learned by analogy [Morshima91]. For natural images, taken
from a video of someone speaking, the phonemic labels can be
generated manually [Scott94] or automatically. Video Rewrite
determines the phoneme labels automatically (Section 3.1).
2.4 Output-Sequence Generation
The goal of facial animation is to generate an image sequence that
matches a target utterance. When phoneme labels are used, those
for the target utterance can be entered manually [Scott94] or com-
puted automatically [Lewis91, Morshima91]. Another option for
phoneme labeling is to create the new utterance with synthetic
speech [Parke72, Cohen93, Waters95]. Approaches, that do not use
phoneme labels include motion capture of facial locations that are
artificially highlighted [Williams90, Guiard-Marigny94] and man-
ual control by an animator [Litwinowicz94]. Video Rewrite uses a
combination of phoneme labels (from the target utterance) and
facial-location labels (from the video-model segments). Video
Rewrite derives all these labels automatically.
Video Rewrite is the first facial-animation system to automate
all these steps and to generate realistic lip-synched video from nat-
ural speech and natural images.
3 ANALYSIS FOR VIDEO MODELING
As shown in Figure 1, the analysis stage creates an annotated data-
base of example video clips, derived from unconstrained footage.
We refer to this collection of annotated examples as a video model.
This model captures how the subject’s mouth and jaw move during
speech. These training videos are labeled automatically with the
phoneme sequence uttered during the video, and with the locations
of fiduciary points that outline the lips, teeth, and jaw.
As we shall describe, the phonemic labels are from a time-
aligned transcript of the speech, generated by a hidden Markov
model (HMM). Video Rewrite uses the phonemic labels from the
HMM to segment the input footage into short video clips, each
showing three phonemes or a triphone. These triphone videos, with
the fiduciary-point locations and the phoneme labels, are stored in
the video model.
In Sections 3.1 and 3.2, we describe the visual and acoustic
analyses of the video footage. In Section 4, we explain how to use
this model to synthesize new video.
3.1 Annotation Using Image Analysis
Video Rewrite uses any footage of the subject speaking. As her
face moves within the frame, we need to know the mouth position
and the lip shapes at all times. In the synthesis stage, we use this
information to warp overlapping videos such that they have the
same lip shapes, and to align the lips with the background face.
Video
Model
Select
Visemes
Stitch
Speech
Labeling
Figure 2: Overview of synthesis stage. Video Rewrite
segments new audio and uses it to select triphones from
the video model. Based on labels from the analysis stage,
the new mouth images are morphed into a new
background face.
Background
Video
ACM SIGGRAPH 97
3
Manual labeling of the fiduciary points around the mouth and
jaw is error prone and tedious. Instead, we use computer-vision
techniques to label the face and to identify the mouth and its shape.
A major hurdle to automatic annotation is the low resolution of the
images. In a typical scene, the lip region has a width of only 40
pixels. Conventional contour-tracking algorithms [Kass87,
Yuille89] work well on high-contrast outer lip boundaries with
some user interaction, but fail on inner lip boundaries at this reso-
lution, due to the low signal-to-noise ratios. Grayscale-based algo-
rithms, such as eigenimages [Kirby90, Turk91], work well at low
resolutions, but estimate only the location of the lips or jaw, rather
than estimating the desired fiduciary points. Eigenpoints
[Covell96], and other extensions of eigenimages [Lanitis95], esti-
mate control points reliably and automatically, even in such low-
resolution images. As shown in Figure 3, eigenpoints learns how
fiduciary points move as a function of the image appearance, and
then uses this model to label new footage.
Video Rewrite labels each image in the training video using a
total of 54 eigenpoints: 34 on the mouth (20 on the outer boundary,
12 on the inner boundary, 1 at the bottom of the upper teeth, and 1
at the top of the lower teeth) and 20 on the chin and jaw line. There
are two separate eigenpoint analyses. The first eigenspace controls
the placement of the 34 fiduciary points on the mouth, using
pixels around the nominal mouth location, a region that
covers the mouth completely. The second eigenspace controls the
placement of the 20 fiduciary points on the chin and jaw line, using
pixels around the nominal chin-location, a region that
covers the upper neck and the lower part of the face.
We created the two eigenpoint models for locating the fidu-
ciary points from a small number of images. We hand annotated
only 26 images (of 14,218 images total; about 0.2%). We extended
the hand-annotated dataset by morphing pairs of annotated images
to form intermediate images, expanding the original 26 to 351
annotated images without any additional manual work. We then
derived eigenpoints models using this extended data set.
We use eigenpoints to find the mouth and jaw and to label their
contours. The derived eigenpoint models locate the facial features
using six basis vectors for the mouth and six different vectors for
the jaw. Eigenpoints then places the fiduciary points around the
feature locations: 32 basis vectors place points around the lips and
64 basis vectors place points around the jaw.
Eigenpoints assumes that the features (the mouth or the jaw)
are undergoing pure translational motion. It does a comparatively
poor job at modeling rotations and scale changes. Yet, Video
Rewrite is designed to use unconstrained footage. We expect rota-
tions and scale changes. Subjects may lean toward the camera or
turn away from it, tilt their heads to the side, or look up from under
their eyelashes.
To allow for a variety of motions, we warp each face image
into a standard reference plane, prior to eigenpoints labeling. We
50 40
×
100 75
×
find the global transform that minimizes the mean-squared error
between a large portion of the face image and a facial template. We
currently use an affine transform [Black95]. The mask shown in
Figure 4 defines the support of the minimization integral. Once the
best global mapping is found, it is inverted and applied to the
image, putting that face into the standard coordinate frame. We
then perform eigenpoints analysis on this pre-warped image to find
the fiduciary points. Finally, we back-project the fiduciary points
through the global warp to place them on the original face image.
The labels provided by eigenpoints allow us automatically to
(1) build the database of example lip configurations, and (2) track
the features in a background scene that we intend to modify.
Section 4.2 describes how we match the points we find in step 1 to
each other and to the points found in step 2.
3.2 Annotation Using Audio Analysis
All the speech data in Video Rewrite (and their associated video)
are segmented into sequences of phonemes. Although single pho-
nemes are a convenient representation for linguistic analysis, they
are not appropriate for Video Rewrite. We want to capture the
visual dynamics of speech. To do so correctly, we must consider
coarticulation
, which causes the lip shapes for many phonemes to
be modified based on the phoneme’s context. For example, the
in “beet” looks different from the in “boot.”
Therefore, Video Rewrite segments speech and video into tri-
phones: collections of three sequential phonemes. The word “tea-
pot” is split into the sequence of triphones ,
1
,
, , and . When we synthesize a
video, we emphasize the middle of each triphone. We cross-fade
the overlapping regions of neighboring triphones. We thus ensure
that the precise transition points are not critical, and that we can
capture effectively many of the dynamics of both forward and
backward coarticulation.
Video Rewrite uses HMMs [Rabiner89] to label the training
footage with phonemes. We trained the HMMs using the TIMIT
speech database [Lamel86], a collection of 4200 utterances with
phonemic transcriptions that gives the uttered phonemes and their
timing. Each of the 61 phoneme categories in TIMIT is modeled
with a separate three-state HMM. The emission probabilities of
each state are modeled with mixtures of eight Gaussians with diag-
onal covariances. For robustness, we split the available data by
gender and train two speaker-independent, gender-specific sys-
tems, one based on 1300 female utterances, and one based on 2900
male utterances.
We used these gender-specific HMMs to create a fine-grained
phonemic transcription of our input footage, using forced Viterbi
1. indicates silence. Two in a row are used at the
beginnings and ends of utterances to allow all segments—
including the beginning and end—to be treated as triphones.
/T/
/T/
/T-IY-P/
/SIL/
/SIL/
/IY-P-AA/
/P-AA-T/
/AA-T-SIL/
Training Data
Output
Contours
Learn
Eigenpoint
Model
Apply
Eigenpoint
Model
Figure 3: Overview of eigenpoints. A small set of hand-
labeled facial images is used to train subspace models.
Given a new image, the eigenpoint models tell us the
positions of points on the lips and jaw.
Input Image
Figure 4: Mask used to estimate the global warp. Each
image is warped to account for changes in the head’s
position, size, and rotation. The transform minimizes the
difference between the transformed images and the face
template. The mask (left) forces the minimization to
consider only the upper face (right).
ACM SIGGRAPH 97
4
search [Viterbi67]. Forced Viterbi uses unaligned sentence-level
transcriptions and a phoneme-level pronunciation dictionary to
create a time-aligned phoneme-level transcript of the speech. From
this transcript, Video Rewrite segments the video automatically
into triphone videos, labels them, and includes them in the video
model.
4 SYNTHESIS USING A VIDEO MODEL
As shown in Figure 2, Video Rewrite synthesizes the final lip-
synced video by labeling the new speech track, selecting a
sequence of triphone videos that most accurately matches the new
speech utterance, and stitching these images into a background
video.
The background video sets the scene and provides the desired
head position and movement. The background sequence in Video
Rewrite includes most of the subject’s face as well as the scene
behind the subject. The frames of the background video are taken
from the source footage in the same order as they were shot. The
head tilts and the eyes blink, based on the background frames.
In contrast, the different triphone videos are used in whatever
order is needed. They simply show the motions associated with
articulation. For all the animations in this paper, the triphone
images include the mouth, chin, and part of the cheeks, so that the
chin and jaw move and the cheeks dimple appropriately as the
mouth articulates. We use illumination-matching techniques
[Burt83] to avoid visible seams between the triphone and back-
ground images.
The first step in synthesis (Figure 2) is labeling the new
soundtrack. We label the new utterance with the same HMM that
we used to create the video-model phoneme labels. In Sections 4.1
and 4.2, we describe the remaining steps: selecting triphone videos
and stitching them into the background.
4.1 Selection of Triphone Videos
The new speech utterance determines the target sequence of
speech sounds, marked with phoneme labels. We would like to find
a sequence of triphone videos from our database that matches this
new speech utterance. For each triphone in the new utterance, our
goal is to find a video example with exactly the transition we need,
and with lip shapes that match the lip shapes in neighboring tri-
phone videos. Since this goal often is not reachable, we compro-
mise by a choosing a sequence of clips that approximates the
desired transitions and shape continuity.
Given a triphone in the new speech utterance, we compute a
matching distance to each triphone in the video database. The
matching metric has two terms: the
phoneme-context distance
,
, and the
distance between lip shapes
in overlapping visual tri-
phones, . The total error is
where the weight, , is a constant that trades off the two factors.
The phoneme-context distance, , is based on categorical
distances between phoneme categories and between viseme
classes. Since Video Rewrite does not need to create a new
soundtrack (it needs only a new video track), we can cluster pho-
nemes into viseme classes, based on their visual appearance.
We use 26 viseme classes. Ten are consonant classes: (1)
, , , ; (2) , , , ; (3) , ,
, ; (4) , , ; (5) , ; (6) , ; (7)
, ; (8) ; (9) ; and (10) . Fifteen are vowel
classes: one each for , , , , ,
, , , , , , , , .
One class is for silence, /SIL/.
Dp
Ds
e
rror αDp1α()D
s,
+=
α
Dp
/CH/
/JH/
/SH/
/ZH/
/K/
/G/
/N/
/L/
/T/
/D/
/S/
/Z/
/P/
/B/
/M/
/F/
/V/
/TH/
/DH/
/W/
/R/
/HH/
/Y/
/NG/
/EH/
/EY/
/ER/
/UH/
/AA/
/AO/
/AW/
/AY/
/UW/
/OW/
/OY/
/IY/
/IH/
/AE/
/AH/
The phoneme-context distance, , is the weighted sum of
phoneme distances between the target phonemes and the video-
model phonemes within the context of the triphone. If the phone-
mic categories are the same (for example, and ), then this
distance is 0. If they are in different viseme classes ( and ),
then the distance is 1. If they are in different phonemic categories
but are in the same viseme class ( and ), then the distance is
a value between 0 and 1. The intraclass distances are derived from
published confusion matrices [Owens85].
In , the center phoneme of the triphone has the largest
weight, and the weights drop smoothly from there. Although the
video model stores only triphone images, we consider the triph-
one’s original context when picking the best-fitting sequence. In
current animations, this context covers the triphone itself, plus one
phoneme on either side.
The second term, , measures how closely the mouth con-
tours match in overlapping segments of adjacent triphone videos.
In synthesizing the mouth shapes for “teapot” we want the con-
tours for the and in the lip sequence used for to
match the contours for the and in the sequence used for
. We measure this similarity by computing the Euclid-
ean distance, frame by frame, between four-element feature vec-
tors containing the overall lip width, overall lip height, inner lip
height, and height of visible teeth.
The lip-shape distance ( ) between two triphone videos is
minimized with the correct time alignment. For example, consider
the overlapping contours for the in and .
The phoneme includes both a silence, when the lips are
pressed together, and an audible release, when the lips move rap-
idly apart. The durations of the initial silence within the pho-
neme may be different. The phoneme labels do not provide us with
this level of detailed timing. Yet, if the silence durations are differ-
ent, the lip-shape distance for two otherwise-well-matched videos
will be large. This problem is exacerbated by imprecision in the
HMM phonemic labels.
We want to find the temporal overlap between neighboring tri-
phones that maximizes the similarity between the two lip shapes.
We shift the two triphones relative to each other to find the best
temporal offset and duration. We then use this optimal overlap both
in computing the lip-shape distance, , and in cross-fading the
triphone videos during the stitching step. The optimal overlap is
the one that minimizes while still maintaining a minimum-
allowed overlap.
Since the fitness measure for each triphone segment depends
on that segment’s neighbors in both directions, we select the
sequence of triphone segments using dynamic programming over
the entire utterance. This procedure ensures the selection of the
optimal segments.
4.2 Stitching It Together
Video Rewrite produces the final video by stitching together the
appropriate entries from the video database. At this point, we have
already selected a sequence of triphone videos that most closely
matches the target audio. We need to align the overlapping lip
images temporally. This internally time-aligned sequence of vid-
eos is then time aligned to the new speech utterance. Finally, the
resulting sequences of lip images are spatially aligned and are
stitched into the background face. We describe each step in turn.
4.2.1 Time Alignment of Triphone Videos
We have a sequence of triphone videos that we must combine to
form a new mouth movie. In combining the videos, we want to
maintain the dynamics of the phonemes and their transitions. We
need to time align the triphone videos carefully before blending
Dp
/P/
/P/
/P/
/IY/
/P/
/B/
Dp
Ds
/IY/
/P/
/T-IY-P/
/IY/
/P/
/IY-P-AA/
Ds
/P/
/T-IY-P/
/IY-P-AA/
/P/
/P/
Ds
Ds
ACM SIGGRAPH 97
5
them. If we are not careful in this step, the mouth will appear to
flutter open and closed inappropriately.
We align the triphone videos by choosing a portion of the over-
lapping triphones where the two lips shapes are as similar as possi-
ble. We make this choice when we evaluate to choose the
sequence of triphone videos (Section 4.1). We use the overlap
duration and shift that provide the minimum value of for the
given videos.
4.2.2 Time Alignment of the Lips to the Utterance
We now have a self-consistent temporal alignment for the triphone
videos. We have the correct articulatory motions, in the correct
order to match the target utterance, but these articulations are not
yet time aligned with the target utterance.
We align the lip motions with the target utterance by compar-
ing the corresponding phoneme transcripts. The starting time of
the center phone in the triphone sequence is aligned with the corre-
sponding label in the target transcript. The triphone videos are then
stretched or compressed such that they fit the time needed between
the phoneme boundaries in the target utterance.
4.2.3 Combining of the Lips and the Background
The remaining task is to stitch the triphone videos into the back-
ground sequence. The correctness of the facial alignment is critical
to the success of the recombination. The lips and head are con-
stantly moving in the triphone and background footage. Yet, we
need to align them all so that the new mouth is firmly planted on
the face. Any error in spatial alignment causes the mouth to jitter
relative to the face—an extremely disturbing effect.
We again use the mask from Figure 4 to help us find the opti-
mal global transform to register the faces from the triphone videos
with the background face. The combined tranforms from the
mouth and background images to the template face (Section 3.1)
give our starting estimate in this search. Re-estimating the global
transform by directly matching the triphone images to the back-
ground improves the accuracy of the mapping.
We use a replacement mask to specify which portions of the
final video come from the triphone images and which come from
the background video. This replacement mask warps to fit the new
mouth shape in the triphone image and to fit the jaw shape in the
background image. Figure 5 shows an example replacement mask,
applied to triphone and background images.
Local deformations are required to stitch the shape of the
mouth and jaw line correctly. These two shapes are handled differ-
ently. The mouth’s shape is completely determined by the triphone
images. The only changes made to these mouth shapes are
imposed to align the mouths within the overlapping triphone
images: The lip shapes are linearly cross-faded between the shapes
in the overlapping segments of the triphone videos.
Ds
Ds
The jaw’s shape, on the other hand, is a combination of the
background jaw line and the two triphone jaw lines. Near the ears,
we want to preserve the background video’s jaw line. At the center
of the jaw line (the chin), the shape and position are determined
completely by what the mouth is doing. The final image of the jaw
must join smoothly together the motion of the chin with the motion
near the ears. To do this, we smoothly vary the weighting of the
background and triphone shapes as we move along the jawline
from the chin towards the ears.
The final stitching process is a three-way tradeoff in shape and
texture among the fade-out lip image, the fade-in lip image, and
the background image. As we move from phoneme to phoneme,
the relative weights of the mouth shapes associated with the over-
lapping triphone-video images are changed. Within each frame, the
relative weighting of the jaw shapes contributed by the background
image and of the triphone-video images are varied spatially.
The derived fiduciary positions are used as control points in
morphing. All morphs are done with the Beier-Neely algorithm
[Beier92]. For each frame of the output image we need to warp
four images: the two triphones, the replacement mask, and the
background face. The warping is straightforward since we auto-
matically generate high-quality control points using the eigen-
points algorithm.
5 RESULTS
We have applied Video Rewrite to several different training data-
bases. We recorded one video dataset specifically for our evalua-
tions. Section 5.1 describes our methods to collect this data and
create lip-sync videos. Section 5.2 evaluates the resulting videos.
We also trained video models using truncated versions of our
evaluation database. Finally, we used old footage of John F.
Kennedy. We present the results from these experiments in Section
5.3.
5.1 Methods
We recorded about 8 minutes of video, containing 109 sentences,
of a subject narrating a fairy tale. During the reading, the subject
was asked to directly face the camera for some parts (still-head
video) and to move and glance around naturally for others (mov-
ing-head video). We use these different segments to study the
errors in local deformations separately from the errors in global
spatial registration. The subject was also asked to wear a hat during
the filming. We use this landmark to provide a quantitative evalua-
tion of our global alignment. The hat is strictly outside all our
alignment masks and our eigenpoints models. Thus, having the
subject wear the hat does not effect the magnitude or type of errors
that we expect to see in the animations—it simply provides us with
a reference marker for the position and movement of her head.
To create a video model, we trained the system on all the still-
head footage. Video Rewrite constructed and annotated the video
model with just under 3500 triphone videos automatically, using
HMM labeling of triphones and eigenpoint labeling of facial con-
tours.
Video Rewrite was then given the target sentence, and was
asked to construct the corresponding image sequence. To avoid
unduly optimistic results, we removed from the database the tri-
phone videos from training sentences similar to the target. A train-
ing sentence was considered similar to the target if the two shared
a phrase two or more words long. Note that Video Rewrite would
not normally pare the database in this manner: Instead, it would
take advantage of these coincidences. We remove the similar sen-
tences to avoid biasing our results.
We evaluated our output footage both qualitatively and quanti-
tatively. Our qualitative evaluation was done informally, by a panel
Figure 5: Facial fading mask. This mask determines
which portions of the final movie frames come from the
background frame, and which come from the triphone
database. The mask should be large enough to include the
mouth and chin. These images show the replacement
mask applied to a triphone image, and its inverse applied
to a background image. The mask warps according to the
mouth and chin motions.
ACM SIGGRAPH 97
6
of observers. There are no accepted metrics for evaluating lip-
synced footage. Instead, we were forced to rely on the qualitative
judgements listed in Section 5.2.
Only the (global) spatial registration is evaluated quantita-
tively. Since our subject wore a hat that moved rigidly with her
upper head, we were able to measure quantitatively our global-reg-
istration error on this footage. We did so by first warping the full
frame (instead of just the mouth region) of the triphone image into
the coordinate frame of the background image. If this global trans-
formation is correct, it should overlay the two images of the hat
exactly on top of one another. We measured the error by finding the
offset of the correlation peak for the image regions corresponding
to the front of the hat. The offset of the peak is the registration
error (in pixels).
5.2 Evaluation
Examples of our output footage can be seen at http://www.inter-
val.com/papers/1997-012/. The top row of Figure 6 shows example
frames, extracted from these videos. This section describes our
evaluation criteria and the results.
5.2.1 Lip and Utterance Synchronization
How well are the lip motions synchronized with the audio? We
evaluate this measure on the still-head videos. There occasionally
are visible timing errors in plosives and stops.
5.2.2 Triphone-Video Synchronization
Do the lips flutter open and closed inappropriately? This artifact
usually is due to synchronization error in overlapping triphone vid-
eos. We evaluated this measure on the still-head videos. We do not
see any artifacts of this type.
5.2.3 Natural Articulation
Assuming that neither of the artifacts from Sections 5.2.1 or 5.2.2
appear, do the lip and teeth articulations look natural? Unnatural-
looking articulation can result if the desired sequence of phonemes
is not available in the database, and thus another sequence is used
in its place. In our experiments, this replacement occurred on 31
percent of the triphone videos. We evaluated this measure on the
still-head videos. We do not see this type of error when we use the
full video model. Additional experiments in this area are described
in Section 5.3.1.
5.2.4 Fading-Mask Visibility and Extent
Does the fading mask show? Does the animation have believable
texture and motion around the lips and chin? Do the dimples move
in sync with the mouth? We evaluated this measure on all the out-
put videos. The still-head videos better show errors associated with
the extent of the fading mask, whereas the moving-head videos
better show errors due to interactions between the fading mask and
the global transformation. Without illumination correction, we see
artifacts in some of the moving-head videos, when the subject
looked down so that the lighting on her face changed significantly.
These artifacts disappear with adaptive illumination correction
[Burt83].
5.2.5 Background Warping
Do the outer edges of the jaw line and neck, and the upper portions
of the cheeks look realistic? Artifacts in these areas are due to
incorrect warping of the background image or to a mismatch
between the texture and the warped shape of the background
image. We evaluated this measure on all the output videos. In some
segments, we found minor artifacts near the outer edges of the jaw.
5.2.6 Spatial Registration
Does the mouth seem to float around on the face? Are the teeth rig-
idly attached to the skull? We evaluated this measure on the mov-
ing-head videos. No registration errors are visible.
We evaluated this error quantitatively as well, using the hat-
registration metric described in Section 5.1. The mean, median,
and maximum errors in the still-head videos were 0.6, 0.5, and 1.2
pixels (standard deviation 0.3); those in the moving-head videos
were 1.0, 1.0, and 2.0 pixels (standard deviation 0.4). For compari-
son, the face covers approximately pixels.
5.2.7 Overall Quality
Is the lip-sync believable? We evaluated this measure on all the
output videos. We judged the overall quality as excellent.
85 120
×
Figure 6: Examples of synthesized output frames. These frames show the quality of our output after triphone segments have been
stitched into different background video frames.
ACM SIGGRAPH 97
7
5.3 Other Experiments
In this section, we examine our performance using steadily smaller
training databases (Section 5.3.1) and using historic footage (Sec-
tion 5.3.2).
5.3.1 Reduction of Video Model Size
We wanted to see how the quality fell off as the number of data
available in the video model were reduced. With the 8 minutes of
video, we have examples of approximately 1700 different tri-
phones (of around 19,000 naturally occurring triphones); our ani-
mations used triphones other than the target triphones 31 percent
of the time. What happens when we have only 1 or 2 minutes of
data? We truncated our video database to one-half, one-quarter,
and one-eighth of its original size, and then reanimated our target
sentences. The percent of mismatched triphones increased by
about 15 percent with each halving of the database (that is, 46, 58,
and 74 percent of the triphones were replaced in the reduced
datasets). The perceptual quality also degraded smoothly as the
database size was reduced. The video from the reduced datasets are
shown on our web site.
5.3.2 Reanimation of Historic Footage
We also applied Video Rewrite to public-domain footage of John F.
Kennedy. For this application, we digitized 2 minutes (1157 tri-
phones) of Kennedy speaking during the Cuban missile crisis.
Forty-five seconds of this footage are from a close-up camera,
about 30 degrees to Kennedy's left. The remaining images are
medium shots from the same side. The size ratio is approximately
5:3 between the close-up and medium shots. During the footage,
Kennedy moves his head about 30 degrees vertically, reading his
speech from notes on the desk and making eye contact with a cen-
ter camera (which we do not have).
We used this video model to synthesize new animations of
Kennedy saying, for example, “Read my lips” and “I never met
Forrest Gump.” These animations combine the footage from both
camera shots and from all head poses. The resulting videos are
shown on our web site. The bottom row of Figure 6 shows example
frames, extracted from these videos.
In our preliminary experiments, we were able to find the cor-
rect triphone sequences just 6% of the time. The lips are reliably
synchronized to the utterance. The fading mask is not visible, nor
is the background warping. However, the overall animation quality
is not as good as our earlier results. The animations include some
lip fluttering, because of the mismatched triphone sequences.
Our quality is limited for two reasons. The available viseme
footage is distributed over a wide range of vertical head rotations.
If we choose triphones that match the desired pose, then we cannot
find good matches for the desired phoneme sequence. If we choose
triphones that are well matched to the desired phoneme sequence,
then we need to dramatically change the pose of the lip images. A
large change in pose is difficult to model with our global (affine)
transform. The lip shapes are distorted because we assumed,
implicitly in the global transform, that the lips lie on a flat plane.
Both the limited-triphone and pose problems can be avoided with
additional data.
6 FUTURE WORK
There are many ways in which Video Rewrite could be extended
and improved. The phonemic labeling of the triphone and back-
ground footage could consider the mouth- and jaw-shape informa-
tion, as well as acoustic data [Bregler95]. Additional lip-image
data and multiple eigenpoints models could be added, allowing
larger out-of-plane head rotations.
The acoustic data could be used
in selecting the triphone videos, because facial expressions affect
voice qualities (you can hear a smile). The synthesis could be
made real-time, with low-latency.
In Sections 6.1 through 6.3, we explore extensions that we
think are most promising and interesting.
6.1 Alignment Between Lips and Target
We currently use the simplest approach to time aligning the lip
sequences with the target utterance: We rely on the phoneme
boundaries. This approach provides a rough alignment between the
motions in the lip sequence and the sounds in the target utterance.
As we mentioned in Section 4.1, however, the phoneme boundaries
are both imprecise (the HMM alignment is not perfect) and coarse
(significant visual and auditory landmarks occur within single pho-
nemes).
A more accurate way to time align the lip motions with the tar-
get utterance uses dynamic time warping of the audio associated
with each triphone video to the corresponding segment of the tar-
get utterance. This technique would allow us to time align the audi-
tory landmarks from the triphone videos with those of the target
utterance, even if the landmarks occur at subphoneme resolution.
This time alignment, when applied to the triphone image sequence,
would then align the visual landmarks of the lip sequence with the
auditory landmarks of the target utterance.
The overlapping triphone videos would provide overlapping
and conflicting time warpings. Yet we want to keep fixed the time
alignment of the overlapping triphone videos, as dictated by the
visual distances (Section 4.1 and 4.2). Research is needed in how
best to trade off these potentially conflicting time-alignment maps.
6.2 Animation of Facial Features
Another promising extension is animation of other facial parts,
based on simple acoustic features or other criteria. The simplest
version of this extension would change the position of the eye-
brows with pitch [Ohala94]. A second extension would index the
video model by both triphone and expression labels. Using such
labels, we would select smiling or frowning lips, as desired. Alter-
natively, we could impose the desired expression on a neutral
mouth shape, for those times when the appropriate combinations
of triphones and expression are not available. To do this imposition
correctly, we must separate which deformations are associated
with articulations, and which are associated with expressions, and
how the two interact. This type of factorization must be learned
from examples [Tenenbaum97].
6.3 Perception of Lip Shapes
In doing this work, we solved many problems—automatic label-
ing, matching, and stitching—yet we found many situations where
we did not have sufficient knowledge of how people perceive
speaking faces. We would like to know more about how important
the correct lip shapes and motions are in lip synching. For exam-
ple, one study [Owens85] describes the confusibility of consonants
in vowel–consonant–vowel clusters. The clustering of consonants
into viseme class depends on the surrounding vowel context.
Clearly, we need more sophisticated distance metrics within and
between viseme classes.
7 CONTRIBUTIONS
Video Rewrite is a facial animation system that is driven by audio
input. The output sequence is created from real video footage. It
combines background video footage, including natural facial
movements (such as eye blinks and head motions) with natural
footage of mouth and chin motions. Video Rewrite is the first
facial-animation system to automate all the audio- and video-label-
ing tasks required for this type of reanimation.
ACM SIGGRAPH 97
8
Video Rewrite can use images from unconstrained footage
both to create the video model of the mouth and chin motions and
to provide a background sequence for the final output footage. It
preserves the individual characteristics of the subject in the origi-
nal footage, even while the subject appears to mouth a completely
new utterance. For example, the temporal dynamics of John F.
Kennedy’s articulatory motions can be preserved, reorganized, and
reimposed on Kennedy’s face.
Since Video Rewrite retains most of the background frame,
modifying only the mouth area, it is well suited to applications
such as movie dubbing. The setting and action are provided by the
background video. Video Rewrite maintains an actor’s visual man-
nerisms, using the dynamics of the actor’s lips and chin from the
video model for articulatory mannerisms, and using the back-
ground video for all other mannerisms. It maintains the correct
timing, using the action as paced by the background video and
speech as paced by the new soundtrack. It undertakes the entire
process without manual intervention. The actor convincingly
mouths something completely new.
ACKNOWLEDGMENTS
Many colleagues helped us. Ellen Tauber and Marc Davis gra-
ciously submitted to our experimental manipulation. Trevor Dar-
rell and Subutai Ahmad contributed many good ideas to the
algorithm development. Trevor, Subutai, John Lewis, Bud Lassiter,
Gaile Gordon, Kris Rahardja, Michael Bajura, Frank Crow, Bill
Verplank, and John Woodfill helped us to evaluate our results and
the description. Bud Lassiter and Chris Seguine helped us with the
video production. We offer many thanks to all.
REFERENCES
[Beier92] T. Beier, S. Neely. Feature-based image metamorphosis.
Computer Graphics,
26(2):35–42, 1992. ISSN 0097-8930.
[Black95] M.J. Black, Y. Yacoob. Tracking and recognizing rigid
and non-rigid facial motions using local parametric models of
image motion.
Proc. IEEE Int. Conf. Computer Vision,
Cam-
bridge, MA, pp. 374–381, 1995. ISBN 0-8186-7042-8.
[Bregler95] C. Bregler, S. Omohundro. Nonlinear manifold learn-
ing for visual speech recognition.
Proc. IEEE Int. Conf. Com-
puter Vision,
Cambridge, MA, pp. 494–499, 1995. ISBN 0-
8186-7042-8.
[Burt83] P.J. Burt, E.H. Adelson. A multiresolution spline with
application to image mosaics.
ACM Trans. Graphics,
2(4):
217–236, 1983. ISSN 0730-0301.
[Cohen93] M.M. Cohen, D.W. Massaro. Modeling coarticulation
in synthetic visual speech. In
Models and Techniques in Com-
puter Animation,
ed. N.M Thalman, D. Thalman, pp. 139–156,
Tokyo: Springer-Verlag, 1993. ISBN 0-3877-0124-9.
[Covell96] M. Covell, C. Bregler. Eigenpoints.
Proc. Int. Conf.
Image Processing,
Lausanne, Switzerland
,
Vol. 3, pp. 471–
474, 1996. ISBN 0-7803-3258-x.
[Guiard-Marigny94] T. Guiard-Marigny, A. Adjoudani, C. Benoit.
A 3-D model of the lips for visual speech synthesis.
Proc.
ESCA/IEEE Workshop on Speech Synthesis
, New Paltz, NY,
pp. 49–52, 1994.
[Kass87] M. Kass, A. Witkin, D. Terzopoulos. Snakes: Active con-
tour models.
Int. J. Computer Vision,
1(4):321–331, 1987.
ISSN 0920-5691.
[Kirby90] M. Kirby, L. Sirovich. Application of the Karhunen-
Loeve procedure for the characterization of human faces.
IEEE
PAMI
, 12(1):103–108, Jan. 1990. ISSN 0162-8828.
[Lamel86] L. F. Lamel, R. H. Kessel, S. Seneff. Speech database
development: Design and analysis of the acoustic-phonetic
corpus.
Proc. Speech Recognition Workshop (DARPA)
, Report
#SAIC-86/1546, pp. 100–109, McLean VA: Science Applica-
tions International Corp., 1986.
[Lanitis95] A. Lanitis, C.J. Taylor, T.F. Cootes. A unified approach
for coding and interpreting face images.
Proc. Int. Conf. Com-
puter Vision,
Cambridge, MA, pp. 368–373, 1995. ISBN 0-
8186-7042-8.
[Lewis91] J.Lewis. Automated lip-sync: Background and tech-
niques. J.Visualization and Computer Animation, 2(4):118–
122, 1991. ISSN 1049-8907.
[Litwinowicz94] P. Litwinowicz, L. Williams. Animating images
with drawings. SIGGRAPH 94, Orlando, FL, pp. 409–412,
1994. ISBN 0-89791-667-0.
[Morishima91] S. Morishima, H. Harashima. A media conversion
from speech to facial image for intelligent man-machine inter-
face. IEEE J Selected Areas Communications, 9 (4):594–600,
1991. ISSN 0733-8716.
[Moulines90] E. Moulines, P. Emerard, D. Larreur, J. L. Le Saint
Milon, L. Le Faucheur, F. Marty, F. Charpentier, C. Sorin. A
real-time French text-to-speech system generating high-quality
synthetic speech. Proc. Int. Conf. Acoustics, Speech, and Sig-
nal Processing, Albuquerque, NM, pp. 309–312, 1990.
[Ohala94] J.J. Ohala. The frequency code underlies the sound
symbolic use of voice pitch. In Sound Symbolism, ed. L. Hin-
ton, J. Nichols, J. J. Ohala, pp. 325–347, Cambridge UK:
Cambridge Univ. Press, 1994. ISBN 0-5214-5219-8.
[Owens85] E. Owens, B. Blazek. Visemes observed by hearing-
impaired and normal-hearing adult viewers. J. Speech and
Hearing Research, 28:381–393, 1985. ISSN 0022-4685.
[Parke72] F. Parke. Computer generated animation of faces. Proc.
ACM National Conf., pp. 451–457, 1972.
[Rabiner89] L. R. Rabiner. A tutorial on hidden markov models
and selected applications in speech recognition. In Readings in
Speech Recognition, ed. A. Waibel, K. F. Lee, pp. 267–296,
San Mateo, CA: Morgan Kaufmann Publishers, 1989. ISBN 1-
5586-0124-4.
[Scott94] K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R.
Wright, M. Lee, K.J. Hussey. Synthesis of speaker facial
movement to match selected speech sequences. Proc. Austra-
lian Conf. Speech Science and Technology, Perth Australia, pp.
620–625, 1994. ISBN 0-8642-2372-2.
[Tenenbaum97] J. Tenenbaum, W. Freeman. Separable mixture
models: Separating style and content. In Advances in Neural
Information Processing 9, ed. M. Jordan, M. Mozer, T.
Petsche, Cambridge, MA: MIT Press, (in press).
[Turk91] M. Turk, A. Pentland. Eigenfaces for recognition. J. Cog-
nitive Neuroscience, 3(1):71–86, 1991. ISSN 0898-929X
[Viterbi67] A. J. Viterbi. Error bounds for convolutional codes and
an asymptotically optimal decoding algorithm. IEEE Trans.
Informat. Theory, IT-13:260–269, 1967. ISSN 0018-9448.
[Waters95] K. Waters, T. Levergood. DECface: A System for Syn-
thetic Face Applications. J. Multimedia Tools and Applica-
tions, 1 (4):349–366, 1995. ISSN 1380-7501.
[Williams90] L. Williams. Performance-Driven Facial Animation.
Computer Graphics (Proceedings of SIGGRAPH 90),
24(4):235–242, 1990. ISSN 0097-8930.
[Yuille89] A.L. Yuille, D.S. Cohen, P.W. Hallinan. Feature extrac-
tion from faces using deformable templates. Proc. IEEE Com-
puter Vision and Pattern Recognition, San Diego, CA, pp.
104–109, 1989. ISBN 0-8186-1952-x.
... A similar approach is used by many speech-driven techniques, which is to directly map an acoustic feature sequence into a visual feature sequence. The underlying face model can also be used to categorize the above approaches, into model-based [3,11,12,13,14,15] and image-based [6,10,16,17,18,19]. The most popular methods used in each approach are presented below. ...
... Yamamoto et al. (1998) use a similar approach to measure the lip variables series [5]. The Video Rewrite methodology (Bregler et al. 1997) produces an array of triphones that are used to scan a repository for mouth photos by employing the same standards [6]. Ultimately, the observed solution is built by real-time synchronizing the voice with the frames, after which, spatially binding the jaw pieces to the background face is done. ...
... Yamamoto et al. (1998) use a similar approach to measure the lip variables series [5]. The Video Rewrite methodology (Bregler et al. 1997) produces an array of triphones that are used to scan a repository for mouth photos by employing the same standards [6]. Ultimately, the observed solution is built by real-time synchronizing the voice with the frames, after which, spatially binding the jaw pieces to the background face is done. ...
... Talking face generation has been studied since 1990s [15][16][17][18], and it was mainly used in cartoon animation [15] or visual-speech perception experiments [16]. With the advancement of computer technology and the popularization of network services, new application scenarios emerged. ...
... To extract essential information that is useful for talking face animation, one would require robust methods to analyze and comprehend the underlying speech signal [7,12,22,[25][26][27][28]. As the target of talking face generation, face modeling and analysis are also important. Models that characterize human faces have been proposed and applied to various tasks [17,22,23,[29][30][31][32][33]. As the bridge that joins audio and face, audio-to-face animation is the key component in talking face generation. ...
... 2D Models. 2D facial representations like 2D landmarks [17,22,[29][30][31], action units (AUs) [32], and reference face images [23,31,33] are commonly used in talking face generation. Facial landmark detection is defined as the task of localizing and representing salient regions of the face. ...
Chapter
Full-text available
Talking face generation aims at synthesizing coherent and realistic face sequences given an input speech. The task enjoys a wide spectrum of downstream applications, such as teleconferencing, movie dubbing, and virtual assistant. The emergence of deep learning and cross-modality research has led to many interesting works that address talking face generation. Despite great research efforts in talking face generation, the problem remains challenging due to the need for fine-grained control of face components and the generalization to arbitrary sentences. In this chapter, we first discuss the definition and underlying challenges of the problem. Then, we present an overview of recent progress in talking face generation. In addition, we introduce some widely used datasets and performance metrics. Finally, we discuss open questions, potential future directions, and ethical considerations in this task.
... Prior to the popularity of deep learning, many researchers mainly adopted crossmodal retrieval methods [1][2][3][4] and Hidden Markov Model (HMM) to solve this problem [5]. However, cross-modal retrieval methods based on the mapping relationship between morphemes and visemes do not consider the contextual semantic information of the speech. ...
Article
Full-text available
Virtual human is widely employed in various industries, including personal assistance, intelligent customer service, and online education, thanks to the rapid development of artificial intelligence. An anthropomorphic digital human can quickly contact people and enhance user experience in human-computer interaction. Hence, we design the human-computer interaction system framework, which includes speech recognition, text-to-speech, dialogue systems, and virtual human generation. Next, we classify the model of talking-head video generation by the virtual human deep generation framework. Meanwhile, we systematically review the past five years' worth of technological advancements and trends in talking-head video generation, highlight the critical works and summarize the dataset.
... Various approaches have been proposed to achieve realistic and well-synchronized talking portrait videos. Conventional methods [6,7] define a set of phoneme-mouth correspondence rules and use stitching-based techniques to modify mouth shapes. Deep learning enables image-based methods [14,16,24,42,43,48,52,56,59,[64][65][66] by synthesizing images corresponding to the audio inputs. ...
Preprint
While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.
... Se pueden rastrear los inicios de esta automatización de la falsificación en:(Bregler, Covell, y Slaney 1997) 17 La búsqueda de coherencia en la imagen basada en la ausencia o baja frecuencia en el parpadeo de los ojos del cuerpo representado, será un rasgo paradigmático en la detección temprana (humana) de los deepfakes. Este defecto formal, deja patente que las GAN que producen las deepfakes, se alimentan de un ecosistema en el que las imágenes son demasiado uniformes (representaciones de cara ojos siempre abiertos) y por lo tanto hacen emerger la escasez de materiales cierto tipo de materiales (fotografías de caras con los ojos cerrados) que serían necesarias para avanzar hacia un grado mayor de verosimilitud. ...
Article
Full-text available
Las máquinas que producen deepfakes no necesitan del mundo real para producir representaciones verosímiles. Esto debería producir la crisis del viejo contrato de veridicción que mantenemos con las imágenes. La irrupción del deepfake convierte la totalidad de las imágenes en materiales para una agencia artística. Sin embargo, instituciones como la justicia mantienen la confianza en la imagen como dispositivo de conocimiento. En un contexto social de profunda sospecha sobre cualquier tipo de representación, la imagen aún se utiliza para determinar lo verdadero y lo falso. Desde la aparición del deepfake se suceden medidas y contramedidas, cada vez más dependientes de la inteligencia artificial, destinadas a seguir manteniendo a las imágenes como parte del aparato epistemológico que empleamos para conocer el mundo.En este artículo, a través de diferentes estudios de caso, bibliografía y documentos relevantes sobre el tema, se traza una línea sobre el esfuerzo por comprobar la autenticidad de las imágenes como representaciones fieles de la realidad. Esta línea une las viejas fake pictures, con el nuevo fenómeno del deepfake. Concluimos cuestionándonos si las imágenes no son ya sólo un simple reflejo de nuestros deseos de ver, al tiempo que hemos abandonado y entregado totalmente a las máquinas nuestro aparato crítico de determinación de la verdad. En este contexto, lo que es real se dirime en una conversación en entre máquinas (GAN) por su apariencia verosímil, pero no por su contenido.
... Before there were deep neural networks, GANs, massive data sets, and unlimited compute cycles, Chris Bregler and colleagues created what would now be called lipsync deep fakes (Bregler, Covell, and Slaney 1997). In this seminal video-rewrite work, a video of a person speaking is automatically modified to create a video of them saying things not found in the original footage. ...
Article
Full-text available
Synthetic media—so-called deep fakes—have captured the imagination of some and struck fear in others. Although they vary in their form and creation, deep fakes refer to text, image, audio, or video that has been automatically synthesized by a machine-learning system. Deep fakes are the latest in a long line of techniques used to manipulate reality, yet their introduction poses new opportunities and risks due to the democratized access to what would have historically been the purview of Hollywood-style studios. This review describes how synthetic media is created, how it is being used and misused, and if (and how) it can be perceptually and forensically distinguished from reality.
Article
Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in identity-preserving deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, many applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of identity-preserving deep person generation, and provide a helpful foundation for full applications towards the digital human.
Book
Full-text available
This comprehensive text explores the relationship between identity, subjectivity and digital communication, providing a strong starting point for understanding how fast-changing communication technologies, platforms, applications and practices have an impact on how we perceive ourselves, others, relationships and bodies. Drawing on critical studies of identity, behaviour and representation, Identity and Digital Communication demonstrates how identity is shaped and understood in the context of significant and ongoing shifts in online communication. Chapters cover a range of topics including advances in social networking, the development of deepfake videos, intimacies of everyday communication, the emergence of cultures based on algorithms, the authenticities of TikTok and online communication’s setting as a site for hostility and hate speech. Throughout the text, author Rob Cover shows how the formation and curation of self-identity is increasingly performed and engaged with through digital cultural practices, affirming that these practices must be understood if we are to make sense of identity in the 2020s and beyond. Featuring critical accounts, everyday examples and analysis of key platforms such as TikTok, this textbook is an essential primer for scholars and students in media studies, psychology, cultural studies, sociology, anthropology, computer science, as well as health practitioners, mental health advocates and community members.
Conference Paper
Full-text available
The work described here extends the power of 2D animation with a form of texture mapping conveniently controlled by line drawings. By tracing points, line segments, spline curves, or filled regions on an image, the animator defines features which can be used to animate the image. Animations of the control features deform the image smoothly. This development is in the tradition of “skeleton”-based animation, and “feature”-based image metamorphosis. By employing numerics developed in the computer vision community for rapid visual surface estimation, several important advantages are realized. Skeletons are generalized to include curved “bones,” the interpolating surface is better behaved, the expense of computing the animation is decoupled from the number of features in the drawing, and arbitrary holes or cuts in the interpolated surface can be accommodated. The same general scattered data interpolation technique is applied to the problem of mapping animation from one image and set of features to another, generalizing the prescriptive power of animated sequences and encouraging reuse of animated motion.
Article
We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.
Article
ABSTRACT - Asystem isdesctibed which allows forthesynthesis ofavideo sequence ofa realistic - appearing talking human head A phonetic based approach is used to describe facial motion; image processing rather than physical modeling techniques are used to create the video frames
Article
This paper describes the representation, animation and data collection techniques that have been used to produce "realistic" computer generated half-tone animated sequences of a human face changing expression. It was determined that approximating the surface of a face with a polygonal skin containing approximately 250 polygons defined by about 400 vertices is sufficient to achieve a realistic face. Animation was accomplished using a cosine interpolation scheme to fill in the intermediate frames between expressions. This approach is good enough to produce realistic facial motion. The three-dimensional data used to describe the expressions of the face was obtained photogrammetrically using pairs of photographs.
Article
DECface is a system that facilitates the development of applications requiring a real-time lip-synchronized synthetic face. Based on the X Window System and the audio facilities of DECtalk and AF, DECface has been built with a simple interface protocol to support the development of face-related applications. This paper describes our approach to face synthesis, the face and audio protocol, and some sample code examples.
Conference Paper
As computer graphics technique rises to the challenge of rendering lifelike performers, more lifelike performance is required. The techniques used to animate robots, arthropods, and suits of armor, have been extended to flexible surfaces of fur and flesh. Physical models of muscle and skin have been devised. But more complex databases and sophisticated physical modeling do not directly address the performance problem. The gestures and expressions of a human actor are not the solution to a dynamic system. This paper describes a means of acquiring the expressions of real faces, and applying them to computer-generated faces. Such an "electronic mask" offers a means for the traditional talents of actors to be flexibly incorporated in digital animations. Efforts in a similar spirit have resulted in servo-controlled "animatrons," high-technology puppets, and CG puppetry [1]. The manner in which the skills of actors and puppetteers as well as animators are accommodated in such systems may point the way for a more general incorporation of human nuance into our emerging computer media.The ensuing description is divided into two major subjects: the construction of a highly-resoved human head model with photographic texture mapping, and the concept demonstration of a system to animate this model by tracking and applying the expressions of a human performer.