Available via license: CC BY-SA 4.0
Content may be subject to copyright.
One Shot Audio to Animated Video Generation
Neeraj Kumar
Hike Private Limited
neerajku@hike.in
Srishti Goel
Hike Private Limited
srishtig@hike.in
Ankur Narang
Hike Private Limited
ankur@hike.in
Brejesh Lall
IIT Delhi
brejesh@ee.iitd.ac.in
Mujtaba Hasan
Hike Private Limited
mujtaba@hike.in
Pranshu Agarwal
Hike Private Limited
pranshu@hike.in
Dipankar Sarkar
Hike Private Limited
dipankars@hike.in
Abstract
We consider the challenging problem of audio to animated video generation. We
propose a novel method OneShotAu2AV to generate an animated video of arbitrary
length using an audio clip and a single unseen image of a person as an input.
The proposed method consists of two stages. In the first stage, OneShotAu2AV
generates the talking-head video in the human domain given an audio and a person
0
s
image. In the second stage, the talking-head video from the human domain is
converted to the animated domain. The model architecture of the first stage consists
of spatially adaptive normalization based multi-level generator and multiple multi-
level discriminators along with multiple adversarial and non-adversarial losses.
The second stage leverages attention based normalization driven GAN architecture
along with temporal predictor based recycle loss and blink loss coupled with lip-
sync loss, for unsupervised generation of animated video. In our approach, the
input audio clip is not restricted to any specific language, which gives the method
multilingual applicability. OneShotAu2AV can generate animated videos that have:
(a)
lip movements that are in sync with the audio,
(b)
natural facial expressions such
as blinks and eyebrow movements,
(c)
head movements. Experimental evaluation
demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT
and RecycleGan on multiple quantitative metrics including KID(Kernel Inception
Distance), Word error rate, blinks/sec.
1 Introduction
Audio to Video generation has numerous applications across industry verticals including film making,
multi-media, marketing, education and others. In the film industry, it can help through automatic
generation from the voice acting and also occluded parts of the face. Additionally, it can help in
limited bandwidth visual communication by using audio to auto-generate the entire visual content
or by filling in dropped frames. High-quality video generation with expressive facial movements
is a challenging problem that involves complex learning steps. Most of the work in this field has
been centered towards the mapping of audio features (MFCCs, phonemes) to visual features (Facial
landmarks, visemes etc.) (Aneja and Li, 2019; Tian et al., 2019; Cappelletta and Harte, 2012; Lee
and Yook, 2002). Further computer graphics techniques select frames of a specific person from the
database to generate expressive faces. Few techniques which attempt to generate the video using raw
audio focus for the reconstruction of the mouth area only (Chung et al., 2017). Due to a complete
focus on lip-syncing, the aim of capturing human expression is ignored. Further, such methods lack
arxiv preprint 2021.
arXiv:2102.09737v1 [cs.CV] 19 Feb 2021
smooth transition between frames which does not make the final video look natural. Regardless, of
which approach we use, the methods described above are either subject dependent (Suwajanakorn
et al., 2017; Kim and Ganapath, 2016) or generate unnatural videos (Wiles et al., 2018) due to lack of
smooth transition and/or require high compute time to generate video for a new unseen speaker image
for ensuring high-quality output (Zakharov et al., 2019).Recent unsupervised approaches such as
RecycleGAN (Bansal et al., 2018) and U-GAT-IT (Kim et al., 2019) either generate low quality videos
or has low expressives due to lack of eye blinks, eyebrow movement and head movement.We propose
a novel approach based on two novel stages that convert an audio and single image of a person to
an animated video. The first stage generates a speaker-independent and language-independent high-
quality natural-looking talking head video from a single unseen image and an audio clip. It captures
the word embeddings from the audio clip using a pretrained deepspeech2 model(Amodei et al., 2015)
trained on Librispeech corpus(Panayotov et al., 2015). These embeddings and the image are then
fed to the multi-level generator network which is based on the Spatially-Adaptive Normalization
architecture (Park et al., 2019). Multiple multi-level discriminators (Wang et al., 2018) are used to
ensure synchronized and realistic human video generation. A multi-level temporal discriminator is
modeled which ensures temporal smoothening along with spatial consistency. Finally, to ensure lip
synchronization we use SyncNet architecture (Assael et al., 2017) based discriminator applied to the
lower half of the image. To make the generator input-time independent, a sliding window approach is
used. Since, the generator needs to finally learn to generate multiple facial component movements
along with high video quality, multiple loss functions both adversarial and non-adversarial are used in
a curriculum learning fashion. For fast low-cost adaptation to an unseen image, a few output update
epochs suffice to provide
one-shot
learning capability to our approach. The second stage couples an
attention-based normalization driven GAN architecture with temporal predictor based recycle loss
and blink loss and lip-sync loss to generate high quality animated video from human video obtained
from the first stage.
Specifically, we make the following contributions:
(a)
We present a novel approach,
OneShotAu2AV
, that uses two stages, that are trained indepen-
dently, to convert audio and single image input to animated video.
(b)
The first stage takes audio and single image of a person as input, and, leverages curriculum learn-
ing to simultaneously learn movements of expressive facial components and generate a high-quality
talking-head video of the given person. The stage feeds the features generated from the audio input
directly into a generative adversarial network and it adapts to any given unseen selfie by applying
one-shot learning with only a few output update epochs.
(c)
The second stage leverages attention based normalization driven GAN architecture along with
temporal predictor based recycle loss and blink loss coupled with lip-sync loss, for unsupervised gen-
eration of animated video that demonstrates eye blinks, eyebrow movements and lip-synchronization
with audio.
(d)
Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared
to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception
Distance), Word error rate, blinks/sec.
2 Related Work
A lot of work has been done in synthesizing realistic videos in the human domain or animated domain
from an audio clip and an image as an input. Speech is a combination of content and expression and
there is a perceptual variability of speech that exists in the form of various languages, dialects, and
accents.
Audio to Human Domain Video Generation
The earliest methods for generating videos relied
on Hidden Markov Models which captured the dynamics of audio and video sequences. Simons
and Cox(Simons and Cox, 1990) used the Viterbi algorithm to calculate the most likely sequence of
mouth shape given the particular utterances. Such methods are not capable of generating high-quality
videos and lacked emotions.
CNN based models have been used to generate realistic videos given audio and single image as an
input. Audio2Face (Tian et al., 2019) model uses the CNN method to generate an image from audio
signals. (Chung et al., 2017)(Speech2Vid) has used an encoder-decoder based approach for generating
realistic videos. Other approaches such as Synthesizing Obama: learning lip sync (Suwajanakorn
et al., 2017) from the audio are trained for a single image. LumiereNet (Kim and Ganapath, 2016)
uses LSTM, DensePose (Guler et al., 2018) and Pix2Pix (Isola et al., 2017) for generating videos.
2
These however, have limitations either in terms of expressions such as lip-sync, eye-blink, emotions
or they are specifically trained for a single image and are not generalizable. We propose a spatially
adaptive generator along with multiple discriminators which generates high-quality, lip-synchronized
video with expressions such as eye-blink, etc.
(Zakharov et al., 2019) uses meta-learning to create videos of unseen images. Few shot Video to
Video Synthesis (Wang et al., 2019) is able to generate videos on unseen images given a video as
an input by using a network weight generation module for extracting the pattern. Such a method
is computationally expensive compared to our approach which is a one-shot approach for video
generation. Realistic Speech-Driven Facial Animation with GANs (RSDGAN) (Vougioukas et al.,
2004) uses a GAN based approach to produce quality videos. They have used identity encoder,
context encoder and frame decoder to generate images and used various discriminators to take care of
different aspects of video generation. They have used frame discriminator to distinguish real and fake
images, sequence discriminator to distinguish real and fake videos, and synchronization discriminator
for better lip synchronization in videos. We introduce spatially adaptive normalization along with a
one-shot approach and implemented curriculum learning to produce better results. This is explained
in Sections 3,4.
Video to Animated Video generation
Initially, phonemes and visemes based methods were used
to create the stylized characters. (Aneja and Li, 2019) has used an LSTM based approach to generate
live lip synchronization on a 2D animated character. Some of these methods target rigged 3D
characters or meshes with predefined mouth blend shapes that correspond to speech sounds (Stef
et al., 2018; Karras et al., 2017; Taylor et al., 2012; Edwards et al., 2016; Mattheyses and Verhelst,
2014; Suwajanakorn et al., 2017), while others generate 2D motion trajectories that can be used to
deform facial images to produce continuous mouth motions (Brand, 1999; Cao et al., 2005). These
methods are primarily focused on mouth motions only and do not show emotions such as eye-blink,
eye-brow movements, etc.
Several works have been done on expression and facial action units classification and mapping it to
the animated version of a person such as (Leone et al., 2012; Santos PÃl’rez et al., 2011; Gilbert
et al., 2018; Poggi et al., 2005). They cover a finite space in terms of expression, movements and are
not personalized to specific people. The proposed model is able to capture these various aspects such
as facial expressions, lip-syncing, eye-blinks etc., due to attention-based generator and discriminator.
(Aneja et al., 2019) uses facial action coding system (Dailey et al., 2002), ensures lip syncing using
phoneme classifier (Huggins Daines et al., 2006), expression control using (Kring and Sloan, 2007)
and bone control units to create an avatar. They use an unreal engine (Games, 2007) for generating
the avatar. They use a classification-based approach to create an avatar which covers the finite space
in terms of facial details in a video. On the other hand, the proposed method uses an unsupervised
generative approach to create animated videos with various facial details.
The recent introduction of GAN (Ian J. Goodfellow, 2014) has shifted the focus of the machine
learning community towards generative models. Several works have been done in the image to image
translation as well as video to video translation. Techniques such as Pix2Pix (Isola et al., 2017),
Pix2PixHD (Wang et al., 2017), SPADE (Park et al., 2019) work in image to image translation, but
require a paired form of training. CycleGan (Creswell et al., 2017) which uses cycle consistency loss,
deals with unpaired form of training but lack in preserving temporal information while generating the
animated videos.
For Video to Video style transfer, RecycleGan (Bansal et al., 2018) uses unpaired but ordered streams
of data for both domains. This method uses recycle loss apart from adversarial and cycle loss to
handle temporal information. Due to the Unet (Ronneberger et al., 2015) generator and lack of
attention-based architecture, it is not able to generate high-quality animated video. The proposed
method uses adaptive layer and instance normalization (AdaLin) and attention-based networks along
with temporal discriminator which give better superior quality animated videos over RecycleGan.
U-GAT-IT (Kim et al., 2019) uses AdaLin and attention maps for translating an image from one
domain to another. However, this architecture is not able to capture the temporal information and
lacks lip synchronization and expressions such as eye-blink, eyebrow movement as well as head
movement. We have leveraged AdaLin and attention map based architecture along with temporal
predictor using recycle loss, blink loss and lip-sync loss for the better expression capture in animated
videos(refer Section 5) in unsupervised fashion. OneShotAu2AV is able to synthesize a personalized
animated video from an audio clip and a single image of the person.
3
3 Architectural Design
OneShotAu2AV consists of 2 stages: Stage 1 to generate realistic human domain videos given an
audio and single unseen image as an input and Stage 2 to generate animated videos from realisitic
human domain generated videos.
Figure 1: (a) Left side: Stage 1 of OneShotAu2AV with a generator and three discriminators for
generating human-domain video. (b) Right side: Stage 2 of OneShotAu2AV with a generator,
temporal predictor and a discriminator for generating a high quality animated video.
3.1 Stage 1
It consists of a single generator and 3 discriminators as shown in Figure 1(a).
3.1.1 Generator
Spatially Adaptive Generator:
The initial layers of generator, G uses deepspeech2 (Amodei et al.,
2015) layers followed by Spatially-Adaptive normalization similar to SPADE architecture (Park
et al., 2019). Instead of using semantic map as the input, we use the real image as an input to the
SPADE generator. This helps in minimizing the loss of information due to normalization.
Audio features using deepspeech2 model:
The MFCC coefficients of audio signals are fed to the
deepspeech2 model to extract the content related information of audio. The initial few layers are
being used which goes to the generator .This helps in achieving better lip synchronization in video
and lower word error rate
3.1.2 Discriminator
We have used 3 discriminators namely a multi-scale frame discriminator, a multi-scale temporal
discriminator and a synchronization discriminator.
Multi-scale Frame Discriminator
Multi-scale discriminator (Wang et al., 2018), D is used in the
proposed model to distinguish the coarser and finer details between real and fake images. Adversarial
training with the discriminator helps in generating realistic frames. To have high resolution generated
frames, we need to have an architecture with better receptive field. A deeper network can cause
overfitting, to avoid that, multi-scale discriminators are used.
Multi-scale Temporal Discriminator
Every frame in a video is dependent on its previous frames.
To capture the temporal property along with a spatial one, we have used a multi-scale temporal
discriminator (Kim and Ganapath, 2016). This discriminator is modeled to ensure a smooth transition
between consecutive frames and achieve a natural-looking video sequence. The multi-scale temporal
discriminator is described as
L(T, G, D) =
t
X
i=t−L
[log(D(xi))] + [log(D(1 −G(zi)))] (1)
where t is the time instance of an audio and L is the length of the time interval for which the adversarial
loss is computed.
4
Synchronization Discriminator
To have coherent lip synchronization, the proposed model uses
SyncNet architecture proposed in Lip Sync in the wild (Chung and Zisserman, 2016). The input
to the discriminator is an audio signal of 200ms time interval(5 audio signals of 40ms each) and 5
frames of the video. The lower half of the frame of resized to (224,224,3) is fed as an input.
3.1.3 Losses
We have used adversarial loss, L
GAN
, temporal adversarial loss, L(T,G,D), feature loss, L
FL
, recon-
struction loss, L
RL
, perceptual loss, L
PL
, contrastive loss, L
CL
and blink loss, L
BL
to generate high
quality output. Objective function is given below and detailed explanation is given in supplementary
material.
minG((maxD1,D2,D3X
k=1,2,3
LGAN (G, Dk) + maxD1,D2,D3X
k=1,2,3
L(T, G, D))+
λF M X
k=1,2,3
LF M +λP LLP L +λCL LCL +λB LLB L)
where
λF M , λP L, λC L, λBL
are the hyperparameters that control the importance of various loss
functions in the above objective function
Curriculum Learning
We have trained the model in multiple phases so that it can produce better
results. In the first phase we have used a multi-scale frame discriminator and applied the adversarial
loss, feature matching loss and perceptual loss to learn the higher-level features of the image. When
these losses stabilize, we move to the second phase in which we have added a multi-scale temporal
discriminator and synchronization discriminator and used reconstruction loss, Contrastive loss and
temporal adversarial loss to get a better quality image near mouth region and coherent lip synchronized
high-quality videos. After the stabilization of the above losses, we have added blink loss in the third
phase to generate a more realistic image capturing emotions such as eye movement and eye blinks.
Few shot learning
To achieve a more sharp and a better image quality for an unseen subject,
we have used one shot approach using perceptual loss during inference time. Our approach is
computationally less expensive as compared to (Zakharov et al., 2019; Wang et al., 2019) which we
have described in Section 2 and because of the spatially adaptive nature of generator architecture, we
are able to achieve high-quality video. We run the model for 5 epochs during inference time to get
high-quality video frames.
3.2 Stage 2
The stage 2 as shown in Figure 1(b) consists of a generator, a temporal predictor and a discriminator
to generate high-quality animated videos.
3.2.1 Generator
The Generator model
Gs−>t
consists of encoder
Es
, attention-based normalisation driven decoder
Ds
.The attention-based adaptive instance and layer normalization (AdaLin) is inspired by Class
Activation Map(CAM) (Zhou et al., 2015) which is trained to learn weights of feature maps of the
source domain using the global average pooling and global max pooling. This helps the generator
to focus on the source image regions that are more discriminative from the target domain, such as
eyes and mouth. AdaLin adjusts the ratio of IN and LN in the decoder according to source and
target domain distributions to have the features of the source domain as well as the style of the target
domain.
3.2.2 Discriminator
The discriminator,
Dt
consists of Encoder,
EDt
and auxiliary classsifier,
nDt
. The discriminator
concentrates its attention to determine whether the target image is real or fake by visualizing local
and global attention maps so that it helps the generator to capture the global structure (e.g., face area
and near of eyes) as well as the local regions.
5
3.2.3 Temporal Predictor
We use unpaired data but have ordered streams of frames,
(x1· · · xn)
and
(y1· · · yn)
for source
and target domains. To learn better mapping from source to target domain, we focus to learn the
temporal information. We introduce a temporal predictor(
Px
) whose architecture is same as of
UNet (Ronneberger et al., 2015) which predicts the future frames given past frames as an input. This
is trained with L2 loss.
3.2.4 Losses
The losses such as adversarial loss,
LGAN (G, D)
, identity loss, CAM loss,
Lcam
are used for the
domain transfer and recycle loss,
Lrecycle
, lip sync loss,
Llip
and blink loss,
LBL
to extract the spatial
and temporal information from a video that helps in generating high-quality expressive animated
videos.Detailed explanation is given in supplementary material. Objective function is given below:
minG((maxDLGAN (G, D) + λcamLDt
cam) + λrecy cleLr ecycle(Gx, Gy, Py)
+λidentityLidentity +λcam Ls−>t
cam +λlipLlip +λB LLB L)
where λcam ,λrecycle ,λidentity ,λlip,λB L are the hyperparameters used to control the importance of
various loss functions in the above objective function
4 Experiments and Results
4.1 Datasets & Training
Our model is implemented in Pytorch and takes approximately
4
days to run on
4
Nvidia V100 GPUs
for training. Around
5000
and
1200
videos of the GRID dataset are used for training and testing
purposes. We have taken
3000
and
600
videos of the LOMBARD GRID and CREMA-D datasets for
training and testing purposes respectively. The frames are extracted at
25
fps. We have taken 16khz as
sampling frequency for audio signals and used
13
MFCC coefficients for 0.2 sec of overlapping audio
for experimentation.
We have used the GRID dataset (Cooke, 2006), LOMBARD GRID (Najwa Alghamdi and Brown,
2018) and CREMA-D (Cao et al., 2014) for the experimentation and evaluation of different metrics
for stage 1.
We have used the GRID dataset (Cooke, 2006) and hikemoji animated videos for experimentation
and evaluation of different metrics. We have used around 100 videos for training and 30 videos for
testing purposes in both the domains respectively for Stage 2.
4.2 Metrics
To quantify the quality of the final generated video, we use the following metrics. PSNR(Peak Signal
to Noise Ratio), SSIM(Structural Similarity Index), CPBD(Cumulative Probability Blur Detection),
ACD(Average Content Distance) and KID(Kernel Inception Distance). KID (BiÅ ˇ
Dkowski et al.,
2018) computes the squared Maximum Mean Discrepancy between the feature representations of real
and generated images. PSNR, SSIM, and CPBD measure the quality of the generated image in terms
of the presence of noise, perceptual degradation, and blurriness respectively. ACD (Tulyakov et al.,
2018) is used for the identification of the speaker from the generated frames by using OpenPose (Cao
et al., 2018). Along with image quality metrics, we also calculate WER(Word Error Rate) using
pretrained LipNet architecture (Assael et al., 2017) and Blinks/sec using (Soukupova and Cech,
2016) to evaluate our performance of speech recognition and eye-blink reconstruction respectively.
4.3 Qualitative Results
OneShotAu2AV produces natural-looking high-quality animated videos of an unseen input image
and audio signals. The videos are able to do lip synchronization on the sentences provided to them as
well add natural expressions such as head movements, eye-blink, eyebrow movements. Videos were
generated targeting different languages ensuring the proposed method is language independent and
6
can generate videos for any linguistic community.
Figure 2 and Figure 5 display different examples of generated lip-synchronized video for male
and female test cases for human and animated domains. As observed the opening and closing of
the mouth is in sync with the audio signal. Figure 5 also displays a slight head movement of the
animated person(between frames 3 and 4). Figure 3 and Figure 6 display eye-blink and facial
expressions such as frowns in the both domain videos. Figure 4 displays different examples of
generated lip-synchronized video for people uttering words of Hindi and Bengali language words,
such as ’Modi’ and ’aache’ respectively. Figure 7 displays the same for animated domain.
Figure 2: Left side: Female uttering the word "now"; Right side: Male uttering the word "bin"
Figure 3: Left side: Blinking of eyes of the person while speaking; Right side: Man with facial
frowns.
Figure 4: Left side: Generated output for a Hindi audio clip ("Modi"); Right side: Generated output
for a Bengali audio clip ("aache").
Figure 5: Left side: Animated output speaking ’now’; Right side: Head movement of female anime.
Figure 6: Left side: Animated output with eye blinks; Right side: Eyebrow movements of male while
speaking.
Figure 7: Left side: Anime speaking the Hindi word ’modi’; Right side: Anime speaking the Bengali
word ’aache’.
4.4 Quantitative Results
Stage 1
The proposed model performs better on image reconstruction metrics including PSNR
and SSIM for both GRID and CREMA-D datasets as compared to Realistic Speech-Driven Facial
7
Method Results speaking ’now’
Input frames
OneShotAu2AV
U-GAT-IT
RecycleGAN
Table 1: Comparision of OneShotAu2AV with U-GAT-IT and RecycleGAN
Animation with GANs(RSDGAN) (Vougioukas et al., 2004) and Speech2Vid (Chung et al., 2017)
as shown in Table 2. The table also displays the performance of OneShotAu2AV trained on the
LOMBARD GRID dataset (Najwa Alghamdi and Brown, 2018). The improved performance of
the proposed method is achieved with the use of spatially adaptive normalization in the generator
architecture along with training of the proposed model in curriculum learning fashion with appropriate
adversarial and non-adversarial losses.
Method SSIM PSNR CPBD WER ACD-C ACD-E
OneShotAu2AV(GRID) 0.881 28.571 0.262 27.5 0.005 0.09
RSDGAN(GRID) 0.818 27.100 0.268 23.1 - 1.47x10-4
Speech2Vid(GRID) 0.720 22.662 0.255 58.2 0.007 1.48x10-4
OneShotAu2AV(CREMA-D) 0.773 24.057 0.184 NA 0.006 0.96
RSDGAN(CREMA-D) 0.700 23.565 0.216 NA - 1.40x10-4
Speech2Vid(CREMA-D) 0.700 22.190 0.217 NA 0.008 1.73x10-4
OneShotAu2AV(lombard) 0.922 28.978 0.453 26.1 0.002 0.064
Speech2Vid(lombard) 0.782 26.784 0.406 53.1 0.004 0.069
Table 2: Comparision of OneShotAu2AV with RSDGAN and Speech2Vid for GIRD, GRID lom-
bard and CREMA-D datasets for SSIM, PSNR, CPBD, WER and ACD by calculating cosine
distance(ACD-C)(should be 0.02 and below) and euclidean distance(ACD-E)(should be 0.2 and
below).
Stage 2:
The proposed Model shows better results in terms of image translation metric i.e KID
is 2x and 8x better than U-GAT-IT and RecycleGan respectively as displayed in Table 3. This is
achieved by adding temporal predictor based recycle loss and lip-sync loss in the networks which
helps in the reconstruction of high quality animated output. OneShotAu2AV performs better on lip
synchronization metric(WER) which is 2x and 8x better than U-GAT-IT and RecycleGan respectively.
Method KID×100±std.×100 WER Blink/sec
OneShotAu2AV 5.02±0.03 31.97 0.546
U-GAT-IT 10.37±0.17 68.85 0.046
RecycleGan 42.54±0.68 240.51 0.0
Table 3: Comparision of OneShotAu2AV with U-GAT-IT and RecycleGAN for KID, WER and
Blink/sec.
4.5 Ablation Study
We conducted detailed ablation studies.In Stage 1, the addition of Contrastive Loss and multi-scale
temporal adversarial loss leads to the improvement of the SSIM(0.867 to 0.873), PSNR(27.996 to
8
28.327) and CPBD(0.213 to 0.259) scores when measured on GRID dataset. Adding blink loss leads
to further improvement in SSIM(0.873 to 0.881), PSNR(28.327 to 28.571), CPBD(0.259 to 0.262)
scores. Similar improvements are observed on the LOMBARD GRID dataset as well.
In Stage 2, the addition of predictive model based recycle loss and lip-sync loss helped in the
improvement of KID (10.37 to 5.02) and WER (68.85 to 31.97). For further details, kindly refer to
the supplementary material.
Psychophysical assessment:
For video attachments and Psychophysical assessment(results of
Turing test and user ratings), kindly refer to the supplementary material.
5 Conclusion and Future Work
In this paper, we have presented a novel approach, OneShotAu2AV, to convert an audio and single
image of a person to an animated video. Using two stages in our multi-level generators and dis-
criminators based architecture and appropriate adversarial and non-adversarial losses, we are able
to achieve synced lip movements, blinks, and eye-brow movements in the output. Experimental
evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and
RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error
rate, blinks/sec. In future, we will look at techniques to further enhance the expressiveness of the
generated animated videos.
References
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl
Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen,
Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen,
and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. 12
2015.
Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 2019.
Deepali Aneja, Daniel McDuff, and Shital Shah. A high-fidelity open embodied avatar with lip
syncing and expression capabilities. pages 69–73, 10 2019. ISBN 978-1-4503-6860-5. doi:
10.1145/3340555.3353744.
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end
sentence-level lipreading. GPU Technology Conference, 2017. URL
https://github.com/
Fengdalu/LipNet-PyTorch.
Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-gan: Unsupervised video
retargeting. In ECCV, 2018.
MikoÅ´
Caj BiÅ ˇ
Dkowski, DJ Sutherland, M Arbel, and A Gretton. Demystifying mmd gans. 01 2018.
Matthew Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics
and interactive techniques. ACM Press/Addison-Wesley Publishing Co, page 21â˘
A¸S28, 1999.
Houwei Cao, David Cooper, Michael Keutmann, Ruben Gur, Ani Nenkova, and Ragini Verma.
Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective
computing, 5:377–390, 10 2014. doi: 10.1109/TAFFC.2014.2336244.
Yong Cao, Wen Tien, Petros Faloutsos, and Frederic Pighin. Expressive speech-driven facial
animation. ACM Trans. Graph., 24:1283–1302, 10 2005. doi: 10.1145/1095878.1095881.
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime
multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008,
2018.
Luca Cappelletta and Naomi Harte. Phoneme-to-viseme mapping for visual speech recognition.
Proceedings of the International Conference on Pattern Recognition Applications and Methods
(ICPRAM 2012), 2, 05 2012.
9
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view
Lip-reading, ACCV, 2016.
Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? In British Machine Vision
Conference, 2017.
Barker J. Cunningham S. Shao X Cooke, M. An audio-visual corpus for speech perception and auto-
matic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421â ˘
A¸S2424,
2006.
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil
Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35,
10 2017. doi: 10.1109/MSP.2017.2765202.
Matthew Dailey, Michael Lyons, Miyuki Kamachi, H Ishi, Jiro Gyoba, and Garrison Cottrell. Cultural
differences in facial expression classification. Proc. Cognitive Neuroscience Society, 9th Annual
Meeting, San Francisco CA, page 153, 06 2002.
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme
model for expressive lip synchronization. ACM Transactions on Graphics, 35:1–11, 07 2016. doi:
10.1145/2897824.2925984.
Epic Games. Unreal engine , online: https://www. unrealengine. com. 2007.
Michaô
nl Gilbert, Samuel Demarchi, and Isabel Urdapilleta. Facshuman a software to create
experimental material by modeling 3d facial expression. pages 333–334, 11 2018. doi: 10.1145/
3267851.3267865.
Riza Guler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in
the wild. pages 7297–7306, 06 2018. doi: 10.1109/CVPR.2018.00762.
David Huggins Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and Alexander Rudnicky.
Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices.
volume 1, pages I – I, 06 2006. doi: 10.1109/ICASSP.2006.1659988.
Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian
J. Goodfellow, Jean Pouget-Abadie. Generative adversarial nets, 2014.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei Efros. Image-to-image translation with
conditional adversarial networks. pages 5967–5976, 07 2017. doi: 10.1109/CVPR.2017.632.
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial
animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 36:
1–12, 07 2017. doi: 10.1145/3072959.3073658.
Byung-Hak Kim and Varun Ganapath. Lumiô
lrenet: Lecture video synthesis from audio. 2016.
Junho Kim, Minjae Kim, Hyeon-Woo Kang, and Kwanghee Lee. U-gat-it: Unsupervised generative
attentional networks with adaptive layer-instance normalization for image-to-image translation, 07
2019.
Ann Kring and Denise Sloan. The facial expression coding system (faces): Development, validation,
and utility. Psychological assessment, 19:210–24, 06 2007. doi: 10.1037/1040-3590.19.2.210.
Soonkyu Lee and Dongsuk Yook. Audio-to-visual conversion using hidden markov models. pages
563–570, 08 2002. doi: 10.1007/3-540-45683-X_60.
Giuseppe Riccardo Leone, Giulio Paci, and Piero Cosi. Lucia: An open source 3d expressive avatar
for multimodal h.m.i. volume 78, pages 193–202, 01 2012. doi: 10.1007/978-3-642-30214-5_21.
Wesley Mattheyses and Werner Verhelst. Audiovisual speech synthesis: An overview of the state-of-
the-art. Speech Communication, 66, 11 2014. doi: 10.1016/j.specom.2014.11.001.
Ricard Marxer Jon Barker Najwa Alghamdi, Steve Maddock and Guy J. Brown. A corpus of audio-
visual lombard speech with frontal and profile view, the journal of the acoustical society of america
143, el523 (2018); https://doi.org/10.1121/1.5042758, 2018.
10
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus
based on public domain audio books. pages 5206–5210, 04 2015. doi: 10.1109/ICASSP.2015.
7178964.
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with
spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019.
Isabella Poggi, Catherine Pelachaud, F. Rosis, Valeria Carofiglio, and Berardina Carolis. Greta. A
Believable Embodied Conversational Agent, pages 3–25. 01 2005. doi: 10.1007/1-4020-3051-7_1.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. volume 9351, pages 234–241, 10 2015. ISBN 978-3-319-24573-7. doi:
10.1007/978-3-319-24574-4_28.
Marcos Santos PÃl’rez, Eva Gonzà ˛alez-Parada, and Jose Manuel Cano-Garcia. Avatar: An open
source architecture for embodied conversational agents in smart environments. volume 6693, pages
109–115, 06 2011. doi: 10.1007/978-3-642-21303-8_15.
A. Simons and Stephen Cox. Generation of mouthshapes for a synthetic talking head. Proceedings of
the Institute of Acoustics, Autumn Meeting, 01 1990.
Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
Andreea Stef, Kaveen Perera, Hubert Shum, and Edmond Ho. Synthesizing expressive facial and
speech animation by text-to-ipa translation with emotion control. pages 1–8, 12 2018. doi:
10.1109/SKIMA.2018.8631536.
Supasorn Suwajanakorn, Steven Seitz, and Ira Kemelmacher. Synthesizing obama: learning lip sync
from audio. ACM Transactions on Graphics, 36:1–13, 07 2017. doi: 10.1145/3072959.3073640.
Sarah Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual
speech. pages 275–284, 07 2012.
Guanzhong Tian, Yi Yuan, and Yong Liu. Audio2face: Generating speech/face animation from single
audio with attention-based bidirectional lstm networks. 05 2019.
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion
and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1526–1535, 2018.
Konstantinos Vougioukas, Stavros Petridi, and Maja Pantic. End-to-end speech-driven facial anima-
tion with temporal gans. Journal of Foo, 14(1):234–778, 2004.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans. 11 2017.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot
video-to-video synthesis. In Conference on Neural Information Processing Systems (NeurIPS),
2019.
O. Wiles, A.S. Koepke, and A. Zisserman. X2face: A network for controlling face generation by
using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial
learning of realistic neural talking head models, 05 2019.
Bolei Zhou, Aditya Khosla, Ã ˘
Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep
features for discriminative localization. 12 2015.
11
One Shot Audio to Animated Video Generation -
Supplementary Material
1 Architectural Design
1.1 Stage 1
This Stage consists of converting and audio and an unseen person0s image to a human video.
1.1.1 Audio Pre processing
An audio input of 200 ms is given along with the image to produce a single frame of the video. The
audio input is overlapping with the previous audio input with an overlapping interval of 0.16 ms.
Every audio frame is centered around a single video frame. To do that, zero padding is done before
and after the audio signal and uses the following formula for the stride.
stride =audio sampling rate
video frames per sec
The MFCC value of the audio segment is fed into a deepspeech2 model to extract the content related
features which then goes to the generator at stage 1. The figure is mention in 1
Figure 1: Audio signals are fed into deep speech2 architecture to extract content embeddings
1.1.2 Synchronization Discriminator
The architecture of the discriminator is given in figure 2(a) which contains the input an audio signal
of 200ms time interval(5 audio signals of 40ms each) and 5 frames of the video. The lower half of
the frame of resized to (224,224,3) is fed as an input. This loss is fed back to the generator of our
model to learn coherent lip synchronization.
1.1.3 Multi-scale discriminator (Wang et al., 2018)
Multi-scale discriminator is used in multi scale frame discriminator and multi scale temporal dis-
criminator which consists of 3 discriminators that have an identical network structure but operate
arxiv preprint 2021.
arXiv:2102.09737v1 [cs.CV] 19 Feb 2021
Figure 2: (a) SyncNet architecture for better lip synchronization which is trained on GRID dataset
with contrastive loss and then used its loss in our proposed architecture. (b)Description of the 6 eye
points. p represents the landmark points of eyes .
at different image scales. These discriminators are referred to as D1, D2, and D3. Specifically, we
downsample the real and synthesized high-resolution images by a factor of 2 and 4 to create an image
pyramid of 3 scales. The discriminators D1, D2, and D3 are then trained to differentiate real and
synthesized images at the 3 different scales, respectively. The discriminators operate from coarse to
fine level and help the generator to produce high-quality images.
1.1.4 Losses
OneShotAu2AV is trained with different losses to generate realistic videos as explained below.
Adversarial Loss
Adversarial Loss is used to train the model to handle adversarial attacks and
ensure the generation of high-quality images for the video. The loss is defined as:
LGAN(G, D) = Ex∼Pd[log(D(x))] + Ez∼Pz[log(D(1 −G(z)))]
where G tries to minimize this objective against an adversarial D that tries to maximize where z is the
distribution of source(image) and x is the distribution of the video frames.
Reconstruction loss
Reconstruction loss (Li et al., 2018) is used on the lower half of the image to
improve the reconstruction in the mouth area. L1 loss is used for this purpose as described below:
LRL =X
n[0,W ]∗[H/2,H]
(Rn−Gn)
where, R
n
and G
n
are the real and generated frames respectively and W and H represent the height
and width of an image.
Feature Loss
Feature-matching Loss (Wang et al., 2018) ensures generation of natural-looking
high-quality frames. We take the L1 loss of between generated images and real images for different
scale discriminators and then sum it all. We extract features from multiple layers of the discriminator
and learn to match these intermediate representations from the real and the synthesized image. This
helps in stabilizing the training of the generator. The feature matching loss, LFM(G,Dk) is given by:
LFM(G, Dk) = E(x,z)
T
X
n=1
[1
Ni||Dk(i)(x)−Dk(i)(G(z))||1]
where, T is the total number of layers and Nidenotes the the number of elements in each layer.
2
Perceptual Loss
The perceptual similarity metric is calculated between the generated frame and
the real frame. This is done by using features of a VGG19 (Simonyan and Zisserman, 2014) model
trained for ILSVRC classification and VGGFace (Mei and Deng, 2018) dataset.The perceptual
loss (Justin Johnson and Fei-Fei, 2016),(LPL) is defined as:
LPL =λ
N
X
n=1
[1
Mi||F(i)(x)−F(i)(G(z))||1]
where,
λ
is the weight for perceptual loss and
F(i)
is the ith layer of VGG19 network with M
i
elements of VGG layer.
Contrastive Loss
For coherent lip synchronization, we use the Synchronization Discriminator with
Contrastive loss. The training objective is that the output of the audio and the video networks are
similar for genuine pairs, and different for false pairs.
Contrastive loss,(LCL) is given by following equation
LCL =1
2N
N
X
n=1
(yn)d2n+ (1 −yn)max(margin −dn,0)2
dn=||vn−an||2
where, v
n
and a
n
are fc
7
vectors for video and audio inputs respectively. y
[0,1] is the binary
similarity metric for video and audio input.
Blink loss
We have used the eye aspect ratio (EAR) taken from Real-Time Eye Blink Detection
using Facial Landmarks (Soukupova and Cech, 2016) to calculate the blink loss. A blink is detected
at the location where a sharp drop occurs in the EAR signal. Loss is defined as:
m=||p2−p6|| +||p3−p5||
||p1−p4||
LBL =||mr−mg||
where, p
i
is described in Figure 2(b). We have taken the L1 loss of eye aspect ratio(EAR) between
real image mrand synthesized frame mg.
1.2 Stage 2
This stage consists of converting a human video into an animated video.
1.2.1 Temporal Predictor
The temporal predictor is used in the second stage is to capture the temporal information from the
video which is trained with L2 loss and the equation is given below. The temporal predictor has the
UNet architecture which is given in figure 3
LPx=X
t
(||xt+1 −Px(x1:t)||2)
which is given below where x1:t is (x1,x2,....xt).
1.2.2 Losses
Different loss functions are used to create the animated videos capturing various aspects such as
lip-syncing, eye blinking, eyebrows movements, and head movements.
3
Figure 3: Unet architecture
Adversarial Losses: This is used to improve the mapping of a translated image to a target image.
LGAN(G, D) = Ey∼Pt[log(D(y))2] + Ex∼Ps[log(D(1 −G(x)))2]
where G tries to minimize this objective against an adversarial D that tries to maximize and x and z
belongs to source and target domain respectively.
Recylce loss:
We use the temporal predictor,P
x
to define the recycle loss to preserve the temporal
coherency across the generated animated videos and to avoid the perceptual mode collapse i.e. for
different frames of real videos as an input, we are getting same animated output frames.
Lrecycle(Gx, Gy, P y) = X
t
(||xt+1 −Gx(Py(Gy(x1:t)))||2)
where x and y are the source and target domain respectively and
Gy(x1:t)
is
(Gy(x1)
,G
y
(x
1
),...,G
y
(x
t
)),
Intuitively, this will learn the sequence to frames to map back to themselves.
Identity loss:
This is used to ensure the color distribution of the source and target image to be
similar.
Lidentity =Ex∼Ps(|x−(Gs->t(x))|1)]
CAM loss:
By exploiting the information from the auxiliary classifiers , n
s
and n
Dt
, G
s->t
and D
t
need to know where they need to improve and what makes the most difference between two domain.
Lcams−>t =Ey∼Pt[log(ns(y))] + Ex∼Ps[log(D(1 −ns(x)))]
LcamDt=Ey∼Pt[log(nDt(y))] + Ex∼Ps[log(D(1 −nDt(Gs->t(x))))]
Lip sync Loss:
We have used the cycle consistency loss on the lower half between x and
Gt->s(Gs->t (x)) as proposed
Llip =X
n[0,W ]∗[H/2,H]
(|x−Gt->s(Gs->t (x))|1)]
blink Loss:
We have used the eye aspect ratio (EAR) taken from Real-Time Eye Blink Detection
using Facial Landmarks (Soukupova and Cech, 2016) to calculate the blink loss.Loss is defined as
m=||p2−p6|| +||p3−p5||
||p1−p4||
LBL =||mr−mg||
where, p
i
is described in Figure 2(b). We have taken the L1 loss of eye aspect ratio(EAR) between
real image mxand mGt->s(Gs->t (x)).
4
2 Experiments
2.1 Metrics
1. PSNR- Peak Signal to Noise Ratio:
It computes the peak signal to noise ratio between two
images. The higher the PSNR the better the quality of the reconstructed image.
2. SSIM- Structural Similarity Index:
It is a perceptual metric that quantifies image quality
degradation. The larger the value the better the quality of the reconstructed image.
3. CPBD- Cumulative Probability Blur Detection:
It is a perceptual based no-reference objective
image sharpness metric based on the cumulative probability of blur detection developed at the Image.
4. WER- Word error rate:
It is a metric to evaluate the performance of speech recognition in a
given video. We have used LipNet architecture (Assael et al., 2017) which is pre-trained on the GRID
dataset for evaluating the WER. On the GRID dataset, Lipnet achieves 95.2 percent accuracy which
surpasses the experienced human lipreaders.
5. ACD- Average Content Distance( (Tulyakov et al., 2018)):
It is used for the identification
of speakers from the generated frames using OpenPose (Cao et al., 2018). We have calculated the
Cosine distance and Euclidean distance of representation of the generated image and the actual image
from Openpose. The distance threshold for the OpenPose model should be 0.02 for Cosine distance
and 0.20 for Euclidean distance (Long Zhao and Metaxas1, 2018). The lesser the distances the more
similar the generated and actual images.
6. KID - Kernel Inception Distance (BiÅ ˇ
Dkowski et al., 2018):
It computes the squared Maxi-
mum Mean Discrepancy between the feature representations of real and generated images. In contrast
to the Frchet Inception Distance (Heusel et al., 2017), KID has an unbiased estimator, which makes it
more reliable, especially when there are fewer test images than the dimensionality of the inception
features. The lower KID indicates that the more shared visual similarities between real and generated
images (Alami Mejjati et al., 2018).
7. Blinks/sec:
To capture the blinks in the video, we are calculating the blinks/sec so that we can
better understand the quality of animated videos. We have used SVM and eye landmarks along with
Eye aspect ratio used in Real-Time Eye Blink Detection using Facial Landmarks (Soukupova and
Cech, 2016) to detect the blinks in a video.
2.2 Detailed Description of Datasets
GRID dataset is a large multi-talker audiovisual sentence corpus. This corpus consists of high-quality
audio and video
(
facial
)
recordings of 1000 sentences spoken by each of 34 talkers
(
18 male, 16
female
)
. LOMBARD GRID dataset is a bi-view audiovisual Lombard speech corpus that can be
used to support joint computational-behavioral studies in speech perception. The corpus includes
54 talkers, with 100 utterances per talker
(
50 Lombard and 50 plain utterances
)
. It consists of 5400
videos generated on 54 talkers comprising 30 female talkers and 24 male talkers. CREMA-D is a data
set of 7,442 original clips from 91 actors with six different emotions (Anger, Disgust, Fear, Happy,
Neutral, and Sad).The hikemoji dataset is used for the style transfer and creating the animated videos.
2.3 Detailed Training
2.3.1 Stage 1
The aligned face is generated for every speaker using facial landmark detector (Asthana et al., 2014)
and HopeNet (Ruiz et al., 2018) for calculating the yaw, pitch and roll angles to get the most aligned
faces for every speaker as an input.
We take the Adam optimizer (Kingma and Ba, 2014) with learning rate = 0.002 and
β1
= 0.0 and
β2
= 0.90 for the generator and discriminators. The learning rate of the generator and discriminator is
constant for 50 epochs and after that it decays to zeros in the next 100 epochs.
5
2.3.2 Stage 2
We have used
λcam = 2000, λrecycle = 100, λidentity = 10, λlip = 100, λBL = 100
are the hyperparam-
eters used to control the importance of various loss functions in the above objective function. Adam
optimizer (Kingma and Ba, 2014) with learning rate = 0.0001 and
β1
= 0.5 and
β2
= 0.999 for the
generator, discriminators and temporal predictor.
2.4 Ablation Study
2.4.1 Stage 1
We studied the incremental impact of various loss functions on the LOMBARD GRID dataset and
the GRID dataset. We have provided corresponding videos in supplementary data for better visual
understanding. As mentioned in section 4 (Curriculum learning) each loss has a different impact on
the final output video. Table 8 and Table 2 depicts the impact of different losses on both datasets.
The base model mentioned is the includes the adversarial gan loss, feature loss, and perceptual loss.
The addition of contrastive loss and multi-scale temporal adversarial loss in sequence discriminator
helps in achieving coherent lip synchronized videos and improves the SSIM, PSNR, and CPBD
values. Further addition of Blink Loss, ensures improved quality of the final video.
Method SSIM PSNR CPBD
Base Model(BM) 0.869 27.996 0.213
BM + CL +TAL 0.873 28.327 0.258
BM + CL + TAL+ BL 0.881 28.571 0.262
Table 1: Ablation Study on the GRID dataset where, CL is the contrastive loss ,TAL is the multi-scale
temporal adversarial loss and BL is the Blink loss
Method SSIM PSNR CPBD
Base Model(BM) 0.909 28.656 0.386
BM + CL + TAL 0.913 28.712 0.390
BM + CL + TAL+ BL 0.922 28.978 0.453
Table 2: Ablation Study on the LOMBARD GRID dataset where, CL is the contrastive loss, TAL is
the multi-scale temporal adversarial loss and BL is the Blink loss
The use of deepspeech2 to generate audio to content embeddings helped in the improvement of WER
and help us reach almost similar performance as RSDGAN.
Table 3 gives a brief about the videos attached related to the ablation study.
Video Name Model Description
Video_11.mp4 Output with Base model as explained in Section 6.6
Video_12.mp4 Output with Base model as explained in Section 6.6
Video_13.mp4 Output with contrastive loss (CL)
Video_14.mp4 Output with blink loss (BL)
Video_15.mp4 Output with blink loss (BL)
Table 3: Ablation Study Video Description
2.4.2 Stage 2
We studied the incremental result of various loss functions on the GRID dataset and the hikemoji
dataset. The addition of recycle loss, lip synchronization loss, and blink loss helps in the improvement
of different aspects of an animated video such as lip movement, eye-blink, eyebrows, and head
movements.
6
Method KID×100±std.×100 WER blinks/sec
Base Model(BM) 10.37±0.17 68.85 0.046
BM + RL+ LSL 6.43±0.92 32.62 0.153
BM + RL + LSL+ BL 5.02±0.03 31.97 0.546
Table 4: Ablation Study of Stage 2 on the GRID dataset where, Base model is U-GAT-IT architecture,
RL is the recycle loss ,LSL is the lip synchronisation loss loss and BL is the Blink loss
3 Psychophysical assessment
3.1 Stage 1
Results are visually rated (on a scale of 5) individually by 25 persons, on three aspects, lip synchro-
nization, eye blinks and eyebrow raises and quality of video. The subjects were shown anonymous
videos at the same time for the different audio clips for side-by-side comparison. Table 7 clearly
shows that OneShotAu2AV performs significantly better in quality and lip synchronization which is
of prime importance in videos.
Method Lip-Sync Eye-blink Quality
OneShotAu2AV 90.8 88.5 76.2
RSDGAN 92.8 90.2 74.3
Speech2Vid 90.7 87.7 72.2
Table 5: Psychophysical Evaluation (in percentages) based on users rating
To test the naturalism of the generated videos we conduct an online Turing test 1. Each test consists
of 25 questions with 13 fake and 12 real videos. The user is asked to label a video real or fake based
on the aesthetics and naturalism of the video. Approximately 300 user data is collected and their
score of the ability to spot fake video is displayed in Figure 4.
Figure 4: Distribution of user scores for the online Turing test
1https://forms.gle/JEk1u5ahc9gny7528
7
3.2 Stage 2
Table 6 clearly shows that OneShotAu2AV performs significantly better in quality, lip synchronization
and facial expressions which is of prime importance in videos.
Method Lip-Sync Eye-blink Head-Move Quality User Rating
OneShotAu2AV 90.8 88.5 87.5 94.2 8.3
U-GAT-IT 72.8 76.2 76.5 85.3 5.4
RecycleGan 50.7 56.7 55.5 62.7 5.2
Table 6: Psychophysical Evaluation (in percentages) based on users rating
To test the naturalism of the generated animated videos we conduct an online feedback test
2
. Each
test consists of 15 questions with 5 videos generated using the proposed method, RecycleGan and
U-GAT-IT. The user is asked to rate a video based on the aesthetics, naturalism a Lip-Sync of the
video. Approximately 300 user data is collected and we observe a high score for the proposed method
and lower scores for U-GAT-IT and RecycleGan in Table 6 User Rating column.
4 Generated Videos
4.1 Stage 1
Below is the description of the Videos attached with Supplementary data (Table 7). The description
covers the text spoken and the language used of the input audio.
The test subject and audio used is from the GRID Dataset. Few audio clips are recorded on their own
as well.
Video Name Text Language
Video_1.mp4 Hi how are you? English
Video_2.mp4 Bin blue at e seven please English
Video_3.mp4 Bin blue by f nine again English
Video_4.mp4 Set y with v zero now English
Video_5.mp4 Modi hai toh mumkin hai Hindi
Video_6.mp4 Lay white in y 9 again English
Video_7.mp4 Lay blue in q zero now English
Video_8.mp4 Place blue at I 6 please English
Video_9.mp4 Place blue at P 1 again English
Video_10.mp4 7 Bin red with t 5 soon English
Table 7: Video Description
Few videos to highlight eye-blink: Video_3.mp4, Video_8.mp4, Video_15.mp4
4.2 Stage 2
Below is the description of the Videos attached with Supplementary data (Table 8). The description
covers the text spoken, the language used of the input audio, and the model used for generation of the
video. We have used the proposed method, RecycleGan and U-GAT-IT to generate these videos.
Few videos to highlight eye-blink: Video_24.mp4, Video_25.mp4, Video_26.mp4
Few videos to highlight head movement and facial expressions: Video_18.mp4,
Video_27.mp4
2https://forms.gle/ZuH5uoUianrpMrm68
8
Video Name Text Language Generation Method
Video_16.mp4 Bin green with b 5 soon English Proposed Method
Video_17.mp4 Bin white in k 3 now English Proposed Method
Video_18.mp4 Bin white in d 9 now English Proposed Method
Video_19.mp4 Bin green in t 6 please English Proposed Method
Video_20.mp4 Modi aache to Shambhaw aache Bengali Proposed Method
Video_21.mp4 Modi hai toh mumkin hai Hindi Proposed Method
Video_22.mp4 Lay green by k 2 soon English Proposed Method
Video_23.mp4 Bin white at g 4 now English Proposed Method
Video_24.mp4 Lay green by e 0 again English Proposed Method
Video_25.mp4 Lay green by d 7 now English Proposed Method
Video_26.mp4 Bin red by t 2 please English Proposed Method
Video_27.mp4 Bin red at m 3 soon English Proposed Method
Video_30.mp4 Bin green in t 6 please English RecycleGan
Video_31.mp4 Bin green in n 3 again English RecycleGan
Video_32.mp4 Bin green in t 5 soon English RecycleGan
Video_33.mp4 Bin red by g 3 soon English RecycleGan
Video_34.mp4 Bin red in p 3 soon English RecycleGan
Video_35.mp4 Bin green in n 3 again English U-GAT-IT
Video_36.mp4 Bin red by m 6 now English U-GAT-IT
Video_37.mp4 Bin red by m 9 again English U-GAT-IT
Video_38.mp4 Bin red by a 0 please English U-GAT-IT
Video_39.mp4 Bin green in t 6 please English U-GAT-IT
Video_40.mp4 Bin red by g 3 soon English U-GAT-IT
Video_41.mp4 wie ist das Wetter heute German Proposed Method
Video_42.mp4 cøsmo està ˛a el clima hoy Spanish Proposed Method
Video_43.mp4 Quel temps fait-il aujourd’hui French Proposed Method
Table 8: Video Description
References
Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang Kim. Unsu-
pervised attention-guided image to image translation, 06 2018.
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end
sentence-level lipreading. GPU Technology Conference, 2017. URL
https://github.com/
Fengdalu/LipNet-PyTorch.
Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face alignment
in the wild. pages 1859–1866, 06 2014. doi: 10.1109/CVPR.2014.240.
MikoÅ´
Caj BiÅ ˇ
Dkowski, DJ Sutherland, M Arbel, and A Gretton. Demystifying mmd gans. 01 2018.
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime
multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008,
2018.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. 12 2017.
Alexandre Alahi Justin Johnson and Li Fei-Fei. Perceptual losses for real-time style transfer and
super-resolution. 2016.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International
Conference on Learning Representations, 12 2014.
Yanchun Li, Nanfeng Xiao, and Wanli Ouyang. Improved generative adversarial networks with
reconstruction loss. Neurocomputing, 323, 10 2018. doi: 10.1016/j.neucom.2018.10.014.
9
Yu Tian Mubbasir Kapadia Long Zhao, Xi Peng and Dimitris Metaxas1. Learning to forecast and
refine residual motion for image-to-video generation, 2018.
Wang Mei and Weihong Deng. Deep face recognition: A survey. 04 2018.
Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine-grained head pose estimation without key-
points. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
June 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv 1409.1556, 09 2014.
Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion
and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1526–1535, 2018.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
10