Content uploaded by Zechen Bai
Author content
All content in this area was uploaded by Zechen Bai on Feb 21, 2024
Content may be subject to copyright.
A Simple Approach to Animating Virtual Characters
by Facial Expressions Reenactment
Zechen Bai1,2*Naiming Yao1Lu Liu1Hui Chen1,2† Hongan Wang1,2
Institute of Software, Chinese Academy of Sciences, China1
University of Chinese Academy of Sciences, China2
ABSTRACT
Animating virtual characters is one of the core problems in virtual
reality. Facial animations are able to intuitively express emotion and
attitudes of virtual characters. However, creating facial animations
is a non-trivial task. It depends on either expensive motion capture
devices or human designers’ time and effort to tune the animation
parameters. In this work, we propose a learning-based approach to
animate virtual characters by facial expression reenactment from
abundant image data. This approach is simple yet effective, and
is generalizable to various 3D characters. Preliminary evaluation
results demonstrate its effectiveness and its potential to accelerate
the development of VR applications.
Index Terms: Virtual Human—Facial Animation—Face
Reenactment—Blendshape
1I
NTRODUCTION
Virtual reality (VR) technology is attracting increasing attention
as it provides users with an immersive experience. When building
virtual characters, it is of great significance to create vivid facial
animations, because they can intuitively convey emotion and feel-
ing, which is crucial in many VR applications [1]. However, it is
non-trivial to create diverse facial animations conveniently. Tradi-
tional methods usually use motion capture devices to track the key
points of human actors and replicate them on virtual characters. The
requirement of the dedicated devices limits its application. Recently,
most 3D virtual characters are animated by a linear combination
of muscle movements under the control of corresponding coeffi-
cients. A popular implementation among them is blendshape. In
this scheme, creating animations usually requires human designers
to tune the coefficients in a trial-and-error manner, which is boring
and time-consuming.
In this paper, instead of creating animations from scratch, we pro-
pose to address this problem by facial expression reenactment based
on abundant image data. Previous methods of face reenactment
mainly focus on the 2D image/video domain, while few methods
explore its application in 2D to 3D scenario. Under the popular
blendshape scheme, we postulate that the generation of facial ani-
mation is essentially an estimation of the blendshape coefficients.
Therefore, we propose to achieve facial expression reenactment by
estimating blendshape coefficients.
The approach is designed to be simple in theory and practice in
order to enhance usability. As illustrated in Fig. 1, the approach
contains two models. We first pre-train a base model that estimates
generic 3D facial parameters from the given image. Based on the
frozen base model, we train a lightweight adapter model to adapt the
generic parameters into the desired blendshape coefficients of the
target virtual character. During testing, the pipeline takes human face
*E-mail: zechen2019@iscas.ac.cn
†Corresponding Author: chenhui@iscas.ac.cn
images as input and estimates corresponding blendshape coefficients.
This approach is able to reduce the workload of human designers
in creating facial animations. Besides, the paradigm of a frozen
base model plus a trainable lightweight adapter makes this approach
generalizable. On one hand, once trained, it is generalizable to 3D
characters with the same blendshape topology, even with different
texture appearances. On the other hand, the adapter model is a
lightweight framework that can be easily retrained to adapt to a new
blendshape topology. Preliminary evaluation results demonstrate its
effectiveness and its potential to accelerate the development of VR
applications. We will make our solution public.
2A
PPROACH
2.1 Model Details
The linear 3D Morphable Model (3DMM) is employed as the generic
3D face representation, in which the face shape Sand texture T
in 3D space is represented as: S
=S+α
B
id +β
B
exp
,T
=T+
σ
B
tex
, where
S
and
T
are mean face shape and texture respectively.
B
id
,B
exp
, and B
tex
denote PCA bases of identity, expression, and
texture respectively. In addition, we also introduce illumination
model and camera model to define the light coefficient
γ
and pose
coefficient
p
respectively. All the coefficients mentioned above are
concatenated into a single vector v
=(α,β,σ,γ,p)
. The base model
is implemented as a ResNet-50 with a modified projection head to
fit the output dimension. The model takes human face images as
input and predicts the 3DMM coefficients vectors.
For facial expression reenactment, the adapter model takes the
expression coefficient
β
and pose coefficient
p
as input and pre-
dicts blendshape coefficients
ˆy
. The adapter is implemented as a
lightweight Multi-Layer Perceptron (MLP) neural network. Specif-
ically, after the last layer, there is a
Clam p
operator truncating the
output value into the range
0∼1
, which enforces the model to output
valid values of coefficients.
Generally, the pipeline can be formulated as:
α,β,σ,γ,p=BaseModel(x),(1)
ˆy=Adapter(β,p),(2)
where xdenotes the input face image.
2.2 Training Strategy
The two models are trained in consecutive order. Once trained, the
base model serves as a foundation and the training of the adapter
model relies on the pre-trained base model.
We use a public real-world face dataset to train the base model in
a self-supervised way. The ResNet backbone estimates the 3DMM
parameter vector
ˆ
v
, which is used to reconstruct the 3D face and
render it back to the image plane
ˆx
with the differentiable renderer.
The model is optimized by minimizing the reconstruction error
between
x
and
ˆx
. In addition, we also employ perception loss and
landmark loss introduced in 3D face reconstruction work [2].
To train the adapter model, we prepare a dedicated dataset for the
target 3D character. The dataset consists of: 1) randomly generated
blendshape coefficients; 2) correspondingly rendered images of the
585
2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)
979-8-3503-4839-2/23/$31.00 ©2023 IEEE
DOI 10.1109/VRW58643.2023.00135
Figure 1: The base model estimates generic 3D facial parameters. Once trained, it is set to fronzen state. The adapter model estimates
blendshape coefficients based on expression and pose coefficients. The adaptor is a lightweight model that can be adapted to various characters.
virtual character’s face. Note that the dataset preparation is an
automatic process with the help of 3D software, like Maya. During
preparing blendshape coefficients, a filtering mechanism is designed
to get rid of contradictory facial blendshapes. For example, a person
can hardly move his lip in two opposite directions simultaneously.
Empirical studies show that a set of manually designed rules is
powerful enough to filter a large portion of contradictory samples.
The adapter model is trained on this dataset by minimizing the Mean
Squared Error (MSE) between ground-truth blendshape coefficients
yand estimated blendshape coefficients ˆy.
2.3 Usage
Given the pre-trained models, the way of creating animations de-
pends on the type of source data. If the source data is a still image,
users can estimate blendshape coefficients of the given image, gener-
ate facial expression as the peak intensity, and create facial animation
by a smooth transition from natural expression to peak intensity ex-
pression. If the source data is a video, users can estimate blendshape
coefficients of each frame and then generate animation easily.
The two models in this pipeline play different roles. The base
model aims to provide generic 3D facial parameters, which is a
character-agnostic model, i.e., training and using of the base model
are not concerned with the 3D virtual character. In contrast, the
adapter model is a character-dependent model. Training and using
of the model are coupled with specific blendshape topology.
The generalization of the approach is reflected in two folds. First,
once trained, it generalizes well in the 3D character family with the
same blendshape topology, even with different texture appearances.
This has covered a large portion of the usage scenario. Second, when
adapting the approach to a new blendshape topology, users only need
to retrain the adapter model with the automatically generated dataset
regarding the target 3D character. In addition, thanks to the simple
and light design of the adapter model, the retraining requires little
computing resources and takes little time, enhancing its usability.
3E
VALUATION
3.1 Qualitative Examples
In Fig. 2, we show examples with various virtual character gender
and appearance. The reenacted facial expressions are capable of
vividly replicating the source facial expressions, validating the ef-
fectiveness of our approach. Even though there are some subtle
muscle movements that are not reproduced perfectly, the estimated
coefficients still provide us with a fair reasonable initial state.
3.2 Comparison with human designers
We compare the proposed approach with human designers’ perfor-
mance in terms of satisfaction score and time usage. The satisfaction
score, ranging from 1 to 10, is obtained from 14 volunteers scor-
ing on 7 randomly picked samples. The result is shown in Tab. 1.
Though the human designer achieves a higher satisfaction score than
the proposed approach, the margin is limited. Besides, the variation
of the score is relatively large according to the standard deviation,
which reflects the subjectiveness of measuring facial expressions.
For the inference time to generate or adjust blendshape coefficients,
Figure 2: Examples of virtual characters with different gender and
appearance.
the proposed approach costs only 0.41s for running the whole testing
pipeline, while the human designer need over 236s to tune the blend-
shape coefficient values into the desired state. The huge gap shows
the effectiveness of our approach to reducing designers’ workload
without losing much fidelity.
Table 1: Comparison with human designers.
Satisfaction Score Inference Time
Algorithm
6.36 ±1.80 0.41 ±0.06s
Designer
6.92 ±1.91 236.12 ±48.61s
4CONCLUSION
In this work, we propose a simple yet effective approach to ease the
burden of creating facial animations of virtual characters for VR
applications. The task is formulated as a facial expressions reenact-
ment problem and is achieved by estimating blendshape coefficients
based on image data. This approach not only generalizes well within
the character family with the same blendshape topology but is also
easy to be adapted to other customized characters. Evaluation re-
sults have shown promising performance on this task and verified
the effectiveness of our approach. We hope this work will inspire
research on automatically animating virtual characters.
ACKNOWLEDGMENTS
This work was supported by the National Key R&D Program of
China under Grant 2020YFC2004100; and the Open Research Fund
of Guangxi Key Lab of Human-machine Interaction and Intelligent
Decision under Grant GXHIID2201.
REFERENCES
[1]
Z. Bai, N. Yao, N. Mishra, H. Chen, H. Wang, and N. Magnenat Thal-
mann. Enhancing Emotional Experience by Building Emotional Virtual
Characters in VR Volleyball Games. Computer Animation and Virtual
Worlds, p. e2008, 2021.
[2]
Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong. Accurate 3D Face
Reconstruction With Weakly-Supervised Learning: From Single Image
to Image Set. In 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), pp. 285–295. IEEE, 2019.
586