PreprintPDF Available

Synthesis of Compositional Animations from Textual Descriptions

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

"How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.
Synthesis of Compositional Animations from Textual Descriptions
Anindita Ghosh *1,3, Noshaba Cheema1,2,3 , Cennet Oguz1, Christian Theobalt2,3, and Philipp Slusallek1,3
1German Research Center for Artificial Intelligence (DFKI)
2Max-Planck Institute for Informatics
3Saarland Informatics Campus
Abstract
“How can we animate 3D-characters from a movie
script or move robots by simply telling them what we would
like them to do?” “How unstructured and complex can we
make a sentence and still generate plausible movements
from it?” These are questions that need to be answered in
the long-run, as the field is still in its infancy. Inspired
by these problems, we present a new technique for gen-
erating compositional actions, which handles complex in-
put sentences. Our output is a 3D pose sequence depict-
ing the actions in the input sentence. We propose a hierar-
chical two-stream sequential model to explore a finer joint-
level mapping between natural language sentences and 3D
pose sequences corresponding to the given motion. We
learn two manifold representations of the motion – one each
for the upper body and the lower body movements. Our
model can generate plausible pose sequences for short sen-
tences describing single actions as well as long composi-
tional sentences describing multiple sequential and super-
imposed actions. We evaluate our proposed model on the
publicly available KIT Motion-Language Dataset contain-
ing 3D pose data with human-annotated sentences. Exper-
imental results show that our model advances the state-of-
the-art on text-based motion synthesis in objective evalua-
tions by a margin of 50%. Qualitative evaluations based on
a user study indicate that our synthesized motions are per-
ceived to be the closest to the ground-truth motion captures
for both short and compositional sentences.
1. Introduction
Manually creating realistic animations of humans per-
forming complex motions is challenging. Motion syn-
thesis based on textual descriptions substantially simpli-
fies this task and has a wide range of applications, in-
*Corresponding Author: anindita.ghosh@dfki.de
Figure 1: Overview of our proposed method to generate mo-
tion from complex natural language sentences.
cluding language-based task planning for robotics and vir-
tual assistants [3], designing instructional videos, creating
public safety demonstrations [39], and visualizing movie
scripts [27]. However, mapping natural language text de-
scriptions to 3D pose sequences for human motions is non-
trivial. The input texts may describe single actions with se-
quential information (e.g., “a person walks four steps for-
ward”), or may not correspond to the discrete time steps of
the pose sequences to be generated, in case of superimposed
actions (e.g., “a person is spinning around while walking”).
This necessitates a machine-level understanding of the syn-
tax and the semantics of the text descriptions to generate the
desired motions [4].
While translating a sentence to a pose sequence, we need
to identify the different parts of speech in the given sentence
and how they impact the output motion. A verb in the sen-
tence describes the type of action, whereas an adverb may
provide information on the direction, place, frequency, and
other circumstances of the denoted action. These need to
be mapped into the generated pose sequence in the correct
order, laying out additional challenges for motion model-
arXiv:2103.14675v3 [cs.CV] 1 Aug 2021
ing systems. Existing text-to-motion mapping methods can
either generate sentences describing one action only [52]
or produce incorrect results for descriptions of composi-
tional actions [4]. They fail to translate long-range depen-
dencies and correlations in complex sentences and do not
generalize well to other types of motions outside of loco-
motion [4]. We propose a method to handle complex sen-
tences, meaning sentences that describe a person perform-
ing multiple actions either sequentially or simultaneously.
For example, the input sentence “a person is stretching his
arms, taking them down, walking forwards for four steps
and raising them again” describes multiple sequential ac-
tions such as raising the arms, taking down the arms, and
walking, as well as the direction and number of steps for
the action. To the best of our knowledge, our method is
the first to synthesize plausible motions from such varieties
of complex textual descriptions, which is an essential next
step to improve the practical applicability of text-based mo-
tion synthesis systems. To achieve this goal, we propose
a hierarchical, two-stream, sequential network that synthe-
sizes 3D pose sequences of human motions by parsing the
long-range dependencies of complex sentences, preserving
the essential details of the described motions in the process.
Our output is a sequence of 3D poses generating the anima-
tion described in the sentence (Fig. 1). Our main contribu-
tions in this paper are as follows:
Hierarchical joint embedding space. In contrast to [4],
we separate our intermediate pose embeddings into two em-
beddings, one each for the upper body and the lower body.
We further separate these embeddings hierarchically to limb
embeddings. Our model learns the semantic variations in a
sentence ascribing speed, direction, frequency of motion,
and maps them to temporal pose sequences by decoding the
combined embeddings. This results in the synthesis of pose
sequences that correlate strongly with the descriptions given
in the input sentences.
Sequential two-stream network. We introduce a se-
quential two-stream network with an autoencoder architec-
ture, with different layers focusing on different parts of the
body, and combine them hierarchically to two representa-
tions for the pose in the manifold space – one for the upper
body and the other for the lower body. This reduces the
smoothing of upper body movements (such as wrist move-
ments for playing violin) in the generated poses and makes
the synthesized motion more robust.
Contextualized BERT embeddings. In contrast to pre-
vious approaches [4, 52], which do not use any contextu-
alized language model, we use the state-of-the-art BERT
model [16] with handpicked word feature embeddings to
improve text understanding.
Additional loss terms and pose discriminator. We add
a set of loss terms to the network training to better condition
the learning of the velocity and the motion manifold [36].
We also add a pose discriminator with an adversarial loss to
further improve the plausibility of the synthesized motions.
Experimental results show that our method outperforms
the state-of-the-art methods of Ahuja et al. [4] and Lin et
al. [43] significantly on both the quantitative metrics we dis-
cuss in Section 4.3 and on qualitative evaluations.
2. Related Work
This section briefly summarizes prior works in the re-
lated areas of data-driven human motion modeling and text-
based motion synthesis.
2.1. Human Motion modeling
Data-driven motion synthesis is widely used to generate
realistic human motion for digital human models [33, 31,
17]. Different strategies have been implemented over the
years using temporal convolutional neural networks [14, 40,
10], graph convolution networks [5, 49] and recurrent neu-
ral networks [46, 26, 67, 37]. Pose forecasting attempts to
generate short [20, 50] and long-term motions [23, 42, 61]
by predicting future sequence of poses given their history.
Prior works encode the observed information of poses to
latent variables and perform predictions based on the la-
tent variables [36, 35]. Holden et al. [34] used a feed-
forward network to map high-level parameters to charac-
ter movement. Xu et al. [69] proposed a hierarchical style
transfer-based motion generation, where they explored a
self-supervised learning method to decompose a long-range
generation task hierarchically. Aristidou et al. [6] break the
whole motion sequences into short-term movements defin-
ing motion words and cluster them to a high-dimensional
feature space. Generative adversarial networks [24] have
also gained considerable attention in the field of unsuper-
vised learning-based motion prediction [8, 38]. Li et al. [41]
used a convolutional discriminator to model human motion
sequences to predict realistic poses. Gui et al. [25] presents
the adversarial geometry aware encoder-decoder (AGED)
framework, where two global recurrent discriminators dis-
tinguish the predicted pose from the ground-truth. Cui et
al. [15] propose a generative model for pose modeling based
on graph networks and adversarial learning.
Related work also include pixel-level prediction using
human pose as an intermediate variable [65, 66], loco-
motion trajectories forecasting [29, 28, 45]. Various au-
dio, speech, and image conditioned forecasting[7] have also
been explored for predicting poses. For instance, [19] ex-
plores generating skeleton pose sequences for dance move-
ments from audio, [9, 68] aims at predicting future pose se-
quences from static images. [2] has linked pose prediction
with speech and audio. Takeuchi et al.[60] tackled speech
conditioned forecasting for only the upper body, modeling
the non-verbal behaviors such as head nods, pose switches,
hand waving for a character without providing knowledge
on the character’s next movements. [11] rely solely on the
history of poses to predict what kind of motion will follow.
2.2. Text-based Motion Synthesis
A subset of prior works have opted to train deep learn-
ing models to translate linguistic instructions to actions for
virtual agents [30, 32, 47, 71]. Takano et al. describe a
system that learns a mapping between human motion and
word labels using Hidden Markov Models in [59, 56]. They
also used statistical methods [57, 58] using bigram mod-
els for natural languages to generate motions. Yamada et
al. [70] use separate autoencoders for text and animations
with a shared latent space to generate animations from text.
Ahn et al. [1] generates actions from natural language de-
scriptions for video data. However, their method only ap-
plies to upper-body joints (neck, shoulders, elbows, and
wrist joints) with a static root. Recent methods mentioned
in [52, 43, 4] used RNN based sequential networks to map
text inputs to motion. Plappert et al. [52] propose a bidirec-
tional RNN network to map text input to a series of Gaus-
sian distributions representing the joint angles of the skele-
ton. However, their input sequence is encoded into a single
one-hot vector that cannot scale as the input sequence be-
comes longer. Lin et al. [43] use an autoencoder architec-
ture to train on mocap data without language descriptions
first, and then use an RNN to map descriptions into these
motion representations. Ahuja et al. [4] learns a joint em-
bedding space for both pose and language using a curricu-
lum learning approach. Training a model jointly with both
pose and sentence inputs improves the generative power of
the model. However, these methods are limited to synthe-
size motion from simple sentences. Our model, by contrast,
handles long sentences describing multiple actions.
3. Proposed Method
We train our model end-to-end with a hierarchical two-
stream pose autoencoder, a sentence encoder, and pose dis-
criminator as shown in Fig. 2. Our model learns a joint
embedding between the natural language and the poses
of the upper body and the lower body. Our input mo-
tion P= [P0, ..., PT1]is a sequence of Tposes, where
PtRJ×3is the pose at tth time step. J×3indicate the
joints of the skeleton with the (x, y, z)coordinates of each
joint. Our hierarchical two-stream pose encoder (pe)en-
codes the ground truth pose sequence Pinto two manifold
vectors,
pe (P)=(Zp
ub, Z p
lb)(1)
where Zp
ub, Z p
lb Rhrepresent the features for the upper
body and the lower body, respectively, and hdenotes the
dimension of the latent space.
Our input sentence S= [S1, S2, ..., SW]is a sequence of
Wwords converted to word embeddings ˜
Swusing the pre-
trained BERT model [16]. ˜
SwRKrepresents the word
embedding vector of the wth word in the sentence and Kis
the dimension of the word embedding vector used. Our two-
stream sentence encoder (se)encodes the word embeddings
and maps them to the latent space such that we have two
latent vectors,
se (S) = (Zs
ub, Z s
lb)(2)
where Zs
ub, Z s
lb Rhrepresent the sentence embeddings
for the upper body and the lower body, respectively. Us-
ing an appropriate loss (see Section 3.2), we ensure that
(Zp
ub, Z p
lb)and (Zs
ub, Z s
lb)lie close in the joint embedding
space and carry similar information.
Our hierarchical two-stream pose decoder (de)learns to
generate poses from these two manifold vectors. As an ini-
tial input, the pose decoder uses the initial pose Ptof time
step t= 0 to generate the pose ˆ
Pt, which it uses recursively
as input to generate the next pose ˆ
Pt+1.ˆ
PRT×J×3de-
notes a generated pose sequence. The output of our decoder
module is a sequence of Tposes ˆ
PpRT×J×3generated
from the pose embeddings, and ˆ
PsRT×J×3generated
from the language embeddings:
ˆ
Pp=de (Zp
ub, Z p
lb)(3)
ˆ
Ps=de (Zs
ub, Z s
lb).(4)
We use a pose prediction loss term to ensure that ˆ
Ppand ˆ
Ps
are similar (Section 3.2). ˆ
P=ˆ
Psis our final output pose
sequence for a given sentence.
3.1. Network Architecture
The three main modules in our network are the two-
stream hierarchical pose encoder, the two-stream sentence
encoder and the two-stream hierarchical pose decoder. We
explain the architecture of all these modules.
3.1.1 Two-Stream Hierarchical Pose Encoder
We structure the pose encoder such that it learns features
from the different components of the body. Individual parts
are then combined hierarchically. We decompose the hu-
man skeleton into the five major parts - left arm, right arm,
trunk, left leg, and right leg as done in [18]. Our hierar-
chical pose encoder, as shown in Fig 2, encodes these five
parts using five linear layers with output dimension h1. We
combine the trunk representation with that of the left arm,
right arm, left leg, and right leg and pass them through an-
other set of linear layers to obtain combined representations
of (left arm, trunk), (right arm, trunk), (left leg, trunk), and
(right leg, trunk) each of dimension h2each. Two separate
GRUs [12] encode the combined representation for the arms
with trunk and the legs with trunk respectively, thus creat-
ing two manifold representations – one for the upper body
(Zp
ub Rh)and the other for the lower body (Zp
lb Rh).
Figure 2: Structure of our hierarchical two-stream model along with pose discriminator. The model learns a joint embedding
for both pose and language. The embedding has separate representations for the upper body and lower body movements.
The output of the GRUs give the two manifold representa-
tions of dimension h.
3.1.2 Two-Stream Sentence Encoder
To represent the text input, we use the pre-trained large-case
model of BERT [16] as a contextualized language model.
It comprises 24 subsequent layers, each representing dif-
ferent linguistic notions of syntax or semantics [13]. To
find the focused layers on local context (e.g., adverbs of a
verb) [62], we use the attention visualization tool [64] with
randomly selected samples of the KIT Motion Language
dataset [51]. Thus, we select the layers 12 (corresponding
to subject(s)), 13 (adverb(s)), 14 (verb(s)) and 15 (preposi-
tional object(s)) and concatenate the hidden states of these
layers in order to represent the corresponding word. For-
mally, ˜
SwRKrepresents the word embedding vector of
the wth word in the sentence S, and Kis the dimension
of the word embedding vector used. Our Sentence encoder
(se)uses Long-Short Term Memory units (LSTMs) [53] to
capture the long-range dependencies of a complex sentence.
We input the word embeddings to a two-layer LSTM, which
generates ZsR2h, where,
LST M ˜
Sw=Zs= [Zs
ub, Z s
lb](5)
is the latent embedding of the whole sentence, with ˜
Sw=
BE RT (Sw). We use the first half of this embedding as
Zs
ub Rhto represent the upper body and the second half
as Zs
lb Rhto represent the lower body.
3.1.3 Two-Stream Hierarchical Pose Decoder
We can conceptually unfold our pose decoder as a series of
Thierarchical decoder units, each constructing the output
pose ˆ
Pt,t= 0, . . . , T time steps in a recurrent fashion by
taking in the generated pose at the corresponding previous
time step. We add a residual connection between the input
and the output of the individual decoder units as shown in
Fig. 2. Each decoder unit consists of two GRUs, and a se-
ries of linear layers structured hierarchically. The hierarchi-
cal structure of the linear layers in the decoder unit mirrors
that of the pose encoder. Conditioned by the latent space
vector representing the previous frames, the GRUs and the
hierarchical linear layers Hier (as shown in Fig 2) output
the reconstructed pose ˆ
Pt+1 at the (t+ 1)th frame given its
previous pose ˆ
Pt.
3.2. Optimizing the Training Procedure
We train our model end-to-end with a hierarchical two-
stream pose autoencoder along with a sentence encoder as
shown in Fig. 2. Our model learns a joint embedding space
between the natural language and the poses of the upper
body and the lower body. Our decoder is trained twice in
each pass: once with (Zp
ub, Z p
lb)obtained from pe to gen-
erate the pose sequence ˆ
Pp, and the second time with the
(Zs
ub, Z s
lb)obtained from se, which generates the pose se-
quence ˆ
P=ˆ
Ps.
Loss functions. We use the smooth `1loss as a distance
metric to train our model. The smooth `1loss is less sen-
sitive to outliers than the smoother `2loss, and more stable
than the `1loss as it is differentiable near x= 0 for all
xR[4]. We use the following losses while training the
whole model:
Pose Prediction loss: It minimizes the difference be-
tween the input ground-truth motion (P)and the pre-
dicted motions ˆ
P=ˆ
Psand ˆ
Pp. We measure it as,
LR=Lˆ
Ps, P +Lˆ
Pp, P ,(6)
where Ldenotes the Smooth `1Loss between the two
terms.
Manifold reconstruction loss: This encourages a re-
ciprocal mapping between the generated motions and
the manifold representations to improve the manifold
space [36]. For that, we reconstruct the manifold
representations from the generated poses as ˆ
Zp
ub =
pe ˆ
Pand ˆ
Zp
lb =pe ˆ
P, and compare them with
the manifold representations obtained from input pose
sequence. We compute the loss as,
LM=Lˆ
Zp
ub, Z p
ub+Lˆ
Zp
lb, Z p
lb.(7)
Velocity reconstruction loss: We minimize the differ-
ence between the velocity of the reconstructed motion
ˆ
Pvel and the velocity of the input motion (Pvel). We
compute the velocity of the tth frame of a pose Pas
Pvel (t) = P(t+1) P(t). We compute LVas ,
LV=Lˆ
Pvel , Pvel.(8)
Embedding similarity loss: We use this loss to ensure
that the manifold representations, Zs
ub and Zs
lb, gener-
ated by the sentence encoder is similar to the manifold
representations Zp
ub and Zp
lb generated by the pose en-
coder. We measure it as,
LE=L(Zp
ub, Z s
ub) + L(Zp
lb, Z s
lb).(9)
Adversarial loss: We further employ a binary cross-
entropy discriminator Dto distinguish between the
real and generated poses. We compute the correspond-
ing discriminator and “generator” losses as,
LD=L2Dˆ
P,0+L2(D(P),1) (10)
LG=L2Dˆ
P,1,(11)
where L2denotes the Binary Cross Entropy loss, and
the “generator” is the decoder of the auto-encoder.
We train the model end-to-end with the pose autoencoder,
the sentence encoder and the discriminator modules on a
weighted sum of these loss terms as,
min
pe,se,de (LR+λMLM+λVLv+λELE+λGLG)
min
D(λGLD),(12)
where λM= 0.001,λV= 0.1,λE= 0.1and λG= 0.001
are weight parameters, obtained experimentally.
4. Experiments
This section describes the dataset we use for our exper-
iments and report the quantitative and qualitative perfor-
mance of our method. We also highlight the benefits of the
different components of our method via ablation studies.
4.1. Dataset
We evaluate our model on the publicly available KIT
Motion-Language Dataset [51] which consists of 3,911
recordings of human whole-body motion in MMM repre-
sentation [63, 44] with natural language descriptions cor-
responding to each motion. There is a total of 6,278 an-
notations in natural language, with each motion record-
ings having one or multiple annotations describing the task.
The sentences range from describing simple actions such as
walking forwards or waving the hand to describing motions
with complicated movements such as waltzing. Moreover,
there are longer, more descriptive sentences describing a se-
quence of multiple actions, e.g., “A human walks forwards
two steps, pivots 180 degrees and walks two steps back to
where they started”. We split the whole dataset into random
samples in the ratio of 0.6,0.2, and 0.2for training, vali-
dation, and test sets. For better comparison with the state-
of-the-art [4, 43], we pre-process the given motion data in
the same manner as done in [4, 43]. Following the method
of Holden et al. [34], we use the character’s joint positions
with respect to the local coordinate frame and the charac-
ter’s trajectory of movement in the global coordinate frame.
We have J= 21 joints, each having (x, y, z)coordinates,
and a separate dimension for representing the global trajec-
tory for the root joint. Similar to [4, 43], we sub-sample the
motion sequences to a frequency of 12.5Hz from 100 Hz.
4.2. Implementation Details
We trained our model for 350 epochs using Adam Opti-
mizer. Total training time was approximately 20 hours us-
ing an NVIDIA Tesla V100. The dimensions of our hid-
den layers in the hierarchical autoencoder are h1= 32,
h2= 128 and h= 512. We used a batch size of 32 and
a learning rate of 0.001 with exponential decay. For train-
ing the sentence encoder, we converted given sentences to
word embeddings of dimension K= 4,096 using selected
Figure 3: Comparison of consecutive frames of generated animations of our method (top row) with Lin et al. [43] (middle
row) and JL2P [4] (bottom row) for the given sentences. Our method generates clear kicking and dancing motions in contrast
to JL2P and Lin et al., which do not show any prominent movements. The perplexity values of the sentences are according
to [51].
layers of the pre-trained BERT-large-case model (details
in 3.1.2). We encoded these embeddings to a dimension
of 1024 through the sentence encoder, and split it to obtain
two manifold representations Zs
ub Rhand Zs
lb Rh, each
of dimension h= 512.
4.3. Quantitative Evaluation Metrics
To quantitatively evaluate the correctness of our motion,
we use the Average Position Error (APE). APE measures
the average positional difference for a joint jbetween the
generated poses and the ground-truth pose sequence as,
AP E[j] = 1
NT X
nNX
tT
Pt[j]ˆ
Pt[j]
2,(13)
where Tis the total time steps and Nis the total number of
data in our test dataset.
Given our setting of natural language descriptions and
corresponding free-form movements, it is naturally diffi-
cult to find a quantitative measure that does justice to both
modalities. For example, in a walking setting, sentences
that do not mention any direction correspond to a wider va-
riety of plausible motions, while specifying a direction nar-
rows the possibilities. To account for such discrepancies,
we separate the APEs between the local joint positions and
the global root trajectory. The former corresponds to the er-
ror of the overall poses, while the latter corresponds to the
overall direction and trajectory of the motion.
However, the average position of each joint simply cor-
responds to a mean compared to the dataset. To under-
stand the full statistics of the overall distribution compared
to the dataset, we also compute the Average Variance Error
(AVE), which measures the difference of variances of indi-
vidual joints of the generated poses compared to the ground
truth poses. We calculate the variance of an individual joint
jfor a pose Pwith Ttime steps as,
σ[j] = 1
T1X
tT
Pt[j]˜
P[j]2
,(14)
where ˜
P[j]is the mean pose over Ttime steps for the joint
j. Calculating the variance for all joints of the ground-
truth poses and the generated poses, we use their root mean
square error as the AVE metric as follows:
AV E[j] = 1
NX
nN
kσ[j]ˆσ[j]k2,(15)
where σrefers to the ground-truth pose variance and ˆσ
refers to generated pose variance.
However, even this measure does not account for any
information regarding the sentences or sentence encodings
themselves. Therefore, we propose a Content Encoding Er-
ror (CEE), which corresponds to the embedding similarity
loss LEin Eq. 9 by measuring the effectiveness of the em-
bedding space. We calculate CEE as the difference between
manifold representations Zp= [Zp
ub, Z p
lb](obtained by en-
coding the input poses Pthrough the pose encoder pe) and
the manifold representations Zs= [Zs
ub, Z s
lb](obtained by
encoding the corresponding input sentences using the sen-
tence encoder se). We write it as,
CE E(S, P ) = 1
MN X
nNX
mM
kZsZpk2,(16)
where Mis the number of features in the manifold repre-
sentation, and Nis the total number of data. The idea is to
measure how well the joint embedding space correlates the
latent embeddings of poses with the latent embeddings of
the corresponding sentences.
To further account for style factors in the motion and the
sentences, we propose a Style Encoding Error (SEE). SEE
compares a summary statistics of the sentence embeddings
Zsand the pose embeddings Zpto account for general style
information. We compute the Gram matrix [22, 21] Gon
the corresponding embeddings:
Gs=Zs·Zs>
(17)
Gp=Zp·Zp>
(18)
We compute SEE as:
SEE(S, P ) = 1
MN X
nNX
mM
kGsGpk2,(19)
where Mis the number of features in the manifold repre-
sentation and Nis the total number of data.
4.4. Ablation Studies
We compare the performance of our model with the fol-
lowing four ablations acquired from itself:
Ablation 1: Two-stream hierarchical model with-
out jointly training the embedding space (w/o JT).
Instead of end-to-end training of the model, we trained
the hierarchical pose encoder and decoder first, using
the loss terms LR,LM,LV,LGand LD(discussed
in Section 3.2). We then trained the model with the
sentence encoder and the pose decoder with LRand
LE. This indicates that the model is not learning a joint
embedding space for pose and language but learns the
embedding space for poses first and then fine-tunes to
map the sentences.
Ablation 2: Hierarchical model without the two-
stream representation (w/o 2-St). We used a single
manifold representation for the whole body instead of
separating the upper and lower body and trained the
model jointly on language and pose inputs.
Ablation 3: Training the hierarchical two-stream
model without the extra losses (w/o Lo). We dis-
carded the additional loss terms introduced in the pa-
per in Section 3.2 and only used the pose prediction
loss LRto train our model.
Ablation 4: Using a pre-trained language model in-
stead of selected layers of BERT (w/o BERT). We
used a pre-trained Word2Vec model [48] as done in [4]
to convert the input sentence into word embeddings in-
stead of selecting layers of BERT as mentioned in Sec-
tion 3.1.2. This ablation shows how BERT as a contex-
tualized language model, helped to focus on the local
context within a sentence.
4.5. User Study
To evaluate our ablation studies, we conducted a user
study to observe the subjective judgment of the quality of
our generated motions compared to the quality of motions
generated from the ablations described in Section 4.4. We
asked 23 participants to rank 14 motion videos from the five
methods and from the ground-truth motion-captures, based
on whether the motion corresponds to the input text, and by
the quality and naturalness of the motions. The five meth-
ods include our method and the four ablations of our model
– ‘w/o JT’, ‘w/o 2-St’, ‘w/o Lo’, and ‘w/o BERT’. We quan-
tified the user study with two preference scores – the first
one describing if the participants found the motions to cor-
respond to the input sentence (“yes/no”), and the second one
rating the overall quality of the motion in terms of natural-
ness (from 1 =“most natural” to 6 =“least natural”, which
we then scaled to 0 and 1 and inverted). We observe that
our method has a preference score of 40% in both cases,
second only to the ground truth motion as seen in Fig. 5. 1
5. Results and Discussion
We compare our method with the state-of-the-art Joint
Language to Pose (JL2P) method [4], and the proposed ap-
proach by Lin et al. [43]. We have used the pre-trained
models for both JL2P and Lin et al.’s approach, provided
by Ahuja et al. [4], to calculate the quantitative results. We
computed all the results on the test dataset.
5.1. Objective Evaluation
Fig. 4 shows the improvement of our method compared
to JL2P and Lin et al. for all the metrics discussed in Sec-
tion 4.3. Our method shows an improvement of 55.4% in
the mean APE calculated for all local joints compared to
JL2P and by 58.4% compared to Lin et al. When included
with the global trajectory, our method still shows an im-
provement of 55.7% in mean APE compared to JL2P and
an improvement of 58.7% in mean APE compared to Lin et
al. (Fig. 4 left). 2
We also observe that high error in the root joint leads to
either foot sliding in the motion or averages out the whole
motion. Improvement in the error values for the root joint
indicates high-quality animations without any artifacts like
foot sliding. Furthermore, our method shows closer resem-
blances to the variance of the ground truth motion compared
to the state-of-the-art models (Fig. 4 center). Our method
1We decided to exclude [4] and [43] from the user study, based on
overwhelming feedback from participants that our method beats the state-
of-the-art in the most obvious ways. We provide additional qualitative
results for these in the supplementary material.
2We note that our reported numbers for the state-of-the-art methods in
the APE metric are different from the original paper. However, we were
unable to replicate the numbers in the original paper using the code and the
pre-trained model provided by the authors.
Figure 4: Plots showing the APE (left), AVE (middle), CEE and SEE (right) in mm for our model compared to the JL2P [4],
Lin et al. [43]. Dark blue line denotes our method, grey denotes JL2P and light blue denotes Lin et al. method. Lower values
are better. We see our method improves over the state-of-the-art by over 50% on all benchmarks.
Figure 5: Semantic accuracy in percentage denoting how
good the motion visually corresponds to the input sentences
(left) and Motion quality in percentage showing how good
the overall quality of motion is in terms of naturalness
(right). Higher value is better. Our method’s score denoted
in red, has highest percentage compared to the ablations.
has an improvement of 50.4% in the AVE over the mean
of all joints with global trajectory compared to JL2P, and
an improvement of 50.6% over the mean of all joints with
global trajectory compared to Lin et al. We provide detailed
APE and AVE values of individual joints in the supplemen-
tary material.
We also show improvements of 50% in the CEE and SEE
metrics compared to JL2P. Compared to Lin et al., we show
improvements of 72.3% and 83.1% in the CEE and SEE,
respectively (Fig. 4 right). These results show that the joint
embedding space learned by our method can correlate the
poses and corresponding sentences better than the state-of-
the-art methods.
5.2. Qualitative Results
To qualitatively compare our best model against the
state-of-the-art methods [4, 43], we examine the generated
motions from all the methods. Fig. 3 shows two motions
with rather high perplexity [51] compared to the average
movements in the dataset. Our method (top row) accurately
generates the kicking action with the correct foot and right
arm positions as described in the sentence, while the bench-
mark models fail to generate a kick at all (left). Fig. 3
(right) further shows that the Waltz dance is more promi-
nent in our model, compared to both benchmarks where arm
movements seem to be missing completely, and the skele-
ton tends to slide than actually step. Fig. 6 shows screen-
shots with motions generated from rather complex sentence
semantics. Our method (top row) accurately synthesizes a
trajectory that matches the semantics of the sentence. Al-
though Ahuja et al. [4] generate a circular trajectory (bottom
right), their walking direction does not match the semantics
of the sentence, while Lin et al. [43] fail to generate a cir-
cular trajectory at all. Both methods also cannot synthesize
correct turning motions (Fig. 6 left and center columns).
6. Limitations, Future Work and Conclusion
We presented a novel framework that advances the state-
of-the-art methods on text-based motion synthesis on quali-
tative evaluations and several objective benchmarks. While
our model accurately synthesizes superimposed actions it
encountered during training, generalization to novel super-
impositions is not always successful, however. We intend
to extend our model into a zero- or few-shot paradigm [55]
such that it generates simultaneous actions from input sen-
tences without being trained on those specific combina-
tions. We also plan to experiment with narration-based tran-
scripts that describe long sequences of step-by-step actions
involving multiple people, e.g., narration-based paragraphs
depicting step-by-step movements for performing complex
actions such as dance, work-outs, or professional training
videos. To this end, a different embedding that explicitly
Figure 6: Comparison of generated animations of our
method (top row) with Lin et al. [43] (middle row) and
Ahuja et al. [4] (bottom row) for long sentences indicating
direction and number of steps. Orange cross denotes start-
ing point and green denotes end point of the motion. Blue
line on the plane is the trajectory and the black dots repre-
sent the foot steps. Our method is clearly able to follow the
semantics of the sentences, while the state-of-the-art fail.
models the sequential nature of the task may be more suit-
able. However, that may reduce the model’s ability to syn-
thesize actions not described in an exact sequential man-
ner. Furthermore, improvements on general motion quality,
such as foot sliding, limb constraints, and biomechanical
plausibility, can be improved by introducing physical con-
straints [54] to the model.
Being able to model a variety of motions and handle such
complex sentence structures is an essential next step in gen-
erating realistic animations for mixtures of actions in the
long-term and improving the practical applicability of text-
based motion synthesis systems. To the best of our knowl-
edge, this is the first work to achieve this quality of motion
synthesis on the benchmark dataset and is an integral step
towards script-based animations.
Acknowledgements. We would like to thank all the par-
ticipants in our user study, as well as the XAINES part-
ners for their valuable feedback. This research was funded
by the BMBF grants XAINES (01|W20005) and IM-
PRESS (01|S20076), as well as the EU Horizon 2020 grant
Carousel+ (101017779) and an IMPRS-CS Fellowship.
References
[1] H. Ahn, T. Ha, Y. Choi, H. Yoo, and S. Oh. Text2action:
Generative adversarial synthesis from language to action. In
2018 IEEE International Conference on Robotics and Au-
tomation (ICRA), pages 5915–5920, 2018.
[2] Chaitanya Ahuja. Coalescing narrative and dialogue for
grounded pose forecasting. In 2019 International Confer-
ence on Multimodal Interaction, pages 477–481, 2019.
[3] Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and
Louis-Philippe Morency. Style transfer for co-speech gesture
animation: A multi-speaker conditional-mixture approach.
In European Conference on Computer Vision, pages 248–
265. Springer, 2020.
[4] C. Ahuja and L. Morency. Language2pose: Natural lan-
guage grounded pose forecasting. In 2019 International
Conference on 3D Vision (3DV), pages 719–728, 2019.
[5] Emre Aksan, Manuel Kaufmann, and Otmar Hilliges. Struc-
tured prediction helps 3d human motion modelling. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7144–7153, 2019.
[6] Andreas Aristidou, Daniel Cohen-Or, Jessica K Hodgins,
Yiorgos Chrysanthou, and Ariel Shamir. Deep motifs and
motion signatures. ACM Transactions on Graphics (TOG),
37(6):1–13, 2018.
[7] Tadas Baltruˇ
saitis, Chaitanya Ahuja, and Louis-Philippe
Morency. Multimodal machine learning: A survey and tax-
onomy. IEEE transactions on pattern analysis and machine
intelligence, 41(2):423–443, 2018.
[8] Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan:
Probabilistic 3d human motion prediction via gan. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition workshops, pages 1418–1427, 2018.
[9] Yu-Wei Chao, Jimei Yang, Brian Price, Scott Cohen, and Jia
Deng. Forecasting human dynamics from static images. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 548–556, 2017.
[10] Noshaba Cheema, Somayeh Hosseini, Janis Sprenger, Erik
Herrmann, Han Du, Klaus Fischer, and Philipp Slusallek.
Dilated temporal fully-convolutional network for seman-
tic segmentation of motion capture data. arXiv preprint
arXiv:1806.09174, 2018.
[11] Hsu-kuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang,
and Juan Carlos Niebles. Action-agnostic human pose fore-
casting. In 2019 IEEE Winter Conference on Applications of
Computer Vision (WACV), pages 1423–1432. IEEE, 2019.
[12] Kyunghyun Cho, Bart Van Merri ¨
enboer, Dzmitry Bahdanau,
and Yoshua Bengio. On the properties of neural machine
translation: Encoder-decoder approaches. arXiv preprint
arXiv:1409.1259, 2014.
[13] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christo-
pher D Manning. What does bert look at? an analysis of
bert’s attention. In Proceedings of the 2019 ACL Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks
for NLP, pages 276–286, 2019.
[14] Qiongjie Cui, Huaijiang Sun, Yue Kong, Xiaoqian Zhang,
and Yanmeng Li. Efficient human motion prediction using
temporal convolutional generative adversarial network. In-
formation Sciences, 545:427–447, 2021.
[15] Qiongjie Cui, Huaijiang Sun, and Fei Yang. Learning dy-
namic relationships for 3d human motion prediction. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 6519–6527, 2020.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[17] Han Du, Erik Herrmann, Janis Sprenger, Noshaba Cheema,
Somayeh Hosseini, Klaus Fischer, and Philipp Slusallek.
Stylistic locomotion modeling with conditional variational
autoencoder. In Eurographics (Short Papers), pages 9–12,
2019.
[18] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur-
rent neural network for skeleton based action recognition. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1110–1118, 2015.
[19] Joao P Ferreira, Thiago M Coutinho, Thiago L Gomes,
Jos´
e F Neto, Rafael Azevedo, Renato Martins, and Erick-
son R Nascimento. Learning to dance: A graph convolu-
tional adversarial network to generate realistic dance mo-
tions from audio. Computers & Graphics, 94:11–21, 2020.
[20] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Ji-
tendra Malik. Recurrent network models for human dynam-
ics. In Proceedings of the IEEE International Conference on
Computer Vision, pages 4346–4354, 2015.
[21] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.
Texture synthesis using convolutional neural networks. arXiv
preprint arXiv:1505.07376, 2015.
[22] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im-
age style transfer using convolutional neural networks. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2414–2423, 2016.
[23] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges.
Learning human motion models for long-term predictions.
In 2017 International Conference on 3D Vision (3DV), pages
458–466. IEEE, 2017.
[24] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial networks. arXiv
preprint arXiv:1406.2661, 2014.
[25] Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and
Jos´
e MF Moura. Adversarial geometry-aware human mo-
tion prediction. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 786–803, 2018.
[26] Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe
Yearsley, and Taku Komura. A recurrent variational autoen-
coder for human motion synthesis. In 28th British Machine
Vision Conference, 2017.
[27] Eva Hanser, Paul Mc Kevitt, Tom Lunney, and Joan Con-
dell. Scenemaker: Intelligent multimodal visualisation of
natural language scripts. In Irish Conference on Artificial In-
telligence and Cognitive Science, pages 144–153. Springer,
2009.
[28] Irtiza Hasan, Francesco Setti, Theodore Tsesmelis, Vasileios
Belagiannis, Sikandar Amin, Alessio Del Bue, Marco
Cristani, and Fabio Galasso. Forecasting people trajectories
and head poses by jointly reasoning on tracklets and vislets.
IEEE transactions on pattern analysis and machine intelli-
gence, 2019.
[29] Irtiza Hasan, Francesco Setti, Theodore Tsesmelis, Alessio
Del Bue, Marco Cristani, and Fabio Galasso. ” seeing is be-
lieving”: Pedestrian trajectory forecasting using visual frus-
tum of attention. In 2018 IEEE Winter Conference on Ap-
plications of Computer Vision (WACV), pages 1178–1185.
IEEE, 2018.
[30] Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, Kuniyuki
Takahashi, Yuta Tsuboi, Yuya Unno, Wilson Ko, and Jethro
Tan. Interactively picking real-world objects with uncon-
strained spoken language instructions. In 2018 IEEE In-
ternational Conference on Robotics and Automation (ICRA),
pages 3774–3781. IEEE, 2018.
[31] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow.
Moglow: Probabilistic and controllable motion synthesis
using normalising flows. ACM Transactions on Graphics
(TOG), 39(6):1–14, 2020.
[32] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin
Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wo-
jciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin,
et al. Grounded language learning in a simulated 3d world.
arXiv preprint arXiv:1706.06551, 2017.
[33] Daniel Holden, Taku Komura, and Jun Saito. Phase-
functioned neural networks for character control. ACM
Transactions on Graphics (TOG), 36(4):1–13, 2017.
[34] Daniel Holden, Jun Saito, and Taku Komura. A deep learning
framework for character motion synthesis and editing. ACM
Transactions on Graphics (TOG), 35(4):1–11, 2016.
[35] Daniel Holden, Jun Saito, Taku Komura, and Thomas
Joyce. Learning motion manifolds with convolutional au-
toencoders. In SIGGRAPH Asia 2015 Technical Briefs,
pages 1–4. 2015.
[36] Deok-Kyeong Jang and Sung-Hee Lee. Constructing hu-
man motion manifold with sequential networks. In Computer
Graphics Forum, volume 39, pages 314–324. Wiley Online
Library, 2020.
[37] Jogendra Nath Kundu, Himanshu Buckchash, Priyanka
Mandikal, Anirudh Jamkhandi, Venkatesh Babu RAD-
HAKRISHNAN, et al. Cross-conditioned recurrent networks
for long-term synthesis of inter-person human motion inter-
actions. In Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, pages 2724–2733,
2020.
[38] Jogendra Nath Kundu, Maharshi Gor, and R Venkatesh
Babu. Bihmp-gan: Bidirectional 3d human motion predic-
tion gan. In Proceedings of the AAAI conference on artificial
intelligence, volume 33, pages 8553–8560, 2019.
[39] Jasper LaFortune and Kristen L Macuga. Learning move-
ments from a virtual instructor: Effects of spatial orientation,
immersion, and expertise. Journal of Experimental Psychol-
ogy: Applied, 24(4):521, 2018.
[40] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and
Gregory D Hager. Temporal convolutional networks for ac-
tion segmentation and detection. In proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 156–165, 2017.
[41] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Con-
volutional sequence to sequence model for human dynamics.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5226–5234, 2018.
[42] Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, Zeng Huang,
and Hao Li. Auto-conditioned recurrent networks for ex-
tended complex human motion synthesis. arXiv preprint
arXiv:1707.05363, 2017.
[43] Angela S Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qix-
ing Huang, and Raymond J Mooney. Generating animated
videos of human activities from natural language descrip-
tions. Learning, 2018:2, 2018.
[44] Christian Mandery, ¨
Omer Terlemez, Martin Do, Nikolaus
Vahrenkamp, and Tamim Asfour. Unifying representations
and large-scale whole-body motion databases for studying
human motion. IEEE Transactions on Robotics, 32(4):796–
809, 2016.
[45] Karttikeya Mangalam, Ehsan Adeli, Kuan-Hui Lee, Adrien
Gaidon, and Juan Carlos Niebles. Disentangling human dy-
namics for pedestrian locomotion forecasting with noisy su-
pervision. In Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, pages 2784–2793,
2020.
[46] Julieta Martinez, Michael J Black, and Javier Romero. On
human motion prediction using recurrent neural networks.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2891–2900, 2017.
[47] Hongyuan Mei, Mohit Bansal, and Matthew Walter. Listen,
attend, and walk: Neural mapping of navigational instruc-
tions to action sequences. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 30, 2016.
[48] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,
and Jeffrey Dean. Distributed representations of words
and phrases and their compositionality. arXiv preprint
arXiv:1310.4546, 2013.
[49] Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and
Christian Claudel. Social-stgcnn: A social spatio-temporal
graph convolutional neural network for human trajectory
prediction. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 14424–
14432, 2020.
[50] Dario Pavllo, David Grangier, and Michael Auli. Quater-
net: A quaternion-based recurrent model for human motion.
arXiv preprint arXiv:1805.06485, 2018.
[51] Matthias Plappert, Christian Mandery, and Tamim Asfour.
The KIT motion-language dataset. Big Data, 4(4):236–252,
dec 2016.
[52] Matthias Plappert, Christian Mandery, and Tamim Asfour.
Learning a bidirectional mapping between human whole-
body motion and natural language using deep recurrent neu-
ral networks. Robotics and Autonomous Systems, 109:13 –
26, 2018.
[53] J¨
urgen Schmidhuber and Sepp Hochreiter. Long short-term
memory. Neural Comput, 9(8):1735–1780, 1997.
[54] Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Chris-
tian Theobalt. Physcap: Physically plausible monocular 3d
motion capture in real time. ACM Transactions on Graphics
(TOG), 39(6):1–16, 2020.
[55] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas-
tani, Christopher D Manning, and Andrew Y Ng. Zero-
shot learning through cross-modal transfer. arXiv preprint
arXiv:1301.3666, 2013.
[56] Wataru Takano, Dana Kulic, and Yoshihiko Nakamura. In-
teractive topology formation of linguistic space and motion
space. In 2007 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems, pages 1416–1422. IEEE, 2007.
[57] Wataru Takano and Yoshihiko Nakamura. Bigram-based nat-
ural language model and statistical motion symbol model for
scalable language of humanoid robots. In 2012 IEEE In-
ternational Conference on Robotics and Automation, pages
1232–1237. IEEE, 2012.
[58] Wataru Takano and Yoshihiko Nakamura. Statistical mu-
tual conversion between whole body motion primitives and
linguistic sentences for human motions. The International
Journal of Robotics Research, 34(10):1314–1328, 2015.
[59] Wataru Takano and Yoshihiko Nakamura. Symbolically
structured database for human whole body motions based
on association between motion symbols and motion words.
Robotics and Autonomous Systems, 66:75–85, 2015.
[60] Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi
Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. Speech-to-
gesture generation: A challenge in deep learning approach
with bi-directional lstm. In Proceedings of the 5th Interna-
tional Conference on Human Agent Interaction, pages 365–
369, 2017.
[61] Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. Long-
term human motion prediction by modeling motion con-
text and enhancing motion dynamic. arXiv preprint
arXiv:1805.02513, 2018.
[62] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam
Poliak, R Thomas McCoy, Najoung Kim, Benjamin
Van Durme, Samuel R Bowman, Dipanjan Das, et al. What
do you learn from context? probing for sentence struc-
ture in contextualized word representations. arXiv preprint
arXiv:1905.06316, 2019.
[63] ¨
Omer Terlemez, Stefan Ulbrich, Christian Mandery, Mar-
tin Do, Nikolaus Vahrenkamp, and Tamim Asfour. Master
motor map (mmm)—framework and toolkit for capturing,
representing, and reproducing human motion on humanoid
robots. In 2014 IEEE-RAS International Conference on Hu-
manoid Robots, pages 894–901. IEEE, 2014.
[64] Jesse Vig. A multiscale visualization of attention in the trans-
former model. arXiv preprint arXiv:1906.05714, 2019.
[65] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn,
Xunyu Lin, and Honglak Lee. Learning to generate long-
term future via hierarchical prediction. In international
conference on machine learning, pages 3560–3569. PMLR,
2017.
[66] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial
Hebert. The pose knows: Video forecasting by generating
pose futures. In Proceedings of the IEEE international con-
ference on computer vision, pages 3332–3341, 2017.
[67] Zhiyong Wang, Jinxiang Chai, and Shihong Xia. Combining
recurrent neural networks and adversarial training for human
motion synthesis and control. IEEE transactions on visual-
ization and computer graphics, 27(1):14–28, 2019.
[68] Erwin Wu and Hideki Koike. Real-time human motion fore-
casting using a rgb camera. In 2019 IEEE Conference on
Virtual Reality and 3D User Interfaces (VR), pages 1575–
1577. IEEE, 2019.
[69] Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang,
Xiaolong Wang, and Trevor Darrell. Hierarchical style-
based networks for motion synthesis. arXiv preprint
arXiv:2008.10162, 2020.
[70] Tatsuro Yamada, Hiroyuki Matsunaga, and Tetsuya Ogata.
Paired recurrent autoencoders for bidirectional translation
between robot actions and linguistic descriptions. IEEE
Robotics and Automation Letters, 3(4):3441–3448, 2018.
[71] Tatsuro Yamada, Shingo Murata, Hiroaki Arie, and Tetsuya
Ogata. Dynamical integration of language and behavior in a
recurrent neural network for human–robot interaction. Fron-
tiers in neurorobotics, 10:5, 2016.
Appendix: More Results on Quantitative Eval-
uation Metrics
We show the average positional error (APE) values for
individual joints in Table 1. We compare our method with
the two state-of-the-art methods [4, 43] and also with the
four ablations of our method: ‘w/o BERT’, ‘w/o JT’, ‘w/o
2-St’, ‘w/o Lo’, as described in Section 4.4of our paper.
We observe that high error in the root joint leads to either
foot sliding in the motion or averages out the whole mo-
tion. Improvement in the error values for the root joint in-
dicates high-quality animations without any such artifacts.
When compared to the ablations of our model, we find that
the APE calculated over the mean of all the joints with the
global trajectory is marginally better for the ablations com-
pared to our method (best for the ablation ‘w/o 2-St’ show-
ing an improvement of 1.96% over our method). This is be-
cause the motions get averaged out in the ablations, bringing
the joint positions closer to the mean but reducing the rel-
evant joint movements. However, our method has the low-
est APE for the root joint, implying that the overall motion
quality is better. Using the additional metric of the average
variance error (AVE) for evaluating the variability of the
motions further shows that the joint movements are reduced
in the ablations. Our method has the lowest AVE for the
root joint as well as the mean of all the joints with and with-
out the global trajectory, as shown in Table 2. Our method
also performs the best in terms of the content encoding error
(CEE) and the style encoding error (SEE) compared to the
ablations and the state-of-the-art methods as seen in Table 3.
Table 1: Average Positional Error (APE) in mm for our model compared to the JL2P [4], Lin et al. [43], and four ablations
of our method described in Section 4.4 of our paper. Although the over all APE is lower for our ablation studies, we find the
overall motion quality to be poorer than our final method due to larger errors in the root. Please refer to Section 5.1 of our
paper for details.
JL2P Lin et al. w/o BERT w/o JT w/o 2-St w/o Lo Ours
Trajectory 4.12 4.52 1.21 1.27 1.22 1.23 1.22
Root 7.28 7.78 3.23 3.50 3.22 3.23 3.21
Torso 13.18 14.93 5.84 5.71 5.71 5.91 5.90
Pelvis 14.92 16.10 6.49 6.54 6.52 6.67 6.60
Neck 33.01 36.03 14.88 14.50 14.69 15.04 15.01
Left Arm 37.37 41.71 16.54 16.79 16.09 16.79 16.94
Right Arm 37.91 42.33 16.41 16.56 15.81 16.25 16.40
Left Hip 13.50 14.33 6.02 6.12 6.14 6.18 6.21
Right Hip 13.39 14.05 6.00 6.15 6.15 6.20 6.22
Left Foot 38.38 38.84 16.78 16.63 16.84 16.25 16.97
Right Foot 39.66 40.31 17.12 17.15 17.24 16.78 17.22
Mean w/o trajectory 24.86 26.64 10.93 10.96 10.84 10.93 11.07
Mean 22.97 24.63 10.04 10.08 9.97 10.05 10.17
Table 2: Average Variance Error (AVE) for our model compared to the JL2P [4], Lin et al. [43], and the four ablations of our
method described in Section 4.4 of our paper. Our method has the lowest AVE for the root joint as well as the mean of all the
joints with and without the global trajectory.
JL2P Lin et al. w/o BERT w/o JT w/o 2-St w/o Lo Ours
Trajectory 18.55 19.00 10.87 10.52 11.20 9.75 10.29
Root 4.70 5.46 2.45 2.42 2.32 2.30 2.19
Torso 21.44 22.61 12.65 12.20 13.22 11.85 11.87
Pelvis 23.79 24.51 13.66 13.25 13.99 12.73 12.58
Neck 45.05 36.03 26.24 25.26 27.37 24.78 24.65
Left Arm 32.66 41.71 16.59 16.42 16.86 15.66 15.20
Right Arm 29.15 42.34 15.18 14.54 15.05 14.31 13.95
Left Hip 27.79 28.73 16.01 15.45 15.82 14.35 14.71
Right Hip 26.73 27.05 14.46 14.13 14.92 13.31 13.40
Left Foot 48.34 38.84 24.63 24.03 23.67 22.27 21.57
Right Foot 47.23 40.31 23.04 23.10 22.80 20.72 20.87
Mean w/o Trajectory 30.69 30.75 16.49 16.08 16.60 15.22 15.09
Mean 29.58 29.69 15.98 15.57 16.11 14.73 14.66
Table 3: Content Encoding Error (CEE) and Style Encoding Error (SEE) for our model compared to the JL2P [4], Lin et
al. [43], and the four ablations of our method described in Section 4.4 of our paper. Lower values are better. Our method has
the lowest CEE and SEE.
Method JL2P Lin et al. w/o BERT w/o JT w/o 2-St w/o Lo Ours
CEE 1.06 1.92 1.10 0.99 0.67 1.04 0.53
SEE 0.38 1.13 0.80 0.76 0.46 0.77 0.19
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Synthesizing human motion through learning techniques is becoming an increasingly popular approach to alleviating the requirement of new data capture to produce animations. Learning to move naturally from music, i.e., to dance, is one of the more complex motions humans often perform effortlessly. Each dance movement is unique, yet such movements maintain the core characteristics of the dance style. Most approaches addressing this problem with classical convolutional and recursive neural models undergo training and variability issues due to the non-Euclidean geometry of the motion manifold structure. In this paper, we design a novel method based on graph convolutional networks to tackle the problem of automatic dance generation from audio information. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions preserving the key movements of different music styles. We evaluate our method with three quantitative metrics of generative methods and a user study. The results suggest that the proposed GCN model outperforms the state-of-the-art dance generation method conditioned on music in different experiments. Moreover, our graph-convolutional approach is simpler, easier to be trained, and capable of generating more realistic motion styles regarding qualitative and different quantitative metrics. It also presented a visual movement perceptual quality comparable to real motion data. The dataset and project are publicly available at: https://www.verlab.dcc.ufmg.br/motion-analysis/cag2020.
Conference Paper
Full-text available
3D human motion prediction, i.e., forecasting future sequences from given historical poses, is a fundamental task for action analysis, human-computer interaction, machine intelligence. Recently, the state-of-the-art method assumes that the whole human motion sequence involves a fully-connected graph formed by links between each joint pair.Although encouraging performance has been made, due to the neglect of the inherent and meaningful characteristics of the natural connectivity of human joints, unexpected results may be produced. Moreover, such a complicated topology greatly increases the training difficulty.To tackle these issues, we propose a deep generative model based on graph networks and adversarial learning. Specifically, the skeleton pose is represented as a novel dynamic graph, in which natural connectivities of the joint pairs are exploited explicitly, and the links of geometrically separated joints can also be learned implicitly. Notably, in the proposed model, the natural connection strength is adaptively learned, whereas, in previous schemes, it was constant. Our approach is evaluated on two representations (i.e., angle-based, position-based) from various large-scale 3D skeleton benchmarks (e.g., H3.6M, CMU, 3DPW MoCap). Extensive experiments demonstrate that our approach achieves significant improvements against existing baselines in accuracy and visualization.
Article
Marker-less 3D human motion capture from a single colour camera has seen significant progress. However, it is a very challenging and severely ill-posed problem. In consequence, even the most accurate state-of-the-art approaches have significant limitations. Purely kinematic formulations on the basis of individual joints or skeletons, and the frequent frame-wise reconstruction in state-of-the-art methods greatly limit 3D accuracy and temporal stability compared to multi-view or marker-based motion capture. Further, captured 3D poses are often physically incorrect and biomechanically implausible, or exhibit implausible environment interactions (floor penetration, foot skating, unnatural body leaning and strong shifting in depth), which is problematic for any use case in computer graphics. We, therefore, present PhysCap , the first algorithm for physically plausible, real-time and marker-less human 3D motion capture with a single colour camera at 25 fps. Our algorithm first captures 3D human poses purely kinematically. To this end, a CNN infers 2D and 3D joint positions, and subsequently, an inverse kinematics step finds space-time coherent joint angles and global 3D pose. Next, these kinematic reconstructions are used as constraints in a real-time physics-based pose optimiser that accounts for environment constraints ( e.g. , collision handling and floor placement), gravity, and biophysical plausibility of human postures. Our approach employs a combination of ground reaction force and residual force for plausible root control, and uses a trained neural network to detect foot contact events in images. Our method captures physically plausible and temporally stable global 3D human motion, without physically implausible postures, floor penetrations or foot skating, from video in real time and in general scenes. PhysCap achieves state-of-the-art accuracy on established pose benchmarks, and we propose new metrics to demonstrate the improved physical plausibility and temporal stability.
Article
Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Article
Human motion prediction from its historical poses is an essential task in computer vision; it is successfully applied for human-machine interaction and intelligent driving. Recently, significant progress has been made with variants of RNNs or LSTMs. Despite alleviating the vanishing gradient problem, the chain RNN often leads to deformities and convergence to the mean pose because of its low ability to capture long-term dependencies. To address these problems, in this paper, we propose a temporal convolutional generative adversarial network (TCGAN) to forecast high-fidelity future poses. The TCGAN uses hierarchical temporal convolution to model the long-term patterns of human motion effectively. In contrast to RNNs, the hierarchical convolution structure has recently proved to be a more efficient method for sequence-to-sequence learning in computational complexity, the number of model parameters, and parallelism. Besides, instead of traditional GANs, spectral normalization (SN) is embedded in the model to alleviate mode collapse. Compared with typical recurrent methods, the proposed model is feedforward and can produce the future poses in real-time. Extensive experiments on various human activity analysis benchmarks (i.e., H3.6M, CMU, and 3DPW MoCap) demonstrate that the model consistently outperforms the state-of-the-art methods in terms of accuracy and visualization for short-term and long-term predictions.
Article
In this article, we explore the correlation between people trajectories and their head orientations. We argue that people trajectory and head pose forecasting can be modelled as a joint problem. Recent approaches on trajectory forecasting leverage short-term trajectories ( aka tracklets) of pedestrians to predict their future paths. In addition, sociological cues, such as expected destination or pedestrian interaction, are often combined with tracklets. In this article, we propose MiXing-LSTM (MX-LSTM) to capture the interplay between positions and head orientations (vislets) thanks to a joint unconstrained optimization of full covariance matrices during the LSTM backpropagation. We additionally exploit the head orientations as a proxy for the visual attention, when modeling social interactions. MX-LSTM predicts future pedestrians location and head pose, increasing the standard capabilities of the current approaches on long-term trajectory forecasting. Compared to the state-of-the-art, our approach shows better performances on an extensive set of public benchmarks. MX-LSTM is particularly effective when people move slowly, i.e., the most challenging scenario for all other models. The proposed approach also allows for accurate predictions on a longer time horizon.