Content uploaded by He Wang

Author content

All content in this area was uploaded by He Wang on Sep 08, 2020

Content may be subject to copyright.

Dynamic Future Net: Diversified Human Motion Generation

Wenheng Chen

NetEase Fuxi AI Lab

chenwenheng@corp.netease.com

He Wang

University of Leeds

H.E.Wang@leeds.ac.uk

Yi Yuan

NetEase Fuxi AI Lab

yuanyi@corp.netease.com

Tianjia Shao

State Key Lab of CAD&CG, Zhejiang

University

tjshao@zju.edu.cn

Kun Zhou

State Key Lab of CAD&CG, Zhejiang

University

kunzhou@acm.org

Figure 1: Given a 20-frame walking motion pre-x (white), our model can generate diversied motion: walking (yellow),

walking-to-running (blue), walking-to-boxing (green), and walking-to-dancing (red), with arbitrary duration. The correspond-

ing animation can be found in teaser.mp4 in supplementary video.

ABSTRACT

Human motion modelling is crucial in many areas such as computer

graphics, vision and virtual reality. Acquiring high-quality skele-

tal motions is dicult due to the need for specialized equipment

and laborious manual post-posting, which necessitates maximiz-

ing the use of existing data to synthesize new data. However, it

is a challenge due to the intrinsic motion stochasticity of human

motion dynamics, manifested in the short and long terms. In the

short term, there is strong randomness within a couple frames, e.g.

one frame followed by multiple possible frames leading to dierent

motion styles; while in the long term, there are non-deterministic

action transitions. In this paper, we present Dynamic Future Net,

a new deep learning model where we explicitly focuses on the

aforementioned motion stochasticity by constructing a generative

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

MM ’20, October 12–16, 2020, Seattle, WA, USA.

©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00

https://doi.org/10.1145/3394171.3413669

model with non-trivial modelling capacity in temporal stochas-

ticity. Given limited amounts of data, our model can generate a

large number of high-quality motions with arbitrary duration, and

visually-convincing variations in both space and time. We evaluate

our model on a wide range of motions and compare it with the

state-of-the-art methods. Both qualitative and quantitative results

show the superiority of our method, for its robustness, versatility

and high-quality.

CCS CONCEPTS

•Computer systems organization →Embedded systems

;Re-

dundancy; Robotics; •Networks →Network reliability.

KEYWORDS

human motion, neural networks, generative models

ACM Reference Format:

Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, and Kun Zhou. 2020. Dy-

namic Future Net: Diversied Human Motion Generation. In 28th ACM

International Conference on Multimedia (MM ’20), October 12–16, 2020, Seat-

tle, WA, USA.. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/

3394171.3413669

1 INTRODUCTION

Modeling natural human motions is a central topic in several elds

such as computer animation, bio-mechanics, virtual reality, etc,

where high-quality motion data is a necessity. Despite the improved

accuracy and lowered costs of motion capture systems, it is still

highly desirable to make full use of existing data to generate di-

versied new data. One key challenge in motion generation is

the dynamics modelling, where it has been shown that a latent

space can be found due to the high coordination of body motions

[

21

,

36

,

41

]. However, as much as the spatial aspect is studied, dy-

namics modelling, especially with the aim of diversied motion

generation, still remains to be an open problem.

Human motion dynamics manifest several levels of short-term

and long-term stochasticity. Given a homogeneous discritization

of motions in time, the short-term stochasticity refers to the ran-

domness in next one or few frames (pose-transition); while the

long-term one refers to the random in the next or few actions

(action-transition). Tradition methods model them by Finite State

Machines with carefully organized data [

33

], which have limited

model capacities for large amounts of data and require extensive pre-

processing work. New deep learning methods either ignore them

[

21

] or do not explicitly model them [

41

]. Very recently, dynamics

modelling for diversied generation has just been investigated [

42

],

but only from the perspective of the overall dynamics, rather than

the detailed short/long term stochasticity.

In this paper, we propose a new deep learning model, Dynamic

Future Net, or DFN, for automatic and diversied high-quality mo-

tion generation based on limited amounts of data. Given a motion,

we assume that it can be discretized homogeneously in time and

represented by a series of posture and instantaneous velocities.

Following the observation that it is easier to learn the dynamics in

a natural motion latent space [

21

], we rst embed features in the

data space into a latent space. Next, DFN learns explicitly the his-

tory,current and future state given any time, where we also model

several conditional distributions for the inuences of history and

future state on the current state. The state-wise bidirectional mod-

elling (extending into both the past and future) separates DFN from

existing methods and endows us with the ability of modelling the

short-term (next-frame) randomness and long-term (next-action)

randomness. Last, for inference purposes, we propose new loss

functions based on distributional similarities as opposed to point-

wise estimation [

41

,

44

], which captures the dynamics accurately

but also keep the randomness that is crucial for diversied motion

generation.

We show extensive experimental results to show DFN’s robust-

ness, versatility and high-quality. Unlike existing methods which

have to be trained on one type of motions a time [

42

,

44

], DFN

can be trained using both single type of motions or mixed motions,

which shows DFN’s ability to capture multi-modal dynamics and

therefore its versatility in diversied motion generation. Visual

evaluation shows that DFN can generate high-quality motions with

dierent dynamics.

In summary, our formal contributions include:

(1)

a new deep learning model, Dynamic Future Net, for auto-

matic, diversied and high-quality human motion genera-

tion.

(2)

a new dynamic model that captures the transition stochas-

ticity of the past, current and future states in motions.

(3)

insights of the importance of both short-term and long-term

dynamics in human motion modelling.

2 BACKGROUND AND RELATED WORKS

2.1 Human pose and motion embedding

Given a human motion sequence, it is useful to nd the low dimen-

sion representation of the whole sequence. Holden et al. [

22

] for

the rst time use a convolution neural network to project the entire

sequence to a low dimensional embedding space. Using this more

abstract representation, one can blend motions or remove noises

from corrupted motions. In [

20

,

21

], the authors further make use of

the power of the learned motion manifold and decoder to synthesize

motions with constraints. Another important application of motion

embedding is motion style generation [

8

], in which the embedding

code can be tuned to matched the desired style. Although modeling

a motion sequence with auto-encoders is straightforward, how it

can model the dynamics of human motion is not clear. In [

28

], the

authors model a motion sequence as a trajectory in pose embedding

manifold, then use a bidirectional-RNN to encode the trajectory to

model the dynamics of the trajectory, which can improve motion

classication results. Moreover, they designed a graph-like network

to better represent the human body components.

The existing methods focus on the embedding of the poses and

dynamics. However, they do not explicitly model the distributions

of these latent variables which governs the stochasticity of the

dynamics. In this paper, we go a lever deeper and learn the latent

variable distributions for the embedded poses and dynamics.

2.2 Deterministic human motion prediction

and synthesis

In the eort of modelling motion dynamics, many methods employ

deterministic transitions [

2

,

4

,

9

,

13

,

17

,

19

,

24

,

30

,

31

,

38

], especially

in human motion prediction or generation. They either focus on

short term dynamics modeling or spatial-temporal information of

the overall dynamics. In [

44

], the authors propose a training tech-

nique for RNN to generate very long human motions. Although this

technique solves the problem of the freezing phenomena of RNN,

their model is deterministic, which makes the training dicult:

given a past state, if multiple possible future motions are present

in the data, the network will average them, which is a common

problem in many human motion prediction methods.

One solution to this problem is to introduce control signals [

14

,

39

]. They design several networks and make the character to follow

a given trajectory in real-time. In [

35

], the control signal becomes

the 3d human pose predicted by neural nets as a reference for an

agent to imitate. In [

1

], the authors co-embed the language and

corresponding motions to a share manifold, ignoring the fact that

language-to-motion is a one-to-many mapping. Even with a specic

control signal, like 2D human skeleton, one can still expect that

there are dierent motions or dierent pose corresponding to the

same control signal [

29

], essentially indicating the multi-modality

nature of human motion dynamics.

Dierent from the existing methods, our paper focuses on the ex-

plicit modelling of the multi-modality nature of motion transitions

in human motions. Further, we also aim to learn the stochastiity in

those transitions.

2.3 Stochastic human motion synthesis

In [

42

] the authors combine RNNs and Generative Adversarial Net-

works (GANs) to generate stochastic human motions. They use a

mixture density layer to model the stochastic property, and use an

adversarial discriminator to judge whether the generated motion

is natural or not. In MoGLow [

18

], the authors for the rst time

use normalizing ow to directly model the next frame distribution.

One advantage of this method is that it can capture complex dis-

tributions without learning an apparent latent space. Given the

same initial poses and the same control signals or constraints, the

model still generate dierent motion sequences. Chen et al. [

5

]

combine Dynamic motion primitive and variational Bayesian lter

to model the human motion dynamics. They show that the latent

representations are self-clustering after training. However, in the

transitions, it needs the information of the whole sequence, which

separates it from being a pure generative model.

Our method diers from existing approaches in its treatment in

the relations between the past, current and future states of human

motions. Unlike the aforementioned methods, we explicitly model

the current state based on both the past and the future. Also, we

further model their randomness in the latent space that captures

the transition multi-modality.

2.4 Stochastic RNN model

Modelling the stochasticity in time-series data has been a long-

standing problem, such as music, hand writing and human voice

[

11

,

25

,

40

]. The VRNN [

7

] for the rst time combines Variational

AutoEncoder (VAE) [

27

] and recurrent neural networks for this

purpose. Later in [

23

], the authors disentangle the latent variables

of the observation and the dynamics, with the observation latent

being used to recover the full observation information, and the

dynamic latent capturing the dynamics.

A key modelling choice in stochastic RNN models is the relations

between the past, current and future. In early work, the posterior

of current state is inferred from the past information, which makes

it lack the ability to foresee the future. In [

3

,

10

,

37

], the authors

show that the performance can be improved by incorporating the

future state with a backward RNN in the inference stage. In [

12

], the

authors design a model that can go beyond step-by-step modelling,

and predict multiple steps up to a given horizon. Similar eort is also

made in reinforcement learning, where the reward function takes

the discounted future reward into consideration [

12

,

15

]. In [

16

], the

authors went further and designed a model that can predict multiple

future scenarios, then choose the one with highest predicted reward

from all the possibilities.

We observed that human motions follow a similar philosophy:

the current state is a result of the past motion but also a particular

choice for a certain planned future. Our research is inspired by

Stochastic RNN models but focuses on human motion transition

stochasticity.

3 METHOD OVERVIEW

Our method takes a homogeneous series of human pose representa-

tions as input. This representation contains the 3D joint coordinates

relative to the root, and the root translation velocity over ground

plane and the rotation velocity around the y-axis. We propose the

Dynamic Future Net to model the motion dynamics as a future-

guided transition and generate random natural human motions

that transit between dierent actions.

As illustrated in Fig. 2, DFN is composed of three modules, a

pose encoder, a pose trajectory encoder and a stochastic latent RNN.

The pose encoder (Section 5.1) maps the high dimensional human

pose to a latent space while the pose trajectory encoder (Section

5.2) embeds the trajectory in the latent pose space into a code.

Such compact representations of pose sequences can facilitate the

learning process [

28

]. As a key module, the stochastic latent RNN

(Section 5.3) deploys a stochastic latent state and a deterministic

latent state to learn two latent distributions for the pose-embedding

and the future trajectory embedding. Such explicit learning of two

dierent latent distributions on the one hand forces the model to

learn strong temporal correlation and on the other hand generates

motions with varied and natural transition. During inference we

combine the past, current and future state to infer the current

latent state distribution, and we combine the past and future to

infer the future latent state distribution. In the generation stage,

unlike existing methods [

10

] where the current state is generated

from the past state only, we rst generate the future state and

combine it with the past state to generate the current latent state

prior, from which we sample the current latent state then decode it

to the pose-embedding and recover the current pose and velocity.

We regard this process as a self-driving motion generation process

guided by the envisioned dynamic future. In this way, the model

can learn and generate rich and varied natural motions.

Figure 2: Overview of the proposed Dynamic Future Net-

work. In the learning process, the network take human mo-

tion sequence as input and predict the long term distribu-

tion and the short term(next frame) distribution.

4 DATA PREPARATION

We train our model on the CMU Human motion capture dataset.

As the skeletons in the origin dataset are dierent, we rst retarget

all motion to a chosen skeleton as in [

21

]. The skeleton contains

24 joints, we rst extract the X and Z global coordinate of the root,

and rotate the human pose to the Y-axis direction as in [

22

], the

global position and angle of human pose can be recovered from the

X-Z velocity and the rotation velocity around the Y axis. Finally the

original human pose vector contains 76 degrees of freedom, 72 for

3D joint positions, 4 for the global translation and rotation velocity.

Figure 3: The pose-velocity auto-encoder network. The in-

put of the encoder is the 76 dimensional pose-velocity vec-

tor. The encoder outputs two code, one for pose, the other

for velocity. The pose code and velocity code is fed into a

quaternion decoder and velocity decoder separately. The 3D

joint positions are recovered from quaternions by the For-

ward Kinematics (FK) layer.

5 METHODOLOGY

Formally, we start by describing a motion as a homogeneous time

series:

{𝑋0, . . . , 𝑋𝑇}

, where

𝑋𝑡

is the motion frame at time

𝑡

and

contains the joint positions and global velocities. Starting with a

joint distribution

𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)

, we model the inuence of

the past frames

𝑋<𝑡

and future frames

𝑋𝑡+1:𝑡+𝐻

on the current

frame

𝑋𝑡

by transition probabilistic distributions

𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)

and

𝑃(𝑋𝑡|𝑋<𝑡)

, where

𝐻

is the duration of a short-horizon future.

The key reason of such a modelling choice is based two observa-

tions: the current frame is a result of the past motion and therefore

conditioned on it, captured by

𝑃(𝑋𝑡|𝑋<𝑡)

. Meanwhile, the current

frame is also a choice made for certain planned future, e.g. needing

to stop swing the legs aiming for a transition from walk to stand-

ing, captured by

𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)

. In addition, since the past motion

will also limit the possibilities of the future motion, there is also

a impact of the past on the future,

𝑃(𝑋𝑡+1:𝑡+𝐻|𝑋<𝑡)

. Overall, the

joint probability:

𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻) ∝ 𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)𝑃(𝑋𝑡+1:𝑡+𝐻,𝑋 𝑡|𝑋<𝑡)(1)

Not that the two probabilities on the right side play dierent roles.

𝑃(𝑋𝑡+1:𝑡+𝐻, 𝑋𝑡|𝑋<𝑡)

is the probability of unrolling from the past to

the future. Given a known past, this is a joint probability of both the

current and the future, containing all the possible transitions. On

top of it,

𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)

dictates that if the future is also known,

then the current can be inferred. This explicit modelling of the

transition probabilistic distributions between the past, current and

future helps capturing the transition stochasticity, which facilitates

diversied motion generation as shown in the experiments.

Learning the transitional probabilities in the data space, however,

is dicult due to the curse of dimensionality. We therefore project

motions to a latent space, which involves embedding the frames

as well as the dynamics. We then learn the transition distributions

in the latent space. During inference, we then recover motions

from sampled states in the latent space to the original data space.

DFN is naturally divided into three components: Spatial (frame)

embedding, dynamics embedding and dynamics modelling.

5.1 Spatial Embedding

We use an auto-encoder for frame embedding,

𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)

and

ˆ

𝑋𝑡=𝑃𝑜 𝑠𝑒𝐷 𝑒𝑐 (𝑧𝑡)

, shown in Figure 3.

𝑃𝑜𝑠𝑒𝐸𝑛𝑐

is multi-layer

perceptron network to project the data into the latent space. Then

we separate the latent feature into two components to represent

the pose code and the global velocity code.

𝑃𝑜𝑠𝑒𝐷𝑒𝑐

contains two

components, the quaternion decoder and the velocity decoder. The

quaternion decoder takes the pose latent feature as input and out-

puts joint angles (represented by quaternions), and the velocity

decoder takes the latent velocity feature as input and outputs the

velocity. The quaternion decoder essentially is a dierential Inverse

Kinematics module. As stated in [

34

], using joint rotations instead

of joint positions maintains the bone lengths. After the reconstruc-

tion, we use a Forward Kinematics layer to compute the 3D joint

positions. To train the auto-encoder, we use a Mean Squared Error

loss function:

𝐿𝑠𝑙 =

1

𝑇

𝑇

Õ

𝑡=0

||𝑋𝑡−ˆ

𝑋𝑡|| 2

2(2)

where 𝑇is the number of frames in a motion.

5.2 Dynamics Embedding

Figure 4: A seq-to-seq network for trajectory embedding.

After learning the pose latent space, we project the dynamics as

trajectories in this space using a Recurrent Neural Network shown

in Figure 4. We employ a sequence-to-sequence architecture as

it forces the model to learn long-term dynamics. The RNN con-

sists of Gated Recurrent Unit (GRU) [

6

], encodes a sequence of

encoded frames

{𝑧𝑡, . . . , 𝑧𝑡+𝐻}

into a latent representation

𝑚𝑡

, and

then unroll to reconstruct the same sequence

{𝑧′

𝑡+1, . . . , 𝑧 ′

𝑡+𝐻}

from

𝑚𝑡

given

𝑧𝑡

. To

𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)

,

𝑚𝑡

is a future summary over

multiple frames. We use the following loss function:

𝐿𝑡𝑙 =𝐿𝑟𝑒𝑐 +𝐿𝑠𝑚𝑜𝑜 𝑡ℎ (3)

𝑤ℎ𝑒𝑟 𝑒 𝐿𝑟 𝑒𝑐 =

1

𝑇

𝑇

Õ

𝑡=0

||𝑧𝑡−𝑧′

𝑡|| 2

2

𝐿𝑠𝑚𝑜𝑜𝑡ℎ =

1

𝑇

𝑇

Õ

𝑡=0

||𝑉𝑡−ˆ

𝑉𝑡||2

2

where

𝑇

is the frame number of a motion,

𝑉𝑡

and

ˆ

𝑉𝑡

are the original

and reconstructed joint velocities. To facilitate training, we use Eq.

2 to pre-train the posture auto-encoder an x its weights when

training the RNN module.

5.3 Dynamics modelling

Generative Model.

After embedding the poses and dynamics into

a latent space, we now explain the dynamics modelling, which is the

key technical contribution of this paper. We propose a new dynam-

ics model that captures the joint distribution

𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)

.

Rather than directly learning the distribution in the data space, we

aim to learn the latent joint distribution

𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)

, where

we abstract the past, current and future features separately. First,

given the Markov property, we assume that all past information

𝑧<𝑡

is encoded into

ℎ𝑡

which is a deterministic (known) past state. Next,

we assume that the future information

𝑧𝑡+1:𝑡+𝐻

can be summarized

into a future state

𝑓𝑡

, and

𝑓𝑡

is drawn from a distribution over all

possible future states conditioned on

ℎ𝑡

. Last, we also assume that

there is a current state

𝑠𝑡

which captures the current information

𝑧𝑡

and

𝑠𝑡

is drawn from a distribution of all possible current states.

Then we can therefore assume:

𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)=𝑃(𝑧𝑡+1:𝑡+𝐻, 𝑧𝑡|𝑧<𝑡)𝑃(𝑧<𝑡)

∝𝑃(𝑧𝑡|𝑧𝑡+1:𝑡+𝐻)𝑃(𝑧𝑡|𝑧<𝑡)𝑃(𝑧𝑡+1:𝑡+𝐻|𝑧<𝑡)

=𝑃(𝑠𝑡|𝑓𝑡)𝑃(𝑠𝑡|ℎ𝑡)𝑃(𝑓𝑡|ℎ𝑡)

=𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)𝑃(𝑓𝑡|ℎ𝑡)(4)

where we directly use

𝑠𝑡

,

ℎ𝑡

and

𝑓𝑡

to replace the corresponding

𝑧

variables by assuming mappings between them which will be

explained later. Dierent from existing methods [

10

,

37

], our di-

rect conditioning of the current state on the future and past state

𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)

, and the future state on the past state

𝑃(𝑓𝑡|ℎ𝑡)

allows

us great exibility in modelling the stochasticity in transitions. The

generation model is shown in Fig.5 a.

Future Feature Prior.

Given

ℎ𝑡

, we rst predict the future state,

via

𝑃(𝑓𝑡|ℎ𝑡)

. Here, we assume a diagonal multivariate Gaussian

prior over 𝑓𝑡[5, 14, 26]

𝑝(𝑓𝑡|ℎ𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝑓𝑡;𝜇𝑓

𝑡, 𝜎 𝑓

𝑡),

where [𝜇𝑓

𝑡, 𝜎 𝑓

𝑡]=𝑔𝑝

𝑓(ℎ𝑡)(5)

where

𝜇𝑓

𝑡

and

𝜎𝑓

𝑡

are the mean and covariance.

𝑔𝑝

𝑓

is a three-layer

MLP with hidden dimension 256 and LeakyReLU activation. It

contains all the possible future state given the past. It can represent a

goal or a driving signal for the generative process. Also it forces the

model to learn rich motion transitions and long term correlations,

overcoming the freezing problem of traditional RNN [44].

Current Feature Prior.

Next, we explain

𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)

. Although

ℎ𝑡

is a known (deterministic) past,

𝑓𝑡

is random. We therefore rst

sample a specic future state

𝑓𝑡

, then decode it to an unrolled future

summary

𝑚𝑡

, and nally condition the current state

𝑠𝑡

on

ℎ𝑡

and

𝑚𝑡. We therefore have:

𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡) ∝ 𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)𝑃(𝑚𝑡|𝑓𝑡, ℎ𝑡)(6)

𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡;𝜇𝑠

𝑡, 𝜎𝑠

𝑡)(7)

[𝜇𝑠

𝑡, 𝜎𝑠

𝑡]=𝑔𝑝

𝑠(ℎ𝑡,𝑚𝑝

𝑡)

where

𝑃(𝑚𝑡|𝑓𝑡,ℎ 𝑡)

is parameterized by a four layer MLP with hid-

den dimension 128 with LeackyReLU activation.

𝜇𝑠

𝑡

and

𝜎𝑠

𝑡

are the

mean and covariance.

𝑔𝑝

𝑠

is a two layer MLP with hidden dimension

128. After being able to sample the current state

𝑠𝑡

, we can compute

the current feature

𝑧𝑡

via

𝑧𝑡=𝑀𝐿 𝑃 (𝑠𝑡,ℎ𝑡)

, where the MLP has

three hidden layer with 128 dimensions and LeakyReLU activation.

Finally, given the current and future state, the past state is up-

dated as follows (Fig.5 c):

ℎ𝑡+1=GRU(ℎ𝑡, 𝑠𝑡, 𝑓𝑡)(8)

where the GRU has two stacked layer and hidden state of dimension

128. Now the generation model in Fig.5 a is complete.

Dierent from existing methods [

7

,

10

] where the prior of current

state is a function of past state

ℎ𝑡

only, and where the future state is

shared with the current state, we let the model learn two dierent

distributions for current and future states. The prior of current state

is also a function of the future state, which will force the model to

make use of the future information.

5.4 Inference

In the generation model in Fig.5 a, the key variables to be inferred

are

𝑠𝑡

and

𝑓𝑡

, shown in Fig.5 b. The posterior of the future state

𝑓𝑡

is dependent on past state

ℎ𝑡

and its unrolled future summary

𝑚𝑡

.

The posterior of the current state

𝑠𝑡

is dependant on the feature

𝑧𝑡

,

the past state

ℎ𝑡

and the future summary

𝑚𝑡

. We rst factorize the

dynamics as follow:

𝑞(𝑠≤𝑇|𝑧≤𝑇) ≈

𝑇

Ö

𝑡=1

𝑞(𝑠𝑡|𝑧≤𝑡−1, 𝑧𝑡, 𝑧 ≤𝑡+𝐻)=

𝑇

Ö

𝑡=1

𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡)

(9)

where

𝑇

is the total length of motion. Here we approximate the

𝑞(𝑠𝑡|𝑧≤𝑇)

with

𝑞(𝑠𝑡|𝑧≤𝑡+𝐻)

, as the correlations between

𝑠𝑡

and the

far future is likely to be small, so we only consider up to

𝑡+𝐻

. Then

for each time step, we use a MLP to parameterize the posterior:

𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝜇𝑠

𝑡, 𝜎𝑠

𝑡),[𝜇𝑠

𝑡, 𝜎𝑠

𝑡]=𝑀𝐿 𝑃 (ℎ𝑡,𝑧 𝑡,𝑚𝑡)

(10)

where the MLP has two hidden layers with 32 dimensions and

LeakyReLU activation. For the future state we approximate its

posterior as follow:

𝑞(𝑓𝑡|ℎ𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝜇𝑓

𝑡, 𝜎 𝑓

𝑡),[𝜇𝑓

𝑡, 𝜎 𝑓

𝑡]=𝑀𝐿 𝑃 (ℎ𝑡,𝑚𝑡)(11)

c) Transition(GRU)

: deterministic latent for history

state

: code for pose-v

: code for future

sequence(length=8/16)

: stochastic latent for current state

: stochastic latent for future state

a) Generation

Pose &

Velociy

t

z

t

h

t

m

1t

h

t

s

t

f

b) Inference

t

m

t

h

t

f

t

z

t

h

t

m

t

s

t

h

t

s

t

f

t

s

t

z

t

h

t

f

t

m

Figure 5: The stochastic latent RNN. a) Generation Model. The current pose embedding feature 𝑧𝑡depends on the current and

past latent state 𝑠𝑡and ℎ𝑡.𝑠𝑡depends on the past state ℎ𝑡and the future summary 𝑚𝑡which depends on the future latent state

𝑓𝑡and past state ℎ𝑡b) Inference on 𝑠𝑡and 𝑓𝑡. c) Transition of ℎ𝑡

where the MLP has two hidden layers with 512 dimensions and

LeakyReLU activation.

5.5 Temporal dierence loss

Besides inferring

𝑠𝑡

and

𝑓𝑡

, we also constrain the dynamics of

𝑠

. We

assume a relation between two states

𝑠𝑡1

and

𝑠𝑡2

at

𝑡1

and

𝑡2

where

𝑡1<𝑡2, similar to [12]:

𝑝(𝑠𝑡1, 𝑠𝑡2, 𝑡1, 𝑡2) ∝ 𝑝(𝑠𝑡2|𝑠𝑡1, ℎ𝑡1, ℎ𝑡2)(12)

where we parameterize the posterior:

𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡1;𝜇𝑠𝑡1, 𝜎𝑠𝑡1)(13)

where 𝜇𝑠𝑡1, 𝜎𝑠𝑡1

=𝑀𝐿 𝑃 (𝑠𝑡2,ℎ𝑡1, ℎ 𝑡2)

where

𝜇𝑠𝑡1

and

𝜎𝑠𝑡1

are the mean and covariance. Here the MLP has

two hidden layers with 32 dimensions and LeakyReLU activation.

This way we can sample

𝑠𝑝𝑜𝑠𝑡 𝑒𝑟 𝑖𝑜 𝑟

𝑡1

during inference. Meanwhile,

we hope to reconstruct 𝑠𝑡2with the time dierence 𝛿 𝑡 =|𝑡1−𝑡2|

𝑠𝑟𝑒𝑐

𝑡2

=𝑀𝐿 𝑃 (𝑠𝑝𝑜𝑠𝑡𝑒 𝑟𝑖 𝑜𝑟

𝑡1, 𝛿𝑡 )(14)

where the MLP for skip prediction has three hidden layers each

with 32 dimensions and LeakyReLU activation.

5.6 Learning

Finally, we compose all terms for the loss funtion:

𝑚𝑎𝑥Φ𝐿𝑑+𝐿𝑇(15)

where

Φ

is the set of all learnable parameters in our networks and

𝐿𝑑=

𝑇

Õ

𝑡=1

[−𝐾𝐿(𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡) | |𝑝(𝑠𝑡|ℎ𝑡, 𝑚𝑡))

−𝐾𝐿 (𝑞(𝑓𝑡|ℎ𝑡, 𝑚𝑡) | |𝑝(𝑓𝑡|ℎ𝑡))

+𝑙𝑜𝑔(𝑝(𝑧𝑡|𝑠𝑡, ℎ𝑡)) + 𝑙𝑜𝑔 (𝑝(𝑚𝑡|𝑓𝑡, ℎ𝑡))]

𝐿𝑇=Õ

𝑡1,𝑡2

[−𝐾𝐿(𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)||𝑝(𝑠𝑡1)) + 𝑝(𝑠𝑡2|𝑠𝑡1, 𝑡1, 𝑡2)]

𝐾𝐿 is the Kullback–Leibler divergence.

After training the pose auto-encoder and the sequence auto-

encoder (Section 5.1-5.2), we freeze their parameters and train the

dynamics model (Section 5.3).

6 EXPERIMENT AND RESULTS

For all our experiments, we use CMU MoCap database

1

. CMU

dataset is a high-quality dataset acquired using optical motion cap-

ture systems, containing 2605 trials in 6 categories and 23 subcate-

gories. Its high-quality serves our purposes well as it provides good

data ‘seeds’ for motion generation. Also, the tremendous eort that

went into capturing the data shows the need for tools such as DFN

for data augmentation. To carefully evaluate DFN, we select dier-

ent motion classes with dierent features and dynamics, shown in

Tab.1, to show that DFN can generate new high-quality motions

with arbitrary lengths using dierent motion prexes. Next, we

evaluate DFN on data with single and mixed motion classes to see

its ability to learn the dierent transition stochasticity on data with

a single type of dynamics and mixed types of dynamics. Last, we

push the limit of DFN by reducing the training data, to show that

DFN can make use of a small amount of data to generate high-

quality and diversied data, which is crucial for data augmentation.

More example can be found in the supplementary video.

6.1 Open-loop Motion Generation

We rst show open-loop motion generation, where we do not mod-

erate accumulative errors. We use a 8 to 20-frame motion prex to

start motion generation to get 900 frames (dfn_run2box_2char and

dfn_boxing_3char in the video). The motion stability indicates that

DFN does not suer from the problem of cumulative error that is

common in time-series generation [

41

]. Given the same prex, the

diversity is shown in their transitions between dierent postures

(short-term) and dierent actions (long-term).

6.2 Dynamics Multi-modality

We investigate how well DFN can capture dierent transition stochas-

ticity in dierent motions, using several types of motions with

dierent properties (shown in Tab.1). We rst train DFN on them

separately then jointly. The results can be found in dfn_walk_top,

dfn_walk_close, walking1-walking2, running1-running3, dancing1

and boxing1 in the video. We observe that DFN can learn the tran-

sition stochasticity well when trained on single type of motions.

1http://mocap.cs.cmu.edu/

Motion Cyclic Main Body Part Rhythmic Dynamics

Walking Yes Lower No Low

Boxing No Upper No High

Dancing No Full Yes High

Running Yes Lower No High

Table 1: Motion types and their features.

The diversity can be found in short-term and long-term transitions,

which are two-levels of multi-modality captured well by DFN. In

walking (dfn_walk_top and dfn_walk_close in video), the short-

term stochasticity is shown in within-cycle motion randomness,

which enriches the walking style. The long-term stochasticity is

shown when a turning is generated. The action-level transition has

also been captured and generated. Similar observations are also

found in other motions.

When trained on mixed data (combining all motion in Tab.1),

DFN learns higher-level action transitions between dierent mo-

tion classes. We can see examples (action_transition in video and

frame-level image in supplementary le) that transit from dancing

to running, from boxing to dancing or from slow walk to running,

showing the modelling capacity at two levels. Within a single ac-

tion, diversied styles are learned well. Between dierent actions,

transitions are learned well too. This demonstrates the benets

of modelling randomness explicitly between the past, current and

future state, which would otherwise make it hard to capture the

multi-modality and lead it to average over all types of dynamics,

resulting in meaningless mean poses and motions.

6.3 Diversied Generation

Although visually it is clear that DFN can generate diversied

motions, we also numerically show the diversity especially when

the duration of generation becomes long. First, we randomly select

training data of dierent classes, and show their latent feature

trajectory in Fig. 6, with embedding dimension of 16. We then

use a PCA model to project the embedding features to 2D. The

trajectories are continues and smooth without extra constraints on

the auto-encoder. It shows that the motion dynamics are captured

well in the latent space, which is critical in motion generation. Next,

we show a group of generated motions in Fig. 7. Even with the

same motion prex, motions start to diversify from the beginning,

which is a distinct property lacking in deterministic generators in

most action-prediction models such as [

32

], and our sequence is in

3d which is more dicult than 2d [43].

Distributional Shift in Time.

The motion diversity increases

in time. To show that there is a distributional shift of poses, we

randomly pick an initial sequence from training data, then randomly

generate 4096 sequences each with 128 frames. We visualize the

current latent state

𝑠

and latent feature

𝑧

at

𝑡=

32 and

𝑡=

128 in Fig.

8. Note that the distributions of

𝑠

and

𝑧

capture the stochasticity at

two dierent levels, one at the stochastic state level and the other

at the latent feature level. The red dots represent them at

𝑡=

32

and the yellow dots at 𝑡=128.

For both

𝑧

and

𝑠

, the red (

𝑡=

32) are concentrated more, showing

that the dierence between the 4096 generated motions are still

somewhat similar in the early stage. However, the yellow (

𝑡=

128)

Figure 6: Randomly selected training motions in the latent

space. Color indicate dierent motion class. Smooth trajec-

tories are universally obtained by embedding.

Figure 7: Pose embedding trajectory of random generated se-

quences given same initialization with 20 frame. The circle

mark the rst frame. We see that as time goes, these trajec-

tory depart from each other.

show that the generated motions start to diversify later. Not only

they shift out of the original red region, indicating that they are in

now in dierent pose regions, they also start to diverge more, shown

by dierent modes in yellow areas, meaning they have diverged

into several dierent pose regions.

Distribution Matching in Time

. Another way to test the di-

versity of generated motions is to see their statistical similarity to

the training motions. Since the motion prex is from one particular

motion, the more similar the generated motions are to the whole

training dataset, the more diverse they are, because the generated

motions have leave the original motion region where the motion

prex is.

We employ the mean-distance distribution as a measure, as in

[

43

]. For each time step, we calculate the mean pose of all generated

motions, then calculate the Euclidean distances between the mean

pose and all other poses at that time step. We then plot the mean

distance and variance in Figure 9. The blue background indicate the

mean and variance of mean-distance distribution of the training

dataset. It shows that as time goes, the mean-distance distribution

Pose embedding distribution

t=32 t=128

Pose embedding latent distribution

Figure 8: Four groups of motions generated from four dif-

ferent motion prexes, each group with 4096 motions, and

their 𝑧(Left) and 𝑠(Right) at 𝑡=

32

and 𝑡=

128

. We can see

that the earlier distributions are more concentrated and di-

verge fast as time passes.

Figure 9: Four groups of motions generated from four dif-

ferent motion prexes, each group with 4096 motions. The

x axis represents the time dimension, and the y axis repre-

sent the mean distance to average pose at each time step. The

band represents the variations.

of generated poses gradually matches that of the training data. This

further shows the generation diversity.

6.4 Generation on Limited Training Data

DFN aims to solve the problem of data scarcity, so it should only

require as little data as possible for generation. We therefore push

DFN to its limit by reducing the training data, to see the minimal

amounts of data needed. To investigate each individual type of

motions, we train DFN on walking, running, and boxing data sepa-

rately. We start from full training data where the longest sequence

lasts for around 10 minutes, and gradually reduce the duration by

sampling until the quality of the generated motions start to dete-

riorate. Although DFN responds to reduced training data slightly

dierently on dierent motions, we nally able to reduce the train-

ing data to a tiny amount, with the longest sequence being only

15 seconds (12 second for walking, 15 second for boxing and 7 sec-

ond for running). DFN can still generate stable motions even when

trained on merely a 7-second long motion. (The result can be seen

in reduced_data in video) The impact of reducing the training data

is mainly on the diversity of the motion. (However we can see in

supplementary video that the generated boxing motion still has a

certain of diversity). Less training data contains fewer transition

diversities (both short-term and long-term). The generated motions

therefore are less diverse. This is understandable as DFN cannot

deviate too much from the original data distribution to ensure the

motion quality.

6.5 Comparison

To our best knowledge, the only similar paper to ours is [

42

] which

also focuses on diversied motion generation. However, the biggest

dierence is that DFN explicitly models the inuence of the future

on the current. This enables DFN to explicitly model the transition

randomness at dierent stages and levels. This is the key reason why

DFN can be trained well on multiple types of motions, separately

and jointly, which has not been shown in [

42

]. However, a direct

numerical comparison is dicult due to the lack of widely accepted

metrics for diversied motion generation. In addition, the method

in [42] uses heavy post-processing while DFN does not.

7 CONCLUSION AND DISCUSSIONS

In this paper, we propose a new generative model, DFN, for diver-

sied human motion generation. DFN can generate motions with

arbitrary lengths. It successfully captures the transition stochastic-

ity in short and long term, and capable of learning the multi-modal

randomness in dierent motions. The training data needed is small.

We have conducted extensive evaluation to show DFN’s robustness,

versatility and diversity in motion generation.

There are two main limitations in our method. There is no control

signal, and sometimes it can overly smooth high-frequency motions.

We will address them in the future. Our explicit modelling of the

future makes it convenient to introduce desired future as control

signals;while replacing some of the Gaussian components with

multi-modal priors might mitigate the over-smoothing issue.

ACKNOWLEDGMENTS

We thank anonymous reviewers for their valuable comments. This

work is partially supported by the National Key Research & Devel-

opment Program of China (No. 2016YFB1001403), NSF China (No.

61772462, No. 61572429, No. U1736217), the 100 Talents Program

of Zhejiang University, Strategic Priorities Fund Research England,

and EPSRC (Ref:EP/R031193/1).

REFERENCES

[1]

Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural

Language Grounded Pose Forecasting. In 2019 International Conference on 3D

Vision (3DV). IEEE, 719–728.

[2]

Federico Bartoli, Giuseppe Lisanti, Lamberto Ballan, and Alberto Del Bimbo. 2018.

Context-aware trajectory prediction. In 2018 24th International Conference on

Pattern Recognition (ICPR). IEEE, 1941–1946.

[3]

Justin Bayer and Christian Osendorfer. 2014. Learning Stochastic Recurrent

Networks. stat 1050 (2014), 27.

[4]

Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017.

Deep representation learning for human motion prediction and classication.

In Proceedings of the IEEE conference on computer vision and pattern recognition.

6158–6166.

[5]

Nutan Chen, Maximilian Karl, and Patrick Van Der Smagt. 2016. Dynamic

movement primitives in latent space of time-dependent variational autoencoders.

In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

IEEE, 629–636.

[6]

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.

2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-

proaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and

Structure in Statistical Translation. Association for Computational Linguistics,

Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012

[7]

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville,

and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data.

In Advances in neural information processing systems. 2980–2988.

[8]

Han Du, Erik Herrmann, Janis Sprenger, Noshaba Cheema, Somayeh Hosseini,

Klaus Fischer, and Philipp Slusallek. 2019. Stylistic Locomotion Modeling with

Conditional Variational Autoencoder.. In Eurographics (Short Papers). 9–12.

[9]

Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015.

Recurrent network models for human dynamics. In Proceedings of the IEEE Inter-

national Conference on Computer Vision. 4346–4354.

[10]

Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Côté,

Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-forcing: Training stochastic

recurrent networks. In Advances in neural information processing systems. 6713–

6723.

[11]

Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks.

CoRR abs/1308.0850 (2013). http://dblp.uni- trier.de/db/journals/corr/corr1308.

html#Graves13

[12]

Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theo-

phane Weber. 2019. Temporal Dierence Variational Auto-Encoder. In Interna-

tional Conference on Learning Representations. https://openreview.net/forum?id=

S1x4ghC9tQ

[13]

Xiao Guo and Jongmoo Choi. 2019. Human Motion Prediction via Learning

Local Structure Representations and Temporal Dependencies. In Proceedings of

the AAAI Conference on Articial Intelligence, Vol. 33. 2580–2587.

[14]

Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, Taku Komura,

Jun Saito, Ikuo Kusajima, Xi Zhao, Myung-Geol Choi, Ruizhen Hu, et al

.

2017. A

Recurrent Variational Autoencoder for Human Motion Synthesis.. In BMVC.

[15]

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2020.

Dream to Control: Learning Behaviors by Latent Imagination. In International

Conference on Learning Representations. https://openreview.net/forum?id=

S1lOTC4tDS

[16]

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak

Lee, and James Davidson. 2019. Learning Latent Dynamics for Planning from

Pixels. In International Conference on Machine Learning. 2555–2565.

[17]

Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for

character locomotion. In SIGGRAPH Asia 2018 Technical Briefs. 1–4.

[18]

Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2019. Moglow:

Probabilistic and controllable motion synthesis using normalising ows. arXiv

preprint arXiv:1905.06598 (2019).

[19]

Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human

motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE

International Conference on Computer Vision. 7134–7143.

[20]

Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural

networks for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017),

1–13.

[21]

Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework

for character motion synthesis and editing. ACM Transactions on Graphics (TOG)

35, 4 (2016), 1–11.

[22]

Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning

motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015

Technical Briefs. 1–4.

[23]

Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised learning of

disentangled and interpretable representations from sequential data. In Advances

in neural information processing systems. 1878–1889.

[24]

Junfeng Hu, Zhencheng Fan, Jun Liao, and Li Liu. 2019. Predicting Long-Term

Skeletal Motions by a Spatio-Temporal Hierarchical Recurrent Network. arXiv

preprint arXiv:1911.02404 (2019).

[25]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis

Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Homan, Monica Din-

culescu, and Douglas Eck. 2019. Music Transformer. In International Conference

on Learning Representations. https://openreview.net/forum?id=rJe4ShAcF7

[26]

Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt.

2017. DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF

STATE SPACE MODELS FROM RAW DATA. stat 1050 (2017), 3.

[27]

Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes.

stat 1050 (2014), 1.

[28]

Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala, and Venkatesh Babu

Radhakrishnan. 2019. Unsupervised feature learning of human actions as trajecto-

ries in pose embedding manifold. In 2019 IEEE Winter Conference on Applications

of Computer Vision (WACV). IEEE, 1459–1467.

[29]

Chen Li and Gim Hee Lee. 2019. Generating multiple hypotheses for 3d hu-

man pose estimation with mixture density network. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition. 9887–9895.

[30]

Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional

sequence to sequence model for human dynamics. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition. 5226–5234.

[31]

Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning

trajectory dependencies for human motion prediction. In Proceedings of the IEEE

International Conference on Computer Vision. 9489–9497.

[32]

Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion

prediction using recurrent neural networks. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition. 2891–2900.

[33]

Jianyuan Min and Jinxiang Chai. 2012. Motion Graphs++: A Compact Generative

Model for Semantic Motion Analysis and Synthesis. ACM Trans. Graph. 31, 6,

Article 153 (Nov. 2012), 12 pages. https://doi.org/10.1145/2366145.2366172

[34]

Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-

based Recurrent Model for Human Motion.

[35]

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018.

Deepmimic: Example-guided deep reinforcement learning of physics-based char-

acter skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.

[36]

A. Safonova, Jessica Hodgins, and Nancy Pollard. 2004. Synthesizing physically

realistic human motion in low dimensional. ACM Transactions on Graphics - TOG

(01 2004).

[37]

Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris

Pal, and Yoshua Bengio. 2018. Twin Networks: Matching the Future for Sequence

Generation. In International Conference on Learning Representations. https:

//openreview.net/forum?id=BydLzGb0Z

[38]

Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2019. Spa-

tiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Mo-

tion Prediction.

[39]

Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state

machine for character-scene interactions. ACM Transactions on Graphics (TOG)

38, 6 (2019), 1–14.

[40]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol

Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.

[n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis

Workshop. 125–125.

[41]

He Wang, Edmond SL Ho, Hubert PH Shum, and Zhanxing Zhu. 2019. Spatio-

temporal Manifold Learning for Human Motions via Long-horizon Modeling.

IEEE transactions on visualization and computer graphics (2019).

[42]

Zhiyong Wang, Jinxiang Chai, and Shihong Xia. 2019. Combining recurrent

neural networks and adversarial training for human motion synthesis and control.

IEEE transactions on visualization and computer graphics (2019).

[43]

Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan,

and Changyou Chen. 2019. Learning Diverse Stochastic Human-Action Gen-

erators by Learning Smooth Latent Transitions. CoRR abs/1912.10150 (2019).

arXiv:1912.10150 http://arxiv.org/abs/1912.10150

[44]

Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2018.

Auto-Conditioned Recurrent Networks for Extended Complex Human Motion

Synthesis. In International Conference on Learning Representations. https://

openreview.net/forum?id=r11Q2SlRW