Content uploaded by He Wang
Author content
All content in this area was uploaded by He Wang on Sep 08, 2020
Content may be subject to copyright.
Dynamic Future Net: Diversified Human Motion Generation
Wenheng Chen
NetEase Fuxi AI Lab
chenwenheng@corp.netease.com
He Wang
University of Leeds
H.E.Wang@leeds.ac.uk
Yi Yuan
NetEase Fuxi AI Lab
yuanyi@corp.netease.com
Tianjia Shao
State Key Lab of CAD&CG, Zhejiang
University
tjshao@zju.edu.cn
Kun Zhou
State Key Lab of CAD&CG, Zhejiang
University
kunzhou@acm.org
Figure 1: Given a 20-frame walking motion pre-x (white), our model can generate diversied motion: walking (yellow),
walking-to-running (blue), walking-to-boxing (green), and walking-to-dancing (red), with arbitrary duration. The correspond-
ing animation can be found in teaser.mp4 in supplementary video.
ABSTRACT
Human motion modelling is crucial in many areas such as computer
graphics, vision and virtual reality. Acquiring high-quality skele-
tal motions is dicult due to the need for specialized equipment
and laborious manual post-posting, which necessitates maximiz-
ing the use of existing data to synthesize new data. However, it
is a challenge due to the intrinsic motion stochasticity of human
motion dynamics, manifested in the short and long terms. In the
short term, there is strong randomness within a couple frames, e.g.
one frame followed by multiple possible frames leading to dierent
motion styles; while in the long term, there are non-deterministic
action transitions. In this paper, we present Dynamic Future Net,
a new deep learning model where we explicitly focuses on the
aforementioned motion stochasticity by constructing a generative
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’20, October 12–16, 2020, Seattle, WA, USA.
©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00
https://doi.org/10.1145/3394171.3413669
model with non-trivial modelling capacity in temporal stochas-
ticity. Given limited amounts of data, our model can generate a
large number of high-quality motions with arbitrary duration, and
visually-convincing variations in both space and time. We evaluate
our model on a wide range of motions and compare it with the
state-of-the-art methods. Both qualitative and quantitative results
show the superiority of our method, for its robustness, versatility
and high-quality.
CCS CONCEPTS
•Computer systems organization →Embedded systems
;Re-
dundancy; Robotics; •Networks →Network reliability.
KEYWORDS
human motion, neural networks, generative models
ACM Reference Format:
Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, and Kun Zhou. 2020. Dy-
namic Future Net: Diversied Human Motion Generation. In 28th ACM
International Conference on Multimedia (MM ’20), October 12–16, 2020, Seat-
tle, WA, USA.. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
3394171.3413669
1 INTRODUCTION
Modeling natural human motions is a central topic in several elds
such as computer animation, bio-mechanics, virtual reality, etc,
where high-quality motion data is a necessity. Despite the improved
accuracy and lowered costs of motion capture systems, it is still
highly desirable to make full use of existing data to generate di-
versied new data. One key challenge in motion generation is
the dynamics modelling, where it has been shown that a latent
space can be found due to the high coordination of body motions
[
21
,
36
,
41
]. However, as much as the spatial aspect is studied, dy-
namics modelling, especially with the aim of diversied motion
generation, still remains to be an open problem.
Human motion dynamics manifest several levels of short-term
and long-term stochasticity. Given a homogeneous discritization
of motions in time, the short-term stochasticity refers to the ran-
domness in next one or few frames (pose-transition); while the
long-term one refers to the random in the next or few actions
(action-transition). Tradition methods model them by Finite State
Machines with carefully organized data [
33
], which have limited
model capacities for large amounts of data and require extensive pre-
processing work. New deep learning methods either ignore them
[
21
] or do not explicitly model them [
41
]. Very recently, dynamics
modelling for diversied generation has just been investigated [
42
],
but only from the perspective of the overall dynamics, rather than
the detailed short/long term stochasticity.
In this paper, we propose a new deep learning model, Dynamic
Future Net, or DFN, for automatic and diversied high-quality mo-
tion generation based on limited amounts of data. Given a motion,
we assume that it can be discretized homogeneously in time and
represented by a series of posture and instantaneous velocities.
Following the observation that it is easier to learn the dynamics in
a natural motion latent space [
21
], we rst embed features in the
data space into a latent space. Next, DFN learns explicitly the his-
tory,current and future state given any time, where we also model
several conditional distributions for the inuences of history and
future state on the current state. The state-wise bidirectional mod-
elling (extending into both the past and future) separates DFN from
existing methods and endows us with the ability of modelling the
short-term (next-frame) randomness and long-term (next-action)
randomness. Last, for inference purposes, we propose new loss
functions based on distributional similarities as opposed to point-
wise estimation [
41
,
44
], which captures the dynamics accurately
but also keep the randomness that is crucial for diversied motion
generation.
We show extensive experimental results to show DFN’s robust-
ness, versatility and high-quality. Unlike existing methods which
have to be trained on one type of motions a time [
42
,
44
], DFN
can be trained using both single type of motions or mixed motions,
which shows DFN’s ability to capture multi-modal dynamics and
therefore its versatility in diversied motion generation. Visual
evaluation shows that DFN can generate high-quality motions with
dierent dynamics.
In summary, our formal contributions include:
(1)
a new deep learning model, Dynamic Future Net, for auto-
matic, diversied and high-quality human motion genera-
tion.
(2)
a new dynamic model that captures the transition stochas-
ticity of the past, current and future states in motions.
(3)
insights of the importance of both short-term and long-term
dynamics in human motion modelling.
2 BACKGROUND AND RELATED WORKS
2.1 Human pose and motion embedding
Given a human motion sequence, it is useful to nd the low dimen-
sion representation of the whole sequence. Holden et al. [
22
] for
the rst time use a convolution neural network to project the entire
sequence to a low dimensional embedding space. Using this more
abstract representation, one can blend motions or remove noises
from corrupted motions. In [
20
,
21
], the authors further make use of
the power of the learned motion manifold and decoder to synthesize
motions with constraints. Another important application of motion
embedding is motion style generation [
8
], in which the embedding
code can be tuned to matched the desired style. Although modeling
a motion sequence with auto-encoders is straightforward, how it
can model the dynamics of human motion is not clear. In [
28
], the
authors model a motion sequence as a trajectory in pose embedding
manifold, then use a bidirectional-RNN to encode the trajectory to
model the dynamics of the trajectory, which can improve motion
classication results. Moreover, they designed a graph-like network
to better represent the human body components.
The existing methods focus on the embedding of the poses and
dynamics. However, they do not explicitly model the distributions
of these latent variables which governs the stochasticity of the
dynamics. In this paper, we go a lever deeper and learn the latent
variable distributions for the embedded poses and dynamics.
2.2 Deterministic human motion prediction
and synthesis
In the eort of modelling motion dynamics, many methods employ
deterministic transitions [
2
,
4
,
9
,
13
,
17
,
19
,
24
,
30
,
31
,
38
], especially
in human motion prediction or generation. They either focus on
short term dynamics modeling or spatial-temporal information of
the overall dynamics. In [
44
], the authors propose a training tech-
nique for RNN to generate very long human motions. Although this
technique solves the problem of the freezing phenomena of RNN,
their model is deterministic, which makes the training dicult:
given a past state, if multiple possible future motions are present
in the data, the network will average them, which is a common
problem in many human motion prediction methods.
One solution to this problem is to introduce control signals [
14
,
39
]. They design several networks and make the character to follow
a given trajectory in real-time. In [
35
], the control signal becomes
the 3d human pose predicted by neural nets as a reference for an
agent to imitate. In [
1
], the authors co-embed the language and
corresponding motions to a share manifold, ignoring the fact that
language-to-motion is a one-to-many mapping. Even with a specic
control signal, like 2D human skeleton, one can still expect that
there are dierent motions or dierent pose corresponding to the
same control signal [
29
], essentially indicating the multi-modality
nature of human motion dynamics.
Dierent from the existing methods, our paper focuses on the ex-
plicit modelling of the multi-modality nature of motion transitions
in human motions. Further, we also aim to learn the stochastiity in
those transitions.
2.3 Stochastic human motion synthesis
In [
42
] the authors combine RNNs and Generative Adversarial Net-
works (GANs) to generate stochastic human motions. They use a
mixture density layer to model the stochastic property, and use an
adversarial discriminator to judge whether the generated motion
is natural or not. In MoGLow [
18
], the authors for the rst time
use normalizing ow to directly model the next frame distribution.
One advantage of this method is that it can capture complex dis-
tributions without learning an apparent latent space. Given the
same initial poses and the same control signals or constraints, the
model still generate dierent motion sequences. Chen et al. [
5
]
combine Dynamic motion primitive and variational Bayesian lter
to model the human motion dynamics. They show that the latent
representations are self-clustering after training. However, in the
transitions, it needs the information of the whole sequence, which
separates it from being a pure generative model.
Our method diers from existing approaches in its treatment in
the relations between the past, current and future states of human
motions. Unlike the aforementioned methods, we explicitly model
the current state based on both the past and the future. Also, we
further model their randomness in the latent space that captures
the transition multi-modality.
2.4 Stochastic RNN model
Modelling the stochasticity in time-series data has been a long-
standing problem, such as music, hand writing and human voice
[
11
,
25
,
40
]. The VRNN [
7
] for the rst time combines Variational
AutoEncoder (VAE) [
27
] and recurrent neural networks for this
purpose. Later in [
23
], the authors disentangle the latent variables
of the observation and the dynamics, with the observation latent
being used to recover the full observation information, and the
dynamic latent capturing the dynamics.
A key modelling choice in stochastic RNN models is the relations
between the past, current and future. In early work, the posterior
of current state is inferred from the past information, which makes
it lack the ability to foresee the future. In [
3
,
10
,
37
], the authors
show that the performance can be improved by incorporating the
future state with a backward RNN in the inference stage. In [
12
], the
authors design a model that can go beyond step-by-step modelling,
and predict multiple steps up to a given horizon. Similar eort is also
made in reinforcement learning, where the reward function takes
the discounted future reward into consideration [
12
,
15
]. In [
16
], the
authors went further and designed a model that can predict multiple
future scenarios, then choose the one with highest predicted reward
from all the possibilities.
We observed that human motions follow a similar philosophy:
the current state is a result of the past motion but also a particular
choice for a certain planned future. Our research is inspired by
Stochastic RNN models but focuses on human motion transition
stochasticity.
3 METHOD OVERVIEW
Our method takes a homogeneous series of human pose representa-
tions as input. This representation contains the 3D joint coordinates
relative to the root, and the root translation velocity over ground
plane and the rotation velocity around the y-axis. We propose the
Dynamic Future Net to model the motion dynamics as a future-
guided transition and generate random natural human motions
that transit between dierent actions.
As illustrated in Fig. 2, DFN is composed of three modules, a
pose encoder, a pose trajectory encoder and a stochastic latent RNN.
The pose encoder (Section 5.1) maps the high dimensional human
pose to a latent space while the pose trajectory encoder (Section
5.2) embeds the trajectory in the latent pose space into a code.
Such compact representations of pose sequences can facilitate the
learning process [
28
]. As a key module, the stochastic latent RNN
(Section 5.3) deploys a stochastic latent state and a deterministic
latent state to learn two latent distributions for the pose-embedding
and the future trajectory embedding. Such explicit learning of two
dierent latent distributions on the one hand forces the model to
learn strong temporal correlation and on the other hand generates
motions with varied and natural transition. During inference we
combine the past, current and future state to infer the current
latent state distribution, and we combine the past and future to
infer the future latent state distribution. In the generation stage,
unlike existing methods [
10
] where the current state is generated
from the past state only, we rst generate the future state and
combine it with the past state to generate the current latent state
prior, from which we sample the current latent state then decode it
to the pose-embedding and recover the current pose and velocity.
We regard this process as a self-driving motion generation process
guided by the envisioned dynamic future. In this way, the model
can learn and generate rich and varied natural motions.
Figure 2: Overview of the proposed Dynamic Future Net-
work. In the learning process, the network take human mo-
tion sequence as input and predict the long term distribu-
tion and the short term(next frame) distribution.
4 DATA PREPARATION
We train our model on the CMU Human motion capture dataset.
As the skeletons in the origin dataset are dierent, we rst retarget
all motion to a chosen skeleton as in [
21
]. The skeleton contains
24 joints, we rst extract the X and Z global coordinate of the root,
and rotate the human pose to the Y-axis direction as in [
22
], the
global position and angle of human pose can be recovered from the
X-Z velocity and the rotation velocity around the Y axis. Finally the
original human pose vector contains 76 degrees of freedom, 72 for
3D joint positions, 4 for the global translation and rotation velocity.
Figure 3: The pose-velocity auto-encoder network. The in-
put of the encoder is the 76 dimensional pose-velocity vec-
tor. The encoder outputs two code, one for pose, the other
for velocity. The pose code and velocity code is fed into a
quaternion decoder and velocity decoder separately. The 3D
joint positions are recovered from quaternions by the For-
ward Kinematics (FK) layer.
5 METHODOLOGY
Formally, we start by describing a motion as a homogeneous time
series:
{𝑋0, . . . , 𝑋𝑇}
, where
𝑋𝑡
is the motion frame at time
𝑡
and
contains the joint positions and global velocities. Starting with a
joint distribution
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)
, we model the inuence of
the past frames
𝑋<𝑡
and future frames
𝑋𝑡+1:𝑡+𝐻
on the current
frame
𝑋𝑡
by transition probabilistic distributions
𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)
and
𝑃(𝑋𝑡|𝑋<𝑡)
, where
𝐻
is the duration of a short-horizon future.
The key reason of such a modelling choice is based two observa-
tions: the current frame is a result of the past motion and therefore
conditioned on it, captured by
𝑃(𝑋𝑡|𝑋<𝑡)
. Meanwhile, the current
frame is also a choice made for certain planned future, e.g. needing
to stop swing the legs aiming for a transition from walk to stand-
ing, captured by
𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)
. In addition, since the past motion
will also limit the possibilities of the future motion, there is also
a impact of the past on the future,
𝑃(𝑋𝑡+1:𝑡+𝐻|𝑋<𝑡)
. Overall, the
joint probability:
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻) ∝ 𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)𝑃(𝑋𝑡+1:𝑡+𝐻,𝑋 𝑡|𝑋<𝑡)(1)
Not that the two probabilities on the right side play dierent roles.
𝑃(𝑋𝑡+1:𝑡+𝐻, 𝑋𝑡|𝑋<𝑡)
is the probability of unrolling from the past to
the future. Given a known past, this is a joint probability of both the
current and the future, containing all the possible transitions. On
top of it,
𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)
dictates that if the future is also known,
then the current can be inferred. This explicit modelling of the
transition probabilistic distributions between the past, current and
future helps capturing the transition stochasticity, which facilitates
diversied motion generation as shown in the experiments.
Learning the transitional probabilities in the data space, however,
is dicult due to the curse of dimensionality. We therefore project
motions to a latent space, which involves embedding the frames
as well as the dynamics. We then learn the transition distributions
in the latent space. During inference, we then recover motions
from sampled states in the latent space to the original data space.
DFN is naturally divided into three components: Spatial (frame)
embedding, dynamics embedding and dynamics modelling.
5.1 Spatial Embedding
We use an auto-encoder for frame embedding,
𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)
and
ˆ
𝑋𝑡=𝑃𝑜 𝑠𝑒𝐷 𝑒𝑐 (𝑧𝑡)
, shown in Figure 3.
𝑃𝑜𝑠𝑒𝐸𝑛𝑐
is multi-layer
perceptron network to project the data into the latent space. Then
we separate the latent feature into two components to represent
the pose code and the global velocity code.
𝑃𝑜𝑠𝑒𝐷𝑒𝑐
contains two
components, the quaternion decoder and the velocity decoder. The
quaternion decoder takes the pose latent feature as input and out-
puts joint angles (represented by quaternions), and the velocity
decoder takes the latent velocity feature as input and outputs the
velocity. The quaternion decoder essentially is a dierential Inverse
Kinematics module. As stated in [
34
], using joint rotations instead
of joint positions maintains the bone lengths. After the reconstruc-
tion, we use a Forward Kinematics layer to compute the 3D joint
positions. To train the auto-encoder, we use a Mean Squared Error
loss function:
𝐿𝑠𝑙 =
1
𝑇
𝑇
Õ
𝑡=0
||𝑋𝑡−ˆ
𝑋𝑡|| 2
2(2)
where 𝑇is the number of frames in a motion.
5.2 Dynamics Embedding
Figure 4: A seq-to-seq network for trajectory embedding.
After learning the pose latent space, we project the dynamics as
trajectories in this space using a Recurrent Neural Network shown
in Figure 4. We employ a sequence-to-sequence architecture as
it forces the model to learn long-term dynamics. The RNN con-
sists of Gated Recurrent Unit (GRU) [
6
], encodes a sequence of
encoded frames
{𝑧𝑡, . . . , 𝑧𝑡+𝐻}
into a latent representation
𝑚𝑡
, and
then unroll to reconstruct the same sequence
{𝑧′
𝑡+1, . . . , 𝑧 ′
𝑡+𝐻}
from
𝑚𝑡
given
𝑧𝑡
. To
𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)
,
𝑚𝑡
is a future summary over
multiple frames. We use the following loss function:
𝐿𝑡𝑙 =𝐿𝑟𝑒𝑐 +𝐿𝑠𝑚𝑜𝑜 𝑡ℎ (3)
𝑤ℎ𝑒𝑟 𝑒 𝐿𝑟 𝑒𝑐 =
1
𝑇
𝑇
Õ
𝑡=0
||𝑧𝑡−𝑧′
𝑡|| 2
2
𝐿𝑠𝑚𝑜𝑜𝑡ℎ =
1
𝑇
𝑇
Õ
𝑡=0
||𝑉𝑡−ˆ
𝑉𝑡||2
2
where
𝑇
is the frame number of a motion,
𝑉𝑡
and
ˆ
𝑉𝑡
are the original
and reconstructed joint velocities. To facilitate training, we use Eq.
2 to pre-train the posture auto-encoder an x its weights when
training the RNN module.
5.3 Dynamics modelling
Generative Model.
After embedding the poses and dynamics into
a latent space, we now explain the dynamics modelling, which is the
key technical contribution of this paper. We propose a new dynam-
ics model that captures the joint distribution
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)
.
Rather than directly learning the distribution in the data space, we
aim to learn the latent joint distribution
𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)
, where
we abstract the past, current and future features separately. First,
given the Markov property, we assume that all past information
𝑧<𝑡
is encoded into
ℎ𝑡
which is a deterministic (known) past state. Next,
we assume that the future information
𝑧𝑡+1:𝑡+𝐻
can be summarized
into a future state
𝑓𝑡
, and
𝑓𝑡
is drawn from a distribution over all
possible future states conditioned on
ℎ𝑡
. Last, we also assume that
there is a current state
𝑠𝑡
which captures the current information
𝑧𝑡
and
𝑠𝑡
is drawn from a distribution of all possible current states.
Then we can therefore assume:
𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)=𝑃(𝑧𝑡+1:𝑡+𝐻, 𝑧𝑡|𝑧<𝑡)𝑃(𝑧<𝑡)
∝𝑃(𝑧𝑡|𝑧𝑡+1:𝑡+𝐻)𝑃(𝑧𝑡|𝑧<𝑡)𝑃(𝑧𝑡+1:𝑡+𝐻|𝑧<𝑡)
=𝑃(𝑠𝑡|𝑓𝑡)𝑃(𝑠𝑡|ℎ𝑡)𝑃(𝑓𝑡|ℎ𝑡)
=𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)𝑃(𝑓𝑡|ℎ𝑡)(4)
where we directly use
𝑠𝑡
,
ℎ𝑡
and
𝑓𝑡
to replace the corresponding
𝑧
variables by assuming mappings between them which will be
explained later. Dierent from existing methods [
10
,
37
], our di-
rect conditioning of the current state on the future and past state
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)
, and the future state on the past state
𝑃(𝑓𝑡|ℎ𝑡)
allows
us great exibility in modelling the stochasticity in transitions. The
generation model is shown in Fig.5 a.
Future Feature Prior.
Given
ℎ𝑡
, we rst predict the future state,
via
𝑃(𝑓𝑡|ℎ𝑡)
. Here, we assume a diagonal multivariate Gaussian
prior over 𝑓𝑡[5, 14, 26]
𝑝(𝑓𝑡|ℎ𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝑓𝑡;𝜇𝑓
𝑡, 𝜎 𝑓
𝑡),
where [𝜇𝑓
𝑡, 𝜎 𝑓
𝑡]=𝑔𝑝
𝑓(ℎ𝑡)(5)
where
𝜇𝑓
𝑡
and
𝜎𝑓
𝑡
are the mean and covariance.
𝑔𝑝
𝑓
is a three-layer
MLP with hidden dimension 256 and LeakyReLU activation. It
contains all the possible future state given the past. It can represent a
goal or a driving signal for the generative process. Also it forces the
model to learn rich motion transitions and long term correlations,
overcoming the freezing problem of traditional RNN [44].
Current Feature Prior.
Next, we explain
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)
. Although
ℎ𝑡
is a known (deterministic) past,
𝑓𝑡
is random. We therefore rst
sample a specic future state
𝑓𝑡
, then decode it to an unrolled future
summary
𝑚𝑡
, and nally condition the current state
𝑠𝑡
on
ℎ𝑡
and
𝑚𝑡. We therefore have:
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡) ∝ 𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)𝑃(𝑚𝑡|𝑓𝑡, ℎ𝑡)(6)
𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡;𝜇𝑠
𝑡, 𝜎𝑠
𝑡)(7)
[𝜇𝑠
𝑡, 𝜎𝑠
𝑡]=𝑔𝑝
𝑠(ℎ𝑡,𝑚𝑝
𝑡)
where
𝑃(𝑚𝑡|𝑓𝑡,ℎ 𝑡)
is parameterized by a four layer MLP with hid-
den dimension 128 with LeackyReLU activation.
𝜇𝑠
𝑡
and
𝜎𝑠
𝑡
are the
mean and covariance.
𝑔𝑝
𝑠
is a two layer MLP with hidden dimension
128. After being able to sample the current state
𝑠𝑡
, we can compute
the current feature
𝑧𝑡
via
𝑧𝑡=𝑀𝐿 𝑃 (𝑠𝑡,ℎ𝑡)
, where the MLP has
three hidden layer with 128 dimensions and LeakyReLU activation.
Finally, given the current and future state, the past state is up-
dated as follows (Fig.5 c):
ℎ𝑡+1=GRU(ℎ𝑡, 𝑠𝑡, 𝑓𝑡)(8)
where the GRU has two stacked layer and hidden state of dimension
128. Now the generation model in Fig.5 a is complete.
Dierent from existing methods [
7
,
10
] where the prior of current
state is a function of past state
ℎ𝑡
only, and where the future state is
shared with the current state, we let the model learn two dierent
distributions for current and future states. The prior of current state
is also a function of the future state, which will force the model to
make use of the future information.
5.4 Inference
In the generation model in Fig.5 a, the key variables to be inferred
are
𝑠𝑡
and
𝑓𝑡
, shown in Fig.5 b. The posterior of the future state
𝑓𝑡
is dependent on past state
ℎ𝑡
and its unrolled future summary
𝑚𝑡
.
The posterior of the current state
𝑠𝑡
is dependant on the feature
𝑧𝑡
,
the past state
ℎ𝑡
and the future summary
𝑚𝑡
. We rst factorize the
dynamics as follow:
𝑞(𝑠≤𝑇|𝑧≤𝑇) ≈
𝑇
Ö
𝑡=1
𝑞(𝑠𝑡|𝑧≤𝑡−1, 𝑧𝑡, 𝑧 ≤𝑡+𝐻)=
𝑇
Ö
𝑡=1
𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡)
(9)
where
𝑇
is the total length of motion. Here we approximate the
𝑞(𝑠𝑡|𝑧≤𝑇)
with
𝑞(𝑠𝑡|𝑧≤𝑡+𝐻)
, as the correlations between
𝑠𝑡
and the
far future is likely to be small, so we only consider up to
𝑡+𝐻
. Then
for each time step, we use a MLP to parameterize the posterior:
𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝜇𝑠
𝑡, 𝜎𝑠
𝑡),[𝜇𝑠
𝑡, 𝜎𝑠
𝑡]=𝑀𝐿 𝑃 (ℎ𝑡,𝑧 𝑡,𝑚𝑡)
(10)
where the MLP has two hidden layers with 32 dimensions and
LeakyReLU activation. For the future state we approximate its
posterior as follow:
𝑞(𝑓𝑡|ℎ𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝜇𝑓
𝑡, 𝜎 𝑓
𝑡),[𝜇𝑓
𝑡, 𝜎 𝑓
𝑡]=𝑀𝐿 𝑃 (ℎ𝑡,𝑚𝑡)(11)
c) Transition(GRU)
: deterministic latent for history
state
: code for pose-v
: code for future
sequence(length=8/16)
: stochastic latent for current state
: stochastic latent for future state
a) Generation
Pose &
Velociy
t
z
t
h
t
m
1t
h
t
s
t
f
b) Inference
t
m
t
h
t
f
t
z
t
h
t
m
t
s
t
h
t
s
t
f
t
s
t
z
t
h
t
f
t
m
Figure 5: The stochastic latent RNN. a) Generation Model. The current pose embedding feature 𝑧𝑡depends on the current and
past latent state 𝑠𝑡and ℎ𝑡.𝑠𝑡depends on the past state ℎ𝑡and the future summary 𝑚𝑡which depends on the future latent state
𝑓𝑡and past state ℎ𝑡b) Inference on 𝑠𝑡and 𝑓𝑡. c) Transition of ℎ𝑡
where the MLP has two hidden layers with 512 dimensions and
LeakyReLU activation.
5.5 Temporal dierence loss
Besides inferring
𝑠𝑡
and
𝑓𝑡
, we also constrain the dynamics of
𝑠
. We
assume a relation between two states
𝑠𝑡1
and
𝑠𝑡2
at
𝑡1
and
𝑡2
where
𝑡1<𝑡2, similar to [12]:
𝑝(𝑠𝑡1, 𝑠𝑡2, 𝑡1, 𝑡2) ∝ 𝑝(𝑠𝑡2|𝑠𝑡1, ℎ𝑡1, ℎ𝑡2)(12)
where we parameterize the posterior:
𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡1;𝜇𝑠𝑡1, 𝜎𝑠𝑡1)(13)
where 𝜇𝑠𝑡1, 𝜎𝑠𝑡1
=𝑀𝐿 𝑃 (𝑠𝑡2,ℎ𝑡1, ℎ 𝑡2)
where
𝜇𝑠𝑡1
and
𝜎𝑠𝑡1
are the mean and covariance. Here the MLP has
two hidden layers with 32 dimensions and LeakyReLU activation.
This way we can sample
𝑠𝑝𝑜𝑠𝑡 𝑒𝑟 𝑖𝑜 𝑟
𝑡1
during inference. Meanwhile,
we hope to reconstruct 𝑠𝑡2with the time dierence 𝛿 𝑡 =|𝑡1−𝑡2|
𝑠𝑟𝑒𝑐
𝑡2
=𝑀𝐿 𝑃 (𝑠𝑝𝑜𝑠𝑡𝑒 𝑟𝑖 𝑜𝑟
𝑡1, 𝛿𝑡 )(14)
where the MLP for skip prediction has three hidden layers each
with 32 dimensions and LeakyReLU activation.
5.6 Learning
Finally, we compose all terms for the loss funtion:
𝑚𝑎𝑥Φ𝐿𝑑+𝐿𝑇(15)
where
Φ
is the set of all learnable parameters in our networks and
𝐿𝑑=
𝑇
Õ
𝑡=1
[−𝐾𝐿(𝑞(𝑠𝑡|ℎ𝑡, 𝑧𝑡,𝑚𝑡) | |𝑝(𝑠𝑡|ℎ𝑡, 𝑚𝑡))
−𝐾𝐿 (𝑞(𝑓𝑡|ℎ𝑡, 𝑚𝑡) | |𝑝(𝑓𝑡|ℎ𝑡))
+𝑙𝑜𝑔(𝑝(𝑧𝑡|𝑠𝑡, ℎ𝑡)) + 𝑙𝑜𝑔 (𝑝(𝑚𝑡|𝑓𝑡, ℎ𝑡))]
𝐿𝑇=Õ
𝑡1,𝑡2
[−𝐾𝐿(𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)||𝑝(𝑠𝑡1)) + 𝑝(𝑠𝑡2|𝑠𝑡1, 𝑡1, 𝑡2)]
𝐾𝐿 is the Kullback–Leibler divergence.
After training the pose auto-encoder and the sequence auto-
encoder (Section 5.1-5.2), we freeze their parameters and train the
dynamics model (Section 5.3).
6 EXPERIMENT AND RESULTS
For all our experiments, we use CMU MoCap database
1
. CMU
dataset is a high-quality dataset acquired using optical motion cap-
ture systems, containing 2605 trials in 6 categories and 23 subcate-
gories. Its high-quality serves our purposes well as it provides good
data ‘seeds’ for motion generation. Also, the tremendous eort that
went into capturing the data shows the need for tools such as DFN
for data augmentation. To carefully evaluate DFN, we select dier-
ent motion classes with dierent features and dynamics, shown in
Tab.1, to show that DFN can generate new high-quality motions
with arbitrary lengths using dierent motion prexes. Next, we
evaluate DFN on data with single and mixed motion classes to see
its ability to learn the dierent transition stochasticity on data with
a single type of dynamics and mixed types of dynamics. Last, we
push the limit of DFN by reducing the training data, to show that
DFN can make use of a small amount of data to generate high-
quality and diversied data, which is crucial for data augmentation.
More example can be found in the supplementary video.
6.1 Open-loop Motion Generation
We rst show open-loop motion generation, where we do not mod-
erate accumulative errors. We use a 8 to 20-frame motion prex to
start motion generation to get 900 frames (dfn_run2box_2char and
dfn_boxing_3char in the video). The motion stability indicates that
DFN does not suer from the problem of cumulative error that is
common in time-series generation [
41
]. Given the same prex, the
diversity is shown in their transitions between dierent postures
(short-term) and dierent actions (long-term).
6.2 Dynamics Multi-modality
We investigate how well DFN can capture dierent transition stochas-
ticity in dierent motions, using several types of motions with
dierent properties (shown in Tab.1). We rst train DFN on them
separately then jointly. The results can be found in dfn_walk_top,
dfn_walk_close, walking1-walking2, running1-running3, dancing1
and boxing1 in the video. We observe that DFN can learn the tran-
sition stochasticity well when trained on single type of motions.
1http://mocap.cs.cmu.edu/
Motion Cyclic Main Body Part Rhythmic Dynamics
Walking Yes Lower No Low
Boxing No Upper No High
Dancing No Full Yes High
Running Yes Lower No High
Table 1: Motion types and their features.
The diversity can be found in short-term and long-term transitions,
which are two-levels of multi-modality captured well by DFN. In
walking (dfn_walk_top and dfn_walk_close in video), the short-
term stochasticity is shown in within-cycle motion randomness,
which enriches the walking style. The long-term stochasticity is
shown when a turning is generated. The action-level transition has
also been captured and generated. Similar observations are also
found in other motions.
When trained on mixed data (combining all motion in Tab.1),
DFN learns higher-level action transitions between dierent mo-
tion classes. We can see examples (action_transition in video and
frame-level image in supplementary le) that transit from dancing
to running, from boxing to dancing or from slow walk to running,
showing the modelling capacity at two levels. Within a single ac-
tion, diversied styles are learned well. Between dierent actions,
transitions are learned well too. This demonstrates the benets
of modelling randomness explicitly between the past, current and
future state, which would otherwise make it hard to capture the
multi-modality and lead it to average over all types of dynamics,
resulting in meaningless mean poses and motions.
6.3 Diversied Generation
Although visually it is clear that DFN can generate diversied
motions, we also numerically show the diversity especially when
the duration of generation becomes long. First, we randomly select
training data of dierent classes, and show their latent feature
trajectory in Fig. 6, with embedding dimension of 16. We then
use a PCA model to project the embedding features to 2D. The
trajectories are continues and smooth without extra constraints on
the auto-encoder. It shows that the motion dynamics are captured
well in the latent space, which is critical in motion generation. Next,
we show a group of generated motions in Fig. 7. Even with the
same motion prex, motions start to diversify from the beginning,
which is a distinct property lacking in deterministic generators in
most action-prediction models such as [
32
], and our sequence is in
3d which is more dicult than 2d [43].
Distributional Shift in Time.
The motion diversity increases
in time. To show that there is a distributional shift of poses, we
randomly pick an initial sequence from training data, then randomly
generate 4096 sequences each with 128 frames. We visualize the
current latent state
𝑠
and latent feature
𝑧
at
𝑡=
32 and
𝑡=
128 in Fig.
8. Note that the distributions of
𝑠
and
𝑧
capture the stochasticity at
two dierent levels, one at the stochastic state level and the other
at the latent feature level. The red dots represent them at
𝑡=
32
and the yellow dots at 𝑡=128.
For both
𝑧
and
𝑠
, the red (
𝑡=
32) are concentrated more, showing
that the dierence between the 4096 generated motions are still
somewhat similar in the early stage. However, the yellow (
𝑡=
128)
Figure 6: Randomly selected training motions in the latent
space. Color indicate dierent motion class. Smooth trajec-
tories are universally obtained by embedding.
Figure 7: Pose embedding trajectory of random generated se-
quences given same initialization with 20 frame. The circle
mark the rst frame. We see that as time goes, these trajec-
tory depart from each other.
show that the generated motions start to diversify later. Not only
they shift out of the original red region, indicating that they are in
now in dierent pose regions, they also start to diverge more, shown
by dierent modes in yellow areas, meaning they have diverged
into several dierent pose regions.
Distribution Matching in Time
. Another way to test the di-
versity of generated motions is to see their statistical similarity to
the training motions. Since the motion prex is from one particular
motion, the more similar the generated motions are to the whole
training dataset, the more diverse they are, because the generated
motions have leave the original motion region where the motion
prex is.
We employ the mean-distance distribution as a measure, as in
[
43
]. For each time step, we calculate the mean pose of all generated
motions, then calculate the Euclidean distances between the mean
pose and all other poses at that time step. We then plot the mean
distance and variance in Figure 9. The blue background indicate the
mean and variance of mean-distance distribution of the training
dataset. It shows that as time goes, the mean-distance distribution
Pose embedding distribution
t=32 t=128
Pose embedding latent distribution
Figure 8: Four groups of motions generated from four dif-
ferent motion prexes, each group with 4096 motions, and
their 𝑧(Left) and 𝑠(Right) at 𝑡=
32
and 𝑡=
128
. We can see
that the earlier distributions are more concentrated and di-
verge fast as time passes.
Figure 9: Four groups of motions generated from four dif-
ferent motion prexes, each group with 4096 motions. The
x axis represents the time dimension, and the y axis repre-
sent the mean distance to average pose at each time step. The
band represents the variations.
of generated poses gradually matches that of the training data. This
further shows the generation diversity.
6.4 Generation on Limited Training Data
DFN aims to solve the problem of data scarcity, so it should only
require as little data as possible for generation. We therefore push
DFN to its limit by reducing the training data, to see the minimal
amounts of data needed. To investigate each individual type of
motions, we train DFN on walking, running, and boxing data sepa-
rately. We start from full training data where the longest sequence
lasts for around 10 minutes, and gradually reduce the duration by
sampling until the quality of the generated motions start to dete-
riorate. Although DFN responds to reduced training data slightly
dierently on dierent motions, we nally able to reduce the train-
ing data to a tiny amount, with the longest sequence being only
15 seconds (12 second for walking, 15 second for boxing and 7 sec-
ond for running). DFN can still generate stable motions even when
trained on merely a 7-second long motion. (The result can be seen
in reduced_data in video) The impact of reducing the training data
is mainly on the diversity of the motion. (However we can see in
supplementary video that the generated boxing motion still has a
certain of diversity). Less training data contains fewer transition
diversities (both short-term and long-term). The generated motions
therefore are less diverse. This is understandable as DFN cannot
deviate too much from the original data distribution to ensure the
motion quality.
6.5 Comparison
To our best knowledge, the only similar paper to ours is [
42
] which
also focuses on diversied motion generation. However, the biggest
dierence is that DFN explicitly models the inuence of the future
on the current. This enables DFN to explicitly model the transition
randomness at dierent stages and levels. This is the key reason why
DFN can be trained well on multiple types of motions, separately
and jointly, which has not been shown in [
42
]. However, a direct
numerical comparison is dicult due to the lack of widely accepted
metrics for diversied motion generation. In addition, the method
in [42] uses heavy post-processing while DFN does not.
7 CONCLUSION AND DISCUSSIONS
In this paper, we propose a new generative model, DFN, for diver-
sied human motion generation. DFN can generate motions with
arbitrary lengths. It successfully captures the transition stochastic-
ity in short and long term, and capable of learning the multi-modal
randomness in dierent motions. The training data needed is small.
We have conducted extensive evaluation to show DFN’s robustness,
versatility and diversity in motion generation.
There are two main limitations in our method. There is no control
signal, and sometimes it can overly smooth high-frequency motions.
We will address them in the future. Our explicit modelling of the
future makes it convenient to introduce desired future as control
signals;while replacing some of the Gaussian components with
multi-modal priors might mitigate the over-smoothing issue.
ACKNOWLEDGMENTS
We thank anonymous reviewers for their valuable comments. This
work is partially supported by the National Key Research & Devel-
opment Program of China (No. 2016YFB1001403), NSF China (No.
61772462, No. 61572429, No. U1736217), the 100 Talents Program
of Zhejiang University, Strategic Priorities Fund Research England,
and EPSRC (Ref:EP/R031193/1).
REFERENCES
[1]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural
Language Grounded Pose Forecasting. In 2019 International Conference on 3D
Vision (3DV). IEEE, 719–728.
[2]
Federico Bartoli, Giuseppe Lisanti, Lamberto Ballan, and Alberto Del Bimbo. 2018.
Context-aware trajectory prediction. In 2018 24th International Conference on
Pattern Recognition (ICPR). IEEE, 1941–1946.
[3]
Justin Bayer and Christian Osendorfer. 2014. Learning Stochastic Recurrent
Networks. stat 1050 (2014), 27.
[4]
Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017.
Deep representation learning for human motion prediction and classication.
In Proceedings of the IEEE conference on computer vision and pattern recognition.
6158–6166.
[5]
Nutan Chen, Maximilian Karl, and Patrick Van Der Smagt. 2016. Dynamic
movement primitives in latent space of time-dependent variational autoencoders.
In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).
IEEE, 629–636.
[6]
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.
2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-
proaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and
Structure in Statistical Translation. Association for Computational Linguistics,
Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012
[7]
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville,
and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data.
In Advances in neural information processing systems. 2980–2988.
[8]
Han Du, Erik Herrmann, Janis Sprenger, Noshaba Cheema, Somayeh Hosseini,
Klaus Fischer, and Philipp Slusallek. 2019. Stylistic Locomotion Modeling with
Conditional Variational Autoencoder.. In Eurographics (Short Papers). 9–12.
[9]
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015.
Recurrent network models for human dynamics. In Proceedings of the IEEE Inter-
national Conference on Computer Vision. 4346–4354.
[10]
Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Côté,
Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-forcing: Training stochastic
recurrent networks. In Advances in neural information processing systems. 6713–
6723.
[11]
Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks.
CoRR abs/1308.0850 (2013). http://dblp.uni- trier.de/db/journals/corr/corr1308.
html#Graves13
[12]
Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theo-
phane Weber. 2019. Temporal Dierence Variational Auto-Encoder. In Interna-
tional Conference on Learning Representations. https://openreview.net/forum?id=
S1x4ghC9tQ
[13]
Xiao Guo and Jongmoo Choi. 2019. Human Motion Prediction via Learning
Local Structure Representations and Temporal Dependencies. In Proceedings of
the AAAI Conference on Articial Intelligence, Vol. 33. 2580–2587.
[14]
Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, Taku Komura,
Jun Saito, Ikuo Kusajima, Xi Zhao, Myung-Geol Choi, Ruizhen Hu, et al
.
2017. A
Recurrent Variational Autoencoder for Human Motion Synthesis.. In BMVC.
[15]
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2020.
Dream to Control: Learning Behaviors by Latent Imagination. In International
Conference on Learning Representations. https://openreview.net/forum?id=
S1lOTC4tDS
[16]
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak
Lee, and James Davidson. 2019. Learning Latent Dynamics for Planning from
Pixels. In International Conference on Machine Learning. 2555–2565.
[17]
Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for
character locomotion. In SIGGRAPH Asia 2018 Technical Briefs. 1–4.
[18]
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2019. Moglow:
Probabilistic and controllable motion synthesis using normalising ows. arXiv
preprint arXiv:1905.06598 (2019).
[19]
Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human
motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE
International Conference on Computer Vision. 7134–7143.
[20]
Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural
networks for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017),
1–13.
[21]
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework
for character motion synthesis and editing. ACM Transactions on Graphics (TOG)
35, 4 (2016), 1–11.
[22]
Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning
motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015
Technical Briefs. 1–4.
[23]
Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised learning of
disentangled and interpretable representations from sequential data. In Advances
in neural information processing systems. 1878–1889.
[24]
Junfeng Hu, Zhencheng Fan, Jun Liao, and Li Liu. 2019. Predicting Long-Term
Skeletal Motions by a Spatio-Temporal Hierarchical Recurrent Network. arXiv
preprint arXiv:1911.02404 (2019).
[25]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis
Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Homan, Monica Din-
culescu, and Douglas Eck. 2019. Music Transformer. In International Conference
on Learning Representations. https://openreview.net/forum?id=rJe4ShAcF7
[26]
Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt.
2017. DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF
STATE SPACE MODELS FROM RAW DATA. stat 1050 (2017), 3.
[27]
Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes.
stat 1050 (2014), 1.
[28]
Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala, and Venkatesh Babu
Radhakrishnan. 2019. Unsupervised feature learning of human actions as trajecto-
ries in pose embedding manifold. In 2019 IEEE Winter Conference on Applications
of Computer Vision (WACV). IEEE, 1459–1467.
[29]
Chen Li and Gim Hee Lee. 2019. Generating multiple hypotheses for 3d hu-
man pose estimation with mixture density network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 9887–9895.
[30]
Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional
sequence to sequence model for human dynamics. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 5226–5234.
[31]
Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning
trajectory dependencies for human motion prediction. In Proceedings of the IEEE
International Conference on Computer Vision. 9489–9497.
[32]
Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion
prediction using recurrent neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2891–2900.
[33]
Jianyuan Min and Jinxiang Chai. 2012. Motion Graphs++: A Compact Generative
Model for Semantic Motion Analysis and Synthesis. ACM Trans. Graph. 31, 6,
Article 153 (Nov. 2012), 12 pages. https://doi.org/10.1145/2366145.2366172
[34]
Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-
based Recurrent Model for Human Motion.
[35]
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018.
Deepmimic: Example-guided deep reinforcement learning of physics-based char-
acter skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.
[36]
A. Safonova, Jessica Hodgins, and Nancy Pollard. 2004. Synthesizing physically
realistic human motion in low dimensional. ACM Transactions on Graphics - TOG
(01 2004).
[37]
Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris
Pal, and Yoshua Bengio. 2018. Twin Networks: Matching the Future for Sequence
Generation. In International Conference on Learning Representations. https:
//openreview.net/forum?id=BydLzGb0Z
[38]
Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2019. Spa-
tiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Mo-
tion Prediction.
[39]
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state
machine for character-scene interactions. ACM Transactions on Graphics (TOG)
38, 6 (2019), 1–14.
[40]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
[n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis
Workshop. 125–125.
[41]
He Wang, Edmond SL Ho, Hubert PH Shum, and Zhanxing Zhu. 2019. Spatio-
temporal Manifold Learning for Human Motions via Long-horizon Modeling.
IEEE transactions on visualization and computer graphics (2019).
[42]
Zhiyong Wang, Jinxiang Chai, and Shihong Xia. 2019. Combining recurrent
neural networks and adversarial training for human motion synthesis and control.
IEEE transactions on visualization and computer graphics (2019).
[43]
Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan,
and Changyou Chen. 2019. Learning Diverse Stochastic Human-Action Gen-
erators by Learning Smooth Latent Transitions. CoRR abs/1912.10150 (2019).
arXiv:1912.10150 http://arxiv.org/abs/1912.10150
[44]
Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2018.
Auto-Conditioned Recurrent Networks for Extended Complex Human Motion
Synthesis. In International Conference on Learning Representations. https://
openreview.net/forum?id=r11Q2SlRW