Conference PaperPDF Available

Dynamic Future Net: Diversified Human Motion Generation

Dynamic Future Net: Diversified Human Motion Generation
Wenheng Chen
NetEase Fuxi AI Lab
He Wang
University of Leeds
Yi Yuan
NetEase Fuxi AI Lab
Tianjia Shao
State Key Lab of CAD&CG, Zhejiang
Kun Zhou
State Key Lab of CAD&CG, Zhejiang
Figure 1: Given a 20-frame walking motion pre-x (white), our model can generate diversied motion: walking (yellow),
walking-to-running (blue), walking-to-boxing (green), and walking-to-dancing (red), with arbitrary duration. The correspond-
ing animation can be found in teaser.mp4 in supplementary video.
Human motion modelling is crucial in many areas such as computer
graphics, vision and virtual reality. Acquiring high-quality skele-
tal motions is dicult due to the need for specialized equipment
and laborious manual post-posting, which necessitates maximiz-
ing the use of existing data to synthesize new data. However, it
is a challenge due to the intrinsic motion stochasticity of human
motion dynamics, manifested in the short and long terms. In the
short term, there is strong randomness within a couple frames, e.g.
one frame followed by multiple possible frames leading to dierent
motion styles; while in the long term, there are non-deterministic
action transitions. In this paper, we present Dynamic Future Net,
a new deep learning model where we explicitly focuses on the
aforementioned motion stochasticity by constructing a generative
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from
MM ’20, October 12–16, 2020, Seattle, WA, USA.
©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00
model with non-trivial modelling capacity in temporal stochas-
ticity. Given limited amounts of data, our model can generate a
large number of high-quality motions with arbitrary duration, and
visually-convincing variations in both space and time. We evaluate
our model on a wide range of motions and compare it with the
state-of-the-art methods. Both qualitative and quantitative results
show the superiority of our method, for its robustness, versatility
and high-quality.
Computer systems organization Embedded systems
dundancy; Robotics; Networks Network reliability.
human motion, neural networks, generative models
ACM Reference Format:
Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, and Kun Zhou. 2020. Dy-
namic Future Net: Diversied Human Motion Generation. In 28th ACM
International Conference on Multimedia (MM ’20), October 12–16, 2020, Seat-
tle, WA, USA.. ACM, New York, NY, USA, 9 pages.
Modeling natural human motions is a central topic in several elds
such as computer animation, bio-mechanics, virtual reality, etc,
where high-quality motion data is a necessity. Despite the improved
accuracy and lowered costs of motion capture systems, it is still
highly desirable to make full use of existing data to generate di-
versied new data. One key challenge in motion generation is
the dynamics modelling, where it has been shown that a latent
space can be found due to the high coordination of body motions
]. However, as much as the spatial aspect is studied, dy-
namics modelling, especially with the aim of diversied motion
generation, still remains to be an open problem.
Human motion dynamics manifest several levels of short-term
and long-term stochasticity. Given a homogeneous discritization
of motions in time, the short-term stochasticity refers to the ran-
domness in next one or few frames (pose-transition); while the
long-term one refers to the random in the next or few actions
(action-transition). Tradition methods model them by Finite State
Machines with carefully organized data [
], which have limited
model capacities for large amounts of data and require extensive pre-
processing work. New deep learning methods either ignore them
] or do not explicitly model them [
]. Very recently, dynamics
modelling for diversied generation has just been investigated [
but only from the perspective of the overall dynamics, rather than
the detailed short/long term stochasticity.
In this paper, we propose a new deep learning model, Dynamic
Future Net, or DFN, for automatic and diversied high-quality mo-
tion generation based on limited amounts of data. Given a motion,
we assume that it can be discretized homogeneously in time and
represented by a series of posture and instantaneous velocities.
Following the observation that it is easier to learn the dynamics in
a natural motion latent space [
], we rst embed features in the
data space into a latent space. Next, DFN learns explicitly the his-
tory,current and future state given any time, where we also model
several conditional distributions for the inuences of history and
future state on the current state. The state-wise bidirectional mod-
elling (extending into both the past and future) separates DFN from
existing methods and endows us with the ability of modelling the
short-term (next-frame) randomness and long-term (next-action)
randomness. Last, for inference purposes, we propose new loss
functions based on distributional similarities as opposed to point-
wise estimation [
], which captures the dynamics accurately
but also keep the randomness that is crucial for diversied motion
We show extensive experimental results to show DFN’s robust-
ness, versatility and high-quality. Unlike existing methods which
have to be trained on one type of motions a time [
], DFN
can be trained using both single type of motions or mixed motions,
which shows DFN’s ability to capture multi-modal dynamics and
therefore its versatility in diversied motion generation. Visual
evaluation shows that DFN can generate high-quality motions with
dierent dynamics.
In summary, our formal contributions include:
a new deep learning model, Dynamic Future Net, for auto-
matic, diversied and high-quality human motion genera-
a new dynamic model that captures the transition stochas-
ticity of the past, current and future states in motions.
insights of the importance of both short-term and long-term
dynamics in human motion modelling.
2.1 Human pose and motion embedding
Given a human motion sequence, it is useful to nd the low dimen-
sion representation of the whole sequence. Holden et al. [
] for
the rst time use a convolution neural network to project the entire
sequence to a low dimensional embedding space. Using this more
abstract representation, one can blend motions or remove noises
from corrupted motions. In [
], the authors further make use of
the power of the learned motion manifold and decoder to synthesize
motions with constraints. Another important application of motion
embedding is motion style generation [
], in which the embedding
code can be tuned to matched the desired style. Although modeling
a motion sequence with auto-encoders is straightforward, how it
can model the dynamics of human motion is not clear. In [
], the
authors model a motion sequence as a trajectory in pose embedding
manifold, then use a bidirectional-RNN to encode the trajectory to
model the dynamics of the trajectory, which can improve motion
classication results. Moreover, they designed a graph-like network
to better represent the human body components.
The existing methods focus on the embedding of the poses and
dynamics. However, they do not explicitly model the distributions
of these latent variables which governs the stochasticity of the
dynamics. In this paper, we go a lever deeper and learn the latent
variable distributions for the embedded poses and dynamics.
2.2 Deterministic human motion prediction
and synthesis
In the eort of modelling motion dynamics, many methods employ
deterministic transitions [
], especially
in human motion prediction or generation. They either focus on
short term dynamics modeling or spatial-temporal information of
the overall dynamics. In [
], the authors propose a training tech-
nique for RNN to generate very long human motions. Although this
technique solves the problem of the freezing phenomena of RNN,
their model is deterministic, which makes the training dicult:
given a past state, if multiple possible future motions are present
in the data, the network will average them, which is a common
problem in many human motion prediction methods.
One solution to this problem is to introduce control signals [
]. They design several networks and make the character to follow
a given trajectory in real-time. In [
], the control signal becomes
the 3d human pose predicted by neural nets as a reference for an
agent to imitate. In [
], the authors co-embed the language and
corresponding motions to a share manifold, ignoring the fact that
language-to-motion is a one-to-many mapping. Even with a specic
control signal, like 2D human skeleton, one can still expect that
there are dierent motions or dierent pose corresponding to the
same control signal [
], essentially indicating the multi-modality
nature of human motion dynamics.
Dierent from the existing methods, our paper focuses on the ex-
plicit modelling of the multi-modality nature of motion transitions
in human motions. Further, we also aim to learn the stochastiity in
those transitions.
2.3 Stochastic human motion synthesis
In [
] the authors combine RNNs and Generative Adversarial Net-
works (GANs) to generate stochastic human motions. They use a
mixture density layer to model the stochastic property, and use an
adversarial discriminator to judge whether the generated motion
is natural or not. In MoGLow [
], the authors for the rst time
use normalizing ow to directly model the next frame distribution.
One advantage of this method is that it can capture complex dis-
tributions without learning an apparent latent space. Given the
same initial poses and the same control signals or constraints, the
model still generate dierent motion sequences. Chen et al. [
combine Dynamic motion primitive and variational Bayesian lter
to model the human motion dynamics. They show that the latent
representations are self-clustering after training. However, in the
transitions, it needs the information of the whole sequence, which
separates it from being a pure generative model.
Our method diers from existing approaches in its treatment in
the relations between the past, current and future states of human
motions. Unlike the aforementioned methods, we explicitly model
the current state based on both the past and the future. Also, we
further model their randomness in the latent space that captures
the transition multi-modality.
2.4 Stochastic RNN model
Modelling the stochasticity in time-series data has been a long-
standing problem, such as music, hand writing and human voice
]. The VRNN [
] for the rst time combines Variational
AutoEncoder (VAE) [
] and recurrent neural networks for this
purpose. Later in [
], the authors disentangle the latent variables
of the observation and the dynamics, with the observation latent
being used to recover the full observation information, and the
dynamic latent capturing the dynamics.
A key modelling choice in stochastic RNN models is the relations
between the past, current and future. In early work, the posterior
of current state is inferred from the past information, which makes
it lack the ability to foresee the future. In [
], the authors
show that the performance can be improved by incorporating the
future state with a backward RNN in the inference stage. In [
], the
authors design a model that can go beyond step-by-step modelling,
and predict multiple steps up to a given horizon. Similar eort is also
made in reinforcement learning, where the reward function takes
the discounted future reward into consideration [
]. In [
], the
authors went further and designed a model that can predict multiple
future scenarios, then choose the one with highest predicted reward
from all the possibilities.
We observed that human motions follow a similar philosophy:
the current state is a result of the past motion but also a particular
choice for a certain planned future. Our research is inspired by
Stochastic RNN models but focuses on human motion transition
Our method takes a homogeneous series of human pose representa-
tions as input. This representation contains the 3D joint coordinates
relative to the root, and the root translation velocity over ground
plane and the rotation velocity around the y-axis. We propose the
Dynamic Future Net to model the motion dynamics as a future-
guided transition and generate random natural human motions
that transit between dierent actions.
As illustrated in Fig. 2, DFN is composed of three modules, a
pose encoder, a pose trajectory encoder and a stochastic latent RNN.
The pose encoder (Section 5.1) maps the high dimensional human
pose to a latent space while the pose trajectory encoder (Section
5.2) embeds the trajectory in the latent pose space into a code.
Such compact representations of pose sequences can facilitate the
learning process [
]. As a key module, the stochastic latent RNN
(Section 5.3) deploys a stochastic latent state and a deterministic
latent state to learn two latent distributions for the pose-embedding
and the future trajectory embedding. Such explicit learning of two
dierent latent distributions on the one hand forces the model to
learn strong temporal correlation and on the other hand generates
motions with varied and natural transition. During inference we
combine the past, current and future state to infer the current
latent state distribution, and we combine the past and future to
infer the future latent state distribution. In the generation stage,
unlike existing methods [
] where the current state is generated
from the past state only, we rst generate the future state and
combine it with the past state to generate the current latent state
prior, from which we sample the current latent state then decode it
to the pose-embedding and recover the current pose and velocity.
We regard this process as a self-driving motion generation process
guided by the envisioned dynamic future. In this way, the model
can learn and generate rich and varied natural motions.
Figure 2: Overview of the proposed Dynamic Future Net-
work. In the learning process, the network take human mo-
tion sequence as input and predict the long term distribu-
tion and the short term(next frame) distribution.
We train our model on the CMU Human motion capture dataset.
As the skeletons in the origin dataset are dierent, we rst retarget
all motion to a chosen skeleton as in [
]. The skeleton contains
24 joints, we rst extract the X and Z global coordinate of the root,
and rotate the human pose to the Y-axis direction as in [
], the
global position and angle of human pose can be recovered from the
X-Z velocity and the rotation velocity around the Y axis. Finally the
original human pose vector contains 76 degrees of freedom, 72 for
3D joint positions, 4 for the global translation and rotation velocity.
Figure 3: The pose-velocity auto-encoder network. The in-
put of the encoder is the 76 dimensional pose-velocity vec-
tor. The encoder outputs two code, one for pose, the other
for velocity. The pose code and velocity code is fed into a
quaternion decoder and velocity decoder separately. The 3D
joint positions are recovered from quaternions by the For-
ward Kinematics (FK) layer.
Formally, we start by describing a motion as a homogeneous time
{𝑋0, . . . , 𝑋𝑇}
, where
is the motion frame at time
contains the joint positions and global velocities. Starting with a
joint distribution
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)
, we model the inuence of
the past frames
and future frames
on the current
by transition probabilistic distributions
, where
is the duration of a short-horizon future.
The key reason of such a modelling choice is based two observa-
tions: the current frame is a result of the past motion and therefore
conditioned on it, captured by
. Meanwhile, the current
frame is also a choice made for certain planned future, e.g. needing
to stop swing the legs aiming for a transition from walk to stand-
ing, captured by
. In addition, since the past motion
will also limit the possibilities of the future motion, there is also
a impact of the past on the future,
. Overall, the
joint probability:
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻) ∝ 𝑃(𝑋𝑡|𝑋𝑡+1:𝑡+𝐻)𝑃(𝑋𝑡+1:𝑡+𝐻,𝑋 𝑡|𝑋<𝑡)(1)
Not that the two probabilities on the right side play dierent roles.
𝑃(𝑋𝑡+1:𝑡+𝐻, 𝑋𝑡|𝑋<𝑡)
is the probability of unrolling from the past to
the future. Given a known past, this is a joint probability of both the
current and the future, containing all the possible transitions. On
top of it,
dictates that if the future is also known,
then the current can be inferred. This explicit modelling of the
transition probabilistic distributions between the past, current and
future helps capturing the transition stochasticity, which facilitates
diversied motion generation as shown in the experiments.
Learning the transitional probabilities in the data space, however,
is dicult due to the curse of dimensionality. We therefore project
motions to a latent space, which involves embedding the frames
as well as the dynamics. We then learn the transition distributions
in the latent space. During inference, we then recover motions
from sampled states in the latent space to the original data space.
DFN is naturally divided into three components: Spatial (frame)
embedding, dynamics embedding and dynamics modelling.
5.1 Spatial Embedding
We use an auto-encoder for frame embedding,
𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)
𝑋𝑡=𝑃𝑜 𝑠𝑒𝐷 𝑒𝑐 (𝑧𝑡)
, shown in Figure 3.
is multi-layer
perceptron network to project the data into the latent space. Then
we separate the latent feature into two components to represent
the pose code and the global velocity code.
contains two
components, the quaternion decoder and the velocity decoder. The
quaternion decoder takes the pose latent feature as input and out-
puts joint angles (represented by quaternions), and the velocity
decoder takes the latent velocity feature as input and outputs the
velocity. The quaternion decoder essentially is a dierential Inverse
Kinematics module. As stated in [
], using joint rotations instead
of joint positions maintains the bone lengths. After the reconstruc-
tion, we use a Forward Kinematics layer to compute the 3D joint
positions. To train the auto-encoder, we use a Mean Squared Error
loss function:
𝐿𝑠𝑙 =
𝑋𝑡|| 2
where 𝑇is the number of frames in a motion.
5.2 Dynamics Embedding
Figure 4: A seq-to-seq network for trajectory embedding.
After learning the pose latent space, we project the dynamics as
trajectories in this space using a Recurrent Neural Network shown
in Figure 4. We employ a sequence-to-sequence architecture as
it forces the model to learn long-term dynamics. The RNN con-
sists of Gated Recurrent Unit (GRU) [
], encodes a sequence of
encoded frames
{𝑧𝑡, . . . , 𝑧𝑡+𝐻}
into a latent representation
, and
then unroll to reconstruct the same sequence
𝑡+1, . . . , 𝑧
. To
𝑧𝑡=𝑃𝑜 𝑠𝑒𝐸𝑛𝑐 (𝑋𝑡)
is a future summary over
multiple frames. We use the following loss function:
𝐿𝑡𝑙 =𝐿𝑟𝑒𝑐 +𝐿𝑠𝑚𝑜𝑜 𝑡ℎ (3)
𝑤ℎ𝑒𝑟 𝑒 𝐿𝑟 𝑒𝑐 =
𝑡|| 2
𝐿𝑠𝑚𝑜𝑜𝑡ℎ =
is the frame number of a motion,
are the original
and reconstructed joint velocities. To facilitate training, we use Eq.
2 to pre-train the posture auto-encoder an x its weights when
training the RNN module.
5.3 Dynamics modelling
Generative Model.
After embedding the poses and dynamics into
a latent space, we now explain the dynamics modelling, which is the
key technical contribution of this paper. We propose a new dynam-
ics model that captures the joint distribution
𝑃(𝑋<𝑡, 𝑋𝑡, 𝑋𝑡+1:𝑡+𝐻)
Rather than directly learning the distribution in the data space, we
aim to learn the latent joint distribution
𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)
, where
we abstract the past, current and future features separately. First,
given the Markov property, we assume that all past information
is encoded into
which is a deterministic (known) past state. Next,
we assume that the future information
can be summarized
into a future state
, and
is drawn from a distribution over all
possible future states conditioned on
. Last, we also assume that
there is a current state
which captures the current information
is drawn from a distribution of all possible current states.
Then we can therefore assume:
𝑃(𝑧<𝑡, 𝑧𝑡, 𝑧𝑡+1:𝑡+𝐻)=𝑃(𝑧𝑡+1:𝑡+𝐻, 𝑧𝑡|𝑧<𝑡)𝑃(𝑧<𝑡)
=𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)𝑃(𝑓𝑡|𝑡)(4)
where we directly use
to replace the corresponding
variables by assuming mappings between them which will be
explained later. Dierent from existing methods [
], our di-
rect conditioning of the current state on the future and past state
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)
, and the future state on the past state
us great exibility in modelling the stochasticity in transitions. The
generation model is shown in Fig.5 a.
Future Feature Prior.
, we rst predict the future state,
. Here, we assume a diagonal multivariate Gaussian
prior over 𝑓𝑡[5, 14, 26]
𝑡, 𝜎 𝑓
where [𝜇𝑓
𝑡, 𝜎 𝑓
are the mean and covariance.
is a three-layer
MLP with hidden dimension 256 and LeakyReLU activation. It
contains all the possible future state given the past. It can represent a
goal or a driving signal for the generative process. Also it forces the
model to learn rich motion transitions and long term correlations,
overcoming the freezing problem of traditional RNN [44].
Current Feature Prior.
Next, we explain
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡)
. Although
is a known (deterministic) past,
is random. We therefore rst
sample a specic future state
, then decode it to an unrolled future
, and nally condition the current state
𝑚𝑡. We therefore have:
𝑃(𝑠𝑡|𝑓𝑡, ℎ𝑡) ∝ 𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)𝑃(𝑚𝑡|𝑓𝑡, ℎ𝑡)(6)
𝑃(𝑠𝑡|𝑚𝑡, ℎ𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡;𝜇𝑠
𝑡, 𝜎𝑠
𝑡, 𝜎𝑠
𝑃(𝑚𝑡|𝑓𝑡,ℎ 𝑡)
is parameterized by a four layer MLP with hid-
den dimension 128 with LeackyReLU activation.
are the
mean and covariance.
is a two layer MLP with hidden dimension
128. After being able to sample the current state
, we can compute
the current feature
𝑧𝑡=𝑀𝐿 𝑃 (𝑠𝑡,ℎ𝑡)
, where the MLP has
three hidden layer with 128 dimensions and LeakyReLU activation.
Finally, given the current and future state, the past state is up-
dated as follows (Fig.5 c):
𝑡+1=GRU(𝑡, 𝑠𝑡, 𝑓𝑡)(8)
where the GRU has two stacked layer and hidden state of dimension
128. Now the generation model in Fig.5 a is complete.
Dierent from existing methods [
] where the prior of current
state is a function of past state
only, and where the future state is
shared with the current state, we let the model learn two dierent
distributions for current and future states. The prior of current state
is also a function of the future state, which will force the model to
make use of the future information.
5.4 Inference
In the generation model in Fig.5 a, the key variables to be inferred
, shown in Fig.5 b. The posterior of the future state
is dependent on past state
and its unrolled future summary
The posterior of the current state
is dependant on the feature
the past state
and the future summary
. We rst factorize the
dynamics as follow:
𝑞(𝑠𝑇|𝑧𝑇) ≈
𝑞(𝑠𝑡|𝑧𝑡1, 𝑧𝑡, 𝑧 𝑡+𝐻)=
𝑞(𝑠𝑡|𝑡, 𝑧𝑡,𝑚𝑡)
is the total length of motion. Here we approximate the
, as the correlations between
and the
far future is likely to be small, so we only consider up to
. Then
for each time step, we use a MLP to parameterize the posterior:
𝑞(𝑠𝑡|𝑡, 𝑧𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(𝜇𝑠
𝑡, 𝜎𝑠
𝑡, 𝜎𝑠
𝑡]=𝑀𝐿 𝑃 (𝑡,𝑧 𝑡,𝑚𝑡)
where the MLP has two hidden layers with 32 dimensions and
LeakyReLU activation. For the future state we approximate its
posterior as follow:
𝑞(𝑓𝑡|𝑡,𝑚𝑡)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝜇𝑓
𝑡, 𝜎 𝑓
𝑡, 𝜎 𝑓
𝑡]=𝑀𝐿 𝑃 (𝑡,𝑚𝑡)(11)
c) Transition(GRU)
: deterministic latent for history
: code for pose-v
: code for future
: stochastic latent for current state
: stochastic latent for future state
a) Generation
Pose &
b) Inference
Figure 5: The stochastic latent RNN. a) Generation Model. The current pose embedding feature 𝑧𝑡depends on the current and
past latent state 𝑠𝑡and 𝑡.𝑠𝑡depends on the past state 𝑡and the future summary 𝑚𝑡which depends on the future latent state
𝑓𝑡and past state 𝑡b) Inference on 𝑠𝑡and 𝑓𝑡. c) Transition of 𝑡
where the MLP has two hidden layers with 512 dimensions and
LeakyReLU activation.
5.5 Temporal dierence loss
Besides inferring
, we also constrain the dynamics of
. We
assume a relation between two states
𝑡1<𝑡2, similar to [12]:
𝑝(𝑠𝑡1, 𝑠𝑡2, 𝑡1, 𝑡2) 𝑝(𝑠𝑡2|𝑠𝑡1, ℎ𝑡1, ℎ𝑡2)(12)
where we parameterize the posterior:
𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)=𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝑠𝑡1;𝜇𝑠𝑡1, 𝜎𝑠𝑡1)(13)
where 𝜇𝑠𝑡1, 𝜎𝑠𝑡1
=𝑀𝐿 𝑃 (𝑠𝑡2,ℎ𝑡1, ℎ 𝑡2)
are the mean and covariance. Here the MLP has
two hidden layers with 32 dimensions and LeakyReLU activation.
This way we can sample
𝑠𝑝𝑜𝑠𝑡 𝑒𝑟 𝑖𝑜 𝑟
during inference. Meanwhile,
we hope to reconstruct 𝑠𝑡2with the time dierence 𝛿 𝑡 =|𝑡1𝑡2|
=𝑀𝐿 𝑃 (𝑠𝑝𝑜𝑠𝑡𝑒 𝑟𝑖 𝑜𝑟
𝑡1, 𝛿𝑡 )(14)
where the MLP for skip prediction has three hidden layers each
with 32 dimensions and LeakyReLU activation.
5.6 Learning
Finally, we compose all terms for the loss funtion:
is the set of all learnable parameters in our networks and
[−𝐾𝐿(𝑞(𝑠𝑡|𝑡, 𝑧𝑡,𝑚𝑡) | |𝑝(𝑠𝑡|𝑡, 𝑚𝑡))
𝐾𝐿 (𝑞(𝑓𝑡|𝑡, 𝑚𝑡) | |𝑝(𝑓𝑡|𝑡))
+𝑙𝑜𝑔(𝑝(𝑧𝑡|𝑠𝑡, ℎ𝑡)) + 𝑙𝑜𝑔 (𝑝(𝑚𝑡|𝑓𝑡, ℎ𝑡))]
[−𝐾𝐿(𝑞(𝑠𝑡1|𝑠𝑡2, ℎ𝑡1, ℎ𝑡2)||𝑝(𝑠𝑡1)) + 𝑝(𝑠𝑡2|𝑠𝑡1, 𝑡1, 𝑡2)]
𝐾𝐿 is the Kullback–Leibler divergence.
After training the pose auto-encoder and the sequence auto-
encoder (Section 5.1-5.2), we freeze their parameters and train the
dynamics model (Section 5.3).
For all our experiments, we use CMU MoCap database
dataset is a high-quality dataset acquired using optical motion cap-
ture systems, containing 2605 trials in 6 categories and 23 subcate-
gories. Its high-quality serves our purposes well as it provides good
data ‘seeds’ for motion generation. Also, the tremendous eort that
went into capturing the data shows the need for tools such as DFN
for data augmentation. To carefully evaluate DFN, we select dier-
ent motion classes with dierent features and dynamics, shown in
Tab.1, to show that DFN can generate new high-quality motions
with arbitrary lengths using dierent motion prexes. Next, we
evaluate DFN on data with single and mixed motion classes to see
its ability to learn the dierent transition stochasticity on data with
a single type of dynamics and mixed types of dynamics. Last, we
push the limit of DFN by reducing the training data, to show that
DFN can make use of a small amount of data to generate high-
quality and diversied data, which is crucial for data augmentation.
More example can be found in the supplementary video.
6.1 Open-loop Motion Generation
We rst show open-loop motion generation, where we do not mod-
erate accumulative errors. We use a 8 to 20-frame motion prex to
start motion generation to get 900 frames (dfn_run2box_2char and
dfn_boxing_3char in the video). The motion stability indicates that
DFN does not suer from the problem of cumulative error that is
common in time-series generation [
]. Given the same prex, the
diversity is shown in their transitions between dierent postures
(short-term) and dierent actions (long-term).
6.2 Dynamics Multi-modality
We investigate how well DFN can capture dierent transition stochas-
ticity in dierent motions, using several types of motions with
dierent properties (shown in Tab.1). We rst train DFN on them
separately then jointly. The results can be found in dfn_walk_top,
dfn_walk_close, walking1-walking2, running1-running3, dancing1
and boxing1 in the video. We observe that DFN can learn the tran-
sition stochasticity well when trained on single type of motions.
Motion Cyclic Main Body Part Rhythmic Dynamics
Walking Yes Lower No Low
Boxing No Upper No High
Dancing No Full Yes High
Running Yes Lower No High
Table 1: Motion types and their features.
The diversity can be found in short-term and long-term transitions,
which are two-levels of multi-modality captured well by DFN. In
walking (dfn_walk_top and dfn_walk_close in video), the short-
term stochasticity is shown in within-cycle motion randomness,
which enriches the walking style. The long-term stochasticity is
shown when a turning is generated. The action-level transition has
also been captured and generated. Similar observations are also
found in other motions.
When trained on mixed data (combining all motion in Tab.1),
DFN learns higher-level action transitions between dierent mo-
tion classes. We can see examples (action_transition in video and
frame-level image in supplementary le) that transit from dancing
to running, from boxing to dancing or from slow walk to running,
showing the modelling capacity at two levels. Within a single ac-
tion, diversied styles are learned well. Between dierent actions,
transitions are learned well too. This demonstrates the benets
of modelling randomness explicitly between the past, current and
future state, which would otherwise make it hard to capture the
multi-modality and lead it to average over all types of dynamics,
resulting in meaningless mean poses and motions.
6.3 Diversied Generation
Although visually it is clear that DFN can generate diversied
motions, we also numerically show the diversity especially when
the duration of generation becomes long. First, we randomly select
training data of dierent classes, and show their latent feature
trajectory in Fig. 6, with embedding dimension of 16. We then
use a PCA model to project the embedding features to 2D. The
trajectories are continues and smooth without extra constraints on
the auto-encoder. It shows that the motion dynamics are captured
well in the latent space, which is critical in motion generation. Next,
we show a group of generated motions in Fig. 7. Even with the
same motion prex, motions start to diversify from the beginning,
which is a distinct property lacking in deterministic generators in
most action-prediction models such as [
], and our sequence is in
3d which is more dicult than 2d [43].
Distributional Shift in Time.
The motion diversity increases
in time. To show that there is a distributional shift of poses, we
randomly pick an initial sequence from training data, then randomly
generate 4096 sequences each with 128 frames. We visualize the
current latent state
and latent feature
32 and
128 in Fig.
8. Note that the distributions of
capture the stochasticity at
two dierent levels, one at the stochastic state level and the other
at the latent feature level. The red dots represent them at
and the yellow dots at 𝑡=128.
For both
, the red (
32) are concentrated more, showing
that the dierence between the 4096 generated motions are still
somewhat similar in the early stage. However, the yellow (
Figure 6: Randomly selected training motions in the latent
space. Color indicate dierent motion class. Smooth trajec-
tories are universally obtained by embedding.
Figure 7: Pose embedding trajectory of random generated se-
quences given same initialization with 20 frame. The circle
mark the rst frame. We see that as time goes, these trajec-
tory depart from each other.
show that the generated motions start to diversify later. Not only
they shift out of the original red region, indicating that they are in
now in dierent pose regions, they also start to diverge more, shown
by dierent modes in yellow areas, meaning they have diverged
into several dierent pose regions.
Distribution Matching in Time
. Another way to test the di-
versity of generated motions is to see their statistical similarity to
the training motions. Since the motion prex is from one particular
motion, the more similar the generated motions are to the whole
training dataset, the more diverse they are, because the generated
motions have leave the original motion region where the motion
prex is.
We employ the mean-distance distribution as a measure, as in
]. For each time step, we calculate the mean pose of all generated
motions, then calculate the Euclidean distances between the mean
pose and all other poses at that time step. We then plot the mean
distance and variance in Figure 9. The blue background indicate the
mean and variance of mean-distance distribution of the training
dataset. It shows that as time goes, the mean-distance distribution
Pose embedding distribution
t=32 t=128
Pose embedding latent distribution
Figure 8: Four groups of motions generated from four dif-
ferent motion prexes, each group with 4096 motions, and
their 𝑧(Left) and 𝑠(Right) at 𝑡=
and 𝑡=
. We can see
that the earlier distributions are more concentrated and di-
verge fast as time passes.
Figure 9: Four groups of motions generated from four dif-
ferent motion prexes, each group with 4096 motions. The
x axis represents the time dimension, and the y axis repre-
sent the mean distance to average pose at each time step. The
band represents the variations.
of generated poses gradually matches that of the training data. This
further shows the generation diversity.
6.4 Generation on Limited Training Data
DFN aims to solve the problem of data scarcity, so it should only
require as little data as possible for generation. We therefore push
DFN to its limit by reducing the training data, to see the minimal
amounts of data needed. To investigate each individual type of
motions, we train DFN on walking, running, and boxing data sepa-
rately. We start from full training data where the longest sequence
lasts for around 10 minutes, and gradually reduce the duration by
sampling until the quality of the generated motions start to dete-
riorate. Although DFN responds to reduced training data slightly
dierently on dierent motions, we nally able to reduce the train-
ing data to a tiny amount, with the longest sequence being only
15 seconds (12 second for walking, 15 second for boxing and 7 sec-
ond for running). DFN can still generate stable motions even when
trained on merely a 7-second long motion. (The result can be seen
in reduced_data in video) The impact of reducing the training data
is mainly on the diversity of the motion. (However we can see in
supplementary video that the generated boxing motion still has a
certain of diversity). Less training data contains fewer transition
diversities (both short-term and long-term). The generated motions
therefore are less diverse. This is understandable as DFN cannot
deviate too much from the original data distribution to ensure the
motion quality.
6.5 Comparison
To our best knowledge, the only similar paper to ours is [
] which
also focuses on diversied motion generation. However, the biggest
dierence is that DFN explicitly models the inuence of the future
on the current. This enables DFN to explicitly model the transition
randomness at dierent stages and levels. This is the key reason why
DFN can be trained well on multiple types of motions, separately
and jointly, which has not been shown in [
]. However, a direct
numerical comparison is dicult due to the lack of widely accepted
metrics for diversied motion generation. In addition, the method
in [42] uses heavy post-processing while DFN does not.
In this paper, we propose a new generative model, DFN, for diver-
sied human motion generation. DFN can generate motions with
arbitrary lengths. It successfully captures the transition stochastic-
ity in short and long term, and capable of learning the multi-modal
randomness in dierent motions. The training data needed is small.
We have conducted extensive evaluation to show DFN’s robustness,
versatility and diversity in motion generation.
There are two main limitations in our method. There is no control
signal, and sometimes it can overly smooth high-frequency motions.
We will address them in the future. Our explicit modelling of the
future makes it convenient to introduce desired future as control
signals;while replacing some of the Gaussian components with
multi-modal priors might mitigate the over-smoothing issue.
We thank anonymous reviewers for their valuable comments. This
work is partially supported by the National Key Research & Devel-
opment Program of China (No. 2016YFB1001403), NSF China (No.
61772462, No. 61572429, No. U1736217), the 100 Talents Program
of Zhejiang University, Strategic Priorities Fund Research England,
and EPSRC (Ref:EP/R031193/1).
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural
Language Grounded Pose Forecasting. In 2019 International Conference on 3D
Vision (3DV). IEEE, 719–728.
Federico Bartoli, Giuseppe Lisanti, Lamberto Ballan, and Alberto Del Bimbo. 2018.
Context-aware trajectory prediction. In 2018 24th International Conference on
Pattern Recognition (ICPR). IEEE, 1941–1946.
Justin Bayer and Christian Osendorfer. 2014. Learning Stochastic Recurrent
Networks. stat 1050 (2014), 27.
Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017.
Deep representation learning for human motion prediction and classication.
In Proceedings of the IEEE conference on computer vision and pattern recognition.
Nutan Chen, Maximilian Karl, and Patrick Van Der Smagt. 2016. Dynamic
movement primitives in latent space of time-dependent variational autoencoders.
In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).
IEEE, 629–636.
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.
2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-
proaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and
Structure in Statistical Translation. Association for Computational Linguistics,
Doha, Qatar, 103–111.
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville,
and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data.
In Advances in neural information processing systems. 2980–2988.
Han Du, Erik Herrmann, Janis Sprenger, Noshaba Cheema, Somayeh Hosseini,
Klaus Fischer, and Philipp Slusallek. 2019. Stylistic Locomotion Modeling with
Conditional Variational Autoencoder.. In Eurographics (Short Papers). 9–12.
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015.
Recurrent network models for human dynamics. In Proceedings of the IEEE Inter-
national Conference on Computer Vision. 4346–4354.
Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Côté,
Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-forcing: Training stochastic
recurrent networks. In Advances in neural information processing systems. 6713–
Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks.
CoRR abs/1308.0850 (2013). http://dblp.uni-
Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theo-
phane Weber. 2019. Temporal Dierence Variational Auto-Encoder. In Interna-
tional Conference on Learning Representations.
Xiao Guo and Jongmoo Choi. 2019. Human Motion Prediction via Learning
Local Structure Representations and Temporal Dependencies. In Proceedings of
the AAAI Conference on Articial Intelligence, Vol. 33. 2580–2587.
Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, Taku Komura,
Jun Saito, Ikuo Kusajima, Xi Zhao, Myung-Geol Choi, Ruizhen Hu, et al
2017. A
Recurrent Variational Autoencoder for Human Motion Synthesis.. In BMVC.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2020.
Dream to Control: Learning Behaviors by Latent Imagination. In International
Conference on Learning Representations.
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak
Lee, and James Davidson. 2019. Learning Latent Dynamics for Planning from
Pixels. In International Conference on Machine Learning. 2555–2565.
Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for
character locomotion. In SIGGRAPH Asia 2018 Technical Briefs. 1–4.
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2019. Moglow:
Probabilistic and controllable motion synthesis using normalising ows. arXiv
preprint arXiv:1905.06598 (2019).
Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human
motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE
International Conference on Computer Vision. 7134–7143.
Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural
networks for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017),
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework
for character motion synthesis and editing. ACM Transactions on Graphics (TOG)
35, 4 (2016), 1–11.
Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning
motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015
Technical Briefs. 1–4.
Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised learning of
disentangled and interpretable representations from sequential data. In Advances
in neural information processing systems. 1878–1889.
Junfeng Hu, Zhencheng Fan, Jun Liao, and Li Liu. 2019. Predicting Long-Term
Skeletal Motions by a Spatio-Temporal Hierarchical Recurrent Network. arXiv
preprint arXiv:1911.02404 (2019).
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis
Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Homan, Monica Din-
culescu, and Douglas Eck. 2019. Music Transformer. In International Conference
on Learning Representations.
Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt.
STATE SPACE MODELS FROM RAW DATA. stat 1050 (2017), 3.
Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes.
stat 1050 (2014), 1.
Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala, and Venkatesh Babu
Radhakrishnan. 2019. Unsupervised feature learning of human actions as trajecto-
ries in pose embedding manifold. In 2019 IEEE Winter Conference on Applications
of Computer Vision (WACV). IEEE, 1459–1467.
Chen Li and Gim Hee Lee. 2019. Generating multiple hypotheses for 3d hu-
man pose estimation with mixture density network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 9887–9895.
Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional
sequence to sequence model for human dynamics. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 5226–5234.
Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning
trajectory dependencies for human motion prediction. In Proceedings of the IEEE
International Conference on Computer Vision. 9489–9497.
Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion
prediction using recurrent neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2891–2900.
Jianyuan Min and Jinxiang Chai. 2012. Motion Graphs++: A Compact Generative
Model for Semantic Motion Analysis and Synthesis. ACM Trans. Graph. 31, 6,
Article 153 (Nov. 2012), 12 pages.
Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-
based Recurrent Model for Human Motion.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018.
Deepmimic: Example-guided deep reinforcement learning of physics-based char-
acter skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.
A. Safonova, Jessica Hodgins, and Nancy Pollard. 2004. Synthesizing physically
realistic human motion in low dimensional. ACM Transactions on Graphics - TOG
(01 2004).
Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris
Pal, and Yoshua Bengio. 2018. Twin Networks: Matching the Future for Sequence
Generation. In International Conference on Learning Representations. https:
Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2019. Spa-
tiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Mo-
tion Prediction.
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state
machine for character-scene interactions. ACM Transactions on Graphics (TOG)
38, 6 (2019), 1–14.
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
[n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis
Workshop. 125–125.
He Wang, Edmond SL Ho, Hubert PH Shum, and Zhanxing Zhu. 2019. Spatio-
temporal Manifold Learning for Human Motions via Long-horizon Modeling.
IEEE transactions on visualization and computer graphics (2019).
Zhiyong Wang, Jinxiang Chai, and Shihong Xia. 2019. Combining recurrent
neural networks and adversarial training for human motion synthesis and control.
IEEE transactions on visualization and computer graphics (2019).
Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan,
and Changyou Chen. 2019. Learning Diverse Stochastic Human-Action Gen-
erators by Learning Smooth Latent Transitions. CoRR abs/1912.10150 (2019).
Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2018.
Auto-Conditioned Recurrent Networks for Extended Complex Human Motion
Synthesis. In International Conference on Learning Representations. https://
... The latter methods generate motion from a control signal [25]. This control signal can be any partial future information, for example, the direction of the movement, speed, type of an action being performed, coordinates of a particular joint, or any combination of the above [25,52,12,18,28,72,66]. However, these methods can handle only simple motions a body, such as walking, running, or sidestepping, and cannot generate fine-grained motions of each of the joints. ...
... By the explicit control, we express a situation when one draws a trajectory of either the whole body or a particular joint and expects that the generated character will follow that trajectory. This area has developed significantly in recent years [54,53,52,25,28,12,59,72,16]. Nevertheless, such methods are trained to handle a single trajectory, often represented as a target pelvis coordinate projected onto the floor's plane, or as a tuple of velocities for each axis. ...
Full-text available
The generation of plausible and controllable 3D human motion animations is a long-standing problem that often requires a manual intervention of skilled artists. Existing machine learning approaches try to semi-automate this process by allowing the user to input partial information about the future movement. However, they are limited in two significant ways: they either base their pose prediction on past prior frames with no additional control over the future poses or allow the user to input only a single trajectory that precludes fine-grained control over the output. To mitigate these two issues, we reformulate the problem of future pose prediction into pose completion in space and time where trajectories are represented as poses with missing joints. We show that such a framework can generalize to other neural networks designed for future pose prediction. Once trained in this framework, a model is capable of predicting sequences from any number of trajectories. To leverage this notion, we propose a novel transformer-like architecture, TrajeVAE, that provides a versatile framework for 3D human animation. We demonstrate that TrajeVAE outperforms trajectory-based reference approaches and methods that base their predictions on past poses in terms of accuracy. We also show that it can predict reasonable future poses even if provided only with an initial pose.
... However, to handle arbitrary in-between motions and target frames, the size of needed data in memory grows exponentially [Harvey et al. 2020]. In the era of deep learning, in-between motions can be interpreted as a motion manifold learning problem [Chen et al. 2020;Holden et al. 2016;Wang et al. 2021], or a control problem [Ling et al. 2020] if dense temporal control signs are available. Compared with previous data-driven methods, deep neural networks can leverage compressed data representations, but cannot be easily converted into in-between motion generators [Harvey et al. 2020]. ...
... However, since the control or constraints can be diverse, the size of needed data in memory to cover all situations grows exponentially [Harvey et al. 2020], leading to unaffordable space complexity. Recently, in-between motions have been interpreted as a motion manifold learning problem [Chen et al. 2020;Holden et al. 2016;Li et al. 2021;Petrovich et al. 2021;Rempe et al. 2021;Wang et al. 2021], or a control problem [Ling et al. 2020] in deep learning. Compared with previous data-driven methods, deep neural networks can leverage compressed data representation [Holden et al. 2020]. ...
Full-text available
Real-time in-between motion generation is universally required in games and highly desirable in existing animation pipelines. Its core challenge lies in the need to satisfy three critical conditions simultaneously: quality, controllability and speed, which renders any methods that need offline computation (or post-processing) or cannot incorporate (often unpredictable) user control undesirable. To this end, we propose a new real-time transition method to address the aforementioned challenges. Our approach consists of two key components: motion manifold and conditional transitioning. The former learns the important low-level motion features and their dynamics; while the latter synthesizes transitions conditioned on a target frame and the desired transition duration. We first learn a motion manifold that explicitly models the intrinsic transition stochasticity in human motions via a multi-modal mapping mechanism. Then, during generation, we design a transition model which is essentially a sampling strategy to sample from the learned manifold, based on the target frame and the aimed transition duration. We validate our method on different datasets in tasks where no post-processing or offline computation is allowed. Through exhaustive evaluation and comparison, we show that our method is able to generate high-quality motions measured under multiple metrics. Our method is also robust under various target frames (with extreme cases).
... Previously, statistical approaches broadly fall within the scope of latent variable models, which leverage Gaussian processes [29] and hidden Markov models [15] to capture the temporal dynamic of human motions. Recently, the success of deep learning methods in various fields [2,9,13,19,23,30,31,34] lead to diverse neural network designs for modeling motion contexts [5,10,18,20]. The general framework encompasses a sequence-to-sequence model, where the observed pose sequence is encoded to a latent motion context that is decoded to output future sequence. ...
Full-text available
Motion prediction is a classic problem in computer vision, which aims at forecasting future motion given the observed pose sequence. Various deep learning models have been proposed, achieving state-of-the-art performance on motion prediction. However, existing methods typically focus on modeling temporal dynamics in the pose space. Unfortunately, the complicated and high dimensionality nature of human motion brings inherent challenges for dynamic context capturing. Therefore, we move away from the conventional pose based representation and present a novel approach employing a phase space trajectory representation of individual joints. Moreover, current methods tend to only consider the dependencies between physically connected joints. In this paper, we introduce a novel convolutional neural model to effectively leverage explicit prior knowledge of motion anatomy, and simultaneously capture both spatial and temporal information of joint trajectory dynamics. We then propose a global optimization module that learns the implicit relationships between individual joint features. Empirically, our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M, CMU MoCap). These results demonstrate that our method sets the new state-of-the-art on the benchmark datasets. Our code will be available at
... Earlier works generally resorted to latent-variables models, such as hidden Markov models [21] to adress the stochasticity of human motion. Fueled by the advances of deep learning, researchers adopted Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) [11], Gated Recurrent Unit (GRU) [31] or refined hierarchical RNN architectures [4,29] that tackled the motion prediction task from a time series perspective. Kinematic trees and Graph Convolutional Networks (GCNs) have been employed to better model the structural correlations among skeletal joints [7,23,30,37,39]. ...
Full-text available
Human motion understanding and prediction is an integral aspect in our pursuit of machine intelligence and human-machine interaction systems. Current methods typically pursue a kinematics modeling approach, relying heavily upon prior anatomical knowledge and constraints. However, such an approach is hard to generalize to different skeletal model representations, and also tends to be inadequate in accounting for the dynamic range and complexity of motion, thus hindering predictive accuracy. In this work, we propose a novel approach in modeling the motion prediction problem based on stochastic differential equations and path integrals. The motion profile of each skeletal joint is formulated as a basic stochastic variable and modeled with the Langevin equation. We develop a strategy of employing GANs to simulate path integrals that amounts to optimizing over possible future paths. We conduct experiments in two large benchmark datasets, Human 3.6M and CMU MoCap. It is highlighted that our approach achieves a 12.48% accuracy improvement over current state-of-the-art methods in average.
... When it comes to imperceptibility, the perceived motion naturalness is vital and not all derivatives are at the same level of importance [40]. Inspired by the work in character animation [39,41,3], we propose a new perceptual loss: ...
Action recognition has been heavily employed in many applications such as autonomous vehicles, surveillance, etc, where its robustness is a primary concern. In this paper, we examine the robustness of state-of-the-art action recognizers against adversarial attack, which has been rarely investigated so far. To this end, we propose a new method to attack action recognizers that rely on 3D skeletal motion. Our method involves an innovative perceptual loss that ensures the imperceptibility of the attack. Empirical studies demonstrate that our method is effective in both white-box and black-box scenarios. Its generalizability is evidenced on a variety of action recognizers and datasets. Its versatility is shown in different attacking strategies. Its deceitfulness is proven in extensive perceptual studies. Our method shows that adversarial attack on 3D skeletal motions, one type of time-series data, is significantly different from traditional adversarial attack problems. Its success raises serious concern on the robustness of action recognizers and provides insights on potential improvements.
... We thus project them back to the manifold. Natural pose manifold can be obtained in two ways: explicit modeling [51] or implicit learning [52,9]. Using implicit learning would require to train a data-driven model then use it for projection [52], breaking BASAR into a two-step system. ...
Full-text available
Skeletal motion plays a vital role in human activity recognition as either an independent data source or a complement. The robustness of skeleton-based activity recognizers has been questioned recently, which shows that they are vulnerable to adversarial attacks when the full-knowledge of the recognizer is accessible to the attacker. However, this white-box requirement is overly restrictive in most scenarios and the attack is not truly threatening. In this paper, we show that such threats do exist under black-box settings too. To this end, we propose the first black-box adversarial attack method BASAR. Through BASAR, we show that adversarial attack is not only truly a threat but also can be extremely deceitful, because on-manifold adversarial samples are rather common in skeletal motions, in contrast to the common belief that adversarial samples only exist off-manifold. Through exhaustive evaluation and comparison, we show that BASAR can deliver successful attacks across models, data, and attack modes. Through harsh perceptual studies, we show that it achieves effective yet imperceptible attacks. By analyzing the attack on different activity recognizers, BASAR helps identify the potential causes of their vulnerability and provides insights on what classifiers are likely to be more robust against attack.
... When it comes to imperceptibility, the perceived motion naturalness is vital and not all derivatives are at the same level of importance [40]. Inspired by the work in character animation [39,41,3], we propose a new perceptual loss: ...
Conference Paper
Full-text available
Action recognition has been heavily employed in many applications such as autonomous vehicles, surveillance, etc, where its robustness is a primary concern. In this paper, we examine the robustness of state-of-the-art action recognizers against adversarial attack, which has been rarely investigated so far. To this end, we propose a new method to attack action recognizers which rely on the 3D skeletal motion. Our method involves an innovative perceptual loss which ensures the imperceptibility of the attack. Empirical studies demonstrate that our method is effective in both white-box and black-box scenarios. Its generalizability is evidenced on a variety of action recognizers and datasets. Its versatility is shown in different attacking strategies. Its deceitfulness is proven in extensive perceptual studies. Our method shows that adversarial attack on 3D skeletal motions, one type of time-series data, is significantly different from traditional adversarial attack problems. Its success raises serious concern on the robustness of action recognizers and provides insights on potential improvements.
Real-time in-between motion generation is universally required in games and highly desirable in existing animation pipelines. Its core challenge lies in the need to satisfy three critical conditions simultaneously: quality, controllability and speed , which renders any methods that need offline computation (or post-processing) or cannot incorporate (often unpredictable) user control undesirable. To this end, we propose a new real-time transition method to address the aforementioned challenges. Our approach consists of two key components: motion manifold and conditional transitioning. The former learns the important low-level motion features and their dynamics; while the latter synthesizes transitions conditioned on a target frame and the desired transition duration. We first learn a motion manifold that explicitly models the intrinsic transition stochasticity in human motions via a multi-modal mapping mechanism. Then, during generation, we design a transition model which is essentially a sampling strategy to sample from the learned manifold, based on the target frame and the aimed transition duration. We validate our method on different datasets in tasks where no post-processing or offline computation is allowed. Through exhaustive evaluation and comparison, we show that our method is able to generate high-quality motions measured under multiple metrics. Our method is also robust under various target frames (with extreme cases).
3D animation production for storytelling requires essential manual processes of virtual scene composition, character creation, and motion editing, etc. Although professional artists can favorably create 3D animations using software, it remains a complex and challenging task for novice users to handle and learn such tools for content creation. In this paper, we present Write-An-Animation, a 3D animation system that allows novice users to create, edit, preview, and render animations, all through text editing. Based on the input texts describing virtual scenes and human motions in natural languages, our system first parses the texts as semantic scene graphs, then retrieves 3D object models for virtual scene composition and motion clips for character animation. Character motion is synthesized with the combination of generative locomotions using neural state machine as well as template action motions retrieved from the dataset. Moreover, to make the virtual scene layout compatible with character motion, we propose an iterative scene layout and character motion optimization algorithm that jointly considers character-object collision and interaction. We demonstrate the effectiveness of our system with customized texts and public film scripts. Experimental results indicate that our system can generate satisfactory animations from texts.
Full-text available
Human-motion generation is a long-standing challenging task due to the requirement of accurately modeling complex and diverse dynamic patterns. Most existing methods adopt sequence models such as RNN to directly model transitions in the original action space. Due to high dimensionality and potential noise, such modeling of action transitions is particularly challenging. In this paper, we focus on skeleton-based action generation and propose to model smooth and diverse transitions on a latent space of action sequences with much lower dimensionality. Conditioned on a latent sequence, actions are generated by a frame-wise decoder shared by all latent action-poses. Specifically, an implicit RNN is defined to model smooth latent sequences, whose randomness (diversity) is controlled by noise from the input. Different from standard action-prediction methods, our model can generate action sequences from pure noise without any conditional action poses. Remarkably, it can also generate unseen actions from mixed classes during training. Our model is learned with a bi-directional generative-adversarial-net framework, which can not only generate diverse action sequences of a particular class or mix classes, but also learns to classify action sequences within the same model. Experimental results show the superiority of our method in both diverse action-sequence generation and classification, relative to existing methods.
Conference Paper
Full-text available
3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints. In contrast to existing deep learning approaches which minimize a mean square error based on an unimodal Gaussian distribution, our method is able to generate multiple feasible hypotheses of 3D pose based on a multimodal mixture density networks. Our experiments show that the 3D poses estimated by our approach from an input of 2D joints are consistent in 2D reprojections, which supports our argument that multiple solutions exist for the 2D-to-3D inverse problem. Furthermore , we show state-of-the-art performance on the Hu-man3.6M dataset in both best hypothesis and multi-view settings, and we demonstrate the generalization capacity of our model by testing on the MPII and MPI-INF-3DHP datasets. Our code is available at the project website 1 .
Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks in modeling the sequential data, recent works utilize RNN to model human-skeleton motion on the observed motion sequence and predict future human motions. However, these methods disregarded the existence of the spatial coherence among joints and the temporal evolution among skeletons, which reflects the crucial characteristics of human motion in spatiotemporal space. To this end, we propose a novel Skeleton-joint Co-attention Recurrent Neural Networks to capture the spatial coherence among joints, and the temporal evolution among skeletons simultaneously on a skeleton-joint co-attention feature map in spatiotemporal space. First, a skeleton-joint feature map is constructed as the representation of the observed motion sequence. Second, we design a new Skeleton-joint Co-Attention mechanism to dynamically learn a skeleton-joint co-attention feature map of this skeleton-joint feature map, which can refine the useful observed motion information to predict one future motion. Third, a variant of GRU embedded with SCA collaboratively models the human-skeleton motion and human-joint motion in spatiotemporal space by regarding the skeleton-joint co-attention feature map as the motion context. Experimental results on motion prediction demonstrate the proposed method outperforms the related methods.
We propose Neural State Machine, a novel data-driven framework to guide characters to achieve goal-driven actions with precise scene interactions. Even a seemingly simple task such as sitting on a chair is notoriously hard to model with supervised learning. This difficulty is because such a task involves complex planning with periodic and non-periodic motions reacting to the scene geometry to precisely position and orient the character. Our proposed deep auto-regressive framework enables modeling of multi-modal scene interaction behaviors purely from data. Given high-level instructions such as the goal location and the action to be launched there, our system computes a series of movements and transitions to reach the goal in the desired state. To allow characters to adapt to a wide range of geometry such as different shapes of furniture and obstacles, we incorporate an efficient data augmentation scheme to randomly switch the 3D geometry while maintaining the context of the original motion. To increase the precision to reach the goal during runtime, we introduce a control scheme that combines egocentric inference and goal-centric inference. We demonstrate the versatility of our model with various scene interaction tasks such as sitting on a chair, avoiding obstacles, opening and entering through a door, and picking and carrying objects generated in real-time just from a single model.
Human motion prediction from motion capture data is a classical problem in the computer vision, and conventional methods take the holistic human body as input. These methods ignore the fact that, in various human activities, different body components (limbs and the torso) have distinctive characteristics in terms of the moving pattern. In this paper, we argue local representations on different body components should be learned separately and, based on such idea, propose a network, Skeleton Network (SkelNet), for long-term human motion prediction. Specifically, at each time-step, local structure representations of input (human body) are obtained via SkelNet’s branches of component-specific layers, then the shared layer uses local spatial representations to predict the future human pose. Our SkelNet is the first to use local structure representations for predicting the human motion. Then, for short-term human motion prediction, we propose the second network, named as Skeleton Temporal Network (Skel-TNet). Skel-TNet consists of three components: SkelNet and a Recurrent Neural Network, they have advantages in learning spatial and temporal dependencies for predicting human motion, respectively; a feed-forward network that outputs the final estimation. Our methods achieve promising results on the Human3.6M dataset and the CMU motion capture dataset, and the code is publicly available 1.
This paper introduces a new generative deep learning network for human motion synthesis and control. Our key idea is to combine recurrent neural networks (RNNs) and adversarial training for human motion modeling. We first describe an efficient method for training an RNN model from prerecorded motion data. We implement RNNs with long short-term memory (LSTM) cells because they are capable of addressing the nonlinear dynamics and long term temporal dependencies present in human motions. Next, we train a refiner network using an adversarial loss, similar to generative adversarial networks (GANs), such that refined motion sequences are indistinguishable from real mocap data using a discriminative network. The resulting model is appealing for motion synthesis and control because it is compact, contact-aware, and can generate an infinite number of naturally looking motions with infinite lengths. Our experiments show that motions generated by our deep learning model are always highly realistic and comparable to high-quality motion capture data. We demonstrate the power and effectiveness of our models by exploring a variety of applications, ranging from random motion synthesis, online/offline motion control, and motion filtering. We show the superiority of our generative model by comparison against baseline models.