Available via license: CC BY 4.0
Content may be subject to copyright.
Learned Neural Physics Simulation for
Articulated 3D Human Pose Reconstruction
Mykhaylo Andriluka1, Baruch Tabanpour1, C. Daniel Freeman2∗, and
Cristian Sminchisescu1
1Google DeepMind 2Anthropic
Abstract. We propose a novel neural network approach to model the
dynamics of articulated human motion with contact. Our goal is to de-
velop a faster and more convenient alternative to traditional physics
simulators for use in computer vision tasks such as human motion re-
construction from video. To that end we introduce a training procedure
and model components that support the construction of a recurrent neu-
ral architecture to accurately learn to simulate articulated rigid body
dynamics. Our neural architecture (LARP) supports features typically
found in traditional physics simulators, such as modeling of joint mo-
tors, variable dimensions of body parts, contact between body parts and
objects, yet it is differentiable, and an order of magnitude faster than
traditional systems when multiple simulations are run in parallel. To
demonstrate the value of our approach we use it as a drop-in replace-
ment for a state-of-the-art classical non-differentiable simulator in an
existing video-based 3D human pose reconstruction framework [13] and
show comparable or better accuracy.
1 Introduction
We introduce a neural network approach to modeling rigid body dynamics [10]
often required for physics simulation of articulated human motion with the objec-
tive to lower the bar for using physics-based reasoning in human reconstruction
and synthesis. Towards that goal, we propose a physically grounded articulated
motion model that is comparable in accuracy to state of the art classical physics
simulators (e.g. [8,37]) but is significantly faster. Being developed in terms of
standard deep learning building blocks, it enables easy integration with other
modern optimization and learning components.
Our approach can be interpreted as neural simulator for a subset of rigid
body dynamics and is closely related to an established line of work on neural
simulation [1,12,30]. We refer to our approach as LARP (Learned Articulated
Rigid body Physics). As common in the literature, we train LARP on examples
provided by a standard off-the-shelf physics simulator, essentially learning to
approximate the rigid body dynamics by means of a neural network. This neural
network approximation has several desirable properties. (1) It can be significantly
∗Work done while the author was with Google DeepMind.
arXiv:2410.12023v1 [cs.NE] 15 Oct 2024
2 M. Andriluka et al.
Fig. 1: Left: Examples of articulated 3d human pose reconstructions obtained with
LARP on public benchmarks [22,38] and real world video. Right: Comparison to com-
mon physics simulators [8,11,29,37] in terms of simulation speed. The x-axis shows
the number of parallel simulations, whereas the y-axis shows the total time taken to
advance all the simulations to the next step.
faster to execute on parallel hardware as it is composed of simple building blocks
well suited for parallelization. Traditional physics simulators typically solve an
optimization problem at each simulation step to compute velocities that satisfy
collision and joint constraints1. Neural simulation can learn to approximate such
calculations. In doing so it amortizes the solution of such optimization problems
at training time, and allows for faster computation at test time. Our work is
also related to how neural networks learn to directly predict the solution of
optimization problems, for e.g. articulated pose estimation [23,35]. (2) Another
advantage of neural simulators is their construction in terms of standard end-
to-end differentiable deep learning components. This is in contrast to some of
the established physics simulators that are non-differentiable and not natively
amenable to parallel simulation [8,37]. Finally, neural physics engines can rely on
parameters estimated directly from real-world data [1,36], which enables realistic
simulation even when analytic formulations cannot be obtained.
We model the motion of a set of articulated bodies using a recurrent neural
network (RNN) with an explicit state aggregating the physical parameters of
each body part. For each articulated object type we define a neural network
that updates its state given the previous state, external forces and internal joint
torques applied at the current step. Such networks, implemented as multi-layer
perceptrons (MLP), are applied recursively over time. We use different MLPs
for each object, e.g . person and ball in fig. 2. To model interactions between
objects we define a collision sub-network that computes additional inputs for
the MLPs associated to each object, via sum-pooling over other objects in the
scene, similarly to collision handling in graph neural networks [30].
Contributions. Our main contribution is a neural network architecture and
training procedure that results in a model of rigid body dynamics (LARP) that
can compute accurate human motion trajectories up to an order of magnitude
faster than traditional physics simulators (fig. 1). We measure accuracy both
directly as well as in the context of physics-based 3D pose reconstruction from
video by integrating LARP into the framework of [13].
1See Chapter 11 in [10].
Learned Neural Physics Simulation 3
Fig. 2: Left: Overview of our approach (LARP). At time tthe input to the neural
simulator is given by a state of the scene Stand joint control targets Qp
t. Here state
is composed of the state of the person Sp
tand ball Sb
t. LARP propagates the state
through time by recurrently applying contact and dynamics networks. We visualize
the state of each rigid component of the articulated body using rectangles with a color
matching the scene structures. Right: Scenarios used to evaluate our approach: chain
of linked capsules (a), two colliding capsule chains (b), articulated pose reconstruction
from video (c), human-ball and human-capsule collision handling (d). See Supp. Mat.
for videos.
Even though the architecture of LARP is seemingly simple, the results signif-
icantly depend on training details such as data augmentation, gradient clipping,
input features and length of sequences used for training (§4.1). As an additional
contribution, we perform a detailed analysis of the model components and the
impact of training parameters, identifying those ingredients that make the model
perform well. We plan to make our implementation and pre-trained models pub-
licly available upon publication.
2 Related Work
Our approach can be viewed as a special type of neural physics simulator focused
on articulated human motion. Neural physics is an established research area with
early work going at least back to [16]. Neural simulation has been applied to
phenomena as diverse as weather forecasting [24], simulation of liquids, clothing
and deformable objects [6,26,31], as well as groups of rigid bodies [2,4,17].
Particularly relevant to our work is the literature focused on rigid body dy-
namics. The paradigm in this area is to formulate neural simulation as a graph-
convolutional neural network (GCN) [2,4,17]. Recent work has shown the ability
to simulate large number of objects and correctly handle collisions even for ob-
jects with complex shapes [2,4]. It has been demonstrated that in certain cases
neural simulation can improve over standard rigid body simulators by directly
training on real data [4,36]. We believe that LARP is complementary to these
works. Whereas they primarily focus on simulating large number of rigid objects
passively moving through the scene, we instead focus on modeling objects with
a large number of components connected by joints actively controlled by mo-
tors. To the best of our knowledge the only work that applies GCN to modeling
dynamics of complex articulated objects is [33]. LARP is methodologically sim-
ilar to [33], but addresses a more complex tasks of modeling human motion and
interaction with scene objects (see fig. 6).
4 M. Andriluka et al.
Closely related to our work is SuperTrack [12]. As in LARP, [12] trains a
fully connected neural network that updates the dynamic state of human body
parts. LARP can be seen as a generalization of [12], that enables multiple ob-
jects, object collisions, improved performance on long sequences, and support for
bodies with variable dimensions of the body parts. We demonstrate (see fig. 7)
that compared to SuperTrack, LARP can generate long-term motion trajectories
without loss of consistency of the simulation state2.
LARP can be seen as an autoregressive motion model that can generate
physically plausible human motion conditioned on the control parameters of
the body joints. The key difference to methods such as [14,19,20,27,32] is
that LARP is trained on motion trajectories generated by physics simulation
and does not include a probabilistic model for the joint control parameters.
In contrast, models such as [32] are trained on motion capture data and focus
on representing statistical dependencies in the articulated motion, but do not
directly incorporate physics-related motion features. We see these approaches
as complementary and hope to incorporate probabilistic motion control in the
future.
3 Methodology
Overview. The architecture of LARP is shown in fig. 2. We represent the scene
by a set of articulated objects, each corresponding to a tree of rigid compo-
nents connected by joints. We refer to such components as “links” using physics
simulation terminology [9]. Each joint is optionally equipped with a motor that
can generate torque to actuate the object. The dynamic state of each object
is given by position, orientation, and (rotational) velocity of its links, given in
world coordinates. At each time step tLARP takes a state of the scene Stand
optionally a set of motor control targets Qtas input and produces the state of
the scene at the next time step. This process is applied recurrently to obtain
motion trajectories over longer time horizons.
Notation. Let us assume that the scene is composed of Marticulated bodies
each consisting of Nlinks. Let xi
tb ∈R3denote the position of link i∈[1 . . . N ]of
the body b∈[1 . . . M ]at time t. We use qi
tb ∈R4for the quaternion representing
body orientation, and vi
tb ∈R3and ωi
tb ∈R3for linear and angular velocities,
respectively. The torque applied at time tis τi
tb ∈R3, set to zero if no external
torque was provided. The quaternions representing the joint targets are Qi
tb ∈
R4, and are also set to zero if joint targets are not specified. We assume that in
our world representation the gravity is aligned with the z-axis, and the ground
plane corresponds to z= 0.
3.1 Dynamics Network
The main component of LARP is a per-object dynamics network fd. The network
takes a concatenated vector of features encoding the state of the body links, and
2See sec. 7 and fig. 14 in [12] for discussion of limitations.
Learned Neural Physics Simulation 5
outputs the linear and rotational velocity of each link in the next time step.
The dynamic network is implemented as a densely connected neural network
with L= 12 layers and ELU nonlinearities [7]. In the following we describe
the features used to represent state of each link. We drop the timestep and the
articulated body index to avoid clutter.
Dynamic state. The features encoding the dynamic state of the link correspond
to the root-relative position of each link xi
rrel , the z-position of each link xi
z,
world orientation of each link represented as a flattened rotation matrix (qi
9d),
and linear and angular velocities. We experimented with other encodings of
orientation such as quaternions or 6d representation and found that directly
passing the rotation matrix works best. This representation is invariant to shifts
parallel to the ground plane which we assume orthogonal to the z-axis. Note
that we do not encode rotation invariance along the z-axis explicitly and instead
induce it via data augmentation (§3.4). We experimentally found that taking
shift and rotation invariances into account is essential in making the approach
work.
Geometric information. For each link we include its length li, radius rias
well as the displacement error diof the joint to the parent link. Our state repre-
sentation independently encodes the positions of all links in world coordinates.
This is often referred to as a maximal coordinates representation in the physics
simulation literature [11,21]. Since positions of all links are updated indepen-
dently, they might float apart and disagree on the position of the mutual joint
after a number of update steps. To mitigate this effect we explicitly provide the
difference between positions of the joint computed based on the state of the child
and parent link, and add a loss function to penalize such displacement during
training (§3.4). We experimentally show in that these components are essential
for good performance (fig. 4and fig. 3).
Control. We include the target orientation Qiof the link relative to the parent.
For traditional simulators such as [9,37] target orientation is used by the control
algorithm (e.g. PD-control) to compute the torque applied by the joint motor,
whereas LARP is supposed to learn from data how to transform control targets
into updates of the link state. In addition we also provide an external torque τi
which is used in the experiments in §4.1 to diversify the motion trajectories.
Contact. We include a feature vector ˆpi∈R6computed by the contact network
fcthat encodes the information about the interaction with links of other objects.
The computation of the contact network fcis described in §3.2.
3.2 Contact Network
Given two articulated bodies mand b, the contact feature ˆpi
mis given by
ˆpi
m=X
j
fc(ϕj
b, ϕi
m, c(ϕj
b, ϕi
m), θc).(1)
The contact feature for link iis the summation of fcover all links j, where fc
is a neural network with K= 6 fully-connected layers, the vector ϕi
bcontains
6 M. Andriluka et al.
ALGORITHM 1: Algorithm for training the dynamics and contact
models given two articulated bodies.
Input: Dynamics network parameters θd, contact network parameters θc, a
batch of sub-sequences S, dynamics network feature normalization µd,
σd, contact network feature normalization µc,σc, dynamics network
output normalization µdo,σdo , dynamics feature function fx, contact
feature function fy,dt=0.01, M= 2, Nh= 20
Output: θd,θc
n = 0;
repeat
ˆ
S0=S0;
for t=1to Nhdo
Yt,1= (fy(stopgrad(ˆ
St,1)) −µc)/σc
Yt,2= (fy(stopgrad(ˆ
St,2)) −µc)/σc
ˆpij
12 =fc(Yi
t,1, Y j
t,2, h(Yi
t,1, Y j
t,2), θc)
ˆpji
21 =fc(Yj
t,2, Y i
t,1, h(Yj
t,2, Y i
t,1), θc)
ˆpi
1=Pjˆpij
12
ˆpj
2=Piˆpji
21
for b=1to Mdo
Xt= (fx(ˆ
St,ˆp1
b, ..., ˆpN
b)−µd)/σd
(ˆvt+1,ˆωt+1) = fd(Xt, θd)
(ˆvt+1,ˆωt+1) = σdo (ˆvt+1,ˆωt+1 ) + µdo
ˆxt+1,ˆqt+1 = Integrate( ˆ
St,ˆvt+1,ˆωt+1)
ˆ
St+1 = (ˆxt+1,ˆqt+1 ,ˆvt+1,ˆωt+1 )
end
end
Compute loss Lfrom Equation 5using ˆ
Sand S
(θc, θd)←Adam(θc, θd, L)
until n=Epochs;
position xi
b, quaternion orientation qi
b, velocity vi
b, angular velocity ωi
b, capsule
length li
b, capsule radius ri
b, and mass mi
bof link i. The function c(·)computes
properties of the collision between two links: link-relative contact point, normal,
and penetration distance. We use the capsule-capsule contact implementation in
Brax [11] which is implemented in JAX and is differentiable [5]. Before feeding
in features to the contact network, features are normalized by their mean and
standard deviation obtained from a single batch of offline training data. As
motivated empirically in §4.1, we apply a stop-gradient on the contact feature
before feeding it into the dynamics network to stabilize training. Notice that the
summation of fcin (1) is similar in spirit to the summation to obtain net force
(e.g. Netwon’s Second Law), and is also similar to graph neural networks (GNN)
used in neural physics simulators [3]; our approach is a special case of GNNs
with the graph between articulated objects being fully connected.
Learned Neural Physics Simulation 7
3.3 Integrator
To compute the position and orientation of all links in the next timestep, we
integrate the velocity outputs of the dynamics network fdover a timestep dt.
The world position and orientation of each link are given by ˆxt+1 =xt+ˆvtdt and
ˆqt+1 =qt+g(ˆωt,qt)dt, where g(ˆωt,qt)=0.5quat(ˆωt)⊗qtcomputes quaternion
derivative from rotational velocity, and ⊗is a quaternion product.
3.4 Training
We estimate the parameters of the dynamics and contact networks on a dataset
of training examples corresponding to sequences of scene states comprised of the
position, orientation, and 6d velocity of all links Nand all articulated bodies
M, over a period of Tsteps. Let’s denote the given a ground truth sequence of
states as S={St|t= 1, . . . , T }, and a sequence of states produced by the model
starting from the initial state S0by ˆ
S. The loss used to estimate parameters
consists of a position, rotation, and joint displacement loss over a rollout of
length T. The position loss is given by
Lp=1
NMT X
i,j,t
(xij
t−ˆxij
t)2.(2)
The rotation loss is
Lr=1
NMT X
i,j,t
1− |qij
t·ˆqij
t|,(3)
and the joint displacement loss is the mean-squared error between the predicted
child and parent joint position
Ld=1
NMT X
i,j,t
(ˆxij
pt −ˆxij
ct)2.(4)
The joint displacement loss encourages joint constraints to be respected. Notice
that if joint constraints are not violated, Ld= 0. The total loss is
L=wpLp+wrLr+wdLd,(5)
where wp= 1,wr= 1, and wd= 0.1are hyper-parameters that balance the loss
terms to have similar weight during training. We summarize the details of model
training for the case of M= 2 articulated bodies in alg. 1.
Implementation details. Depending on the dataset and setting we generate
state trajectories with a length between 80 and 200 steps. At training time we
then subdivide them into batches of shorter subsequences of length T. Note
that T= 1 amounts to doing “teacher forcing” on every step, meaning that the
model always gets one of the ground-truth states as input. We experimentally
observe that models trained on short subsequences (T < 5) perform poorly
(see §4.1). This is likely because there are subtle differences between ground-
truth states and states generated by the model, and the model needs to be
8 M. Andriluka et al.
exposed to sufficiently diverse generated states at training time. To stabilize the
training for larger Twe add gradient norm clipping and drop training batches
with the gradient norm is above the threshold of 0.3. Note that the dynamics
network outputs link velocities, but the loss in eq. (5) compares link positions
and orientations. We experimented with a loss that compares velocities directly,
but found that models trained with such loss performed poorly. Finally, models
did not perform well unless we augmented training data by randomly rotating
each batch around an axis orthogonal to the ground plane (z-axis in our scenes).
4 Results
Fig. 3: Example of a generated sequence
obtained with a model that includes dis-
placement features and displacement loss
(bottom row), and without either of these
elements (middle row). We highlight incon-
sistencies of the joint positions with the red
circles.
In §4.1 we first analyze the impor-
tance of various components and the
effect of training parameters using
simple articulated objects correspond-
ing to capsule chains (fig. 2(a, b)).
Equipped with the best settings from
this analysis, in §4.2 we evaluate
LARP on a more complex humanoid
model with joint motors (fig. 2(c,
d)). Specifically, we compare LARP to
off-the-shelf physics simulator on the
task of reconstruction of human mo-
tion from monocular video. To that
end we use the physics-based articu-
lated pose estimation approach from
[13] that employs the Bullet simula-
tor [8], and replace it with LARP . We
further evaluate the accuracy of sim-
ulating collisions between a human and an external object represented by a ball
(or capsule), quantitatively compare LARP with related work and benchmark
LARP simulation speed against several established physics simulation engines.
4.1 Analysis of the model components
Datasets and training. We begin our evaluation using the setting shown in
fig. 2(a) in which four capsules linked by joints are falling on the ground subject
to random initial state and randomly applied initial torque. The capsules can
freely rotate at the joint, and can slide and roll on the ground following impact.
This setting is simple enough to enable quick experimentation yet is generally
non-trivial since the motion of each capsule needs to obey gravity, ground colli-
sion, and joint constraints. We generate 500k sequences of length 200 for training,
and 25k sequences for testing and validation. Each sequence starts with a unique
random position, orientation, and velocity. We apply a random torque to the top
capsule of the chain for the first 10 steps, and randomize capsule lengths and
Learned Neural Physics Simulation 9
(a) Subsequence window Nh(b) Joint Displacement
(c) Learning Hyperparameters (d) Contract network
Fig. 4: Experiments with the dynamics network: training subsequence length Nh(a),
ablations of the joint displacement feature and loss (b), ablations of non-linearity type
and learning rate schedule (c), and evaluation of variants for the contact network (d).
Error bars show one standard deviation calculated over 5 runs. The x-axis sweeps the
time window over which metrics are computed.
radii for every link. We show an example of a generated sequence in fig. 3(top
row). We train the models for 200 epochs using batch size of 30k and learning
rate of 1e-3. To reduce the effect or random initialization of neural network pa-
rameters in the following experiments we train each network 5times and use the
validation set to pick the best model.
Analysis of the dynamics model.
Our major finding is that a combination of training on longer sequences
while using position, rotation, and joint displacement-based features and losses
(eq. 4) with applying gradient clipping, leads to significantly improved results
than reported in related work (e.g. [12]).
We observe that results improve considerably when dynamics is unrolled for
larger number of steps during training (see fig. 4(a)). Position error is reduced
from 1.1meter to 11 cm when increasing the training sequence length from 2to
20 steps. Training on longer sequences leads to somewhat larger joint displace-
ment error, but consistently reduces the global position and rotation error. Given
the setting of Nh= 20 steps for the training sequence length we evaluate the im-
portance of displacement loss and displacement features in fig. 4(b). We observe
that removing these components has a significant negative impact: position error
increases from 11 to 15cm, and joint displacement error increases from nearly 0
to 6.1cm. Gradient clipping was essential for training on longer sequences with
Nh>10. This is consistent with observations from training recurrent neural
10 M. Andriluka et al.
Nh=10 Nh=40 Nh=80 Nh=190
1 link 0.008 0.021 0.048 0.173
2 link 0.007 0.023 0.047 0.137
(a) Average difference between Bullet and
LARP trajectories (in meters) for two collid-
ing 1-link and 2-link capsule chains.
Nh=10 Nh=38 Nh=60
Human 0.010 0.031 0.048
Ball 0.006 0.033 0.061
Human 0.010 0.031 0.050
Capsule 0.010 0.042 0.069
(b) Average difference between Bullet and
LARP trajectories (in meters) for a human
kicking a ball/capsule for sequences of 10,38
and 60 simulation steps.
Fig. 5: Evaluation of LARP on datasets with colliding objects.
networks where gradient clipping is common in stabilizing training [15]. We also
found that supervising our neural simulator for positions instead of velocities is
essential for good performance. Note that our simulator outputs velocities that
are used to compute positions via integration. Replacing the position loss (eq. 2)
with a corresponding velocity loss degrades position accuracy from 11 cm. to
95 cm. Our explanation is that the network must occasionally output small de-
viations from target velocities to correct the drift in the body joints and using
the velocity loss hinders such desirable behavior. In fig. 4(c) we verify that, as
previously reported in [12] and [36], using soft activation function such as ELU
or GeLU is favorable compared to more standard ReLU in the context of neural
dynamics simulation. Finally, in fig. 4(c) we show that learning rate schedule
helps to improve accuracy.
Analysis of the contact model. We analyze the variants of the contact net-
work using a diagnostic environment with two chains composed of two connected
capsules shown in fig. 2(b). As in experiments with the dynamics network we
report position, rotation, and joint displacement error for each variant of the
contact model. We consider the following variants: (1) our full “contact feature”
model shown in Alg .1where the output of the contact network is fed as a feature
to the dynamic network, (2) an alternative “contact impulse” variant where the
output of the contact network is used directly in the integrator as an additive
factor to linear and rotational velocities vand ω. For the primary variant (1)
we evaluate a version with and without stop-gradient. Results in fig. 4(d) indi-
cate that all variants are able to meaningfully handle the contact, and that the
“contact impulse” formulation performs somewhat worse than “contact feature”
variant. Note that adding stop-gradient is important for good performance and
that without stop-gradient the “contact feature” model exhibits high variance
across retraining runs.
We show results for the best variant on the held-out test dataset in tab. 5a.
4.2 Simulation of articulated human motion
The goal of this paper is to propose a component for modeling articulated motion
dynamics suitable for variety of tasks (e.g. prior for physics-based reconstruction
or motion synthesis). We demonstrate one such application by using LARP as a
replacement for Bullet in a physics-based human motion reconstruction pipeline
[13], which reconstructs human motion via trajectory optimization. Given an
Learned Neural Physics Simulation 11
Dataset Model MPJPE-G MPJPE MPJPE-PA MPJPE-2d Velocity Foot skate
H3.6M
DiffPhy [18]139 82 56 13 - 7.4
[13] + Bullet 143 84 56 13 0.24 4
[13] + LARP 143 85 56 13 0.25 5.4
AIST-easy
DiffPhy [18]150 106 66 12 - 19.6
[13] + Bullet 154 113 69 13 0.41 4
[13] + LARP 150 113 70 13 0.37 5
AIST-hard
[13] + Bullet 654 437 83 16 0.17 8.5
[13] + LARP 643 442 73 14 0.13 7.2
Table 1: Quantitative results obtained with LARP on Human3.6M [22] and AIST
datasets [38] and comparison to related work.
Fig. 6: Left: Reconstructed 3d poses on four consecutive video frames on AIST-hard
dataset. Middle row shows results obtained with the kinematic pipeline from [13,18]
that LARP uses for initialization. Bottom row show results obtained with LARP in-
tegrated into [13]. Middle: Motion sequence with person-ball collision simulated with
LARP (bottom) and comparison to Bullet engine [8] (top). Right: Examples of gener-
ated human motion sequences of a person kicking a ball for three different ball targets.
In each image we show position of the ball right after the kick and at the end of the
sequence. Note that the person pose differs considerably depending on the ball target.
input video, [13] infers a sequence of control parameters that results in physically
simulated human motion that agrees well with observations (e.g. locations of 2d
body joints in the frames of the video input). We refer to [13] for the details and
focus on our results below.
Datasets. Following [13] we evaluate on Human3.6M [22] and AIST [38]. Hu-
man3.6M is a large dataset of videos of people performing common everyday
activities with ground-truth 3D poses acquired using marker-based motion cap-
ture. We follow the protocol proposed in [34] that excludes “Seating” and “Eating”
activities. This leaves a test set of 20 videos with 19,690 frames in total. AIST
include diverse videos of dancing people with motions that are arguably more
complex and dynamic compared to Human3.6M. The evaluation on AIST uses
pseudo-ground truth 3d joint positions computed by triangulation from multi-
ple camera views as defined by [25]. We use the subset of 15 AIST videos as
12 M. Andriluka et al.
Fig. 7: Left and middle: average mean per-joint position error and joint displacement
error computed over all test sequences of the Human3.6M dataset. Right: displacement
loss for the “S9-WalkDog” sequence. The units for y-axis are meters.
Fig. 8: Example results obtained with LARP on real-world videos.
used for evaluation in [13,18]. In addition, we evaluate on 15 AIST videos with
more challenging motions that involve fast turns and rotations. We refer to these
subsets of AIST as “AIST-easy” and “AIST-hard” in tab. 1.
Model training. We employ the physics-based human model introduced in [13]
which is based on the GHUM model of human shape and pose [39]. The model is
composed of 26 rigid components represented as capsules. We generate training
data for LARP by running the sampling-based optimization from [13] on the
training set and recording a subset of human motion samples and corresponding
joint torques. This produces a significantly more diverse set of examples com-
pared to original motion capture sequences, including examples of people falling
on the ground and various self-collisions. We use the best settings for LARP as
identified in the experiments in §4.1.
Metrics. We use standard mean per joint position (MPJPE) metrics for evalua-
tion. The MPJPE-G metric measures mean joint error in the world coordinates
after aligning estimated and ground-truth 3d pose sequences with respect to
pelvis position in the first frame. MPJPE does pelvis alignment independently
per-frame, whereas MPJPE-PA relies on Procrustes to align both position and
orientation for each frame. Finally MPJPE-2D measures 2d joint localization
accuracy. In addition we use two metrics to measure physical motion plausibil-
ity. “Velocity” compares the velocity error between the estimated motion and
the ground-truth and is high when estimated motion is “jittery”. “Footskate” is
implemented as in [18] and corresponds to the percentage of frames where the
foot moves between adjacent frames by more than 2cm, while in contact with
the ground. “Footskate” measures the presence of a common artifact in video-
based pose estimation where the positions of a person’s feet unreasonably shift
between nearby frames.
Learned Neural Physics Simulation 13
Results. Our results are presented in tab. 1and indicate that reconstructions us-
ing LARP are similar or better compared to Bullet. For example, LARP achieves
150 mm. MPJPE-G on AIST-easy and 643 mm. on AIST-hard, compared to 154
mm. and 654 mm. for Bullet. Note that [13] performs in general slightly worse
compared to DiffPhy [18] that makes use of a differentiable physics simulator and
gradient-based optimization. In principle our approach should work in combina-
tion with [18], and we plan to explore this in the future work.
Fig. 9: Example of es-
timated pose from “S9-
WalkDog” seq. after 11
sec. of input. Left: in-
put frame, middle: re-
sult obtained with Super-
Track trained on longer
sequences, right: result
obtained with LARP.
In fig. 7(left, center) we present plots of the position
error (MPJPE) and joint displacement error averaged
over all test sequences of the Human3.6m dataset3
and compare to the SuperTrack approach of [12]4. We
show joint displacement error for one of the longer se-
quences “S9-WalkDog” in fig. 7(right). We observe
that for LARP the simulation indeed accumulates
inconsistencies over time, albeit rather slowly, joint
displacement error grows to about 2 cm. after 1500
sim. steps. This appears to have negligible effect on
pose estimation accuracy: note that MPJPE does not
increase with the simulation steps. Our observation
is that instability (e.g. inconsistent simulation state
with high joint displacement error) is not necessar-
ily a function of the number of simulation steps, but
rather happens during complex motions that are un-
derrepresented in the training data. For example in
fig. 7(right) the displacement error changes little for
most of the 25 sec. long sequence, but jumps up during the abrupt transition
from walking into rushing forward (see fig. 9). We believe that training on larger
and more diverse dataset might mitigate this.
We show examples of the human pose reconstructions on AIST-easy in fig. 1
(left plot, middle column) and AIST-hard in fig. 6(left). In fig. 6(left) we
compare output of LARP (bottom row) to 3d pose reconstruction that does
not employ physics-based constraints and is used by LARP for initialization
(middle row). Note that inference with LARP was able to correct physically
implausible leaning of the person and estimate correct pose of the lower body.
Finally, in fig. 8and fig. 1(left, third column) we include a few examples of 3d
pose reconstructions obtained with LARP on real-world videos. These results
confirm that LARP generalizes beyond AIST and Human3.6M datasets, is able
to handle motions different from the training set (e.g. karate kick in fig. 8) and
can handle complex contact with the ground (e.g. safety roll in fig. 1).
Simulation speed. We measure the speed of our approach using humanoid en-
vironment in Fig. 2(c) and compare it to Bullet [9], MuJoCo [37], and Brax [11].
Bullet and MuJoCo were run on an AMD 48-core CPU for best performance.
3When computing the average we truncate the output to the shortest sequence.
4We use a reimplementation of SuperTrack in this comparison since the original paper
didn’t make the code or pre-trained models public.
14 M. Andriluka et al.
Brax, MuJoCo XLA (MJX) [29], and LARP are implemented to run on SIMD ar-
chitecture and are thus run on a Tesla V100 Nvidia GPU for best performance5.
For each simulator we measure the time required to perform a simulation step
for different number of simulations run in parallel. We run open-loop rollouts
of 100 steps averaged over 5 runs for each number of parallel simulation. The
compilation time is excluded. The results are shown in fig. 1. Overall, we ob-
serve that LARP exhibits better performance already when performing a single
simulation (0.07ms for LARP vs. 0.19ms for MuJoCo and 1.8ms. for Bullet).
When running multiple simulations in parallel we improve over other simulators
by an order of magnitude, e.g.1.3ms for 4,096 parallel simulations compared to
about 20 ms for MJX and MuJoCo.
Human-object collision handling. We include two types of experiments to
evaluate accuracy of human object collisions. In the first experiment a ball or a
capsule is added in a random position and orientation in front of the humanoid.
We then run simulation in Bullet engine [8] and in LARP and measure the
difference between trajectories of the ball and humanoid. We show an example
of simulation results in fig. 6(middle) and present quantitative evaluation in
tab. 5b. Overall we observe that LARP fairly accurately approximates the output
of a Bullet simulation. For example, the difference in the ball position after
60 simulation steps is about 6cm. on average. In the second experiment we
demonstrate that we can use LARP to control a humanoid in order to kick
the ball to a particular target. The control parameters are inferred via model
predictive control by minimizing residuals of: (1) the distance between the ball
trajectory and the target, and (2) a term that keeps the resulting articulated
human motion close to the reference kicking motion. We obtain an average error
between the ball and the target of 5.7cm. The qualitative results are shown in
fig. 6(right).
5 Conclusion
We have presented a methodology (LARP) to train neural networks that can sim-
ulate the complex physical motion of an articulated human body. LARP supports
features found in classical rigid body dynamics simulators, such as joint motors,
variable dimensions of the body component volumes, and contact between body
parts or objects. Our experiments demonstrate that neural physics predictions
produce results comparable to traditional simulation, while being considerably
simpler architecturally, comparable in accuracy, and much more efficient compu-
tationally. Our neural modeling replaces the complex, computationally expensive
operations in traditional physics simulators with efficient forward state propa-
gation in recurrent neural networks. We discuss and illustrate the capability of
LARP in challenging scenarios involving reconstruction of human motion from
video and collisions of articulated bodies, with promising results.
5Brax PBD is an implementation of position based dynamics [28] which is unstable for
the given mass/inertia in the humanoid model, but is added for speed comparison.
Learned Neural Physics Simulation 15
References
1. Allen, K.R., Guevara, T.L., Rubanova, Y., Stachenfeld, K., Sanchez-Gonzalez, A.,
Battaglia, P., Pfaff, T.: Graph network simulators can learn discontinuous, rigid
contact dynamics. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The
6th Conference on Robot Learning. Proceedings of Machine Learning Research,
vol. 205, pp. 1157–1167. PMLR (14–18 Dec 2023), https://proceedings.mlr.
press/v205/allen23a.html 1,2
2. Allen, K.R., Guevara, T.L., Rubanova, Y., Stachenfeld, K., Sanchez-Gonzalez, A.,
Battaglia, P., Pfaff, T.: Graph network simulators can learn discontinuous, rigid
contact dynamics. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The
6th Conference on Robot Learning. Proceedings of Machine Learning Research,
vol. 205, pp. 1157–1167. PMLR (14–18 Dec 2023) 3
3. Allen, K.R., Guevara, T.L., Rubanova, Y., Stachenfeld, K., Sanchez-Gonzalez, A.,
Battaglia, P., Pfaff, T.: Graph network simulators can learn discontinuous, rigid
contact dynamics. In: Conference on Robot Learning. pp. 1157–1167. PMLR (2023)
6
4. Allen, K.R., Rubanova, Y., Lopez-Guevara, T., Whitney, W.F., Sanchez-Gonzalez,
A., Battaglia, P., Pfaff, T.: Learning rigid dynamics with face interaction graph
networks. International Conference on Learning Representations (ICLR) (2023) 3
5. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M.J., Leary, C., Maclaurin, D.,
Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., Zhang, Q.: JAX:
composable transformations of Python+NumPy programs (2018), http://github.
com/google/jax 6
6. Cai, X., Coevoet, E., Jacobson, A., Kry, P.: Active learning neural c-space signed
distance fields for reduced deformable self-collision. In: Graphics Interface 2022
(2022) 3
7. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (elus). In: Bengio, Y., LeCun, Y. (eds.) ICLR
(2016) 5
8. Coumans, E.: Bullet physics simulation. In: ACM SIGGRAPH 2015 Courses, p. 1
(2015) 1,2,8,11,14
9. Coumans, E., Bai, Y.: Pybullet, a python module for physics simulation for games,
robotics and machine learning (2016) 4,5,13
10. Featherstone, R.: Rigid Body Dynamics Algorithms. Springer-Verlag, Berlin, Hei-
delberg (2007) 1,2
11. Freeman, C.D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., Bachem, O.: Brax–a
differentiable physics engine for large scale rigid body simulation. arXiv preprint
arXiv:2106.13281 (2021) 2,5,6,13
12. Fussell, L., Bergamin, K., Holden, D.: Supertrack: Motion tracking for physically
simulated characters using supervised learning. ACM Transactions on Graphics
(TOG) 40(6), 1–13 (2021) 1,4,9,10,13
13. Gärtner, E., Andriluka, M., Xu, H., Sminchisescu, C.: Trajectory optimization for
physics-based reconstruction of 3d human pose from monocular video. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 13106–13115 (2022) 1,2,8,10,11,12,13
14. Ghorbani, S., Wloka, C., Etemad, A., Brubaker, M.A., Troje, N.F.: Probabilistic
character motion synthesis using a hierarchical deep latent variable model. Com-
puter Graphics Forum 39 (2020) 4
16 M. Andriluka et al.
15. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http:
//www.deeplearningbook.org 10
16. Grzeszczuk, R., Terzopoulos, D., Hinton, G.: Neuroanimator: Fast neural network
emulation and control of physics-based models. In: Proceedings of the 25th annual
conference on Computer graphics and interactive techniques. pp. 9–20 (1998) 3
17. Guo, M., Jiang, Y., Spielberg, A.E., Wu, J., Liu, K.: Benchmarking rigid body
contact models. In: Matni, N., Morari, M., Pappas, G.J. (eds.) Proceedings of
The 5th Annual Learning for Dynamics and Control Conference. Proceedings of
Machine Learning Research, vol. 211, pp. 1480–1492. PMLR (15–16 Jun 2023) 3
18. Gärtner, E., Andriluka, M., Coumans, E., Sminchisescu, C.: Differentiable dy-
namics for articulated 3d human motion reconstruction. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 11,
12,13
19. Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent varia-
tional autoencoder for human motion synthesis. In: Proceedings of the 28th British
Machine Vision Conference (2017) 4
20. Henter, G.E., Alexanderson, S., Beskow, J.: Moglow: Probabilistic and controllable
motion synthesis using normalising flows. ACM Transactions on Graphics (TOG)
39(6), 1–14 (2020) 4
21. Howell, T., Le Cleac’h, S., Bruedigam, J., Kolter, Z., Schwager, M., Manchester,
Z.: Dojo: A differentiable simulator for robotics. arXiv preprint arXiv:2203.00806
(2022), https://arxiv.org/abs/2203.00806 5
22. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale
datasets and predictive methods for 3d human sensing in natural environments.
IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7), 1325–1339
(jul 2014) 2,11
23. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct
3d human pose and shape via model-fitting in the loop. In: Proceedings of the
IEEE International Conference on Computer Vision (ICCV) (2019) 2
24. Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet,
F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland,
G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S., Battaglia, P.: Learning skillful
medium-range global weather forecasting. Science 0(0), eadi2336 (2023). https:
//doi.org/10.1126/science.adi2336,https://www.science.org/doi/abs/10.
1126/science.adi2336 3
25. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with aist++: Music
conditioned 3d dance generation (2021) 11
26. Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle
dynamics for manipulating rigid bodies, deformable objects, and fluids. In: Inter-
national Conference on Learning Representations (ICLR) (2019) 3
27. Ling, H.Y., Zinno, F., Cheng, G., van de Panne, M.: Character controllers using
motion vaes. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH)
39 (2020) 4
28. Macklin, M., Müller, M., Chentanez, N.: Xpbd: position-based simulation of com-
pliant constrained dynamics. In: Proceedings of the 9th International Conference
on Motion in Games. pp. 49–54 (2016) 14
29. MuJoCo team: Mujoco xla (2023), https://mujoco.readthedocs.io/en/stable/
mjx.html 2,14
30. Mrowca, D., Zhuang, C., Wang, E., Haber, N., Fei-Fei, L.F., Tenenbaum, J.,
Yamins, D.L.: Flexible neural representation for physics prediction. Advances in
neural information processing systems 31 (2018) 1,2
Learned Neural Physics Simulation 17
31. Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A., Battaglia, P.: Learning mesh-based
simulation with graph networks. In: International Conference on Learning Repre-
sentations (ICLR) (2021), outstanding Paper 3
32. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Hu-
mor: 3d human motion model for robust pose estimation. In: Proceedings of the
International Conference on Computer Vision (ICCV) (2021) 4
33. Sanchez-Gonzalez, A., Heess, N., Springenberg, J.T., Merel, J., Riedmiller, M.,
Hadsell, R., Battaglia, P.: Graph networks as learnable physics engines for infer-
ence and control. In: Proceedings of the 35th International Conference on Machine
Learning (ICML) (2018) 3
34. Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: Physcap: Physically plausible
monocular 3d motion capture in real time. ACM Transactions on Graphics 39(6)
(dec 2020) 11
35. Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient de-
scent. In: European Conference on Computer Vision. pp. 744–760. Springer (2020)
2
36. Sukhija, B., Kohler, N., Zamora, M., Zimmermann, S., Curi, S., Krause, A., Coros,
S.: Gradient-based trajectory optimization with learned dynamics. arXiv preprint
arXiv:2204.04558 (2022) 2,3,10
37. Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control.
In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp.
5026–5033. IEEE (2012) 1,2,5,13
38. Tsuchida, S., Fukayama, S., Hamasaki, M., Goto, M.: Aist dance video database:
Multi-genre, multi-dancer, and multi-camera database for dance information pro-
cessing. In: Proceedings of the 20th International Society for Music Information
Retrieval Conference, ISMIR 2019. pp. 501–510. Delft, Netherlands (Nov 2019) 2,
11
39. Xu, H., Bazavan, E.G., Zanfir, A., Freeman, W.T., Sukthankar, R., Sminchisescu,
C.: GHUM & GHUML: Generative 3d human shape and articulated pose models.
pp. 6184–6193 (2020) 12