PreprintPDF Available

A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture

Preprints and early-stage research may not have been peer reviewed yet.
Preprint

A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture

Abstract and Figures

Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers. The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing from the reconstruction. In this paper, we propose to use a neural network approach to learn how human motion is temporally and spatially correlated, and reconstruct missing markers positions through this model. We experiment with two different models, one LSTM-based and one time-window-based. Both methods produce state-of-the-art results, while working online, as opposed to most of the alternative methods, which require the complete sequence to be known. The implementation is publicly available at https://github.com/Svitozar/NN-for-Missing-Marker-Reconstruction .
Content may be subject to copyright.
A Neural Network Approach to Missing Marker Reconstruction
in Human Motion Capture
Taras Kucherenko
Robotics, Perception and Learning
KTH Royal Institute of Technology
Sweden
tarask@kth.se
Jonas Beskow
Speech, Music and Hearing
KTH Royal Institute of Technology
Sweden
beskow@kth.se
Hedvig Kjellström
Robotics, Perception and Learning
KTH Royal Institute of Technology
Sweden
hedvig@kth.se
ABSTRACT
Optical motion capture systems have become a widely used tech-
nology in various elds, such as augmented reality, robotics, movie
production, etc. Such systems use a large number of cameras to
triangulate the position of optical markers. The marker positions
are estimated with high accuracy. However, especially when track-
ing articulated bodies, a fraction of the markers in each timestep is
missing from the reconstruction.
In this paper, we propose to use a neural network approach to
learn how human motion is temporally and spatially correlated,
and reconstruct missing markers positions through this model.
We experiment with two dierent models, one LSTM-based and
one time-window-based. Both methods produce state-of-the-art
results, while working online, as opposed to most of the alternative
methods, which require the complete sequence to be known. The
implementation is publicly available at https://github.com/Svito-
zar/NN-for-Missing-Marker-Reconstruction.
CCS CONCEPTS
Computing methodologies Machine learning
;
Motion
processing;Neural networks;Motion capture;
KEYWORDS
Motion capture, Missing markers, Neural Networks, Deep Learning
1 INTRODUCTION
Often a digital representation of human motion is needed. This
representation is useful in a wide range of scenarios: mapping an
actor performance to a virtual avatar (in movie productions or in
the game industry); predicting or classifying a motion (in robotics)
; trying clothes in a digital mirror; etc.
A common way to obtain this digital representation is marker-
based optical motion capture (mocap) systems. Such systems use
a large number of cameras to triangulate the position of optical
markers.These are then used to reconstruct the motion of the objects
to which the markers are attached.
All motion capture systems suer to a higher or lower degree
from missing marker detections, due to occlusion problems (less
than two cameras see the marker) or marker detection failures.
In this paper, we propose a method for reconstruction of missing
markers to create a more complete pose estimate (see Figure 1). The
method exploits knowledge about spatial and temporal correlation
in human motion, learned from data examples to remove position
noise and ll in missing parts of the pose estimate (Section 3).
Noisy
esmaon
NN
Ground truth moon
Observed moon
Reconstructed moon
Training data
Figure 1: Illustration of our method for missing marker re-
construction. Due to errors in the capturing process, some
markers are not captured. The proposed method exploits
spatial and temporal correlations to reconstruct the pose of
the missing markers.
A number of methods have been proposed within the Graphics
community to address the problem of missing marker reconstruc-
tion. The traditional approach [
3
,
15
] is interpolation within the
current sequence. Wang et al proposed a method which exploits
motion examples to learn typical correlations [
21
]. The novelty of
our method with respect to theirs is that while they learn linear
dependencies, we employ a Neural Network (NN) methodology
which enables modeling of more complicated spatial and temporal
correlations in sequences of a human pose.
In Section 6 we demonstrate the eectiveness of our network,
showing that our method outperforms the state of the art in missing
marker reconstruction in various conditions.
Finally, we discuss our results in Section 7.
2 RELATED WORK
The task of modeling human motion from mocap data has been
studied quite extensively in the past. We here give a short review
of the works most related to ours.
2.1 Missing Marker Reconstruction
It is possible to do 3D pose estimation even with aordable sensors
such as Kinect. However, all motion capture systems suer to some
degree from missing data. This has created a need for methods for
missing marker reconstruction.
The missing marker problem has been traditionally formulated
as a matrix completion task. Peng et al. [
15
] solve it by non-negative
matrix factorization, using the hierarchy of the body to break the
motion into blocks. Wang et al. [
21
] follow the idea of decomposing
the motion and do dictionary learning for each body part. They train
their system separately for each type of motion. Burke and Lasenby
[
3
] apply PCA rst and do Kalman smoothing afterward, in the
lower dimensional space. Gloersen and Federolf [
7
] used weighted
PCA to reconstruct markers. Taylor et al. [
18
] applied Condition
Restricted Boltzmann Machine based on the binary variables. Both
Taylor [
18
] and Gloersen[
7
] were limited to cyclic motions, such
as walking and running. All those methods are based on linear
algebra. They make strong assumptions about the data: each marker
is often assumed to be present at least at one time-step in the
sequence. Moreover, due to the linear models, they often struggle
to reconstruct irregular and complex motion.
The limitations discussed above motivate the neural network ap-
proach to the missing marker problem. Mall et al. [
14
] successfully
applied deep neural network based on the bidirectional LSTM to
denoise human motion and recover missing markers. Our approach
is similar to theirs, but we are using a simpler network, which re-
quires less data and computational resources. We also experiment
with two dierent ways to handle the sequential character of the
problem, while they just choose one approach.
2.2 Denoising
Another related task which can be tackled with our networks is
removing additive noise from the marker data. Recently Holden
[
10
] used a similar approach to this problem. He employed a neural
network that took noisy markers as an input and returned clean
body position as an output. The main dierence is that our network
is also capable of reconstructing missing values and takes sequential
information into account.
2.3 Prediction
A highly related problem is to predict human motion some time
into the future.
State-of-the-art methods try to ensure continuity either by using
Recurrent Neural Networks (RNNs) [
6
,
12
] or by feeding many time-
frames at the same time [
4
]. While our focus is not on prediction,
our networks architectures are inspired by those methods.
Since our application is not a prediction, our architecture is
slightly dierent.
Another related paper is the work of Bütepage et al. [
4
], who use
a sliding window and a Fully Connected Neural Network (FCNN) to
do motion prediction and classication. Again, since our problem is
dierent, we modify their network, using a much shorter window
length, fewer layers, and no bottleneck.
3 METHOD OVERVIEW
In the following section, we give a mathematical problem formula-
tion and an overview of the proposed approach.
3.1 Missing Markers
Missing markers in real life correspond to the failure of a sensor in
the motion capture system.
In our experiments, we use mocap data without missing markers.
Missing markers are emulated by nullifying some marker positions
in each frame. This process can be mathematically formulated as a
multiplication of the mocap frame xtby a binary matrix Mt:
ˆxt=C(xt)=Mtxt,(1)
where
Mt
[0
,
1]
3nx 3n
, such that that all 3 coordinates of any
marker are either missing or present at the same time.
Every marker is missing over a few time-frames. The percentage
of missing values is usually referred to as the missing rate.
3.2 Missing Marker Reconstruction as
Function Approximation
Missing marker reconstruction is dened in the following way:
Given a human motion sequence ˆx corrupted by missing markers,
the goal is to reconstruct the true pose xtfor every frame t.
We approach missing markers reconstruction as a function ap-
proximation problem: The goal is to learn a reconstruction function
R
that approximates the inverse of the corruption function
C
in
Eq. (3). This function would map the sequence of corrupted poses
to an approximation of the true poses:
x=R(ˆx)C1(ˆx)(2)
The mapping
C
is under-determined, so it is not invertible. However,
it can be approximated by learning spatial and temporal correlations
in human motion in general, from a set of other pose sequences.
We propose to use a Neural Network (NN) approach to learn
R
,
well known for being a powerful tool for function approximation
[
11
]. We employ two dierent types of neural network models,
which are described in the following sections. Both of them are
using a principle of Denoising Autoencoder [
19
]: during the training
Gaussian additive noise was injected into the input:
ˆxt=ˆ
C(xt)=Mt(xt+N(0,σ(X)α)),(3)
where
σ(X)
is a standard deviation in the training dataset and
α
is
a coecient of proportionality, which we call the noise parameter.
It was experimentally set to the value of 0.3.
Denoising is commonly used to regularize encoder-decoder NN.
Experiments proved it to be benecial in our application as well.
The network was learning to remove noise, at the same time as
reconstructing missing values. During the testing, no noise was
injected. Our two methods are compared to each other and to the
state of the art in missing marker reconstruction in Section 6.
4 NEURAL NETWORK ARCHITECTURES
In this section, the two versions of the method are explained.
2
tt
LSTM
LSTM
(a) LSTM-based
t
t-
t
t
t-
t
FC
FC
(b) Window-based
Figure 2: Illustration of the two architecture types. (a) LSTM-
based architecture (Section 4.1). (b) Window-based architec-
ture (Section 4.2).
4.1 LSTM-Based Neural Network Architecture
Long-Short Term Memory (LSTM) [
9
] is a special type of Recurrent
Neural Network (RNN). It was designed as a solution to the vanish-
ing gradient problem [
8
] and has become a default choice for many
problems that involve sequence-to-sequence mapping [5, 16, 17].
Our network is based on LSTM and illustrated in Figure 2a. The
input layer is a corrupted pose
ˆxt
, and the output LSTM layer is
the corresponding true pose xt.
4.2 Window-based Neural Network
architecture
An alternative approach is to use a range of previous time-steps
explicitly, and to train a regular Fully Connected Neural Network
(FCNN) with the current pose along with a short history, i.e., a
window of poses over time (tt):t.
This network is illustrated in Figure 2b. The input layer is a win-
dow of concatenated corrupted poses [
ˆxT
tt, ..., ˆxT
t
]
T
. The output
layer is the corresponding window of true poses [
xT
tt, ..., xT
t
]
T
.
In between, there are a few hidden fully connected layers.
This structure is inspired by the sliding time window-based
method of Bütepage et al. [
4
], but is adapted to pose reconstruction.
For example, there is no bottleneck middle layer and fewer layers
in general, to create a tighter coupling between the corrupted and
real pose, rather than learning a high-level and holistic mapping of
a pose. We also use window length T=10, instead of 100, based on
the performance on the validation dataset.
5 DATASET
We evaluate our method on the popular benchmark CMU Mocap
dataset [
1
]. This database contains 2235 mocap sequences of 144
dierent subjects. We use the recordings of 25 subjects, sampled
at the rate of 120 Hz, covering a wide range of activities, such as
boxing, dancing, acrobatics and running.
Figure 3: Marker placement in the CMU Mocap dataset (mo-
cap.cs.cmu.edu).
5.1 Preprocessing
We start preprocessing by transforming every mocap sequence into
the hips-center coordinate system. First joint angles from the BVH
le are transformed into the 3D coordinates of the joints. The 3D
coordinates are translated to the center of the hips by subtracting
hip coordinates from each marker. We then normalize the data
into the range [-1,1] by subtracting the mean pose over the whole
dataset and then dividing all values by the absolute maximal value
in the dataset.
5.2 Data Explanation
The CMU dataset contains 3D positions of a set of markers, which
were recorded by the mocap system at CMU. Example of a marker
placement during the capture can be seen in Figure 3. All details
can be found in the dataset description [1].
The human pose at each time-frame
t
is represented as a vector
of the marker 3D coordinates:
xt=
[
xi,t,yi,t,zi,t
]
i=1:n
, where
n
denotes the number of markers used during the mocap collection. In
the CMU data,
n=
41, and the dimensionality of a pose is 3
n=
123.
A sequence of poses is denoted x=[xt]t=1:T.
5.3 Training, Validation, and Test Data
Congurations
The validation dataset contains 2 sequences from each of the fol-
lowing motions: pantomime, sports, jumping, and general motions
1
. The test dataset contains basketball, boxing and jump turn
2
se-
quences. The training dataset contains all the sequences not used
for validation or testing, from 25 dierent folders in the CMU Mo-
cap Dataset, such as 6, 14, 32, 40, 141, 143, which include testing
types as well.
1
pantomime (subjects 32 and 54), sports (subject 86 and 127), jumping (subject 118),
and general motions (subject 143)
2102_03 (basketball), 14_01 (boxing), and 85_02 (jump-turn).
3
Table 1: Hyperparameters for our NNs.
αis initial learning rate, tis sequence length.
NN-type Width Depth Dropout αt
LSTM 1024 2 0.9 0.0002 64
Window 512 2 0.9 0.0001 20
Subjects from the training dataset were also present in the test
and validation datasets. Generalization to the novel subjects and
motion types was tested experimentally.
6 EXPERIMENTS
We use the commonly used [
3
,
15
,
21
] Root Mean Squared Error
(RMSE) over the missing markers to measure reconstruction error.
Implementation Details:
All methods were implemented us-
ing Tensorow[2]. The code is publicly available3.
For
training
purposes, we extract short sequences from the
dataset by sliding window, then shue them and feed to the net-
work. The training was done using the Adam optimizer [
13
] with a
batch size of 32.
The
hyperparameters
for both architectures were optimized
w.r.t. the validation dataset (Section 5.3) using grid search. Table 1
contains the main hyper-parameters.
6.1 Comparison to the State of the Art
First of all, the models presented above are evaluated in the same
setting as most of the other random missing marker reconstruction
methods. A specic amount of random markers (10%, 20%, or 30%)
are removed over a few time-frames and each method is applied to
recover them. The length of the gap was sampled from the Gaussian
distribution with mean 10 and standard deviation 5, following the
state-of-the-art settings [
21
]. The reconstruction error is measured.
There is randomness in the system; in the initialization of the
network weights and choosing missing markers. Therefore, ev-
ery experiment is repeated 3 times and error mean and standard
deviation are measured.
Tables 2 provide the comparison of the performance of our sys-
tem with 3 state-of-the-art papers and with the simplest solution
(linear interpolation) as a baseline, on 3 action classes from the
CMU Mocap dataset. The experiments from [
3
] were repeated by
us while using the same hyperparameters as in their original paper.
The results of the Wang method [
20
] were taken from the diagram
in their paper. Last, the error measures of the Peng method [
15
]
were rescaled, since in their original paper they measure it with
averaging the error over all the markers, but we average only over
the missing markers.
Table 2 shows that standard interpolation outperforms all the
state-of-the-art method, including ours. A probable reason for that
is that the duration of the gap is short (less than 0.1 s), so it is
easy to interpolate between existing frames. We will, therefore,
study a more challenging scenario, when markers are missing over
longer periods and when more markers are missing. We can com-
pare with the Burke method only because only they provide the
implementation.
3https://github.com/Svito-zar/NN-for-Missing-Marker-Reconstruction
Table 2: Comparison to the state of the art in missing marker
reconstruction. RMSE in marker position is in cm. A train-
ing set comprises all activities. The numbers from [20] were
extracted from a diagram.
(a) 10% of the markers in each indata frame are missing.
Method Basketball Boxing Jump turn
Interpolation 0.64±0.03 1.06±0.12 1.74±0.3
Wang [20] 0.40.5n.a.
Peng [15] n.a. n.a. n.a.
Burke [3] 4.56 ±0.17 3.47±0.19 15.97±1.34
Window (ours) 2.34 ±0.27 2.61 ±0.21 4.4 ±0.5
LSTM (ours) 1.21±0.02 1.44±0.02 2.52±0.3
(b) 20% of the markers in each indata frame are missing.
Method Basketball Boxing Jump turn
Interpolation 0.67±0.04 1.09±0.07 1.91±0.31
Wang [20] 1.61.5n.a.
Peng [15] n.a. 4.94 5.12
Burke [3] 4.18 ±0.48 3.98±0.07 27.1±1.21
Window (ours) 2.42 ±0.32 2.77 ±0.13 4.3 ±0.75
LSTM (ours) 1.34±0.01 1.58±0.04 2.67±0.2
(c) 30% of the markers in each indata frame are missing.
Method Basketball Boxing Jump turn
Interpolation 0.7±0.11.21±0.14 2.29±0.3
Wang [20] 0.90.9n.a.
Peng [15] n.a. 4.36 4.9
Burke [3] 4.23 ±0.57 4.01±0.26 34.9 ±2.55
Window (ours) 2.33 ±0.13 2.63 ±0.08 4.53 ±0.48
LSTM (ours) 1.48±0.03 1.75±0.07 3.1±0.25
Figure 4: Dependency on the duration of the gap. Basketball
motion. 5 missing markers.
6.2 Gap duration analysis
In the next experiments, we varied the length of the gap and kept
the number of missing markers xed to 5. As before we averaged
the performance over 3 experiments.
4
Table 3: Generalization test for the LSTM network. 20% of
markers missing. Complete dataset contains motions from
all the subjects and from all types. Then all motion with
the same subject as in testing were removed. Finally all the
recordings with the same type of motion very removed. Re-
construction error in cm is measured.
Motion / dataset Basketball Boxing
Complete 7.9 ±0.14 2.08 ±0.5
w/o the subject 9.93 ±0.96 2.13 ±0.42
w/o the motion 8.54 ±1.04 2.57 ±1.18
Table 4: Generalization test for the Window-based network.
20% of markers missing. The same setup as in Table 3.
Motion / dataset Basketball Boxing
Complete 5.59 ±0.29 3.54 ±0.15
w/o the subject 5.68 ±0.48 4.37 ±1.13
w/o the motion 6.52 ±0.54 4.58 ±1.51
Figure 4 shows that our methods can be applied for any length
of gaps, while the performance of other methods degrades steadily
with the increase of the length of the gap. Interpolation-based
methods struggle to reconstruct markers when gaps become longer.
Our method, in contrast, can propagate the information about the
marker position using the hidden state, hence being robust to the
long gaps.
6.3 Very long gaps
In the following experiment, the same markers were missing over a
long period of time. The measurement period started at 1.5 seconds
into the clip to avoid artifacts. Then for 1 second, all markers were
present, followed by a 5 second window where certain markers
were missing for the entire time. Afterwards, all the markers were
present again. Each experiment was repeated 5 times and the mean
result was registered.
We can clearly see in Figure 5 that while interpolation and
Burke[
3
] are quickly losing track of the markers, our methods stay
stable and accurate. This hold for all the scenarios. Figure 5(b,d)
illustrates that all methods except interpolation are degrading sig-
nicantly when most of the markers are missing. That indicates
that those methods are using information about the other markers,
not only about the past or the future of a particular marker.
6.4 Visualization of the Results
Figure 6 illustrates the reconstruction results for one of the test
sequences, boxing. The subject is boxing with their right arm. The
observed marker cloud (Figure
??
b) misses 15 markers. Our recon-
struction result (Figure
??
d) is visually close to the ground truth
(Figure ??a), which is also supported by the numerical errors.
6.5 Generalization
Up to now, our models, as well as the baselines, have been trained
with all motions and all individuals.
In the nal experiments, we evaluated the generalization capa-
bility with respect to motion type and individual. To this end, we
removed all the recordings of the test subject from the training
(a) Basketball: 3 markers missing
(b) Basketball: 30 markers missing
(c) Boxing: 3 markers missing
(d) Boxing: 30 markers missing
Figure 5: A few markers were missing over 5 seconds. All the
markers were present for 1s before and after the gap.
5
data.Furthermore, for each test motion, we created a training set
where this motion was removed. We then evaluated our networks
while having 20% of the markers missing for gaps of 100 frames
(almost 1 second).
Table 3 illustrates the results for the LSTM-based network and
Table 4 for the Window-based method. We can observe that the
performance drop is not dramatic: it is less than 25% and depends
on the motion type and the network architecture. It is important to
note that the variance is signicantly higher for the "generalization"
scenarios, meaning that the system is less stable.
This experiment indicates that our systems can recover unseen
motions, performed by unseen individuals, albeit with slightly
worse performance.
7 DISCUSSION AND CONCLUSION
The experiments presented above show that the proposed methods
can compensate for missing markers better than the state of the
art, when the gap is long, especially when the motion is complex.
Our method is not relying on future frames, unlike most of the
alternatives. That property makes it suitable for on-line usages
when the markers are being reconstructed as they are collected.
Another notable property of the proposed method is that it can
recover markers which are missing over a long period of time.
LSTM-based architecture is modeling the correlation is the hu-
man body to recover missing markers better than a window-based
architecture, accordingly to our experiments.
In summary, the proposed methods can be used to recover mark-
ers over many frames in an accurate and stable way.
8 ACKNOWLEDGEMENTS
Authors would like to thank Simon Alexanderson and Judith Butepage
for the useful discussions. This PhD project is supported by Swedish
Foundation for Strategic Research Grant No.: RIT15-0107. The data
used in this project was obtained from mocap.cs.cmu.edu. The data-
base was created with funding from NSF EIA-0196217.
REFERENCES
[1] 2003. Carnegie-Mellon Mocap Database. http://mocap.cs.cmu.edu/
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S Corrado, Andy Davis, Jerey Dean, Matthieu Devin, et al
.
2016. Tensorow: Large-scale machine learning on heterogeneous distributed
systems. arXiv preprint arXiv:1603.04467 (2016).
[3]
Michael Burke and Joan Lasenby. 2016. Estimating missing marker positions
using low dimensional Kalman smoothing. Journal of Biomechanics 49, 9 (2016),
1854–1858.
[4]
Judith Bütepage, Michael Black, Danica Kragic, and Hedvig Kjellström. 2017.
Deep representation learning for human motion prediction and classication. In
IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Jerey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,
Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term
recurrent convolutional networks for visual recognition and description. In IEEE
Conference on Computer Vision and Pattern Recognition.
[6]
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015.
Recurrent network models for human dynamics. In IEEE International Conference
on Computer Vision.
[7]
Øyvind Gløersen and Peter Federolf. 2016. Predicting missing marker trajectories
in human motion data using marker intercorrelations. PloS one 11, 3 (2016),
e0152616.
[8]
Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent
neural nets and problem solutions. International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems 6, 2 (1998), 107–116.
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
Computation 9 (1997), 1735–1780.
(a) Ground truth markers
(b) 15 (our of 41) markers are missing
(c) Burke[3] reconstruction result
(d) LSTM (ours) reconstruction result
Figure 6: Three keyframes from the boxing test sequence,
illustration of the reconstruction using the Burke and LSTM
(ours) methods.
[10]
Daniel Holden. 2018. Robust Solving of Optical Motion Capture Data by Denois-
ing. ACM Trans. Graph. 38, 1 (2018).
[11]
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed-
forward networks are universal approximators. Neural networks 2, 5 (1989),
359–366.
[12]
Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016.
Structural-RNN: Deep learning on spatio-temporal graphs. In IEEE Conference on
Computer Vision and Pattern Recognition.
[13]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
tion. In International Conference on Learning Representations.
6
[14]
Utkarsh Mall, G Roshan Lal, Siddhartha Chaudhuri, and Parag Chaudhuri. 2017.
A deep recurrent framework for cleaning motion capture data. arXiv preprint
arXiv:1712.03380 (2017).
[15]
Shu-Juan Peng, Gao-Feng He, Xin Liu, and Hua-Zhen Wang. 2015. Hierarchical
block-based incomplete human mocap data recovery using adaptive nonnegative
matrix factorization. Computers & Graphics 49 (2015), 10–23.
[16]
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural
network for image-based sequence recognition and its application to scene text
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39,
11 (2017), 2298–2304.
[17]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Neural Information Processing Systems.
[18]
Graham W Taylor, Georey E Hinton, and Sam T Roweis. 2007. Modeling human
motion using binary latent variables. In Advances in neural information processing
systems. 1345–1352.
[19]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
2008. Extracting and composing robust features with denoising autoencoders. In
International Conference on Machine Learning.
[20]
Zhao Wang, Yinfu Feng, Shuang Liu, Jun Xiao, Xiaosong Yang, and Jian J Zhang.
2016. A 3D human motion renement method based on sparse motion bases
selection. In International Conference on Computer Animation and Social Agents.
[21]
Zhao Wang, Shuang Liu, Rongqiang Qian, Tao Jiang, Xiaosong Yang, and Jian J
Zhang. 2016. Human motion data renement unitizing structural sparsity and
spatial-temporal information. In IEEE Int. Conference on Signal Processing.
7
... With the recent advances in deep-learning there are a number of methods that frame the problem using neural networks [Butepage et al. 2017;Fragkiadaki et al. 2015;Holden 2018;Jain et al. 2016;Kucherenko et al. 2018;Mall et al. 2017]. ...
Conference Paper
Processing motion capture data from optical markers for use in computer animations presents numerous technical challenges. Artifacts caused by noise, marker swaps, and marker occlusions often require manual intervention of a professionally trained marker tracking artist that spends large amounts of time and effort fixing these issues. Existing automatic solutions that attempt to fix marker data lack robustness due to either failing to properly detect and fix marker paths, or generating solutions that are challenging to integrate within current animation pipelines. In this paper, we present a method that robustly identifies invalid marker paths, removes the associated segments and generates new kinematically correct paths. We start by comparing the kinematic solutions generated by commercial software against the one generated by the state-of-the-art methods, using this information to determine which animation keyframes are invalid. Subsequently, we regenerate marker paths from the neural network based method [Holden 2018] and use a sophisticated marker filling algorithm to combine them with the original marker paths at sections where we detect the original data to be invalid. Our method outperforms alternatives by generating solutions that are both closer to the ground truth and more robust, allowing for manual intervention if required.
... Completing missing data within a sequence has been traditionally addressed using low-rank matrix factorization [40,1]. Deep Learning approaches have also been used for this purpose e.g., through RNNs [27,24]. These works, however, are not designed for future prediction, which we have found as the most challenging case of completion. ...
Preprint
We propose a Generative Adversarial Network (GAN) to forecast 3D human motion given a sequence of observed 3D skeleton poses. While recent GANs have shown promising results, they can only forecast plausible human-like motion over relatively short periods of time, i.e. a few hundred milliseconds, and typically ignore the absolute position of the skeleton w.r.t. the camera. The GAN scheme we propose can reliably provide long term predictions of two seconds or more for both the non-rigid body pose and its absolute position, and can be trained in an self-supervised manner. Our approach builds upon three main contributions. First, we consider a data representation based on a spatio-temporal tensor of 3D skeleton coordinates which allows us to formulate the prediction problem as an inpainting one, for which GANs work particularly well. Secondly, we design a GAN architecture to learn the joint distribution of body poses and global motion, allowing us to hypothesize large chunks of the input 3D tensor with missing data. And finally, we argue that the L2 metric, which is considered so far by most approaches, fails to capture the actual distribution of long-term human motion. We therefore propose an alternative metric that is more correlated with human perception. Our experiments demonstrate that our approach achieves significant improvements over the state of the art for human motion forecasting and that it also handles situations in which past observations are corrupted by severe occlusions, noise and consecutive missing frames.
Preprint
Full-text available
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyse the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both feature representation, model architecture and post-processing into account when designing an automatic gesture-production method.
Article
Full-text available
Human motion capture (mocap) data, recording the movement from markers attached to specific joints, has gradually become the most popular solution of animation production. However, the raw motion data are often corrupted due to joint occlusion, marker shedding, and the lack of equipment precision, which severely limits the performance in real-world applications. Since human motion is essentially sequential data, the latest methods resort to variants of long short-time memory network (LSTM) to solve related problems, but most of them tend to obtain visually unreasonable results. This is mainly because these methods hardly capture long-term dependencies and cannot explicitly utilize relevant context. To address these issues, we propose a deep bidirectional attention network which can not only capture the long-term dependencies but also adaptively extract relevant information at each time step. Moreover, the proposed model, embedded attention mechanism in the bidirectional LSTM structure at the encoding and decoding stages, can decide where to borrow information and use it to recover the corrupted frame effectively. Extensive experiments on CMU database demonstrate that the proposed model consistently outperforms other state-of-the-art methods in terms of recovery accuracy and visualization.
Article
Full-text available
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Article
Full-text available
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.
Conference Paper
Full-text available
Human motion capture techniques (MOCAP) are widely applied in many areas such as computer vision, computer animation, digital effect and virtual reality. Even with professional MOCAP system, the acquired motion data still always contains noise and outliers, which highlights the need for the essential motion refinement methods. In recent years, many approaches for motion refinement have been developed, including signal processing based methods, sparse coding based methods and low-rank matrix completion based methods. However, motion refinement is still a challenging task due to the complexity and diversity of human motion. In this paper, we propose a data-driven-based human motion refinement approach by exploiting the structural sparsity and spatio-temporal information embedded in motion data. First of all, a human partial model is applied to replace the entire pose model for a better feature representation to exploit the abundant local body posture. Then, a dictionary learning which is for special task of motion refinement is designed and applied in parallel. Meanwhile, the objective function is derived by taking the statistical and locality property of motion data into account. Compared with several state-of-art motion refine methods, the experimental result demonstrates that our approach outperforms the competitors.
Article
Raw optical motion capture data often includes errors such as occluded markers, mislabeled markers, and high frequency noise or jitter. Typically these errors must be fixed by hand - an extremely time-consuming and tedious task. Due to this, there is a large demand for tools or techniques which can alleviate this burden. In this research we present a tool that sidesteps this problem, and produces joint transforms directly from raw marker data (a task commonly called "solving") in a way that is extremely robust to errors in the input data using the machine learning technique of denoising. Starting with a set of marker configurations, and a large database of skeletal motion data such as the CMU motion capture database [CMU 2013b], we synthetically reconstruct marker locations using linear blend skinning and apply a unique noise function for corrupting this marker data - randomly removing and shifting markers to dynamically produce billions of examples of poses with errors similar to those found in real motion capture data. We then train a deep denoising feed-forward neural network to learn a mapping from this corrupted marker data to the corresponding transforms of the joints. Once trained, our neural network can be used as a replacement for the solving part of the motion capture pipeline, and, as it is very robust to errors, it completely removes the need for any manual clean-up of data. Our system is accurate enough to be used in production, generally achieving precision to within a few millimeters, while additionally being extremely fast to compute with low memory requirements.
Article
We present a deep, bidirectional, recurrent framework for cleaning noisy and incomplete motion capture data. It exploits temporal coherence and joint correlations to infer adaptive filters for each joint in each frame. A single model can be trained to denoise a heterogeneous mix of action types, under substantial amounts of noise. A signal that has both noise and gaps is preprocessed with a second bidirectional network that synthesizes missing frames from surrounding context. The approach handles a wide variety of noise types and long gaps, does not rely on knowledge of the noise distribution, and operates in a streaming setting. We validate our approach through extensive evaluations on noise both in joint angles and in joint positions, and show that it improves upon various alternatives.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.