ArticlePDF Available

Predicting People's 3D Poses from Short Sequences

Authors:
  • Segway Robotics

Abstract and Figures

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Instead of computing candidate poses in individual frames and then linking them, as is often done, we regress directly from a spatio-temporal block of frames to a 3D pose in the central one. We will demonstrate that this approach allows us to effectively overcome ambiguities and to improve upon the state-of-the-art on challenging sequences.
Content may be subject to copyright.
Predicting People’s 3D Poses from Short Sequences
Bugra Tekin
a
Xiaolu Sun
a
Xinchao Wang
a
Vincent Lepetit
a,b
Pascal Fua
a
a
Computer Vision Laboratory,
´
Ecole Polytechnique F
´
ed
´
erale de Lausanne (EPFL)
b
Institute for Computer Graphics and Vision, Graz University of Technology
{bugra.tekin, xiaolu.sun, xinchao.wang, pascal.fua}@epfl.ch, lepetit@icg.tugraz.at
Abstract
We propose an efficient approach to exploiting motion
information from consecutive frames of a video sequence to
recover the 3D pose of people. Instead of computing can-
didate poses in individual frames and then linking them, as
is often done, we regress directly from a spatio-temporal
block of frames to a 3D pose in the central one. We will
demonstrate that this approach allows us to effectively over-
come ambiguities and to improve upon the state-of-the-art
on challenging sequences.
1. Introduction
In recent years, impressive motion capture results have
been demonstrated using depth cameras but 3D body pose
recovery from ordinary video sequences remains extremely
challenging. Nevertheless, there is great interest in doing
so, both because cameras are becoming ever cheaper and
more prevalent and because there are so many potential ap-
plications. These include athletic training, surveillance, en-
tertainment, and electronic publishing.
Most early approaches to monocular 3D pose tracking
involved recursive frame-to-frame tracking and were found
to be brittle, due to distractions and occlusions from other
people or objects in the scene. Since then, the focus has
shifted to “tracking by detection,” which involves detecting
human pose more or less independently in every frame fol-
lowed by linking detection across frames [26, 2], which is
much more robust to algorithmic failures in isolated frames.
In such approaches, motion information is exploited only
a posteriori by the linking procedure. It essentially elim-
inates erroneous poses by selecting compatible candidates
over consecutive frames. If there is none or few correct
ones among them, nothing can be done.
Recently, [16] proposed an effective single-frame ap-
proach by learning a regressor from a kernel embedding
of 2D HoG features to 3D poses. Since then, excellent re-
sults have also been reported using a Convolutional Neural
Net [22]. Even for such state-of-the-art approaches depend-
(a) e
χ
2
-HoG (b) Our (c) e
χ
2
-HoG (d) Our
Figure 1. 3D pose estimation in the Human3.6m dataset. The re-
covered 3D skeletons are reprojected into the images in the top
row and shown by themselves in the bottom one. (a,c) Results ob-
tained using a single-frame [16] are penalized by self-occlusions
and mirror ambiguities. (b,d) By contrast, our approach can re-
liably recover 3D poses in such cases by collecting appearance
and motion evidence from multiple frames simultaneously. Best
viewed in color.
ing on 2D appearance information, there still exist many de-
tection errors caused by inherent ambiguities of the projec-
tion from 3D to 2D, including self-occlusion and mirroring
as depicted in Fig. 1. When such errors happen frequently
for several frames in a row, enforcing temporal consistency
after the fact is not enough.
In this paper, we show that we can overcome these limi-
tations by using appearance and motion information simul-
taneously and regressing directly from short sequences of
frames to 3D poses in the central one. To this end, we find
bounding boxes around the potential subjects, shift them
so that the person inside them remains centered, compute
3D HoG features, and learn a mapping from these features
to 3D poses. This prevents errors from which the above-
mentioned methods cannot recover and we will show that it
yields substantial overall performance increases over state-
of-the-art methods on challenging test datasets.
Spatiotemporal image cubes have been used before for
1
arXiv:1504.08200v3 [cs.CV] 4 May 2015
action recognition purposes [20, 39], person detection [24],
and 2D pose estimation [11] but, to the best of our knowl-
edge, not for discriminative 3D human pose estimation. The
only such method we know of is that of [40] which is com-
putationally very expensive. Besides, it is not discrimina-
tive, and its 3D accuracy has not been reported.
The contribution of this paper is therefore a novel ap-
proach to combining appearance and motion clues very
early in the 3D pose estimation process. Because we regress
directly from a spatio-temporal volume to a 3D pose, we
have little need for complex uncertainty estimation and
propagation schemes to handle the ambiguities of the hu-
man pose and achieve superior performance. However, as
we will show, the ability to properly align successive image
windows from consecutive frames is key to obtaining it.
In the remainder of the paper, we first review related
work in Section 2. Then, in Section 3, we describe our
method. Finally, in Section 4, we present the results ob-
tained on challenging datasets and show that we improve
upon the state-of-the-art.
2. Related Work
Approaches to estimating the 3D human pose can be
classified into two main categories, depending on whether
they rely on still images or image sequences. We briefly
review both kinds below. In the results section, we will
demonstrate that we outperform state-of-the-art representa-
tives of each.
3D Human Pose Estimation in Single Images. Early ap-
proaches tended to rely on generative models to search the
state space for a plausible configuration of the skeleton that
would align with the image evidence [9, 35, 32, 38, 12].
These methods remain competitive provided that a good
enough initialization can be supplied. More recent ones [6,
3] extend 2D pictorial structure approaches [10] into the 3D
domain. However, in addition to their high computational
cost, they tend to have difficulty localizing people’s arms
accurately because the corresponding appearance cues are
weak and easily confused with the background [27].
By contrast, discriminative regression-based ap-
proaches [1, 34, 30, 4, 15] build a direct mapping from
image evidence to 3D poses. They have been shown to
be effective, especially if a large training dataset [16]
is available. Within this context, rich features encoding
depth [29] and body part information [15, 22] have been
shown to be effective at increasing the estimation accuracy.
3D Human Pose Estimation in Image Sequences. Such
approaches also fall into two main classes.
The first class involves frame-to-frame tracking and dy-
namical models that rely on Markov dependencies between
consecutive frames. Their main weakness is that they re-
quire initialization and cannot recover from tracking fail-
ures.
To address these shortcomings, the second class focuses
on detecting candidate poses in individual frames followed
by linking them across frames in a temporally consistent
manner. For example, in [2], initial pose estimates are re-
fined using 2D tracklet-based estimates. In [28], the pair-
wise relationships of joints within and between frames are
modeled by an ensemble of tractable submodels. [13] inte-
grates single-frame pose recovery with K-best trajectories
and model texture adaptation. In [41], dense optical flow
is used to link articulated shape models in adjacent frames.
Non-maxima suppression is then employed to merge pose
estimates across frames in [7]. By contrast to these ap-
proaches, we capture the temporal information earlier in the
process by extracting spatiotemporal features from image
cubes of short sequences and regressing to 3D poses.
While they have long been used for action recogni-
tion [20, 39], person detection [24], and 2D pose estima-
tion [11], spatiotemporal features have been underused for
3D body pose estimation purposes. The only recent ap-
proach we are aware of is that of [40] that involves build-
ing a set of point trajectories corresponding to high joint
responses and matching them to motion capture data. One
drawback of this approach is its very high computational
cost. Also, while the 2D results look promising, no quan-
titative 3D results are provided in the paper and no code is
available for comparison purposes.
3. Method
Our approach involves finding bounding boxes around
people in consecutive frames, aligning these bounding
boxes to form spatiotemporal volumes, and learning a map-
ping from these volumes to a 3D pose in their central frame.
In the remainder of this section, we first introduce our
formalism and then describe each individual step, as de-
picted by Fig. 2.
3.1. Formalism
In this work, we represent 3D body poses in terms of
skeletons, such as those shown in Fig. 1, and the 3D loca-
tions of their D joints relative to that of a root node. As
several authors before us [4, 16], we chose this representa-
tion because it is well adapted to regression and does not
require us to know a priori the exact body proportions of
our subjects. It suffers from not being orientation invariant
but using temporal information provides enough evidence
to overcome this difficulty.
Let I
i
be the i-th image of a sequence containing a sub-
ject and Y
i
R
3×D
be a vector that encodes the corre-
sponding 3D joint locations. Regression-based discrimina-
tive approaches to inferring Y
i
typically involve a paramet-
(a) Image stack (b) Motion compensation (c) Spatiotemporal block (d) Dense HOG3D (e) Regression to central 3D pose
Figure 2. Overview of our approach to 3D pose estimation. (a) A person is detected in several consecutive frames. (b) The corresponding
image windows are shifted so that the subject remains centered. (c) A data volume is formed by concatenating these aligned windows. (d)
A pyramid of 3D HOG features are extracted densely over the volume. (e) The 3D pose in the central frame is obtained by regression.
ric [1, 5, 17] or non-parametric [37, 23] model of the con-
ditional distribution p(Y|X), where X
i
= Ψ(I
i
; m
i
) is a
feature vector computed over the bounding box or the fore-
ground mask, m
i
, of the person in I
i
. The distribution pa-
rameters are usually learned from a labeled set of N training
examples, T = {(X
i
, Y
i
)}
N
i=1
. As discussed in Section 2,
in such a setting, reliably estimating the 3D pose is hard to
do due to the inherent ambiguities of 3D human pose esti-
mation such as self-occlusion and mirror ambiguity.
Instead, we model the posterior distribution
conditioned on a data volume consisting of a se-
quence of T frames centered at image i, V
i
=
[I
iT/2+1
, I
iT/2+2
, . . . , I
i+T/2
], that is, p(Y|Z) where
Z
i
= ξ(V
i
; m
iT/2+1
, m
iT/2+2
, . . . , m
i+T/2
) is a
feature vector computed over the data volume. The training
set becomes T = {(Z
i
, Y
i
)}
N
i=1
, where Y
i
is the pose of
the central frame in the image stack. In practice, we collect
every block of consecutive T frames across all training
videos to obtain data volumes, V
i
. We will show in the
results section that this significantly improves performance
and that the best results are obtained for sequences of
T = 24 to 48 images, that is 0.5 to 1 second given the
50fps of the sequences of the Human3.6m [16] dataset.
3.2. Spatiotemporal Features
Our feature vector Z is based on the 3D HoG descrip-
tor described in [39]. It is computed by first subdividing
a data volume such as the one depicted by Fig. 2(c) into
equally spaced bins. For each one, the histogram of ori-
ented 3D spatio-temporal gradients [18] is then computed.
To increase the descriptive power, we use a multi-scale ap-
proach. We compute several HoG descriptors using differ-
ent bin sizes. In practice, we use 3 levels in the spatial
dimensions—-2×2, 4×4 and 8×8—and we set the tempo-
ral bin size to a small value—4 frames for 50 fps videos—to
capture fine temporal details. Our final feature vector Z is
obtained by concatenating these HoG descriptors into a sin-
gle vector.
A strength of the 3D HoG descriptor is that it simul-
taneously encodes appearance and motion information in
a relatively straightforward way. Not only does it pre-
serve location and time dependent appearance information
by concatenating the blocks into a single vector—as does
the 2D HoG for appearance information only—but it also
encodes motion information by computing temporal gradi-
ents. An alternative to encoding motion information in this
way would have been to explicitly track body parts in the
spatiotemporal volume, as is done in many of the methods
we discussed in Section 2. However, tracking body parts
in 2D is subject to ambiguities caused by the projection
from 3D to 2D and we believe that not having to do this
is a contributing factor to the good results we will show in
Section 4.
3.3. Alignment
For the 3D HoG descriptors introduced above to be rep-
resentative of the person’s pose, the temporal bins must cor-
respond to specific body parts, which implies that the person
should remain centered from frame to frame in the bound-
ing boxes used to build the image volume. Therefore to go
from detected bounding boxes, such as those depicted by
Fig. 1(a) to a useful spatiotemporal volume such as the one
of Fig. 1(c), we need to shift the bounding boxes as shown in
Fig. 1(b). In Fig. 3, we illustrate this requirement by show-
ing heat maps of the gradients across a sequence without
and with motion compensation. Without it, the gradients
are dispersed across the region of interest, which reduces
the feature stability.
As will be discussed in more details in Section 4,
when testing our approach on the HumanEva [31] and Hu-
man3.6m [16] datasets, we use the background subtraction
masks or codes that they provide to find people’s locations
(a) No compensation (b) Motion compensation
Figure 3. Heat maps of the gradients across all frames for Greeting
action (a) without motion compensation and (b) with motion com-
pensation. Compensating for the motion, body parts become co-
variant with the HOG3D bins across frames and thus the extracted
spatiotemporal features become more part-centric and stable.
and position a bounding box around them. We resize all the
images to the same size by keeping the aspect ratio of the
person in it. Concatenating these bounding boxes does not
guarantee a sufficiently good alignment across frames. To
remedy this, we have explored the following two competing
approaches.
1. We use a state-of-the-art optical flow-based motion
stabilization algorithm [24].
2. When background subtraction is possible, we have im-
plemented a simple scheme depicted by Fig. 4, which
exploits the foreground mask produced by background
subtraction. We first compute the histogram of pixel
occurrences along each bounding box column. The
histogram is then clipped with a cut-off value. Fi-
nally, we center the bounding box on the column that
corresponds to the mean of the maximum values in
the clipped histogram which are temporally smoothed
with a Kalman filter.
In situations where background subtractions can be used,
we will see that the second scheme, while simpler than the
first, yields better results. However, the first one has the
clear advantage that it does not depend on background sub-
traction and is therefore more widely applicable.
3.4. Regression
We cast 3D pose estimation in terms of finding a map-
ping Z f (Z) Y, where Z is the descriptor com-
puted over a spatiotemporal volume and Y is the 3D pose
in its central frame. Following [16], we considered both
an unstructured regression method—Kernel Ridge Regres-
sion [14], and a structured one—Kernel Dependency Esti-
mation [8]—to learn f.
Figure 4. Motion compensation scheme. Histograms of fore-
ground masks are computed for consecutive frames in a data vol-
ume and clipped with a threshold value. The mean of the maxi-
mum values in the clipped histogram is taken as the center of per-
son. Clipping provides robustness against background subtraction
artifacts and different articulations of the body. In order to have
stability across frames, centers are temporally smoothed with a
Kalman filter.
Kernel Ridge Regression (KRR) trains a model for
each dimension of the pose vector separately. It first solves
a regularized least-squares problem of the form
argmin
w
X
i
||Y
i
w
i
k(Z, Z
i
)||
2
2
+ λ||w||
2
2
, (1)
where the (Z
i
, Y
i
) are training pairs, w = [w
1
, . . . w
n
]
>
are the parameters of the model, k(Z
1
, Z
2
) =
exp(γχ
2
(Z
1
, Z
2
)) is the exponential-χ
2
kernel [21],
and λ is a regularization parameter. This can be done in
closed-form by solving
w = (K + λI)
1
Y , (2)
where K
i,j
= k(Z
i
, Z
j
). Pose prediction is then carried
out by computing for each Z
f (Z) =
X
i
w
i
k(Z, Z
i
) . (3)
Kernel Dependency Estimation (KDE) is a structured
regressor that accounts for correlations in 3D pose space,
unlike conventional multiple output regression models, such
as KRR. As argued in [16], it is more suitable to 3D human
pose estimation due to the physical constraints and regular-
ity of the human dynamics, which produce strong depen-
dencies in joint positions.
To learn the regressor, the input and output vectors are
first lifted into high-dimensional Hilbert spaces using kernel
mappings Φ
Z
and Φ
Y
, respectively [8]. The dependency
between high dimensional input and output spaces is mod-
eled as a linear function. The corresponding matrix W is
computed by standard kernel ridge regression, that is, by
solving
argmin
W
X
i
||Φ
Y
(Y
i
) WΦ
Z
(Z
i
)||
2
2
+ λ||W||
2
2
, (4)
where λ is a regularization coefficient. To produce the final
prediction Y, the difference between the predictions and the
mapping of the output in the high dimensional Hilbert space
is minimized by finding
argmin
Y
||W
T
Φ
Z
(Z) Φ
Y
(Y)||
2
2
. (5)
Although the problem is non-linear and non-convex, it
can nevertheless be accurately solved given the KRR pre-
dictors for individual outputs to initialize the process.
In practice, we use an input kernel embedding based on
15,000-dimensional random feature maps corresponding to
an exponentiated-χ
2
as in [21] and a 4000-dimensional out-
put embedding corresponding to radial basis function ker-
nel.
4. Results
In this section, we first introduce the datasets and pa-
rameters used in our experiments. Then, we describe the
baselines we compare against and discuss our results.
4.1. Datasets
We evaluate our approach on two standard benchmarks
for 3D human pose estimation. They are as follows:
Human3.6m is a recently released large-scale motion
capture dataset that comprises 3.6 million images and corre-
sponding 3D poses and complex motion scenarios. 11 sub-
jects perform 15 different actions under 4 different view-
points. The scenarios include a diverse set of motions of
typical human activities, such as walking, eating, greeting,
discussion etc.
HumanEva-I/II datasets provide synchronized images
and motion capture data and are standard benchmarks for
3D human pose estimation. On these datasets, 3D joint
positions are transformed to camera coordinates to provide
monocular human pose estimation results.
For the experiments we carried out on these datasets, we
use the average Euclidean distance between the ground truth
and predicted joint positions as the evaluation metric. For
the sake of completeness, we tabulate the parameters used
in our experiments explained in Section 3 in Table 1.
4.2. Baselines
To demonstrate the effectiveness of our approach, we
compare it against several state-of-the-art algorithms. We
Parameter: Value
Spatial cell dividing 2 × 2, 4 × 4, 8 × 8
Temporal cell size 4 frames
Temporal window size 24 frames
Input kernel dimensionality 15000
Output kernel dimensionality 4000
Table 1. The parameters used throughout our experiments.
chose them to be representative of different approaches to
3D human pose estimation, as discussed in Section 2. For
those which we do not have access to the code, we used the
published performance numbers and ran our own method
on the corresponding data. We list them below.
Single-frame based approaches rely on 2D appearance
information for 3D human pose estimation. We will
compare against the state-of-the-art 2D HoG based ap-
proach [16] and Convolutional Neural Networks [22] on
Human3.6m dataset. We provide further comparisons for
3D Pictorial Structures [3] and Regression Forests [19] on
the HumanEva-I dataset.
Frame-to-Frame Tracking Approaches use dynamical
models to estimate transitions in consecutive frames and as-
sume priors on the motion of the person. Among these ap-
proaches, we compare against Conditional Restricted Boltz-
mann Machines [36] and the loose-limbed body model
of [33] that relies on probabilistic graphical models of the
human pose and motion on HumanEva-I dataset.
Tracking-by-Detection Approaches refine single-frame
detections by linking them into consistent object trajecto-
ries. Among these methods, we compare against the one
of [2] on the HumanEva-II dataset.
4.3. Evaluation on the Human3.6m Dataset
To quantitatively evaluate the performance of our ap-
proach, we first used the recently released Human3.6m [16]
large-scale motion capture dataset. On that dataset, the
regression-based method of [16] performed best at the time
and we therefore use it as a baseline. That method relies on
a Fourier approximation of 2D HoG features using χ
2
com-
parison metric and we will refer to it as e
χ
2
-HoG + KRR”
or e
χ
2
-HoG + KDE”, depending on whether it uses the
KRR or KDE regressors introduced in Section 3.4. Since
then, even better results have been reported for some of the
actions by using Convolutional Neural Nets (CNNs) [22].
Given the current interest in Deep Learning approaches, we
therefore use this method as a second baseline and we will
refer to it as CNN-Regression.
Method:
Directions Discussion Eating Greeting Phone Talk Posing Buying Sitting
e
χ
2
-HoG + KRR [16] 140.00 189.36 157.20 167.65 173.72 159.25 214.83 193.81
e
χ
2
-HoG + KDE [16] 132.71 183.55 132.37 164.39 162.12 150.61 171.31 151.57
CNN-Regression [22] - 148.79 104.01 127.17 - - - -
Ours + KRR 118.66 159.51 112.65 143.78 144.65 136.11 166.09 178.21
Ours + KDE 102.39 158.52 87.95 126.83 118.37 114.69 107.61 136.15
Method:
Sitting Down Smoking Taking Photo Waiting Walking Walking Dog Walking Pair Average
e
χ
2
-HoG
+ KRR [16] 279.07 169.59 211.31 174.27 108.37 192.26 139.76 178.03
e
χ
2
-HoG + KDE [16] 243.03 162.14 205.94 170.69 96.60 177.13 127.88 162.14
CNN-Regression [22] - - 189.08 - 77.60 146.59 - -
Ours + KRR 246.40 139.33 191.83 156.66 71.05 151.84 91.66 147.23
Ours + KDE 205.65 118.21 185.02 146.66 65.86 128.11 77.21 125.28
Table 2. 3D joint position errors using the metric of average euclidian distance between the ground truth and predicted joint positions to
compare our results to those of two baselines [16, 22], when using different regressors as described in Section 3.4.
The authors of [22] reported results on subjects S9 and
S11, whereas those of [16] made their code available. To
compare our results to those of both baselines, we therefore
trained our regressors and those of [16] for 15 different ac-
tions. We used 5 subjects (S1, S5, S6, S7, S8) for training
purposes and 2 (S9 and S11) for testing. The training and
testing is carried out in all camera views for each separate
action, as described in [16]. Recall from Section 3.1 that
3D body poses are represented by skeletons with 17 joints.
Their 3D locations are expressed relative to that of a root
node in the coordinate system of the camera that captured
the images.
Table 2 summarizes our results on Human3.6m. Our
method outperforms e
χ
2
-HoG + KDE [16] significantly for
all actions, with the mean error reduced by about 23%. It
also improves on CNN-Regression [22] for the actions on
which these authors reported accuracy numbers, except for
“Discussion. Unsurprisingly, the improvement is particu-
larly marked for actions, such as walking and eating, which
involve substantial amounts of predictable motion. The dis-
cussion sequences, by contrast, involve motions that are
very irregular and different from sequence to sequence as
well as within sequences, which reduces the usefulness of
temporal information. Example 3D pose reconstructions of
our method for Human3.6m can be seen in Fig. 5. Further
visualizations for the datasets are provided in the supple-
mentary material.
Importance of Motion Compensation. To highlight the
importance of motion compensation, we recomputed our
features without it. As discussed in Section 3.3, we also
tried two different approaches to performing it, using ei-
ther a recent optical-flow based motion stabilization algo-
rithm [24] or our own algorithm that exploits background
subtraction results.
As shown in Table 3, motion compensation significantly
improves performance for the two actions for which we per-
formed this comparison. Furthermore, our approach to im-
plementing it is more effective than the optical-flow method
of [24].
Method: Greeting Walking Dog
e
χ
2
-HoG [16] 164.39 177.13
Ours + No Compensation 144.48 138.66
Ours + Optical Flow [24] 140.97 134.98
Ours + Centering 126.83 128.11
Table 3. Importance of motion compensation. We compare the
results of [16] against those obtained using our method, without
motion compensation, with motion compensation using the algo-
rithm of [24], and with motion compensation using the algorithm
of Section 3.3. 3D joint position errors are given in millimeters.
Method: Walking Eating
e
χ
2
-HoG + KDE [16] 96.60 132.37
Ours(12 frames) 69.11 94.32
Ours(24 frames) 65.86 87.95
Ours(36 frames) 64.07 87.64
Ours(48 frames) 64.75 85.77
Table 4. Influence of temporal window size. We compare the re-
sults of [16] against those obtained using our method with increas-
ing temporal window sizes.
Influence of Temporal Window Size. In Table 4, we re-
port the effect of changing the size of our temporal win-
dows from 12 to 48 frames, also for two different actions.
Since e
χ
2
-HoG also relies on HoG features, its output can
be thought of as the result for single-frame windows. Using
temporal information clearly helps and the best results are
obtained in the range 24 to 48, which corresponds to 0.5 to
1 second at 50 fps. For the experiments we carried out on
(a) Ground Truth (b) Our method (c) Ionescu et. al [16]
Figure 5. Example 3D pose estimation estimation results of challenging cases for Buying, Discussion, Eating and Walking Pair actions
in the Human3.6m dataset. The recovered 3D poses and their projection on the orthogonal plane is shown in (b) and (c) for our method
and [16], respectively, along with the ground truth joint positions in (a). Our method can recover the 3D pose of the person under these
challenging scenarios where there is significant amount of self occlusion and orientation ambiguity, more accurately than [16]. Best viewed
in color.
Human3.6m, we use 24 frames as it provides both accurate
reconstructions and efficiency.
Activity Independent Regression. We further evaluate
our approach in an experimental setting where all the mo-
tions from all the subjects are considered together for train-
ing. As for the activity-dependent setting, we use S1, S5,
S6, S7, S8 for training and S9, S11 for testing. We obtain
an average error of 121.62 mm across all test videos regard-
less of the action class, whereas the method of [16] yields
an error of 154.43 mm. Interestingly, for both methods, this
is better than the mean of the activity-dependent regressor
as can be seen in the last column of the Table 2. This is due
to the fact that the increase in the size of training samples
compensate for the increase in the variance of 3D poses.
4.4. Evaluation on HumanEva Datasets
We further evaluated our approach on HumanEva-I and
HumanEva-II datasets. Most of the early research in
the field has reported results on this standard benchmark.
Therefore we also carried out experiments on it to com-
pare our approach to several state-of-the-art 3D human pose
estimation and body tracking approaches. The baselines
we considered include tracking-based approaches which
impose dynamical priors on the motion [36, 33] and the
tracking-by-detection framework of [2]. We demonstrate in
Table 5 and Table 6 that using temporal information earlier
in the inference process in a discriminative bottom-up fash-
ion yields more accurate results than the above mentioned
approaches that enforce top-down temporal priors on the
motion.
HumanEva-I: For the experiments we carried out on
HumanEva-I dataset, we train our regressor on training se-
quences of Subject 1, 2 and 3 and evaluate on the “vali-
dation” sequences in the same manner as the baselines we
compare against [36, 33, 19, 3]. We report the performance
of our approach on cyclic and acyclic motions, i.e., Walking
and Boxing in Table 5. The results show that our proposed
method outperforms state-of-the-art approaches.
Walking Boxing
Method: S1 S2 S3 S1
Taylor et. al [36] 48.8 47.4 49.8 75.35
Sigal et. al [33] 66.0 69.0 - -
Kostrikov et. al [19] 44.0 30.9 41.7 -
Belagiannis et. al [3] 68.3 - - 62.70
Our 32.6 24.1 45.4 46.18
Table 5. Results on the Walking and Boxing sequences of the
HumanEva-I dataset. We compare our approach against meth-
ods that rely on discriminative regression [19], 3D pictorial struc-
tures [3] and top-down temporal priors [36, 33].
HumanEva-II: On HumanEva-II, we compare
against [2] as they have the best monocular pose es-
timation results on this dataset. Following the same
experimental procedure, we use subjects S1, S2 and S3
from HumanEva-I dataset for training and report pose
estimation results in the first 350 frames of the sequence
containing subject S2. Whereas [2] uses additional training
data from “People” [25] and “Buffy” [11] datasets, we
only have to use the training data from HumanEva-I. We
evaluated our approach using the official online evaluation
tool. We illustrate the comparison in Table 6. For both
camera views, our method is able to achieve significantly
higher performance.
Method: S2/C1 S2/C2
Andriluka et. al [2] 107 101
Our 81.8 87.46
Table 6. Results on the Combo sequence of the HumanEva-
II dataset. We compare our approach against the tracking-by-
detection framework of [2].
5. Conclusion
In this paper, we have formulated a novel discriminative
approach to 3D human pose estimation using spatiotem-
poral features. We have shown that gathering evidence
from multiple frames disambiguate difficult poses with self-
occlusion and mirror ambiguity, which brings about signifi-
cant increase in accuracy over the state-of-the-art methods.
Furthermore, we have demonstrated that taking into ac-
count motion information very early in the modeling pro-
cess yields substantial performance improvements over do-
ing it a posteriori by linking detections in individual frames
and only later imposing temporal consistency as part of the
linking process.
References
[1] A. Agarwal and B. Triggs. 3D Human Pose from Silhouettes
by Relevance Vector Regression. In CVPR, 2004.
[2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3D Pose
Estimation and Tracking by Detection. In CVPR, 2010.
[3] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele,
N. Navab, and S. Ilic. 3D Pictorial Structures for Multiple
Human Pose Estimation. In CVPR, 2014.
[4] L. Bo and C. Sminchisescu. Twin Gaussian Processes for
Structured Prediction. IJCV, 2010.
[5] L. Bo, C. Sminchisescu, A. Kanaujia, and D. Metaxas. Fast
Algorithms for Large Scale Conditional 3D Prediction. In
CVPR, June 2008.
[6] M. Burenius, J. Sullivan, and S. Carlsson. 3D Pictorial Struc-
tures for Multiple View Articulated Pose Estimation. In
CVPR, 2013.
[7] X. Burgos-Artizzu, D. Hall, P. Perona, and P. Doll
´
ar. Merg-
ing Pose Estimates Across Space and Time. In BMVC, 2013.
[8] C. Cortes, M. Mohri, and J. Weston. A General Regression
Technique for Learning Transductions. In ICML, 2005.
[9] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion
Capture by Annealed Particle Filtering. In CVPR, 2000.
[10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-
manan. Object Detection with Discriminatively Trained Part
Based Models. PAMI, 32(9), 2010.
[11] V. Ferrari, M. Martin, and A. Zisserman. Progressive Search
Space Reduction for Human Pose Estimation. In CVPR,
2008.
[12] J. Gall, B. Rosenhahn, T. Brox, and H.-P. Seidel. Optimiza-
tion and Filtering for Human Motion Capture. IJCV, 2010.
[13] M. Hofmann and D. M. Gavrila. Multi-view 3D Human Pose
Estimation in Complex Environment. IJCV, 2012.
[14] T. Hofmann, B. Schlkopf, and A. J. Smola. Kernel Methods
in Machine Learning. The Annals of Statistics, 36(3):1171–
1220, 2008.
[15] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated
Second-Order Label Sensitive Pooling for 3D Human Pose
Estimation. In CVPR, 2014.
[16] C. Ionescu, I. Papava, V. Olaru, and C. Sminchisescu. Hu-
man3.6M: Large Scale Datasets and Predictive Methods for
3D Human Sensing in Natural Environments. PAMI, 2014.
[17] A. Kanaujia, C. Sminchisescu, and D. N. Metaxas. Semi-
supervised Hierarchical Models for 3D Human Pose Recon-
struction. In CVPR, 2007.
[18] A. Kl
¨
aser, M. Marszałek, and C. Schmid. A Spatio-Temporal
Descriptor Based on 3D-Gradients. In BMVC, 2008.
[19] I. Kostrikov and J. Gall. Depth Sweep Regression Forests for
Estimating 3D Human Pose from Images. In BMVC, 2014.
[20] I. Laptev. On Space-Time Interest Points. IJCV, 64(2-
3):107–123, 2005.
[21] F. Li, G. Lebanon, and C. Sminchisescu. Chebyshev Ap-
proximations to the Histogram χ
2
Kernel. In CVPR, 2012.
[22] S. Li and A. B. Chan. 3D Human Pose Estimation from
Monocular Images with Deep Convolutional Network. In
ACCV, 2014.
[23] R. Memisevic, L. Sigal, and D. J. Fleet. Shared Kernel In-
formation Embedding for Discriminative Inference. PAMI,
pages 778–790, April 2012.
[24] D. Park, C. L. Zitnick, D. Ramanan, and P. Doll
´
ar. Exploring
Weak Stabilization for Motion Feature Extraction. In CVPR,
2013.
[25] D. Ramanan. Learning to Parse Images of Articulated Bod-
ies. In NIPS, 2006.
[26] D. Ramanan, A. Forsyth, and A. Zisserman. Strike a Pose:
Tracking People by Finding Stylized Poses. In CVPR, 2005.
[27] B. Sapp, A. Toshev, and B. Taskar. Cascaded Models for
Articulated Pose Estimation. In ECCV, 2010.
[28] B. Sapp, D. J. Weiss, and B. Taskar. Parsing Human Motion
with Stretchable Models. In CVPR, 2011.
[29] J. Shotton, A. Fitzgibbon, M. Cook, and A. Blake. Real-
Time Human Pose Recognition in Parts from a Single Depth
Image. In CVPR, 2011.
[30] L. Sigal, A. Balan, and M. J. Black. Combined Discrimi-
native and Generative Articulated Pose and Non-rigid Shape
Estimation. In NIPS, 2007.
[31] L. Sigal, A. Balan, and M. J. Black. Humaneva: Synchro-
nized Video and Motion Capture Dataset and Baseline Algo-
rithm for Evaluation of Articulated Human Motion. IJCV,
87(1-2):4–27, 2010.
[32] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Track-
ing Loose-limbed People. In CVPR, 2004.
[33] L. Sigal, M. Isard, H. W. Haussecker, and M. J. Black.
Loose-limbed People: Estimating 3D Human Pose and Mo-
tion Using Non-parametric Belief Propagation. IJCV, 2012.
[34] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Dis-
criminative Density Propagation for 3D Human Motion Es-
timation. In CVPR, 2005.
[35] C. Sminchisescu and B. Triggs. Covariance Scaled Sampling
for Monocular 3D Body Tracking. In CVPR, 2001.
[36] G. W. Taylor, L. Sigal, D. J. Fleet, and G. E. Hinton. Dy-
namical binary latent variable models for 3D human pose
tracking. In CVPR, 2010.
[37] R. Urtasun and T. Darrell. Sparse Probabilistic Regression
for Activity-Independent Human Pose Inference. In CVPR,
2008.
[38] R. Urtasun, D. Fleet, and P. Fua. 3D People Tracking with
Gaussian Process Dynamical Models. In CVPR, 2006.
[39] D. Weinland, M. Ozuysal, and P. Fua. Making Action Recog-
nition Robust to Occlusions and Viewpoint Changes. In
ECCV, September 2010.
[40] F. Zhou and F. D. la Torre. Spatio-Temporal Matching for
Human Detection in Video. In ECCV, 2014.
[41] S. Zuffi, J. Romero, C. Schmid, and M. J. Black. Estimating
Human Pose with Flowing Puppets. In ICCV, 2013.
... В роботі [26] використовується щільний оптичний потік для виявлення зв'язку моделей положення у сусідніх кадрах. На відміну від цих підходів у [27] часова інформація фіксується раніше: просторово-часові ознаки вилучаються з коротких послідовностей і за допомогою регресійних моделей перетворюються у тривимірні пози. В якості просторово-часових ознак використовується тривимірний дескриптор HOG, описаний в [28]. ...
... Це означає, що особа повинна залишатися центрованою від кадру до кадру. Без цього градієнти виявляються «розпорошеними», що знижує якість дескриптора [27]. ...
... Далі вилучається піраміда ознак тривимірного HOG з фіксованим розміром часового вікна (тобто фіксованою кількістю кадрів). Остання процедура визначає положення скелету у просторі, використовуючи регресійну модель [27]. ...
Article
Full-text available
В роботі запропоновано ряд удосконалень методу визначення положення суглобових з’єднань скелету людини на відеопослідовностях з метою підвищення точності прогнозування положення людини у просторі. Це досягається за рахунок застосування наступних нововведень: врахування інформації про кути переміщення та наближення чи віддалення людини, що дозволяє розрізняти рухи, які схожі у відцентрованих кадрах, але відрізняються переміщенням; використання адаптивного розміру вікна для розрахунку HOG3D ознак; використання нейронної мережі для екстраполяції положень суглобових з’єднань у просторі у випадку відсутності або недостатньої точності прогнозування. Експериментальна перевірка, проведена на наборі даних HumanEva-1, показала підвищення в середньому на 11 пікселів точності локалізації суглобових з’єднань при застосуванні запропонованих модифікацій та підтвердила перспективність використання удосконаленого методу для подальшого вирішення задачі розпізнавання рухів.
... There is an increasing number of works that directly process RGB images. The model proposed in [29] works on RGB images and requires a volume of aligned image windows of people to estimate the 3D body pose of the central one, where a 3D HoG descriptor is used as input of a regression method (Kernel Ridge Regression or Kernel Dependency Estimation). ...
Preprint
Full-text available
Many real-world applications require the estimation of human body joints for higher-level tasks as, for example, human behaviour understanding. In recent years, depth sensors have become a popular approach to obtain three-dimensional information. The depth maps generated by these sensors provide information that can be employed to disambiguate the poses observed in two-dimensional images. This work addresses the problem of 3D human pose estimation from depth maps employing a Deep Learning approach. We propose a model, named Deep Depth Pose (DDP), which receives a depth map containing a person and a set of predefined 3D prototype poses and returns the 3D position of the body joints of the person. In particular, DDP is defined as a ConvNet that computes the specific weights needed to linearly combine the prototypes for the given input. We have thoroughly evaluated DDP on the challenging 'ITOP' and 'UBC3V' datasets, which respectively depict realistic and synthetic samples, defining a new state-of-the-art on them.
Article
We propose a new method for single-camera real-world 3-D human pose estimation. Our method uses multitask training together with iterative pose refinement using a novel conditional attention mechanism. For iterative pose refinement, the output of each convolutional layer is conditioned on the latest pose estimate, using a conditioned squeeze-and-excitation network architecture that incorporates novel feedback connections. Multitask training on both an in-the-wild 2-D pose dataset and a controlled 3-D pose dataset allows for real-world 3-D pose estimation without the need for a large-scale in-the-wild 3-D pose dataset, which is unavailable. Experiments are performed on several real-world datasets, as well as the Human 3.6 Million and HumanEva-I datasets, to show that the combined attention mechanism, iterative refinement scheme, and multitask training allow us to achieve robust and competitive performance with only a simple network architecture. In addition, we show that our method is efficient enough to run on commodity hardware, producing pose estimates in real time.
Chapter
This article tackles the problem of multi-person 3D human pose estimation based on monocular image sequence in a three-step framework: (1) we detect 2D human skeletons in each frame across the image sequence; (2) we track each person through the image sequence and identify the sequence of 2D skeletons for each person; (3) we reconstruct the 3D human skeleton for each person from the detected 2D human joints, by using prelearned base poses and considering the temporal smoothness. We evaluate our framework on the Human3.6M dataset and the multi-person image sequence captured by ourselves. The quantitative results on the Human3.6M dataset and the qualitative results on our constructed test data demonstrate the effectiveness of our proposed method.
Article
Research on 3D human pose estimation based on deep neural networks has recently witnessed substantial progress in both accuracy and execution efficiency. Many methods combine the deep neural network-based 2D pose estimation and 3D pose matching. However, (1) multiscale analysis has the potential to improve inference accuracy, and (2) isometric constraint is not considered in the 3D matching stage. In this paper, a new module named the Densely Connected Attentional Pyramid Residual Module (DCAPRM) in the bottom-up mapping stage is presented to effectively increase the inference accuracy. A new isometric regularization term is also proposed to punish limb extension or shrinkage in the top-down fitting phase. The performance of our approach on 3D human pose datasets is evaluated. The experimental results show that our approach provides better results than other approaches in terms of the accuracy of 3D human pose estimation.
Article
3D human pose estimation from a single image is a challenging problem due to occlusion, viewpoint variance, and the ill-posed nature of back projection. We follow a standard two-step pipeline which first detects 2D joint locations and uses them to infer 3D pose. For the first step, we use a recent deep learning-based detector. For the second step, we propose a novel exemplar-based algorithm to implicitly augment the exemplar set for 3D human pose estimation. The motivation of this algorithm is to well represent various poses in the real world with finite real exemplars. We achieve it by a strategy of synthesizing virtual candidate poses which ensures that the augmented exemplar set has much more variety. Moreover, we also present an effective approach to select the best exemplar from candidate set to well match the detected 2D pose. Experimental results show that our method achieves competitive performance on Human3.6M dataset.
Conference Paper
Full-text available
We propose a CNN-based approach for multi-camera markerless motion capture of the human body. Unlike existing methods that first perform pose estimation on individual cameras and generate 3D models as post-processing, our approach makes use of 3D reasoning throughout a multi-stage approach. This novelty allows us to use provisional 3D models of human pose to rethink where the joints should be located in the image and to recover from past mistakes. Our principled refinement of 3D human poses lets us make use of image cues, even from images where we previously misdetected joints, to refine our estimates as part of an end-to-end approach. Finally, we demonstrate how the high-quality output of our multi-camera setup can be used as an additional training source to improve the accuracy of existing single camera models.
Preprint
Full-text available
We propose a CNN-based approach for multi-camera markerless motion capture of the human body. Unlike existing methods that first perform pose estimation on individual cameras and generate 3D models as post-processing, our approach makes use of 3D reasoning throughout a multi-stage approach. This novelty allows us to use provisional 3D models of human pose to rethink where the joints should be located in the image and to recover from past mistakes. Our principled refinement of 3D human poses lets us make use of image cues, even from images where we previously misdetected joints, to refine our estimates as part of an end-to-end approach. Finally, we demonstrate how the high-quality output of our multi-camera setup can be used as an additional training source to improve the accuracy of existing single camera models.
Patent
Full-text available
Methods and apparatus are described for monocular 3D human pose estimation and tracking, which are able to recover poses of people in realistic street conditions captured using a monocular, potentially moving camera. Embodiments of the present invention provide a three-stage process involving estimating (10, 60, 110) a 3D pose of each of the multiple objects using an output of 2D tracking-by detection (50) and 2D viewpoint estimation (46). The present invention provides a sound Bayesian formulation to address the above problems. The present invention can provide articulated 3D tracking in realistic street conditions. The present invention provides methods and apparatus for people detection and 2D pose estimation combined with a dynamic motion prior. The present invention provides not only 2D pose estimation for people in side views, it goes beyond this by estimating poses in 3D from multiple viewpoints. The estimation of poses is done in monocular images, and does not require stereo images. Also the present invention does not require detection of characteristic poses of people.
Conference Paper
Full-text available
We address the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. Most current methods focus on isolated video frames and often fail to correctly localize arms and hands. Inferring pose over a video sequence is advantageous because poses of people in adjacent frames exhibit properties of smooth variation due to the nature of human and camera motion. To exploit this, previous methods have used prior knowledge about distinctive actions or generic temporal priors combined with static image likelihoods to track people in motion. Here we take a different approach based on a simple observation: Information about how a person moves from frame to frame is present in the optical flow field. We develop an approach for tracking articulated motions that "links" articulated shape models of people in adjacent frames through the dense optical flow. Key to this approach is a 2D shape model of the body that we use to compute how the body moves over time. The resulting "flowing puppets" provide a way of integrating image evidence across frames to improve pose inference. We apply our method on a challenging dataset of TV video sequences and show state-of-the-art performance.
Conference Paper
In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: (1) a multi-task framework that jointly trains pose regression and body part detectors; (2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.
Article
We address the problem of articulated human pose estimation by learning a coarse-to-fine cascade of pictorial structure models. While the fine-level state-space of poses of individual parts is too large to permit the use of rich appearance models, most possibilities can be ruled out by efficient structured models at a coarser scale. We propose to learn a sequence of structured models at different pose resolutions, where coarse models filter the pose space for the next level via their max-marginals. The cascade is trained to prune as much as possible while preserving true poses for the final level pictorial structure model. The final level uses much more expensive segmentation, contour and shape features in the model for the remaining filtered set of candidates. We evaluate our framework on the challenging Buffy and PASCAL human pose datasets, improving the state-of-the-art.
Conference Paper
Detection and tracking humans in videos have been long-standing problems in computer vision. Most successful approaches (e.g., deformable parts models) heavily rely on discriminative models to build appearance detectors for body joints and generative models to constrain possible body configurations (e.g., trees). While these 2D models have been successfully applied to images (and with less success to videos), a major challenge is to generalize these models to cope with camera views. In order to achieve view-invariance, these 2D models typically require a large amount of training data across views that is difficult to gather and time-consuming to label. Unlike existing 2D models, this paper formulates the problem of human detection in videos as spatio-temporal matching (STM) between a 3D motion capture model and trajectories in videos. Our algorithm estimates the camera view and selects a subset of tracked trajectories that matches the motion of the 3D model. The STM is efficiently solved with linear programming, and it is robust to tracking mismatches, occlusions and outliers. To the best of our knowledge this is the first paper that solves the correspondence between video and 3D motion capture data for human pose detection. Experiments on the Human3.6M and Berkeley MHAD databases illustrate the benefits of our method over state-of-the-art approaches.
Article
In this work we address the problem of estimating the 3D human pose from a single RGB image, which is a challenging problem since different 3D poses may have similar 2D projections. Following the success of regression forests for 3D pose estimation from depth data or 2D pose estimation from RGB images, we extend regression forests to infer missing depth data of image features and 3D pose simultaneously. Since we do not observe depth for inference or training directly, we hypothesize the depth of the features by sweeping with a plane through the 3D volume of potential joint locations. The regression forests are then combined with a pictorial structure framework, which is extended to 3D. The approach is evaluated on two challenging benchmarks where state-of-the-art performance is achieved. © 2014. The
Conference Paper
Recently, the emergence of Kinect systems has demonstrated the benefits of predicting an intermediate body part labeling for 3D human pose estimation, in conjunction with RGB-D imagery. The availability of depth information plays a critical role, so an important question is whether a similar representation can be developed with sufficient robustness in order to estimate 3D pose from RGB images. This paper provides evidence for a positive answer, by leveraging (a) 2D human body part labeling in images, (b) second-order label-sensitive pooling over dynamically computed regions resulting from a hierarchical decomposition of the body, and (c) iterative structured-output modeling to contextualize the process based on 3D pose estimates. For robustness and generalization, we take advantage of a recent large-scale 3D human motion capture dataset, Human3.6M[18] that also has human body part labeling annotations available with images. We provide extensive experimental studies where alternative intermediate representations are compared and report a substantial 33% error reduction over competitive discriminative baselines that regress 3D human pose against global HOG features.
Conference Paper
Numerous ‘non-maximum suppression’ (NMS) post-processing schemes have been proposed for merging multiple independent object detections. We propose a generalization of NMS beyond bounding boxes to merge multiple pose estimates in a single frame. The final estimates are centroids rather than medoids as in standard NMS, thus being more accurate than any of the individual candidates. Using the same mathematical framework, we extend our approach to the multi-frame setting, merging multiple independent pose estimates across space and time and outputting both the number and pose of the objects present in a scene. Our approach sidesteps many of the inherent challenges associated with full tracking (e.g. objects entering/leaving a scene, extended periods of occlusion, etc.). We show its versatility by applying it to two distinct state-of-the-art pose estimation algorithms in three domains: human bodies, faces and mice. Our approach improves both detection accuracy (by helping disambiguate correspondences) as well as pose estimation quality and is computationally efficient.
Conference Paper
In this work, we address the problem of 3D pose estimation of multiple humans from multiple views. This is a more challenging problem than single human 3D pose estimation due to the much larger state space, partial occlusions as well as across view ambiguities when not knowing the identity of the humans in advance. To address these problems, we first create a reduced state space by triangulation of corresponding body joints obtained from part detectors in pairs of camera views. In order to resolve the ambiguities of wrong and mixed body parts of multiple humans after triangulation and also those coming from false positive body part detections, we introduce a novel 3D pictorial structures (3DPS) model. Our model infers 3D human body configurations from our reduced state space. The 3DPS model is generic and applicable to both single and multiple human pose estimation. In order to compare to the state-of-the art, we first evaluate our method on single human 3D pose estimation on HumanEva-I and KTH Multiview Football Dataset II datasets. Then, we introduce and evaluate our method on two datasets for multiple human 3D pose estimation. More information: http://campar.in.tum.de/Chair/MultiHumanPose