Articulated pose estimation with flexible mixtures-of-parts
ABSTRACT We describe a method for human pose estimation in static images based on a novel representation of part models. Notably, we do not use articulated limb parts, but rather capture orientation with a mixture of templates for each part. We describe a general, flexible mixture model for capturing contextual co-occurrence relations between parts, augmenting standard spring models that encode spatial relations. We show that such relations can capture notions of local rigidity. When co-occurrence and spatial relations are tree-structured, our model can be efficiently optimized with dynamic programming. We present experimental results on standard benchmarks for pose estimation that indicate our approach is the state-of-the-art system for pose estimation, outperforming past work by 50% while being orders of magnitude faster.
-
Citations (0)
- Cited In (1)
-
Conference Proceeding: Survey on 2D and 3D Human Pose Recovery
CCIA; 01/2012
Page 1
Articulated pose estimation with flexible mixtures-of-parts
Yi Yang Deva Ramanan
Dept. of Computer Science, University of California, Irvine
{yyang8,dramanan}@ics.uci.edu
Abstract
We describe a method for human pose estimation in
static images based on a novel representation of part mod-
els. Notably, we do not use articulated limb parts, but
rather capture orientation with a mixture of templates for
each part. We describe a general, flexible mixture model
for capturing contextual co-occurrence relations between
parts, augmenting standard spring models that encode spa-
tial relations. We show that such relations can capture no-
tions of local rigidity. When co-occurrence and spatial rela-
tions are tree-structured, our model can be efficiently opti-
mized with dynamic programming. We present experimental
results on standard benchmarks for pose estimation that in-
dicate our approach is the state-of-the-art system for pose
estimation, outperforming past work by 50% while being
orders of magnitude faster.
1. Introduction
We examine the task of human pose estimation in static
images. A working technology would immediately impact
many key vision tasks such as image understanding and
activity recognition. An influential approach is the picto-
rial structure framework [7, 12] which decomposes the ap-
pearance of objects into local part templates, together with
human
esenting shape for recognition
287
cylinder
biped
bird
ostrich
K
ape
dove
ocess of relating new shape descriptions to known shapes is to
liable information about the shape, it must be conservative.
an organization (or indexing) of stored shape descriptions
Figure 1: On the left, we show the classic articulated limb
model of Marr and Nishihara [20]. In the middle, we show
different orientation and foreshortening states of a limb,
each of which is evaluated separately in classic articulated
body models. On the right, we approximate these trans-
formations with a mixture of non-oriented pictorial struc-
tures, in this case tuned to represent near-vertical and near-
horizontal limbs.
geometric constraints on pairs of parts, often visualized as
springs. When parts are parameterized by pixel location and
orientation, the resulting structure can model articulation.
Thishasbeenthedominantapproachtohumanposeestima-
tion. In contrast, traditional models for object recognition
use parts parameterized solely by location, which simplifies
both inference and learning. Such models have been shown
to be very successful for object recognition [2, 9]. In this
work, we introduce a novel unified representation for both
models that produces state-of-the-art results for human pose
estimation.
Representations for articulated pose: Full-body pose
estimation is difficult because of the many degrees of free-
doms to be estimated. Moreover, limbs vary greatly in ap-
pearance due to changes in clothing and body shape, as well
as changes in viewpoint manifested in in-plane orientations
and foreshortening. These difficulties complicate inference
since one must typically search images with a large num-
ber of rotated and foreshortened templates. We address
these problems by introducing a novel but simple represen-
tation for modeling a family of affinely-warped templates:
a mixture of non-oriented pictorial structures (Fig.1). We
empirically demonstrate that such approximations can out-
perform explicitly articulated parts because mixture mod-
els can capture orientation-specific statistics of background
features (Fig.2).
Representations for objects: Current object recogni-
tion systems are built on relatively simple structures encod-
ing mixtures of star models defined over tens of parts [9], or
implicitly-defined shape models built on hundreds of parts
[19, 2]. In order to model the varied appearance of objects
(due to deformation, viewpoint,etc.), we argue that one will
need vocabularies of hundreds or thousands of parts, where
only a subset are instanced at a time. We augment clas-
sic spring models with co-occurrence constraints that favor
particular combinations of parts. Such constraints can cap-
ture notions of local rigidity – for example, two parts on the
same limb should be constrained to have the same orienta-
tion state (Fig.1). We show that one can embed such con-
straints in a tree relational graph that preserves tractability.
An open challenge is that of learning such complex repre-
1385
Page 2
0.05
0.1
0.15
30
210
60
240
90
270
120
300
150
330
1800
horizontal
diagonal
vertical
Figure 2: We plot the average HOG feature as a polar his-
togram over 18 gradient orientation channels as computed
from the entire PASCAL 2010 dataset [6]. We see that on
average, images contain more horizontal gradients than ver-
tical gradients, and much stronger horizontal gradients as
compared to diagonal gradients. This means that gradient
statistics are not orientation invariant. In practical terms,
we argue that it is easier to find diagonal limbs (as opposed
to horizontal ones) because one is less likely to be confused
by diagonal background clutter. Articulated limb models
obtained by rotating a single template cannot exploit such
orientation-specific cues. On the other hand, our mixture
models are tuned to detect parts at particular orientations,
and so can exploit such statistics.
sentations from data. As in [2], we conclude that super-
vision is a key ingredient for learning structured relational
models.
We demonstrate results on the difficult task of pose esti-
mation. We use two standard benchmark datasets [23, 10].
We outperform all published past work on both datasets, re-
ducing error by up to 50%. We do so with a novel but sim-
ple representation that is orders of magnitude faster than
previous approaches. Our model requires roughly 1 sec-
ond to process a typical benchmark image, allowing for the
possibility of real-time performance with further speedups
(such as cascaded or parallelized implementations).
2. Related Work
Poseestimationhastypicallybeenaddressedinthevideo
domain, dating back to classic model-based approaches of
O’Rourke and Badler [22], Hogg [13], Rohr [25]. Recent
work has examined the problem for static images, assum-
ing that such techniques will be needed to initialize video-
based articulated trackers. Probabilistic formulations are
common. One area of research is the encoding of spatial
structure. Tree models are efficient and allow for efficient
inference [7], but are plagued by the well-known phenom-
ena of double-counting. Loopy models require approximate
inference strategies such as importance sampling [7, 18],
loopy belief propagation [28], or iterative approximations
[33]. Recent work has suggested that branch and bound al-
gorithms with tree-based lower bounds can globally solve
such problems [31, 29]. Another approach to tackling the
double-counting phenomena is the use of stronger pose pri-
ors, advocated by [17]. However, such approaches maybe
more susceptible to overfitting to statistics of a particular
dataset, as warned by [28, 32].
An alternate family of techniques has explored the trade-
off between generative and discriminative models trained
explicitly for pose estimation. Approaches include condi-
tional random fields [24] and margin-based or boosted de-
tectors [27, 16, 1, 29]. A final crucial issue is that of feature
descriptors. Past work has explored the use of superpixels
[21], contours [27, 26, 30], foreground/background color
models [23, 10], and gradient descriptors [1, 15].
In terms of object detection, our work is most similar
to pictorial structure models that reason about mixtures of
parts [5, 7, 9]. We show that our model generalizes such
representations in Sec.3.1. Our model, when instanced as a
tree, can be written as a recursive grammar of parts [8].
3. Model
Let us write I for an image, pi = (x,y) for the pixel
location of part i and ti for the mixture component of
part i.We write i ∈ {1,...K}, pi ∈ {1,...L} and
ti∈ {1,...T}. We call tithe “type” of part i. Our moti-
vating example of types include orientations of a part (e.g.,
a vertical versus horizontally oriented hand), but types may
span semantic classes (an open versus closed hand). For no-
tational convenience, we define the lack of subscript to indi-
cate a set spanned by that subscript (e.g., t = {t1,...tK}).
Co-occurrence model: To score of a configuration of
parts, we first define a compatibility function for part types
that factors into a sum of local and pairwise scores:
?
The parameter bti
part i, while the pairwise parameter bti,tj
co-occurrences of part types. For example, if part types cor-
respond to orientations and part i and j are on the same rigid
limb, then bti,tj
ij
would favor consistent orientation assign-
ments. We write G = (V,E) for a K-node relational graph
whose edges specify which pairs of parts are constrained to
have consistent relations.
We can now write the full score associated with a con-
figuration of part types and positions:
S(t) =
i∈V
bti
i+
?
ij∈E
bti,tj
ij
(1)
ifavors particular type assignments for
ij
favors particular
S(I,p,t) =S(t)
?
+
(2)
i∈V
wti
i· φ(I,pi) +
?
ij∈E
wti,tj
ij
· ψ(pi− pj)
where φ(I,pi) is a feature vector (e.g., HOG descriptor [3])
extractedfrompixellocationpiinimageI. Wewriteψ(pi−
pj) =
dx2
dy
dy = yi− yj, the relative location of part i with respect to
j. Notably, this relative location is defined with respect to
?dxdy2?T, where dx = xi− xjand
1386
Page 3
the pixel grid and not the orientation of part i (as in classic
articulated pictorial structures [7]).
Appearance model: The first sum in (2) is an appear-
ance model that computes the local score of placing a tem-
plate wti
Deformation model: The second term can be inter-
preted as a “switching” spring model that controls the rela-
tive placement of part i and j by switching between a col-
lection of springs. Each spring is tailored for a particular
pair of types (ti,tj), and is parameterized by its rest loca-
tion and rigidity, which are encoded by wti,tj
ifor part i, tuned for type ti, at location pi.
ij
.
3.1. Special cases
We now describe various special cases of our model
which have appeared in the literature. One obvious case
is T = 1, in which case our model reduces to a standard
pictorial structure. More interesting cases are below.
Semantic part models: [5] argue that part appearances
should capture semantic classes and not visual classes; this
can be done with a type model. Consider a face model with
eye and mouth parts. One may want to model different
types of eyes (open and closed) and mouths (smiling and
frowning). The spatial relationship between the two does
not likely depend on their type, but open eyes may tend to
co-occur with smiling mouths. This can be obtained as a
special case of our model by using a single spring for all
types of a particular pair of parts:
wti,tj
ij
= wij
(3)
Mixtures of deformable parts: [9] define a mixture of
models, whereeachmodelisastar-basedpictorialstructure.
This can achieved by restricting the co-occurrence model to
allow for only globally-consistent types:
?
Articulation: In our experiments, we explore a simpli-
fied version of (2) with a reduced set of springs:
bti,tj
ij
=
0
if
otherwise
ti= tj
−∞
(4)
wti,tj
ij
= wti
ij
(5)
The above simplification states that the relative location of
part with respect to its parent is dependant on part-type, but
not parent-type. For example, let i be a hand part, j its par-
ent elbow part, and assume part types capture orientation.
The above relational model states that a sideways-oriented
handshouldtendtolienexttotheelbow, whileadownward-
oriented hand should lie below the elbow, regardless of the
orientation of the upper arm.
4. Inference
Inference corresponds to maximizing S(x,p,t) from (2)
over p and t. When the relational graph G = (V,E) is
a tree, this can be done efficiently with dynamic program-
ming. Let kids(i) be the set of children of part i in G. We
compute the message part i passes to its parent j by the fol-
lowing:
scorei(ti,pi) = bti
i+ wi
ti· φ(I,pi) +
?
k∈kids(i)
mk(ti,pi)
(6)
mi(tj,pj) = max
ti
score(ti,pi) + wti,tj
bti,tj
ij
+
max
pi
ij
· ψ(pi− pj) (7)
(6) computes the local score of part i, at all pixel locations
piand for all possible types ti, by collecting messages from
the children of i. (7) computes for every location and pos-
sible type of part j, the best scoring location and type of
its child part i. Once messages are passed to the root part
(i = 1), score1(c1,p1) represents the best scoring config-
uration for each root position and type. One can use these
root scores to generate multiple detections in image I by
thresholding them and applying non-maximum suppression
(NMS). By keeping track of the argmax indices, one can
backtrack to find the location and type of each part in each
maximal configuration.
Computation: The computationally taxing portion of
dynamic programming is (7). One has to loop over L × T
possible parent locations and types, and compute a max
over L × T possible child locations and types, making the
computation O(L2T2) for each part. When ψ(pi− pj) is
a quadratic function (as is the case for us), the inner max-
imization in (7) can be efficiently computed for each com-
bination of tiand tj in O(L) with a max-convolution or
distance transform [7]. Since one has to perform T2dis-
tance transforms, message passing reduces to O(LT2) per
part.
Special cases: Model (3) maintains only a single spring
per part, so message passing reduces to O(L). Models (4)
and (5) maintain only T springs per part, reducing mes-
sage passing to O(LT). It is worthwhile to note that our
articulated model is no more computationally complex than
the deformable mixtures of parts in [9], but is considerably
more flexible (as we show in our experiments). In practice,
T is small (≤ 6 in our experiments) and the distance trans-
form is quite efficient, so the computation time is dominated
by computing the local scores of each type-specific appear-
ance models wti
be efficiently computed for all positions piby optimized
convolution routines.
i· φ(I,pi). Since this score is linear, it can
5. Learning
We assume a supervised learning paradigm. Given la-
beled positive examples {In,pn,tn} and negative examples
1387
Page 4
{In}, we will define a structured prediction objective func-
tion similar to those proposed in [9, 16]. To do so, let us
write zn= (pn,tn) and note that the scoring function (2) is
linear in model parameters β = (w,b), and so can be writ-
ten as S(I,z) = β ·Φ(I,z). We would learn a model of the
form:
arg min
w,ξi≥0
s.t.
∀n ∈ pos
∀n ∈ neg,∀z
Theaboveconstraintstatesthatpositiveexamplesshould
score better than 1 (the margin), while negative examples,
for all configurations of part positions and types, should
score less than -1. The objective function penalizes vio-
lations of these constraints using slack variables ξn.
Detection vs pose estimation: Traditional structured
prediction tasks do not require an explicit negative training
set, and instead generate negative constraints from positive
examples with mis-estimated labels z. This corresponds to
training a model that tends to score a ground-truth pose
highly and alternate poses poorly.
directly to a pose estimation task, our above formulation
also includes a “detection” component: it trains a model
that scores highly on ground-truth poses, but generates low
scores on images without people. We find the above to work
well for both pose estimation and person detection.
Optimization: The above optimization is a quadratic
program (QP) with an exponential number of constraints,
since the space of z is (LT)K. Fortunately, only a small mi-
nority of the constraints will be active on typical problems
(e.g., the support vectors), making them solvable in prac-
tice. This form of learning problem is known as a structural
SVM, and there exists many well-tuned solvers such as the
cutting plane solver of SVMStruct [11] and the stochastic
gradient descent solver in [9]. We found good results by im-
plementing our own dual coordinate-descent solver, which
we will describe in an upcoming tech report.
5.1. Learning in practice
1
2β · β + C
?
n
ξn
(8)
β · Φ(In,zn) ≥ 1 − ξn
β · Φ(In,z) ≤ −1 + ξn
While this translates
Most human pose datasets include images with labeled
joint positions [23, 10, 2]. We define parts to be located at
joints, so these provide part position labels p, but not part
type labels t. We now describe a procedure for generating
type labels for our articulated model (5).
We first manually define the edge structure E by con-
necting joint positions based on average proximity. Because
we wish to model articulation, we can assume that part
types should correspond to different relative locations of a
part with respect to its parent in E. For example, sideways-
oriented hands occur next to elbows, while downward-
facing hands occur below elbows. This means we can use
relative location as a supervisory cue to help derive type la-
bels that capture orientation.
Deriving part type from position: Assume that our nth
training image Inhas labeled joint positions pn. Let pn
the relative position of part i with respect to its parent in
image In. For each part i, we cluster its relative position
over the training set {pn
K-means with K = T. Each cluster corresponds to a col-
lection of part instances with consistent relative locations,
and hence, consistent orientations by our arguments above.
We define the type labels for parts tn
bership. We show example results in Fig.3.
Partial supervision:
Because part type is derived
heuristically above, one could treat tn
that is also optimized during learning. This latent SVM
problem can be solved by coordinate descent [9] or the
CCP algorithm [34]. We performed some initial experi-
ments with latent updating of part types using the coordi-
nate descent framework of [9], but we found that type labels
tend not to change over iterations. We leave such partially-
supervised learning as interesting future work.
Problem size: On our training datasets, the number of
positive examples varies from 200-1000 and the number of
negative images is roughly 1000. We treat each possible
placement of the root on a negative image as a unique nega-
tive example xn, meaning we have millions of negative con-
straints. Furthermore, we consider models with hundreds of
thousands of parameters. We found that a careful optimized
solver was necessary to manage learning at this scale.
ibe
i: ∀n} to obtain T clusters. We use
ibased on cluster mem-
ias a latent variable
6. Experimental Results
Datasets: We evaluate results using the Image Parse
dataset [23] and the Buffy dataset [10]. The Parse set con-
tains 305 pose-annotated images of highly-articulated full-
body images of human poses. The Buffy dataset contains
748 pose-annotated video frames over 5 episodes of a TV
show. Both datasets include a standard train/test split, and
a standardized evaluation protocol based on the probability
of a correct pose (PCP), which measures the percentage of
correctly localized body parts. Notably, Buffy is also dis-
tributed with a set of validated detection windows returned
by an upper-body person detector run on the testset. Most
previous work report results on this set, as do we. Since our
model also serves as a person detector, we can also present
PCP results on the full Buffy testset. To train our models,
we use the negative training images from the INRIAPerson
database [3] as our negative training set. These images tend
to be outdoor scenes that do not contain people.
Models: We define a full-body skeleton for the Parse
set, and a upper-body skeleton for the Buffy set. To define
a fully labeled dataset of part locations and types, we group
parts into orientations based on their relative location with
respect to their parents (as described in Sec 5.1). We show
clustering results in Fig.3. We use the derived type labels
to construct a fully supervised dataset, from which we learn
1388
Page 5
flexible mixtures of parts. We show the full-body model
learned on the Parse dataset in Fig.5. We set all parts to
be 5 × 5 HOG cells in size. To visualize the model, we
show 4 trees generated by selecting one of the four types of
each part, and placing it at its maximum-scoring position.
Recall that each part type has its own appearance template
and spring encoding its relative location with respect to its
parent. This is because we expect part types to correspond
to orientation because of the supervised labeling shown in
Fig.3. Though we visualize 4 trees, we emphasize that there
exists an exponential number of trees that our model can
generate by composing different part types together.
Structure: We consider the effect of varying T (the
number of mixtures or types) and K (number of parts)
on the accuracy of pose estimation on the Parse dataset in
Fig.4. We experiment with a 14 part model defined at 14
joint positions (shoulder, elbow, hand, etc.) and a 27 part
model where midway points between limbs are added (mid-
upper arm, mid-lower arm, etc.) to increase coverage. Per-
formance increases with denser coverage and an increased
number of part types, presumably because additional orien-
tations are being captured. For reference, we also trained a
star model, but saw inferior performance compared to the
tree models shown in Fig.4. We saw a slight improvement
using a variable number of mixtures (5 or 6) per part, tuned
by cross validation. These are the results presented below.
Detection accuracy: We use our model as an upper
body detector on the Buffy dataset in Table 1. We correctly
detect 99.6% of the people in the testset. The dataset in-
clude two alternate detectors based on a rigid HOG tem-
plate and a mixtures-of-star models [9] which perform at
85% and 94%, respectively. The latter is widely regarded
as a state-of-the-art system for object recognition. These
results indicate the potential of our representation and su-
pervised learning framework for general object detection.
Parse: We give quantitative results for PCP in Table 2,
and show example images in Fig.6. We refer the reader to
the captions for a detailed analysis, but our method outper-
forms all previously published results by a significant mar-
gin. Notably, all previous work uses articulated parts. We
reduce errorby 25%. We believe ourhigh performance is to
due to the fact that our models leverage orientation-specific
statistics (Fig.2), and because parts and relations are simul-
taneously learned in a discriminative framework. In con-
trast, articulated models are often learned in stages (using
pre-trained, orientation-invariant part detectors) due to the
computational burden of inference.
Buffy: We give quantitative results for PCP in Table 3,
and show example images in Fig.7. We refer the reader to
the captions for a detailed analysis, but we outperform all
past approaches, when evaluated on a subset of standard-
ized windows or the entire testset. Notably, all previous
approaches use articulated parts. Our algorithm is several
Upper body detection on Buffy
Rigid HOG[10] Mixtures of Def. Parts[9]
85.1
Us
99.693.8
Table 1: Our model clearly outperforms past approaches
for upper body detection. Notably, [9] use a star-structured
model of HOG templates trained with weakly-supervised
data. Our results suggest more complex object structure,
when learned with supervision, can yield improved results
for detection.
123456
60
62
64
66
68
70
72
74
Performance vs number of types per part
27−part model
14−part model
Figure 4: We show the effect of model structure on pose
estimation by evaluating PCP performance on the Parse
dataset. Overall, increasing the number of parts (by instanc-
ing parts at limb midpoints in addition to joints) improves
performance. For both cases, increasing the number of mix-
ture components improves performance, likely due to the
fact that more orientations can be modeled.
orders of magnitude faster than the next-best approaches of
[26, 27]. When evaluated on the entire testset, our approach
reduces error by 54%.
Conclusion: We have described a simple, but flexible
extension of tree-based models of part mixtures. When part
mixture models correspond to part orientations, our repre-
sentation can model articulation with greater speed and ac-
curacythanclassicapproaches. Ourrepresentationprovides
a general framework for modeling co-occurrence relations
between mixtures of parts as well as classic spatial relations
between the location of parts. We show that such relations
capture notions of local rigidity. We are applying this ap-
proach to the task of general object detection, but have al-
ready demonstrated impressive results for the challenging
task of human pose estimation.
Acknowledgements: Funding for this research was pro-
vided by NSF Grant 0954083, ONR-MURI Grant N00014-
10-1-0933, and support from Google and Intel.
References
[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures
revisited: People detection and articulated pose estimation.
In Proc. CVPR, volume 1, page 4, 2009.
[2] L. Bourdev and J. Malik.
trained using 3d human pose annotations. In International
Conference on Computer Vision (ICCV), 2009.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In CVPR, pages I: 886–893, 2005.
Poselets: Body part detectors
1389