Pose-Sensitive Embedding by Nonlinear NCA Regression.
-
Citations (0)
-
Cited In (0)
Page 1
Pose-Sensitive Embedding
by Nonlinear NCA Regression
Graham W. Taylor, Rob Fergus, George Williams, Ian Spiro and Christoph Bregler
Courant Institute of Mathematics, New York University
New York, USA 10003
gwtaylor,fergus,spiro,bregler@cs.nyu.edu
Abstract
This paper tackles the complex problem of visually matching people in similar
pose but with different clothes, background, and other appearance changes. We
achieve this with a novel method for learning a nonlinear embedding based on
several extensions to the Neighborhood Component Analysis (NCA) framework.
Our method is convolutional, enabling it to scale to realistically-sized images. By
cheaply labeling the head and hands in large video databases through Amazon
Mechanical Turk (a crowd-sourcing service), we can use the task of localizing
the head and hands as a proxy for determining body pose. We apply our method
to challenging real-world data and show that it can generalize beyond hand lo-
calization to infer a more general notion of body pose. We evaluate our method
quantitatively against other embedding methods. We also demonstrate that real-
world performance can be improved through the use of synthetic data.
1
Determining the pose of a human body from one or more images is a central problem in Computer
Vision. The complex, multi-jointed nature of the body makes the determination of pose challenging,
particularly in natural settings where ambiguous and unusual configurations may be observed. The
ability to localize the hands is particularly important: they provide tight constraints on the layout of
the upper body, yielding a strong cue as to the action and intent of a person.
A huge range of techniques, both parametric and non-parametric, exist for inferring body pose from
2D images and 3D datasets [10, 39, 4, 28, 33, 8, 3, 6, 11]. We propose a non-parametric approach to
Introduction
d=5.51d=5.57d=5.60
d=4.17d=4.31
d=4.58d=5.05d=5.24d=5.35d=5.40d=5.47d=5.49
d=4.29d=5.00d=5.09d=5.21
d=3.20
d=3.65
d=3.90d=3.91d=4.02
d=5.29
d=3.93
d=3.88
Figure 1: Query image (in left column) and the eight nearest neighbours found by our method.
Distance in the learned embedded space is shown bottom right. Matches are based on the location
of the hands, and more generally body pose - not the individual or the background.
1
Page 2
estimating body pose by localizing the hands using a parametric, nonlinear multi-layered embedding
of the raw pixel images. Unlike many other metric learning approaches, ours is designed for use with
real-world images, having a convolutional architecture that scales gracefully to large images and is
invariant to local geometric distortions.
Our embedding, trained on both real and synthetic data, is a functional mapping that projects images
with similar head and hand positions to lie close-by in a low-dimensional output space. Efficient
nearest-neighbour search can then be performed in this space to find images in a large training
corpus that have similar pose. Specifically for this task, we have designed an interface to obtain
and verify head and hand labels for thousands of frames through Amazon Mechanical Turk with
minimal user intervention. We find that our method is able to cope with the terse and noisy labels
provided by crowd-sourcing. It succeeds in generalizing to body and hand pose when such cues are
not explicitly provided in the labels (see Fig. 1).
2
Our applicationdomain is relatedto several approaches inthe computer visionliterature that propose
hand or body pose tracking. Many techniques rely on sliding-window part detectors based on color
and other features applied to controlled recording conditions ([10, 39, 4, 28] to name a few, we
refer to [32] for a complete survey). In our domain, hands might only occupy a few pixels, and the
only body-part that can reliably be detected is the human face ([26, 13]). Many techniques have
been proposed that extract, learn, or reason over entire body features. Some use a combination of
local detectors and structural reasoning (see [33] for coarse tracking and [8] for person-dependent
tracking). Inasimilarspirit, moregeneraltechniquesusingpictorialstructures[3,12,35], “poselets”
[6], and other part-models [11] have received increased attention. An entire new stream of kinematic
model-based techniques based on the HumanEva dataset has been proposed [37], but this area differs
from our domain in that the images considered are of higher quality and less cluttered.
More closely related to our task are nearest-neighbour and locally-weighted regression-based tech-
niques. Some extract “shape-context” edge based histograms from the human body [25, 1] or just
silhouette features [15]. Shakhnarovich et al. [36] use HOG [9] features and boosting for learn-
ing a parameter sensitive hash function. All these approaches rely on good background subtraction
or recordings with clear backgrounds. Our domain contains clutter, lighting variations and low
resolution such that it is impossible to separate body features from background successfully. We
instead learn relevant features directly from pixels (instead of pre-coded edge or gradient histogram
features), and discover implicitly background invariance from training data.
Several other works [36, 9, 4, 15] have used synthetically created data as a training set. We show in
thispaperseveralexperimentswithchallengingrealvideo(withcrowd-sourcedAmazonMechanical
Turk labels), synthetic training data, and hybrid datasets. Our final system (after training) is always
applied to the cluttered non-background subtracted real video input without any labels.
Our technique is also related to distance metric learning, an important area of machine learning
research, especially due to recent interest in analyzing complex high-dimensional data. A subset
of approaches for dimensionality reduction [17, 16] implicitly learn a distance metric by learning
a function (mapping) from high-dimensional (i.e. pixel) space to low-dimensional “feature” space
such that perceptually similar observations are mapped to nearby points on a manifold. Neighbour-
hood Components Analysis (NCA) [14] proposes a solution where the transformation from input
to feature space is linear and the distance metric is Euclidean. NCA learns the transformation that
is optimal for performing KNN in the feature space. NCA has also been recently extended to the
nonlinear case [34] using MNIST class labels and to linear 1D regression for reinforcement learning
[20]. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) [16] also learns a non-
linear mapping. Like NCA, DrLIM uses class neighbourhood structure to drive the optimization:
observations with the same class label are driven to be close-by in feature space. Our approach
is also inspired by recent hashing methods [2, 34, 38], although those techniques are restricted to
binary codes for fast lookup.
3Learning an invariant mapping by nonlinear embedding
We first discuss Neighbourhood Components Analysis [14] and its nonlinear variants. We then pro-
pose an alternative objective function optimized for performing nearest neighbour (NN) regression
rather than classification. Next, we describe our convolutional architecture which maps images from
Related work
2
Page 3
high-dimensional to low-dimensional space. Finally we introduce a related but different objective
for our model based on DrLIM.
3.1Neighbourhood Components Analysis
NCA (both linear and nonlinear) and DrLIM do not presuppose the existence of a meaningful and
computable distance metric in the input space. They only require that neighbourhood relationships
be defined between training samples. This is well-suited for learning a metric for non-parametric
classification (e.g. KNN) on high-dimensional data. If the original data does not contain discrete
class labels, but real-valued labels (e.g. pose information for images of people) one alternative is to
define neighbourhoods based on the distance in the real-valued label space and proceed as usual.
However, if classification is not our ultimate goal, we may wish to exploit the “soft” nature of the
labels and use an alternative objective (i.e. one that does not optimize KNN performance).
Suppose we are given a set of N labeled training cases {xi,yi}, i = 1,2,...,N, where xi∈ RD,
and yi∈ RL. Each training point, i, selects another point, j, as its neighbour with some probability
defined by normalizing distances in the transformed feature space [14]:
pij=
exp(−d2
k?=iexp(−d2
ij)
?
ik),pii= 0,dij= ||zi− zj||2
(1)
where we use a Euclidean distance metric dij and zi = f(xi|θ) is the mapping (parametrized
by θ) from input space to feature space. For NCA this is typically linear, but it can be extended
to be nonlinear through back-propagation (for example in [34] it is a multi-layer neural network).
NCA assumes that the labels, yi, are discrete yi∈ 1,2,...,C rather than real-valued and seeks to
maximize the expected number of correctly classified points on the training data which minimizes:
N
?
The parameters are found by minimizing LNCAwith respect to θ; back-propagating in the case of
a multi-layer parametrization. Instead of seeking to optimize KNN classification performance, we
can use the NCA regression (NCAR) objective [20]:
N
?
Intuitively, this states that if, with high probability, i and j are neighbours in feature space, then
they should also lie close-by in label space. While we use the Euclidean distance in label space, our
approach generalizes to other metrics which may be more appropriate for a different domain.
Keller et al. [20] consider the linear case of NCAR, where θ is a weight matrix and y is a scalar
representing Bellman error to map states with similar Bellman errors close together. Similar to
NCA, we can extend this objective to the nonlinear, multi-layer case. We simply need to compute
the derivative of LNCARwith respect to the output of the mapping, zi, and backpropagate through
the remaining layers of the network. The gradient can be computed efficiently as:
∂LNCAR
∂zi
j?=i
where we use y2
LNCA= −
i=1
?
j:yi=yj
pij.
(2)
LNCAR=
i=1
?
j?=i
pij||yi− yj||2
2.
(3)
= −2
?
(zi− zj)?pij
?y2
ij. See the supplementary material for details.
ij− δi
?+ pji
?y2
ij− δj
??.
(4)
ij= ||yi− yj||2
2and δi=?
jpijy2
3.2
As [34] points out, nonlinear NCA was originally proposed in [14] but with the exception of a
modest success with a two-layer network in extracting 2D codes that explicitly represented the
size and orientation of face images, attempts to extract more complex properties using multi-layer
feature extraction were less successful. This was due, in part, to the difficulty in training multi-layer
networks and the fact that many data pairs are required to fit the large number of network parameters.
Though both [34] and [38] were successful in learning a multi-layer nonlinear mapping of the data,
there is still a fundamental limitation of using fully-connected networks that must be addressed.
Such an architecture can only be applied to relatively small image patches (typically less than 64
×64 pixels), because they do not scale well with the size of the input. Salakhutdinov and Hinton
Convolutional architectures
3
Page 4
escaped this issue by training only on the MNIST dataset (28×28 images of digits) and Torralba
et al. used a global image descriptor [29] as an initial feature representation rather than pixels.
However, to avoid such hand-crafted features which may not be suitable for the task, and to scale to
realistic sized inputs, models should take advantage of the pictorial nature of the image input. This is
addressed by convolutional architectures [21], which exploit the fact that salient motifs can appear
anywhere in the image. By employing successive stages of weight-sharing and feature-pooling,
deep convolutional architectures can achieve stable latent representations at each layer, that preserve
locality, provide invariance to small variations of the input, and drastically reduce the number of free
parameters.
Our proposed method which we call Convolutional NCA regression (C-NCAR) is based on a stan-
dard convolutional architecture [21, 18]: alternating convolution and subsampling layers followed
by a single fully-connected layer (see Fig. 2). It differs from typical convolutional nets in the objec-
tive function with which it is trained (i.e. minimizing Eq. 3). Because the loss is defined on pairs of
examples, we use a siamese network [5]. Pairs of frames are processed by separate networks with
equal weights. The loss is then computed on the output of both networks. Hadsell et al. [16] also
use a siamese convolutional network with yet a different objective. They use their method for visu-
alization but not any discriminative task. Mobahi et al. [24] have also recently used a convolutional
siamese network in which temporal coherence between pairs of frames drives the regularization of
the model rather than the objective. More details of training our network are given in Sec. 4.
Input:
128×128
Layer 1:
16×120×120
Layer 2:
16×24×24
Layer 3:
32×16×16
Layer 4:
32×4×4
Output:
32×1×1
Convolutions,
tanh(), abs()
Average
pooling
Convolutions,
tanh(), abs()
Average
pooling
d(zi,zj)
Fully
connected
xi
xj
Figure 2: Convolutional NCA regression (C-NCAR). Each image is processed by two convolutional
and subsampling layers and one fully-connected layer. A loss (Eq. 3) computed on the distance
between resulting codes drives parameter learning.
3.3
Like NCA, DrLIM assumes a discrete notion of similarity or dissimilarity between data pairs, xi
and xj. It defines both a “similarity” loss, Ls, which penalizes similar points which are far apart
in code space, and a “dissimilarity” loss, LD, which penalizes dissimilar points which lie within a
user-defined margin, m, of each other:
LS(xi,xj) =1
2d2
where dijis given by Eq. 1. Let γijbe an indicator such that γij = 1 if xiand xjare deemed
similar and γij = 1 if xiand xj are deemed dissimilar. For example, if labels yiare discrete
yi∈ 1,2,...,C, then γij= 1 for yi= yjand γij= 0 otherwise. The total loss is defined by:
LDrLIM=
i=1
j?=i
When faced with real-valued labels, yi, we can avoid explicitly defining similarity and dissimilarity
(e.g. via thresholding) by defining a “soft” notion of similarity:
exp(−||yi− yj||2
?
Replacing the indicator variables γijwith ˆ γijin Eq. 6 yields what we call the soft DrLIM loss.
Adding a contrastive loss function
ij
LD(xi,xj) =1
2{max(0,m − dij)}2
(5)
N
??
γijLs(xi,xj) + (1 − γij)LD(xi,xj).
(6)
ˆ γij=
2)
k?=iexp(−||yi− yj||2
2).
(7)
4
Page 5
4
We evaluate our approach in real and synthetic environments by performing 1-nearest neighbour
(NN) regression using a variety of standard and learned metrics described below. For every query
image in a test set, we compute its distance (under the metric) to each of the training points in a
database. We then copy the label (e.g. (x,y) position of the head and hands) of the neighbour to the
query example. For evaluation, we compare the ground-truth label of the query to the label of the
nearest neighbour. Errors are reported in terms of mean pixel error over each query and each marker:
the head (if it is tracked) and each hand. Errors are absolute with respect to the original image size.
We acknowledge that improved results could potentially be obtained by using more than one neigh-
bour or with more sophisticated techniques such as locally weighted regression [36]. However, we
focus on learning a good metric for performing this task rather than the regression problem. The
approaches compared are:
Pixel distance can be used to find nearest neighbours though it is not practical in real situations due
to the intractability of computing distances in such a high-dimensional space.
GIST descriptors [29] are a global representation of image content.We are motivated to use GIST
by its previous use in nonlinear NCA for image retrieval [38]. The resulting image representation
is a length-512 vector. We note that this is still too large for efficient NN search and that the GIST
features are not domain-adaptive.
Linear NCA regression (NCAR) is described in Section 3. We pre-compute GIST for each image
and use that as our input representation. We learn a 512×32 matrix of weights by minimizing
Eq. 3 using nonlinear conjugate gradients with randomly sampled mini-batches of size 512. We
perform three line-searches per mini-batch and stop learning after 500 mini-batches. We found that
our results slightly improved when we applied a form of local contrast normalization (LCN) prior
to computing GIST. Each pixel’s response was normalized by the integrated response of a 9×9
window of neighbouring pixels. For more details see [30].
Convolutional NCA regression (C-NCAR) See Fig. 2 for a summary of our architecture. Images
are pre-processed using LCN. Convolutions are followed by pixel-wise tanh and absolute value
rectification. The abs prevents cancellations in local neighbourhoods during average downsampling
[18]. Our architectural parameters (size of filters, number of filter banks, etc.) are chosen to produce
a 32-dimensional output. Derivations of parameter updates are presented as supplementary material.
Soft DrLIM (S-DrLIM) and Convolutional soft DrLIM (CS-DrLIM) We also experiment with a
variant of an alternative, energy-based method that adds an explicit contrastive loss to the objective
rather than implicitly through normalization. The contrastive loss only operates on dissimilar points
which lie within a specified margin, m, of each other. We use m = 1.25 as suggested by [16].
In both the linear and nonlinear case, the architecture and training procedure remains the same as
NCAR and C-NCAR, respectively. We use a different objective: minimizing Eq. 6 with respect to
the parameters.
4.1Estimating 2D head and hand pose from synthetic data
We extracted 10,000 frames of training data and 5,000 frames of test data from Poser renderings
of several hours of real motion capture data. Our synthetic data is similar to that considered in [36],
however, we use a variety of backgrounds rather than a constant background. Furthermore, subjects
are free to move around the frame and are rendered at various scales. The training set contains 6
different characters superimposed on 9 different backgrounds. The test set contains 6 characters and
8 backgrounds not present in the training set. The inputs, x, are 320×240 images, and the labels,
y, are 6D vectors - the true (x,y) locations of the head and hands.
Results are shown in Table 1 (column SY). Simple linear NCAR performs well compared to the
baselines, while our nonlinear methods C-NCAR and CS-DrLIM (which are not restricted to the
GIST descriptor) significantly outperform all other approaches. Pixel-based matching (though ex-
tremely slow) does surprisingly well. This is perhaps an artifact of the synthetic data.
4.2Estimating 2D hand pose from real video
Experimental results
We digitally recorded all of the contributing and invited speakers at the Learning Workshop (Snow-
bird) held in April 2010. The set consisted of 30 speakers, with talks ranging from 10-40 minutes
each. After each session of talks, blocks of 150 frames were distributed as Human Intelligence Tasks
5