Page 1

Pose-Sensitive Embedding

by Nonlinear NCA Regression

Graham W. Taylor, Rob Fergus, George Williams, Ian Spiro and Christoph Bregler

Courant Institute of Mathematics, New York University

New York, USA 10003

gwtaylor,fergus,spiro,bregler@cs.nyu.edu

Abstract

This paper tackles the complex problem of visually matching people in similar

pose but with different clothes, background, and other appearance changes. We

achieve this with a novel method for learning a nonlinear embedding based on

several extensions to the Neighborhood Component Analysis (NCA) framework.

Our method is convolutional, enabling it to scale to realistically-sized images. By

cheaply labeling the head and hands in large video databases through Amazon

Mechanical Turk (a crowd-sourcing service), we can use the task of localizing

the head and hands as a proxy for determining body pose. We apply our method

to challenging real-world data and show that it can generalize beyond hand lo-

calization to infer a more general notion of body pose. We evaluate our method

quantitatively against other embedding methods. We also demonstrate that real-

world performance can be improved through the use of synthetic data.

1

Determining the pose of a human body from one or more images is a central problem in Computer

Vision. The complex, multi-jointed nature of the body makes the determination of pose challenging,

particularly in natural settings where ambiguous and unusual configurations may be observed. The

ability to localize the hands is particularly important: they provide tight constraints on the layout of

the upper body, yielding a strong cue as to the action and intent of a person.

A huge range of techniques, both parametric and non-parametric, exist for inferring body pose from

2D images and 3D datasets [10, 39, 4, 28, 33, 8, 3, 6, 11]. We propose a non-parametric approach to

Introduction

d=5.51d=5.57 d=5.60

d=4.17d=4.31

d=4.58d=5.05d=5.24 d=5.35d=5.40d=5.47d=5.49

d=4.29 d=5.00d=5.09d=5.21

d=3.20

d=3.65

d=3.90d=3.91d=4.02

d=5.29

d=3.93

d=3.88

Figure 1: Query image (in left column) and the eight nearest neighbours found by our method.

Distance in the learned embedded space is shown bottom right. Matches are based on the location

of the hands, and more generally body pose - not the individual or the background.

1

Page 2

estimating body pose by localizing the hands using a parametric, nonlinear multi-layered embedding

of the raw pixel images. Unlike many other metric learning approaches, ours is designed for use with

real-world images, having a convolutional architecture that scales gracefully to large images and is

invariant to local geometric distortions.

Our embedding, trained on both real and synthetic data, is a functional mapping that projects images

with similar head and hand positions to lie close-by in a low-dimensional output space. Efficient

nearest-neighbour search can then be performed in this space to find images in a large training

corpus that have similar pose. Specifically for this task, we have designed an interface to obtain

and verify head and hand labels for thousands of frames through Amazon Mechanical Turk with

minimal user intervention. We find that our method is able to cope with the terse and noisy labels

provided by crowd-sourcing. It succeeds in generalizing to body and hand pose when such cues are

not explicitly provided in the labels (see Fig. 1).

2

Our applicationdomain is relatedto several approaches inthe computer visionliterature that propose

hand or body pose tracking. Many techniques rely on sliding-window part detectors based on color

and other features applied to controlled recording conditions ([10, 39, 4, 28] to name a few, we

refer to [32] for a complete survey). In our domain, hands might only occupy a few pixels, and the

only body-part that can reliably be detected is the human face ([26, 13]). Many techniques have

been proposed that extract, learn, or reason over entire body features. Some use a combination of

local detectors and structural reasoning (see [33] for coarse tracking and [8] for person-dependent

tracking). Inasimilarspirit, moregeneraltechniquesusingpictorialstructures[3,12,35], “poselets”

[6], and other part-models [11] have received increased attention. An entire new stream of kinematic

model-based techniques based on the HumanEva dataset has been proposed [37], but this area differs

from our domain in that the images considered are of higher quality and less cluttered.

More closely related to our task are nearest-neighbour and locally-weighted regression-based tech-

niques. Some extract “shape-context” edge based histograms from the human body [25, 1] or just

silhouette features [15]. Shakhnarovich et al. [36] use HOG [9] features and boosting for learn-

ing a parameter sensitive hash function. All these approaches rely on good background subtraction

or recordings with clear backgrounds. Our domain contains clutter, lighting variations and low

resolution such that it is impossible to separate body features from background successfully. We

instead learn relevant features directly from pixels (instead of pre-coded edge or gradient histogram

features), and discover implicitly background invariance from training data.

Several other works [36, 9, 4, 15] have used synthetically created data as a training set. We show in

thispaperseveralexperimentswithchallengingrealvideo(withcrowd-sourcedAmazonMechanical

Turk labels), synthetic training data, and hybrid datasets. Our final system (after training) is always

applied to the cluttered non-background subtracted real video input without any labels.

Our technique is also related to distance metric learning, an important area of machine learning

research, especially due to recent interest in analyzing complex high-dimensional data. A subset

of approaches for dimensionality reduction [17, 16] implicitly learn a distance metric by learning

a function (mapping) from high-dimensional (i.e. pixel) space to low-dimensional “feature” space

such that perceptually similar observations are mapped to nearby points on a manifold. Neighbour-

hood Components Analysis (NCA) [14] proposes a solution where the transformation from input

to feature space is linear and the distance metric is Euclidean. NCA learns the transformation that

is optimal for performing KNN in the feature space. NCA has also been recently extended to the

nonlinear case [34] using MNIST class labels and to linear 1D regression for reinforcement learning

[20]. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) [16] also learns a non-

linear mapping. Like NCA, DrLIM uses class neighbourhood structure to drive the optimization:

observations with the same class label are driven to be close-by in feature space. Our approach

is also inspired by recent hashing methods [2, 34, 38], although those techniques are restricted to

binary codes for fast lookup.

3 Learning an invariant mapping by nonlinear embedding

We first discuss Neighbourhood Components Analysis [14] and its nonlinear variants. We then pro-

pose an alternative objective function optimized for performing nearest neighbour (NN) regression

rather than classification. Next, we describe our convolutional architecture which maps images from

Related work

2

Page 3

high-dimensional to low-dimensional space. Finally we introduce a related but different objective

for our model based on DrLIM.

3.1 Neighbourhood Components Analysis

NCA (both linear and nonlinear) and DrLIM do not presuppose the existence of a meaningful and

computable distance metric in the input space. They only require that neighbourhood relationships

be defined between training samples. This is well-suited for learning a metric for non-parametric

classification (e.g. KNN) on high-dimensional data. If the original data does not contain discrete

class labels, but real-valued labels (e.g. pose information for images of people) one alternative is to

define neighbourhoods based on the distance in the real-valued label space and proceed as usual.

However, if classification is not our ultimate goal, we may wish to exploit the “soft” nature of the

labels and use an alternative objective (i.e. one that does not optimize KNN performance).

Suppose we are given a set of N labeled training cases {xi,yi}, i = 1,2,...,N, where xi∈ RD,

and yi∈ RL. Each training point, i, selects another point, j, as its neighbour with some probability

defined by normalizing distances in the transformed feature space [14]:

pij=

exp(−d2

k?=iexp(−d2

ij)

?

ik),pii= 0,dij= ||zi− zj||2

(1)

where we use a Euclidean distance metric dij and zi = f(xi|θ) is the mapping (parametrized

by θ) from input space to feature space. For NCA this is typically linear, but it can be extended

to be nonlinear through back-propagation (for example in [34] it is a multi-layer neural network).

NCA assumes that the labels, yi, are discrete yi∈ 1,2,...,C rather than real-valued and seeks to

maximize the expected number of correctly classified points on the training data which minimizes:

N

?

The parameters are found by minimizing LNCAwith respect to θ; back-propagating in the case of

a multi-layer parametrization. Instead of seeking to optimize KNN classification performance, we

can use the NCA regression (NCAR) objective [20]:

N

?

Intuitively, this states that if, with high probability, i and j are neighbours in feature space, then

they should also lie close-by in label space. While we use the Euclidean distance in label space, our

approach generalizes to other metrics which may be more appropriate for a different domain.

Keller et al. [20] consider the linear case of NCAR, where θ is a weight matrix and y is a scalar

representing Bellman error to map states with similar Bellman errors close together. Similar to

NCA, we can extend this objective to the nonlinear, multi-layer case. We simply need to compute

the derivative of LNCARwith respect to the output of the mapping, zi, and backpropagate through

the remaining layers of the network. The gradient can be computed efficiently as:

∂LNCAR

∂zi

j?=i

where we use y2

LNCA= −

i=1

?

j:yi=yj

pij.

(2)

LNCAR=

i=1

?

j?=i

pij||yi− yj||2

2.

(3)

= −2

?

(zi− zj)?pij

?y2

ij. See the supplementary material for details.

ij− δi

?+ pji

?y2

ij− δj

??.

(4)

ij= ||yi− yj||2

2and δi=?

jpijy2

3.2

As [34] points out, nonlinear NCA was originally proposed in [14] but with the exception of a

modest success with a two-layer network in extracting 2D codes that explicitly represented the

size and orientation of face images, attempts to extract more complex properties using multi-layer

feature extraction were less successful. This was due, in part, to the difficulty in training multi-layer

networks and the fact that many data pairs are required to fit the large number of network parameters.

Though both [34] and [38] were successful in learning a multi-layer nonlinear mapping of the data,

there is still a fundamental limitation of using fully-connected networks that must be addressed.

Such an architecture can only be applied to relatively small image patches (typically less than 64

×64 pixels), because they do not scale well with the size of the input. Salakhutdinov and Hinton

Convolutional architectures

3

Page 4

escaped this issue by training only on the MNIST dataset (28×28 images of digits) and Torralba

et al. used a global image descriptor [29] as an initial feature representation rather than pixels.

However, to avoid such hand-crafted features which may not be suitable for the task, and to scale to

realistic sized inputs, models should take advantage of the pictorial nature of the image input. This is

addressed by convolutional architectures [21], which exploit the fact that salient motifs can appear

anywhere in the image. By employing successive stages of weight-sharing and feature-pooling,

deep convolutional architectures can achieve stable latent representations at each layer, that preserve

locality, provide invariance to small variations of the input, and drastically reduce the number of free

parameters.

Our proposed method which we call Convolutional NCA regression (C-NCAR) is based on a stan-

dard convolutional architecture [21, 18]: alternating convolution and subsampling layers followed

by a single fully-connected layer (see Fig. 2). It differs from typical convolutional nets in the objec-

tive function with which it is trained (i.e. minimizing Eq. 3). Because the loss is defined on pairs of

examples, we use a siamese network [5]. Pairs of frames are processed by separate networks with

equal weights. The loss is then computed on the output of both networks. Hadsell et al. [16] also

use a siamese convolutional network with yet a different objective. They use their method for visu-

alization but not any discriminative task. Mobahi et al. [24] have also recently used a convolutional

siamese network in which temporal coherence between pairs of frames drives the regularization of

the model rather than the objective. More details of training our network are given in Sec. 4.

Input:

128×128

Layer 1:

16×120×120

Layer 2:

16×24×24

Layer 3:

32×16×16

Layer 4:

32×4×4

Output:

32×1×1

Convolutions,

tanh(), abs()

Average

pooling

Convolutions,

tanh(), abs()

Average

pooling

d(zi,zj)

Fully

connected

xi

xj

Figure 2: Convolutional NCA regression (C-NCAR). Each image is processed by two convolutional

and subsampling layers and one fully-connected layer. A loss (Eq. 3) computed on the distance

between resulting codes drives parameter learning.

3.3

Like NCA, DrLIM assumes a discrete notion of similarity or dissimilarity between data pairs, xi

and xj. It defines both a “similarity” loss, Ls, which penalizes similar points which are far apart

in code space, and a “dissimilarity” loss, LD, which penalizes dissimilar points which lie within a

user-defined margin, m, of each other:

LS(xi,xj) =1

2d2

where dijis given by Eq. 1. Let γijbe an indicator such that γij = 1 if xiand xjare deemed

similar and γij = 1 if xiand xj are deemed dissimilar. For example, if labels yiare discrete

yi∈ 1,2,...,C, then γij= 1 for yi= yjand γij= 0 otherwise. The total loss is defined by:

LDrLIM=

i=1

j?=i

When faced with real-valued labels, yi, we can avoid explicitly defining similarity and dissimilarity

(e.g. via thresholding) by defining a “soft” notion of similarity:

exp(−||yi− yj||2

?

Replacing the indicator variables γijwith ˆ γijin Eq. 6 yields what we call the soft DrLIM loss.

Adding a contrastive loss function

ij

LD(xi,xj) =1

2{max(0,m − dij)}2

(5)

N

??

γijLs(xi,xj) + (1 − γij)LD(xi,xj).

(6)

ˆ γij=

2)

k?=iexp(−||yi− yj||2

2).

(7)

4

Page 5

4

We evaluate our approach in real and synthetic environments by performing 1-nearest neighbour

(NN) regression using a variety of standard and learned metrics described below. For every query

image in a test set, we compute its distance (under the metric) to each of the training points in a

database. We then copy the label (e.g. (x,y) position of the head and hands) of the neighbour to the

query example. For evaluation, we compare the ground-truth label of the query to the label of the

nearest neighbour. Errors are reported in terms of mean pixel error over each query and each marker:

the head (if it is tracked) and each hand. Errors are absolute with respect to the original image size.

We acknowledge that improved results could potentially be obtained by using more than one neigh-

bour or with more sophisticated techniques such as locally weighted regression [36]. However, we

focus on learning a good metric for performing this task rather than the regression problem. The

approaches compared are:

Pixel distance can be used to find nearest neighbours though it is not practical in real situations due

to the intractability of computing distances in such a high-dimensional space.

GIST descriptors [29] are a global representation of image content.We are motivated to use GIST

by its previous use in nonlinear NCA for image retrieval [38]. The resulting image representation

is a length-512 vector. We note that this is still too large for efficient NN search and that the GIST

features are not domain-adaptive.

Linear NCA regression (NCAR) is described in Section 3. We pre-compute GIST for each image

and use that as our input representation. We learn a 512×32 matrix of weights by minimizing

Eq. 3 using nonlinear conjugate gradients with randomly sampled mini-batches of size 512. We

perform three line-searches per mini-batch and stop learning after 500 mini-batches. We found that

our results slightly improved when we applied a form of local contrast normalization (LCN) prior

to computing GIST. Each pixel’s response was normalized by the integrated response of a 9×9

window of neighbouring pixels. For more details see [30].

Convolutional NCA regression (C-NCAR) See Fig. 2 for a summary of our architecture. Images

are pre-processed using LCN. Convolutions are followed by pixel-wise tanh and absolute value

rectification. The abs prevents cancellations in local neighbourhoods during average downsampling

[18]. Our architectural parameters (size of filters, number of filter banks, etc.) are chosen to produce

a 32-dimensional output. Derivations of parameter updates are presented as supplementary material.

Soft DrLIM (S-DrLIM) and Convolutional soft DrLIM (CS-DrLIM) We also experiment with a

variant of an alternative, energy-based method that adds an explicit contrastive loss to the objective

rather than implicitly through normalization. The contrastive loss only operates on dissimilar points

which lie within a specified margin, m, of each other. We use m = 1.25 as suggested by [16].

In both the linear and nonlinear case, the architecture and training procedure remains the same as

NCAR and C-NCAR, respectively. We use a different objective: minimizing Eq. 6 with respect to

the parameters.

4.1Estimating 2D head and hand pose from synthetic data

We extracted 10,000 frames of training data and 5,000 frames of test data from Poser renderings

of several hours of real motion capture data. Our synthetic data is similar to that considered in [36],

however, we use a variety of backgrounds rather than a constant background. Furthermore, subjects

are free to move around the frame and are rendered at various scales. The training set contains 6

different characters superimposed on 9 different backgrounds. The test set contains 6 characters and

8 backgrounds not present in the training set. The inputs, x, are 320×240 images, and the labels,

y, are 6D vectors - the true (x,y) locations of the head and hands.

Results are shown in Table 1 (column SY). Simple linear NCAR performs well compared to the

baselines, while our nonlinear methods C-NCAR and CS-DrLIM (which are not restricted to the

GIST descriptor) significantly outperform all other approaches. Pixel-based matching (though ex-

tremely slow) does surprisingly well. This is perhaps an artifact of the synthetic data.

4.2 Estimating 2D hand pose from real video

Experimental results

We digitally recorded all of the contributing and invited speakers at the Learning Workshop (Snow-

bird) held in April 2010. The set consisted of 30 speakers, with talks ranging from 10-40 minutes

each. After each session of talks, blocks of 150 frames were distributed as Human Intelligence Tasks

5