by Nonlinear NCA Regression
Graham W. Taylor, Rob Fergus, George Williams, Ian Spiro and Christoph Bregler
Courant Institute of Mathematics, New York University
New York, USA 10003
This paper tackles the complex problem of visually matching people in similar
pose but with different clothes, background, and other appearance changes. We
achieve this with a novel method for learning a nonlinear embedding based on
several extensions to the Neighborhood Component Analysis (NCA) framework.
Our method is convolutional, enabling it to scale to realistically-sized images. By
cheaply labeling the head and hands in large video databases through Amazon
Mechanical Turk (a crowd-sourcing service), we can use the task of localizing
the head and hands as a proxy for determining body pose. We apply our method
to challenging real-world data and show that it can generalize beyond hand lo-
calization to infer a more general notion of body pose. We evaluate our method
quantitatively against other embedding methods. We also demonstrate that real-
world performance can be improved through the use of synthetic data.
Determining the pose of a human body from one or more images is a central problem in Computer
Vision. The complex, multi-jointed nature of the body makes the determination of pose challenging,
particularly in natural settings where ambiguous and unusual configurations may be observed. The
ability to localize the hands is particularly important: they provide tight constraints on the layout of
the upper body, yielding a strong cue as to the action and intent of a person.
A huge range of techniques, both parametric and non-parametric, exist for inferring body pose from
2D images and 3D datasets [10, 39, 4, 28, 33, 8, 3, 6, 11]. We propose a non-parametric approach to
Figure 1: Query image (in left column) and the eight nearest neighbours found by our method.
Distance in the learned embedded space is shown bottom right. Matches are based on the location
of the hands, and more generally body pose - not the individual or the background.
estimating body pose by localizing the hands using a parametric, nonlinear multi-layered embedding
of the raw pixel images. Unlike many other metric learning approaches, ours is designed for use with
real-world images, having a convolutional architecture that scales gracefully to large images and is
invariant to local geometric distortions.
Our embedding, trained on both real and synthetic data, is a functional mapping that projects images
with similar head and hand positions to lie close-by in a low-dimensional output space. Efficient
nearest-neighbour search can then be performed in this space to find images in a large training
corpus that have similar pose. Specifically for this task, we have designed an interface to obtain
and verify head and hand labels for thousands of frames through Amazon Mechanical Turk with
minimal user intervention. We find that our method is able to cope with the terse and noisy labels
provided by crowd-sourcing. It succeeds in generalizing to body and hand pose when such cues are
not explicitly provided in the labels (see Fig. 1).
Our applicationdomain is relatedto several approaches inthe computer visionliterature that propose
hand or body pose tracking. Many techniques rely on sliding-window part detectors based on color
and other features applied to controlled recording conditions ([10, 39, 4, 28] to name a few, we
refer to  for a complete survey). In our domain, hands might only occupy a few pixels, and the
only body-part that can reliably be detected is the human face ([26, 13]). Many techniques have
been proposed that extract, learn, or reason over entire body features. Some use a combination of
local detectors and structural reasoning (see  for coarse tracking and  for person-dependent
tracking). Inasimilarspirit, moregeneraltechniquesusingpictorialstructures[3,12,35], “poselets”
, and other part-models  have received increased attention. An entire new stream of kinematic
model-based techniques based on the HumanEva dataset has been proposed , but this area differs
from our domain in that the images considered are of higher quality and less cluttered.
More closely related to our task are nearest-neighbour and locally-weighted regression-based tech-
niques. Some extract “shape-context” edge based histograms from the human body [25, 1] or just
silhouette features . Shakhnarovich et al.  use HOG  features and boosting for learn-
ing a parameter sensitive hash function. All these approaches rely on good background subtraction
or recordings with clear backgrounds. Our domain contains clutter, lighting variations and low
resolution such that it is impossible to separate body features from background successfully. We
instead learn relevant features directly from pixels (instead of pre-coded edge or gradient histogram
features), and discover implicitly background invariance from training data.
Several other works [36, 9, 4, 15] have used synthetically created data as a training set. We show in
Turk labels), synthetic training data, and hybrid datasets. Our final system (after training) is always
applied to the cluttered non-background subtracted real video input without any labels.
Our technique is also related to distance metric learning, an important area of machine learning
research, especially due to recent interest in analyzing complex high-dimensional data. A subset
of approaches for dimensionality reduction [17, 16] implicitly learn a distance metric by learning
a function (mapping) from high-dimensional (i.e. pixel) space to low-dimensional “feature” space
such that perceptually similar observations are mapped to nearby points on a manifold. Neighbour-
hood Components Analysis (NCA)  proposes a solution where the transformation from input
to feature space is linear and the distance metric is Euclidean. NCA learns the transformation that
is optimal for performing KNN in the feature space. NCA has also been recently extended to the
nonlinear case  using MNIST class labels and to linear 1D regression for reinforcement learning
. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM)  also learns a non-
linear mapping. Like NCA, DrLIM uses class neighbourhood structure to drive the optimization:
observations with the same class label are driven to be close-by in feature space. Our approach
is also inspired by recent hashing methods [2, 34, 38], although those techniques are restricted to
binary codes for fast lookup.
3 Learning an invariant mapping by nonlinear embedding
We first discuss Neighbourhood Components Analysis  and its nonlinear variants. We then pro-
pose an alternative objective function optimized for performing nearest neighbour (NN) regression
rather than classification. Next, we describe our convolutional architecture which maps images from
 A. Agarwal, B. Triggs, I. Rhone-Alpes, and F. Montbonnot. Recovering 3D human pose from monocular images. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 28(1):44–58, 2006.
 A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, pages
 M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, 2009.
 V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap: A method for efficient approximate similarity rankings. CVPR, 2004.
 S. Becker and G. Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163,
 L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, sep 2009.
 J. Bouvrie. Notes on convolutional neural networks. Unpublished, 2006.
 P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). CVPR, 2009.
 N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. ECCV, 2006.
 A. Farhadi, D. Forsyth, and R. White. Transfer Learning in Sign language. In CVPR, 2007.
 P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.
 V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, 2009.
 A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Large-scale Privacy
Protection in Google Street View. In ICCV, 2009.
 J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.
 K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statistical image-based shape model. In ICCV, pages
 R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, pages 1735–1742, 2006.
 G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.
 K. Jarrett, K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV,
 K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition.
Technical report, NYU, 2008. CBLL-TR-2008-12-01.
 P. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic programming and reinforcement
learning. In ICML, pages 449–456, 2006.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–
 H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical
representations. In ICML, pages 609–616, 2009.
 R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In CVPR, 2007.
 H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, pages 737–744, 2009.
 G. Mori and J. Malik. Estimating human body configurations using shape context matching. ECCV, 2002.
 M. Nechyba, L. Brandy, and H. Schneiderman. Pittpatt face detection and tracking for the CLEAR 2007 evaluation. Multimodal
Technologies for Perception of Humans, 2008.
 M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. In
 S.J. Nowlan and J.C. Platt. A convolutional neural network hand tracker. In NIPS, 1995.
 A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of
Computer Vision, 42(3):145–175, 2001.
 N. Pinto, D. Cox, and J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1), 2008.
 N. Pinto, D. Doukhan, J. DiCarlo, and David D. Cox. A high-throughput screening approach to discovering good forms of biologically
inspired visual representation. PLoS Comput Biol, 5(11), 11 2009.
 R. Poppe. Vision-based human motion analysis: An overview. Computer Vision and Image Understanding, 108(1-2):4–18, 2007.
 D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Tracking people by finding stylized poses. In CVPR, 2005.
 R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, volume 11,
 B. Sapp, C. Jordan, and B.Taskar. Adaptive pose priors for pictorial structures. In CVPR, 2010.
 G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In ICCV, pages 750–759, 2003.
 L. Sigal, A. Balan, and Black. M. J. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation
of articulated human motion. IJCV, 87(1/2):4–27, 2010.
 A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.
 C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(7):780–785, 1997.