Page 1

Joint Manifold Distance: a new approach to appearance based clustering

Andrew W. Fitzgibbon and Andrew Zisserman

Robotics Research Group, Department of Engineering Science,

University of Oxford, United Kingdom

http://www.robots.ox.ac.uk/˜vgg

Abstract

We wish to match sets of images to sets of images where

both sets are undergoing various distortions such as view-

point and lighting changes.

To this end we have developed a Joint Manifold Dis-

tance (JMD) which measures the distance between two

subspaces, where each subspace is invariant to a desired

group of transformations, for example affine warping of

the image plane. The JMD may be seen as generaliz-

ing invariant distance metrics such as tangent distance

in two important ways. First, formally representing pri-

ors on the image distribution avoids certain difficulties

which, in previous work, have required ad-hoc correction.

The second contribution is the observation that previous

distances have been computed using what amounted to

“home-grown” nonlinear optimizers, and that more re-

liable results can be obtained by using generic optimiz-

ers which have been developed in the numerical analysis

community, and which automatically set the parameters

which home-grown methods must set by art.

The JMD is used in this work to cluster faces in video.

Sets of faces detected in contiguous frames define the sub-

spaces, and distance between the subspaces is computed

using JMD. In this way the principal cast of a movie can

be ‘discovered’ as the principal clusters. We demonstrate

the method on a feature-length movie.

1. Introduction

We would like to cluster instances of objects in a video

in an unsupervised manner in order to ‘discover’ the sig-

nificant characters, scenes, events etc. This requires that

our measure of distance between two imaged instances is

Figure 1: Matching image subspaces. Each row is a sequence

of images spanning a subspace, and the goal is to determine for

a pair of sequences, whether the subspaces spanned by the se-

quences are the same. The images within each subspace are

registered, but the transformation between the subspaces is un-

known. The distance between subspaces must be invariant to the

unknown registration between the sequences.

ideally invariant to the changes in viewpoint and lighting

that affect the image—so that our clustering is of the ob-

ject, not its image.

As an example of such clustering, in this paper our ob-

jectiveistoestablishmatchesbetweenthefacesthatoccur

throughout a feature length movie. This is a very chal-

lenging problem: a film typically has 100-150K frames;

andinadditiontochangesoflightingandviewpoint, faces

also change expression and are partially occluded—for

example by hands, telephones or spectacles. In movies

in particular, lighting and viewpoint are intentionally dra-

matically varied. This makes the clustering problem sig-

nificantly more difficult than in traditional “mugshot” ap-

plications.

One way to proceed is to construct a distance function

1

1063-6919/03 $17.00 © 2003 IEEE

Page 2

d(x1,x2) between two images instances x1and x2which

is invariant to all such perturbations and deformations

that occur. For example invariance to viewpoint can be

achieved (to a first approximation) by designing the dis-

tance function to be invariant to an affine transformation

of xi, so that for example d(x1,x2) = d(x1,T(x2;a2)),

where T represents the affine transformation of the image.

Implementation of these measures via tangent distance

and its extensions [3, 5, 15, 16] allows efficient compu-

tation for classes of parametrized transformations. This

idea can be extended to any desired transformation, e.g.

for photometric changes or changes in expression, pro-

vided a parametrized model of the class of transforma-

tions is available.

An alternative is to define a distance function d(x,S)

between a point x and a (possibly infinite) set of points S,

where S contains exemplars of the perturbations and de-

formations. Often this reduces to the distance between a

point and a linear subspace. For example consider pho-

tometric invariance. A widely used approximation is that

(under restricted conditions of no shadowing, Lambertian

reflectance etc), the space of all images under all light-

ing is spanned by a four dimensional linear space [1, 13].

Higher dimensional spaces can approximate other illu-

mination effects such as self-shadowing (attached shad-

ows) [2]. So photometric invariance could be achieved

if S is the space of lighting images, e.g. acquired by an

SVD of many registered images [17]. Given training ex-

amples which exercise these variations, a transformation-

aware principal component analysis [7, 14] can compute

the subspace even in the case of unregistered sets of im-

ages.

A third approach is to remove the variations before

matching, by projecting each image onto a space which

does not permit such deviations. For example, the photo-

metric variation problem can often be avoided by filtering

the images (e.g. by a high pass filter) to significantly ame-

liorate lighting effects, effectively collapsing the space to

a much lower dimension. In the limit the space is col-

lapsed to a single point and the approach reduces to com-

puting a point to point distance d(x1,x2).

Here we partition the deformations into those that can

be modelled or removed to some approximation (view-

point by affine transformations, photometric filtering),

and use a set of images to span the residual: the parts of

viewpoint not modelled by affinity, errors in computing

Figure 2: Affine registration. (Top) Sequence of faces obtained

by the face detector. (Bottom) Affine registered sequence. There

is an overall unresolved affine transformation which must be ac-

counted for when comparing this with other such sequences.

registration, and—in the face case—changes in expres-

sion. Figure 2 shows a set before and after affine reg-

istration. In video sets of this type are readily available

because objects do not arbitrarily disappear between con-

tiguous frames, but can be easily tracked so that clustering

over consecutive frames is straightforward [8].

There is one further development that is clearly moti-

vated by figure 1: to determine the distance between two

finite sets of size n, it is not necessary to compute the

n2distances between each point in one set and each in

the other—instead the distance between the sets d(S1,S2)

can be measured directly. This paper explores these three

distance functions, d(x1,x2),d(x,S),d(S1,S2) includ-

ing an extension of incorporating learnt priors on the vari-

ous transformations, describes efficient implementations,

and demonstrates their performance in the face clustering

application.

2. Classes of distance functions

In this section we discuss the three classes of distance

function: point to point, point to set, and set to set. First,

however, let us consider the observation model.

We have samples x each associated with a “true” da-

tum ˜ x which are drawn independently from the density

described by p(x|˜ x). Given observations x1and x2our

objectiveistodeterminethelikelihoodp(x1,x2)thatboth

Page 3

x2

(a)

(b)

(c)

(d)

2

1

2

x~

M

1

x1

x2

2

M1

M2

1

1

M

2

x

1x

M

y

~

x

x

x1

2

M

d

d

d

d

d

d

Figure3: Severaldefinitionsofmanifolddistancebetweensamplepointsx1andx2. Thedistancefromadatumtothehidden‘true’

point ˜ x is measured as the distance to the manifold M = {T(˜ x;a)∀a ∈ Rm}. (a) Transfer distance. When ˜ x is approximated

by one of the sample points, here x2, this is the “one-sided” manifold distance. (b) Two-sided manifold distance. This definition

can sometimes make distances between disparate objects arbitrarily small, for example by mapping each image to a single point.

(c) Symmetric transfer distance. Sometimes called “two-sided” manifold distance. (d) Manifold distance. The hidden variable

˜ x is explicitly included, so the manifold to which distance is measured must move during the optimization. Section 2.1 shows how

to compute a tangent approximation which accounts for the manifold movement.

are samples of the same ˜ x:

p(x1,x2) =

?

d˜ xp(x1|˜ x)p(x2|˜ x)p(˜ x)

(1)

where p(˜ x) is the prior distribution on ˜ x.

An observation x is generated by applying a transfor-

mation a to a true datum ˜ x and then adding noise. The

family of transformations is parametrized by a vector of

parameters a, and the transformation is given by

x → T(x;a).

For example, if the observations x are n-D points then for

a transformation for scaling about the origin, a will have

exactly one element a1representing the scale and points

will transform as T(x;a) = xa1. The density p(x|˜ x) may

be expanded as

?

where p(a|˜ x) is the prior probability of the transforma-

tion a given ˜ x, and it will be assumed that p(a|˜ x) = p(a),

i.e. that the prior is independent of the datum ˜ x. The man-

ifold itself is encoded in the prior p(a). If this prior were

completely unrestricting, we would have the standard pic-

ture of the manifold as a subset of the space of images in

which x lives, with the dimensionality of the manifold

equal to that of the parameters a. For an affine transfor-

mation of n-pixel images, this would be a six-dimensional

manifold in Rn. In practice, it is important to place priors

p(x|˜ x) =

dap(x|a, ˜ x)p(a|˜ x)

(2)

on a, so some points on that manifold are less likely to

be observed than others. For example, we might expect

that the transformation which shrinks the image down to

less than one pixel square is unlikely. Using (2), the joint

likelihood (1) may thus be expanded as

p(x1,x2) =

?

d˜ xda1da2...

...p(x1|˜ x,a1)p(x2|˜ x,a2)p(a1)p(a2)p(˜ x)

The terms in this expression are summarized as follows:

p(x1|˜ x,a1) is the likelihood of x1, given the true point

˜ x, transformed by the transformation parameters a1. The

likelihood of x2is p(x2|˜ x,a2). p(ai) is the prior prob-

ability of transformation ai—this will be estimated from

training examples. p(˜ x) is the prior distribution on the

true point ˜ x. Here we set this to a broad Gaussian, which

yields a term analogous to the “spring” tangent distance

regularizer [15].

It will be assumed here that the image likelihoods can

be approximated by a distribution whose density func-

tion is of the form p(x|˜ x,a) = e−ρ(z)where z is the

difference image x − T(˜ x;a), and ρ is a kernel func-

tion. Choices of the kernel ρ include the Gaussian model

ρ(z) = ?Σ−1

will be discussed later in the paper.

2z?2or a robust distribution, and the choices

The MAP estimate is obtained from the joint likelihood

Page 4

as:

p(x1,x2) ≈ pMAP(x1,x2)

= max

a1,a2,˜ xp(x1|˜ x,a1)p(x2|˜ x,a2)p(a1)p(a2)p(˜ x)

and then the distance is defined as the negative log like-

lihood d(x1,x2) := −logpMAP(x1,x2). We will refer to

this as the manifold distance between two points.

Having derived manifold distance from a generative

model as above, we relate it to the several different defi-

nitions in the literature. The primary distinction made is

between “one-sided” and “two-sided” distances, but we

show here that neither is equivalent to the true manifold

distance. The discussion is clearer if the manifold dis-

tance is rewritten as a sum of negative log likelihoods1

d(x1,x2) = min

a1,a2,˜ x

E(x1− T(˜ x;a1))+

E(x2− T(˜ x;a2))+

E(a1) + E(a2) + E(˜ x).

Computation of the true manifold distance includes an

optimization over the hidden variable ˜ x as well as the

transformation parameters (a1,a2). In the case of image

matching, ˜ x is the underlying true image which is warped

and noise-corrupted to give the captured images x1and

x2. A number of alternative definitions in the literature

have eliminated ˜ x in various ways.

Variations on manifold distance: In the first, illustrated

in figure 3a, ˜ x is chosen to be equal to one of the two data

points, say x1. The manifold distance of the second point

x2is then

d1(x1,x2) = min

a

E(x2− T(x1;a))

This formulation—called “transfer” or “one-sided”

distance—is easy to compute, but has the disadvantage

that it causes d(.,.) to fail to be a metric, as d1(x1,x2) ?=

d1(x2,x1). It also means that priors on ˜ x cannot be incor-

porated, as ˜ x is fixed to be one of the data points.

Symmetryisaddressedbydefiningthe“two-sided”dis-

tance (figure 3b)

d2(x1,x2) = min

a1,a2E(T(x2;a2) − T(x1;a1))

(3)

1To avoid clutter, there is an overloading of notation here: E(·) is not

the same negative log-likelihood function for each variable, but indicates

that the appropriate likelihood for the argument type is being computed

as above.

in which both images are transformed before comparison.

However this can yield spurious solutions, the canonical

example being that images under affine transformations

can be mapped to a single point by scaling, yielding a

low distance for any pair (x1,x2). A variant that does not

suffer this collapse, but appears not to be widely used, is

the “symmetric transfer distance” (figure 3c)

ds(x1,x2) = min

a1,a2E(x2−T(x1;a1))+E(x1−T(x2;a2)).

Again, however, priors on ˜ x are not readily included. We

show in this paper that these approximations are not nec-

essary, and that the general form d(.,.) may be optimized

over ˜ x and the transformations (a1,a2) directly.

The effect of priors: Suppose for the moment there are

nopriorsonthetransformation, andthat ˜ xisknown. Then

in the single-sided case, the distance is minimized by the

closest point to x on the orbit through ˜ x. Similarly the

manifold distance is minimized by the points closest to

x1and x2respectively on the orbit through ˜ x, but there

is a freedom (symmetry) to choose ˜ x as any point on the

orbit, since it is only the orbit that defines the distances.

Introducing the priors on the transformation breaks this

symmetry—thepoint ˜ xmustnowlie“between”x1andx2

inordertoreducethepriortermsE(a1)+E(a2). Alsothe

estimated point x is no longer given by the closest point

on the orbit.

Discussion: Many existing variants amount to supplying

different forms for the priors and likelihoods, although to

our knowledge, no work has included all simultaneously,

or optimized over ˜ x. Previous authors have added priors

to the one and two sided distances. The regularizing term

in Schwenk et al’s constraint tangent distance[11] may

be seen as imposing a uniform prior using this term, and

Keysers et al’s probabilistic tangent distance [6] shows

howaGaussianpriorcanbeincluded. Jojicetal[5]derive

the form of the two-sided tangent distance (3) with priors

on the transformation parameters, but do not include pri-

ors on ˜ x. They do include priors on ˜ x in a generative

model learning framework, not in the distance function,

but even there only draw ˜ x from a discrete set of cluster

centres and assume zero variance.

2.1. Computing the point distance

Approximating T(x;a) by a first-order Taylor expansion

converts the manifold distance into the tangent distance.

Page 5

Specifically, if a is the m-dimensional parameter vector,

then the transformation of point x under transformation a

is

T(x;a)

≈

=

x +∂T

∂a1(x;a)a1+ ··· +

x + La

∂T

∂am(x;a)am

where the columns of L are the derivatives of the trans-

formation at x. If x itself is unknown, L is often approxi-

mated by computing tangents at a convenient nearby point

(of which more in §3).

In the case of Gaussian priors, a solution to the mini-

mization in d(x1,x2) can be obtained directly. The equa-

tion to be minimized is

min

x,a1,a2

|x1− (x + L1a1)|2

? ???

|D[a1a2]?+ d|2

?

−log p(x1|x,a1)

+|x2− (x + L2a2)|2

?

??

???

−log p(x2|x,a2)

+

?

−log p(a1)p(a2)

+|Sx + s|2

? ???

−log p(x)

where D and d encode the parameters of a single normal

distribution describing the prior probability of the trans-

formation parameters, and S and s represent the prior on

the unwarped image x. Specifically, if the prior on the

transformation parameters is N(Σa,µa), then D = Σ−1

and d = −Dµ. For clarity, the pixel values x are assumed

to have been scaled so that their noise is drawn from a

unit-variance Gaussian per pixel, although spatially vary-

ing noise is easily incorporated.

Gathering the terms to be minimized into a single vec-

tor x = [x,a1,a2]?gives the quadratic form

2

a

min

x,a1,a2

??????

?x1

x2

?

−

?I

??????

L1

0

0

IL2

?

0

0

D2

x

a1

a2

??????

2

+

S

0

0

0

D1

0

x

a1

a2

+

s

d1

d2

??????????

2

??????

2

= min

x,a1,a2

??????????

I

I

S

0

0

L1

0

0

D1

0

0

L2

0

0

D2

x

a1

a2

+

−x1

−x2

s

d1

d2

This is of the form minz|Gz + g|2for which a closed-

form solution is readily found. Naively implemented, this

would be computationally very expensive, requiring the

pseudo-inversion of a matrix whose side length is of the

order of the number of pixels. However, the special struc-

ture of G means that the minimum can be computed with

no more complexity than the two-sided tangent distance.

2.2. Point to subspace distance

A linear subspace of images is defined by a mean image

m and a set of basis vectors M. An image in the space is

linearly parametrized by vector u yielding the set

S = {m + Mu | u ∈ U}

The distance from a query point x1to the space is then

d(x1,S) = min

y∈Sd(x1,y)

= min

u

?m + Mu − x1?2

which is easily computed as minimization of a quadratic

form. In real examples, y will be subject to an unknown

transformation T(y,a), and there will be priors on a

and u. Adding these terms gives the one-sided point-to-

subspace distance

d(x1,S) = min

u,a?T(m + Mu;a) − x1?2+ E(a) + E(u)

Note here that the prior on u acts as a prior on the latent

image y. Denoting this prior by p(y), the subspace dis-

tance becomes

d(x1,S) = min

y,a?T(y;a) − x1?2+ E(a) − logp(y)

which is an analogue of the one-sided manifold distance

where x1is drawn from the prior distribution over y.

2.3. Distance between subspaces

Finally, the problem which faces this paper is to compute

the distance between two subspaces. Defining subspaces

S and T as

S = {m + Mu | u ∈ U}

T = {n + Nv | v ∈ V},