Tracking features on a moving object using local image bases.
Tracking Features on a Moving Object Using Local Image Bases
Atsuto Maki∗, Yosuke Hatanaka†and Takashi Matsuyama
Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606 8501, Japan
This paper presents a new method for tracking fea-
ture points on the textureless surface of a moving object.
We employ a local image basis as a descriptor of each
point for dealing with intensity variances due to the rel-
ative motion of the object to the light source. In par-
ticular, we propose to adaptively update the basis for
enhancing its capability as the tracking proceeds. We
show that the performance of the method further im-
proves when it is coupled with Harris feature detector
at a very large integration scale.
Tracking feature points on a moving object is an im-
portant basic task for estimating the motion of the ob-
ject or reconstructing its 3D structure from multiview
images. Feature points should be specified on the ob-
ject’s surface in a way that their locus are continuously
identified throughout the object’s motion and they are
typically detected where high image gradients are ob-
served in various directions [4, 15]. The task of tracking
is then to determine the corresponding points in subse-
quent images by comparing the local intensity distribu-
tions. Given an image sequence with relatively small
motion from frame to frame, a popular technique for
tracking is to minimise the sum of squared differences
of images intensities , usually referred to as SSD.
One of the difficulties in finding the correspondence
is to deal with illumination variance which occurs due
to the relative motion to the light source. The work of
Jin et al.  effectively extended the algorithm of Shi
and Tomasi  to take account changes in illumination
by an iterative optimisation, but the results are demon-
strated working for only well textured object when the
camera (not the object) is subject to motion. Although
one can also eliminate obvious matching failures which
are caused for example by specularities as outliers ,
spurious matchings as a group will be problematic. In
∗Presently with Toshiba Research Europe Ltd, Cambridge Re-
search Laboratory, UK. Email: firstname.lastname@example.org.
†Presently with Sony Ericsson Mobile Communications Japan.
particular, any points can easily drift on textureless sur-
faces whose irradiance is mainly governed by shading.
The goal of this paper is to develop an efficient tech-
nique for feature tracking on textureless surfaces both
in terms of detecting feature points and describing the
neighbouring image intensities. Our strategy is to cope
tion of local image basis. The idea is related to the work
in  that incorporated a general illumination model1
into motion estimation of large image regions to com-
pensate for the minimisation of SSD. We apply the prin-
ciple of the general model to local feature points rather
than to large image regions such as an entire face, as
suggested in . Although they are required to capture
extra images of the object in a static pose under various
lighting conditions in advance, we propose to generate
local image bases in the course of tracking. Further, we
adaptively update the basis so as to improve its ability
as a descriptor as the tracking proceeds. We also report
our finding in choosing appropriate feature points that
are suitable for tracking on a textureless surface.
Let us consider an image sequence Ij(x), with x,
the coordinates of an image point, and j, an index of
the frame number. Given the time sampling frequency
is sufficiently high, conventional trackers assume that
small image regions are displaced but their intensities
remain unchanged. The tracker’s task is then to com-
pute the displacement, d, for a number of selected
points for successive frames in the sequence. Namely,
the problem has been treated to find such d which min-
imises the SSD residual in the following function (al-
though we will introduce an alternative)
[(Ij(x + d) −¯Ij) − (Ij−1(x) −¯Ij−1)]2(1)
where W is a small image window centred on the point
for which d is computed.¯I indicates the average inten-
sity in the region considered and the subtraction of it is
1All the images of the same Lambertian surface under different
lighting conditions lie in a 3D linear subspace of the space of all pos-
sible images of the surface  in the absence of self-shadowing.
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
to limit the effects of intensity changes between frames.
The performance of a number of popular interest
points detectors and descriptors are explored in the lit-
erature for the task of matching features on weakly tex-
tured surfaces across viewpoints. For example, Max-
imally Stable Extremal Regions  is useful for wide-
baseline matching. Also, Harris-affine detector  fol-
lowed by a SIFT  or shape-context descriptor  is
reported  to be the best combinations.
For tracking multiple points on a textureless mov-
ing object, we also need to first detect points that are
suitable for tracking. We then determine the method to
describe the local neighbourhood of the point so that
we will replace ? in (1) with a new residual for more
stable tracking to be possible. In this paper, we base
our detector on Harris corner detector at a large deriva-
tion scale and propose to employ local images basis as
a method for describing the varying appearances of the
neighbourhood of the detected points.
3Point Detector at Large Scale
For textureless surfaces the problem of finding ap-
propriate points for tracking remains difficult although
robust solutions have been proposed [4, 14, 15] for well
textured objects. It is because distinctive corners tend to
arise on odd-shaped part of a surface, if not at extremal
boundaries or self-shadows, and they can easily change
their locations according to the objects’ motion. More-
over, around those points it is difficult to spatially regis-
ter the local surface in approximation to a planar patch.
It will thus be more desirable to detect corners at points
where the gradients of intensity are moderate. For this
reason, we opt for detecting points at local maxima of
Harris measure  with a large integration scale.
describes the gradient distribution in a local neighbour-
hood of a point x. Harris function is
detC − κ(trC)
with C(x,σI,σD) =
where κ is a constant, σI is the integration scale, σD
the derivation scale, G the Gaussian and L the image
smoothed by a Gaussian as
∂xG(σD) ∗ I(x) .
We then choose to use a large value for σI so that
features due to moderate intensity changes can be ex-
tracted. This operation is conceptually supported by the
Figure 1. Local maxima of Harris mea-
sure. In the right are those by a larger in-
tegration scale, σI, than those in the left.
technique of detecting scale invariant interest points 
in that we can also select points at which a local mea-
sure is maximal over scales. The characteristics of these
points is extremely suitable for tracking as it allows us
to avoid detecting points at drastic shading variance due
a planar patch. Figure 1 exemplifies feature points de-
tected on a textureless surface with different values of
σI: 1.0 (left) and 10.0 (right).
4 Local Image Basis as a Descriptor
Our focus is then to describe the local neighbour-
hood of the detected point in such a way that the lo-
cal image characteristics can be identified for accurate
tracking even under illumination variance. Invariably,
a 2D image patch extracted from an image is used as
the description of the corner feature and this description
plays an important role in establishing correspondence
during tracking. As is well known, however, the im-
age patch formed about a projected corner can change
dramatically over time due to the following two factors:
(a) spatial deformation from motion and projection,
(b) intensity variation from relative lighting change.
We take account both of these factors in our tracker.
In order to accurately represent the spatial deformation
of a patch, we need its local orientation, n, and the cam-
era projection matrices, Pj, in each frame (j is an in-
dex of the frame number). We compute n using tracked
patches in the first few frames by deforming them while
simply searching over the whole hemisphere of possi-
ble orientations to find the one that generates the most
consistent intensities of pixels brought into correspon-
dence by the particular orientation2. On the other hand,
in the jthframe, Pj−1is available by solving the well
known structure from motion problem by factorization
 using the coordinates of tracked points which are
in correspondence until current frame.
2The search is quantised at every five degrees, starting with the
orientation that aligns the fronto-parallel image plane.
Let us represent the spatial deformation of a local
patch in jthframe by a matrix, Dj= [u v], using two
3-vectors, u and v, which define a local plane in 3D
space. For each tracked point, choosing the first frame
to be canonical, the deformation in arbitrary jthframe
is formulated as Dj= RjD1where the elements of D1
are available from the computed n, and Rj, a 3 × 3 ro-
tation matrix from the first frame to the current frame, is
directly given by the parameters in Pj(we assume that
the fast-sampling hypothesis allows us to approximate
Dealing with Photometric Variation
For tracking a point x, on the surface of a moving ob-
ject we propose to impose a rank three constrained ap-
proximation  to the neighbouring pixels. For this to
be possible we compute the local image basis for each
feature point from a small number, a minimum of four,
of initial input images. The rank three approximation
is valid as long as the surface of a patch is locally il-
luminated by a collimated light source and somewhat
three dimensional, deviating from a truly planar patch3,
so that the image basis models the changes in intensity
Note that it is such points that are typically extracted at
maxima of the Harris measure with a large scale.
We proceed to track feature points by initially find-
(m ≥ 3) with a simple correlation of neighbouring im-
age patch while considering the spatial deformation in
terms of projective homography, assuming that the im-
age variation is limited at this stage. Given registered
patches, each consisting of n pixels, we record the in-
tensities at corresponding pixels in an n × m intensity
matrix, Ij, each column of which contains the intensi-
ties in a single patch where j is the index to the current
frame. We generate the initial estimate of local image
basis, Bj|j=m, byarankthreeapproximationtothema-
trix Ijsuch that Ij= BjSjwhere Bj= [b1b2b3] is
a 3D local image basis, and b1,2,3span the local illumi-
nation subspace. Sjis a 3 × m matrix whose columns
correspond to collimated relative light sources up to an
ambiguity, but not used in the rest of the paper.
Although the principle of using local image bases
has been first introduced in  to compose a 3D struc-
ture model by a group of bases, each basis computed at
the initial stage may naturally have limited capability as
a descriptor by itself since only small variance of illu-
mination can be encoded in the representation. Our key
advance is that we propose to automatically update the
linear image basis as we observe more variation of the
neighbouring irradiance as the tracking proceeds.
3This is in contrast to the work in  which assumes normal vec-
tors do not change within the local patch.
We track each feature point by searching for a cor-
responding point while investigating the consistency to
the linear image basis. In the jthframe (j > m) we
deform each component of Bj−1using Dj, and solve
the problem of finding such d that minimises the rank
three residual by replacing the function in (1) with
whereˆI, an estimated intensity, is the element ofˆLj
[Ij(x + d) −ˆI]2
and Lj ∈ Rnrepresents a local patch, containing the
values of Ij(x+d), which is also determined with Dj.
Updating the Local Image Basis
After finding the corresponding point in a new frame,
we update the local basis, Bj, by additionally using
the intensities at the neighbouring pixels in the patch
if they are judged to be useful. That is, at each tracked
point we revise the intensity matrix Ij by incorporat-
ing a new column, consisting of the intensities in the
referred patch, and check their feasibility by the rank of
Ijwith the evaluation ratio, r(j), of the third and the
fourth singular values: r(j) = σ3/σ4. If rank(Ij) is
three, which is the ideal case, σ3should be much larger
than σ4which should always be very small. In the jth
frame, thus, if the value of r(j) is higher than in the
previous frame, we replace Bj−1with a new one, Bj,
computed by decomposing Ij. We then discard a redun-
forces the basis to be more tolerant to varying effect of
lighting as the tracking proceeds even if the initial basis
happens to be rank deficient due to degenerate motions.
input sequence of a moving mat statue of “Venus”. The
surface is textureless, which is problematic not only for
pendent on texture. The tracked points are indicated as
the centre of white four-sided shapes which also show
how the patches are registered. They are shown at ev-
ery fourth frame, starting from the eighth frame (points
in the first frame in Figure 1 (right)). The proposed
tracker continues to track the features more accurately
than does zero mean SSD with which many points are
prone to drift, e.g. those around the left eye. We have
tested our tracker with other sequences of texureless ob-
jects and found it to perform stably.
Figure 2. Tracking results for “Venus” at every fourth frame. Top: Zero mean SSD. Bottom:
Figure 3 shows an example of local linear image ba-
sis which is computed and updated on the right end of
the mouth of “Venus”. The deformed local patches,
Lj|j=1,...,4, and the initial basis, b1,2,3are in the left.
The basis is shown in descending order of the corre-
sponding singular values, from the left to the right, and
the fourth (rightmost) is thus the residual. Lj|j=1,...,4
include only small intensity variations between the
frames and the computed basis of lower order turns
out to be noisy; the third base, b3, looks like residual
as the fourth one. In the right are new input patches,
Lj|j=1,12,15,16, in which more variation of illumination
is involved. The updated basis reflects 3D aspect of the
surface more effectively as obvious in b2and b3. Fig-
ure 4 shows how the evaluation ratio, r(j), is updated
as the tracking proceeds. r(j) is initially about 2.0 but
tion, indicating that the ability of the basis is enhanced
to account for the illumination variances.
Figure 3. Left: Patches tracked in the first
frames (top). Computed basis (bottom).
Right: New inputs and updated basis.
Figure 4. The evaluation ratio of a local im-
age basis, r(j) = σ3/σ4, in each frame.
To evaluate the performance of the tracking quantita-
tively, we computed average distances of tracked points
to epipolar lines. We first computed the affine fun-
damental matrices by using the coordinates of tracked
points in each frame up to the 20th frame. We then
drew epipolar lines in the first frame by using them and
checked the average distance. Since the distance should
become zero when perfect matchings were available,
we can employ this value as a measure of tracking ac-
Figure 5 shows the average distances plotted for each
frame (initial frame is numbered zero), for each case
of using linear image bases and zero mean SSD. For
comparisons, the results are shown for two different sets
of feature points detected with different values of σI:
1.0 and 10.0 (See Figure 1). In the graph ’H’ and ’L’
stands for the two cases, respectively. We can observe
that the smallest error (distance) is achieved in the case
of L-Basis, i.e. when the points detected at σI = 10.0
are tracked by using the linear image basis. The error
increases more drastically when using zero mean SSD
regardless of the choice of the point set.
0 5 10 15 20
Distance from the epipolar line
Figure 5. Average distances (in pixels) of
tracked points from epipolar lines.
We have tackled the ill-posed problem of track-
ing feature points on a textureless surface and shown
promising results by (i) employing a local image basis
as a descriptor of each feature point which we (ii) de-
tect by Harris measure at a very large scale. In particu-
lar, we proposed to (iii) update the bases as the tracking
proceeds so that it can accommodate varying effect of
illumination. Future work will be directed to evaluate
the algorithm using more data, e.g. involving specu-
larities, in comparison to other related approaches ,
and to select the optimal derivation scale  for feature
detection by automatic scale selection.
This work is supported by MEXT, Japan, under a
Grant-in-Aid for Scientific Research (No.18651077),
in part by ditto (No.18049046) and by national
project on ”Development of high fidelity digitization
software for large-scale and intangible cultural as-
sets”. The implementation of Harris operator available
was used in the experiments.
thank Lyndon Hill for his helpful comments on the
We also wish to
 P. Belhumeur and D. Kriegman. What is the set of im-
ages of an object under all possible lighting conditions?
In CVPR, pages 270–277, 1996.
 S. Belongie, J. Malik, and J. Puzicha. Shape match-
ing and object recognition using shape contexts. IEEE-
PAMI, 24(4):509–522, 2002.
 G.D.HagerandP.N.Belhumeur. Efficientregiontrack-
ing with parametric models of geometry and illumina-
tion. IEEE-PAMI, 20:10:1025–1039, 1998.
 C. Harris and M. Stephens. A combined corner and
edge detector. In Proc. Fourth Alvey Vision Conference,
pages 147–151, 1988.
 H. Jin, P. Favaro, and S. Soatto. Real-time feature track-
ICCV, pages 684–689, 2001.
 T. Lindeberg. Feature detection with automatic scale
selection. IJCV, 30:2:79–116, 1998.
 D. G. Lowe.Distinctive image features from scale-
invariant keypoints. IJCV, 60:2:91–110, 2004.
 J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust
wide baseline stereo from maximally stable extremal re-
gions. In BMVC, 2002.
 K. Mikolajczyk and C. Schmid. Indexing based on scale
invariantinterestpoints. InICCV,pages525–531, 2001.
 K. Mikolajczyk and C. Schmid.
interest point detector. In ECCV (1), pages 128–142,
An affine invariant
 P. Moreels and P. Perona. Evaluation of features detec-
tors and descriptors based on 3d objects. In ICCV,pages
 M. Nishino, A. Maki, and T. Matsuyama.
based feature matching under illumination variances. In
IEA/AIE, pages 94–104, 2007.
 A. Shashua. Geometry and photometry in 3D visual
recognition. PhD thesis, Dept. Brain and Cognitive Sci-
ence, MIT, 1992.
 J. Shi and C. Tomasi. Good features to track. In CVPR,
pages 593–600, 1994.
 S. M. Smith and J. M. Brady. SUSAN – A new ap-
proach to low level image processing. Technical Report
TR95SMS1c, Chertsey, Surrey, UK, 1995.
 C. Tomasi and T. Kanade. Shape and motion from im-
age streams under orthography: a factorization method.
IJCV, 9:2:137–154, 1992.
 T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto.
 C. Wiles, A. Maki, and N. Matsuda. Hyper-patches
for 3D model acquisition and tracking. IEEE-PAMI,