ArticlePDF Available

Discriminative Learning of Local Image Descriptors

Authors:

Abstract and Figures

In this paper, we explore methods for learning local image descriptors from training data. We describe a set of building blocks for constructing descriptors which can be combined together and jointly optimized so as to minimize the error of a nearest-neighbor classifier. We consider both linear and nonlinear transforms with dimensionality reduction, and make use of discriminant learning techniques such as Linear Discriminant Analysis (LDA) and Powell minimization to solve for the parameters. Using these techniques, we obtain descriptors that exceed state-of-the-art performance with low dimensionality. In addition to new experiments and recommendations for descriptor learning, we are also making available a new and realistic ground truth data set based on multiview stereo data.
Content may be subject to copyright.
1
Discriminative Learning of
Local Image Descriptors
Matthew Brown, Member, IEEE, Gang Hua, Member, IEEE and Simon Winder, Member, IEEE
AbstractIn this paper we explore methods for learning
local image descriptors from training data. We describe a set
of building blocks for constructing descriptors which can be
combined together and jointly optimized so as to minimize the
error of a nearest-neighbour classifier. We consider both linear
and non-linear transforms with dimensionality reduction, and
make use of discriminant learning techniques such as Linear
Discriminant Analysis (LDA) and Powell minimization to solve
for the parameters. Using these techniques we obtain descriptors
that exceed state-of-the-art performance with low dimensional-
ity. In addition to new experiments and recommendations for
descriptor learning, we are also making available a new and
realistic ground truth dataset based on multi-view stereo data.
Index Termsimage descriptors, local features, discriminative
learning, SIFT
I. INTRODUCTION
LOCAL feature matching has rapidly emerged to become
the dominant paradigm for recognition and registration in
computer vision. In traditional vision tasks such as panoramic
stitching [1], [2] and structure from motion [3], [4], it has largely
replaced direct methods due to its speed, robustness, and the
ability to work without initialization.
It is also used in many recognition problems. Vector quantizing
feature descriptors to finite vocabularies and using the analogue
of “visual words” has enabled visual recognition to scale into
the millions of images [5], [6]. Also the statistical properties of
local features and visual words have been exploited by many
researchers for object class recognition problems [7], [8], [9].
However, despite the proliferation of learning techniques that
are being employed for higher level visual tasks, the majority
of researchers still rely upon a small selection of hand coded
feature transforms for the lower level processing. A good survey
of some of the more common techniques can be found in [10],
[11]. Some exceptions to this rule and good examples of low-level
feature learning include the work of Lepetit and Fua [12], Shotton
et al [13] and Babenko [14]. Lepetit and Fua [12] showed that
randomized trees based on simple pixel differences could be an
effective low level operation. This idea was extended by Shotton
et al [13], who demonstrated a compelling scheme for object
class recognition. Babenko et al. [14] showed that boosting could
be applied to learn point based feature matching representations
from a large training dataset. Another example of learning low
level image operations is the Berkeley edge detector [15], which,
rather than being optimized for recognition performance per se,
is designed to mimic human edge labellings.
Matthew Brown is with the Computer Vision Laboratory, Ecole Poly-
technique F´
ed´
erale de Lausanne, 1015 Lausanne, Switzerland. Email:
matthew.brown@epfl.ch. Gang Hua is with Nokia Research Center Hol-
lywood, 2400 Broadway, D-500, Santa Monica, CA 90404. Email:
ganghua@gmail.com. Simon Winder is with the Interactive Visual Media
group at Microsoft Research, One Microsoft way, Redmond, WA 98052.
Email: swinder@microsoft.com
Progress in image feature matching improved rapidly following
Schmid and Mohr’s work on indexing using grey-value invariants
[16]. This represented a step forward over previous approaches to
invariant recognition that had largely been based on geometrical
entities such as edges and contours [17]. Another landmark paper
in the area was the work of Lowe [18], [19] who demonstrated
the importance of scale invariance and a non-linear, edge-based
descriptor transformation inspired by the ideas of Hubel and
Wiesel [20]. Since then small improvements have resulted, mainly
due to improved spatial pooling arrangements that are more
closely linked to the errors present in the interest point detection
process [11], [21], [22].
One criticism of the local image descriptor designs described
above has been the high dimensionality of descriptors (e.g., 128
dimensions for SIFT). Dimensionality reduction techniques can
help here, and have also been used to design features as well.
A first attempt was PCA-SIFT [23], which used the principal
components of gradient patches to form local descriptors. Whilst
this provides some benefits in reducing noise in the descriptors, a
better approach is to find projections that actively discriminate
between classes [24], instead of just modelling the total data
variance. Such techniques have been extensively studied in the
face recognition literature [25], [26], [27].
Our work attempts to improve on the state of the art in local de-
scriptor matching by learning optimal low-level image operations
using a large and realistic training dataset. In contrast to previous
approaches that have used only planar transformations [11] or
jittered patches [12] we use actual 3D correspondences obtained
via a stereo depth map. This allows us to design descriptors that
are optimized for the non-planar transformations and illumination
changes that result from viewing a truly 3D scene. We note that
Moreels and Perona have also proposed a technique for evaluating
3D feature matches based on trifocal constraints [28]. Our work
extends this approach by giving us the ability to generate new
correspondences at arbitrary locations and also to reason about
visibility.
To generate correspondences, we leverage recent improvements
in multi-view stereo matching [29], [30]. In contrast to previous
approaches [31], this allows us to generate correspondences
for arbitrary interest points and to model true interest point
noise. We explore two methodologies for feature learning. The
first uses parametric models inspired by previous successful
feature designs, and Powell minimization [32] to solve for the
parameters. The second uses non-parametric dimensionality re-
duction techniques common in the face recognition literature.
Our training and test datasets containing approximately 2.5×
106labelled image patches are being made available online at
http://www.cs.ubc.ca/mbrown/patchdata/patchdata.html.
A. Contributions
The main contributions of this work are as follows:
2
1) We present a new ground-truth dataset for descriptor learn-
ing, making use of multi-view stereo from large 3D recon-
structions. This allows us to optimize descriptors for real
interest point detections. We will be making this dataset
available to the community.
2) We extend previous work in parametric and non-parametric
descriptor learning, and provide recommendations for future
designs.
3) We conduct several new experiments, including reducing
dynamic range to minimize the number of bits used by our
feature descriptors (important for scalability) and optimiz-
ing descriptors for different types of interest point (e.g.,
Harris and DOG).
II. GROUND TRUTH DATASET
To generate ground truth data for our descriptor matching
problems, we make use of recent advances in multi-view image
recognition and correspondence. Recent improvements in wide-
baseline matching and structure from motion have made it possi-
ble to find matches and compute cameras for datasets containing
thousands of images, with greatly varying pose and illumination
conditions [33], [34]. Furthermore, advances in multi-view stereo
have made it possible to reconstruct dense surface models for
such images despite the greatly varying imaging conditions [29],
[30].
We view these 3D reconstructions as a possible source of train-
ing data for object recognition problems. Previous work [31] used
re-projections of 3D point clouds to establish correspondences
between images, adding synthetic jitter to emulate the noise
introduced in the interest point detection process. This approach,
whilst being straightforward to implement, has the disadvantage
of allowing training data to be collected only at discrete locations,
and fails to model true interest point noise.
In this work, we use dense surface models obtained via stereo
matching to establish correspondences between images. Note
that because of the epipolar and multi-view constraints, stereo
matching is a much easier problem than unconstrained 2D feature
matching. We can thus generate correspondences via local stereo
matching and multi-view consistency constraints that will be very
challenging for wide baseline feature matching methods to match.
We can also learn descriptors that are optimized for actual (and
arbitrary) interest point detections, finding corresponding points
by transferring their positions via the depth maps.
We make use of camera calibration information and dense
multi-view stereo data for three datasets containing over 1000
images provided by [34] and [30]. In a similar spirit to [31], we
extract patches around each interest point and store them in a large
dataset on disk for efficient processing and learning. We detect
Difference of Gaussian (DOG) interest points with associated
position, scale and orientation in the manner of [19] (we also
experiment with multi-scale Harris corners in Section VI-E). This
results in around 1000 interest points per image.
For each interest point detected, we compute the position,
scale and orientation of the local region when mapped into each
neighbouring image. These parameters are solved for by a least-
squares procedure. We do this by creating a uniform, dense point
sampling (once per pixel) within the feature footprint in the first
image. These points are then transferred via the depth map into the
second image. In general the sampled points will not undergo an
exact similarity transform, due to depth variations and perspective
effects, so we estimate the best translation, rotation and scale
between the corresponding image regions by least squares.
First, we check to see if the interest point is visible in the
neighbouring image using the visibility maps supplied by [30] (a
visibility map is defined over each neighbouring image, and each
pixel has the label 1 if the corresponding point in the reference
image is visible, and 0 otherwise). We then declare interest points
that are detected within 5 pixels of position, 0.25 octaves of scale
and π/8radians in angle to be “matches”. Those falling outside
2×these ranges are defined to be “non-matches”. Interest point
detections that are in between these ranges are deemed to be
ambiguous and not used in learning or testing. We chose fairly
small ranges for position, orientation and scale tolerance to suit
our intended applications in automatic stitching and structure from
motion. However, for category recognition problems one might
choose larger ranges that should result in more position invariance
but less discriminative representations. See Figures 1 and 2 for
examples of correspondences and image patches generated by this
process.
III. DESCRIPTOR ALGORITHM
In previous work [31] we have noted that many existing
descriptors described in the literature, while appearing quite
different, can be constructed using a common modular framework
consisting of processing stages similar to Figure 3. At each stage,
different candidate block algorithms (described below) may be
swapped in and out to produce a new overall descriptor. In
addition, some candidates have free parameters that we can adjust
in order to maximize the performance of the descriptor as a whole.
Certain of these algorithmic combinations give rise to published
descriptors but many are untested. Using this structure allows us
to examine the contribution of each building block in detail and
obtain a better covering of the space of possible algorithms.
Our approach to learning descriptors is therefore to put to-
gether a combination of building blocks and then optimize the
parameters of these blocks using learning to obtain the best
match/no-match classification performance. This contrasts with
prior attempts to hand tune descriptor parameters and helps to
put each algorithm on the same footing so that we can obtain
and compare best performances.
Figure 3 shows the overall learning framework for building
robust local image descriptors. The input is a set of image patches,
which may be extracted from the neighbourhood of any interest
point detector. The processing stages consist of the following:
G-block Gaussian smoothing is applied to the input patch.
T-blocks We perform a range of non-linear transformations
to the smoothed patch. These include operations such as
angle-quantized gradients and rectified steerable filters,
and typically resemble the “simple-cell” stage in human
visual processing.
S-blocks/E-blocks We perform spatial pooling of the
above filter responses. S-blocks use parametrized pool-
ing regions, E-blocks are non-parametric. This stage
resembles the “complex-cell” operations in visual pro-
cessing.
N-blocks We normalize the output patch to account for
photometric variations. This stage may optionally be
followed by another E-block, to reduce the number of
dimensions at the output.
3
Fig. 1. Generating ground truth correspondences. To generate the ground truth image correspondences needed as input to our algorithms, we use multi-view
stereo data provided by Goesele et al [30]. Interest points are detected in the reference image, and transferred to each neighbouring image via the depth map.
If the projected point is visible, we look for interest points within a specified range of position, orientation and scale, and declare these to be matches. Points
lying outside of twice this range are declared to be non-matches. This is the basic input to our learning algorithms. Left to right: reference image, neighbour
image, reference matches, neighbour matches, depth map, visibility map.
In general, the T-block stage extracts useful features from the data
like edge or local frequency information, and the S-block stage
pools these features locally to make the representation insensitive
to positional shift. These stages are similar to the simple/complex
cells in the human visual cortex[36]. It’s important that the T-
block stage introduces some non-linearity, otherwise the smooth-
ing step amounts to simply blurring the image. Also, the N-
block normalization is critical as many factors such as lighting,
reflectance and camera response have a large effect on the actual
pixel values.
These processing stages have been combined into 3 different
pipelines, as shown in the figure. Each stage has trainable
parameters, which are learnt using our ground truth dataset of
match/non-match pairs. In the remainder of this section, we will
take a more detailed look at the parametrization of each of these
building blocks.
A. Pre-smoothing (G-block)
We smooth the image pixels using a Gaussian kernel of
standard deviation σsas a pre-processing stage to allow the
descriptor to adapt to an appropriate scale relative to the interest
point scale. This stage is optional and can be included in the
T-block processing (below) if desired.
B. Transformation (T-block)
The transformation block maps the smoothed input patch onto
a grid with one length kvector with positive elements per
output sample. In this paper, the output grid was given the same
resolution as the input patch, i.e., 64×64. Various forms of linear
or non-linear transformations or classifiers are possible and have
been described previously [31]. In this paper we restrict our choice
to the following T-blocks which were found to perform well:
[T1] We evaluate the gradient vector at each sample and
recover its magnitude mand orientation θ. We then quantize the
orientation to kdirections and construct a vector of length ksuch
that mis linearly allocated to the two circularly adjacent vector
elements iand i+ 1 representing θi< θ < θi+1 according to the
proximity to these quantization centres. All other elements are
zero. This process is equivalent to the orientation binning used in
SIFT and GLOH[11]. For the T1a-variant we use k= 4 directions
and for the T1b-variant we use k= 8 directions.
[T2] We evaluate the gradient vector at each sample and rectify
its xand ycomponents to produce a vector of length 4 for the
T2a-variant: {|∇x|−∇x;|∇x|+x;|∇y|y;|∇y|+y}. This
provides a natural sine-weighted quantization of orientation into
4 directions. Alternatively for T2b, we extend this to 8 directions
by concatenating an additional length 4 vector using 45 which
is the gradient vector rotated through 45.
[T3] We apply steerable filters at each sample location using n
orientations and compute the responses from quadrature pairs [37]
with rectification to give a length k= 4nvector in a similar way
to the gradient computation described above so that the positive
and negative parts of the quadrature filter responses are placed in
different vector elements. We tried two kinds of steerable filters:
those based on a second derivatives provide broader scale and
orientation tuning while fourth order filters give narrow scale and
orientation tuning that can discriminate multiple orientations at
each location in the input patch. These filters were implemented
using the example coefficients given in [37]. The variants were
4
Fig. 2. Patch correspondences from the Liberty dataset. Top rows: reference image and depth map (left column), generated point correspondences (other
columns). Note the wide variation in viewpoints and scales. Bottom rows: patches extracted from this dataset. Patches are considered to be “matching” if the
detected interest points are within 5 pixels in position, 0.25 octaves of scale and π/8radians in angle.
Fig. 3. Schematic showing the learning algorithms explored for building local image descriptors. Three overall pipelines have been explored: (1) uses
parametric parameter optimization, (‘S’ blocks) using Powell Minimization as in [31]; (2) uses optimal linear projections (‘E’ blocks), found via LDA as
in [35]; and a third approach (3) combines a stage of (1) followed by the linear projection step in (2).
5
T3g: 2nd order, 4 orientations; T3h: 4th order 4 orientations; T3i:
2nd order, 8 orientations; and T3j: 4th order, 8 orientations.
[T4] We compute two isotropic Difference of Gaussians (DOG)
responses with different centre scales at each location by con-
volving the already smoothed patch with three new Gaussians
(one additional larger centre and two surrounds). The two linear
DOG filter outputs are then used to generate a length 4 vector
by rectifying their responses into positive and negative parts as
described above for gradient vectors. We set the ratio between the
centre and surround space constants to 1.4. The pre-smoothing
stage sets the size of the first DOG centre and so we use one
additional parameter to set the relative size of the second DOG
centre.
S1: SIFT grid with
bilinear weights S2: GLOH polar grid
with bilinear radial
and angular weights
S3: 3x3 grid with
Gaussian weights S4: 17 polar samples
with Gaussian weigh
ts
Fig. 4. Examples of the different spatial summation blocks. For S3 and S4,
the positions of the samples and the sizes of the Gaussian summation zones
were parametrized in a symmetric manner.
C. Spatial Pooling (S-block)
Many descriptor algorithms incorporate some form of his-
togramming. In our pooling stage we spatially accumulate
weighted vectors from the previous stage to give Nlinearly
summed vectors of length kand these are concatenated to form
a descriptor of kN dimensions where N∈ {3,9,16,17,25}. We
now describe the different spatial arrangements of pooling and
the different forms of weighting:
[S1] We used a square grid of pooling centres (see Figure 4),
with the overall footprint size of this grid being a parameter. The
vectors from the previous stage were summed together spatially
by bilinearly weighting them according to their distance from the
pooling centres as in the SIFT descriptor [19] so that the width of
the bilinear function is dictated by the output sample spacing. We
use sub-pixel interpolation throughout as this allows continuous
control over the size of the descriptor grid. Note that all these
summation operations are performed independently for each of
the kvector elements.
[S2] We used the spatial histogramming scheme of the GLOH
descriptor introduced by Mikolajczyk and Schmid [11]. This uses
a polar arrangement of summing regions as shown in Figure 4.
We used three variants of this arrangement with 3, 9 and 17
regions, depending on the number of angular segments in the
outer two rings (zero, 4, or 8). The radii of the centres of the
middle and outer regions and the outer edge of the outer region
were parameters that were available for learning. Input vectors
are bilinearly weighted in polar coordinates so that each vector
contributes to multiple regions. As a last step, each of the final
vectors from the Npooling regions is normalized by the area of
its summation region.
[S3] We used normalized Gaussian weighting functions to sum
input vectors over local pooling regions arranged on a 3×3,4×4
or 5×5grid. The sizes of each Gaussian and the positions of the
grid samples were parameters that could be learned. Figure 4
displays the symmetric 3×3arrangement with two position
parameters and three Gaussian widths.
[S4] We tried the same approach as S3 but instead used a polar
arrangement of Gaussian pooling regions with 17 or 25 sample
centres. Parameters were used to specify the ring radii and the size
of the Gaussian kernel associated with all samples in each ring
(Figure 4). The rotational phase angle of the spatial positioning of
middle ring samples was also a parameter that could be learned.
This configuration was introduced in [31] and named the DAISY
descriptor by [38].
D. Embedding (E-block)
Embedding methods are prevalent in the face recognition
literature [24], [25], and have been used by some authors for
building local image descriptors [23], [35], [39]. Discriminative
linear embedding can identify more robust image descriptors,
whilst simultaneously reducing the number of dimensions. We
summarize the different embedding methods we have used for
E-blocks below (see also the objective functions in Section V).
[E1] We perform principal component analysis (PCA) on the
input vectors. This is a non-discriminative technique and is used
mostly for comparison purposes.
[E2] We find projections that minimize the ratio of in-class
variance for match pairs to the variance of all match pairs. This
is similar to Locality Preserving Projections (LPP) [25].
[E4] We find projections that minimize the ratio of variance
between matched and non-matched pairs. This is similar to Local
Discriminative Embedding [26].
[E6] We find projections that minimize the ratio of in-class
variance for match pairs to the total data variance. We call
this generalized local discriminative embedding (GLDE). If the
number of classes is large, this objective function will be similar
to [E2] and [E4] [35].
[E3], [E5] and [E7] are the same as [E2], [E4] and [E6] with
the addition of orthogonality constraints which ensure that each
of the projection directions are mutually orthogonal [40], [27],
[41].
E. Post Normalization (N-block)
We use normalization to remove the descriptor dependency on
image contrast and to introduce robustness.
For parametric descriptors, we employ the SIFT style nor-
malization approach which involves range clipping descriptor
elements. Our slightly modified algorithm consists of four steps:
(1) Normalize to a unit vector, (2) clip all the elements of
the vector that are above a threshold κby computing v
i=
min(vi, κ), (3) re-normalize to a unit vector, and (4) repeat from
step 2 until convergence or a maximum number of iterations
has been reached. This procedure has the effect of reducing the
dynamic range of the descriptor and creating a robust function
for matching. The threshold κwas available for learning.
In the case of the non-parametric descriptors of Figure 3(2),
we normalize the descriptor to a unit vector.
IV. LEARNING PARAMETRIC DESCRIPTORS
This section corresponds to Pipeline 1 in figure 3. The input
to the modular descriptor is a 64 ×64 image patch and the final
output is a descriptor vector of D=kN numbers where kis the
6
T-block dimension and Nis the number of S-block summation
regions.
We evaluate descriptor performance and carry out learning
using our ground-truth data sets consisting of match and non-
match pairs. For each pair we compute the Euclidean distance
between descriptor vectors and form two histograms of this value
for all true matching and non-matching cases in the data set.
A good descriptor minimizes the amount of overlap of these
histograms. We integrate the two histograms to obtain an ROC
curve which plots correctly detected matches as a fraction of all
true matches against incorrectly detected matches as a fraction
of all true non-matches. We compute the area under the ROC
curve as a final score for descriptor performance and aim to
maximize this value. Other choices for quality measures are
possible depending on the application but we choose ROC area
as a robust and fairly generic measure. In terms of reporting our
results on the test set, however, we choose to indicate performance
in terms of the percentage of false matches present when 95% of
all correct matches are detected.
We jointly optimized parameter values of G, T, S, and N-blocks
by using Powell’s multidimensional direction set method [32] to
maximize the ROC area. We initialized the optimization with
reasonable choices of parameters.
Each ROC area measure was evaluated using one run over the
training data set. After each run we updated the parameters and
repeated the evaluation until the change in ROC area was small.
In order to avoid over-fitting we used a careful parametrization of
the descriptors using as few parameters as possible (typically 5–11
depending on descriptor type). Once we had determined optimal
parameters, we re-ran the evaluation over our testing data set to
obtain the final ROC curves and error rates.
V. LEARNING NON-PARAMETRIC DESCRIPTORS
This section corresponds to Pipeline 2 in figure 3. In this
section, we attempt to learn the spatial pooling component of
the descriptor pipeline without committing to any particular
parametrization. To do this, we make use of linear embedding
techniques as described in Section III-D. Instead of using nu-
merical gradient descent methods such as Powell minimization to
optimize parametrized descriptors, the embedding methods solve
directly for a set of optimal linear projections. The projected
output vector in this embedding space becomes the final image
descriptor. Although Pipeline 2 also involves parameters for
T and N-blocks, these are learned independently using Powell
Minimization as described above. We leave the joint optimization
of these parameters for future work.
The input to the embedding learning algorithms is a set of
match/non-match labelled image pairs that have been processed
by different processing units (T-blocks), i.e.,
S={xi=T(pi),xj=T(pj), lij }.(1)
In Equation 1, pkis an input image patch, T(·)represents a
composite set of different image processing units presented in
Section III, xkis the output vector of T(·), and lij takes binary
value to indicate if patch piand pjare match (lij = 1) or non-
match (lij = 0). We now present the mathematical formulation
of the different embedding learning algorithms.
A. Objective functions of different embedding methods.
Our E2 block attempts to maximize the ratio of the projected
variance of all xiin the match patch pair set to that of the
difference vectors xixj. Letting wbe the projection vector,
we can write this mathematically as follows:
J1(w) = Plij =1 wTxi2
Plij =1 `wT(xixj)´2.(2)
The intuition for this objective function is that in projection space,
we try to minimize the distance between the match pairs while
at the same time keeping the overall projected variance of all
vectors in the match pair set as big as possible. This is similar to
the Laplacian eigen-map adopted in previous works such as the
locality preserving projections [25].
Alternatively, motivated by local discriminative embed-
ding [26], the E4 block optimizes the following objective func-
tion:
J2(w) = Plij =0 wT(xixj)2
Plij =1 `wT(xixj)´2.(3)
By maximizing J2(w), we are seeking the embedding space under
which the distances between match pairs are minimized and the
distances between non-match pairs are maximized.
A third objective function (E6 blocks) unifies the above two
objective functions under certain conditions [35]:
J3(w) = Pxi∈S wTxi2
Plij =1 `wT(xixj)´2.(4)
All three objective functions J1,J2, and J3can be written in
matrix form as
Ji(w) = wTAiw
wTBw .(5)
where
A1=X
S
(X
j
lij )xixT
i(6)
A2=X
lij =0
(xixj)(xixj)T(7)
A3=X
xi∈S
xixT
i(8)
B=X
lij =1
(xixj)(xixj)T.(9)
In the following, for ease of presentation, we use Ato represent
any of A1,A2and A3. Setting the derivative of our objective
function (Equation 5) to zero gives
∂J
w=2Aw(wTBw)2(wTAw)Bw
(wTBw)2= 0 (10)
which implies that the optimal wis given by the solution to a
generalized eigenvalue problem
Aw =λBw (11)
where λ=wTAw/wTBw. Equation 11 is solved using standard
techniques, and the first Kgeneralized eigenvectors are chosen
to form the embedding space.
7
E3, E5 and E7 blocks place orthogonality constraints on the
corresponding E2, E4 and E6 blocks, respectively. The mathe-
matical formulation is quite straightforward: Suppose we have
already obtained k1orthogonal projections for the embedding,
i.e.,
Wk= [w1,w2,...,wk1],(12)
to pursue the kth vector, we solve the following optimization
problem:
arg maxw
wTAw
wTBw (13)
s.t. wTw1= 0 (14)
wTw2= 0 (15)
... (16)
wTwk1= 0.(17)
By formulating the Lagrangian, it can be shown that the solution
to this problem can be found by solving the following eigenvalue
problem [27], [41]:
ˆ
Mw = ((IB1WkQ1
kWT
k)B1A)w=λw,(18)
where
Qk=WT
kB1Wk.(19)
The optimal wkis then the eigenvector associated with the largest
eigenvalue in Equation 18. We omit the details of the derivation
of the solution here but refer readers to [27], [41].
B. Power regularization
A common problem with the linear discriminative formulation
in Equation 5 is the issue of over-fitting. This occurs because
projections wwhich are essentially noise can appear discrimina-
tive in the absence of sufficient data. This issue is exacerbated
by the high dimensional input vectors used in our experiments
(typically several hundred to several thousands of dimensions).
To mitigate the problem, we adopt a power regularization cost
function to force the discriminative projections to lie in the signal
subspace. To do this, we first perform eigenvalue decomposition
for the Bmatrix in Equation 5, i.e., B=UΛUT. Here Λis
a diagonal matrix with Λii =λibeing the ith eigenvalue of B
and λ1λ2... λn. We then regularize Λby clipping its
diagonal elements against a minimal value λr, where
λ
i= max(λi, λr).(20)
We choose rsuch that Pirλiaccounts for a portion αof the
total power, i.e.,
r= min
ks.t. Pn
i=kλi
Pn
i=1 λiα. (21)
Figure 5 shows the top 10 projections learnt from a set of
match/non-match image patches with different power regulariza-
tion rate α. The only pre-processing applied to these patches
was bias-gain normalization. As we can clearly observe, as α
decreases from 0.2 to 0 (top to bottom), the projections become
increasingly noisy.
Fig. 5. The first 10 projections learned from normalized image patches
in a match/non-match image patch set using J2(w)with different power
regularization rate [35]. From top to bottom, αtakes the value of 0.2, 0.1,
0.02 and 0, respectively. Notice that the projections become progressively
noisier as the power regularization is reduced.
VI. EXPERIMENTS
We performed experiments using the parametric and non-
parametric descriptor formulations described above, using our
new test dataset. The following results all apply to Difference
of Gaussian (DOG) interest points. For experiments using Harris
corners, see Section VI-E. In each case we have compared to
Lowe’s original implementation of SIFT. Since SIFT performs
descriptor sampling at a certain scale relative to the Difference
of Gaussian peak, we have optimized over this scaling parameter
to ensure that a fair comparison is made (see Figure 6).
For the results presented in this paper, we used three test
sets (Yosemite, Notre Dame, and Liberty) which were obtained
by extracting scale and orientation normalized 64 ×64 patches
around DOG interest points as described in Section II. Typically
four training and test set combinations were used: Yosemite–
Notre Dame, Yosemite–Liberty, Notre Dame–Yosemite, and Notre
Dame–Liberty, where the first of the pair is the training set. In
addition a “synthetic” training set was obtained which incorpo-
rated artificial geometric jitter as described in [31]. Training sets
typically contained from 10,000 to 500,000 patch pairs depending
on the application while test sets always contained 100,000 pairs.
The training and test sets contained 50% match pairs, and 50%
non-match pairs. During training and testing, we recomputed all
match/non-match descriptor distances as the descriptor transfor-
mation varied, sweeping a threshold on the descriptor distance to
generate an ROC curve. Note that using predefined match/non-
match pairs eliminates the need to recompute nearest neighbours
in the 100,000 element test set, which would be computationally
very demanding. In addition to presenting ROC curves, we give
many results in terms of the 95% error rate which is the percent
of incorrect matches obtained when 95% of the true matches are
found (Section IV).
A. Parametric Descriptors
We obtained very good results using combinations of the para-
metric descriptor blocks of Section III, exceeding the performance
of SIFT by around 1/3 in terms of 95% error rates. We chose to
focus specifically on four combinations that were shown to have
merit in [31]. These included a combination of angle quantized
gradients (T1) or steerable filters (T3) with log-polar (S2) or
Gaussian (S4) summation regions. Other combinations with T2,
T4, S1, S3 performed less well. Example ROC curves are shown
in Figure 7 and 8, and all error rates are given in Table I (all tables
show the 95% error rate with the optimal number of dimensions
given in parentheses).
8
32
34
36
38
40
42
44
46
48
50
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
95% error rate
log2 scale
SIFT (128)
(a)
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 28.5%)
SIFT + PCA (27, 28.5%)
SIFT + GLDE (19, 27.5%)
(b)
Fig. 6. Results for Lowe-SIFT descriptors: (a) shows the solution for the optimal SIFT descriptor footprint using the Liberty dataset. Note that the performance
is quite sensitive to this parameter, so it must be set carefully. (b) shows ROC curves when using this optimal patch scaling and the Yosemite dataset for
testing. We also tried using PCA and GLDE on the SIFT descriptors (shown in the other curves). GLDE gave only small improvement in performance (1%
error at 95% true positives) to Lowe’s algorithm, but substantially reduced the number of dimensions from 128 to 19. PCA also gives a large dimensionality
reduction for only a small drop in performance.
S3-16 (4x4)S3-9 (3x3) S3-25 (5x5) S4-17 S4-25
Fig. 9. Optimal summation regions are foveated and this is despite
initialization with a rectangular arrangement in the case of S3.
On three of the four datasets, the best performance was
achieved by the T3h-S4-25 combination, which is a combination
of steerable filters with 25 Gaussian summation regions arranged
in concentric rings. We found that when optimized over our
training dataset, these summation regions tended to converge to a
foveated shape, with larger and more widely space summation
regions further from the centre (see Figure 9). This structure
is reminiscent of the geometric blur work of [22], and similar
arrangements were independently suggested and named DAISY
descriptors by [38]. Rectangular arrays of summation regions
were found to have lower performance and their results are not
included here.
Note that the performance of these parametric descriptors is
uniformly strong in comparison to SIFT, but the downside of this
method is that the number of dimensions is very large (typically
several hundred).
B. Non-Parametric Descriptors
The ROC curves for training on Yosemite and testing on Notre
Dame using Non-Parametric descriptors are shown in Figure 10.
To summarize the remaining results, we have created tables
showing the 95% error rates only.
Table II shows the best results for each T-block using the
scheme of Figure 3(2) over all subspace methods that we tried
(PCA, LDE, LPP, GLDE and orthogonal variants). Also shown
are results for applying subspace methods to raw bias-gain
normalized pixel patches and gain normalized gradients. We see
that the T3 (steerable filter) block performs the best, followed
by T1 (angle-quantized gradients) and T2 (rectified gradients). In
half of the cases the combination of T3 and E-block learning beat
SIFT. Table III shows the best results for each E-block over all T-
block filters. LPP is the clear winner when trained on Yosemite.
For Notre Dame the case is not so clear, and no one method
performs consistently well. The best results for each subspace
method are almost always using T3.
To investigate sensitivity to training data, we tested on the
Liberty set using training on both Notre Dame and Yosemite.
For the non-parametric descriptor learning it seems that the
Yosemite dataset was best for training, whereas for the parametric
descriptors the performance was comparable (within 1-2%) for
both datasets. In general the results from the E-block learning
are less strong and more variable than the parametric S-block
techniques. Certain combinations, such as T3/LPP were able to
generate SIFT beating performance (e.g. 19.29% vs 26.10% on
the Yosemite/Notre Dame test case), but many other combinations
did not. The principal advantage of these techniques is that di-
mensionality reduction is simultaneously achieved, so the number
of dimensions is typically low (e.g. 32 dimensions in the case of
T3/LPP).
C. Dimension reduced parametric descriptors
Parametric descriptor learning yielded excellent performance
with high dimensionality, whereas the non-parametric learning
gave us a very small number of dimensions but with a slightly
inferior performance. Thus it seems natural to combine these
approaches. We did this by running a stage of non-parametric
dimensionality reduction after a stage of parametric learning. This
corresponds to Pipeline 3 in Figure 3. Note that we did not attempt
to jointly optimize for the embedding and parametric descriptors,
although this could be a good direction for future work. The
results are shown in Figure 11 and Table IV. This approach gave
us the overall best results, with typically 1-2% less error than
parametric S-blocks alone, and far fewer dimensions (30-40).
Although LDA gave much better results than PCA when applied
to raw pixel data [35], running PCA on the outputs of S-block
learning gave equal or better results to LDA. It may be that LDA is
slightly overfitting in cases where a discriminative representation
has already been found. For half the datasets, the best results were
9
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 28.49%)
T1c-S2-17 (272, 18.30%)
T3h-S2-17 (272, 16.56%)
T3h-S4-25 (400, 16.36%)
T3j-S2-17 (544, 15.91%)
Fig. 7. ROC curves for parametrized descriptors. Training on Notre Dame
and testing on Yosemite.
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 35.09%)
T1c-S2-17 (272, 22.76%)
T3h-S4-25 (400, 22.05%)
T3h-S2-17 (272, 21.85%)
T3j-S2-17 (544, 21.98%)
Fig. 8. ROC curves for parametrized descriptors. Training on Notre Dame
and testing on Liberty.
Train Test T1c-S2-17 T3h-S4-25 T3h-S2-17 T3j-S2-17 SIFT
Yosemite Notre Dame 17.90(272) 14.43(400) 15.44(272) 15.87(544) 26.10(128)
Yosemite Liberty 23.00(272) 20.48(400) 22.00(272) 22.28(544) 35.09(128)
Notre Dame Yosemite 18.30(272) 16.35(400) 16.56(272) 15.91(544) 28.50(128)
Notre Dame Liberty 22.76(272) 21.85(400) 22.05(272) 21.98(544) 35.09(128)
Synthetic Liberty 29.50(272) 24.25(400) 25.74(272) 32.36(544) 35.09(128)
TABLE I
PARAMETRIC DESCRIPTOR RESULTS. 95% ERROR RATES ARE SHOWN,WITH THE NUMBER OF DIMENSIONS IN PARENTHESIS.
normalized normalized
Training Set Test Set pixels gradients T1 T2 T3 T4 SIFT
Yosemite Notre Dame 37.17(14) 32.09(15) 25.68(24) 27.78(33) 19.29(32) 35.37(28) 26.10(128)
Yosemite Liberty 56.33(14) 51.63(15) 38.55(24) 41.10(20) 31.10(32) 47.74(28) 35.09(128)
Notre Dame Yosemite 43.37(27) 38.36(19) 33.59(21) 33.99(40) 31.27(19) 42.39(27) 28.50(128)
Notre Dame Liberty 55.70(27) 52.62(17) 41.37(24) 43.80(15) 36.54(19) 50.63(27) 35.09(128)
Synthetic Notre Dame 37.85(15) 39.15(24) 24.47(32) 24.47(32) 22.94(30) 34.41(28) 26.10(128)
TABLE II
BEST T-BLOCK RESULTS OVER ALL SUBSPACE METHODS.
Training Test PCA GLDE GOLDE LDE OLDE LPP OLPP SIFT
Yosemite Notre D. 40.36(29) 24.20(28) 26.24(31) 24.65(31) 25.01(27) 19.29(32) 23.71(31) 26.10(128)
Yosemite Liberty 53.20(29) 35.76(28) 43.35(31) 34.97(31) 40.15(27) 31.10(32) 39.46(31) 35.09(128)
Notre D. Yosemite 45.43(61) 32.53(45) 34.61(25) 31.27(19) 33.38(20) 33.19(46) 35.04(17) 28.50(128)
Notre D. Liberty 51.63(97) 41.66(45) 40.75(18) 36.54(19) 39.95(20) 42.68(46) 41.46(17) 35.09(128)
Synthetic Notre D. 43.78(66) 24.04(29) 26.25(29) 24.86(26) 26.10(33) 22.94(30) 26.05(34) 26.10(128)
TABLE III
BEST SUBSPACE METHOD OVER ALL T-BLOCKS.
obtained using PCA on T3h-S4-25 (rectified steerable filters with
DAISY-like Gaussian summation regions) and for the other half,
the best results were from T3j-S2-17 plus PCA (rectified steerable
filters and log-polar GLOH-like summation regions). The best
results here gave less than half the error rate of SIFT, using about
1/4 of the number of dimensions. See “best of the best” table V.
To aid in the dissemination of these results, we have cre-
ated a document detailing parameter settings for the most
successful DAISY configurations, as well as details of the
recognition performance/computation time tradeoffs. This can
be found on the same website as our patch datasets:
http://www.cs.ubc.ca/mbrown/patchdata/tutorial.pdf.
We also used this approach to perform dimensionality reduction
on SIFT itself, the results are shown in Figure 6(b). We were able
to reduce the number of dimensions significantly (to around 20),
but the matching performance of the LDA reduced SIFT descrip-
tors was only slightly better than the original SIFT descriptors
(1% error).
D. Comparisons with Synthetic Interest Point Noise
Previous work [31], [12] used synthetic jitter applied to image
patches in lieu of the position errors introduced in interest point
10
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 43.936%)
GLDE (28, 27.86%)
GOLDE (36, 27.77%)
LDE (25, 28.49%)
OLDE (25, 25.99%)
LPP (24, 25.68%)
OLPP (36, 28.05%)
PCA (29, 40.36%)
(a) T1, 4 orientations
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 44.80%)
GLDE (23, 29.89%)
GOLDE (35, 28.40%)
LDE (27, 31.17%)
OLDE (33, 27.78%)
LPP (20, 28.56%)
OLPP (37, 28.33%)
PCA (29, 41.35%)
(b) T2, 4 orientations
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (4096, 48.63%)
GLDE (28, 24.20%)
GOLDE (31, 26.24%)
LDE (31, 24.65%)
OLDE (27, 25.01%)
LPP (32, 19.29%)
OLPP (31, 23.71%)
PCA (25, 53.48%)
(c) T3, 2nd order, 4 orientations
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1296, 63.72%)
GLDE (12, 46.80%)
GOLDE (27, 37.59%)
LDE (21, 40.50%)
OLDE (28, 35.37%)
LPP (12, 47.10%)
OLPP (29, 38.54%)
PCA (97, 61.07%)
(d) T4
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1800, 61.42%)
GLDE (19, 33.54%)
GOLDE (15, 32.09%)
LDE (15, 34.20%)
OLDE (9, 36.68%)
LPP (14, 33.05%)
OLPP (12, 33.73%)
PCA (14, 48.54%)
(e) Normalized gradient
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 51.05%)
GLDE (14, 37.17%)
GOLDE (14, 46.06%)
LDE (13, 40.27%)
OLDE (41, 41.45%)
LPP (21, 39.11%)
OLPP (19, 41.24%)
PCA (26, 46.00%)
(f) Bias-gain normalized image
Fig. 10. Testing of linear discriminant descriptors trained on Yosemite and tested on Notre Dame. The optimal number of dimensions and the associated
95% error rate is given in parentheses. NSSD: Normalized sum squared difference computed on the output of the T-block directly without embedding.
Training Test PCA GLDE GOLDE LDE OLDE LPP OLPP SIFT
Yosemite Notre D. 11.98(29) 19.12(39) 13.64(49) 18.03(60) 12.48(71) 16.77(52) 14.07(36) 26.10(128)
Yosemite Liberty 18.27(29) 26.92(32) 19.88(49) 25.20(60) 18.70(71) 25.39(32) 20.33(36) 35.09(128)
Notre D. Yosemite 13.55(36) 25.25(87) 15.67(67) 21.78(35) 15.04(99) 22.30(48) 15.56(86) 28.50(128)
Notre D. Liberty 16.85(36) 30.38(28) 20.01(53) 26.48(45) 19.80(49) 26.78(48) 19.47(48) 35.09(128)
TABLE IV
BEST SUBSPACE METHODS FOR COMPOSITE DESCRIPTORS.
11
Train Test Parametric Non-parametric Composite SIFT
Yosemite Notre Dame 14.43(400) 19.29(32) 11.98(29) 26.10(128)
Yosemite Liberty 20.48(400) 31.10(32) 18.27(29) 35.09(128)
Notre Dame Yosemite 15.91(544) 31.27(19) 13.55(36) 28.50(128)
Notre Dame Liberty 21.85(400) 36.54(19) 16.85(36) 35.09(128)
TABLE V
“BEST OF THE BESTRESULTS.
detection. In order to evaluate the effectiveness of this strategy,
we tested a number of descriptors that were trained on a dataset
with synthetic noise applied ([31]).
For results, see the last rows of tables I, II and III. Here,
“synthetic” means that synthetic scale, rotation and position jitter
noise was applied to the patches, although the actual patch data
was sampled from real images as in [31]. For the parametric
descriptors, there is a clear gain of 5-10% from training using the
new non-synthetic dataset. For the LDA based methods smaller
gains are noticeable.
E. Learning Descriptors for Harris Corners
Using our multi-view stereo ground truth data we can easily
create optimal descriptors for any choice of interest point. To
demonstrate this, we also created a dataset of patches centred
on multi-scale Harris corner points (see Figure 12). The left
column shows the projections learnt from Harris corners and the
right column from DOG interest points, for normalized image
patches. The projections learnt from the two different types of
interest points share several similarities in appearance. They
are all centre focused, and look like Gaussian derivatives [16]
combined with geometric blur [22]. We also found that the order
of the performance of the descriptors learnt from the different
embedding methods are similar to each other across the two data-
sets.
F. Effects of Normalization
As demonstrated in [35], the post-normalization step is very
important for the performance of the non-parametric descriptors
learnt from synthetically jittered data-set. We observe a similar
phenomenon in our new experiments with the new data.
The higher performance of the parametric descriptors when
compared to the non-parametric descriptors is in some part
attributable to the use of SIFT-style clipping normalization ver-
sus simple unit-length normalization for these. Since parametric
descriptors maintain a direct relation between image-space and
descriptor coefficients compared with coefficients after PCA re-
duction, SIFT-style clipping, by introducing a robustness function,
can mitigate differences due to spatial occlusions and shadowing
which affect one part of the descriptor and not another. For
this reason applying SIFT-style normalization prior to dimension
reduction seems appropriate.
Figure 13 shows the effect of changing the threshold of clipping
for SIFT normalization. Error rates are significantly improved
when the clipping threshold are equal to around 1.6/Dwhen
tested on a wide range of parametric descriptors with different
dimensionality. This graph shows the drastic reduction in error
rate compared with simple unit normalization.
15
20
25
30
35
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 Unit
Error rate (%)
Normalization Threshold Ratio
T1c-S2-17
T3h-S4-25
T3h-S2-17
T3j-S2-17
Fig. 13. Change in error rates as the normalization clipping threshold is varied
for parametric descriptors. The threshold was set to r/Dwhere ris the
ratio and Dis the descriptor dimensionality. Unit: unit length normalization
without clipping.
G. Minimizing Bits
For certain applications, such as scalable recognition, it is im-
portant that descriptors are represented as efficiently as possible.
A natural question is: “what is the minimum number of bits
required for accurate feature descriptors?”. To address this ques-
tion we tested the recognition performance of our parametrized
descriptors as the number of bits per dimension was reduced from
8 to 1. The results are shown in Figure 14 for the parametric
descriptors. Surprisingly, there seems to be very little benefit to
using any more than 2 or 3 bits of dynamic range per dimension,
which suggests that it should be possible to create local image
descriptors with a very small memory footprint indeed. In one
case (T1c-S2-17), the performance actually degraded slightly as
more bits were added. It could be that in this case quantization
caused a small noise reduction effect. Note that this effect was
small ( 1% in error rate), and not shown for the other descriptors,
where the major change in performance came from 1 to 2 bits per
dimension, which gave around 16% change in error rate. Whilst
it would also be possible to quantize bits for dimension reduced
(embedded) descriptors, a variable number of bits per dimension
would be required as the variance on each dimension can differ
substantially across the descriptor.
VII. LIMITATIONS
Here we address some limitations of the current method and
suggest ideas for future work.
A. Repetitive image structure
One caveat with our learning approach scheme is that distinct
3D locations are defined to be different classes, when in the
12
14
16
18
20
22
24
26
28
30
32
1 2 3 4 5 6 7 8
Error rate (%)
Number of bits per dimension
T1c-S2-17 (272)
T3h-S4-25 (400)
T3h-S2-17 (272)
T3j-S2-17 (544)
Fig. 14. Results of limiting the number of bits in each descriptor dimension.
Not many more than 2 bits are required per dimension to retain a good error
rate.
real world, they can often have the same visual appearance.
One common example would be repeated architectural structures,
such as windows or doors. Such repetitions typically cause false
positives in our matching schemes (see Figure 15). For the Notre
Dame dataset, false positives occur due to translational repetition
(e.g. the stone figures) as well as rotational repetitions (e.g. the
rose window).
B. Multi-view Stereo Data
Although there have been great improvements in stereo in
recent years [30], using multi-view stereo to train local image de-
scriptors has its limitations. Noise in the stereo reconstruction will
inevitably propagate through to the set of image correspondences,
but probably a bigger issue is that certain image correspondences,
i.e., in regions where stereo fails, will not be present at all. One
way around this problem would be to use imagery registered to
LIDAR scans as in [42].
VIII. CONCLUSIONS
We have described a scheme for learning discriminative, low-
dimensional image descriptors from realistic training data. These
techniques have state-of-the-art performance in all our test sce-
narios. The techniques described in this paper have been used to
design local feature descriptors for a robust structure from mo-
tion application called Photosynth1and an automatic panoramic
stitcher named ICE2(Image Compositing Editor).
Recommendations
To summarize our work, we suggest a few recommendations
for practitioners in this area:
Learn parameters from training data Successful descrip-
tor designs typically have many parameter choices that are
difficult to optimize by hand. We recommend using realistic
training datasets to optimize these parameters.
Use foveated summation regions Pooling regions that
become larger away from the interest point are generally
found to have good performance. See [38] for an efficient
implementation approach.
1http://www.photosynth.com
2http://research.microsoft.com/ivm/ice.html
Use non-linear filter responses Some form of non-linear
filtering before spatial pooling is essential for the best
performance. Steerable filters work well if the phase is kept.
Rectified or angle-quantized gradients are also a good and
simple choice.
Use LDA for discriminative dimension reductions LDA
can be used to find discriminative, low dimensional descrip-
tors without imposing a choice of parameters. However, if a
discriminative representation has already been found, PCA
can work well for reducing the number of dimensions.
Normalization Thresholding normalization often provides a
large boost in performance. If dimension reduction is used,
normalization should come before the dimension reduction
block.
ACKNOWLEDGMENT
The authors would like to thank Michael Goesele and Noah
Snavely for sharing their 3D reconstruction data with us. We’d
also like to thank David Lowe, Rick Szeliski and Sumit Basu for
helpful discussions.
REFERENCES
[1] R. Szeliski, “Image alignment and stitching: A tutorial,” Microsoft
Research, Tech. Rep. MSR-TR-2004-92, December 2004.
[2] M. Brown and D. Lowe, “Automatic panoramic image stitching using
invariant features,International Journal of Computer Vision, vol. 74,
no. 1, pp. 59–73, 2007.
[3] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis,
J. Tops, and R. Koch, “Visual modeling with a hand-held camera,
International Journal of Computer Vision, vol. 59, no. 3, pp. 207–232,
2004.
[4] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Exploring photo
collections in 3D,” in SIGGRAPH Conference Proceedings. New York,
NY, USA: ACM Press, 2006, pp. 835–846.
[5] D. Nist´
er and H. Stew´
enius, “Scalable recognition with a vocabulary
tree,” in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 2, June 2006, pp. 2161–2168.
[6] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
retrieval with large vocabularies and fast spatial matching,” in Proceed-
ings of the International Conference on Computer Vision and Pattern
Recognition (CVPR07), 2007.
[7] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features
and kernels for classification of texture and object categories: A com-
prehensive study,” International Journal of Computer Vision, vol. 73,
no. 2, pp. 213–238, June 2007.
[8] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative
classification with sets of image features,” in Proceedings of the IEEE
Interntaion Conference on Computer Vision, Bejing, October 2005.
[9] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by un-
supervised scale-invariant learning,” in Proceedings of the Interntaional
Conference on Computer Vision and Pattern Recognition, 2003.
[10] K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest point
detectors,” International Journal of Computer Vision, vol. 1, no. 60, pp.
63–86, 2004.
[11] K. Mikolajczyk and C. Schmid, “A performance evaluation of local
descriptors,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 10, no. 27, pp. 1615–1630, 2005.
[12] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 9, pp. 1465–1479, 2006.
[13] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for
image categorization and segmentation,” in Internaional Conference on
Computer Vision and Pattern Recognition, June 2008.
[14] B. Babenko, P. Dollar, and S. Belongie, “Task specific local region
matching,” in International Conference on Computer Vision (ICCV07),
Rio de Janeiro, 2007.
[15] J. M. D. Martin, C. Fowlkes, “Learning to detect natural image bound-
aries using local brightness, color and texture cues,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 26, no. 5, May 2004.
13
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T1c-S2-17 (272, 17.93%)
GLDE (39, 19.12%)
GOLDE (65, 14.67%)
LDE (59, 18.84%)
OLDE (92, 15.95%)
LPP (52, 16.77%)
OLPP (74, 14.94%)
PCA (65, 13.16%)
(a) T1c-S2-17
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T4h-S4-25 (400, 15.45%)
GLDE (32, 19.87%)
GOLDE (39, 15.13%)
LDE (47, 18.85%)
OLDE (57, 13.99%)
LPP (25, 19.44%)
OLPP (36, 14.07%)
PCA (29, 11.98%)
(b) T4h-S4-25
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T3h-S2-17 (272, 14.41%)
GLDE (32, 19.27%)
GOLDE (49, 13.64%)
LDE (60, 18.03%)
OLDE (71, 12.48%)
LPP (32, 17.48%)
OLPP (23, 18.45%)
PCA (22, 13.16%)
(c) T3h-S2-17
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T3j-S2-17 (544, 15.84%)
GLDE (31, 19.85%)
GOLDE (67, 14.20%)
LDE (44, 18.67%)
OLDE (68, 14.27%)
LPP (32, 17.09%)
OLPP (45, 14.35%)
PCA (49, 12.66%)
(d) T3j-S2-17
Fig. 11. ROC curves for composite descriptors trained on Yosemite and testing on Notre Dame.
Fig. 12. Comparison of projections on patches centred on Harris corner points (left column), and DOG points (right column), respectively. From top to the
bottom, we present projections learnt using the embedding blocks of E2, E3, E4, E5, E6, E7 and E1, respectively.
Fig. 15. Some of the false positive, false negative, true positive and true negative image patch pairs when testing on the new Notre Dame dataset using
E-blocks learnt from the new Yosemite dataset. We used a combination of T3 (steerable filters) and E2 (LPP) in this experiment. Each row shows 6 pairs of
image patches and the two image patches in each pair are shown in the same column. Note that the two images in the false positive pairs are indeed obtained
from different 3D points but their appearances look surprisingly similar.
14
[16] C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 19, no. 5, pp. 530–535, May 1997.
[17] C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy, “Canonical frames
for planar object recognition,” in European Conference on Computer
Vision, 1992, pp. 757–772.
[18] D. Lowe, “Object recognition from local scale-invariant features,” in
International Conference on Computer Vision, Corfu, Greece, September
1999, pp. 1150–1157.
[19] D. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,
2004.
[20] D. Hubel and T. Wiesel, “Brain mechanisms of vision,Scientific
American, pp. 150–162, September 1979.
[21] S. Belongie, J. Malik, and J. Puzicha, “Shape context: A new descriptor
for shape matching and object recognition,” in Advances in Neural
Information Processing Systems. MIT Press, Cambridge, MA, 2000.
[22] A. Berg and J. Malik, “Geometric blur and template matching,” in
International Conference on Computer Vision and Pattern Recognition,
2001, pp. I:607–614.
[23] Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive representation
for local image descriptors,” in Proceedings of the International Con-
ference on Computer Vision and Pattern Recognition, vol. 2, July 2004,
pp. 506–513.
[24] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces VS Fisher-
faces: Recognition using class specific linear projection,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7,
pp. 711–720, 1997.
[25] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using
Laplacianfaces,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, no. 3, pp. 328–340, March 2005.
[26] H. Chen, H. Chang, and T. Liu, “Local discriminant embedding and its
variants,” in Proceedings of the International Conference on Computer
Vision and Pattern Recognition, vol. 2, San Diego, CA, June 2005, pp.
846–853.
[27] J. Duchene and S. Leclercq, “An optimal transformation for discrimi-
nant and principle component analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 10, no. 6, pp. 978–983, 1988.
[28] P. Moreels and P. Perona, “Evaluation of feature detectors and de-
scriptors based on 3D objects,” in Proceedings of the International
Conference on Computer Vision, vol. 1, 2005, pp. 800–807.
[29] M. Goesele, S. Seitz, and B. Curless, “Multi-view stereo revisited,” in
International Conference on Computer Vision and Pattern Recognition,
New York, June 2006.
[30] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. Seitz, “Multi-view
stereo for community photo collections,” in International Conference on
Computer Vision, Rio de Janeiro, October 2007.
[31] S. Winder and M. Brown, “Learning local image descriptors,” in
Proceedings of the International Conference on Computer Vision and
Pattern Recognition (CVPR07), Minneapolis, June 2007.
[32] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical
Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge
University Press, 1992.
[33] M. Brown and D. Lowe, “Unsupervised 3D object recognition and
reconstruction in unordered datasets,” in 5th International Conference
on 3D Imaging and Modelling (3DIM05), 2005.
[34] N. Snavely, S. Seitz, and R. Szeliski, “Modeling the world from internet
photo collections,” International Journal of Computer Vision, vol. 80,
no. 2, pp. 189–210, 2008.
[35] G. Hua, M. Brown, and S. Winder, “Discriminant embedding for local
image descriptors,” in Proceedings of the 11th International Conference
on Computer Vision (ICCV07), Rio de Janeiro, October 2007.
[36] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Object
recognition with cortex-like mechanisms,IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426, 2007.
[37] W. T. Freeman and E. H. Adelson, “The design and use of steerable fil-
ters,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 13, pp. 891–906, 1991.
[38] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptor for dense
matching,” in Proceedings of the International Conference on Computer
Vision and Pattern Recognition, Anchorage, June 2008.
[39] K. Mikolajczyk and J. Matas, “Improving descriptors for fast tree match-
ing by optimal linear projection,” in Proceedings of the International
Conference on Computer Vision, Rio de Janeiro, 2007.
[40] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal Laplacianfaces
for face recognition,” IEEE Transaction on Image Processing, vol. 15,
no. 11, pp. 3608–3614, November 2006.
[41] G.Hua, P. Viola, and S. Druker, “Face recognition using discriminatively
trained orthogonal rank one tensor projections,” in Proc. of IEEE Conf.
on Computer Vision and Pattern Recognition, Minneapplois, MN, June
2007.
[42] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoennessen,
“On benchmarking camera calibration and multi-view stereo for high
resolution imagery,” in Proceedings of the International Conference on
Computer Vision and Pattern Recognition, Anchorage, June 2008.
[43] S. Winder, G. Hua, and M. Brown, “Picking the best daisy,” in Proceed-
ings of the International Conference on Computer Vision and Pattern
Recognition (CVPR09), Miami, June 2009.
Matthew Brown is a Postdoctoral Fellow at the
Ecole Polytechnique F´
ed´
erale de Lausanne. He ob-
tained the M.Eng. degree in Electrical and Informa-
tion Sciences from Cambridge University in 2000,
and the Ph.D. degree in Computer Science from
the University of British Columbia in 2005. His
research interests include Computer Vision, Machine
Learning, Medical Imaging and Environmental In-
formatics. He worked with Microsoft Research as
an intern in Cambridge in 2002, and in Redmond
in 2003-2004. He returned to Microsoft Research
Redmond as a Postdoctoral Researcher during 2006-2007. His work there
focused on image segmentation, panoramic stitching and local feature design.
His work on panoramic stitching has been widely adopted, and appears on
the curriculum of many University courses as well as in several commercial
products. He is a director and CTO of Vancouver based Cloudburst Research
Inc.
Gang Hua is a Senior Researcher at Nokia Research
Center Hollywood. Before that, he was a Scientist at
Microsoft Live Labs Research from 2006 to 2009.
He received his Ph.D. degree in Electrical and Com-
puter Engineering from Northwestern University in
2006, the M.S. and B.S. degree in Electrical Engi-
neering from Xi’an Jiaotong University in 2002 and
1999, respectively. He was enrolled in the Special
Class for the Gifted Young of XJTU in 1994. During
the summer 2005 and summer 2004, he was a
research intern with the Speech Technology Group,
Microsoft Research, Redmond, WA, and a research intern with the Honda
Research Institute, Mountain View, CA, respectively.
He received the Richter Fellowship and the Walter P. Murphy Fellowship
at Northwestern University in 2005 and 2002, respectively. When he was in
XJTU, he was awarded the Guanghua Fellowship, the Eastcom Fellowship,
the Most Outstanding Student Exemplar Fellowship, the Sea-star Fellowship
and the Jiangyue Fellowship in 2001, 2000, 1997, 1997 and 1995 respectively.
He was also a recipient of the University Fellowship from 1994 to 2001 at
XJTU. He is a member of both IEEE and ACM. As of Jan, 2009, he holds 1
US patent and has 18 more patents pending.
Simon Winder is a Senior Developer in the Interac-
tive Visual Media group at Microsoft Research. He
obtained his Ph.D. in 1995 from the School of Math-
ematical Sciences, University of Bath, UK, studying
computational neuroscience of primate vision. Prior
to joining Microsoft, Simon obtained a B.Sc. in
1990 and a M.Eng. in 1991 from the University of
Bath, studying Electrical and Electronic Engineer-
ing. Prior employment includes work on thermal
imaging hardware at GEC Sensors, Basildon, UK,
and later work on MPEG-4 video standardization
for the Partnership in Advanced Computing Technologies, Bristol, UK. His
current research includes feature detection and descriptors for matching
and recognition with application to 3D reconstruction and real-time scene
recognition, localization, and mapping. To date he has filed 20 US patents.
... BEBLID introduces the AdaBoost algorithm and an improved weak learner training scheme to achieve more precise local descriptions. It was trained on the Brown dataset [40], wherein the training samples are pairs of labeled image patches: ...
Article
Full-text available
SLAM is a critical technology for enabling autonomous navigation and positioning in unmanned vehicles. Traditional visual simultaneous localization and mapping algorithms are built upon the assumption of a static scene, overlooking the impact of dynamic targets within real-world environments. Interference from dynamic targets can significantly degrade the system’s localization accuracy or even lead to tracking failure. To address these issues, we propose a dynamic visual SLAM system named BY-SLAM, which is based on BEBLID and semantic information extraction. Initially, the BEBLID descriptor is introduced to describe Oriented FAST feature points, enhancing both feature point matching accuracy and speed. Subsequently, FasterNet replaces the backbone network of YOLOv8s to expedite semantic information extraction. By using the results of DBSCAN clustering object detection, a more refined semantic mask is obtained. Finally, by leveraging the semantic mask and epipolar constraints, dynamic feature points are discerned and eliminated, allowing for the utilization of only static feature points for pose estimation and the construction of a dense 3D map that excludes dynamic targets. Experimental evaluations are conducted on both the TUM RGB-D dataset and real-world scenarios and demonstrate the effectiveness of the proposed algorithm at filtering out dynamic targets within the scenes. On average, the localization accuracy for the TUM RGB-D dataset improves by 95.53% compared to ORB-SLAM3. Comparative analyses against classical dynamic SLAM systems further corroborate the improvement in localization accuracy, map readability, and robustness achieved by BY-SLAM.
... We cropped an image patch of 50 × 50 px around the gaze estimation for every frame in the video. Each pair of consecutive frames served as input for a convolutional neural network (CNN) pretrained to predict patch similarity on the liberty dataset [7,34], resulting in a vector containing the patch similarity for each pair of frames. Sequences with similarities above the threshold of 1.3 were kept as fixation candidates. ...
Chapter
Digital health interventions that involve monitoring patient behaviour increasingly benefit from improvements in sensor technology. Eye tracking in particular can provide useful information for psychotherapy but an effective method to extract this information is currently missing. We propose a method to analyse natural gaze behaviour during exposure exercises for obsessive-compulsive disorder (OCD). At the core of our method is a neural network to detect fixations based on gaze patch similarities. Detected fixations are clustered into exposure-relevant, therapist, and other locations and corresponding eye movement metrics are correlated with subjective stress reported during exposure. We evaluate our method on gaze and stress data recorded during video-based psychotherapy of four adolescents with OCD. We found that fixation duration onto exposure-relevant locations consistently increases with the perceived stress level as opposed to fixations onto other locations. Fixation behaviour towards the therapist varied largely between patients. Taken together, our results not only demonstrate the effectiveness of our method for analysing natural gaze behaviour during exposure sessions. The fixation analysis shows that patients allocate more attention towards exposure-related objects under higher stress levels, suggesting higher mental load. As such, providing feedback on fixation behaviour holds significant promise to support therapists in monitoring intensity of exposure exercises.
... For the task of cross-view geographic localization, constructing more robust crossdomain feature descriptors is a key challenge. Image feature descriptors, initially designed manually [29][30][31] and through local feature learning [32,33], have been extensively studied Remote Sens. 2024, 16, 1249 5 of 23 in the early development of computer vision. With the continuous progress in deep learning, this methodology has been applied to the end-to-end learning of two-dimensional image feature descriptors [34]. ...
Article
Full-text available
Cross-view geolocation is a valuable yet challenging task. In practical applications, the images targeted by cross-view geolocation technology encompass multi-domain remote sensing images, including those from different platforms (e.g., drone cameras and satellites), different perspectives (e.g., nadir and oblique), and different temporal conditions (e.g., various seasons and weather conditions). Based on the characteristics of these images, we have designed an effective framework, Image Reconstruction and Multi-Unit Mutual Learning Net (IML-Net), for accomplishing cross-view geolocation tasks. By incorporating a deconvolutional network into the architecture to reconstruct images, we can better bridge the differences in remote sensing image features across different domains. This enables the mapping of target images from different platforms and perspectives into a shared latent space representation, obtaining more discriminative feature descriptors. The process enhances the robustness of feature extraction for locating targets across a wide range of perspectives. To improve the network’s performance, we introduce attention regions learned from different units as augmented data during the training process. For the current cross-view geolocation datasets, the use of large-scale datasets is limited due to high costs and privacy concerns, leading to the prevalent use of simulated data. However, real data allow the network to learn more generalizable features. To make the model more robust and stable, we collected two groups of multi-domain datasets from the Zurich and Harbin regions, incorporating real data into the cross-view geolocation task to construct the ZHcity750 Dataset. Our framework is evaluated on the cross-domain ZHcity750 Dataset, which shows competitive results compared to state-of-the-art methods.
Article
Full-text available
Learning a fast and discriminative patch descriptor is a challenging topic in computer vision. Recently, many existing works focus on training various descriptor learning networks by minimizing a triplet loss (or its variants), which is expected to decrease the distance between each positive pair and increase the distance between each negative pair. However, such an expectation has to be lowered due to the non-perfect convergence of network optimizer to a local solution. Addressing this problem and the open computational speed problem, we propose a Descriptor Distillation framework for local descriptor learning, called DesDis, where a student model gains knowledge from a pre-trained teacher model, and it is further enhanced via a designed teacher-student regularizer. This teacher-student regularizer is to constrain the difference between the positive (also negative) pair similarity from the teacher model and that from the student model, and we theoretically prove that a more effective student model could be trained by minimizing a weighted combination of the triplet loss and this regularizer, than its teacher which is trained by minimizing the triplet loss singly. Under the proposed DesDis, many existing descriptor networks could be embedded as the teacher model, and accordingly, both equal-weight and light-weight student models could be derived, which outperform their teacher in either accuracy or speed. Experimental results on 3 public datasets demonstrate that the equal-weight student models, derived from the proposed DesDis framework by utilizing three typical descriptor learning networks as teacher models, could achieve significantly better performances than their teachers and several other comparative methods. In addition, the derived light-weight models could achieve 8 times or even faster speeds than the comparative methods under similar patch verification performances.
Article
Full-text available
Multi-scale feature fusion has been widely used in handcrafted descriptors, but has not been fully explored in deep learning-based descriptor extraction. Simple concatenation of descriptors of different scales has not been successful in significantly improving performance for computer vision tasks. In this paper, we propose a novel convolutional neural network, based on center-surround adaptive multi-scale feature fusion. Our approach enables the network to focus on different center-surround scales, resulting in improved performance. We also introduce a novel regularization technique that uses second-order similarity to constrain the learning of local descriptors, based on the symmetric property of the similarity matrix. The proposed method outperforms single-scale or simple-concatenation descriptors on two datasets and achieves state-of-the-art results on the Brown dataset. Furthermore, our method demonstrates excellent generalization ability on the HPatches dataset. Our code is released on GitHub: https://github.com/Leung-GD/AFSRNet/tree/main.
Article
Full-text available
We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space-if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed “Fisherface” method has error rates that are lower than those of the eigenface technique for tests on the Harvard and Yale face databases
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
We present a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface. Our system consists of an image-based modeling front end that automatically computes the viewpoint of each photograph as well as a sparse 3D model of the scene and image to model correspondences. Our photo explorer uses image-based rendering techniques to smoothly transition between photographs, while also enabling full 3D navigation and exploration of the set of images and world geometry, along with auxiliary information such as overhead maps. Our system also makes it easy to construct photo tours of scenic or historic locations, and to annotate image details, which are automatically transferred to other relevant images. We demonstrate our system on several large personal photo collections as well as images gathered from Internet photo sharing sites.
Article
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover's Distance and the χ 2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. Our experiments demonstrate that image representations based on distributions of local features are surprisingly effective for classification of texture and object images under challenging real-world conditions, including significant intra-class variations and substantial background clutter.