ArticlePDF Available

Discriminative Learning of Local Image Descriptors


Abstract and Figures

In this paper, we explore methods for learning local image descriptors from training data. We describe a set of building blocks for constructing descriptors which can be combined together and jointly optimized so as to minimize the error of a nearest-neighbor classifier. We consider both linear and nonlinear transforms with dimensionality reduction, and make use of discriminant learning techniques such as Linear Discriminant Analysis (LDA) and Powell minimization to solve for the parameters. Using these techniques, we obtain descriptors that exceed state-of-the-art performance with low dimensionality. In addition to new experiments and recommendations for descriptor learning, we are also making available a new and realistic ground truth data set based on multiview stereo data.
Content may be subject to copyright.
Discriminative Learning of
Local Image Descriptors
Matthew Brown, Member, IEEE, Gang Hua, Member, IEEE and Simon Winder, Member, IEEE
AbstractIn this paper we explore methods for learning
local image descriptors from training data. We describe a set
of building blocks for constructing descriptors which can be
combined together and jointly optimized so as to minimize the
error of a nearest-neighbour classifier. We consider both linear
and non-linear transforms with dimensionality reduction, and
make use of discriminant learning techniques such as Linear
Discriminant Analysis (LDA) and Powell minimization to solve
for the parameters. Using these techniques we obtain descriptors
that exceed state-of-the-art performance with low dimensional-
ity. In addition to new experiments and recommendations for
descriptor learning, we are also making available a new and
realistic ground truth dataset based on multi-view stereo data.
Index Termsimage descriptors, local features, discriminative
learning, SIFT
LOCAL feature matching has rapidly emerged to become
the dominant paradigm for recognition and registration in
computer vision. In traditional vision tasks such as panoramic
stitching [1], [2] and structure from motion [3], [4], it has largely
replaced direct methods due to its speed, robustness, and the
ability to work without initialization.
It is also used in many recognition problems. Vector quantizing
feature descriptors to finite vocabularies and using the analogue
of “visual words” has enabled visual recognition to scale into
the millions of images [5], [6]. Also the statistical properties of
local features and visual words have been exploited by many
researchers for object class recognition problems [7], [8], [9].
However, despite the proliferation of learning techniques that
are being employed for higher level visual tasks, the majority
of researchers still rely upon a small selection of hand coded
feature transforms for the lower level processing. A good survey
of some of the more common techniques can be found in [10],
[11]. Some exceptions to this rule and good examples of low-level
feature learning include the work of Lepetit and Fua [12], Shotton
et al [13] and Babenko [14]. Lepetit and Fua [12] showed that
randomized trees based on simple pixel differences could be an
effective low level operation. This idea was extended by Shotton
et al [13], who demonstrated a compelling scheme for object
class recognition. Babenko et al. [14] showed that boosting could
be applied to learn point based feature matching representations
from a large training dataset. Another example of learning low
level image operations is the Berkeley edge detector [15], which,
rather than being optimized for recognition performance per se,
is designed to mimic human edge labellings.
Matthew Brown is with the Computer Vision Laboratory, Ecole Poly-
technique F´
erale de Lausanne, 1015 Lausanne, Switzerland. Email: Gang Hua is with Nokia Research Center Hol-
lywood, 2400 Broadway, D-500, Santa Monica, CA 90404. Email: Simon Winder is with the Interactive Visual Media
group at Microsoft Research, One Microsoft way, Redmond, WA 98052.
Progress in image feature matching improved rapidly following
Schmid and Mohr’s work on indexing using grey-value invariants
[16]. This represented a step forward over previous approaches to
invariant recognition that had largely been based on geometrical
entities such as edges and contours [17]. Another landmark paper
in the area was the work of Lowe [18], [19] who demonstrated
the importance of scale invariance and a non-linear, edge-based
descriptor transformation inspired by the ideas of Hubel and
Wiesel [20]. Since then small improvements have resulted, mainly
due to improved spatial pooling arrangements that are more
closely linked to the errors present in the interest point detection
process [11], [21], [22].
One criticism of the local image descriptor designs described
above has been the high dimensionality of descriptors (e.g., 128
dimensions for SIFT). Dimensionality reduction techniques can
help here, and have also been used to design features as well.
A first attempt was PCA-SIFT [23], which used the principal
components of gradient patches to form local descriptors. Whilst
this provides some benefits in reducing noise in the descriptors, a
better approach is to find projections that actively discriminate
between classes [24], instead of just modelling the total data
variance. Such techniques have been extensively studied in the
face recognition literature [25], [26], [27].
Our work attempts to improve on the state of the art in local de-
scriptor matching by learning optimal low-level image operations
using a large and realistic training dataset. In contrast to previous
approaches that have used only planar transformations [11] or
jittered patches [12] we use actual 3D correspondences obtained
via a stereo depth map. This allows us to design descriptors that
are optimized for the non-planar transformations and illumination
changes that result from viewing a truly 3D scene. We note that
Moreels and Perona have also proposed a technique for evaluating
3D feature matches based on trifocal constraints [28]. Our work
extends this approach by giving us the ability to generate new
correspondences at arbitrary locations and also to reason about
To generate correspondences, we leverage recent improvements
in multi-view stereo matching [29], [30]. In contrast to previous
approaches [31], this allows us to generate correspondences
for arbitrary interest points and to model true interest point
noise. We explore two methodologies for feature learning. The
first uses parametric models inspired by previous successful
feature designs, and Powell minimization [32] to solve for the
parameters. The second uses non-parametric dimensionality re-
duction techniques common in the face recognition literature.
Our training and test datasets containing approximately 2.5×
106labelled image patches are being made available online at
A. Contributions
The main contributions of this work are as follows:
1) We present a new ground-truth dataset for descriptor learn-
ing, making use of multi-view stereo from large 3D recon-
structions. This allows us to optimize descriptors for real
interest point detections. We will be making this dataset
available to the community.
2) We extend previous work in parametric and non-parametric
descriptor learning, and provide recommendations for future
3) We conduct several new experiments, including reducing
dynamic range to minimize the number of bits used by our
feature descriptors (important for scalability) and optimiz-
ing descriptors for different types of interest point (e.g.,
Harris and DOG).
To generate ground truth data for our descriptor matching
problems, we make use of recent advances in multi-view image
recognition and correspondence. Recent improvements in wide-
baseline matching and structure from motion have made it possi-
ble to find matches and compute cameras for datasets containing
thousands of images, with greatly varying pose and illumination
conditions [33], [34]. Furthermore, advances in multi-view stereo
have made it possible to reconstruct dense surface models for
such images despite the greatly varying imaging conditions [29],
We view these 3D reconstructions as a possible source of train-
ing data for object recognition problems. Previous work [31] used
re-projections of 3D point clouds to establish correspondences
between images, adding synthetic jitter to emulate the noise
introduced in the interest point detection process. This approach,
whilst being straightforward to implement, has the disadvantage
of allowing training data to be collected only at discrete locations,
and fails to model true interest point noise.
In this work, we use dense surface models obtained via stereo
matching to establish correspondences between images. Note
that because of the epipolar and multi-view constraints, stereo
matching is a much easier problem than unconstrained 2D feature
matching. We can thus generate correspondences via local stereo
matching and multi-view consistency constraints that will be very
challenging for wide baseline feature matching methods to match.
We can also learn descriptors that are optimized for actual (and
arbitrary) interest point detections, finding corresponding points
by transferring their positions via the depth maps.
We make use of camera calibration information and dense
multi-view stereo data for three datasets containing over 1000
images provided by [34] and [30]. In a similar spirit to [31], we
extract patches around each interest point and store them in a large
dataset on disk for efficient processing and learning. We detect
Difference of Gaussian (DOG) interest points with associated
position, scale and orientation in the manner of [19] (we also
experiment with multi-scale Harris corners in Section VI-E). This
results in around 1000 interest points per image.
For each interest point detected, we compute the position,
scale and orientation of the local region when mapped into each
neighbouring image. These parameters are solved for by a least-
squares procedure. We do this by creating a uniform, dense point
sampling (once per pixel) within the feature footprint in the first
image. These points are then transferred via the depth map into the
second image. In general the sampled points will not undergo an
exact similarity transform, due to depth variations and perspective
effects, so we estimate the best translation, rotation and scale
between the corresponding image regions by least squares.
First, we check to see if the interest point is visible in the
neighbouring image using the visibility maps supplied by [30] (a
visibility map is defined over each neighbouring image, and each
pixel has the label 1 if the corresponding point in the reference
image is visible, and 0 otherwise). We then declare interest points
that are detected within 5 pixels of position, 0.25 octaves of scale
and π/8radians in angle to be “matches”. Those falling outside
2×these ranges are defined to be “non-matches”. Interest point
detections that are in between these ranges are deemed to be
ambiguous and not used in learning or testing. We chose fairly
small ranges for position, orientation and scale tolerance to suit
our intended applications in automatic stitching and structure from
motion. However, for category recognition problems one might
choose larger ranges that should result in more position invariance
but less discriminative representations. See Figures 1 and 2 for
examples of correspondences and image patches generated by this
In previous work [31] we have noted that many existing
descriptors described in the literature, while appearing quite
different, can be constructed using a common modular framework
consisting of processing stages similar to Figure 3. At each stage,
different candidate block algorithms (described below) may be
swapped in and out to produce a new overall descriptor. In
addition, some candidates have free parameters that we can adjust
in order to maximize the performance of the descriptor as a whole.
Certain of these algorithmic combinations give rise to published
descriptors but many are untested. Using this structure allows us
to examine the contribution of each building block in detail and
obtain a better covering of the space of possible algorithms.
Our approach to learning descriptors is therefore to put to-
gether a combination of building blocks and then optimize the
parameters of these blocks using learning to obtain the best
match/no-match classification performance. This contrasts with
prior attempts to hand tune descriptor parameters and helps to
put each algorithm on the same footing so that we can obtain
and compare best performances.
Figure 3 shows the overall learning framework for building
robust local image descriptors. The input is a set of image patches,
which may be extracted from the neighbourhood of any interest
point detector. The processing stages consist of the following:
G-block Gaussian smoothing is applied to the input patch.
T-blocks We perform a range of non-linear transformations
to the smoothed patch. These include operations such as
angle-quantized gradients and rectified steerable filters,
and typically resemble the “simple-cell” stage in human
visual processing.
S-blocks/E-blocks We perform spatial pooling of the
above filter responses. S-blocks use parametrized pool-
ing regions, E-blocks are non-parametric. This stage
resembles the “complex-cell” operations in visual pro-
N-blocks We normalize the output patch to account for
photometric variations. This stage may optionally be
followed by another E-block, to reduce the number of
dimensions at the output.
Fig. 1. Generating ground truth correspondences. To generate the ground truth image correspondences needed as input to our algorithms, we use multi-view
stereo data provided by Goesele et al [30]. Interest points are detected in the reference image, and transferred to each neighbouring image via the depth map.
If the projected point is visible, we look for interest points within a specified range of position, orientation and scale, and declare these to be matches. Points
lying outside of twice this range are declared to be non-matches. This is the basic input to our learning algorithms. Left to right: reference image, neighbour
image, reference matches, neighbour matches, depth map, visibility map.
In general, the T-block stage extracts useful features from the data
like edge or local frequency information, and the S-block stage
pools these features locally to make the representation insensitive
to positional shift. These stages are similar to the simple/complex
cells in the human visual cortex[36]. It’s important that the T-
block stage introduces some non-linearity, otherwise the smooth-
ing step amounts to simply blurring the image. Also, the N-
block normalization is critical as many factors such as lighting,
reflectance and camera response have a large effect on the actual
pixel values.
These processing stages have been combined into 3 different
pipelines, as shown in the figure. Each stage has trainable
parameters, which are learnt using our ground truth dataset of
match/non-match pairs. In the remainder of this section, we will
take a more detailed look at the parametrization of each of these
building blocks.
A. Pre-smoothing (G-block)
We smooth the image pixels using a Gaussian kernel of
standard deviation σsas a pre-processing stage to allow the
descriptor to adapt to an appropriate scale relative to the interest
point scale. This stage is optional and can be included in the
T-block processing (below) if desired.
B. Transformation (T-block)
The transformation block maps the smoothed input patch onto
a grid with one length kvector with positive elements per
output sample. In this paper, the output grid was given the same
resolution as the input patch, i.e., 64×64. Various forms of linear
or non-linear transformations or classifiers are possible and have
been described previously [31]. In this paper we restrict our choice
to the following T-blocks which were found to perform well:
[T1] We evaluate the gradient vector at each sample and
recover its magnitude mand orientation θ. We then quantize the
orientation to kdirections and construct a vector of length ksuch
that mis linearly allocated to the two circularly adjacent vector
elements iand i+ 1 representing θi< θ < θi+1 according to the
proximity to these quantization centres. All other elements are
zero. This process is equivalent to the orientation binning used in
SIFT and GLOH[11]. For the T1a-variant we use k= 4 directions
and for the T1b-variant we use k= 8 directions.
[T2] We evaluate the gradient vector at each sample and rectify
its xand ycomponents to produce a vector of length 4 for the
T2a-variant: {|∇x|−∇x;|∇x|+x;|∇y|y;|∇y|+y}. This
provides a natural sine-weighted quantization of orientation into
4 directions. Alternatively for T2b, we extend this to 8 directions
by concatenating an additional length 4 vector using 45 which
is the gradient vector rotated through 45.
[T3] We apply steerable filters at each sample location using n
orientations and compute the responses from quadrature pairs [37]
with rectification to give a length k= 4nvector in a similar way
to the gradient computation described above so that the positive
and negative parts of the quadrature filter responses are placed in
different vector elements. We tried two kinds of steerable filters:
those based on a second derivatives provide broader scale and
orientation tuning while fourth order filters give narrow scale and
orientation tuning that can discriminate multiple orientations at
each location in the input patch. These filters were implemented
using the example coefficients given in [37]. The variants were
Fig. 2. Patch correspondences from the Liberty dataset. Top rows: reference image and depth map (left column), generated point correspondences (other
columns). Note the wide variation in viewpoints and scales. Bottom rows: patches extracted from this dataset. Patches are considered to be “matching” if the
detected interest points are within 5 pixels in position, 0.25 octaves of scale and π/8radians in angle.
Fig. 3. Schematic showing the learning algorithms explored for building local image descriptors. Three overall pipelines have been explored: (1) uses
parametric parameter optimization, (‘S’ blocks) using Powell Minimization as in [31]; (2) uses optimal linear projections (‘E’ blocks), found via LDA as
in [35]; and a third approach (3) combines a stage of (1) followed by the linear projection step in (2).
T3g: 2nd order, 4 orientations; T3h: 4th order 4 orientations; T3i:
2nd order, 8 orientations; and T3j: 4th order, 8 orientations.
[T4] We compute two isotropic Difference of Gaussians (DOG)
responses with different centre scales at each location by con-
volving the already smoothed patch with three new Gaussians
(one additional larger centre and two surrounds). The two linear
DOG filter outputs are then used to generate a length 4 vector
by rectifying their responses into positive and negative parts as
described above for gradient vectors. We set the ratio between the
centre and surround space constants to 1.4. The pre-smoothing
stage sets the size of the first DOG centre and so we use one
additional parameter to set the relative size of the second DOG
S1: SIFT grid with
bilinear weights S2: GLOH polar grid
with bilinear radial
and angular weights
S3: 3x3 grid with
Gaussian weights S4: 17 polar samples
with Gaussian weigh
Fig. 4. Examples of the different spatial summation blocks. For S3 and S4,
the positions of the samples and the sizes of the Gaussian summation zones
were parametrized in a symmetric manner.
C. Spatial Pooling (S-block)
Many descriptor algorithms incorporate some form of his-
togramming. In our pooling stage we spatially accumulate
weighted vectors from the previous stage to give Nlinearly
summed vectors of length kand these are concatenated to form
a descriptor of kN dimensions where N∈ {3,9,16,17,25}. We
now describe the different spatial arrangements of pooling and
the different forms of weighting:
[S1] We used a square grid of pooling centres (see Figure 4),
with the overall footprint size of this grid being a parameter. The
vectors from the previous stage were summed together spatially
by bilinearly weighting them according to their distance from the
pooling centres as in the SIFT descriptor [19] so that the width of
the bilinear function is dictated by the output sample spacing. We
use sub-pixel interpolation throughout as this allows continuous
control over the size of the descriptor grid. Note that all these
summation operations are performed independently for each of
the kvector elements.
[S2] We used the spatial histogramming scheme of the GLOH
descriptor introduced by Mikolajczyk and Schmid [11]. This uses
a polar arrangement of summing regions as shown in Figure 4.
We used three variants of this arrangement with 3, 9 and 17
regions, depending on the number of angular segments in the
outer two rings (zero, 4, or 8). The radii of the centres of the
middle and outer regions and the outer edge of the outer region
were parameters that were available for learning. Input vectors
are bilinearly weighted in polar coordinates so that each vector
contributes to multiple regions. As a last step, each of the final
vectors from the Npooling regions is normalized by the area of
its summation region.
[S3] We used normalized Gaussian weighting functions to sum
input vectors over local pooling regions arranged on a 3×3,4×4
or 5×5grid. The sizes of each Gaussian and the positions of the
grid samples were parameters that could be learned. Figure 4
displays the symmetric 3×3arrangement with two position
parameters and three Gaussian widths.
[S4] We tried the same approach as S3 but instead used a polar
arrangement of Gaussian pooling regions with 17 or 25 sample
centres. Parameters were used to specify the ring radii and the size
of the Gaussian kernel associated with all samples in each ring
(Figure 4). The rotational phase angle of the spatial positioning of
middle ring samples was also a parameter that could be learned.
This configuration was introduced in [31] and named the DAISY
descriptor by [38].
D. Embedding (E-block)
Embedding methods are prevalent in the face recognition
literature [24], [25], and have been used by some authors for
building local image descriptors [23], [35], [39]. Discriminative
linear embedding can identify more robust image descriptors,
whilst simultaneously reducing the number of dimensions. We
summarize the different embedding methods we have used for
E-blocks below (see also the objective functions in Section V).
[E1] We perform principal component analysis (PCA) on the
input vectors. This is a non-discriminative technique and is used
mostly for comparison purposes.
[E2] We find projections that minimize the ratio of in-class
variance for match pairs to the variance of all match pairs. This
is similar to Locality Preserving Projections (LPP) [25].
[E4] We find projections that minimize the ratio of variance
between matched and non-matched pairs. This is similar to Local
Discriminative Embedding [26].
[E6] We find projections that minimize the ratio of in-class
variance for match pairs to the total data variance. We call
this generalized local discriminative embedding (GLDE). If the
number of classes is large, this objective function will be similar
to [E2] and [E4] [35].
[E3], [E5] and [E7] are the same as [E2], [E4] and [E6] with
the addition of orthogonality constraints which ensure that each
of the projection directions are mutually orthogonal [40], [27],
E. Post Normalization (N-block)
We use normalization to remove the descriptor dependency on
image contrast and to introduce robustness.
For parametric descriptors, we employ the SIFT style nor-
malization approach which involves range clipping descriptor
elements. Our slightly modified algorithm consists of four steps:
(1) Normalize to a unit vector, (2) clip all the elements of
the vector that are above a threshold κby computing v
min(vi, κ), (3) re-normalize to a unit vector, and (4) repeat from
step 2 until convergence or a maximum number of iterations
has been reached. This procedure has the effect of reducing the
dynamic range of the descriptor and creating a robust function
for matching. The threshold κwas available for learning.
In the case of the non-parametric descriptors of Figure 3(2),
we normalize the descriptor to a unit vector.
This section corresponds to Pipeline 1 in figure 3. The input
to the modular descriptor is a 64 ×64 image patch and the final
output is a descriptor vector of D=kN numbers where kis the
T-block dimension and Nis the number of S-block summation
We evaluate descriptor performance and carry out learning
using our ground-truth data sets consisting of match and non-
match pairs. For each pair we compute the Euclidean distance
between descriptor vectors and form two histograms of this value
for all true matching and non-matching cases in the data set.
A good descriptor minimizes the amount of overlap of these
histograms. We integrate the two histograms to obtain an ROC
curve which plots correctly detected matches as a fraction of all
true matches against incorrectly detected matches as a fraction
of all true non-matches. We compute the area under the ROC
curve as a final score for descriptor performance and aim to
maximize this value. Other choices for quality measures are
possible depending on the application but we choose ROC area
as a robust and fairly generic measure. In terms of reporting our
results on the test set, however, we choose to indicate performance
in terms of the percentage of false matches present when 95% of
all correct matches are detected.
We jointly optimized parameter values of G, T, S, and N-blocks
by using Powell’s multidimensional direction set method [32] to
maximize the ROC area. We initialized the optimization with
reasonable choices of parameters.
Each ROC area measure was evaluated using one run over the
training data set. After each run we updated the parameters and
repeated the evaluation until the change in ROC area was small.
In order to avoid over-fitting we used a careful parametrization of
the descriptors using as few parameters as possible (typically 5–11
depending on descriptor type). Once we had determined optimal
parameters, we re-ran the evaluation over our testing data set to
obtain the final ROC curves and error rates.
This section corresponds to Pipeline 2 in figure 3. In this
section, we attempt to learn the spatial pooling component of
the descriptor pipeline without committing to any particular
parametrization. To do this, we make use of linear embedding
techniques as described in Section III-D. Instead of using nu-
merical gradient descent methods such as Powell minimization to
optimize parametrized descriptors, the embedding methods solve
directly for a set of optimal linear projections. The projected
output vector in this embedding space becomes the final image
descriptor. Although Pipeline 2 also involves parameters for
T and N-blocks, these are learned independently using Powell
Minimization as described above. We leave the joint optimization
of these parameters for future work.
The input to the embedding learning algorithms is a set of
match/non-match labelled image pairs that have been processed
by different processing units (T-blocks), i.e.,
S={xi=T(pi),xj=T(pj), lij }.(1)
In Equation 1, pkis an input image patch, T(·)represents a
composite set of different image processing units presented in
Section III, xkis the output vector of T(·), and lij takes binary
value to indicate if patch piand pjare match (lij = 1) or non-
match (lij = 0). We now present the mathematical formulation
of the different embedding learning algorithms.
A. Objective functions of different embedding methods.
Our E2 block attempts to maximize the ratio of the projected
variance of all xiin the match patch pair set to that of the
difference vectors xixj. Letting wbe the projection vector,
we can write this mathematically as follows:
J1(w) = Plij =1 wTxi2
Plij =1 `wT(xixj)´2.(2)
The intuition for this objective function is that in projection space,
we try to minimize the distance between the match pairs while
at the same time keeping the overall projected variance of all
vectors in the match pair set as big as possible. This is similar to
the Laplacian eigen-map adopted in previous works such as the
locality preserving projections [25].
Alternatively, motivated by local discriminative embed-
ding [26], the E4 block optimizes the following objective func-
J2(w) = Plij =0 wT(xixj)2
Plij =1 `wT(xixj)´2.(3)
By maximizing J2(w), we are seeking the embedding space under
which the distances between match pairs are minimized and the
distances between non-match pairs are maximized.
A third objective function (E6 blocks) unifies the above two
objective functions under certain conditions [35]:
J3(w) = Pxi∈S wTxi2
Plij =1 `wT(xixj)´2.(4)
All three objective functions J1,J2, and J3can be written in
matrix form as
Ji(w) = wTAiw
wTBw .(5)
lij )xixT
lij =0
lij =1
In the following, for ease of presentation, we use Ato represent
any of A1,A2and A3. Setting the derivative of our objective
function (Equation 5) to zero gives
(wTBw)2= 0 (10)
which implies that the optimal wis given by the solution to a
generalized eigenvalue problem
Aw =λBw (11)
where λ=wTAw/wTBw. Equation 11 is solved using standard
techniques, and the first Kgeneralized eigenvectors are chosen
to form the embedding space.
E3, E5 and E7 blocks place orthogonality constraints on the
corresponding E2, E4 and E6 blocks, respectively. The mathe-
matical formulation is quite straightforward: Suppose we have
already obtained k1orthogonal projections for the embedding,
Wk= [w1,w2,...,wk1],(12)
to pursue the kth vector, we solve the following optimization
arg maxw
wTBw (13)
s.t. wTw1= 0 (14)
wTw2= 0 (15)
... (16)
wTwk1= 0.(17)
By formulating the Lagrangian, it can be shown that the solution
to this problem can be found by solving the following eigenvalue
problem [27], [41]:
Mw = ((IB1WkQ1
The optimal wkis then the eigenvector associated with the largest
eigenvalue in Equation 18. We omit the details of the derivation
of the solution here but refer readers to [27], [41].
B. Power regularization
A common problem with the linear discriminative formulation
in Equation 5 is the issue of over-fitting. This occurs because
projections wwhich are essentially noise can appear discrimina-
tive in the absence of sufficient data. This issue is exacerbated
by the high dimensional input vectors used in our experiments
(typically several hundred to several thousands of dimensions).
To mitigate the problem, we adopt a power regularization cost
function to force the discriminative projections to lie in the signal
subspace. To do this, we first perform eigenvalue decomposition
for the Bmatrix in Equation 5, i.e., B=UΛUT. Here Λis
a diagonal matrix with Λii =λibeing the ith eigenvalue of B
and λ1λ2... λn. We then regularize Λby clipping its
diagonal elements against a minimal value λr, where
i= max(λi, λr).(20)
We choose rsuch that Pirλiaccounts for a portion αof the
total power, i.e.,
r= min
ks.t. Pn
i=1 λiα. (21)
Figure 5 shows the top 10 projections learnt from a set of
match/non-match image patches with different power regulariza-
tion rate α. The only pre-processing applied to these patches
was bias-gain normalization. As we can clearly observe, as α
decreases from 0.2 to 0 (top to bottom), the projections become
increasingly noisy.
Fig. 5. The first 10 projections learned from normalized image patches
in a match/non-match image patch set using J2(w)with different power
regularization rate [35]. From top to bottom, αtakes the value of 0.2, 0.1,
0.02 and 0, respectively. Notice that the projections become progressively
noisier as the power regularization is reduced.
We performed experiments using the parametric and non-
parametric descriptor formulations described above, using our
new test dataset. The following results all apply to Difference
of Gaussian (DOG) interest points. For experiments using Harris
corners, see Section VI-E. In each case we have compared to
Lowe’s original implementation of SIFT. Since SIFT performs
descriptor sampling at a certain scale relative to the Difference
of Gaussian peak, we have optimized over this scaling parameter
to ensure that a fair comparison is made (see Figure 6).
For the results presented in this paper, we used three test
sets (Yosemite, Notre Dame, and Liberty) which were obtained
by extracting scale and orientation normalized 64 ×64 patches
around DOG interest points as described in Section II. Typically
four training and test set combinations were used: Yosemite–
Notre Dame, Yosemite–Liberty, Notre Dame–Yosemite, and Notre
Dame–Liberty, where the first of the pair is the training set. In
addition a “synthetic” training set was obtained which incorpo-
rated artificial geometric jitter as described in [31]. Training sets
typically contained from 10,000 to 500,000 patch pairs depending
on the application while test sets always contained 100,000 pairs.
The training and test sets contained 50% match pairs, and 50%
non-match pairs. During training and testing, we recomputed all
match/non-match descriptor distances as the descriptor transfor-
mation varied, sweeping a threshold on the descriptor distance to
generate an ROC curve. Note that using predefined match/non-
match pairs eliminates the need to recompute nearest neighbours
in the 100,000 element test set, which would be computationally
very demanding. In addition to presenting ROC curves, we give
many results in terms of the 95% error rate which is the percent
of incorrect matches obtained when 95% of the true matches are
found (Section IV).
A. Parametric Descriptors
We obtained very good results using combinations of the para-
metric descriptor blocks of Section III, exceeding the performance
of SIFT by around 1/3 in terms of 95% error rates. We chose to
focus specifically on four combinations that were shown to have
merit in [31]. These included a combination of angle quantized
gradients (T1) or steerable filters (T3) with log-polar (S2) or
Gaussian (S4) summation regions. Other combinations with T2,
T4, S1, S3 performed less well. Example ROC curves are shown
in Figure 7 and 8, and all error rates are given in Table I (all tables
show the 95% error rate with the optimal number of dimensions
given in parentheses).
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
95% error rate
log2 scale
SIFT (128)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 28.5%)
SIFT + PCA (27, 28.5%)
SIFT + GLDE (19, 27.5%)
Fig. 6. Results for Lowe-SIFT descriptors: (a) shows the solution for the optimal SIFT descriptor footprint using the Liberty dataset. Note that the performance
is quite sensitive to this parameter, so it must be set carefully. (b) shows ROC curves when using this optimal patch scaling and the Yosemite dataset for
testing. We also tried using PCA and GLDE on the SIFT descriptors (shown in the other curves). GLDE gave only small improvement in performance (1%
error at 95% true positives) to Lowe’s algorithm, but substantially reduced the number of dimensions from 128 to 19. PCA also gives a large dimensionality
reduction for only a small drop in performance.
S3-16 (4x4)S3-9 (3x3) S3-25 (5x5) S4-17 S4-25
Fig. 9. Optimal summation regions are foveated and this is despite
initialization with a rectangular arrangement in the case of S3.
On three of the four datasets, the best performance was
achieved by the T3h-S4-25 combination, which is a combination
of steerable filters with 25 Gaussian summation regions arranged
in concentric rings. We found that when optimized over our
training dataset, these summation regions tended to converge to a
foveated shape, with larger and more widely space summation
regions further from the centre (see Figure 9). This structure
is reminiscent of the geometric blur work of [22], and similar
arrangements were independently suggested and named DAISY
descriptors by [38]. Rectangular arrays of summation regions
were found to have lower performance and their results are not
included here.
Note that the performance of these parametric descriptors is
uniformly strong in comparison to SIFT, but the downside of this
method is that the number of dimensions is very large (typically
several hundred).
B. Non-Parametric Descriptors
The ROC curves for training on Yosemite and testing on Notre
Dame using Non-Parametric descriptors are shown in Figure 10.
To summarize the remaining results, we have created tables
showing the 95% error rates only.
Table II shows the best results for each T-block using the
scheme of Figure 3(2) over all subspace methods that we tried
(PCA, LDE, LPP, GLDE and orthogonal variants). Also shown
are results for applying subspace methods to raw bias-gain
normalized pixel patches and gain normalized gradients. We see
that the T3 (steerable filter) block performs the best, followed
by T1 (angle-quantized gradients) and T2 (rectified gradients). In
half of the cases the combination of T3 and E-block learning beat
SIFT. Table III shows the best results for each E-block over all T-
block filters. LPP is the clear winner when trained on Yosemite.
For Notre Dame the case is not so clear, and no one method
performs consistently well. The best results for each subspace
method are almost always using T3.
To investigate sensitivity to training data, we tested on the
Liberty set using training on both Notre Dame and Yosemite.
For the non-parametric descriptor learning it seems that the
Yosemite dataset was best for training, whereas for the parametric
descriptors the performance was comparable (within 1-2%) for
both datasets. In general the results from the E-block learning
are less strong and more variable than the parametric S-block
techniques. Certain combinations, such as T3/LPP were able to
generate SIFT beating performance (e.g. 19.29% vs 26.10% on
the Yosemite/Notre Dame test case), but many other combinations
did not. The principal advantage of these techniques is that di-
mensionality reduction is simultaneously achieved, so the number
of dimensions is typically low (e.g. 32 dimensions in the case of
C. Dimension reduced parametric descriptors
Parametric descriptor learning yielded excellent performance
with high dimensionality, whereas the non-parametric learning
gave us a very small number of dimensions but with a slightly
inferior performance. Thus it seems natural to combine these
approaches. We did this by running a stage of non-parametric
dimensionality reduction after a stage of parametric learning. This
corresponds to Pipeline 3 in Figure 3. Note that we did not attempt
to jointly optimize for the embedding and parametric descriptors,
although this could be a good direction for future work. The
results are shown in Figure 11 and Table IV. This approach gave
us the overall best results, with typically 1-2% less error than
parametric S-blocks alone, and far fewer dimensions (30-40).
Although LDA gave much better results than PCA when applied
to raw pixel data [35], running PCA on the outputs of S-block
learning gave equal or better results to LDA. It may be that LDA is
slightly overfitting in cases where a discriminative representation
has already been found. For half the datasets, the best results were
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 28.49%)
T1c-S2-17 (272, 18.30%)
T3h-S2-17 (272, 16.56%)
T3h-S4-25 (400, 16.36%)
T3j-S2-17 (544, 15.91%)
Fig. 7. ROC curves for parametrized descriptors. Training on Notre Dame
and testing on Yosemite.
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 35.09%)
T1c-S2-17 (272, 22.76%)
T3h-S4-25 (400, 22.05%)
T3h-S2-17 (272, 21.85%)
T3j-S2-17 (544, 21.98%)
Fig. 8. ROC curves for parametrized descriptors. Training on Notre Dame
and testing on Liberty.
Train Test T1c-S2-17 T3h-S4-25 T3h-S2-17 T3j-S2-17 SIFT
Yosemite Notre Dame 17.90(272) 14.43(400) 15.44(272) 15.87(544) 26.10(128)
Yosemite Liberty 23.00(272) 20.48(400) 22.00(272) 22.28(544) 35.09(128)
Notre Dame Yosemite 18.30(272) 16.35(400) 16.56(272) 15.91(544) 28.50(128)
Notre Dame Liberty 22.76(272) 21.85(400) 22.05(272) 21.98(544) 35.09(128)
Synthetic Liberty 29.50(272) 24.25(400) 25.74(272) 32.36(544) 35.09(128)
normalized normalized
Training Set Test Set pixels gradients T1 T2 T3 T4 SIFT
Yosemite Notre Dame 37.17(14) 32.09(15) 25.68(24) 27.78(33) 19.29(32) 35.37(28) 26.10(128)
Yosemite Liberty 56.33(14) 51.63(15) 38.55(24) 41.10(20) 31.10(32) 47.74(28) 35.09(128)
Notre Dame Yosemite 43.37(27) 38.36(19) 33.59(21) 33.99(40) 31.27(19) 42.39(27) 28.50(128)
Notre Dame Liberty 55.70(27) 52.62(17) 41.37(24) 43.80(15) 36.54(19) 50.63(27) 35.09(128)
Synthetic Notre Dame 37.85(15) 39.15(24) 24.47(32) 24.47(32) 22.94(30) 34.41(28) 26.10(128)
Yosemite Notre D. 40.36(29) 24.20(28) 26.24(31) 24.65(31) 25.01(27) 19.29(32) 23.71(31) 26.10(128)
Yosemite Liberty 53.20(29) 35.76(28) 43.35(31) 34.97(31) 40.15(27) 31.10(32) 39.46(31) 35.09(128)
Notre D. Yosemite 45.43(61) 32.53(45) 34.61(25) 31.27(19) 33.38(20) 33.19(46) 35.04(17) 28.50(128)
Notre D. Liberty 51.63(97) 41.66(45) 40.75(18) 36.54(19) 39.95(20) 42.68(46) 41.46(17) 35.09(128)
Synthetic Notre D. 43.78(66) 24.04(29) 26.25(29) 24.86(26) 26.10(33) 22.94(30) 26.05(34) 26.10(128)
obtained using PCA on T3h-S4-25 (rectified steerable filters with
DAISY-like Gaussian summation regions) and for the other half,
the best results were from T3j-S2-17 plus PCA (rectified steerable
filters and log-polar GLOH-like summation regions). The best
results here gave less than half the error rate of SIFT, using about
1/4 of the number of dimensions. See “best of the best” table V.
To aid in the dissemination of these results, we have cre-
ated a document detailing parameter settings for the most
successful DAISY configurations, as well as details of the
recognition performance/computation time tradeoffs. This can
be found on the same website as our patch datasets:
We also used this approach to perform dimensionality reduction
on SIFT itself, the results are shown in Figure 6(b). We were able
to reduce the number of dimensions significantly (to around 20),
but the matching performance of the LDA reduced SIFT descrip-
tors was only slightly better than the original SIFT descriptors
(1% error).
D. Comparisons with Synthetic Interest Point Noise
Previous work [31], [12] used synthetic jitter applied to image
patches in lieu of the position errors introduced in interest point
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 43.936%)
GLDE (28, 27.86%)
GOLDE (36, 27.77%)
LDE (25, 28.49%)
OLDE (25, 25.99%)
LPP (24, 25.68%)
OLPP (36, 28.05%)
PCA (29, 40.36%)
(a) T1, 4 orientations
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 44.80%)
GLDE (23, 29.89%)
GOLDE (35, 28.40%)
LDE (27, 31.17%)
OLDE (33, 27.78%)
LPP (20, 28.56%)
OLPP (37, 28.33%)
PCA (29, 41.35%)
(b) T2, 4 orientations
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (4096, 48.63%)
GLDE (28, 24.20%)
GOLDE (31, 26.24%)
LDE (31, 24.65%)
OLDE (27, 25.01%)
LPP (32, 19.29%)
OLPP (31, 23.71%)
PCA (25, 53.48%)
(c) T3, 2nd order, 4 orientations
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1296, 63.72%)
GLDE (12, 46.80%)
GOLDE (27, 37.59%)
LDE (21, 40.50%)
OLDE (28, 35.37%)
LPP (12, 47.10%)
OLPP (29, 38.54%)
PCA (97, 61.07%)
(d) T4
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1800, 61.42%)
GLDE (19, 33.54%)
GOLDE (15, 32.09%)
LDE (15, 34.20%)
OLDE (9, 36.68%)
LPP (14, 33.05%)
OLPP (12, 33.73%)
PCA (14, 48.54%)
(e) Normalized gradient
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
NSSD (1024, 51.05%)
GLDE (14, 37.17%)
GOLDE (14, 46.06%)
LDE (13, 40.27%)
OLDE (41, 41.45%)
LPP (21, 39.11%)
OLPP (19, 41.24%)
PCA (26, 46.00%)
(f) Bias-gain normalized image
Fig. 10. Testing of linear discriminant descriptors trained on Yosemite and tested on Notre Dame. The optimal number of dimensions and the associated
95% error rate is given in parentheses. NSSD: Normalized sum squared difference computed on the output of the T-block directly without embedding.
Yosemite Notre D. 11.98(29) 19.12(39) 13.64(49) 18.03(60) 12.48(71) 16.77(52) 14.07(36) 26.10(128)
Yosemite Liberty 18.27(29) 26.92(32) 19.88(49) 25.20(60) 18.70(71) 25.39(32) 20.33(36) 35.09(128)
Notre D. Yosemite 13.55(36) 25.25(87) 15.67(67) 21.78(35) 15.04(99) 22.30(48) 15.56(86) 28.50(128)
Notre D. Liberty 16.85(36) 30.38(28) 20.01(53) 26.48(45) 19.80(49) 26.78(48) 19.47(48) 35.09(128)
Train Test Parametric Non-parametric Composite SIFT
Yosemite Notre Dame 14.43(400) 19.29(32) 11.98(29) 26.10(128)
Yosemite Liberty 20.48(400) 31.10(32) 18.27(29) 35.09(128)
Notre Dame Yosemite 15.91(544) 31.27(19) 13.55(36) 28.50(128)
Notre Dame Liberty 21.85(400) 36.54(19) 16.85(36) 35.09(128)
detection. In order to evaluate the effectiveness of this strategy,
we tested a number of descriptors that were trained on a dataset
with synthetic noise applied ([31]).
For results, see the last rows of tables I, II and III. Here,
“synthetic” means that synthetic scale, rotation and position jitter
noise was applied to the patches, although the actual patch data
was sampled from real images as in [31]. For the parametric
descriptors, there is a clear gain of 5-10% from training using the
new non-synthetic dataset. For the LDA based methods smaller
gains are noticeable.
E. Learning Descriptors for Harris Corners
Using our multi-view stereo ground truth data we can easily
create optimal descriptors for any choice of interest point. To
demonstrate this, we also created a dataset of patches centred
on multi-scale Harris corner points (see Figure 12). The left
column shows the projections learnt from Harris corners and the
right column from DOG interest points, for normalized image
patches. The projections learnt from the two different types of
interest points share several similarities in appearance. They
are all centre focused, and look like Gaussian derivatives [16]
combined with geometric blur [22]. We also found that the order
of the performance of the descriptors learnt from the different
embedding methods are similar to each other across the two data-
F. Effects of Normalization
As demonstrated in [35], the post-normalization step is very
important for the performance of the non-parametric descriptors
learnt from synthetically jittered data-set. We observe a similar
phenomenon in our new experiments with the new data.
The higher performance of the parametric descriptors when
compared to the non-parametric descriptors is in some part
attributable to the use of SIFT-style clipping normalization ver-
sus simple unit-length normalization for these. Since parametric
descriptors maintain a direct relation between image-space and
descriptor coefficients compared with coefficients after PCA re-
duction, SIFT-style clipping, by introducing a robustness function,
can mitigate differences due to spatial occlusions and shadowing
which affect one part of the descriptor and not another. For
this reason applying SIFT-style normalization prior to dimension
reduction seems appropriate.
Figure 13 shows the effect of changing the threshold of clipping
for SIFT normalization. Error rates are significantly improved
when the clipping threshold are equal to around 1.6/Dwhen
tested on a wide range of parametric descriptors with different
dimensionality. This graph shows the drastic reduction in error
rate compared with simple unit normalization.
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 Unit
Error rate (%)
Normalization Threshold Ratio
Fig. 13. Change in error rates as the normalization clipping threshold is varied
for parametric descriptors. The threshold was set to r/Dwhere ris the
ratio and Dis the descriptor dimensionality. Unit: unit length normalization
without clipping.
G. Minimizing Bits
For certain applications, such as scalable recognition, it is im-
portant that descriptors are represented as efficiently as possible.
A natural question is: “what is the minimum number of bits
required for accurate feature descriptors?”. To address this ques-
tion we tested the recognition performance of our parametrized
descriptors as the number of bits per dimension was reduced from
8 to 1. The results are shown in Figure 14 for the parametric
descriptors. Surprisingly, there seems to be very little benefit to
using any more than 2 or 3 bits of dynamic range per dimension,
which suggests that it should be possible to create local image
descriptors with a very small memory footprint indeed. In one
case (T1c-S2-17), the performance actually degraded slightly as
more bits were added. It could be that in this case quantization
caused a small noise reduction effect. Note that this effect was
small ( 1% in error rate), and not shown for the other descriptors,
where the major change in performance came from 1 to 2 bits per
dimension, which gave around 16% change in error rate. Whilst
it would also be possible to quantize bits for dimension reduced
(embedded) descriptors, a variable number of bits per dimension
would be required as the variance on each dimension can differ
substantially across the descriptor.
Here we address some limitations of the current method and
suggest ideas for future work.
A. Repetitive image structure
One caveat with our learning approach scheme is that distinct
3D locations are defined to be different classes, when in the
1 2 3 4 5 6 7 8
Error rate (%)
Number of bits per dimension
T1c-S2-17 (272)
T3h-S4-25 (400)
T3h-S2-17 (272)
T3j-S2-17 (544)
Fig. 14. Results of limiting the number of bits in each descriptor dimension.
Not many more than 2 bits are required per dimension to retain a good error
real world, they can often have the same visual appearance.
One common example would be repeated architectural structures,
such as windows or doors. Such repetitions typically cause false
positives in our matching schemes (see Figure 15). For the Notre
Dame dataset, false positives occur due to translational repetition
(e.g. the stone figures) as well as rotational repetitions (e.g. the
rose window).
B. Multi-view Stereo Data
Although there have been great improvements in stereo in
recent years [30], using multi-view stereo to train local image de-
scriptors has its limitations. Noise in the stereo reconstruction will
inevitably propagate through to the set of image correspondences,
but probably a bigger issue is that certain image correspondences,
i.e., in regions where stereo fails, will not be present at all. One
way around this problem would be to use imagery registered to
LIDAR scans as in [42].
We have described a scheme for learning discriminative, low-
dimensional image descriptors from realistic training data. These
techniques have state-of-the-art performance in all our test sce-
narios. The techniques described in this paper have been used to
design local feature descriptors for a robust structure from mo-
tion application called Photosynth1and an automatic panoramic
stitcher named ICE2(Image Compositing Editor).
To summarize our work, we suggest a few recommendations
for practitioners in this area:
Learn parameters from training data Successful descrip-
tor designs typically have many parameter choices that are
difficult to optimize by hand. We recommend using realistic
training datasets to optimize these parameters.
Use foveated summation regions Pooling regions that
become larger away from the interest point are generally
found to have good performance. See [38] for an efficient
implementation approach.
Use non-linear filter responses Some form of non-linear
filtering before spatial pooling is essential for the best
performance. Steerable filters work well if the phase is kept.
Rectified or angle-quantized gradients are also a good and
simple choice.
Use LDA for discriminative dimension reductions LDA
can be used to find discriminative, low dimensional descrip-
tors without imposing a choice of parameters. However, if a
discriminative representation has already been found, PCA
can work well for reducing the number of dimensions.
Normalization Thresholding normalization often provides a
large boost in performance. If dimension reduction is used,
normalization should come before the dimension reduction
The authors would like to thank Michael Goesele and Noah
Snavely for sharing their 3D reconstruction data with us. We’d
also like to thank David Lowe, Rick Szeliski and Sumit Basu for
helpful discussions.
[1] R. Szeliski, “Image alignment and stitching: A tutorial,” Microsoft
Research, Tech. Rep. MSR-TR-2004-92, December 2004.
[2] M. Brown and D. Lowe, “Automatic panoramic image stitching using
invariant features,International Journal of Computer Vision, vol. 74,
no. 1, pp. 59–73, 2007.
[3] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis,
J. Tops, and R. Koch, “Visual modeling with a hand-held camera,
International Journal of Computer Vision, vol. 59, no. 3, pp. 207–232,
[4] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Exploring photo
collections in 3D,” in SIGGRAPH Conference Proceedings. New York,
NY, USA: ACM Press, 2006, pp. 835–846.
[5] D. Nist´
er and H. Stew´
enius, “Scalable recognition with a vocabulary
tree,” in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 2, June 2006, pp. 2161–2168.
[6] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
retrieval with large vocabularies and fast spatial matching,” in Proceed-
ings of the International Conference on Computer Vision and Pattern
Recognition (CVPR07), 2007.
[7] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features
and kernels for classification of texture and object categories: A com-
prehensive study,” International Journal of Computer Vision, vol. 73,
no. 2, pp. 213–238, June 2007.
[8] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative
classification with sets of image features,” in Proceedings of the IEEE
Interntaion Conference on Computer Vision, Bejing, October 2005.
[9] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by un-
supervised scale-invariant learning,” in Proceedings of the Interntaional
Conference on Computer Vision and Pattern Recognition, 2003.
[10] K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest point
detectors,” International Journal of Computer Vision, vol. 1, no. 60, pp.
63–86, 2004.
[11] K. Mikolajczyk and C. Schmid, “A performance evaluation of local
descriptors,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 10, no. 27, pp. 1615–1630, 2005.
[12] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 9, pp. 1465–1479, 2006.
[13] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for
image categorization and segmentation,” in Internaional Conference on
Computer Vision and Pattern Recognition, June 2008.
[14] B. Babenko, P. Dollar, and S. Belongie, “Task specific local region
matching,” in International Conference on Computer Vision (ICCV07),
Rio de Janeiro, 2007.
[15] J. M. D. Martin, C. Fowlkes, “Learning to detect natural image bound-
aries using local brightness, color and texture cues,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 26, no. 5, May 2004.
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T1c-S2-17 (272, 17.93%)
GLDE (39, 19.12%)
GOLDE (65, 14.67%)
LDE (59, 18.84%)
OLDE (92, 15.95%)
LPP (52, 16.77%)
OLPP (74, 14.94%)
PCA (65, 13.16%)
(a) T1c-S2-17
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T4h-S4-25 (400, 15.45%)
GLDE (32, 19.87%)
GOLDE (39, 15.13%)
LDE (47, 18.85%)
OLDE (57, 13.99%)
LPP (25, 19.44%)
OLPP (36, 14.07%)
PCA (29, 11.98%)
(b) T4h-S4-25
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T3h-S2-17 (272, 14.41%)
GLDE (32, 19.27%)
GOLDE (49, 13.64%)
LDE (60, 18.03%)
OLDE (71, 12.48%)
LPP (32, 17.48%)
OLPP (23, 18.45%)
PCA (22, 13.16%)
(c) T3h-S2-17
0 0.05 0.1 0.15 0.2 0.25 0.3
Correct Match Fraction
Incorrect Match Fraction
SIFT (128, 26.10%)
T3j-S2-17 (544, 15.84%)
GLDE (31, 19.85%)
GOLDE (67, 14.20%)
LDE (44, 18.67%)
OLDE (68, 14.27%)
LPP (32, 17.09%)
OLPP (45, 14.35%)
PCA (49, 12.66%)
(d) T3j-S2-17
Fig. 11. ROC curves for composite descriptors trained on Yosemite and testing on Notre Dame.
Fig. 12. Comparison of projections on patches centred on Harris corner points (left column), and DOG points (right column), respectively. From top to the
bottom, we present projections learnt using the embedding blocks of E2, E3, E4, E5, E6, E7 and E1, respectively.
Fig. 15. Some of the false positive, false negative, true positive and true negative image patch pairs when testing on the new Notre Dame dataset using
E-blocks learnt from the new Yosemite dataset. We used a combination of T3 (steerable filters) and E2 (LPP) in this experiment. Each row shows 6 pairs of
image patches and the two image patches in each pair are shown in the same column. Note that the two images in the false positive pairs are indeed obtained
from different 3D points but their appearances look surprisingly similar.
[16] C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 19, no. 5, pp. 530–535, May 1997.
[17] C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy, “Canonical frames
for planar object recognition,” in European Conference on Computer
Vision, 1992, pp. 757–772.
[18] D. Lowe, “Object recognition from local scale-invariant features,” in
International Conference on Computer Vision, Corfu, Greece, September
1999, pp. 1150–1157.
[19] D. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,
[20] D. Hubel and T. Wiesel, “Brain mechanisms of vision,Scientific
American, pp. 150–162, September 1979.
[21] S. Belongie, J. Malik, and J. Puzicha, “Shape context: A new descriptor
for shape matching and object recognition,” in Advances in Neural
Information Processing Systems. MIT Press, Cambridge, MA, 2000.
[22] A. Berg and J. Malik, “Geometric blur and template matching,” in
International Conference on Computer Vision and Pattern Recognition,
2001, pp. I:607–614.
[23] Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive representation
for local image descriptors,” in Proceedings of the International Con-
ference on Computer Vision and Pattern Recognition, vol. 2, July 2004,
pp. 506–513.
[24] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces VS Fisher-
faces: Recognition using class specific linear projection,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7,
pp. 711–720, 1997.
[25] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using
Laplacianfaces,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, no. 3, pp. 328–340, March 2005.
[26] H. Chen, H. Chang, and T. Liu, “Local discriminant embedding and its
variants,” in Proceedings of the International Conference on Computer
Vision and Pattern Recognition, vol. 2, San Diego, CA, June 2005, pp.
[27] J. Duchene and S. Leclercq, “An optimal transformation for discrimi-
nant and principle component analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 10, no. 6, pp. 978–983, 1988.
[28] P. Moreels and P. Perona, “Evaluation of feature detectors and de-
scriptors based on 3D objects,” in Proceedings of the International
Conference on Computer Vision, vol. 1, 2005, pp. 800–807.
[29] M. Goesele, S. Seitz, and B. Curless, “Multi-view stereo revisited,” in
International Conference on Computer Vision and Pattern Recognition,
New York, June 2006.
[30] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. Seitz, “Multi-view
stereo for community photo collections,” in International Conference on
Computer Vision, Rio de Janeiro, October 2007.
[31] S. Winder and M. Brown, “Learning local image descriptors,” in
Proceedings of the International Conference on Computer Vision and
Pattern Recognition (CVPR07), Minneapolis, June 2007.
[32] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical
Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge
University Press, 1992.
[33] M. Brown and D. Lowe, “Unsupervised 3D object recognition and
reconstruction in unordered datasets,” in 5th International Conference
on 3D Imaging and Modelling (3DIM05), 2005.
[34] N. Snavely, S. Seitz, and R. Szeliski, “Modeling the world from internet
photo collections,” International Journal of Computer Vision, vol. 80,
no. 2, pp. 189–210, 2008.
[35] G. Hua, M. Brown, and S. Winder, “Discriminant embedding for local
image descriptors,” in Proceedings of the 11th International Conference
on Computer Vision (ICCV07), Rio de Janeiro, October 2007.
[36] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Object
recognition with cortex-like mechanisms,IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426, 2007.
[37] W. T. Freeman and E. H. Adelson, “The design and use of steerable fil-
ters,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 13, pp. 891–906, 1991.
[38] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptor for dense
matching,” in Proceedings of the International Conference on Computer
Vision and Pattern Recognition, Anchorage, June 2008.
[39] K. Mikolajczyk and J. Matas, “Improving descriptors for fast tree match-
ing by optimal linear projection,” in Proceedings of the International
Conference on Computer Vision, Rio de Janeiro, 2007.
[40] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal Laplacianfaces
for face recognition,” IEEE Transaction on Image Processing, vol. 15,
no. 11, pp. 3608–3614, November 2006.
[41] G.Hua, P. Viola, and S. Druker, “Face recognition using discriminatively
trained orthogonal rank one tensor projections,” in Proc. of IEEE Conf.
on Computer Vision and Pattern Recognition, Minneapplois, MN, June
[42] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoennessen,
“On benchmarking camera calibration and multi-view stereo for high
resolution imagery,” in Proceedings of the International Conference on
Computer Vision and Pattern Recognition, Anchorage, June 2008.
[43] S. Winder, G. Hua, and M. Brown, “Picking the best daisy,” in Proceed-
ings of the International Conference on Computer Vision and Pattern
Recognition (CVPR09), Miami, June 2009.
Matthew Brown is a Postdoctoral Fellow at the
Ecole Polytechnique F´
erale de Lausanne. He ob-
tained the M.Eng. degree in Electrical and Informa-
tion Sciences from Cambridge University in 2000,
and the Ph.D. degree in Computer Science from
the University of British Columbia in 2005. His
research interests include Computer Vision, Machine
Learning, Medical Imaging and Environmental In-
formatics. He worked with Microsoft Research as
an intern in Cambridge in 2002, and in Redmond
in 2003-2004. He returned to Microsoft Research
Redmond as a Postdoctoral Researcher during 2006-2007. His work there
focused on image segmentation, panoramic stitching and local feature design.
His work on panoramic stitching has been widely adopted, and appears on
the curriculum of many University courses as well as in several commercial
products. He is a director and CTO of Vancouver based Cloudburst Research
Gang Hua is a Senior Researcher at Nokia Research
Center Hollywood. Before that, he was a Scientist at
Microsoft Live Labs Research from 2006 to 2009.
He received his Ph.D. degree in Electrical and Com-
puter Engineering from Northwestern University in
2006, the M.S. and B.S. degree in Electrical Engi-
neering from Xi’an Jiaotong University in 2002 and
1999, respectively. He was enrolled in the Special
Class for the Gifted Young of XJTU in 1994. During
the summer 2005 and summer 2004, he was a
research intern with the Speech Technology Group,
Microsoft Research, Redmond, WA, and a research intern with the Honda
Research Institute, Mountain View, CA, respectively.
He received the Richter Fellowship and the Walter P. Murphy Fellowship
at Northwestern University in 2005 and 2002, respectively. When he was in
XJTU, he was awarded the Guanghua Fellowship, the Eastcom Fellowship,
the Most Outstanding Student Exemplar Fellowship, the Sea-star Fellowship
and the Jiangyue Fellowship in 2001, 2000, 1997, 1997 and 1995 respectively.
He was also a recipient of the University Fellowship from 1994 to 2001 at
XJTU. He is a member of both IEEE and ACM. As of Jan, 2009, he holds 1
US patent and has 18 more patents pending.
Simon Winder is a Senior Developer in the Interac-
tive Visual Media group at Microsoft Research. He
obtained his Ph.D. in 1995 from the School of Math-
ematical Sciences, University of Bath, UK, studying
computational neuroscience of primate vision. Prior
to joining Microsoft, Simon obtained a B.Sc. in
1990 and a M.Eng. in 1991 from the University of
Bath, studying Electrical and Electronic Engineer-
ing. Prior employment includes work on thermal
imaging hardware at GEC Sensors, Basildon, UK,
and later work on MPEG-4 video standardization
for the Partnership in Advanced Computing Technologies, Bristol, UK. His
current research includes feature detection and descriptors for matching
and recognition with application to 3D reconstruction and real-time scene
recognition, localization, and mapping. To date he has filed 20 US patents.
... The training samples for affine-invariant feature matching contained in the Brown dataset [7] are derived from ground-based, close-range images that are dominated by artificial statues and natural landscapes. Most of the corresponding patches in the Brown dataset were manually selected; this manual selection may be time-consuming and laborious. ...
... Differing from HPatches [24] and MegaDepth [25] datasets, our dataset (SJRS) consists of 1024 × 1024 bitmap (.bmp) images, each of which contain an array of 16 × 16 image patches, and each patch is sampled as a 64 × 64 grey scale with a normalized region and orientation. SJRS is of a similar type to the Brown dataset [7], but it contains more image types from ground-based close-range, aerial, and satellite platforms and is larger in scale than Brown. The contributions of this paper are as follows. ...
... The AffNet network was reconstructed using the Python programming language and the PyTorch framework. The AffNet model was trained on the Brown dataset [7] and our SJRS dataset, and the trained models were named Brown-AffNet and SJRS-AffNet, respectively. The training parameters were uniformly set as follows. ...
Full-text available
To promote the development of deep learning for feature matching, image registration, and three-dimensional reconstruction, we propose a method of constructing a deep learning benchmark dataset for affine-invariant feature matching. Existing images often have large viewpoint differences and areas with weak texture, which may cause difficulties for image matching, with respect to few matches, uneven distribution, and single matching texture. To solve this problem, we designed an algorithm for the automatic production of a benchmark dataset for affine-invariant feature matching. It combined two complementary algorithms, ASIFT (Affine-SIFT) and LoFTR (Local Feature Transformer), to significantly increase the types of matching patches and the number of matching features and generate quasi-dense matches. Optimized matches with uniform spatial distribution were obtained by the hybrid constraints of the neighborhood distance threshold and maximum information entropy. We applied this algorithm to the automatic construction of a dataset containing 20,000 images: 10,000 ground-based close-range images, 6000 satellite images, and 4000 aerial images. Each image had a resolution of 1024 × 1024 pixels and was composed of 128 pairs of corresponding patches, each with 64 × 64 pixels. Finally, we trained and tested the affine-invariant deep learning model, AffNet, separately on our dataset and the Brown dataset. The experimental results showed that the AffNet trained on our dataset had advantages, with respect to the number of matching points, match correct rate, and matching spatial distribution on stereo images with large viewpoint differences and weak texture. The results verified the effectiveness of the proposed algorithm and the superiority of our dataset. In the future, our dataset will continue to expand, and it is intended to become the most widely used benchmark dataset internationally for the deep learning of wide-baseline image matching.
... The selection of measurement rules of these description vectors is generally related to the real label vector. The processing process of references [26,27] included multiple parameterization modules such as gradient calculation, spatial pooling, feature normalization and dimension reduction. Trzcinski et al. [28] used a "weak learning" accelerator, including a series of capabilities of gradient direction and spatial position parameterization. ...
Full-text available
Research in the field of medical image is an important part of the medical robot to operate human organs. A medicalrobot is the intersection of multi-disciplinary research fields, in which medical image is an important direction andhas achieved fruitful results. In this paper, a method of soft tissue surface feature tracking based on a depth matchingnetwork is proposed. This method is described based on the triangular matching algorithm. First, we construct aself-made sample set for training the depth matching network from the first N frames of speckle matching dataobtained by the triangle matching algorithm. The depth matching network is pre-trained on the ORL face dataset and then trained on the self-made training set. After the training, the speckle matching is carried out in thesubsequent frames to obtain the speckle matching matrix between the subsequent frames and the first frame.From this matrix, the inter-frame feature matching results can be obtained. In this way, the inter-frame speckletracking is completed. On this basis, the results of this method are compared with the matching results based onthe convolutional neural network. The experimental results show that the proposed method has higher matchingaccuracy. In particular, the accuracy of the MNIST handwritten data set has reached more than 90% Soft Tissue Feature Tracking Based on Deep Matching Network.
... It hence limits their representation capability on higher levels. With the development of deep learning and the emergence of patch dataset with annotation [7], learning-based descriptors have been widely studied. Most learning-based descriptors from patches adopt the network architecture introduced in L2-Net [45] and are trained with different loss functions, e.g. ...
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
Full-text available
Visual Simultaneous Localization and Mapping (VSLAM) has attracted considerable attention in recent years. This task involves using visual sensors to localize a robot while simultaneously constructing an internal representation of its environment. Traditional VSLAM methods involve the laborious hand-crafted design of visual features and complex geometric models. As a result, they are generally limited to simple environments with easily identifiable textures. Recent years, however, have witnessed the development of deep learning techniques for VSLAM. This is primarily due to their capability of modeling complex features of the environment in a completely data-driven manner. In this paper, we present a survey of relevant deep learning-based VSLAM methods and suggest a new taxonomy for the subject. We also discuss some of the current challenges and possible directions for this field of study.
Full-text available
Heterogeneous images acquired from various platforms and sensors provide complementary information. However, to use that information in applications such as image fusion and change detection, accurate image matching is essential to further process and analyze these heterogeneous images, especially if they have significant differences in radiation and geometric characteristics. Therefore, matching heterogeneous remote sensing images is challenging. To address this issue, we propose a feature point matching method named Cross and Self Attentional Matcher (CSAM) based on Attention mechanisms (algorithms) that have been extensively used in various computer vision-based applications. Specifically, CSAM alternatively uses self-Attention and cross-Attention on the two matching images to exploit feature point location and context information. Then, the feature descriptor is further aggregated to assist CSAM in creating matching point pairs while removing the false matching points. To further improve the training efficiency of CSAM, this paper establishes a new training dataset of heterogeneous images, including 1,000,000 generated image pairs. Extensive experiments indicate that CSAM outperforms the existing feature extraction and matching methods, including SIFT, RIFT, CFOG, NNDR, FSC, GMS, OA-Net, and Superglue, attaining an average precision and processing time of 81.29% and 0.13 s. In addition to higher matching performance and computational efficiency, CSAM has better generalization ability for multimodal image matching and registration tasks.
The triplet loss is widely used in learning the local descriptors for image matching. However, existing triplet loss-based methods, like HardNet and DSM, employ the point-to-point distance metric, which neglects the neighborhood information of descriptors. Considering the fact that local neighborhood structures of matching descriptors would be similar under the ideal condition, this paper aims to learn the neighborhood topology-consistent descriptors (TCDesc). To this end, we first propose the linear combination weight as the topology weight to depict the neighborhood topology for each descriptor, where the difference between the center descriptor and the linear combination of its neighbors is minimized. For the global comparison, we then define a global topology vector by using the local topology weights. Next, beyond the Euclidean distance, we define a topology distance with the topology vectors to indicate the topological difference between the matching descriptors. Furthermore, we propose an adaptive weighting strategy to jointly minimize the topology distance and Euclidean distance in triplet loss. Experimental results on four widely-used datasets, i.e., UBC PhotoTourism, HPatches, W1BS and Oxford, demonstrate that our method can effectively improve the performance of both HardNet and DSM.
Full-text available
We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space-if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed “Fisherface” method has error rates that are lower than those of the eigenface technique for tests on the Harvard and Yale face databases
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
We present a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface. Our system consists of an image-based modeling front end that automatically computes the viewpoint of each photograph as well as a sparse 3D model of the scene and image to model correspondences. Our photo explorer uses image-based rendering techniques to smoothly transition between photographs, while also enabling full 3D navigation and exploration of the set of images and world geometry, along with auxiliary information such as overhead maps. Our system also makes it easy to construct photo tours of scenic or historic locations, and to annotate image details, which are automatically transferred to other relevant images. We demonstrate our system on several large personal photo collections as well as images gathered from Internet photo sharing sites.
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover's Distance and the χ 2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. Our experiments demonstrate that image representations based on distributions of local features are surprisingly effective for classification of texture and object images under challenging real-world conditions, including significant intra-class variations and substantial background clutter.