Fast Keypoint Recognition in Ten Lines of Code
ABSTRACT While feature point recognition is a key component of modern approaches to object detection, existing approaches require computationally expensive patch preprocessing to handle perspective distortion. In this paper, we show that formulating the problem in a Naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust. Furthermore, it scales well to handle large number of classes. To recognize the patches surrounding keypoints, our classifier uses hundreds of simple binary features and models class posterior probabilities. We make the problem computationally tractable by assuming independence between arbitrary sets of features. Even though this is not strictly true, we demonstrate that our classifier nevertheless performs remarkably well on image datasets containing very significant perspective changes.
-
Citations (0)
-
Cited In (0)
Page 1
Fast Keypoint Recognition in Ten Lines of Code∗
Mustafa Özuysal Pascal FuaVincent Lepetit
Computer Vision Laboratory
École Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne, Switzerland
Email:{Mustafa.Oezuysal, Pascal.Fua, Vincent.Lepetit}@epfl.ch
Abstract
While feature point recognition is a key component of
modernapproachestoobjectdetection,existingapproaches
require computationally expensive patch preprocessing to
handle perspective distortion. In this paper, we show that
formulating the problem in a Naive Bayesian classification
framework makes such preprocessing unnecessary and pro-
ducesanalgorithmthatis simple, efficient,androbust. Fur-
thermore, it scales well to handle large number of classes.
To recognize the patches surrounding keypoints, our
classifier uses hundreds of simple binary features and mod-
els class posterior probabilities. We make the problem com-
putationally tractable by assuming independence between
arbitrary sets of features. Even though this is not strictly
true, we demonstrate that our classifier nevertheless per-
forms remarkably well on image datasets containing very
significant perspective changes.
1. Introduction
The ability to recognize interest points across images
that may have been taken from very different viewpoints is
required to address many important Computer Vision prob-
lems. They range from image registration to object detec-
tion [7, 6] and often require real-time performance. The
standard approach to addressing this problem is to build
affine-invariant descriptors of local image patches and to
compare them across images. This usually involves fine
scale selection, rotation correction, and intensity normal-
ization [11, 10]. It results in a high computational overhead
and often requires handcrafting the descriptors to achieve
insensitivity to specific kinds of distortion.
Ithasrecentlybeenshownthatcastingthiswide-baseline
matching problem as a more generic classification problem
leads to solutions that are much less computationally de-
manding [9]. This approach relies on an offline training
∗This work has been supported in part by the Swiss National Science
Foundation.
phase during which multiple views of the keypoints to be
matched are used to train randomized trees [2] to recog-
nize them based on a few pairwise intensity comparisons.
This yields both fast run-time performance and robustness
to viewpoint and lighting changes, which has proved very
effective for real-time object detection.
In this paper, we show that using a classic Naive
Bayesian framework yields an approach that is simpler,
faster, and as robust as the state-of-the-art methods dis-
cussed above. We use non-hierarchical structures that we
refer to as ferns to classify the patches. Each one consists
of a small set of binary tests and returns the probability that
a patch belongs to any one of the classes that have been
learned duringtraining. These responses are then combined
in a Naive Bayesian way. As [9], we train our classifier by
synthesizing many views of the keypoints extracted from
a training image as they would appear under different per-
spective or scale.
The binary tests we use as classifier features are picked
completely at random, which puts our approach firmly
in the camp of techniques that rely on randomization to
achievegoodperformance[1]. We will showthatthis is par-
ticularly effective for the specific classification task we are
addressing,whichrequiresscaleandperspectiveinvariance,
involvesa very largenumberof classes, but can tolerate sig-
nificanterrorrates since we use robust statistical methodsto
exploit the information provided by the keypoints. Further-
more, our approach is particularly easy to implement, does
notoverfit,doesnotrequireadhocpatchnormalization,and
allows fast and potentially incremental training.
2. Image Patch Classification
The importance of image patch recognition and match-
ing across images is widely accepted for applications rang-
ing from object recognitionand image retrieval to pose esti-
mation. Givenfeaturepointsextractedfromtheimages,two
main classes of approaches have been proposed to achieve
results such as those of Figure 6.
The first family involves computing local descriptors in-
1
Page 2
variant to changes such as perspective and lighting [13, 10].
In particularthe SIFT vector [10], computed from local his-
tograms of gradients, works remarkably well, at least on
textured images and we will use it as a benchmark for our
own approach.
A second class relies on statistical learning techniques
to model the set of possible appearances of a patch. The
one-shot approach of [6] uses PCA and Gaussian Mixture
Modelsbutdoesnotaccountforperspectivedistortion. This
is addressed in [9] using Randomized Trees (RTs). Since
the set of possible patches around an image feature under
changing perspective and lightning conditions can be seen
as a class, it is possible to train a set of RTs to recognize
feature points by feeding it samples of their possible ap-
pearances, synthesized by warping the patches found in a
training image using randomly chosen homographies.
This approach is fast and effective to achieve the kind
of object detection depicted by Figure 6. Note that un-
like in traditional classification problems, a close-to-perfect
methodis not required. Here it is enoughto recognizesome
features succesfully and to use a robust estimator such as
RANSAC todetecttheobject. Howeverascalableapproach
is still highly desirable for practical applications since the
number of keypoints might become very large (typically >
400). We will show that when this happensthe performance
of the RTs tends to drop whereas that of the ferns does not.
Recently [12] used keypoints as visual words [14] for
image retrieval in very large image databases. Keypoints
are labeled by hierarchical k-means [5] clustering based on
their SIFT descriptors. This allows a very large number of
visual words, but the performance measure is the number
of correctly retrieved documents rather than number of cor-
rectly classified keypoints. In this work, we concentrate on
localizing individual keypoints to obtain pose information
which is required in tracking and augmented reality appli-
cations.
3. Naive Bayesian Classification
It has been shown that image patches can be recognized
on the basis of very simple and randomly chosen binary
teststhataregroupedintodecisiontreesandrecursivelypar-
tition the space of all possible patches [9]. In practice, no
single tree is discriminative enough when there are many
classes. However, using a number of trees and averaging
their votes yields good results.
In this section, we will argue that, when the tests are
chosen randomly, the power of the approach derives not
from the tree structure itself but from the fact that com-
bining groups of binary tests allows improved classification
rates. Therefore,replacing the trees by our non-hierarchical
ferns and pooling their answers in a Naive Bayesian man-
ner yields better results and scalability in terms of number
of classes. The naive combination strategy lets us combine
many more features, which is key to improved recognition
rate.
3.1. Formulation
As discussed in Section 2 we treat the set of all possible
appearances of the image patch surrounding a keypoint as
a class. Therefore, given the patch surrounding a keypoint
detected in an image, our task is to assign it to the most
likely class. Let ci,i = 1,...,H be the set of classes and
let fj, j = 1,...,N be the set of binary features that will be
calculatedoverthepatchwe aretryingtoclassify. Formally,
we are looking for
ˆ ci= argmax
ci
P(C = ci| f1, f2,..., fN),
where C is a random variable that represents the class.
Bayes’ Formula yields
P(C=ci| f1, f2,..., fN)=P(f1, f2,..., fN|C = ci)P(C = ci)
P(f1, f2,..., fN)
.
Assuming a uniform prior P(C), since the denominator is
simply a scaling factor that it is independentfrom the class,
our problem reduces to finding
ˆ ci= argmax
ci
P(f1, f2,..., fN|C = ci).
(1)
In our implementation, the value of each binary feature fj
only depends on the intensities of two pixel locations dj,1
and dj,2of the image patch we write
fj=
?
1 if I(dj,1) < I(dj,2)
0 otherwise
whereI representstheimagepatch. Sincetheyareverysim-
ple, we require many (N ≈ 300) for accurate classification.
Therefore a complete representation of the joint probability
in Eq. (1) is not feasible since it would require estimating
and storing 2Nentries for each class. One way to compress
the representation is to assume independence between fea-
tures. An extreme version is to assume complete indepen-
dence, that is,
P(f1, f2,..., fN|C = ci) =
N
∏
j=1
P(fj|C = ci).
Howeverthiscompletelyignoresthecorrelationbetween
features. To make the problem tractable while accounting
for these dependencies, a good compromise is to partition
our features into M groups of size S =N
what we define as ferns and we compute the joint proba-
bility for features in each fern. The conditional probability
becomes
M. These groups are
P(f1, f2,..., fN|C = ci) =
M
∏
k=1
P(Fk|C = ci),
(2)
Page 3
for i = 1 to H
logPI |C[i] ← 0
end for
for all fern Fkdo
index ← 0
for j = 1 to S
index ← 2×index
if I?dσ(k,j,1)
index ← index+1
end if
end for
for i = 1 to H
logPI |C[i] ← logPI |C[i]+logPFk[index,i]
end for
end for
Figure 1. Left: The pseudo-code of the run-time algorithm that computes P(f1, f2,..., fN| C = ci) as given by Eq. (2) to classify the
image patch I, where index is an integer index computed from the binary features. No image rectification, illumination normalization, or
parameter tuning are required. Right: A C++ implementation of the pseude-code. The code used for training is very similar.
1:for(int i = 0; i < H; i++) P[i] = 0.;
2:for(int k = 0; k < M; k++) {
3:int index = 0, * d = D + k * 2 * S;
4:for(int j = 0; j < S; j++) {
5:index <<= 1;
?< I?dσ(k,j,2)
?then
6: if (*(K + d[0]) < *(K + d[1]))
7: index++;
8: d += 2;
}
9: p = PF + k * shift2 + index * shift1;
10:for(int i = 0; i < H; i++) P[i]+=p[i];
}
where Fk= {fσ(k,1), fσ(k,2),..., fσ(k,S)},k = 1,...,M repre-
sents the kthfern and σ(k, j) is a random permutation func-
tion with range 1,...,N. Hence we follow a Semi-Naive
Bayesian [15] approach by modelling only some of the de-
pendencies between features. The viability of such an ap-
proach has been shown by [8] in the context of image re-
trieval applications.
This form can now be handled easily since the it has
M ×2Sparameters with M between 30-50, and we show
in Section 4 that a fern size S around 10 gives good recog-
nition rates, compared to the 2Nwith N ≈ 300 for the full
joint probability representation. It is also flexible since per-
formance/memory trade-offs can be made by changing the
number of ferns and their sizes. The corresponding code is
given as Figure 1 to highlight the simplicity of the resulting
implementation.
3.2. Training
For our experiments, we start the training by construct-
ing a set of H prominent keypoints lying on the object
model. To each feature point corresponds a class.
The fern features, that is the locations dj,1and dj,2, are
picked at random. The terms
P(Fk|C = ci), k = 1, ..., M
areestimated bycomputingthe featuresontrainingsamples
of each class. We can exploit here our strong knowledge on
the problem to create a virtually infinite training set: We
use a small number of images and synthesize many new
views of the object using simple rendering techniques as
affine deformations, and extract training patches for each
class. White noise is also added for more realism. For each
keypoint on the model, this gives us a fine sampling of the
set of all its possible appearances under different viewing
conditions.
Howeverevenif eachterm P(Fk|C =ci) is onlya partof
the full joint probability of Eq. (1), their estimation still in-
volves estimating an extremelylargenumberof parameters,
and they cannot be reliably estimated directly as empirical
probabilities in practice. In order to explain how we esti-
mate the P(Fk| C = ci), lets us introduce the event Θ(Fk)
that states that “the empirical probabilities for Fkare reli-
able”. We can then express a P(Fk|C = ci) term as:
P(Fk|C = ci)=
P(Fk|C = ci,Θ(Fk))P(Θ(Fk)) +
P(Fk|C = ci,Θ(Fk))P(Θ(Fk))
.
(3)
P(Fk| C = ci,Θ(Fk)) is nothing more than the empirical
probability of P(Fk| C = ci), and P(Fk| C = ci,Θ(Fk))
should be taken as constant and is therefore equal to1
us now model P(Θ(Fk)) as:
H. Let
P(Θ(Fk)) =
∑ink,i
∑i(nk,i+u),
where nk,iis the number of training samples that verify the
set of features Fk. When the training set is truly repre-
sentative of the actual variations within classes, this model
makes sense since it tends to 1 when the number of train-
ing samples grows, and yields a simple way to estimate the
P(Fk|C = ci,Θk). It is easy to check that we have then:
P(Fk|C = ci) =
nk,i+u
∑k(nk,i+u).
In practice, the value of u does not really influence the re-
sults as soon as it is higher than 0. In all our experiments,
we use u = 1. This factor can be interpreted as a Dirichlet
prior, since the class conditional probabilities are modelled
as multinomials [4].
Page 4
0
50
100
150
200
250
300
0 5 10 15 20 25 30 35 40 45 50
Number of Evaluated Class Posteriors
Number of Ferns Evaluated
Ratio thresholds (300 classes)
Simple thresholds (300 classes)
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
0 5 10 15 20 25 30 35 40 45 50
Number of Evaluated Class Posteriors
Number of Ferns Evaluated
Ratio thresholds (900 classes)
Simple thresholds (900 classes)
Figure 2. The number of class posteriors that needs to be evaluated decreases very rapidly when the probabilities are thresholded by their
ratio to the maximum probability. Plots show these curves when 300 and 900 classes are trained respectively.
Figure 3. Two of the images used for evaluation.
3.3. Handling Many Classes
At run-time, computing the class probabilities takes a
single lookup for each fern and their final multiplication.
However this has to be repeated for each class and as the
number of classes increases this quickly becomes burden-
some. Furthermore some of the class posteriors reach very
small values at the end of a multiplication of the first few
terms and their final value becomes irrelevant for class se-
lection. In principle we do not have to calculate the final
value of the posterior for each class as long as the selected
class does not change.
Here we consider two strategies for eliminating classes
duringposteriorevaluationas each term comingfroma fern
is multiplied, so that the computation can be carried out
much more quickly. The first strategy is to use a simple
threshold, which can be learned from the training set, on
the posteriors as each term gets multiplied. This eliminates
classes that are unlikely to be the correct class. The sec-
ond approach is to use a threshold on the ratio of the maxi-
mumposteriortothe consideredclass posteriorat eachstep.
This is based on the observation that a class posterior that
300 Classes
93.2
87.2
90.2
900 Classes
87.2
80.6
84.1
No Thresholding
Simple Thresholds
Ratio Thresholds
Table 1. Percentage of correctly classified image patches with and
without thresholding. Since the thresholds are calculated using the
training set they can cause misclassification on a test set. In the
case of ratio thresholds this loss of performance is not significant.
has fallen back by a large margin is unlikely to catch up.
Note that this second strategy requires the computation of
the maximum posterior at the end of each step.
Figure 2 shows the average number of class posteriors
calculated at the end of each step for the two thresholding
strategies. As theplotsclearlyshow,thresholdsontheratios
to maximum probability at each step decreases the number
of necessary evaluations significantly. The thresholds were
chosen to be as large as possible without causing a mis-
classification on the training set. For these experiments, we
used the two images shown Figure 3, and the percentage
of correctly classified image patches evaluatedon randomly
generatedimages of these images. As can be observedfrom
Table 1 the ratio thresholds decrease the classification rate
only slightly and therefore can generalize well.
4. Comparing Ferns and Trees
Ferns and Random Trees are very similar in spirit but
differ in two important respects. In trees the binary tests are
organized hierarchically and the posterior distributions are
computedadditively. Bycontrast,fernsareflatandcompute
posteriors multiplicatively. In this section, we first compare
the two approaches. We then offer a theoretical insight into
why ferns appear to outperform trees, but only when the
training set is sufficiently large.
Page 5
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Average Classification Rate
# of Structures (Depth/Size 10)
Ferns
Random Trees
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Average Classification Rate
# of Structures (Depth/Size 10)
Ferns
Random Trees
Figure 4. The percentage of correctly classified image patches is shown against different classifier parameters for the fern and tree based
methods. The independence assumption between ferns allows the joint utilization of the features resulting in a higher classification rate.
The error bars show the 95% confidence interval assuming a Gaussian distribution.
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Average Classification Rate
# of Classes
Ferns (20 ferns with size 10)
Random Trees (20 trees with depth 10)
0
10
20
30
40
50
60
70
80
90
100
0 250 500 750 1000 1250 1500 1750 2000
Average Classification Rate
# of Classes
Ferns (30 ferns with size 10)
Random Trees (30 trees with depth 10)
Figure 5. Classification using ferns can handle many more classes than Random Tree based methods. For both figures ferns with size 10
and trees with depth 10 are used.
4.1. Ferns Outperform Trees
We evaluate the performance of the proposed fern based
approach by comparing to the results of a Random Tree
based implementation. The number of tests in the ferns and
the depth of the trees are taken to be equal, and we compare
the classification rate when using the same numberof struc-
tures, that is of ferns or Random Trees. In particular, the
same number of tests is performed on each keypoint, and
the same number of joint probabilities has to be stored.
We do our tests on the images presented 3 with 500
classes and calculate the average classification rate on
randomly generated test images while eliminating false
matches using object geometry. Since the feature selec-
tion is random, we repeat the test 10 times and calculate the
mean and variance of the classification rate and we perform
the test on the two images.
As depicted by Figure 4, despite the inaccuracy of the
independence assumptions the fern based classifier outper-
forms the combination of trees. Furthermore as the number
of ferns is increased the random selection method does not
cause large variations on the classifier performance.
We also investigate the behavior of the classification rate
as the number of classes increases. Figure 5 shows that a
larger number of classes does not affect the performance
of ferns much, while tree based methods can not cope with
many classes. In both experimentswe have trained the clas-
sifiers using classes from three different images up to 700
classes for each image.
4.2. Linking the Two Approaches
Here we show that the two approaches are equivalent
when the training set is small and give some insights into
why the ferns perform better when it is large.
Recall that we evaluate P(Fk|C = ci) of Eq. (3)
P(Fk|C = ci) = PeP(Θ(Fk))+µ(1−P(Θ(Fk))),
where Pe is the empirical probability and µ =1
computing the product of such terms is the same as sum-
H. Since