Conference PaperPDF Available

ORB: an efficient alternative to SIFT or SURF

Authors:

Abstract and Figures

Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.
Content may be subject to copyright.
ORB: an efficient alternative to SIFT or SURF
Ethan Rublee Vincent Rabaud Kurt Konolige Gary Bradski
Willow Garage, Menlo Park, California
{erublee}{vrabaud}{konolige}{bradski}@willowgarage.com
Abstract
Feature matching is at the base of many computer vi-
sion problems, such as object recognition or structure from
motion. Current methods rely on costly descriptors for de-
tection and matching. In this paper, we propose a very fast
binary descriptor based on BRIEF, called ORB, which is
rotation invariant and resistant to noise. We demonstrate
through experiments how ORB is at two orders of magni-
tude faster than SIFT, while performing as well in many
situations. The efficiency is tested on several real-world ap-
plications, including object detection and patch-tracking on
a smart phone.
1. Introduction
The SIFT keypoint detector and descriptor [17], al-
though over a decade old, have proven remarkably success-
ful in a number of applications using visual features, in-
cluding object recognition [17], image stitching [28], visual
mapping [25], etc. However, it imposes a large computa-
tional burden, especially for real-time systems such as vi-
sual odometry, or for low-power devices such as cellphones.
This has led to an intensive search for replacements with
lower computation cost; arguably the best of these is S URF
[2]. There has also been research aimed at speeding up the
computation of SIFT, most notably with GPU devices [26].
In this paper, we propose a computationally-efficient re-
placement to SIFT that has similar matching performance,
is less affected by image noise, and is capable of being used
for real-time performance. Our main motivation is to en-
hance many common image-matching applications, e.g., to
enable low-power devices without GPU acceleration to per-
form panorama stitching and patch tracking, and to reduce
the time for feature-based object detection on standard PCs.
Our descriptor performs as well as SIFT on these tasks (and
better than SURF), while being almost two orders of mag-
nitude faster.
Our proposed feature builds on the well-known FAST
keypoint detector [23] and the recently-developed BRIEF
descriptor [6]; for this reason we call it ORB (Oriented
Figure 1. Typical matching result using ORB on real-world im-
ages with viewpoint change. Green lines are valid matches; red
circles indicate unmatched points.
FAST and Rotated BRIEF). Both these techniques are at-
tractive because of their good performance and low cost.
In this paper, we address several limitations of these tech-
niques vis-a-vis SIFT, most notably the lack of rotational
invariance in BRIEF. Our main contributions are:
The addition of a fast and accurate orientation compo-
nent to FAST.
The efficient computation of oriented BRIEF features.
Analysis of variance and correlation of oriented
BRIEF features.
A learning method for de-correlating BRIEF features
under rotational invariance, leading to better perfor-
mance in nearest-neighbor applications.
To validate ORB, we perform experiments that test the
properties of ORB relative to SIFT and SURF, for both
raw matching ability, and performance in image-matching
applications. We also illustrate the efficiency of ORB
by implementing a patch-tracking application on a smart
phone. An additional benefit of ORB is that it is free from
the licensing restrictions of SIFT and SURF.
2. Related Work
Keypoints FAST and its variants [23,24] are the method
of choice for finding keypoints in real-time systems that
match visual features, for example, Parallel Tracking and
Mapping [13]. It is efficient and finds reasonable corner
keypoints, although it must be augmented with pyramid
1
schemes for scale [14], and in our case, a Harris corner filter
[11] to reject edges and provide a reasonable score.
Many keypoint detectors include an orientation operator
(SIFT and SURF are two prominent examples), but FAST
does not. There are various ways to describe the orientation
of a keypoint; many of these involve histograms of gradient
computations, for example in SIFT [17] and the approxi-
mation by block patterns in SURF [2]. These methods are
either computationally demanding, or in the case of SURF,
yield poor approximations. The reference paper by Rosin
[22] gives an analysis of various ways of measuring orienta-
tion of corners, and we borrow from his centroid technique.
Unlike the orientation operator in SIFT, which can have
multiple value on a single keypoint, the centroid operator
gives a single dominant result.
Descriptors BRIEF [6] is a recent feature descriptor that
uses simple binary tests between pixels in a smoothed image
patch. Its performance is similar to SIFT in many respects,
including robustness to lighting, blur, and perspective dis-
tortion. However, it is very sensitive to in-plane rotation.
BRIEF grew out of research that uses binary tests to
train a set of classification trees [4]. Once trained on a set
of 500 or so typical keypoints, the trees can be used to re-
turn a signature for any arbitrary keypoint [5]. In a similar
manner, we look for the tests least sensitive to orientation.
The classic method for finding uncorrelated tests is Princi-
pal Component Analysis; for example, it has been shown
that PCA for SIFT can help remove a large amount of re-
dundant information [12]. However, the space of possible
binary tests is too big to perform PCA and an exhaustive
search is used instead.
Visual vocabulary methods [21,27] use offline clustering
to find exemplars that are uncorrelated and can be used in
matching. These techniques might also be useful in finding
uncorrelated binary tests.
The closest system to ORB is [3], which proposes a
multi-scale Harris keypoint and oriented patch descriptor.
This descriptor is used for image stitching, and shows good
rotational and scale invariance. It is not as efficient to com-
pute as our method, however.
3. oFAST: FAST Keypoint Orientation
FAST features are widely used because of their compu-
tational properties. However, FAST features do not have an
orientation component. In this section we add an efficiently-
computed orientation.
3.1. FAST Detector
We start by detecting FAST points in the image. FAST
takes one parameter, the intensity threshold between the
center pixel and those in a circular ring about the center.
We use FAST-9 (circular radius of 9), which has good per-
formance.
FAST does not produce a measure of cornerness, and we
have found that it has large responses along edges. We em-
ploy a Harris corner measure [11] to order the FAST key-
points. For a target number Nof keypoints, we first set the
threshold low enough to get more than Nkeypoints, then
order them according to the Harris measure, and pick the
top Npoints.
FAST does not produce multi-scale features. We employ
a scale pyramid of the image, and produce FAST features
(filtered by Harris) at each level in the pyramid.
3.2. Orientation by Intensity Centroid
Our approach uses a simple but effective measure of cor-
ner orientation, the intensity centroid [22]. The intensity
centroid assumes that a corner’s intensity is offset from its
center, and this vector may be used to impute an orientation.
Rosin defines the moments of a patch as:
mpq =X
x,y
xpyqI(x, y),(1)
and with these moments we may find the centroid:
C=m10
m00
,m01
m00 (2)
We can construct a vector from the corner’s center, O, to the
centroid, ~
OC. The orientation of the patch then simply is:
θ=atan2(m01, m10 ),(3)
where atan2 is the quadrant-aware version of arctan. Rosin
mentions taking into account whether the corner is dark or
light; however, for our purposes we may ignore this as the
angle measures are consistent regardless of the corner type.
To improve the rotation invariance of this measure we
make sure that moments are computed with xand yre-
maining within a circular region of radius r. We empirically
choose rto be the patch size, so that that xand yrun from
[r, r]. As |C|approaches 0, the measure becomes unsta-
ble; with FAST corners, we have found that this is rarely the
case.
We compared the centroid method with two gradient-
based measures, BIN and MAX. In both cases, Xand
Ygradients are calculated on a smoothed image. MAX
chooses the largest gradient in the keypoint patch; BIN
forms a histogram of gradient directions at 10 degree inter-
vals, and picks the maximum bin. BIN is similar to the SIFT
algorithm, although it picks only a single orientation. The
variance of the orientation in a simulated dataset (in-plane
rotation plus added noise) is shown in Figure 2. Neither of
the gradient measures performs very well, while the cen-
troid gives a uniformly good orientation, even under large
image noise.
5
10
15
20
25
30
35
40
0 5 10 15 20 25
Standard Deviation (degrees)
Image Intensity Noise (gaussian noise for an 8 bit image)
Standard Deviation in Angle Error
IC
MAX
BIN
Figure 2. Rotation measure. The intensity centroid (IC) per-
forms best on recovering the orientation of artificially rotatednoisy
patches, compared to a histogram (BIN) and MAX method.
4. rBRIEF: Rotation-Aware Brief
In this section, we first introduce a steered BRIEF de-
scriptor, show how to compute it efficiently and demon-
strate why it actually performs poorly with rotation. We
then introduce a learning step to find less correlated binary
tests leading to the better descriptor rBRIEF, for which we
offer comparisons to SIFT and SURF.
4.1. Efficient Rotation of the BRIEF Operator
Brief overview of BRIEF
The BRIEF descriptor [6] is a bit string description of an
image patch constructed from a set of binary intensity tests.
Consider a smoothed image patch, p. A binary test τis
defined by:
τ(p;x,y) := 1 : p(x)<p(y)
0 : p(x)p(y),(4)
where p(x)is the intensity of pat a point x. The feature is
defined as a vector of nbinary tests:
fn(p) := X
1in
2i1τ(p;xi,yi)(5)
Many different types of distributions of tests were consid-
ered in [6]; here we use one of the best performers, a Gaus-
sian distribution around the center of the patch. We also
choose a vector length n= 256.
It is important to smooth the image before performing
the tests. In our implementation, smoothing is achieved us-
ing an integral image, where each test point is a 5×5sub-
window of a 31 ×31 pixel patch. These were chosen from
our own experiments and the results in [6].
Steered BRIEF
We would like to allow BRIEF to be invariant to in-plane
rotation. Matching performance of BRIEF falls off sharply
for in-plane rotation of more than a few degrees (see Figure
7). Calonder [6] suggests computing a BRIEF descriptor
for a set of rotations and perspective warps of each patch,
0
20
40
60
80
100
120
140
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Number of Bits
Bit Response Mean
Histogram of Descriptor Bit Mean Values
BRIEF
rBRIEF
steered BRIEF
Figure 3. Distribution of means for feature vectors: BRIEF, steered
BRIEF (Section 4.1), and rBRIEF (Section 4.3). The X axis is the
distance to a mean of 0.5
but this solution is obviously expensive. A more efficient
method is to steer BRIEF according to the orientation of
keypoints. For any feature set of nbinary tests at location
(xi,yi), define the 2×nmatrix
S=x1,...,xn
y1,...,yn
Using the patch orientation θand the corresponding rotation
matrix Rθ, we construct a “steered” version Sθof S:
Sθ=RθS,
Now the steered BRIEF operator becomes
gn(p, θ) := fn(p)|(xi,yi)Sθ(6)
We discretize the angle to increments of 2π/30 (12 de-
grees), and construct a lookup table of precomputed BRIEF
patterns. As long at the keypoint orientation θis consistent
across views, the correct set of points Sθwill be used to
compute its descriptor.
4.2. Variance and Correlation
One of the pleasing properties of BRIEF is that each bit
feature has a large variance and a mean near 0.5. Figure 3
shows the spread of means for a typical Gaussian BRIEF
pattern of 256 bits over 100k sample keypoints. A mean
of 0.5 gives the maximum sample variance 0.25 for a bit
feature. On the other hand, once BRIEF is oriented along
the keypoint direction to give steered BRIEF, the means are
shifted to a more distributed pattern (again, Figure 3). One
way to understand this is that the oriented corner keypoints
present a more uniform appearance to binary tests.
High variance makes a feature more discriminative, since
it responds differentially to inputs. Another desirable prop-
erty is to have the tests uncorrelated, since then each test
will contribute to the result. To analyze the correlation and
0 5 10 15 20 25 30 35 40
0
2
4
6
8
Component
Eigenvalue
BRIEF
Steered BRIEF
rBRIEF
Figure 4. Distribution of eigenvalues in the PCA decomposition
over 100k keypoints of three feature vectors: BRIEF, steered
BRIEF (Section 4.1), and rBRIEF (Section 4.3).
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 64 128 192 256
Relative Frequency
Descriptor Distance
Distance Distribution
rBRIEF
steered BRIEF
BRIEF
Figure 5. The dotted lines show the distances of a keypoint to out-
liers, while the solid lines denote the distances only between inlier
matches for three feature vectors: BRIEF, steered BRIEF (Section
4.1), and rBRIEF (Section 4.3).
variance of tests in the BRIEF vector, we looked at the re-
sponse to 100k keypoints for BRIEF and steered BRIEF.
The results are shown in Figure 4. Using PCA on the data,
we plot the highest 40 eigenvalues (after which the two de-
scriptors converge). Both BRIEF and steered BRIEF ex-
hibit high initial eigenvalues, indicating correlation among
the binary tests – essentially all the information is contained
in the first 10 or 15 components. Steered BRIEF has signif-
icantly lower variance, however, since the eigenvalues are
lower, and thus is not as discriminative. Apparently BRIEF
depends on random orientation of keypoints for good per-
formance. Another view of the effect of steered BRIEF is
shown in the distance distributions between inliers and out-
liers (Figure 5). Notice that for steered BRIEF, the mean for
outliers is pushed left, and there is more of an overlap with
the inliers.
4.3. Learning Good Binary Features
To recover from the loss of variance in steered BRIEF,
and to reduce correlation among the binary tests, we de-
velop a learning method for choosing a good subset of bi-
nary tests. One possible strategy is to use PCA or some
other dimensionality-reduction method, and starting from a
large set of binary tests, identify 256 new features that have
high variance and are uncorrelated over a large training set.
However, since the new features are composed from a larger
number of binary tests, they would be less efficient to com-
pute than steered BRIEF. Instead, we search among all pos-
sible binary tests to find ones that both have high variance
(and means close to 0.5), as well as being uncorrelated.
The method is as follows. We first setup a training set of
some 300k keypoints, drawn from images in the PASCAL
2006 set [8]. We also enumerate all possible binary tests
drawn from a 31 ×31 pixel patch. Each test is a pair of 5×5
sub-windows of the patch. If we note the width of our patch
as wp= 31 and the width of the test sub-window as wt= 5,
then we have N= (wpwt)2possible sub-windows. We
would like to select pairs of two from these, so we have N
2
binary tests. We eliminate tests that overlap, so we end up
with M= 205590 possible tests. The algorithm is:
1. Run each test against all training patches.
2. Order the tests by their distance from a mean of 0.5,
forming the vector T.
3. Greedy search:
(a) Put the first test into the result vector R and re-
move it from T.
(b) Take the next test from T, and compare it against
all tests in R. If its absolute correlation is greater
than a threshold, discard it; else add it to R.
(c) Repeat the previous step until there are 256 tests
in R. If there are fewer than 256, raise the thresh-
old and try again.
This algorithm is a greedy search for a set of uncorrelated
tests with means near 0.5. The result is called rBRIEF.
rBRIEF has significant improvement in the variance and
correlation over steered BRIEF (see Figure 4). The eigen-
values of PCA are higher, and they fall off much less
quickly. It is interesting to see the high-variance binary tests
produced by the algorithm (Figure 6). There is a very pro-
nounced vertical trend in the unlearned tests (left image),
which are highly correlated; the learned tests show better
diversity and lower correlation.
4.4. Evaluation
We evaluate the combination of oFAST and rBRIEF,
which we call ORB, using two datasets: images with syn-
thetic in-plane rotation and added Gaussian noise, and a
real-world dataset of textured planar images captured from
different viewpoints. For each reference image, we compute
the oFAST keypoints and rBRIEF features, targeting 500
keypoints per image. For each test image (synthetic rotation
or real-world viewpoint change), we do the same, then per-
form brute-force matching to find the best correspondence.
Figure 6. A subset of the binary tests generated by considering
high-variance under orientation (left) and by running the learning
algorithm to reduce correlation (right). Note the distribution of the
tests around the axis of the keypoint orientation, which is pointing
up. The color coding shows the maximum pairwise correlation of
each test, with black and purple being the lowest. The learned tests
clearly have a better distribution and lower correlation.
0
20
40
60
80
100
0 45 90 135 180 225 270 315 360
Percentage of Inliers
Angle of Rotation (Degrees)
Percentage of Inliers considering In Plane Rotation
rBRIEF
SIFT
SURF
BRIEF
Figure 7. Matching performance of SIFT, SURF, BRIEF with
FAST, and ORB (oFAST +rBRIEF) under synthetic rotations
with Gaussian noise of 10.
The results are given in terms of the percentage of correct
matches, against the angle of rotation.
Figure 7shows the results for the synthetic test set with
added Gaussian noise of 10. Note that the standard BRIEF
operator falls off dramatically after about 10 degrees. SIFT
outperforms SURF, which shows quantization effects at 45-
degree angles due to its Haar-wavelet composition. ORB
has the best performance, with over 70% inliers.
ORB is relatively immune to Gaussian image noise, un-
like SIFT. If we plot the inlier performance vs. noise, SIFT
exhibits a steady drop of 10% with each additional noise
increment of 5. ORB also drops, but at a much lower rate
(Figure 8).
To test ORB on real-world images, we took two sets of
images, one our own indoor set of highly-textured mag-
azines on a table (Figure 9), the other an outdoor scene.
The datasets have scale, viewpoint, and lighting changes.
Running a simple inlier/outlier test on this set of images,
35
40
45
50
55
60
65
70
75
80
85
90
95
100
90 180 270 360
Percentage of Inliers
Angle of Rotation (Degrees)
Comparison of SIFT and rBRIEF considering Gaussian Intensity Noise
rBRIEF
SIFT
Figure 8. Matching behavior under noise for SIFT and rBRIEF.
The noise levels are 0, 5, 10, 15, 20, and 25. SIFT performance
degrades rapidly, while rBRIEF is relatively unaffected.
Figure 9. Real world data of a table full of magazines and an out-
door scene. The images in the first column are matched to those in
the second. The last column is the resulting warp of the first onto
the second.
we measure the performance of ORB relative to SIFT and
SURF. The test is performed in the following manner:
1. Pick a reference view V0.
2. For all Vi, find a homographic warp Hi0that maps
ViV0.
3. Now, use the Hi0as ground truth for descriptor
matches from SIFT, SURF, and ORB.
inlier % Npoints
Magazines
ORB 36.180 548.50
SURF 38.305 513.55
SIFT 34.010 584.15
Boat ORB 45.8 789
SURF 28.6 795
SIFT 30.2 714
ORB outperforms SIFT and SURF on the outdoor dataset.
It is about the same on the indoor set; [6] noted that blob-
detection keypoints like SIFT tend to be better on graffiti-
type images.
Figure 10. Two different datasets (7818 images from the PASCAL
2009 dataset [9] and 9144 low resolution images from the Caltech
101 [29]) are used to train LSH on the BR IEF, steered BR IEF and
rBRIEF descriptors. The training takes less than 2 minutes and
is limited by the disk IO. rBRIEF gives the most homogeneous
buckets by far, thus improving the query speed and accuracy.
5. Scalable Matching of Binary Features
In this section we show that ORB outperforms
SIFT/SURF in nearest-neighbor matching over large
databases of images. A critical part of ORB is the recovery
of variance, which makes NN search more efficient.
5.1. Locality Sensitive Hashing for rBrief
As rBRIEF is a binary pattern, we choose Locality Sen-
sitive Hashing [10] as our nearest neighbor search. In LSH,
points are stored in several hash tables and hashed in differ-
ent buckets. Given a query descriptor, its matching buckets
are retrieved and its elements compared using a brute force
matching. The power of that technique lies in its ability
to retrieve nearest neighbors with a high probability given
enough hash tables.
For binary features, the hash function is simply a subset
of the signature bits: the buckets in the hash tables contain
descriptors with a common sub-signature. The distance is
the Hamming distance.
We use multi-probe LSH [18] which improves on the
traditional LSH by looking at neighboring buckets in which
a query descriptor falls. While this could result in more
matches to check, it actually allows for a lower number of
tables (and thus less RAM usage) and a longer sub-signature
and therefore smaller buckets.
5.2. Correlation and Leveling
rBRIEF improves the speed of LSH by making the
buckets of the hash tables more even: as the bits are less
correlated, the hash function does a better job at partitioning
Figure 11. Speed vs. accuracy. The descriptors are tested on
warped versions of the images they were trained on. We used 1,
2 and 3 kd-trees for SIFT (the autotuned FLANN kd-tree gave
worse performance), 4 to 20 hash tables for rBRIEF and 16 to 40
tables for steered BRIEF (both with a sub-signature of 16 bits).
Nearest neighbors were searched over 1.6M entries for SIFT and
1.8M entries for rBRIEF.
the data. As shown in Figure 10, buckets are much smaller
in average compared to steered BRIEF or normal BRIEF.
5.3. Evaluation
We compare the performance of rBRIEF LSH with kd-
trees of SIFT features using FLANN [20]. We train the dif-
ferent descriptors on the Pascal 2009 dataset and test them
on sampled warped versions of those images using the same
affine transforms as in [1].
Our multi-probe LSH uses bitsets to speedup the pres-
ence of keys in the hash maps. It also computes the Ham-
ming distance between two descriptors using an SSE 4.2
optimized popcount.
Figure 11 establishes a correlation between the speed
and the accuracy of kd-trees with SIFT (SURF is equiv-
alent) and LSH with rBRIEF. A successful match of the
test image occurs when more than 50 descriptors are found
in the correct database image. We notice that LSH is faster
than the kd-trees, most likely thanks to its simplicity and the
speed of the distance computation. LSH also gives more
flexibility with regard to accuracy, which can be interesting
in bag-of-feature approaches [21,27]. We can also notice
that the steered BRIEF is much slower due to its uneven
buckets.
6. Applications
6.1. Benchmarks
One emphasis for ORB is the efficiency of detection and
description on standard CPUs. Our canonical ORB detec-
tor uses the oFAST detector and rBRIEF descriptor, each
computed separately on five scales of the image, with a scal-
ing factor of 2. We used an area-based interpolation for
efficient decimation.
The ORB system breaks down into the following times
per typical frame of size 640x480. The code was executed
in a single thread running on an Intel i7 2.8 GHz processor:
ORB: Pyramid oFAST rBRIEF
Time (ms) 4.43 8.68 2.12
When computing ORB on a set of 2686 images at 5
scales, it was able to detect and compute over 2×106fea-
tures in 42 seconds. Comparing to SIFT and SURF on the
same data, for the same number of features (roughly 1000),
and the same number of scales, we get the following times:
Detector ORB SURF SIFT
Time per frame (ms) 15.3 217.3 5228.7
These times were averaged over 24 640x480 images from
the Pascal dataset [9]. ORB is an order of magnitude faster
than SURF, and over two orders faster than SIFT.
6.2. Textured object detection
We apply rBRIEF to object recognition by implement-
ing a conventional object recognition pipeline similar to
[19]: we first detect oFAST features and rBRIEF de-
scriptors, match them to our database, and then perform
PROSAC [7] and EPnP [16] to have a pose estimate.
Our database contains 49 household objects, each taken
under 24 views with a 2D camera and a Kinect device from
Microsoft. The testing data consists of 2D images of sub-
sets of those same objects under different view points and
occlusions. To have a match, we require that descriptors are
matched but also that a pose can be computed. In the end,
our pipeline retrieves 61% of the objects as shown in Figure
12.The algorithm handles a database of 1.2M descriptors
in 200MB and has timings comparable to what we showed
earlier (14 ms for detection and 17ms for LSH matching in
average). The pipeline could be sped up considerably by not
matching all the query descriptors to the training data but
our goal was only to show the feasibility of object detection
with ORB.
6.3. Embedded real-time feature tracking
Tracking on the phone involves matching the live frames
to a previously captured keyframe. Descriptors are stored
with the keyframe, which is assumed to contain a planar
surface that is well textured. We run ORB on each incom-
ing frame, and proced with a brute force descriptor match-
ing against the keyframe. The putative matches from the
descriptor distance are used in a PROSAC best fit homog-
raphy H.
Figure 12. Two images of our textured obejct recognition with
pose estimation. The blue features are the training features super-
imposed on the query image to indicate that the pose of the object
was found properly. Axes are also displayed for each object as
well as a pink label. Top image misses two objects; all are found
in the bottom one.
While there are real-time feature trackers that can run on
a cellphone [15], they usually operate on very small images
(e.g., 120x160) and with very few features. Systems com-
parable to ours [30] typically take over 1 second per image.
We were able to run ORB with 640 ×480 resolution at 7
Hz on a cellphone with a 1GHz ARM chip and 512 MB of
RAM. The OpenCV port for Android was used for the im-
plementation. These are benchmarks for about 400 points
per image:
ORB Matching HFit
Time (ms) 66.6 72.8 20.9
7. Conclusion
In this paper, we have defined a new oriented descrip-
tor, ORB, and demonstrated its performance and efficiency
relative to other popular features. The investigation of vari-
ance under orientation was critical in constructing ORB
and de-correlating its components, in order to get good per-
formance in nearest-neighbor applications. We have also
contributed a BSD licensed implementation of ORB to the
community, via OpenCV 2.3.
One of the issues that we have not adequately addressed
here is scale invariance. Although we use a pyramid scheme
for scale, we have not explored per keypoint scale from
depth cues, tuning the number of octaves, etc.. Future work
also includes GPU/SSE optimization, which could improve
LSH by another order of magnitude.
References
[1] M. Aly, P. Welinder, M. Munich, and P. Perona. Scaling
object recognition: Benchmark of current state of the art
techniques. In First IEEE Workshop on Emergent Issues in
Large Amounts of Visual Data (WS-LAVD), IEEE Interna-
tional Conference on Computer Vision (ICCV), September
2009. 6
[2] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up ro-
bust features. In European Conference on Computer Vision,
May 2006. 1,2
[3] M. Brown, S. Winder, and R. Szeliski. Multi-image match-
ing using multi-scale oriented patches. In Computer Vision
and Pattern Recognition, pages 510–517, 2005. 2
[4] M. Calonder, V. Lepetit, and P. Fua. Keypoint signatures for
fast learning and recognition. In European Conference on
Computer Vision, 2008. 2
[5] M. Calonder, V. Lepetit, K. Konolige, P. Mihelich, and
P. Fua. High-speed keypoint description and matching us-
ing dense signatures. In Under review, 2009. 2
[6] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Bi-
nary robust independent elementary features. In In European
Conference on Computer Vision, 2010. 1,2,3,5
[7] O. Chum and J. Matas. Matching with PROSAC - pro-
gressive sample consensus. In C. Schmid, S. Soatto, and
C. Tomasi, editors, Proc. of Conference on Computer Vision
and Pattern Recognition (CVPR), volume 1, pages 220–226,
Los Alamitos, USA, June 2005. IEEE Computer Society. 7
[8] M. Everingham. The PASCAL Visual Ob-
ject Classes Challenge 2006 (VOC2006) Results.
http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html.
4
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes
Challenge 2009 (VOC2009) Results. http://www.pascal-
network.org/challenges/VOC/voc2009/workshop/index.html.
6,7
[10] A. Gionis, P. Indyk, and R. Motwani. Similarity search in
high dimensions via hashing. In M. P. Atkinson, M. E. Or-
lowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors,
VLDB’99, Proceedings of 25th International Conference on
Very Large Data Bases, September 7-10, 1999, Edinburgh,
Scotland, UK, pages 518–529. Morgan Kaufmann, 1999. 6
[11] C. Harris and M. Stephens. A combined corner and edge
detector. In Alvey Vision Conference, pages 147–151, 1988.
2
[12] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. In Computer Vision
and Pattern Recognition, pages 506–513, 2004. 2
[13] G. Klein and D. Murray. Parallel tracking and mapping for
small AR workspaces. In Proc. Sixth IEEE and ACM Inter-
national Symposium on Mixed and Augmented Reality (IS-
MAR’07), Nara, Japan, November 2007. 1
[14] G. Klein and D. Murray. Improving the agility of keyframe-
based SLAM. In European Conference on Computer Vision,
2008. 2
[15] G. Klein and D. Murray. Parallel tracking and mapping on a
camera phone. In Proc. Eigth IEEE and ACM International
Symposium on Mixed and Augmented Reality (ISMAR’09),
Orlando, October 2009. 7
[16] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate
O(n) solution to the pnp problem. Int. J. Comput. Vision,
81:155–166, February 2009. 7
[17] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. International Journal of Computer Vision,
60(2):91–110, 2004. 1,2
[18] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li.
Multi-probe LSH: efficient indexing for high-dimensional
similarity search. In Proceedings of the 33rd international
conference on Very large data bases, VLDB ’07, pages 950–
961. VLDB Endowment, 2007. 6
[19] M. Martinez, A. Collet, and S. S. Srinivasa. MOPED:
A Scalable and low Latency Object Recognition and Pose
Estimation System. In IEEE International Conference on
Robotics and Automation. IEEE, 2010. 7
[20] M. Muja and D. G. Lowe. Fast approximate nearest neigh-
bors with automatic algorithm configuration. VISAPP, 2009.
6
[21] D. Nist´er and H. Stew´enius. Scalable recognition with a vo-
cabulary tree. In CVPR, 2006. 2,6
[22] P. L. Rosin. Measuring corner properties. Computer Vision
and Image Understanding, 73(2):291 – 307, 1999. 2
[23] E. Rosten and T. Drummond. Machine learning for high-
speed corner detection. In European Conference on Com-
puter Vision, volume 1, 2006. 1
[24] E. Rosten, R. Porter, and T. Drummond. Faster and bet-
ter: A machine learning approach to corner detection. IEEE
Trans. Pattern Analysis and Machine Intelligence, 32:105–
119, 2010. 1
[25] S. Se, D. Lowe, and J. Little. Mobile robot localization and
mapping with uncertainty using scale-invariant visual land-
marks. International Journal of Robotic Research, 21:735–
758, August 2002. 1
[26] S. N. Sinha, J. michael Frahm, M. Pollefeys, and Y. Genc.
Gpu-based video feature tracking and matching. Technical
report, In Workshop on Edge Computing Using New Com-
modity Architectures, 2006. 1
[27] J. Sivic and A. Zisserman. Video google: A text retrieval
approach to object matching in videos. International Con-
ference on Computer Vision, page 1470, 2003. 2,6
[28] N. Snavely, S. M. Seitz, and R. Szeliski. Skeletal sets for
efficient structure from motion. In Proc. Computer Vision
and Pattern Recognition, 2008. 1
[29] G. Wang, Y. Zhang, and L. Fei-Fei. Using dependent regions
for object categorization in a generative framework, 2006. 6
[30] A. Weimert, X. Tan, and X. Yang. Natural feature detection
on mobile phones with 3D FAST. Int. J. of Virtual Reality,
9:29–34, 2010. 7
... The structural description of the object is a finite multiset n B O  [15]. There are many methods for detecting and matching keypoints on the image, for example, BRIEF, ORB, BRISK, FREAK, AKAZE, LATCH, SIFT, SURF, which use a binary representation for image features [30], [41], [45]. ...
... For research, Speeded Up Robust Features (SURF) [30], [41], [45] was chosen as one of the effective and fast methods for structural analysis of visual objects. ...
... The SURF method is used to construct descriptions of objects for the following tasks: comparison of images [30]; search for objects on images [15]; 3D reconstruction of the territory infrastructure [20]; recognition of road signs [45]; automatic recognition of bacteria in blood tests [41], etc. ...
Article
The processes of intelligent data processing in computer vision systems have been researched. The problem of structural image recognition is relevant. This is a promising way to assess the degree of similarity of objects. This approach provides the simplicity of construction and the high reliability of decision making. The main problem of an effective description of characteristic features is the distortion of fragments of analyzed objects. The reasons for changing the input data can be the actions of geometric transformations, the influence of background or interference. The elements of false objects with similar characteristics are formed. The problem of ensuring high-quality recognition requires the implementation of effective means of image processing. Methods of statistical modeling, granulation of data and fuzzy sets, detection and comparison of keypoints on the image, classification and clustering of data, and simulation modelling are used in this research. The implementation of the proposed approaches provides the formation of a concise description of features or a vector representation of unique keypoints. The verification of theoretical foundations and evaluation of the effectiveness of the proposed data processing methods for real image bases is performed using the OpenCV library. The applied significance of the work is substantiated according to the criterion of data processing time without reducing the characteristics of reliability and interference immunity. The developed methods allow to increase the structural recognition of images by several times. Perspectives of research may involve identifying the optimal number of keypoints of the base set.
... Visual SLAM estimates camera pose by matching the point features of frames normally. These algorithms for detecting points including the well-known SIFT [20], SURF [21], and ORB [15] have been applied in the visual SLAM widely, which improved the accuracy and robustness of the pose estimation [22,23]. The process of solving camera pose is implemented by minimizing the 2D re-projection error of point features. ...
... We proposed a new geometric constraint model of RGB-D SLAM using point and line features. We detected and matched point and line features of frames by applying ORB [15] and LSD (Line segment Detector) algorithm [16]. With the aid of point-line features and depth images, the 3D points and lines were recovered and then refined according to the knowledge of space geometry property within points and lines. ...
... When considering some factors within feature extraction algorithms (e.g., SIFT, SURF) such as velocity, stability, rotation invariance and so on [48], the ORB algorithm [15] is selected to extract and match point features by using the Opencv library. By combing the RANSAC [43] algorithm with a fundamental matrix constraint, we reject the mismatches and obtain an initial set of matching points. ...
Article
Full-text available
In the study of RGB-D SLAM (Simultaneous Localization and Mapping), two types of primary visual features, point and line features, have been widely utilized to calculate the camera pose. As an RGB-D camera can capture RGB and depth information simultaneously, most RGB-D SLAM methods only utilize the 2D information within the point and line features. To obtain a higher accuracy camera pose and utilize the 2D and 3D information within points and lines better, a novel geometric constraint model of points and lines (PL-GM) using an RGB-D camera is proposed in this paper. Our contributions are threefold. Firstly, the 3D points and lines generated by an RGB-D camera combining with 2D point and line features are utilized to establish the PL-GM, which is different from most models of point-line SLAM (PL-SLAM). Secondly, in addition to the 2D re-projection error of point and line features, the constraint errors of 3D points and lines are constructed and minimized likewise, and then a unified optimization model based on PL-GM is extended to the bundle adjustment model (BA). Finally, extensive experiments have been performed on two public benchmark RGB-D datasets and a real scenario sequence. These experimental results demonstrate that our method achieves a comparable or better performance than the state-of-the-art SLAM methods based on point and line features, and point features.
... Bay et al. [4] proposed the speeded up robust features (SURF) algorithm, whose variable is the size and the scale of Gaussian blur templates, so as to avoid the process of down-sampling and improve the processing speed. Using the features from accelerated segment test (FAST) [5] algorithm, the oriented FAST and rotated binary robust independent elementary features (BRIEF) (ORB) algorithm [6] extracts features, and obtains the main direction of features with intensity centroid. The ORB algorithm has rotation invariance, but does not have scale invariance. ...
... Different feature algorithms have different applicability, so the matching feature pairs of different algorithms will have a big difference. For example, for the same two images, the SURF algorithm [4] and the ORB algorithm [5] will result in different matches. In order to increase the number of feature pairs and make the algorithm more applicable, we can use two or more traditional image matching algorithms in Section 2 to match the images and fuse the matching results together. ...
... To give an example, we use SURF [4] + RANSAC [32] and ORB [5] + RANSAC [32] to fuse. The process is shown in Fig. 3 and the steps are shown as follows. ...
Article
Three-dimensional (3D) reconstruction based on aerial images has broad prospects, and feature matching is an important step of it. However, for high-resolution aerial images, there are usually problems such as long time, mismatching and sparse feature pairs using traditional algorithms. Therefore, an algorithm is proposed to realize fast, accurate and dense feature matching. The algorithm consists of four steps. Firstly, we achieve a balance between the feature matching time and the number of matching pairs by appropriately reducing the image resolution. Secondly, to realize further screening of the mismatches, a feature screening algorithm based on similarity judgment or local optimization is proposed. Thirdly, to make the algorithm more widely applicable, we combine the results of different algorithms to get dense results. Finally, all matching feature pairs in the low-resolution images are restored to the original images. Comparisons between the original algorithms and our algorithm show that the proposed algorithm can effectively reduce the matching time, screen out the mismatches, and improve the number of matches.
... This section will show the performance of the MIIVG in image local feature extraction through keypoint matching experiments. It is compared with the classic and widely used algorithms of SIFT [30] and SURF [37], [38]. It is well-known that the image keypoint matching had realized by a detector and a descriptor. ...
... The first step is to rotate the keypoints to be sure the descriptor is rotatinginvariant, which requires a stable and accurate estimation of the keypoint orientation. If intensity centroid C of a keypoint does not coincide with geometric center O, vector − − → OC subsequently used to represent the orientation, called intensity centroid method [38], [39]. ...
Article
Full-text available
Recently, the image visibility graphs (IVG) had introduced as simple algorithms by which images map into complex networks. However, current methods based on IVG use global statistical behaviors of the resulting graph to extract image features, which leads to loss of the local structural information of the image. To extract more informative image features by using the concept of IVG, we propose a new concept called matching intensity for image visibility graphs (MIIVG). The key idea of MIIVG is to separate the image into segments and represent the structural behavior of each with reference patterns and corresponding matching intensity. Theoretical analysis shows that the operation of MIIVG can be simplified to convolution operation and provides 256 convolution kernels with clear and apparent physical meaning, through which we can extract image features from multi-viewpoints and obtained more informative image features. Theoretical analysis and experiments demonstrate that MIIVG has a remarkable computing speed and is sufficiently stable against noise. Its high performance in image feature extraction we confirmed by two experiments. In keypoint matching experiments, MIIVG achieves a competitive result compared with SIFT. In texture classification experiments, compared with LBP, MIIVG is superior to LBP in calculation speed and classification effect. Compared with several current deep learning models, they all have the best feature extraction effect and very fast, but the features extracted by MIIVG are more concise. Also, MIIVG hardware requirements are lower, so it is easier to deploy. It is worth mentioning that MIIVG achieved 99.7% classification accuracy on the Multiband datasets, which is a state of the art performance on texture classification task of Multiband datasets and fully demonstrates the effectiveness of MIIVG.
... Therefore, some scholars began to use sift and surf features for object tracking [42], [43], which improved the generalization performance of tracking, but tracking was time-consuming. Rublee et al. [37] proposed ORB, a feature detection algorithm with rotation and scale invariance. While obtaining high-quality feature points, the speed is two orders of magnitude higher than that of sift, which is suitable for applications with higher real-time requirements. ...
... In this paper, we propose a MOT algorithm using ORB keypoints [37], which can be extracted from the object block in section III. In this algorithm, we establish an object library. ...
... It was technically preceded by HOG descriptors in 1986 [McC86], but this method only really gained traction after it was used by Dalal and Triggs for human detection in 2005 [DT05]. Such works inspired other feature descriptors such as SURF [BTVG06] and ORB [RRKB11]. Such descriptors were often coupled with machine learning methods such as the bag-of-words approach to perform object detection and image classification [SZ09]. ...
Thesis
The analysis of satellite and aerial Earth observation images allows us to obtain precise information over large areas. A multitemporal analysis of such images is necessary to understand the evolution of such areas. In this thesis, convolutional neural networks are used to detect and understand changes using remote sensing images from various sources in supervised and weakly supervised settings. Siamese architectures are used to compare coregistered image pairs and to identify changed pixels. The proposed method is then extended into a multitask network architecture that is used to detect changes and perform land cover mapping simultaneously, which permits a semantic understanding of the detected changes. Then, classification filtering and a novel guided anisotropic diffusion algorithm are used to reduce the effect of biased label noise, which is a concern for automatically generated large-scale datasets. Weakly supervised learning is also achieved to perform pixel-level change detection using only image-level supervision through the usage of class activation maps and a novel spatial attention layer. Finally, a domain adaptation method based on adversarial training is proposed, which succeeds in projecting images from different domains into a common latent space where a given task can be performed. This method is tested not only for domain adaptation for change detection, but also for image classification and semantic segmentation, which proves its versatility.
... Mur-Artal et al. presented ORB-SLAM2 [16], a complete SLAM system for monocular, stereo, and RGB-D cameras, which works in real-time on standard CPUs in a wide variety of environments. This system estimates the ego-motion of the camera by matching the corresponding ORB [17] features between the current frame and previous frames and has three parallel threads: tracking, local mapping, and loop closing. Carlos et al. proposed the latest version ORB-SLAM3 [18], mainly adding two novelties: 1) a feature-based tightly-integrated visual-inertial SLAM that fully relies on maximum-a-posteriori (MAP) estimation; 2) a multiple map system (ATLAS [19]) that relies on a new place recognition method with improved recall. ...
Article
Full-text available
The scene rigidity is a strong assumption in typical visual Simultaneous Localization and Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic real-world environments, which are the target of several relevant applications such as augmented reality, semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic results in the tracking thread in their architecture, and the processing time depends on the segmentation methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the others, and therefore the tracking thread does not need to wait for the semantic information anymore. Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making it possible to use segmentation methods with different speeds in a uniform way. We update and propagate semantic information using the moving probability, which is saved in the map and used to remove outliers from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios. Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git.
... Feature-based methods use a range of feature detectors such as SURF (Speeded Up Robust Features) [27], FAST (Features from Accelerated Segment Test) [28], SIFT (Scale-Invariant Feature Transform) [29], ORB (Oriented FAST and Rotated BRIEF) [30] and Harris [31] to detect key points. These features are then tracked in the next sequential frame using a key point tracker resulting in optical flow, the most common of which is the KLT tracker [32], [33]. ...
Article
Full-text available
Deep learning (DL) based localization and Simultaneous Localization and Mapping (SLAM) has recently gained considerable attention demonstrating remarkable results. Instead of constructing hand-crafted algorithms through geometric theories, DL based solutions provide a data-driven solution to the problem. Taking advantage of large amounts of training data and computing capacity, these approaches are increasingly developing into a new field that offers accurate and robust localization systems. In this work, the problem of global localization for unmanned aerial vehicles (UAVs) is analyzed by proposing a sequential, end-to-end, and multimodal deep neural network based monocular visual-inertial localization framework. More specifically, the proposed neural network architecture is three-fold; a visual feature extractor convNet network, a small IMU integrator bi-directional long short-term memory (LSTM), and a global pose regressor bi-directional LSTM network for pose estimation. In addition, by fusing the traditional IMU filtering methods instead of LSTM with the convNet, a more time-efficient deep pose estimation framework is presented. It is worth pointing out that the focus in this study is to evaluate the precision and efficiency of visual-inertial (VI) based localization approaches concerning indoor scenarios. The proposed deep global localization is compared with the various state-of-the-art algorithms on indoor UAV datasets, simulation environments and real-world drone experiments in terms of accuracy and time-efficiency. In addition, the comparison of IMU-LSTM and IMU-Filter based pose estimators is also provided by a detailed analysis. Experimental results show that the proposed filter-based approach combined with a DL approach has promising performance in terms of accuracy and time efficiency in indoor localization of UAVs.
... It builds on parallel tracking and mapping [29] and other algorithms. ORB features are used, as they are computationally efficient and rotation invariant [30]. ORB-SLAM consists of three parallel threads: tracking, local mapping, and loop closing. ...
Article
Full-text available
Many visual simultaneous localization and mapping (SLAM) systems have been shown to be accurate and robust, and have real-time performance capabilities on both indoor and ground datasets. However, these methods can be problematic when dealing with aerial frames captured by a camera mounted on an unmanned aerial vehicle (UAV) because the flight height of the UAV can be difficult to control and is easily affected by the environment. For example, the UAV may be shaken or experience a rapid drop in height due to sudden strong wind, which may in turn lead to lost tracking. What is more, when photographing a large area, the UAV flight path is usually planned in advance and the UAV does not generally return to the previously covered areas, so if the tracking fails during the flight, many areas of the map will be missing. To cope with the case of lost tracking, we present a method of reconstructing a complete global map of UAV datasets by sequentially merging the submaps via the corresponding undirected connected graph. Specifically, submaps are repeatedly generated, from the initialization process to the place where the tracking is lost, and a corresponding undirected connected graph is built by considering these submaps as nodes and the common map points within two submaps as edges. The common map points are then determined by the bag-of-words (BoW) method, and the submaps are merged if they are found to be connected with the online map in the undirect connected graph. To demonstrate the performance of the proposed method, we first investigated the performance on a UAV dataset, and the experimental results showed that, in the case of several tracking failures, the integrity of the mapping was significantly better than that of the current mainstream SLAM method. We also tested the proposed method on both ground and indoor datasets, where it again showed a superior performance.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evaluation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Article
Full-text available
This paper describes novel implementations of the KLT feature track- ing and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by ex- ploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT im- plementation tracks about a thousand features in real-time at 30 Hz on 1024 £ 768 resolution video which is a 20 times improvement over the CPU. It works on both ATI and NVIDIA graphics cards. The GPU-based SIFT implementation works on NVIDIA cards and extracts about 800 features from 640 £ 480 video at 10Hz which is approximately 10 times faster than an optimized CPU implementation.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evalu-ation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Article
In this paper, we present a novel feature detection approach designed for mobile devices, showing optimized solutions for both detection and description. It is based on FAST (Features from Accelerated Segment Test) and named 3D FAST. Being robust, scale-invariant and easy to compute, it is a candidate for augmented reality (AR) applications running on low performance platforms. Using simple calculations and machine learning, FAST is a feature detection algorithm known to be efficient but not very robust in addition to its lack of scale information. Our approach relies on gradient images calculated for different scale levels on which a modified9 FAST algorithm operates to obtain the values of the corner response function. We combine the detection with an adapted version of SURF (Speed Up Robust Features) descriptors, providing a system with all means to implement feature matching and object detection. Experimental evaluation on a Symbian OS device using a standard image set and comparison with SURF using Hessian matrix-based detector is included in this paper, showing improvements in speed (compared to SURF) and robustness (compared to FAST)
Article
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Article
1 Challenge The goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). There are ten object classes: For each of the ten object classes predict the presence/absence of at least one object of that class in a test image. The output from your system should be a real-valued confidence of the object's presence so that an ROC curve can be drawn.
Article
High-dimensional indexing plays a critical role in multidimensional data retrieval. In this work, we propose a new indexing method, named HVA-Index, for similarity search in high-dimensional vector space. This index is based on Vector Approximation and Hash Table. The outstanding advantage is that it stores all vectors in a hash table using approximation as key, and the vectors fall into same cell are organized in a linked list. Contrast to VA-File, the HVA-Index doesn't require scan the entire approximation file, and efficiently improves the speed of similarity search. Our experiments prove that HVA-Index outperforms both of the VA-File and the sequential scan in total elapsed time and the number of disk access, and it's still effective at high dimensionality.