Content uploaded by Michiel Jeroen Holtkamp
Author content
All content in this area was uploaded by Michiel Jeroen Holtkamp
Content may be subject to copyright.
Robot Localisation Using SIFT
and Active Monocular Vision
(Bachelor thesis)
Michiel J. Holtkamp (holtkamp@ai.rug.nl)
Sjoerd de Jong (sdejong@ai.rug.nl)
September 7, 2006
Supervisor: Gert Kootstra
Department of Artificial Intelligence
University of Groningen
Abstract
In our experiment a robot has to be able to localise itself by recognising a previously seen
location. First the robot has to collect a set of images for every location, covering the
complete surroundings of the robot. From these images distinguishing features will be
extracted using sift (Scale Invariant Feature Transform). These features will be saved
in a database as keypoints. When the robot is placed randomly at one of the locations,
newly produced keypoints will be compared with those in the database. More matching
keypoints yield a higher location score. The use of sift for our approach of the localisation
task proved to be quite reliable.
Contents
1 Introduction 3
1.1 The localisation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Ourresearch .................................. 3
2 Background 4
2.1 Different models for visual localisation . . . . . . . . . . . . . . . . . . . . 4
2.2 The SIFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Models using the SIFT algorithm . . . . . . . . . . . . . . . . . . . . . . . 5
3 Model 5
3.1 Keypoint extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 The matching process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Robust keypoint selection by active vision . . . . . . . . . . . . . . . . . . 7
4 Experiment 8
4.1 Creating the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 The benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 The robustness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Results 9
5.1 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Robustness test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Keypoint pruning results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Discussion 14
6.1 Multiple matching of keypoints . . . . . . . . . . . . . . . . . . . . . . . . 14
6.2 The spreadfactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.3 Conclusion ................................... 15
6.4 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
1 Introduction
1.1 The localisation problem
In robotics the ultimate goal is to build a fully autonomous robot, which can find its
way and act on its own. Two of the main things such a robot has to be capable of, are
building a map of its surrounding environment, and localising itself relative to certain
locations in the environment using that map. These two capabilities are known as slam
(Simultaneous Localisation And Mapping). In this report we will describe an experiment
in which we try to let a robot do a part of slam, namely to let it classify its present
location on visual input. With this experiment we have investigated whether localisation
is possible with features extracted from images using sift (Scale Invariant Feature Trans-
form) [Lowe, 1999] (summarized in section 2.2). The robot uses active vision to acquire
image; to represent a location, it uses sequential images of its surroundings instead of one
image from one single viewpoint. The second part of our research consisted of investigat-
ing the functionality of pruning the sift-keypoints database. In other research stereo or
trinocular vision is often used, in our research monocular vision is used instead.
1.2 Our research
In our experiment the robot, a standard Pioneer-2-DX from ActivMedia, has to be able
to localise itself by recognising a location. The robot’s representation of a location con-
sists of a set of keypoints extracted from the visual input. We will investigate whether
recognition is possible using only the comparison between sets of keypoints.
The robot has to collect a set of images for every location. Inbetween capturing two
sequential images, the robot rotates. Every set exists of images covering the complete
surroundings of the robot. From every image distinguishing features will be extracted
using sift. These features will be saved as keypoints per location and per orientation
(relative to starting position) in which the image was captured at that location. All the
keypoints together form a describing database of the locations.
For the localisation task, the robot is placed randomly at one of the locations. The newly
produced keypoints from the visual input at that location will be compared with saved
keypoints. A location can be recognised when keypoints match. It is ofcourse possible
that the new keypoints match with keypoints of more than one location. In such a case
the set of keypoints representing the current location is ambiguous. To disambiguate the
representation of the location, active vision is used. The robot will rotate and capture
an image from the next orientation. From this image, new keypoints are extracted. This
procedure will be repeated for every orientation. The localisation task will first be bench-
marked, then it will be tested for robustness for environmental changes.
The database of keypoints may grow very large when all the extracted keypoints of differ-
ent locations are saved. The matching proces with a smaller database, with only the most
robust keypoints, will need less computation time than with a larger database and may
have the same matching performance. We will compare both the matching computation
time and performance between the large and small (pruned) database.
3
2 Background
2.1 Different models for visual localisation
The robotics research field is a very broad one, with a lot of different ways of thinking.
Because of the broadness of this research field, there are different approaches to the lo-
calisation problem, from purely mathematical to more biologically plausible. A lot of
localisation is based on laser or sonar range finders, but we focus on localisation based on
visual input. In this section we will mention some research in this field.
An example of monocular vision with a mathematical approach of the localisation prob-
lem is the research of [Zhang et al., 2004]. In their research they use an underwater robot,
which has to localise itself relative to a number of visual landmarks. The robots position
relative to the visual landmarks is computed with geometric information on these land-
marks, which are in a particular formation. When its relative position in the Camera
Centered Coordinate System (cccs) is known, the position in the World Centered Coor-
dinate System (wccs) can be computed. This last step is possible because the landmarks,
that are in fact buoys floating on the water surface, are equipped with a gps-receiver, so
their exact position in the wccs is known.
[Prasser et al., 2005] proposed the rat brain inspired RatSLAM model, which is a more
biologically plausible solution for the localisation problem. In contradiction to Zhang et
al., in the RatSLAM approach no geometrical information is computed and an omnidirec-
tional view is used. Only the visual appearance of the environment is used to represent
locations, by using the histograms of hue and saturation. The RatSLAM model uses the
combined information from readings from internal odometric sensors and external visual
information for localisation. RatSLAM is able to use the histogram representations to
build a representation of an outdoor environment. Localisation is done by matching the
histograms from the current location with stored histograms. Localisation is only possible
by combining the odometric sensor information and the processed external visual infor-
mation.
2.2 The SIFT algorithm
Other approaches to localisation use sift [Lowe, 1999] and [Lowe, 2004] for the repre-
sentation of the visual input. sift extracts features from an image (contrast-rich parts,
especially corners). Each feature is described by an 128-dimensional vector [Lowe, 2004].
Together with some additional data this vector forms a keypoint. To match a keypoint
with another keypoint, the euclidian distance is calculated between the two vectors. If the
ratio of the next-best match and the best match (the distance ratio) is below a threshold,
the two keypoints match.
To extract sift features from an image, the image will be smoothed with a Gaussian ker-
nel. By substracting the smoothed image from the original image, a difference-of-Gaussian
image is produced. This first difference-of-Gaussian image forms the lowest layer of a pyra-
mid of difference-of-Gaussian images. The image used for creating the second layer of the
pyramid is created by resampling the smoothed image to a lower resolution. This pro-
cedure is repeated until the scale of the resampled image is too small for feature detection.
4
Feature locations are identified by minima and maxima in the pyramid of difference-of-
Gaussian images. These extremas can be found by comparing a pixel with its surrounding
pixels within the image and with those in the next and previous scale. The 128-dimensional
vector describes these extremas. The orientation of the keypoint is found by calculating
an angle from the pixels above, below and next to the current pixel. The scale of the
image in which a feature is detected, and the feature’s orientation, X- and Y-co-ordinates
and vector are saved to form a keypoint.
2.3 Models using the SIFT algorithm
In [Se et al., 2001a], [Se et al., 2001b] and [Se et al., 2002] sift keypoints are extracted
from a trinocular camera so 3D information can be calculated for every keypoint. The key-
points with 3D location information represent a location and are used for slam. During
the robot’s driving, the keypoints are selected for ego-motion (odometry calculated from
visual input) by comparing the expected motion of the robot with the change in visual
input. These articles focus more on the mapping than on the localisation part of slam.
[Se et al., 2005] builds upon the previous articles, but focuses more on the localisation
part of slam, by dividing the 3D map in submaps and thus providing global localisation.
[Kosecka and Yang, 2004] also used the sift algorithm, but they used monocular vision.
They combined the information extracted from the visual input with a Hidden Markov
Model to exploit the potential neighbourhood relationships of visited locations. During
the driving, Kosecka and Yang split the visual input into different locations. Subsequent
locations from their experiment form a route. Location recognition worked well, but they
needed knowledge about neighbourhood relationships and the use of Hidden Markov Mod-
els to improve the overall location recognition.
3 Model
3.1 Keypoint extraction
The robot will capture a set of images that covers its complete surroundings. Images will
be captured with an interval of 10◦, which fits multiple times in the horizontal field of
view of the camera, which is 48.8◦. This ensures overlap of adjoining images so that the
same keypoints can be found in more than one image. This property is used in matching
(section 3.2) and pruning (section 3.3). For every image, keypoints will be extracted using
sift and stored in a database. The database consists of multiple sets of keypoints. Every
set of keypoints represents an orientation in a location, e.g. a database for 10 locations
with 36 orientations each, will contain 360 sets of keypoints. Every set is labeled with
the location and orientation (relative to the starting orientation of the robot) in which
the keypoints were found. The sift parameter standard deviation is set to 1.6 and the
number of octaves to 3. These values are copied from [Lowe, 2004].
3.2 The matching process
To compare the current location with the known locations, the keypoints extracted from
the images are compared with the keypoints from the database. To compare two keypoints,
the euclidian distance is calculated over the 128-dimensional vector of the keypoint. If
5
this distance is under a threshold, the two keypoints match. The threshold is defined as
0.35, we found this value by experimenting.
Every keypoint extracted from the image of the current orientation is matched to every
keypoint in the database. The number of matched keypoints per orientation per location
in the database is used to calculate an orientation score, see equation 1.
orientation scoreij =(mij > min orientation matches ,mij
kij
mij ≤min orientation matches , 0 (1)
In this equation, mij is the number of keypoints matched with keypoints in orientation i
of location j from the database. kij the number of keypoints in orientation i of location
j from the database. In the case of an orientation with a small number of keypoints, it
only takes a small number of matches to yield a high orientation score. For example, if
an orientation has only one keypoint, the score will be either 0.0 or 1.0 (depending on
its matching), making it a binary score, while we want a gradual score. For situations
like this, the parameter min orientation matches (defined as 3) is used. If the number of
matches is smaller than min orientation matches, the orientation score will be 0, in effect
ignoring this orientation.
After each orientation score is calculated, the scores are summed per location in the
database to form the location score for each location. In some cases, the orientations that
score highest are not close to each other, while we expect them to be adjoining because of
the overlap mentioned in section 3.1. Figure 1 shows two locations with dots representing
orientations, the large dots representing the three largest orientation scores. In the right
image the orientations are ambiguous (because they are not adjoining) and we want to
lower the location score to compensate for this. For this purpose, we introduce the spread
factor.
Figure 1: Lowest (left image) and highest (right image) orientation
score spread. The total number of dots in the figure is symbolic, chosen
for clarity.
The largest angle between any of the three highest orientation scores is taken as a measure
of spread. A large angle means low spread (left image in figure 1). Likewise, a small angle
means high spread (right image in figure 1). The lowest spread should not influence the
score and yields a spread factor of 1.0 so we can multiply it with the location score. The
highest spread should yield a spreadfactor of 0.5, not 0.0, because high spread generally
means less likely matches, not unlikely matches. The spread factor is calculated with
equation 2, where djis the maximum difference between any of the three highest scoring
orientations in location j, dmin is the smallest possible difference (360◦/3) and dmax is
the greatest possible difference (360◦−(3 −1) ·10◦).
6
spread f actorj= 0.5+0.5·dj−dmin
dmax −dmin (2)
The location score is calculated with equation 3, where N is the number of orientations
from location j. For every image matched to the database, the location scores of all
locations are summed to form the cumulative location score.
location scorej=spread factorj·
N
X
i=1
orientation scoreij (3)
To conclude whether a location stands out from the rest, the scores must be compared to
each other. If you take the highest score, it is possible that the next-highest score is very
similar to the highest and this means that the highest score does not really stand out. To
measure this, the ratio of the next-best and best scores is taken. A decision is not based
on this ratio, but we will use the ratio to analyse the scores in section 5. In our model,
the location is classified by selecting the location with the highest score.
3.3 Robust keypoint selection by active vision
To calculate the scores for every location, every keypoint in the image is compared to
every keypoint in the database. With large databases, this results in larger computation
time and larger memory footprint. To decrease the size of the database, less keypoints
should be stored. The keypoints that do end up in the database should be robust, mean-
ing they can be found in different situations.
In [Se et al., 2002] the pruning of keypoints is done using a threshold for the miscount;
the number of times a keypoint is not used for matching between subsequent frames. This
miscount is only updated if a keypoint is expected to be visible at the current frame. If
a keypoint crosses the threshold, the keypoint is pruned. This method uses information
about the current and expected next X- and Y-co-ordinates of the keypoints. Our method
is similar but simpler, it does not use the coordinates of the keypoints, but instead relies
on the overlap mentioned in section 3.1. Keypoints in the current orientation should be
present in the orientation at the left, at the right, or possibly in both. In figure 2 an
example of pruning is shown.
Figure 2: An example of pruning.
In this figure, a + in the middle image represents a keypoint that matches with a keypoint
in the right image. An X represents a keypoint that matches with a keypoint in the left
image. The large dots in the middle image represent keypoints which do not match with
keypoints from the left or the right image. These keypoints are pruned.
7
4 Experiment
4.1 Creating the database
Ten locations are selected within our university building. These ten locations have to be
distinguished during the localisation task. The robot is placed four times at every location
with small variations in location (a couple of decimeters at most) and orientation. These
variations cause the robot to be at the same location, but have a different view. The
keypoints extracted from the 36 captured images are stored in the database. Because the
robot is placed four times at every location, there are A, B, C and D variants of every
location in the database. When the database is created, a copy is made which is pruned
as explained in section 3.3.
During the first two weeks of our project we could not let the robot turn with a constant
angle between capturing two images. The image sets captured during week 0 and week 1
(image sets 0-9 and 10-19) are not used for the experiment. Only the images captured in
week 2 are used, for this reason the locations are numbered 20 to 29.
4.2 The benchmark
In order to test the database, we made a benchmark in which K-fold cross validation is
used. The benchmark validates the localisation performance using the created database.
In the first fold all the A variants of the ten locations are matched with the other variants
(B, C and D). For all ten test locations, the location scores for the ten potentialy matching
locations are stored per orientation of the current location. This procedure is repeated for
the B, C and D variants. To test the localisation performance using the pruned database,
the benchmark is repeated. This time one of the variants from the unpruned database is
matched with the other variants of the pruned database. The unpruned set is used be-
cause the output of the robot in the final experiment will not be pruned before matching.
Computation time during matching with both databases will be recorded for comparison.
4.3 The robustness test
After the benchmark the robustness of the localisation task for environmental changes will
be tested. The robot is randomly placed at the ten locations. At these locations the robot
captures new images, from which it will extract keypoints with the sift algorithm. These
keypoints will be compared with the saved keypoints in the database (this time containing
all sets, A, B, C and D). Again, for all ten test locations the location scores for the ten
potentialy matching locations are stored per orientation. The robustness test is performed
two times, once using an unpruned database and once using the pruned database. The
major difference between the benchmark and the robustness test, is the newly made test-
set for the latter. This testset is made from sets of images from the same ten locations,
but is created approximately two months later than the original sets. Initially the robot
was supposed to run the sift algorithm on the collected images, and match the extracted
keypoints with keypoints in the database. Due to a very slow processor in the robot, the
robot only collects images. The actual keypoint extraction and matching is performed on
a different computer.
8
5 Results
5.1 Benchmark results
The percentual location scores (the location score divided by the sum of location scores)
for the benchmarks are shown per test location in figure 4. On the horizontal axis the
test locations are shown, on the vertical axis the matched locations are shown. In every
column of one subplot, the location scores for a test location (relative to the total score
of that test location) are shown in color values. The green dots indicate the highest value
per test location. On the left side of figure 4 the results of the benchmarks in which
matching is done with the unpruned keypoints database are shown. On the right side
of the figure the results of matching with the pruned keypoints are shown. Ideally, a
white diagonal from bottom-left to top-right should be seen if all locations are classi-
fied correctly. In figure 4 diagonals are clearly seen, which is what we expected to see.
Since the overall results are this good, we will only highlight some of the misclassifications.
In benchmark A and B all locations have the highest location score at their own location.
This means that most of the keypoints from these locations are classified correctly. In
benchmark C some errors occur, location 20 receives the highest location score at location
23 when matching with both the unpruned and pruned database. In benchmark C loca-
tion 28 has the highest location score at location 29 when the pruned database is used.
When the unpruned database is used for benchmark D, location 28 and 29 both have have
a relatively higher score at location 26. Benchmark D does not show high location scores
at wrong locations when the pruned database is used.
We will highlight location 20 from benchmark C. Examining the progress of the (cumu-
lative) location scores shows that at orientation 260, location 23 has a steep rise in score.
Further examining this, shows that this orientation highly matches with location 23b,
orientation 140. Figure 3 shows this match. The two images show the two locations,
left 20c and right 23b. Keypoints are marked with white circles, matching keypoints are
connected by white lines. As can be clearly seen, one keypoint of location 23b matches
multiple keypoints of location 20c. This results in an orientation score (not yet corrected
with the spread factor) of 11
4= 2.75, which is much higher than the intended maximum
of 1.0. This multiple matching is due to programming error and is discussed in section 6.1.
Figure 3: Match between location 20c, orientation 260 and location
23b, orientation 140.
9
Figure 4: Percentual location scores for the benchmarks (top to bottom:
A, B, C and D, left: unpruned, right: pruned).
10
5.2 Robustness test results
Figure 5 shows the percentual location scores for the robustness test. Despite of the en-
vironmental changes, nine out of the ten locations are classified correctly.
Figure 5: Percentual location scores for the robustness test, left: un-
pruned, right: pruned.
Figure 6 shows the progress of the (cumulative) location scores of the ten locations at test
location 20 as an example. The X-axis shows the locations, the Y-axis shows the time
steps, or the orientations (0◦to 350◦) and the Z-axis shows the score of that location in
time. The scores of all locations grow, but the score for location 20 clearly grows faster
and ends up higher.
Figure 6: Progess of cumulative location scores at location 20, pruned
(left) and unpruned (right).
In the robustness test, location 21 is incorrectly classified as location 23. Figure 7 shows
the progress of (cumulative) location scores at location 21. In the first nine time steps,
location 21, which is the correct location, is favoured. Then in four time steps the scores of
location 23 rise substantially above the rest and remain so for the rest of the orientations.
In the last three orientations, location 21 again rises, but not nearly as high as location 23.
11
Figure 7: Progress of cumulative location scores at location 21, pruned
(left) and unpruned (right).
Examining the images of location 21 shows that only large wooden doors are present on
the images of the four orientations where location 23 rises substantially, doors that are
also on the images of locations 20, 21 and 24, but not on images in location 22. Further
examination of the images of location 21 shows that for the localisation task, the robot
was displaced some 2 metres in relation with the benchmark data. This caused the robot
to stand right in front of the large wooden doors, instead of in front of the elevator so
that could explain why location 21 does not score that high. Also, while comparing the
images of the benchmark set and of the localisation task, location 21 had changed. In
the time between the two sets were taken, a fire extinguisher was placed, the copier was
moved a meter or so (and rotated), a table was added and the benches were moved to
make room for the copier. Also, some small objects standing on the floor were moved
or rotated. These changes are present in orientations 150 through 350, in figure 7 this
would be the flat area where all location scores change barely, except location 21 at the
last orientations. In last four orientations, the stairs are present (which have lots of key-
points) and coincide with a small rise in the score. The stairs are also visible in the first
orientations in which only location 21 rises.
The last observation to point out is the fact that location 23 and 24 are almost similar in
reality, but differ a lot according to the scores in figure 7. Examination of the benchmark
images show that location 24 has dark marks on the doors, like shoe sole marks. These
marks produce keypoints. Neither location 20, 21 nor location 23 has these marks on the
door, so that could explain why location 24 would not score as high as location 23. The
marks in location 24 are not present anymore in the robustness test images.
5.3 Keypoint pruning results
The number of keypoints extracted from the 36 images of the four different variants of the
ten locations added up to 78449, as seen in table 1. This is an average of approximately 55
keypoints per image. After pruning the keypoint database 24265 keypoints are left. This
is an average of approximately 17 keypoints per image. The size of the database is thus
reduced by a factor of 3.2. Running the benchmark took 18 minutes, using the unpruned
database, while running the benchmark with the pruned database took 5 minutes, 3.6
times as fast. The robustness test took 23 minutes using the unpruned database, while
12
using the pruned database took 7 minutes, 3.3 times as fast.
The difference in the number of keypoints found for the locations is due to the different
appearances of the locations. Some locations have a lot of different keypoint generating
features, others do not. For example locations 25 and 26 have a small number of key-
points, because these locations are in a small corridor with walls that are uniformly white.
Because almost half of all the images grabbed at these locations only show parts of these
uniformly white walls, the number of generated keypoints remains small.
location unpruned pruned
20 13488 5032
21 7595 3335
22 8720 3830
23 4068 634
24 4545 1002
25 2862 105
26 2904 159
27 6155 1682
28 11595 3864
29 16517 4622
total 78449 24265
Table 1: Number of keypoints per location, unpruned and pruned.
We are interested in the ratio between the scores of the location with the highest score
and the location with the second highest score. These next-best/best ratios are shown for
all locations and all benchmarks in figure 8. A smaller ratio means a better possibility
of recognition. Observation shows that locations with a ratio smaller than 0.6 have the
highest location score at their own (correct) location.
19 20 21 22 23 24 25 26 27 28 29 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Next best score / best score per location, benchmark A, (o = unpruned, x = pruned)
Locations
Next best score / best score
19 20 21 22 23 24 25 26 27 28 29 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Next best score / best score per location, benchmark B, (o = unpruned, x = pruned)
Locations
Next best score / best score
19 20 21 22 23 24 25 26 27 28 29 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Next best score / best score per location, benchmark C, (o = unpruned, x = pruned)
Locations
Next best score / best score
19 20 21 22 23 24 25 26 27 28 29 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Next best score / best score per location, benchmark D, (o = unpruned, x = pruned)
Locations
Next best score / best score
Figure 8: Comparison next-best/best ratios of unpruned and pruned
benchmarks.
13
The final cumulative location scores are shown in appendix A, for every test location,
pruned and unpruned. The dashed line shows the height of the next-best score. The
changes between unpruned and pruned can be described in three categories: higher scores,
almost no change in scores and lower scores. Locations 20, 28 and 29 have higher pruned
scores than unpruned scores, but the relative scores have not changed much. Locations
21, 22, 23 and 27 have almost no change in both absolute and relative scores. Finally,
locations 24, 25 and 26 have much lower scores and the relative scores did change as well,
although the locations with the highest scores did not change.
In the robustness test, the locations belonging to the first category have higher absolute
scores when matching with the pruned database, but have almost no changes in the rel-
ative scores. The absolute scores are higher due to a smaller number of total keypoints
and relatively more matching keypoints. This is expected and also desirable. Locations
belonging to the second category have almost no changes at all in the absolute scores,
while there is a smaller number of total keypoints (see table 1). This can mean two things:
both matching and non-matching keypoints are pruned or non-matching keypoints from
orientations that did not score anyway are pruned. We have not further examined this.
Locations belonging to the third category have lower absolute scores and because these
locations have a smaller number of keypoints in the pruned version, this means that
matching keypoints are pruned. However, the locations are still correctly classified with
a fair next-best/best ratio.
19 20 21 22 23 24 25 26 27 28 29 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Next best score / best score per location, (o = unpruned, x = pruned)
Locations
Next best score / best score
Figure 9: Comparison next-best/best ratios of experiments with the
unpruned and the pruned database.
The next-best/best ratios for the robustness test are shown in figure 9. In the benchmarks
23 out of the 40 ratios decreased or remained the same when using the pruned database.
In the robustness test 4 out of the 10 ratios decreased when the pruned database was
used. From these results no general improvement or degradation can be concluded.
6 Discussion
6.1 Multiple matching of keypoints
In section 5.1, we mentioned multiple matching. Research on the effects of a programming
error shows that multiple keypoint matching occurs now and then, but most of the time
a keypoint is matched no more than 3 times. This means that this keypoint counts as
14
3 keypoints. Since only one or two keypoints out of 20 to 180 keypoints match multiple
times, this will not cause a large increase in orientation score, except in some cases. A
rerun of the robustness test with multiple matching suppressed, shows that locations are
better classified. The incorrect locations receive lower scores, so the correct locations
score relatively higher. Not all incorrect locations receive lower scores, thus some in-
correct locations receive relatively higher scores. The preliminary results indicate better
classifications when multiple matching is suppressed.
6.2 The spreadfactor
While researching the effects of multiple keypoint matching, we also checked whether the
spread factor actually influences the final scores. The preliminary results indicate that
the spread factor influences both the scores of incorrect as correct locations, so possibly
it can be omitted from our model altogether.
6.3 Conclusion
We conclude that sift is a useful algorithm for the localisation on visual input. Con-
cerning the robustness of our system, we can conclude that it is robust against small
displacements (a couple of decimetres) in the location, but less robust for larger displace-
ments (a couple of metres). It is also robust to small changes in the location, but less
robust to moving, rotating or adding distinctive objects. Our conclusion about pruning
is that pruning speeds up the process of matching, as expected, while the percentual lo-
cation scores barely change.
6.4 Further Research
The first suggestion we have, is to research the effect of altering the sift parameters.
These parameters influence the number of keypoints found and therefore also the speed
and matching performance.
Our original plan was to correct the cumulative location score with an “orientation se-
quence factor”. This factor should have represented the sequentiality of multiple orien-
tations matched to the database. For example, if a image has a highest match with the
orientation at 110◦in location 21, the next image should also match best with the next
orientation at 120◦in location 21. If it does not, it should receive a penalty. However,
due to time constraints we have chosen not to implement this feature and instead focused
on getting the rest to work. Although we did not implement this, we feel that spatial
relationship between orientations is useful information.
The next suggestion also concerns matching, using the spatial information of the key-
points in the image. When matching keypoints from two different images, the X- and
Y-co-ordinates should be more or less the same values or every keypoint should be trans-
lated to the left or right in an equal amount (this is expected when the robot is rotated
between capturing two images). If one matching keypoint differs from the rest in this
translation, it is likely that this match is incorrect. Large differences in Y-co-ordinates
between two images are odd, because this would mean that either the robot is elevated or
that the robot is much closer or further away from an object. This information can not
15
only be used for matching, but also to improve pruning.
Further on we have some other suggestions to reduce the database even more and indi-
rectly improve the speed of the matching. The first is to use other orientation intervals,
e.g. intervals of 15◦instead of 10◦, so that only 24 orientations exist in the database
instead of 36 without losing the overlap of images. This would theoretically reduce the
database to two-thirds of it’s original size (assuming uniform distribution of keypoints).
Because it would also reduce the number of orientations taken during the localisation
task, the number of keypoints to match is reduced to a factor of 2
3·2
3=4
9of its original
number. Since the matching computation time is linear to the number of keypoints, this
change alone could make the matching more than twice as fast.
The last suggestion is to prune the keypoints of the newly captured images in the same
way as the database is pruned, but this means that matching can start only after at least
three images are captured. Assuming this reduces the number of keypoints to one third of
its size, as happened with the database, the matching process could be three times as fast.
Acknowledgements
We would like to thank Jelmer Ypma for making available his implementation of Lowe’s
sift algorithm.
16
References
[Kosecka and Yang, 2004] Kosecka, J. and Yang, X. (2004). Location recognition and
global localization based on scale-invariant keypoints. Workshop on statistical learning
in vision, ECCV.
[Lowe, 1999] Lowe, D. (1999). Object recognition from local scale-invariant features.
Proceedings of the Seventh International Conference on Computer Vision (ICCV’99).
[Lowe, 2004] Lowe, D. (2004). Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision.
[Prasser et al., 2005] Prasser, D., Milford, M., and Wyeth, G. (2005). Outdoor simulta-
neous localisation and mapping using ratslam. International Conference on Field and
Service Robotics, Port Douglas,Australia.
[Se et al., 2001a] Se, S., Lowe, D., and Little, J. (2001a). Local and global localization
for mobile robots using visual landmarks. Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 414–420.
[Se et al., 2001b] Se, S., Lowe, D., and Little, J. (2001b). Vision-based mobile robot
localization and mapping using scale-invariant features. Proceedings of the IEEE In-
ternational Conference on Robotics and Automation (ICRA), pages 2051–2058.
[Se et al., 2002] Se, S., Lowe, D., and Little, J. (2002). Mobile robot localization and
mapping with uncertainty using scale-invariant visual landmarks. The International
Journal of Robotics Research., 21(8):735–758.
[Se et al., 2005] Se, S., Lowe, D., and Little, J. (2005). Vision-based global localization
and mapping for mobile robots. IEEE Transactions on Robotics, 21(3):364–375.
[Zhang et al., 2004] Zhang, P., Milios, E., and Gu, J. (2004). Underwater robot localiza-
tion using artificial visual landmarks. IEEE International Conference on Robotics and
Biomimetics, Shenyang, China, Paper no 67.
17
Appendix A: Results robustness test
The final cumulative location scores for every test location from the robustness test are
shown below. On the left side the location scores for the test with the unpruned database
are shown, on the right side the location scores for the test with the pruned database.
The dashed line shows the height of the next-best score.
18
19
20