ArticlePDF Available

Air-ground Matching: Appearance-based GPS-denied Urban Localization of Micro Aerial Vehicles

Abstract and Figures

In this paper, we address the problem of globally localizing and tracking the pose of a camera-equipped micro aerial vehicle (MAV) flying in urban streets at low altitudes without GPS. An image-based global positioning system is introduced to localize the MAV with respect to the surrounding buildings. We propose a novel air-ground image-matching algorithm to search the airborne image of the MAV within a ground-level, geotagged image database. Based on the detected matching image features, we infer the global position of the MAV by back-projecting the corresponding image points onto a cadastral three-dimensional city model. Furthermore, we describe an algorithm to track the position of the flying vehicle over several frames and to correct the accumulated drift of the visual odometry whenever a good match is detected between the airborne and the ground-level images. The proposed approach is tested on a 2 km trajectory with a small quadrocopter flying in the streets of Zurich. Our vision-based global localization can robustly handle extreme changes in viewpoint, illumination, perceptual aliasing, and over-season variations, thus outperforming conventional visual place-recognition approaches. The dataset is made publicly available to the research community. To the best of our knowledge, this is the first work that studies and demonstrates global localization and position tracking of a drone in urban streets with a single onboard camera.
Content may be subject to copyright.
This is a preprint of an article accepted for publication in J. of Field Robotics c
copyright 2015 Wiley Periodicals, Inc.
Air-ground Matching: Appearance-based GPS-denied Urban Localization of
Micro Aerial Vehicles
Andr´
as L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza
Abstract In this paper, we address the problem of globally
localizing and tracking the pose of a camera-equipped Micro
Aerial Vehicle (MAV) flying in urban streets at low altitudes
without GPS. An image-based global positioning system is
introduced to localize the MAV with respect to the surrounding
buildings. We propose a novel air-ground image matching
algorithm to search the airborne image of the MAV within a
ground-level, geotagged image database. Based on the detected
matching image features, we infer the global position of the
MAV by back-projecting the corresponding image points onto a
cadastral 3D city model. Furthermore, we describe an algorithm
to track the position of the flying vehicle over several frames
and to correct the accumulated drift of the visual odometry
whenever a good match is detected between the airborne and
the ground-level images. The proposed approach is tested on a
2km trajectory with a small quadroctopter flying in the streets
of Zurich. Our vision-based global localization can robustly
handle extreme changes in viewpoint, illumination, perceptual
aliasing, and over-season variations, thus, outperforming con-
ventional visual place-recognition approaches. The dataset is
made publicly available to the research community. To the
best of our knowledge, this is the first work that studies and
demonstrates global localization and position tracking of a
drone in urban streets with a single onboard camera.
SUP PLE MENTARY MATE RIA L:
Please note that this paper is accompanied by videos
available at: http://youtu.be/CDdUKESUeLc
The dataset used in this work is available at: rpg.ifi.
uzh.ch/data/air-ground-data.tar.gz
I. INTRO DUC TIO N
In this paper, we address the problem of localizing and
tracking the pose of a camera-equipped rotary-wing Micro
Aerial Vehicle (MAV) flying in urban streets at low altitudes
(i.e., 10-20 meters from the ground) without GPS. A novel
appearance-based global positioning system to localize and
track the pose of the MAV with respect to the surrounding
buildings is presented.
Our motivation is to create vision-based localization meth-
ods for MAVs flying in urban environments, where the
satellite GPS signal is often shadowed by the presence of the
buildings, or completely unavailable. Accurate localization
The authors are with the Robotics and Perception Group, University
of Zurich, Switzerland—http://rpg.ifi.uzh.ch. Andr´
as Majdik
is also affiliated with the Institute for Computer Science and Control,
Hungarian Academy of Sciences, Hungary, majdik@sztaki.hu,
damiano.verda@ieiit.cnr.it, yves.albers@gmail.com,
davide.scaramuzza@ieee.org.
This research was supported by Scientic Exchange Programme SCIEX-
NMS-CH (no. 12.097), the Swiss National Science Foundation through
project number 200021-143607 (Swarm of Flying Cameras), and the Na-
tional Centre of Competence in Research Robotics.
Fig. 1: Illustration of the problem addressed in this work.
The absolute position of the aerial vehicle is computed by
matching airborne MAV images with ground-level Street
View images that have previously been back-projected onto
the cadastral 3D city model.
is indispensable to safely operate small-sized aerial service-
robots to perform everyday tasks, such as goods delivery,
inspection and monitoring, first-response and telepresence in
case of accidents.
Firstly, we address the topological localization problem
of the flying vehicle. The global position of the MAV is
recovered by recognizing visually-similar discrete places in
the topological map. Namely, the air-level image captured
by the MAV is searched in a database of ground-based geo-
tagged pictures. Because of the large difference in viewpoint
between the air-level and ground-level images, we call this
problem air-ground matching.
Secondly, we address the metric localization and position
tracking problem of the vehicle. The metric position of
the vehicle is computed with respect to the surrounding
buildings. We propose the use of textured 3D city models
to solve the appearance-based global positioning problem.
A graphical illustration of the problem addressed in this work
is shown in Fig. 1.
In the recent years, numerous papers have addressed the
development of autonomous Unmanned Ground Vehicles
(UGV), thus, leading to striking new technologies, such
as self-driving cars. These can map and react in highly-
uncertain street environments using partially (Churchill and
Newman, 2012)—or completely neglecting—GPS (Iba˜
nez
Guzm´
an et al., 2012). In the next years, a similar bust in
Fig. 2: Comparison between airborne MAV (left) and
ground-level Street View images (right). Note the significant
changes—in terms of viewpoint, illumination, over-season
variation, lens distortions, and scene between the query (left)
and the database images (right)—that obstruct their visual
recognition.
the development of small-sized Unmanned Aerial Vehicles
(UAV) is expected. Flying robots will be able perform a large
variety of tasks in everyday’s life.
Visual-search techniques used in state-of-the-art place-
recognition systems fail with matching air-ground images
(Cummins and Newman, 2011; Galvez-Lopez and Tardos,
2012; Morel and Yu, 2009), since, in this case, extreme
changes in viewpoint and scale can be found between the
aerial images and the ground-level images. Furthermore,
appearance-based localization is a challenging problem be-
cause of the large changes of illumination, lens distortion,
over-season variation of the vegetation, and scene changes
between the query and the database images.
To illustrate the challenges of the air-ground image match-
ing scenario, in Fig. 2 we show a few samples of the airborne
images and their associate Google-Street-View (hereinafter
referred to as Street View) images from the dataset used
in this work. As observed, due to the different field of
view of the cameras on the ground and aerial vehicles and
their different distance to the buildings’ facades, the aerial
image is often a small subsection of the ground-level im-
age, which additionally mainly consists of highly-repetitive
and self-similar structures (e.g., windows) (c.f. Fig. 3). All
these peculiarities make the air-ground matching problem
Fig. 3: Please note that often the aerial MAV image (dis-
played in mono-color) is just a small subsection of the Street
View image (color images) and that the airborne images
contain highly repetitive and self-similar structures.
extremely difficult to solve for state-of-the-art feature-based
image-search techniques.
We depart from conventional image-search algorithms by
generating artificial views of the scene in order to overcome
the large viewpoint differences between the Street View and
MAV images, and, thus, successfully solve their matching.
An efficient artificial-view generation algorithm is introduced
by exploiting the air-ground geometry of our system, thus
leading to a significant improvement of the correctly-paired
airborne images to the ground level ones.
Furthermore, to deal with the large number of outliers
(about 80%) that the large viewpoint difference introduces
during the feature-matching process, in the final verification
step of the algorithm, we leverage an alternative solution
to the classical Random Sample Consensus (RANSAC)
approach, which can deal with such a high outlier ratio in
reasonable time.
In this paper, we advance our previous topological local-
ization (Majdik et al., 2013) by computing and tracking the
pose of the MAV using cadastral 3D city models, which
we first introduced in (Majdik et al., 2014). Furthermore,
we present an appearance-based global positioning system
that is able to successfully substitute the satellite GPS for
MAVs flying in urban streets. By means of uncertainty
quantification, we are able to estimate the accuracy of the
visual localization system. We show extended experiments
of the appearance-based global localization system on a 2km
trajectory with a drone flying in the streets of Zurich. Finally,
we show a real application of the system, where the state
of the MAV is updated whenever a new appearance-based
global position measurement becomes available. To the best
of our knowledge, this is the first work that studies and
demonstrates global localization of a drone in urban streets
with vision only.
The contributions of this paper are:
We solve the problem of air-ground matching between
MAV-based and ground-based images in urban envi-
ronments. Specifically, we propose to generate artificial
views of the scene in order to overcome the large view-
point differences between ground and aerial images,
and, thus, successfully resolve their matching.
We present a new appearance-based global positioning
system to detect the position of MAVs with respect
to the surrounding buildings. The proposed algorithm
matches airborne MAV images with geotagged Street
View images1and exploits cadastral 3D city models to
compute the absolute position of the flying vehicle.
We describe an algorithm to track the vehicle position
and correct the accumulated drift induced by the on-
board state estimator.
We provide the first ground-truth labeled dataset that
contains both aerial images—recorded by a drone to-
gether with other measured parameters—and geotagged
ground-level images of urban streets. We hope that this
dataset can motivate further research in this field and
serve as benchmark.
The remainder of the paper is organized as follows.
Section II presents the related work. Section III describes
the air-ground matching algorithm. Section IV presents the
appearance-based global positioning system. Section V de-
scribes the position tracking algorithm. Finally, Section VI
presents the experimental results.
II. REL ATED WORK
Several research works have addressed appearance-based
localization throughout image search and matching in urban
environments. Many of them were developed for ground-
robot Simultaneous Localization and Mapping (SLAM) sys-
tems to address the loop-closing problem (Cummins and
Newman, 2011; Majdik et al., 2011; Galvez-Lopez and Tar-
dos, 2012; Maddern et al., 2012), while other works focused
on position tracking using the Bayesian fashion—such as in
(Vaca-Castano et al., 2012), where the authors presented a
method that also uses Street View data to track the geospatial
position of a camera-equipped car in a city-like environment.
Other algorithms used image-search–based localization for
hand-held mobile devices to detect Point Of Interest (POI),
such as landmark buildings or museums (Baatz et al., 2012;
Fritz et al., 2005; Yeh et al., 2004). Finally, in the recent
years, several works have focused on image localization
with Street View data (Schindler et al., 2007; Zamir and
Shah, 2010). However, all the works mentioned above aim
to localize street-level images in a database of pictures also
captured at the street level. These assumptions are safe in
ground-based settings, where there are no large changes
between the images in terms of viewpoint. However, as will
1By geotag, we mean the latitude and longitude data in the geographic
coordinate system, enclosed in the metadata of the Street View images.
be discussed later in Section III-E and Fig. 8, traditional
algorithms tend to fail in the air-ground settings, where the
goal is to match airborne imagery with ground one.
Most works addressing the air-ground-matching problem
have relied on assumptions different than ours, notably the
altitude at which the aerial images are taken. For instance,
the problem of geo-localizing ground level images in urban
environments with respect to satellite or high-altitude (sev-
eral hundred meters) aerial imagery was studied in (Bansal
et al., 2011; Bansal et al., 2012). In contrast, in this paper
we aim specifically at low-altitude imagery, which means,
images captured by safe MAVs flying at 10-20m from the
soil.
A down-looking camera is used in (Conte and Doherty,
2009) in order to cope with long-term GPS outages. The
visual odometry is fused with the inertial sensors measure-
ments, and the on-board video data is registered in a geo-
referenced aerial image. In contrast, in this paper we use
a MAV equipped with side-looking camera, always facing
the buildings along the street. Furthermore, we describe a
method that is able to solve the first localization problem by
using image retrial techniques.
World models, maps of the environment, street-network
layouts have been used to localize vehicles performing planar
motion in urban environments (Montemerlo et al., 2008). Re-
cently, several research works have addressed the localization
of ground vehicles using publicly available maps (Brubaker
et al., 2013; Floros et al., 2013), road networks (Hentschel
and Wagner, 2010) or satellite images (Kuemmerle et al.,
2011). However, the algorithms described in those works are
not suitable for the localization of flying vehicles, because
of the large viewpoint differences. With the advance of the
mapping technologies, more and more detailed, textured 3D
city models are becoming publicly available (Anguelov et al.,
2010), which can be exploited for vision-based localization
of MAVs.
As envisaged by several companies, MAVs will be soon
used to transport goods2, medications and blood samples3, or
even pizzas from building to building in large urban settings.
Therefore, improving localization at small altitude where
GPS signal is shadowed or completely unreliable is of utmost
importance.
III. AI R-GRO UND M ATC HING OF IMAGES
In this section, we describe the proposed algorithm to
match airborne MAV images with ground-level ones. A
pseudo-code description is given in Algorithm 1. Please
note that the algorithm from line 1 to 7 can and should
be computed off-line, previous to an actual flight mission.
In this phase, previously saved geotagged images I=
{I1, I2, . . . , In}are converted into image-feature–based rep-
resentations Fi(after applying the artificial-view generation
method described in the next section) and are saved in a
database DT. Next, for every aerial image Iawe perform the
2Amazon Prime Air
3Matternet
Algorithm 1: Vision based global localization of MAVs
Input: A finite set I={I1, I2, . . . , In}of ground
geotagged images
Input: An aerial image Iataken by a drone in
street-like environment
Output: The location of the drone in the discrete map,
respectively the best match Ib
1DT=database of all the image features of I; ;
2for i1to ndo
3Vi=generate artificial-views (Ii); // details in III-A
;
4Fi=extract image features (Vi); ;
5add Fito DT;
6train DTusing FLANN (Muja and Lowe, 2009); ;
7cnumber of cores; ;
8// up to this line the algorithm is computed off-line ;
9Va=generate artificial-views (Ia); ;
10 Fa=extract image features (Va); ;
11 search approximate nearest neighbor feature matches
for Fain DT:MD=ANN(Fa,DT) ;
12 select cputative image matches Ip⊆ I:
Ip={Ip
1, Ip
2, . . . , Ip
c}// details Section III-B ;
13 run in parallel for j1to cdo
14 search approximate nearest neighbor feature
matches for Fain Fp
j:Mj=ANN(Fa,Fp
j); ;
15 select inlier points: Nj=kVLD(Mj,Ia,Ip
j); ;
16 Ibmax(N1, N2, . . . , Nc);;
17 return Ib;
artificial-view generation and feature extraction steps (line
9,10). The extracted features Faare searched in the database
DT(line 11). We select a finite number of ground level
images, using the putative match selection method (line 12)
detailed in Section III-B. Finally, we run in parallel a more
elaborated image similarity test (line 13—16) to obtained the
best matching Street View image Iato the aerial one Ia. In
the next sections we give further details about the proposed
algorithm.
A. Artificial-view generation
Point feature detectors and descriptors—such as SIFT
(Lowe, 2004), SURF (Bay et al., 2008), etc.—usually ensure
invariance to rotation and scale. However, they tend to fail
in case of substantial viewpoint changes (θ > 45).
Our approach was inspired by a technique initially pre-
sented in (Morel and Yu, 2009), where, for a complete
affine invariance (6 degrees of freedom), it was proposed
to simulate all image views obtainable by varying the two
camera-axis orientation parameters, namely the latitude and
the longitude angles. The longitude angle (ϕ) and the latitude
angles (θ) are defined in Fig. 4 on the right. The tilt can
thus be defined as tilt =1
cos(θ). The Affine Scale-Invariant
Feature Transform (abbrev. ASIFT (Morel and Yu, 2009))
detector and descriptor is obtained by sampling various
TABLE I: Tilting values for which artificial views were
made.
Tilt 2222
θ456069.3
45
60
69.3
y
x
y
x
y
x
θ
φ
x
-40
40
-4040
45 60 69.3
Fig. 4: Illustration of the sampling parameters for artificial-
view generation. Left: observation hemisphere - perspective
view. Right: observation hemisphere - zenith view. The
samples are marked with dots.
values for the tilt and longitude angle ϕto compute artificial-
views of the scene. Further on, SIFT features are detected on
the original image and as well on the artificially-generated
images.
In contrast, in our implementation, we limit the number
of considered tilts by exploiting the air-ground geometry of
our system. To address our air-ground-matching problem, we
sample the tilt values along the vertical direction of the image
instead of the horizontal one. Furthermore, instead of the
arithmetical sampling of the longitude angle at every tilt level
proposed in (Morel and Yu, 2009), we make use of just three
artificial simulations, i.e., at 0, and ±40. We illustrate the
proposed parameter-sampling method in Fig. 4 and display
the different tilt values in Table I. By adopting this efficient
sampling method, we managed to reduce the computational
complexity by a factor of six (from 60 to 9 artificial views).
We have chosen this particular discretization in order to
exploit the air-ground geometry of the air-ground-matching
problem. Thus, we obtained a significant improvement of the
correctly-paired airborne images to the ground level ones.
Furthermore, we limited the number of the artificial-views
in comparison to the original ASIFT technique, in order to
reduce the computational complexity of the algorithm. Based
on our experiments, using a higher number of artificial-views
the performances are not improved.
In conclusion, the algorithm described in this section
has two main advantages in comparison with the original
ASIFT implementation (Morel and Yu, 2009). Firstly, we
significantly reduce the number of artificial views needed
by exploiting the air-ground geometry of our system, thus,
leading to a significant improvement in the computational
complexity. Secondly, by introducing less error sources into
the matching algorithm, our solution contributes also to
obtaining an increased performance in the global localization
process.
TABLE II: Recall rate at precision 1 (RR-P1) in case of the
number of putative Street View images analyzed in parallel
on different cores (NPC—Number of Parallel Cores).
NPC 4 8 16 48 96
RR-P1 (%) 41.9 44.7 45.9 46.4 46.4
B. Putative match selection
One might argue that the artificial-view generation leads
to a significant computational complexity. We overcome this
issue by selecting only a finite number of the most similar
Street View images. Namely, we present a novel algorithm
to select these putative matches based on a computationally-
inexpensive and extremely-fast two-dimensional histogram-
voting scheme.
The selected, ground level candidate images are then
subjected to a more detailed analysis that is carried out in
parallel on the available cores of the processing unit. The
experiments show that selecting only 4 candidate Street View
images very good results were obtained with the proposed
algorithm.
In this step, the algorithm selects a fixed number of
putative image matches Ip={Ip
1, Ip
2, . . . , Ip
c}, based on
the available hardware. The idea is to select a subset of the
Street View images from the total number of all the possible
matches and to exclusively process these selected images
in parallel, in order to establish a correct correspondence
with the aerial image. This approach enables a very fast
computation of the algorithm. In case there are no multiple
cores available, the algorithm could be serialized, but the
computational time would increase accordingly. The subset
of the ground images is selected by searching for the approx-
imate nearest neighbor for all the image features extracted
from the aerial image and its artificial-views Fa. The search
is performed by using the FLANN (Muja and Lowe, 2009)
library that implements multiple randomized KD-tree or
K-means tree forests and auto-tuning of the parameters.
According to the literature, this method performs the search
extremely fast and with a good precision, although, for
searching in very-large data bases (100 millions of images),
there are more efficient algorithms (J´
egou et al., 2011).
Since we perform the search in a certain area, we opted
for FLANN.
Further on, we apply an idea similar to (Scaramuzza,
2011), where in order to eliminate the outlier features, just a
rotation is estimated between two images. In our approach,
we compute the difference in orientation αbetween the
image features of the aerial view Faand the approximate
nearest neighbor found in DT. Next, by using a histogram-
voting scheme, we look for that specific Street View image
that contains the most image features with the same angular
change. To further improve the speed of the algorithm, the
possible values of αare clustered in bins of 5. Accordingly,
a two-dimensional histogram Hcan be built, in which each
bin contains the number of features that count for αin a
0.35 0.4 0.45 0.5 0.55 0.6 0.65
0.75
0.8
0.85
0.9
0.95
1
1.05
Precision (%)
Recall (%)
48 cores
16 cores
8 cores
4 cores
Fig. 5: Performance analysis in terms of precision and recall
in case of: 4, 8, 16, and 48 threads were used in parallel.
Please note that by selecting just 3% of the total number of
possible matches, more than 40% of the true positive matches
were detected by the proposed algorithm.
certain Street View image. Finally, we select the number c
of Street View images that have the maximal values in H.
To evaluate the performance of our algorithm, we run
several tests using the same 2km-long dataset and test param-
eters, and only modifying the number of selected candidate
Street View images, i.e., the number of parallel cores. Fig.
5 shows the obtained results in terms of recall rate4and
precision rate.5for 4, 8, 16, and 48 selected candidate Street
View images (parallel cores). The plot shows that, even by
using just 4 cores in parallel, a significant number of true-
positive matches between the MAV and Street View images
is found without having any erroneous pairing, namely at
precision 1. Using 8 putative Street View images processed in
parallel on different cores, the recall at precision 1 increases
by almost 3%. Please note that it is also possible to use two
times 4 cores to obtain the same performance. By further
increasing the number of cores (e.g., in the case of a cloud-
robotics scenario) minor improvements in performance are
obtained in terms of precision and recall (c.f. Table II). In
case a pool of 96 candidate Street View images are selected,
the number of correct matches at precision 1 is not increased
any more. Therefore, this shows the limitations of the air-
ground matching algorithm.
More importantly, it can be concluded that the presented
approach to select putative matches from the Street View data
has a very good performance and, by just selecting 3% of
the total number of possible matches, can detect more than
40% of the true positive matches at precision 1.
4Recall rate =Number of detected matches over the total number of
possible correspondences
5Precision rate =Number of true positive detected over the total number
of matches detected (both true and false)
C. Pairing and acceptance of good matches
Having selected cStreet View images Ip=
{Ip
1, Ip
2, . . . , Ip
c}as described in the previous section,
in the final part of the algorithm we make a more detailed
analysis in parallel to compute the final best match for the
MAV image. Similarly to line 11 in Algorithm 1, we search
for the approximate nearest neighbor of every feature of
the aerial image Fain each selected ground level image Ip
j.
The feature points Fp
jcontained in Ip
jare retrieved from
the Street View image feature database DT, and matched
against Fa.
In order to pair the airborne MAV images with the Street
View data and select the best match among the putative
images, we make a verification step (line 15 in Algorithm
1). The goal of this step is to select the inliers, correctly
match feature points, and reject the outliers. As emphasized
earlier, the air-ground matching of images is very challenging
for several reasons, and thus, traditional RANSAC-based
approaches tend to fail, or need a very high number of
iterations, as shown in the previous section. Consequently,
in this paper we make use of an alternative solution to
eliminate outlier points and to determine feature point cor-
respondences, which extends the pure photometric matching
with a graph based one.
In this work, we use the Virtual Line Descriptor (kVLD)
(Liu and Marlet, 2012). Between two key-points of the
image, a virtual line is defined and assigned a SIFT-like de-
scriptor, after the points pass a geometrical consistency check
as in (Albarelli et al., 2012). Consistent image matches are
searched in the other image by computing and comparing the
virtual lines. Further on, the algorithm connects and match
a graph consisting of kconnected virtual lines. The image
points that support a kVLD graph structure are considered
inliers, while the other ones are marked as outliers. In the
next section, we show the efficiency and precision of this
method as well as the artificial-view generation and putative-
match selection.
The precision of the air-ground matching algorithm and
the uncertainty of the position determination depends on the
number of correctly-matched image features. Fig. 6 summa-
rizes the mean number of inliers matched between airborne
and ground images as a function of the distance to the closest
Street View image. The results show a Gaussian distribution
with standard deviation σ= 5 meters. This means that, if
the MAV is within 5 meters from a Street View image
along the path, our algorithm can detect around 60 correct
correspondences.
D. Computational complexity
The main goal of this work is to present a proof-of-
concept of the system, rather than a real-time, efficient
implementation. The aim of this paper is to present the first
appearance-based global localization system for rotary-wing
MAVs, similarly to the very popular visual-localization al-
gorithms for ground-level vehicles (Cummins and Newman,
2011; Brubaker et al., 2013). For the sake of completeness,
we present in Fig. 7 the effective processing time of the
−20 −15 −10 −5 0 5 10 15 20
0
20
40
60
80
100
120
140
Distance from the Google Street View image (meter)
Number of inlierer matches (average)
Fig. 6: Number of inlier feature points matched between the
MAV and ground images as a function of the distance to the
closest Street View image.
Air-ground image matching algorithm, using a commer-
cially available laptop with an 8 core—2.40 GHz clock—
architecture.
The Air-ground matching algorithm is computed in five
major steps: (1) artificial-view generation and feature extrac-
tion (Section III-A); (2) approximate nearest-neighbor search
within the full Street View database (line 11 in Algorithm
1); (3) putative correspondences selection (Section III-C);
(4) approximate nearest-neighbor search among the features
extracted from the aerial MAV image with respect to the
selected ground level image (line 14 in Algorithm 1); (5)
acceptance of good matches (Section III-C).
In Fig. 7 we used the 2km-long dataset and more than 400
airborne MAV images. All the images were searched within
the entire Street View images that could be found along the
2km trajectory. Notice that the longest computation time is
the approximate nearest-neighbor search in the entire Street
View database for the feature descriptors found in the MAV
image. However, this step can be completely neglected, once
an approximate position of the MAV is known. Since, in
this case, the air-ground matching algorithm can be applied
using a distance-based approach, instead of the a brute-force
search.
In a distance-based scenario, the closest Street View im-
ages are selected, that are inside of a certain radius from the
MAV, e.g., 100 meter bound in urban streets. By adopting the
distance-based approach the appearance-based localization
problem can be significantly simplified. We have evaluated
the air-ground matching algorithm using a brute-force search,
because our aim was to solve a more general problem,
namely the first localization problem.
In the position tracking experiment (Section V), we used
the distance-based approach, since, in that case, the MAV
image is compared only with the neighboring Street View
images (usually up to 4 or 8, computed in parallel on
different cores, depending on the road configuration).
01234
Time (sec)
Acceptance of good matches -
kVLD inlier detection
ANN search among the features extracted
from the aerial MAV image with respect
to the selected ground level image
Putative image match
selection - histogram voting
Approximate nearest neighbor (ANN)
search within the full
Google-Street-View database
Artificial-view generation and
feature extraction
Fig. 7: Analysis of the processing time of the Air-ground image matching algorithm. To compute this figure, we used more
than 400 airborne MAV images, and all the images were searched within the entire Street View image database, that could
be found along the 2km trajectory.
Finally, notice that the histogram voting (Fig. 7) takes only
0.01 seconds.
Using the current implementation, on average, an
appearance-based global localization—steps (1), (4), and
(5)—is computed in 3.2 seconds. Therefore, if the MAV
flies roughly with a speed of 2 m/s, its position would be
updated every 6.5 meters. The computational time could be
significantly reduced by outsourcing the image processing
computations to a server in a cloud-robotics scenario.
E. Comparison with state-of-the-art techniques
Here, we briefly describe four state-of-the-art algorithms,
against which we compare and evaluate our approach. These
algorithms can be classified into brute-force or bag-of-words
strategies. All the results shown in this section were obtained
using the 2km-long dataset, c.f., Appendix 1.
1) Brute-force search algorithms: Brute-force approaches
work by comparing each aerial image with every Street
View image in the database. These algorithms have better
precision but at the expense of a very-high computational
complexity. The first algorithm that we used for comparison
is referred to as brute-force feature matching. This algorithm
is similar to a standard object-detection method. It compares
all the airborne images from the MAV to all the ground
level Street View images. A comparison between two images
is done through the following pipeline: (i) SIFT (Lowe,
2004) image features are extracted in both images; (ii) their
descriptors are matched; (iii) outliers are rejected through
verification of their geometric consistency via fundamental-
matrix estimation (e.g., RANSAC 8-point algorithm (Hartley
and Zisserman, 2004)). RANSAC-like algorithms work ro-
bustly as long as the percentage of outliers in the data is
below 50%. The number of iterations Nneeded to select
at least one random sample set free of outliers with a given
confidence level p—usually set to be 0.99—can be computed
as (Fischler and Bolles, 1981):
N=log(1 p)/log(1 (1 γ)s),(1)
where γspecifies the expected outlier ratio. Using the 8-
point implementation (s= 8) and given an outlier ratio larger
than 70%, it becomes evident that the number of iterations
needed to robustly reject outliers becomes unmanageable, in
the order of 100’000 iterations, and grows exponentially.
From our studies, the outlier ratio after applying the
described feature matching steps on the given air-ground
dataset (before RANSAC) is between 80% 90%, or, stated
differently, only 10% 20% of the found matches (between
images of the same scene) correspond to correct match pairs.
Following the above analysis, in the case of our dataset,
which is illustrated in Fig. 2, we conclude that RANSAC-like
methods fail to robustly reject wrong correspondences. The
confusion matrix depicted in Fig. 8b reports the results of
the brute-force feature matching. This further underlines the
inability of RANSAC to uniquely identify two corresponding
images in our air-ground search scenario. We obtained very
similar results using 4-point RANSAC—which leverages
the planarity constraint between features sets belonging to
building facades.
The second algorithm applied to our air-ground-matching
scenario is the one presented in (Morel and Yu, 2009), here
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(a) Ground truth
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(b) Brute-force feature matching
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(c) Affine SIFT and ORSA
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(d) Proposed approach
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(e) Bag of Words
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
(f) FAB-MAP
Fig. 8: These plots show the confusion matrices obtained by applying several algorithms described in the literature (b-c,
e-f) and the one proposed in the current paper (d). (a) Ground-truth: the data was manually labeled to establish the exact
visual overlap between the aerial MAV images and the ground Street View image; (b) Brute-force feature matching; (c)
Affine-SIFT and ORSA ; (d) Our proposed air-ground-matching algorithm; (e) Bag of Words (BoW); (f) FAB-MAP. Notice
that our algorithm outperforms all other approaches and in the challenging task of matching ground and aerial images. For
precision and recall curves, compare to Fig. 17
referred to as Affine SIFT and ORSA. In (Morel and Yu,
2009), an image-warping algorithm is described to compute
artificially-generated views of a planar scene able to cope
with large viewpoint changes. ORSA (Moisan et al., 2012) is
a variant of RANSAC, which introduces an adaptive criterion
to avoid the hard thresholds for inlier/outlier discrimination.
The results were improved by adopting this strategy (shown
in Fig. 8c), although the recall rate at precision 1 was below
15% (c.f. Fig. 17).
2) Bag-of-words search algorithms: The second category
of algorithms used for comparison are the bag-of-words
(BoW) based methods (Sivic and Zisserman, 2003), devised
to improve the speed of image-search algorithms. This tech-
nique represents an image as a numerical vector quantizing
its salient local features. Their technique entails an off-
line stage that performs hierarchical clustering of the image
descriptor space, obtaining a set of clusters arranged in a
tree structure. The leaves of the tree form the so-called
visual vocabulary and each leaf is referred to as a visual
word. The similarity between two images, described by the
BoW vectors is estimated by counting the common visual
words in the images. Different weighting strategies can be
adopted between the words of the visual vocabulary (Majdik
et al., 2011). The results of this approach applied to the
air-ground dataset are shown in Fig. 8e. We tested different
configuration parameters, but the results did not improve (c.f.
Fig. 17).
Additional experiments were carried out by exploiting
the joint advantages of the Affine SIFT feature extraction
algorithm and the one of the bag-of-words technique, re-
ferred to as ASIFT bag-of-words. In this experiment, SIFT
features were extracted also on the generated artificial views
for both the aerial and ground level images. Further on, all
the extracted feature vectors were transformed into the BoW
representation. Lastly, the BoW vectors extracted from the
airborne MAV images were match with the one computed
from the Street View images. The results of this approach are
shown on Fig. 9. Note, that the average precision—the area
below the precision-recall curve—was significantly improved
in comparison with the standard BoW approach (c.f. Fig. 17).
Finally, the fourth algorithm used for our comparison is
FAB-MAP (Cummins and Newman, 2011). To cope with
perceptual aliasing, in (Cummins and Newman, 2011) an
algorithm is presented where the co-appearance probability
of certain visual words is modeled in a probabilistic frame-
work. This algorithm was successfully used in traditional
ID of the aerial MAV images
ID of the ground Google Street View images
50 100 150 200 250 300 350 400
10
20
30
40
50
60
70
80
90
100
110
Fig. 9: This figure shows the confusion matrix obtained by
applying the Affine SIFT feature extraction algorithm and the
bag-of-words technique to match the airborne MAV images
with the Street View images.
street-level ground-vehicle localization scenarios, but failed
in our air-ground-matching scenario, as displayed in Fig. 8f.
As observed, both BoW and FAB-MAP approaches fail
to correctly pair air-ground images. The reason is that the
visual patterns of the air and ground images are classified
with different visual words, leading, thus, to a false visual-
word association. Consequently, the air-level images are
erroneously matched to the Street View database.
To conclude, all these algorithms perform rather unsat-
isfactorily in the air-ground matching scenario, due to the
issues emphasized at the beginning of this paper. This
motivated the development of a novel algorithm presented
throughout this section. The confusion matrix of the pro-
posed algorithm applied to our air-ground matching scenario
is shown in Fig. 8d. This can be compared with the confusion
matrix of the ground truth data (Fig. 8a). As observed, the
proposed algorithm outperforms all previous approaches. In
the experimental section we give further details about the
performance of the described algorithm.
IV. APP E AR ANC E-BASE D GL O BA L PO SIT ION ING S YS T EM
In this section, we extend the topological localization
algorithm described in the previous section in order to
compute the global position of the flying vehicle in a metric
map. To achieve this goal, we back-project each pixel onto
the 3D cadastral model of the city. Please note that the
approach detailed in this section is independent of the 3D
model used, thus the same algorithm can be applied to any
other textured 3D city model.
A. Textured 3D cadastral models
The 3D cadastral model of Zurich used in this work was
acquired from the city administration and claims to have
an average lateral position error of σl=±10 cm and an
average error in height of σh=±50 cm. The city model is
referenced in the Swiss Coordinate System CH1903 (DDPS,
2008). Note on Fig. 11a that this model does not contain
any textures. By placing virtual cameras in the cadastral
model, 2D images and 3D depth maps can be obtained from
any arbitrary position within the model, using the Blender6
software environment.
The geo-location information of the Street View dataset
is not exact. The geotags of the Street View images provide
only approximate information about where the images were
recorded by the vehicle. Indeed, according to (Taneja et al.,
2012), where 1,400 Street View images were used to perform
the analysis, the average error of the camera positions is 3.7
meters and the average error of the camera orientation is 1.9
degrees. In the same work, an algorithm was proposed to
improve the precision of the Street View image poses. There,
cadastral 3D city-models were used to generate virtual 2D
images, in combination with image-segmentation techniques,
to detect the outline of the buildings. Finally, the pose was
computed by an iterative optimization, namely by minimiz-
ing the offset between the segmented outline in the Street
View and the virtual images. The resulting corrected Street
View image positions have a standard deviation of 0.1184
meters and the orientation of the cameras have standard
deviation of 0.476 degrees.
In our work, we apply the algorithm from (Taneja et al.,
2012) on the dataset used in this work to correct the Street
View image poses. Then, from the known location of the
Street View image, we back-project each pixel onto the
3D cadastral model Fig. 11b. One sample of the resulting
textured 3D model is shown in Fig. 11c. By applying this
procedure, we are able to compute the 3D location of the
image features detected on the 2D images. This step is crucial
to compute the scale of the monocular visual odometry
(Section V-A) and to localize the MAV images with respect
to the street level ones, thus, reducing the uncertainty of the
position tracking algorithm. In the next section, we give more
details about the integration of textured 3D models into our
pipeline.
B. Global MAV camera pose estimation
The steps of the algorithm are visualized in Fig. 10.
For the geo referenced Strret View images, depth maps are
computed by back-projecting the image from the known
camera position onto the 3D model (Fig. 10a). The air-
ground matching algorithm described in the previous section
detects the most similar Street View in the database for
a given MAV image (Fig. 10b). Also, the 2D-2D image
feature correspondences are computed by the air-ground
matching algorithm, shown with green lines in Fig. 10c. The
magenta lines are the virtual lines used to distinguish the
inlier points from the outlier ones (Section III-C). Since the
depth of every image pixel of the Street View image is known
from the 3D city model, 3D-2D point correspondences are
computed (Fig. 10d). The absolute MAV camera pose and
6Blender 3D modeling software environment: http://www.
blender.org/.
(a) (b)
(c)
(d) (e)
Fig. 10: (a) Street View image depth map obtained from
the 3D cadastral city model; (b) air-borne MAV image; (c)
matched feature point pairs (green lines) between the Street
View and the MAV image, the magenta lines are the virtual
lines used to distinguish the inlier points from the outlier
ones (Section III-C); (d) 3D-2D point correspondences be-
tween the textures 3D city model and the MAV image; (e)
global position of the MAV .computed based on the 3D-2D
point correspondences.
orientation (Fig. 10e) is estimated given a set of known 3D-
2D correspondence points.
Several approaches have been proposed in the literature
to estimate the external camera parameters based on 3D-
2D correspondences. In (Fischler and Bolles, 1981), the
perspective-n-point (PnP) problem was introduced and dif-
ferent solution were described to retrieve the absolute camera
pose given n correspondences. The authors in (Kneip et al.,
2011) addressed the PnP problem for the minimal case where
n equals 3 and introduced a novel parametrization to compute
the absolute camera position and orientation. In this work the
Efficient PnP (EPnP) algorithm (Moreno-Noguer et al., 2007)
is used to estimate the MAV camera position and orientation
with respect to the global reference frame. The advantage
of EPnP algorithm with respect to other state-of-the-art non-
iterative PnP techniques is the low computational complexity
and the robustness in terms of noise in the 2D point locations.
Given that the output of our air-ground matching algorithm
may still contain outliers and that the model-generated 3D
coordinates may depart from the real 3D coordinates, we
apply the EPnP algorithm together with a RANSAC scheme
(Fischler and Bolles, 1981) to discard the outliers. However,
the number of inlier points is reduced by using the EPnP-
Fig. 12: This figure shows the number of detected air-ground
matches (green: 2D-2D matching points) for thr MAV -
Street View image pairs and the resulting number of matches
(blue: 3D-2D matching points) after applying the EPnP-
RANSAC algorithm. The number of 3D-2D point correspon-
dences are reduced in comparison to the 2D-2D matching
points. This is because of the errors in the backprojection
of the Street View images on the cadastral model and the
inaccuracies of the 3D model.
RANSAC scheme in comparison with the number of inlier
points provided by the air-ground matching algorithm as
shown in Fig. 12 for a testbed of more than 1600 samples
from the 2km dataset. This happens because the output of
the air-ground matching algorithm may still contain a small
amount of outlier matching points and more importantly, the
3D coordinates of the projected Street View image points
have inaccuracies since in the 3D cadastral city model non-
planar part of the facades, e.g., windows and balconies are
not modeled. In the future by using more detailed city models
this kind of error source could be eliminated.
We refine the resulting camera pose estimate using the
Levenberg-Marquardt (Hartley and Zisserman, 2004) opti-
mization, which minimizes the reprojection error given by
the sum of the squared distances between the observed image
points and the reprojected 3D points. Finally, using only the
inlier points we compute the MAV camera position.
Fig. 11a-c show an examples about how are the Street
View images back-projected onto the 3D city model. More-
over, Fig 11 (d) shows the estimated camera positions and
orientations in the 3D city model for a series of consecutive
MAV images. As we do not have an accurate ground-truth
(we only have the GPS poses of the MAV) we visually eval-
uate the accurateness of the position estimate by rendering-
out the estimated MAV camera view and comparing it to
the actual MAV image for a given position, as shown in
Fig. 11(e-f). Fig. 11g-i again shows another example of the
estimated camera position (g), the synthesized camera view
(h), and the actual MAV image (i).
By comparing the actual MAV images to the rendered-
out views (Fig. 11e-f and Fig. 11h-i), it can be noted that
the orientation of the flying vehicle is correctly computed
by the presented approach. It is very important to correct the
orientation of the vehicle in order to correct the accumulated
drift by the incremental visual odometry system used for the
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 11: (a) perspective view of the cadastal 3D city model; (b) the ground-level Street View image overlaid on the model;
(c) the back-projected texture onto the cadastral 3D city model; (d) estimated MAV camera positions matched with one
Street View image; (e) the synthesized view from one estimated camera position corresponding to actual MAV image (f);
(g)-(i) show another example from our dataset, where (g) is an aerial view of the estimated camera position (h), which is
marked with the blue camera in front of the textured 3D model, (h) is the synthesized view from the estimated camera
position corresponding to actual MAV image (i).
position tracking of the vehicle. It can be noticed that the
position of the vehicle along the street is also correct. How-
ever, in the direction perpendicular to the street the position
still has a small error. This is due to the inaccurateness of the
used 3D city model. In the cadastral model, the windows and
other small elements that are not exactly in the main plain of
the facade are not modeled. Similar results were derived for
the remaining MAV-Street View image pairs of the recorded
dataset.
The minimal number of correspondences required for the
EPnP algorithm is s= 4. However, in case a non-minimal
set of points is randomly selected, then s > 4(in our
experiments we used a non-minimal set of points with s= 8
matches) and more robust results are obtained (c.f. Fig. 21).
The results are further improved by estimating the uncer-
tainty of appearance-based global positioning system using
a Monte Carlo approach (Section V-C). Fig. 21e,h show the
results of the vision-based estimates filtered using the com-
puted covariance. Note that all the erroneous localizations
are removed.
The appearance-based global-localization updates will be
used in the next section to correct the accumulated drift in
the trajectory of the MAV.
V. PO SI T IO N T RACK ING
The goal of this section is to integrate the appearance-
based global localization algorithm detailed in the previous
section into the position tracking algorithm that estimates
the state of the MAV over time. Our aim is to show
an application of the vision-based localization system by
updating the state of the MAV, whenever an appearance-
based global position measurement becomes available.
The vehicle state at time kis composed by the position
vector and the orientation of the airborne image with respect
to the global reference system. To simplify the proposed
algorithm, we neglect the roll and pitch, since we assume that
the MAV flies in near-hovering conditions. Consequently, we
consider the reduced state vector qkR4
qk:= (pk, θk),(2)
where pkR3denotes the position and θkRthe yaw
angle.
We adopt a Bayesian approach (Thrun et al., 2005) to
track and update the position of the MAV. We compute the
posterior probability density function (PDF) of the state in
two steps. To compute the prediction update of the Bayesian
filter, we use visual odometry. To compute the measurement
update, we integrate the global position, as soon as this is
made available by the algorithm described in the previous
section.
The system model fdescribes the evolution of the state
over time. The measurement model hrelates the current
measurement zkR4to the state. Both are expressed in
a probabilistic form:
qk|k1=f(qk1|k1, uk1),(3)
zk=h(qk|k1),(4)
where uk1R4denotes the output of the visual odometry
algorithm at time k1,qk|k1denotes the prediction
estimate of qat time k, and qk1|k1denotes the updated
estimate of qat time k1.
A. Visual odometry
Visual Odometry (VO) is the problem of incrementally
estimating the ego-motion of a vehicle using its on-board
camera(s) (Scaramuzza and Fraundorfer, 2011). We use the
VO algorithm in (Wu et al., 2011) to incrementally estimate
the state of the MAV.
B. Uncertainty estimation and propagation of the VO
At time k, VO takes two consecutive images Ik,Ik1as
input and returns an incremental motion estimate with respect
to the camera reference system. We define this estimate as
δ
k,k1R4
δ
k,k1:= (∆s
k,θk),(5)
where s
kR3denotes the translational component of the
motion and θkRthe yaw increment. s
kis valid up to
a scale factor, thus the metric translation skR3of the
MAV at time kwith respect to the camera reference frame
is equal to
sk=λs
k.(6)
We define δk,k1R4as
δk,k1:= (∆sk,θk),(7)
where λRrepresents the scale factor. We describe the
procedure to estimate λin Section V-E.
We estimate the covariance matrix Σδk,k1R4x4using
the Monte Carlo technique (Thrun et al., 2001). The VO at
every step of the algorithm provides an incremental estimate
δk,k1, together with a set of corresponding image points
between image Ikand Ik1. We randomly sample five
couples from the corresponding point set multiple times
(1000 in our experiments). Each time, we use the selected
samples as an input to the 5-point algorithm (Nist´
er, 2004) to
obtain the estimate {δi}. All these estimates form D={δi}.
Finally, we calculate the uncertainty Σδk,k1of the VO by
computing the sample covariance from the data.
The error of the VO is propagated throughout consecutive
camera positions as follows. At time kthe state qk|k1
depends on qk1|k1and δk,k1
qk|k1=f(qk1|k1, δk,k1),(8)
We compute its associated covariance Σqk|k1R4x4by the
error-propagation law:
Σqk|k1=fqk1|k1Σqk1|k1fT
qk1|k1+fδk,k1Σδk,k1fT
δk,k1,
(9)
assuming that qk1|k1and δk,k1are uncorrelated. We
compute the Jacobian matrices numerically. The rows of the
Jacobian matrices (ifqk1|k1),(ifδk,k1)R1x4(i=
1,2,3,4) are computed as
(ifqk1|k1) = [(if)
(1qk1|k1)
(if)
(2qk1|k1)
(if)
(3qk1|k1)
(if)
(4qk1|k1)],
(ifδk,k1) = [(if)
(1δk,k1)
(if)
(2δk,k1)
(if)
(3δk,k1)
(if)
(4δk,k1)],
(10)
where iqk1|k1and iδk,k1denote the i-th component of
qk1|k1respectively δk,k1. The function ifrelates the
updated state estimate qk1|k1and the VO output δk,k1
to the i-th component of the predicted state iqk|k1.
In conclusion, the state covariance matrix Σqk|k1defines
an uncertainty space (with a confidence level of 3σ). If
the measurement zkthat we compute by means of the
appearance-based global positioning system is not included
in this uncertainty space, we do not update the state and we
rely on the VO estimate.
C. Uncertainty estimation of the appearance-based global
localization
Our goal is to update the state of the MAV denoted qk|k1,
whenever an appearance-based global position measurement
zkR4is available. We define zkas
zk:= (pS
k, θS
k),(11)
where pS
kR3denotes the position and θS
kRthe yaw in
the global reference system at time k.
The appearance-based global positioning system provides
Fig. 13: Blue dots represent Street View cameras. If the
MAV current image matches with the central Street View
one, the MAV must lie in an area of 15 meters around the
corresponding Street View camera. We display this area with
a green ellipse.
the index jNof the Street View image corresponding
to the current MAV image, together with two sets of n
N2D corresponding image points between the two images.
Furthermore, it provides also the 3D coordinates of the
corresponding image points in the global reference system.
We define the set of 3D coordinates as XS: = {xS
i}({xS
i}
R3i= 1, ···,n) and the set of 2D coordinates as MD
={mD
i}({mD
i},R2i= 1, ···,n).
If a MAV image matches a Street View image, it cannot
be farther than 15 meters from that Street View camera
according to our experiments (c.f. Fig. 6). We illustrate the
uncertainty bound of the MAV in a bird-eye view in Fig.
13 with the green ellipse, where the blue dots represent
Street View camera positions. To reduce the the uncertainty
associated to zk, we use the two sets of corresponding image
points.
We compute zksuch that the reprojection error of XS
with respect to MDis minimized, that is
zk= argmin
z
(
n
i=1 ||mD
iπ(xS
i, z)||),(12)
where πdenotes the j-th Street View camera projection
model.
The reprojected points coordinates π(xS
i, z)are often
inaccurate, because of the uncertainty of the Street View
camera poses and that of the 3D model data. The MD,XS
sets may contain outliers. We choose then EPnP-RANSAC to
compute zk, selecting the solution with the highest consensus
(maximum number of inliers, minimum reprojection error).
Similarly to Section V-B, we estimate the covariance
matrix ΣzkR4x4using Monte Carlo technique as follows.
We randomly sample mcorresponding pairs between MD
and XSmultiple times (1000 in the experiments). Each
time, we use the selected samples as an input to the EPnP
algorithm, to obtain the measurement {zi}. As we can see
in Fig. 6, a match with images gathered by Street View
cameras farther than 15 meters is not plausible. We use this
0 200 400 600 800 1000
0
2
4
6
8
10
12
MAV − Street View Match Pair
Standard Deviation in [m]
x−coordinate
mean std = 1.16 m
0 200 400 600 800 1000
0
2
4
6
8
10
MAV − Street View Match Pair
Standard Deviation in [m]
y−coordinate
mean std = 1.56 m
0 200 400 600 800 1000
0
2
4
6
8
10
MAV − Street View Match Pair
Standard Deviation in [m]
z−coordinate
mean std = 2.20 m
0 200 400 600 800 1000
0
10
20
30
40
50
MAV − Street View Match Pair
Standard Deviation in degrees
yaw
mean std = 7.86 degree
Standard Deviations Appearance−based Position Estimates
Fig. 15: The figure shows the standard deviations computed
for matching MAV - Street View image pairs along the x,
y,zcoordinates [m] and the yaw angle [degrees] of the
vehicle. The mean standard deviation for the xcoordinate is
1.16 meters, for the ycoordinate is 1.56 meters, and for of
the zcoordinate is slightly bigger, namely 2.20 meters. The
mean standard deviation for the yaw is 7.86 degrees. Note
that in case the uncertainty of the appearance-based global
localization algorithm is very large it is discarded and another
image is used to localize the vehicle.
criterion to accept or discard {zi}measurements. All the
plausible estimates form the set Z={zi}. We estimate Σzk
by computing the sample covariance from the data.
Fig. 14 shows the estimated uncertainties of the global
localization algorithm for a section of the entire 2km dataset
(Section VI-A). Further details are given in Fig 15 where
the Monte-Carlo-based standard deviations are shown along
the x,y,zcoordinates and the yaw angle of the vehicle.
Based on the computed covariances, a simple filtering rule
is used to discard those vision-based position estimates that
have a very high uncertainty. Conversely, the appearance-
based global positions with high confidence are used to
update the position tracking system of the MAV. By applying
such an approach, the results can be greatly improved (c.f.
Fig. 21e,h), although the total number of the global position
updates will be reduced.
D. Fusion
We aim to reduce the uncertainty associated to the state
by fusing the prediction estimate with the measurement,
whenever an appearance-based global position measurement
is available. The outputs of this fusion step are the updated
estimate qk|kand its covariance Σqk|kR4x4. We compute
them according to Kalman filter equations (Kalman et al.,
1960):
qk|k=qk|k1qk|k1qk|k1zk)1(zkqk|k1)(13)
−20 0 20 40 60 80 100 120 140 160
−220
−200
−180
−160
−140
−120
−100
−80
X−Coordinate
Y−Coordinate
Vision−based Global Positioning
Error Ellipse − 95% Condence
Keypoints Building
Street View Camera Positions
Fig. 14: The figure shows the top view of an enlarged subpart of the full trajectory. The blue ellipses show the 95-percent
confidence intervals of the Appearance-based global positioning system computed using the outlined Monte Carlo approach.
The green boxes correspond to the Street View camera positions. The magenta crosses show the positions of the matched 3D
feature points on the building facades. Note that most of the confidence intervals border a reasonably small area, meaning
that the accuracy of the vision-based positioning approach can accurately localize the MAV in the urban environment.
Σqk|k= Σqk|k1Σqk|k1qk|k1+ Σzk)1Σqk|k1(14)
E. Initialization
In order to initialize our system we use the global local-
ization algorithm, namely we use (12) to compute the initial
state q0|0and the Monte Carlo procedure described in V-C
to estimate its covariance Σq0|0. In the initialization step, we
also estimate the absolute scale factor λfor Visual Odometry.
After the initial position we need another position of the
MAV that is globally localized by our appearance-based
approach. Finally, we compute λby comparing the metric
distance traveled computed by the two global localization
estimates, with the unscaled motion estimate returned by the
VO.
VI. EX P ER IME NTS A ND RE SU LTS
This section presents the results in two parts. Firstly, the
air-ground matching algorithm is evaluated. Secondly, the
results of the appearance-based global positioning system are
presented, together with the position tracking algorithm.
A. Air-ground matching algorithm evaluation
We collected a dataset in downtown Zurich, Switzerland,
c.f., Appendix 1. A commercially available Parrot AR.Drone
2 flying vehicle (equipped with a camera—standard mount-
ing) was manually piloted along a 2km trajectory, collecting
images throughout the environment at different flying alti-
tudes up to 20 meters by keeping the MAV camera always
facing the buildings. Sample images are shown in Fig. 2, left
column. For more insights, the reader can watch the video file
accompanying this article.7The full dataset consists of more
than 40,500 images. For all the experiments presented in this
work, we sub-sampled the data selecting one image every
100, resulting in a total number of 405 MAV test images. In
all the experiments we used an image resolution of 640x360
pixels. All the available Street View data covering the test
7http://rpg.ifi.uzh.ch
Fig. 16: Bird’s-eye view of the test area. The blue dots
mark the locations of the ground Street View images. The
green circles represent those places where the aerial images
taken by the urban MAV were successfully matched with the
terrestrial image data.
area were downloaded and saved locally, resulting in 113
discrete possible locations. Since all the MAV test images
should have a corresponding terrestrial Street View image,
the total number of possible correspondences is 405 in all
evaluations. We manually labeled the data to establish the
ground-truth, namely the exact visual overlap between the
aerial MAV images and the Street View data. The Street
View pictures were recorded in summer 2009 while the MAV
dataset was collected in winter 2012; thus, the former is
outdated in comparison to the latter. Furthermore, the aerial
images are also affected by motion blur due to the fast
maneuvers of the MAV. Fig. 16 shows the positions of the
Street View images (blue-dots) overlaid to an aerial image of
the area. Also, correctly-matched MAV image locations—for
which a correct most similar Street View image was found—
are shown (green-circle).
The different visual-appearance–based algorithms were
evaluated in terms of recall rate8and precision rate.9We
also show the results using a different visualization, namely
confusion maps. Fig. 8 depicts the results obtained by
applying the five conventional methods discussed in Section
III-E and the algorithm proposed in this work (Fig. 8d).
The confusion matrix shows the visual similarity computed
between all the Street View (vertical axes) images and all the
MAV test images (horizontal axes). To display the confusion
maps, we used intensity maps, colored as heat maps. A dark
blue represents no visual similarity, while a dark red color
is a complete similarity. An ideal image pairing algorithm
would detect a confusion matrix coincident to the ground-
truth matrix (Fig. 8a). A stronger deviation from the ground-
truth map shows less accurate results.
8Recall rate =Number of detected matches over the total number of
possible correspondences
9Precision rate =Number of true positive detected over the total number
of matches detected (both true and false)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.2
0.4
0.6
0.8
1
Recall (%)
Precision (%)
FAB−MAP
Bag of words
ASIFT Bag of words
BF feature matching
Affine SIFT and ORSA
Air−ground matching
(proposed in this work)
Fig. 17: Comparison of the results. Please note that at preci-
sion 1 the proposed Air-ground matching algorithm greatly
outperforms in terms of recall the other methods. To visualize
all the correctly matched airborne MAV images with the
Street View images please consult the video attachment of
the paper.
1) Parameters used in the experiments: For the Bag-of-
Words10 approach in Fig. 8e and Fig. 17, a hierarchical
vocabulary tree was trained with branching factor of k= 10
and depth levels of L= 5, resulting in kL= 100,000
leaves (visual words) (using both MAV images and Street
View images recorded in a neighborhood similar to our test
area). Term frequency-inverse document frequency tf-idf was
used as weighting type and the L1-Norm as scoring type. In
the case of FAB-MAP11 algorithm, several parameters were
tested to get meaningful results. However, all checked pa-
rameter configurations failed on our dataset. For the experi-
ments presented in the paper, the FAB-MAP Vocabulary 100k
Words was used. Moreover, a motion model was assumed
(bias forward 0.9) and the geometric consistency check was
turned on. The other parameters were set according to the
recommendations of the authors. For our proposed air-ground
matching algorithm, we used the SIFT feature detector and
descriptor, but our approach can be adapted easily to use
other features as well.
2) Results and discussion of the experiments: Fig. 17
shows the results in terms of precision and recall. Opposite
to object recognition algorithms, where the average precision
is used to evaluate the results, in robotic applications the
most important evaluation criteria is usually the recall rate at
precision 1. This criteria represents the total number of true-
positive detections without having any false-positive match.
Considering the recall rate at precision 1, our proposed
air-ground matching algorithm (shown with blue in Fig. 17)
outperforms the second best approach, namely the ASIFT and
10We used the implementation of (Galvez-Lopez and Tardos, 2012)
publicly available at: http://webdiis.unizar.es/˜dorian/
11We used the implementation of (Cummins and Newman, 2011) publicly
available at: http://www.robots.ox.ac.uk/˜mobile/
Fig. 18: Analysis of the first false-positive detection. Top-left:
urban MAV image; top-right: zoom on the global map, where
the image was taken; bottom-left: detected match; bottom-
right: true positive pairing according to manual labeling.
Please note that our algorithm fails for the first time in a
situation where the MAV is facing the same building from
two different sides (streets), having in the filed of view only
windows with the same patterns.
ORSA (red) by a factor of 4. This is because, in our approach,
the artificial-views are simulated in a more efficient way.
Moreover, to reject the outliers, we use a graph matching
method that extends the pure photometric matching with
a graph based one. These results are even more valuable
since the ASIFT and ORSA algorithm was applied in a brute-
force fashion, which is computationally very expensive. In
contrast, in the case of our proposed algorithm, we applied
the extremely fast putative-match selection method. Namely,
the results were obtained by selecting just 7% from the total
number of Street View images. We show all the correctly-
matched MAV images with Street View images in the video
file accompanying this article, which gives a further insight
about our air-ground matching algorithm. As observed, other
traditional methods, such as the Visual Bag-of-Words ap-
proach (shown with black in Fig. 17), ASIFT Bag-of-Words
(orange) and FAB-MAP (magenta) fail in matching our MAV
images with ground level Street View data. Apparently, these
algorithms fail because the visual patterns present in both
images are classified in different visual words, thus, leading
to false visual-word associations.
Fig. 18 shows the first false-positive detection of our air-
ground matching algorithm. After a more careful analysis,
we found that this is a special case, where the MAV was
facing the same building from two different sides (i.e.,
from different streets), having only windows with the same
patterns in the field of view. Repetitive structures represent a
barrier for visual-appearance–based localization algorithms,
which can be solved by taking motion into account using a
Bayesian fashion as explained in Section V. The limitations
of the proposed method are shown in Fig. 19. Please note that
these robot positions (top row) are difficult to be recognized
even for humans.
Fig. 19: Analysis in case of no detections. Top-left: urban
MAV image; top-right: next view of the urban MAV; bottom-
left: true positive pairing according to manual labeling;
bottom-right: zoom on the global map, where the image
was taken. Please note that these robot positions (top raw)
are difficult to be recognized even for humans. Moreover
the over-season change of the vegetation makes it extremely
difficult to cope with the pairing of them for image feature
based techniques.
B. Appearance-based global positioning system experiments
We collected a second dataset in downtown Zurich, using
the same platform. The MAV was piloted along a 150 m
trajectory, collecting images throughout the environment at
different altitudes up to 6 meters. The images are synchro-
nized with the GPS data based on their timestamps. Every
image is considered a state of the MAV. For the visual odom-
etry we used an average frame rate of 3 images/meter and
we assumed close-to-hover flight conditions for recording
the images. Although we do not have an accurate ground-
truth path of the MAV to compare with (since the GPS
signal is shadowed between the buildings), we can still
evaluate visually the performance of our system (c.f. Fig.
11). Furthermore, we display our result within the cadastral
3D city model, which can give a good basis to evaluate the
result (c.f. Fig. 20).
1) Results and discussion of the experiments: In Fig. 20a-
d, we display the results using the cadastral 3D model,
in order to evaluate the trajectories with respect to the
surrounding buildings. The Street View image locations are
shown in blue, Visual Odometry (VO) estimate is shown in
black, the GPS in green, and our estimate in red. In order
to reduce the drift in the VO estimate, we constrained the
orientation to be aligned with the street’s dominant direction.
However, note on Fig. 20a, that the VO estimate (black)
accumulates a significant error alongside the direction of the
street. Further on, note in Fig. 20a that the estimate shown
with red, computed with the proposed approach, is the most
plausible since the vehicle was navigated close to the center
of the street. Thus, our estimate is the most similar to the
actual one. The altitude estimate error of the GPS is even
more notable in Fig. 20b-c.
The rendered view of the textured 3D model—which the
(a)
(b) (c) (d)
(e) (f) (g)
Fig. 20: Comparison between path estimates shown within the cadastral 3D city model: Top row: Top view of the estimated
trajectory of the MAV, we display the Street View image locations in blue, the Visual Odometry estimate in black, GPS in
green, our estimate in red; Middle row: (b-c) altitude evaluation: in the experiment, the MAV flew close to the middle of
the street and it never flew over the height of 6 m (above the ground), from this point of view, our path estimate (red) is
more accurate than the GPS one (green); (d) enlarged view of the path estimates; Bottom row: we show a visual comparison
of the: (e) actual view; (f) rendered view of the textured 3D model corresponding to (e) that the MAV perceives according
to our estimate; (g) rendered view of the textured 3D model corresponding to (e) which the MAV perceives according to
the GPS measurement; to conclude, the algorithm presented in this paper outperform the other techniques to estimate the
trajectory of the MAV flying at low altitudes in urban environment.
MAV perceives at the end of the trajectory—is visually more
similar to the real one (Fig. 20e), in case it is estimated
by the presented algorithm (Fig. 20f), in comparison with
the rendered view computed based on the GPS measurement
(Fig. 20g).
Finally, we show our result in Fig. 21, where the bird-
(a) (b)
(c) (d) (e) (f) (g) (h)
Fig. 21: Estimated poses of the MAV along the 2km trajectory: (a) EPnP-RANSAC with minimal set of s= 4 points; (b)
EPnP-RANSAC with non-minimal set of s= 8 matches, the black dots show the location of the geotagged Street View data;
Enlarged area 1 for comparison between: minimal points set (c), non-minimal points set (d), and filtered using the uncertainty
estimation (e); Enlarged area 2 for comparison between: minimal points set (f), non-minimal points set (g), and filtered using
the uncertainty estimation (h). Note that by applying the EPnP-Ransac algorithm with minimal and non-minimal points sets
a few erroneous localizations are computed (highlighted with yellow rectangles in the figures); however in case the results
are filtered based on the uncertainty estimation proposed in this paper the erroneous positions are completely eliminated (e)
and (h).
eye view of the 2km-long test environment is presented. A
comparison is shown between the results obtained with the
minimal set of correspondences for the EPnP algorithm (Fig.
21a) and the non-minimal case (Fig. 21b) respectively. The
red points show the computed MAV camera positions with
the EPnP-RANSAC algorithm. Enlarged areas are shown in
Fig. 21c,f for the minimal case and Fig. 21d,g for the non-
minimal case. By closely comparing these figures, it can be
concluded that the position estimates computed from a non-
minimal set are more accurate than those from the minimal
set. This is illustrated by the fact that the non-minimal
position estimates tend to be more organized along smoother
trajectories, which is in agreement with the real MAV flight
path. Stated differently, the position estimates derived from
the minimal set tend to ”jump around” more than those from
the non-minimal set, i.e. they are more widely spread and less
spatially consistent. The reason for more accurate results in
the non-minimal case is that the position estimates derived by
the EPnP are less affected by outliers and degenerate point
configurations. However, in both approaches—the minimal
and the non-minimal—a few extreme outliers occur which
are clearly not along the flying path as highlighted by the
yellow boxes in Fig. 21. A possible cause for these outliers
are wrong point correspondences between the Street View
images and the MAV images. Another potential explanation
are inaccurate 3D point coordinates supplied to EPnP result-
ing from inaccuracies when the overlay of the Street View
images with the cadastral city model is not perfect.
2) Lessons learned: Matching airborne images to ground
level ones is a challenging problem because extreme changes
in viewpoint and scale can be found between the aerial MAV
images and the ground-level images, besides large changes
of illumination, lens distortion, over-season variation of the
vegetation, and scene changes between the query and the
database images. Only a complex visual-search algorithm can
deal with such a scenario. We demonstrated that a multi-rotor
MAV flying in urban streets, where satellite GPS signal is
often shadowed by the presence of the buildings, or com-
pletely unavailable, can be localized by using a textured 3D
city model of the environment. Although the 3D city model
contains inaccuracies—e.g., non-planar part of the facades
are not modeled (windows, balconies, etc)—the MAV can be
accurately localized by means of uncertainty quantification.
This paper presented a proof-of-concept appearance-based
global positioning system that could be readily implemented
in real-time with cloud computing.
VII. C ONCLUSIONS
To conclude, this work addressed the air-ground match-
ing problem between low-altitude MAV-based imagery and
ground level Street View images. Our algorithm outperforms
conventional place-recognition methods in challenging set-
tings, where the aerial vehicle flies over urban streets up to
20 meters, often close to buildings. The presented algorithm
keeps the computational complexity of the system at an
affordable level. A solution was described to globally localize
MAVs in urban environments with respect to the surrounding
buildings using a single on-board camera and geotagged
street-level images together with a cadastral 3D city model.
By means of visual inspection and uncertainty quantification,
it was shown that the accuracy of the described vision-
based approach outperforms that of satellite-based GPS.
Therefore, vision-based localization can be either a viable
alternative to GPS localization in urban areas, where the GPS
signal is shadowed or completely unavailable, or a powerful
complement to it in order to enhance localization in areas
where the GPS signal strength is weak, i.e., where direct line
of sight to the satellites may be obstructed. The presented
appearance-based global positioning system is a step towards
safe operation (i.e., takeoff, land, and navigate) of small-
sized, autonomous aerial vehicles in urban environments
equipped with vision sensors.
ACK NO WL EDG EME NTS
The authors are grateful to Aparna Taneja and Luca
Ballan from the ETH Computer Vision and Geometry lab
for providing corrected Street View image poses for the data
used in this work.
APPENDIX 1: ZUR I CH A IR-G RO UN D M ATCHI NG D ATASE T
The air-ground matching dataset used in this work
is publically available at: rpg.ifi.uzh.ch/data/
air-ground-data.tar.gz
The dataset was collected in downtown of city Zurich,
Switzerland. A commercially available Parrot AR.Drone 2
flying vehicle (equipped with a camera, standard mounting)
was manually piloted along a 2km trajectory, collecting im-
ages throughout the environment at different flying altitudes
up to 20 meters by keeping the MAV camera always facing
the buildings. The ground-truth was established by manually
labeling the data in order to mark the exact visual overlap
between the aerial MAV images and the Street View data.
The dataset consist of the following files:
./images/MAV Images/ – This folder contains the im-
ages recorded by an MAV in the city of Zurich, Switzer-
land.
./images/Street View Images/ – This folder contains the
Street View Images corresponding to the area recorded
by the MAV.
./ground truth.mat — This Matlab matrix file contains
the human-made ground-truth for the MAV images, i.e.
the overlap between each drone image and the Street
View Images.
./lat long.txt — This file contains the GPS data (geo-
tagging) for every database (Street View) image, the for-
mat is according to the Google Street View API: 1) lati-
tude; 2)longitude; 3) yaw degree; 4) tilt yaw degree;
5) tilt pitch degree; 6) auxiliary variable.
References
Albarelli, A., Rodol`
a, E., and Torsello, A. (2012). Im-
posing semi-local geometric constraints for accurate
correspondences selection in structure from motion: A
game-theoretic perspective. International Journal of
Computer Vision, 97(1):36–53.
Anguelov, D., Dulong, C., Filip, D., Frueh, C., Lafon, S.,
Lyon, R., Ogale, A., Vincent, L., and Weaver, J. (2010).
Google street view: Capturing the world at street level.
Computer, 43(6):32–38.
Baatz, G., K¨
oser, K., Chen, D. M., Grzeszczuk, R., and
Pollefeys, M. (2012). Leveraging 3d city models for
rotation invariant place-of-interest recognition. Interna-
tional Journal of Computer Vision, 96(3).
Bansal, M., Daniilidis, K., and Sawhney, H. S. (2012). Ultra-
wide baseline facade matching for geo-localization. In
European Conference on Computer Vision Workshops
and Demonstrations.
Bansal, M., Sawhney, H. S., Cheng, H., and Daniilidis, K.
(2011). Geo-localization of street views with aerial
image databases. In ACM Multimedia.
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).
Speeded-up robust features (surf). Comput. Vis. Image
Underst., 110(3):346–359.
Brubaker, M. A., Geiger, A., and Urtasun, R. (2013).
Lost! leveraging the crowd for probabilistic visual self-
localization. In IEEE Conference on Computer Vision
and Pattern Recognition.
Churchill, W. and Newman, P. M. (2012). Practice makes
perfect? managing and leveraging visual experiences for
lifelong navigation. In IEEE International Conference
on Robotics and Automation, pages 4525–4532.
Conte, G. and Doherty, P. (2009). Vision-based unmanned
aerial vehicle navigation using geo-referenced infor-
mation. EURASIP Journal on Advances in Signal
Processing.
Cummins, M. and Newman, P. (2011). Appearance-only
SLAM at large scale with FAB-MAP 2.0. International
Journal of Robotics Research, 30(9):1100–1123.
DDPS (2008). Formulas and constants for the calculation of
the swiss conformal cylindrical projection and for the
transformation between coordinate systems. Technical
report, Federal Department of Defence, Civil Protection
and Sport DDPS.
Fischler, M. A. and Bolles, R. C. (1981). Random sample
consensus: a paradigm for model fitting with appli-
cations to image analysis and automated cartography.
Communications of the ACM, 24(6):381–395.
Floros, G., Zander, B., and Leibe, B. (2013). Openstreetslam:
Global vehicle localization using openstreetmaps. In
IEEE International Conference on Robotics and Au-
tomation, pages 1054–1059.
Fritz, G., Seifert, C., Kumar, M., and Paletta, L. (2005).
Building detection from mobile imagery using infor-
mative sift descriptors. In Scandinavian Conference on
Image Analysis.
Galvez-Lopez, D. and Tardos, J. D. (2012). Bags of binary
words for fast place recognition in image sequences.
IEEE Transactions on Robotics, 28(5):1188–1197.
Hartley, R. I. and Zisserman, A. (2004). Multiple View
Geometry in Computer Vision. Cambridge University
Press, ISBN: 0521540518, second edition.
Hentschel, M. and Wagner, B. (2010). Autonomous robot
navigation based on openstreetmap geodata. In IEEE
International Conference on Intelligent Transportation
Systems, pages 1645–1650.
Iba˜
nez Guzm´
an, J., Laugier, C., Yoder, J.-D., and Thrun,
S. (2012). Autonomous Driving: Context and State-of-
the-Art. In Handbook of Intelligent Vehicles, volume 2,
pages 1271–1310.
J´
egou, H., Douze, M., and Schmid, C. (2011). Product
quantization for nearest neighbor search. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence,
33(1):117–128.
Kalman, R. E. et al. (1960). A new approach to linear
filtering and prediction problems. Journal of basic
Engineering, 82(1):35–45.
Kneip, L., Scaramuzza, D., and Siegwart, R. (2011). A novel
parametrization of the perspective-three-point problem
for a direct computation of absolute camera position and
orientation. In IEEE Conference on Computer Vision
and Pattern Recognition, Colorado Springs, USA.
Kuemmerle, R., Steder, B., Dornhege, C., Kleiner, A.,
Grisetti, G., and Burgard, W. (2011). Large scale graph-
based SLAM using aerial images as prior information.
Journal of Autonomous Robots, 30(1):25–39.
Liu, Z. and Marlet, R. (2012). Virtual line descriptor and
semi-local graph matching method for reliable feature
correspondence. In British Machine Vision Conference,
pages 16.1–16.11.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Computer
Vision, 60(2):91–110.
Maddern, W. P., Milford, M., and Wyeth, G. (2012). Cat-
slam: probabilistic localisation and mapping using a
continuous appearance-based trajectory. International
Journal of Robotics Research, 31(4):429–451.
Majdik, A., Albers-Schoenberg, Y., and Scaramuzza, D.
(2013). Mav urban localization from google street view
data. In IEEE International Conference on Intelligent
Robots and Systems, pages 3979 – 3986.
Majdik, A., G´
alvez-L´
opez, D., Lazea, G., and Castellanos,
J. A. (2011). Adaptive appearance based loop-closing
in heterogeneous environments. In IEEE International
Conference on Intelligent Robots and Systems, pages
1256–1263.
Majdik, A., Verda, D., Albers-Schoenberg, Y., and Scara-
muzza, D. (2014). Micro air vehicle localization and
position tracking from textured 3d cadastral models.
In IEEE International Conference on Robotics and
Automation.
Moisan, L., Moulon, P., and Monasse, P. (2012). Automatic
Homographic Registration of a Pair of Images, with A
Contrario Elimination of Outliers. Image Processing On
Line.
Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dolgov,
D., Ettinger, S., Haehnel, D., Hilden, T., Hoffmann,
G., Huhnke, B., Johnston, D., Klumpp, S., Langer, D.,
Levandowski, A., Levinson, J., Marcil, J., Orenstein,
D., Paefgen, J., Penny, I., Petrovskaya, A., Pflueger,
M., Stanek, G., Stavens, D., Vogt, A., and Thrun,
S. (2008). Junior: The stanford entry in the urban
challenge. Journal of Field Robotics.
Morel, J.-M. and Yu, G. (2009). Asift: A new framework for
fully affine invariant image comparison. SIAM Journal
on Imaging Sciences, 2(2):438–469.
Moreno-Noguer, F., Lepetit, V., and Fua, P. (2007). Accurate
non-iterative o(n) solution to the pnp problem. In IEEE
International Conference on Computer Vision.
Muja, M. and Lowe, D. G. (2009). Fast approximate nearest
neighbors with automatic algorithm configuration. In
International Conference on Computer Vision Theory
and Application VISSAPP, pages 331–340.
Nist´
er, D. (2004). An efficient solution to the five-point
relative pose problem. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(6):756–770.
Scaramuzza, D. (2011). 1-point-ransac structure from mo-
tion for vehicle-mounted cameras by exploiting non-
holonomic constraints. International Journal of Com-
puter Vision, 95(1):74–85.
Scaramuzza, D. and Fraundorfer, F. (2011). Visual odometry
[tutorial]. IEEE Robotics & Automation Magazine,
18(4):80–92.
Schindler, G., Brown, M., and Szeliski, R. (2007). City-scale
location recognition. In IEEE Conference on Computer
Vision and Pattern Recognition.
Sivic, J. and Zisserman, A. (2003). Video google: A text
retrieval approach to object matching in videos. In IEEE
International Conference on Computer Vision, pages
1470–1477.
Taneja, A., Ballan, L., and Pollefeys, M. (2012). Registra-
tion of spherical panoramic images with cadastral 3d
models. In International Conference on 3D Imaging,
Modeling, Processing, Visualization and Transmission
(3DIMPVT), pages 479–486. IEEE.
Thrun, S., Burgard, W., Fox, D., et al. (2005). Probabilistic
robotics, volume 1. MIT press Cambridge.
Thrun, S., Fox, D., Burgard, W., and Dellaert, F. (2001).
Robust monte carlo localization for mobile robots. Ar-
tificial intelligence, 128(1):99–141.
Vaca-Castano, G., Zamir, A. R., and Shah, M. (2012).
City scale geo-spatial trajectory estimation of a moving
camera. In IEEE Conference on Computer Vision and
Pattern Recognition.
Wu, C., Agarwal, S., Curless, B., and Seitz, S. M. (2011).
Multicore bundle adjustment. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 3057–
3064. IEEE.
Yeh, T., Tollmar, K., and Darrell, T. (2004). Searching
the web with mobile images for location recognition.
In IEEE Conference on Computer Vision and Pattern
Recognition, pages 76–81.
Zamir, A. and Shah, M. (2010). Accurate image localiza-
tion based on google maps street view. In European
Conference on Computer Vision.
... The visual localization usually includes three tasks [12,23]: i) Relative camera pose estimation between consecutive keyframes as visual odometry, ii) relative camera pose estimation between the query and the reference images to eliminate the drift of localization in back-end optimization, iii) image matching to recognize viewed places in loop closure. We classify the first two as metric localization and the last one as topological localization. ...
... Majdik et al. [23] collected a dataset in the center of Zurich, Switzerland by a drone that flew a trajectory of two kilometers, to achieve UAVs localization from street view images in GPS-denied urban environments. The dataset includes 113 discrete street view locations and 405 matching aerial images. ...
Article
Full-text available
Learning-based visual localization has become prospective over the past decades. Since ground truth pose labels are difficult to obtain, recent methods try to learn pose estimation networks using pixel-perfect synthetic data. However, this also introduces the problem of domain bias. In this paper, we first build a Tuebingen Buildings dataset of RGB images collected by a drone in urban scenes and create a 3D model for each scene. A large number of synthetic images are generated based on these 3D models. We take advantage of image style transfer and cycle-consistent adversarial training to predict the relative camera poses of image pairs based on training over synthetic environment data. We propose a relative camera pose estimation approach to solve the continuous localization problem for autonomous navigation of unmanned systems. Unlike those existing learning-based camera pose estimation methods that train and test in a single scene, our approach successfully estimates the relative camera poses of multiple city locations with a single trained model. We use the Tuebingen Buildings and the Cambridge Landmarks datasets to evaluate the performance of our approach in a single scene and across-scenes. For each dataset, we compare the performance between real images and synthetic images trained models. We also test our model in the indoor dataset 7Scenes to demonstrate its generalization ability.
... Liu et al. [11] designed a visual compass based on point and line features for UAV high-altitude orientation estimation, using the appearance and geometry structure of the point and line features in the remote sensing images. Majdik et al. [12] proposed a textured three-dimensional model and similar discrete places in the topological map and to address the air-ground matching problem. Obviously, it is necessary to design the local features for specific tasks considering specific invariants, so experienced researchers and time consumption are indispensable. ...
Article
Full-text available
Vision-based unmanned aerial vehicle (UAV) localization is capable of providing real-time coordinates independently during GNSS interruption, which is important in security, agriculture, industrial mapping, and other fields. owever, there are problems with shadows, the tiny size of targets, interfering objects, and motion blurred edges in aerial images captured by UAVs. Therefore, a multi-order Siamese region proposal network (M-O SiamRPN) with weight adaptive joint multiple intersection over union (MIoU) loss function is proposed to overcome the above limitations. The normalized covariance of 2-O information based on1-O features is introduced in the Siamese convolutional neural network to improve the representation and sensitivity of the network to edges. We innovatively propose a spatial continuity criterion to select 1-O features with richer local details for the calculation of 2-O information, to ensure the effectiveness of M-O features. To reduce the effect of unavoidable positive and negative sample imbalance in target detection, weight adaptive coefficients were designed to automatically modify the penalty factor of cross-entropy loss. Moreover, the MIoU was constructed to constrain the anchor box regression from multiple perspectives. In addition, we proposed an improved Wallis shadow automatic compensation method to pre-process aerial images, providing the basis for subsequent image matching procedures. We also built a consumer-grade UAV acquisition platform to construct an aerial image dataset for experimental validation. The results show that our framework achieved excellent performance for each quantitative and qualitative metric, with the highest precision being 0.979 and a success rate of 0.732.
... A similarity transformation is applied to satisfy the geometrical validation of a loop and for loop closure. The repeated map or the matched keypoints are fused [91]. New edges are inserted into the covisibility graph, which is repeatedly updated to attach to loop closure. ...
Article
Full-text available
Autonomous robotics plays a pivotal role to simplify humanmachine interaction while meeting the current industrial demands. In that process, machine intelligence plays a dominant role during the decision making in the operational state-space. Primarily, this decision making and control mechanism relies on sensing and actuation. Simultaneous localization and mapping (SLAM) is the most advanced technique that facilitates both sensing and actuation to achieve autonomy for robots. This work aims to collate multi-dimensional aspects of simultaneous localization and mapping techniques primarily in the purview of both deterministic and probabilistic frameworks. This investigation on SLAM classification is further elaborated into different categories such as Feature-based SLAM and Optimization based SLAM. In this work, the chronological evolution of SLAM technique develops a comprehensive understanding among the concerned research community.
... Most similar to our dataset is [41] that released a 2 km sequence dataset captured by a drone in Zurich as well GoogleMaps images. Our dataset is much larger, covering more than 50 km in more diverse urban and suburban environments. ...
... In general, the world is observed from much the same viewpoints over repeated visits in cases of ground robots; yet, other systems, such as a small UAV, experience considerably different viewpoints which demand recognition of similar images obtained from very wide baselines [20], [332]. Traditional loop closure detection systems do not usually address such scenarios; novel algorithms have been proposed in complementary areas for ground-to-air association [333]- [338]. ...
Article
Full-text available
Where am I? This is one of the most critical questions that any intelligent system should answer to decide whether it navigates to a previously visited area. This problem has long been acknowledged for its challenging nature in simultaneous localization and mapping (SLAM), wherein the robot needs to correctly associate the incoming sensory data to the database allowing consistent map generation. The significant advances in computer vision achieved over the last 20 years, the increased computational power, and the growing demand for long-term exploration contributed to efficiently performing such a complex task with inexpensive perception sensors. In this article, visual loop closure detection, which formulates a solution based solely on appearance input data, is surveyed. We start by briefly introducing place recognition and SLAM concepts in robotics. Then, we describe a loop closure detection system’s structure, covering an extensive collection of topics, including the feature extraction, the environment representation, the decision-making step, and the evaluation process. We conclude by discussing open and new research challenges, particularly concerning the robustness in dynamic environments, the computational complexity, and scalability in long-term operations. The article aims to serve as a tutorial and a position paper for newcomers to visual loop closure detection.
... In general, the world is observed from much the same viewpoints over repeated visits in cases of ground robots; yet, other systems, such as a small UAV, experience considerably different viewpoints which demand recognition of similar images obtained from very wide baselines [20], [332]. Traditional loop closure detection systems do not usually address such scenarios; novel algorithms have been proposed in complementary areas for ground-to-air association [333]- [338]. ...
Preprint
Where am I? This is one of the most critical questions that any intelligent system should answer to decide whether it navigates to a previously visited area. This problem has long been acknowledged for its challenging nature in simultaneous localization and mapping (SLAM), wherein the robot needs to correctly associate the incoming sensory data to the database allowing consistent map generation. The significant advances in computer vision achieved over the last 20 years, the increased computational power, and the growing demand for long-term exploration contributed to efficiently performing such a complex task with inexpensive perception sensors. In this article, visual loop closure detection, which formulates a solution based solely on appearance input data, is surveyed. We start by briefly introducing place recognition and SLAM concepts in robotics. Then, we describe a loop closure detection system's structure, covering an extensive collection of topics, including the feature extraction, the environment representation, the decision-making step, and the evaluation process. We conclude by discussing open and new research challenges, particularly concerning the robustness in dynamic environments, the computational complexity, and scalability in long-term operations. The article aims to serve as a tutorial and a position paper for newcomers to visual loop closure detection.
... Most similar to our dataset is [41] that released a 2 km sequence dataset captured by a drone in Zurich as well GoogleMaps images. Our dataset is much larger, covering more than 50 km in more diverse urban and suburban environments. ...
Preprint
Place recognition and visual localization are particularly challenging in wide baseline configurations. In this paper, we contribute with the \emph{Danish Airs and Grounds} (DAG) dataset, a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illumination and perspective. The dataset is larger and more diverse than current publicly available data, including more than 50 km of road in urban, suburban and rural areas. All images are associated with accurate 6-DoF metadata that allows the benchmarking of visual localization methods. We also propose a map-to-image re-localization pipeline, that first estimates a dense 3D reconstruction from the aerial images and then matches query street-level images to street-level renderings of the 3D model. The dataset can be downloaded at: https://frederikwarburg.github.io/DAG
Article
Application of UAVs (Unmanned Aerial Vehicles) into environments devoid of GNSS (Global Navigation Satellite System) service has motivated research into GNSS-independent navigation solutions. This paper presents an account of such solutions proposed within the last decade. Unlike most literature that abstract UAV navigation to mere localization, this work takes a bottom-up approach by assessing the navigation components namely perception, localization and motion planning presented in the selected literature. The review results indicate that only 16% of the research presented full navigation solutions, while the rest present one or several components thereof. Besides the account of navigation solutions, our other contributions include an adapted MTOW-based (Maximum Take-Off Weight) UAV classification scheme incorporating a nano-sized UAV class, technology maturity assessment of the reviewed GNSS-independent navigation solutions and analysis of integrity monitoring frameworks.
Article
Full-text available
Combining Dirichlet Mixture Models (DMM) with deep learning models for road extraction is an attractive study topic. Benefiting from DMM, the manually labeling work is alleviated. However, DMM suffers from high computational complexity due to pixel by pixel computations. Also, traditional constant parameter settings of DMM may not be suitable for different target images. To address the above problems, we propose an improved DMM which embeds superpixel strategy and sparse representation into DMM. In our road extraction framework, we first use improved DMM to filter out most backgrounds. Then, a trained deep CNN model is used for further precise road area recognition. To further promote the processing speed, we also apply the superpixel scanning strategy for CNN models. We tested our method on a Shaoshan dataset and proved that our method not only can achieve better results than other compared state-of-the-art image segmentation methods, but the processing speed and accuracy of DMM are also improved.
Article
Full-text available
The paper considers an original autonomous correction algorithm for UAV navigation system based on comparison between terrain images obtained by onboard machine vision system and vector topographic map images. Comparison is performed by calculating the homography of vision system images segmented using the convolutional neural network and the vector map images. The presented results of mathematical and flight experiments confirm the algorithm effectiveness for navigation applications.
Chapter
Full-text available
Vehicles are evolving into autonomous mobile-connected platforms. The rationale resides on the political and economic will towards a sustainable environment as well as advances in information and communication technologies that are rapidly being introduced into modern passenger vehicles. From a user perspective, safety and convenience are always a major concern. Further, new vehicles should enable people to drive that presently can not as well as to facilitate the continued mobility of the aging population. Advances are led by endeavors from vehicle manufacturers, the military and academia and development of sensors applicable to ground vehicles. Initially, the motivators are detailed on the reasons that vehicles are being built with intelligent capabilities. An outline of the navigation problem is presented to provide an understanding of the functions needed for a vehicle to navigate autonomously. In order to provide an overall perspective on how technology is converging towards vehicles with autonomous capabilities, advances have been classified into driver centric, network centric and vehicle centric. Vehicle manufacturers are introducing at a rapid pace Advanced Driving Assistance Systems; these are considered as Driver Centric with all functions facilitating driver awareness. This has resulted on the introduction of perception sensors utilizable in traffic situations and technologies that are advancing from simple (targeted to inform drivers) towards the control of the vehicle. The introduction of wireless links onboard vehicles should enable the sharing of information and thus enlarge the situational awareness of drivers as the perceived area is enlarged. Network Centric vehicles provide the means to perceive areas that vehicle onboard sensors alone can not observe and thus grant functions that allow for the deployment of vehicles with autonomous capabilities. Finally, vehicle centric functions are examined; these apply directly to the deployment of autonomous vehicles. Efforts in this realm are not new and thus fundamental work in this area is included. Sensors capable to detect objects in the road network are identified as dictating the pace of developments. The availability of intelligent sensors, advanced digital maps, and wireless communications technologies together with the availability of electric vehicles should allow for deployment on public streets without any environment modification. Likely, there will first be self-driving cars followed by environment modifications to facilitate their deployment.
Conference Paper
Full-text available
In this paper, we address the problem of localizing a camera-equipped Micro Aerial Vehicle (MAV) flying in urban streets at low altitudes. An appearance-based global positioning system to localize MAVs with respect to the surrounding buildings is introduced. We rely on an air-ground image matching algorithm to search the airborne image of the MAV within a ground-level Street View image database and to detect image matching points. Based on the image matching points, we infer the global position of the MAV by back-projecting the corresponding image points onto a cadastral 3D city model. Furthermore, we describe an algorithm to track the position of the flying vehicle over several frames and to correct the accumulated drift of the visual odometry, whenever a good match is detected between the airborne MAV and the street-level images. The proposed approach is tested on a dataset captured with a small quadroctopter flying in the streets of Zurich. SUPPLEMENTARY MATERIAL Please note the dataset used in this work along with a video demonstration are both available on our webpage: rpg.ifi.uzh.ch
Conference Paper
Full-text available
Finding reliable correspondences between sets of feature points in two images remains challenging in case of ambiguities or strong transformations. In this paper, we define a photometric descriptor for virtual lines that join neighbouring feature points. We show that it can be used in the second-order term of existing graph matchers to significantly improve their accuracy. We also define a semi-local matching method based on this descriptor. We show that it is robust to strong transformations and more accurate than existing graph matchers for scenes with significant occlusions, including for very low inlier rates. Used as a preprocessor to filter outliers from match candidates, it significantly improves the robustness of RANSAC and reduces camera calibration errors.
Conference Paper
Full-text available
The availability of geolocated panoramic images of urban environments has been increasing in the recent past thanks to services like Google Street View, Microsoft Street Side, and Navteq. Despite the fact that their primary application is in street navigation, these images can be used, along with cadastral information, for city planning, real-estate evaluation and tracking of changes in an urban environment. The geolocation information, provided with these images, is however not accurate enough for such applications: this inaccuracy can be observed in both the position and orientation of the camera, due to noise introduced during the acquisition. We propose a method to refine the calibration of these images leveraging cadastral 3D information, typically available in urban scenarios. We evaluated the algorithm on a city scale dataset, spanning commercial and residential areas, as well as the countryside.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
Matching street-level images to a database of airborne images is hard because of extreme viewpoint and illumination differences. Color/gradient distributions or local descriptors fail to match forcing us to rely on the structure of self-similarity of patterns on facades. We propose to capture this structure with a novel “scale-selective self-similarity” (S 4) descriptor which is computed at each point on the facade at its inherent scale. To achieve this, we introduce a new method for scale selection which enables the extraction and segmentation of facades as well. Matching is done with a Bayesian classification of the street-view query S 4 descriptors given all labeled descriptors in the bird’s-eye-view database. We show experimental results on retrieval accuracy on a challenging set of publicly available imagery and compare with standard SIFT-based techniques.
Article
An efficient algorithmic solution to the classical five-point relative pose problem is presented. The problem is to find the possible solutions for relative camera motion between two calibrated views given five corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial and subsequently finding its roots. It is the first algorithm well suited for numerical implementation that also corresponds to the inherent complexity of the problem. The algorithm is used to a robust hypothesise-and-test framework to estimate structure and motion in real-time.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.