PreprintPDF Available

Google Map Oriented Robust Visual Navigation for MAVs in GPS-denied Environment

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

With the increasing need for micro aerial vehicles (MAVs) to work in GPS-denied environments, vision technique has been extensively explored to realize robust flight control. In this paper, we propose to employ Google map, which could be outdated due to the lag of updating, as reference to realize robust navigation for MAVs equipped with a downward-looking camera. Specifically, the initial position is estimated via correlation , and optical flow and homography decomposition are used subsequently to obtain the predicted position. Precise localization is achieved by image registration centered at the predicted position, using Histograms of Oriented Gradients (HOG) features to describe the onboard image as well as the map. To reduce the computational time, particle filter is employed to conduct a coarse to fine search. Extensive experiments on the datasets from International Micro Air Vehicles Conference and Flight Competitions (IMAV 2014 and 2015, where our MAV won the championship and 1 st runner up respectively) demonstrate the efficacy and efficiency of our method.
Content may be subject to copyright.
Google Map Oriented Robust Visual Navigation for
MAVs in GPS-denied Environment
Mo Shan, Zhi Gao, Yazhe Tang, Feng Lin, Member, IEEE, and Ben M. Chen, Fellow, IEEE
Abstract—With the increasing need for micro aerial vehicles
(MAVs) to work in GPS-denied environments, vision technique
has been extensively explored to realize robust flight control.
In this paper, we propose to employ Google map, which could
be outdated due to the lag of updating, as reference to realize
robust navigation for MAVs equipped with a downward-looking
camera. Specifically, the initial position is estimated via correla-
tion, and optical flow and homography decomposition are used
subsequently to obtain the predicted position. Precise localization
is achieved by image registration centered at the predicted
position, using Histograms of Oriented Gradients (HOG) features
to describe the onboard image as well as the map. To reduce
the computational time, particle filter is employed to conduct
a coarse to fine search. Extensive experiments on the datasets
from International Micro Air Vehicles Conference and Flight
Competitions (IMAV 2014 and 2015, where our MAV won the
championship and 1st runner up respectively) demonstrate the
efficacy and efficiency of our method.
Index Terms—Vision-based navigation, Micro aerial vehicles
(MAVs), GPS-denied environment.
Autonomous MAVs usually rely on GPS signal which,
combined with inertial measurement unit (IMU) data, provide
high-rate and drift-free state estimation suitable for control
purpose. Due to their small size and high maneuverability,
MAVs have become an increasingly powerful and popular
tool for rescue, surveillance, exploration and transportation.
Inevitably, MAVs have been required to work in environments
where GPS signal is denied or unreliale, such as in situations
of multipath reflection caused by obstacle buildings or terrain,
jamming or injection of erroneous signal from malicious
actions, adverse weather conditions, or hardware failure, etc.
Therefore, for its critical role in automatic flight control,
realizing robust localization and navigation for MAVs without
GPS has been attracting growing attention from researchers in
control, robotics and vision communities.
Vision-based techniques have been extensively explored to
facilitate the control of autonomous flight due on one hand
to the characteristics of visual sensors which are lightweight
and low power consumptional, yet provide a large amount of
information about the environment at high frame rate; due
Mo Shan and Zhi Gao contributed equally to this work.
M. Shan, Z. Gao, and F. Lin are with the Temasek Laboratories, National
University of Singapore, 117411, Singapore. E-mail:;;
Y. Tang is with the Temasek Laboratories, National University of Singa-
pore, Singapore. He is also with the Department of Precision Mechanical
Engineering Shanghai University, 200444, China. E-mail:
Ben M. Chen is with the Electrical and Computer Engineering Department,
National University of Singapore, Singapore. E-mail:
Manuscript received April XX, 2016; revised XXX XX, 2016.
Fig. 1: Overview of the problem addressed in this paper. (a)
Our quadrotor platform. (b) The location of MAV is decided
by matching on-the-fly image with the Google map.
on the other hand to the great achievements in visual signal
processing, where algorithms for precise geometrical informa-
tion (either 2D or 3D) estimation and accurate object/scene
analysis (including detection, recognition, and understanding
etc) have been proposed. In terms of the specific problem of
navigation, most available vision-based method can be roughly
divided into three categories (according to the data being
applied for localization). The first category is the technique of
simultaneous localization and mapping (SLAM) [1], [2], [3],
[4], [5], [6]. Although appealing and successful applications
on ground vehicles have been reported, SLAM suffers from
high cost of computation and memory, as a progressively more
complex map should be generated and maintained. Moreover,
its robustness decreases sharply for large and structure rich
scenes. Therefore, researchers are still struggling to propose
practical SLAM algorithms for MAVs whose computational
power could be significantly inferior to that of ground vehicles
due to the limited payload capability. The second category
methods incrementally estimate the camera pose by examin-
ing the changes induced by motion on the image sequence,
referring to as visual odometry (VO) [7], [8], [9], [10], [11],
[12], [13]. Compared with SLAM, which tries to obtain a
global and consistent estimate of the path via identifying
loop closures, VO conversely aims at recovering the path
incrementally, pose after pose, thus avoiding the need to keep
an environment map and detect loop closure, rendering it
less computationally expensive. However, VO methods usually
suffer from pose estimation drift since the error of inter frame
pose estimation accumulates. Consequently, sophisticated op-
timization algorithm, such as bundle adjustment, or additional
sensor should be incorporated to correct the drift. The third
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
category leverages the available geographic information sys-
tem (GIS) data, such as Google map, Google Street View,
and labelled images, as georeference to realize localization
and navigation [14], [15], [16], [17], [18], [19], [20]. At first
glance, such georeference methods are quite straightforward.
Nevertheless, it is very challenging to register the image
captured by the on-board camera to the reference beacuse
two images of the same place acquired at different times and
with different cameras may show huge appearance differences
due to illumination and colorimetry variations (e.g. sunny or
cloudy days), camera viewpoints changes, scene modifications
(e.g. seasonal changes, building construction) and occlusion
(e.g. by moving objects). Thus, such noisy registration results
necessitate sophisticated filtering, which usually induces high
computational load, to realize long term localization.
In this paper, we propose to employ Google map as georef-
erence to realize robust localization and navigation for MAV
equipped with a downward-looking camera. Our quadrotor
platform and the overview of our work are shown in Fig.
1. Here, the on-board image and the map are captured in
different scales, orientations, illumination conditions, and with
different cameras. Moreover, the map is usually not up-to-
date, which leads to significant appearance difference between
the on-board image and its counterpart region in the map.
To address such challenges, we make careful design in each
stage of the algorithm by taking both robustness and efficiency
into account. Specifically, the initial position is detected via
correlation, and then optical flow and homography decom-
position are used to obtain the predicted position. Precise
localization is achieved by image registration centered at the
predicted position, using HOG features to describe the on-
board image as well as the map. To reduce the computational
time, particle filter is employed to conduct a coarse to fine
search. Clearly, our method belongs to the third category
mentioned above, while achieving directness and simplicity.
Extensive experiments on the datasets from IMAV 2014 and
2015, where our MAV won the championship and 1st runner
up respectively, demonstrate the efficacy and efficiency of
our method. This work is an extended version of the paper
published in IEEE ROBIO’15. Please note that this paper is
accompanied by a video demonstration1.
The rest of the paper is organized as follows: Section II
presents an overview of previous efforts. Section III is devoted
to our technical details and Section IV presents our extensive
experiments. The paper finishes with discussions of conclusion
and future work in Section V.
Tremendous efforts have been devoted to the problem of
visual localization and navigation for unmaned systems, and
a plethora of methods have been proposed. While promising
results have been reported, significant problems remain, es-
pecially for MAV platforms whose computational power is
significantly inferior to that of ground vehicles. Here, instead
of reviewing all the three categories mentioned before, we
focus on the techniques which are the most relevant to our
1The video link is:
method. Moreover, due to the critical role of registration
between on-board image and Google map, we also review the
techniques of registering images of challenging situations.
There are various methods of estimating an MAV’s real-
world position from a combination of aerial images sensed by
on-board camera of the MAV itself and GIS data, including
the use of digital elevation maps (DEMs), data contained
in the vector layer, and reference images. A DEM provides
a representation of the ground surface topography of the
earth. In [21], the matching between the 3D terrain model
obtained using optical flow from on-board image sequence
and a DEM is proposed to estimate the MAV’s position and
pose. As the depth information of both the 3D terrain model
and reference DEM is converted into 2D pseudo images with
pixel brightness representing height, such method achieves
the robustness against illumination changes. However, for an
MAV flying at a relatively low altitude, it may be difficult
to reliably match a 3D terrain to a DEM for areas which
are predominantly planar. The data in the vector layer can
directly provide salient feature, such as road intersections, to
facilitate matching with on-board aerial images. The works of
[19], [20] match high altitude aerial images to GIS data based
on the shape of roads to realize vision-aided MAV navigation.
In [14], Data in the Ordnance Survey (OS) layers is used to
match with the aerial image. The main disadvantage of such
a strategy is that matching relies heavily on the relationships
between multiple linear features observed from a relatively
high altitude. For an MAV operating in a mountainous en-
vironment at lower altitude, it may not be possible to detect
multiple features in a single frame. Besides DEM and vector
data, both labelled aerial and ground images have been applied
as reference to provide a backup navigation system for MAV.
In [15], [16], [18], MAV aerial images are matched to Google
Earth images to realize localization. In [17], on-board aerial
images are matched to ground images for localization, in
which intermediate artificial views are generated to overcome
the large view-point differences. The main disadvantages of
such methods are the requirement of labelled images and lack
of accuracy. Moreover, the computational load induced by such
sophisticated algorithms could be prohibitively high for MAVs.
Due to the necessity of registering on-board image to refer-
ence image, we also briefly review the techniques of determing
similarity between visual data. Determining similarity has
attracted many interests for its critical role in many vision
tasks, including object detection and recognition, texture/scene
classification, data retrieval, tracking, image alignment, etc.
Based on the general assumption that there exists a common
underlying visual property (i.e., colors, intensities, edges,
gradients, or other filter response) which is shared by the two
images, and can therefore be extracted and compared across
images, compact region descriptors (features), usually used
with interest point detector, have been extensively proposed.
Without attempting to be exhaustive, such works include scale
invariant descriptor (SIFT [22] and its derivatives), differential
descriptors [23], and shape-based descriptors [24]. For a
more comprehensive and detailed discussion of many region
descriptors for image matching, readers can refer to a survey
paper [25]. However, such assumption could be too restrictive
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
Fig. 2: The overview of our main idea of Google map oriented visual navigation. The HOG features of the map are computed
offline. During onboard processing, we track the pose by position prediction and image registration.
for images from different modalities which share no obvious
visual property and all those descriptors can hardly produce
satisfactory matching results. To match (or align) images
with significant appearance differences, but sharing similar
layout of global or local intensity patterns, co-occurrence and
self-similarity have been considered. Typically, in [26], [27],
statistical co-occurrence of pixel-wise measures is estimated
from individual images and then compared across images
to realize matching. Nevertheless, statistical co-occurrence is
assumed to be global (within the entire image) – a very
strong assumption which is often invalid. Therefore, local
self-similarity descriptors which do not use the appearance
information directly, but capture internal geometric layouts
of local self-similarities have been proposed, which greatly
improve the matching capabilities for complex data [28], [29].
Our problem of matching on-board image to reference image
somewhat falls in between. The huge difference between on-
board and reference images preclude the possibility of using
any appearance based region descriptor to realize matching.
On the other hand, compared with the image retrieval (or
recognition) situations which self-similarity descriptors are
designed for, our task shares more similarity in terms of data
content, but has much higher requirement on efficiency (the
processing frequency should achieve about 10 Hz). Therefore,
we leverage HOG features [30], which has proved to be
highly successful for pedestrian detection under challenging
illumination and pose conditions, to register on-board image
to Google reference image. To expedite the matching process,
particle filter is used to avoid sliding window search. To
further improve the efficiency, the search is confined around
the location predicted by optical flow.
This section investigates our Google map oriented visual
navigation method and the main idea is given in Fig. 2.
As the Google map is given, a lookup table of its HOG
representation is built offline, and then correlation based global
search is performed for initialization. Position prediction is
performed next to restrain the search. After the on-board image
is described by HOG, the particles are drawn in a coarse grid.
If the match is reliable, then the particles will vote for the
location of MAV. Otherwise, the search in a fine scale should
be conducted. If the search fails, the optical flow value is
retained as location estimation.
A. Preprocessing
Prior to any aforementioned operation, we firstly rotate the
on-board images to the upright orientations with respect to
the Google map by the yaw angle obtained from an on-board
IMU, resulting in the same orientation between the on-board
frame and the reference one. Secondly, we resize the on-board
image according to the altitude information obtained from an
on-board barometer, thus the on-board image and reference
image are in the same scale (namely each pixel in the on-
board image represents the same spatial distance as the pixel
in reference image does, despite the different flying heights of
MAV). By doing such preprocessing, the effects of different
orientations and different scales are eliminated. Therefore the
difficulty of our task is significantly reduced and our later
efforts could focus on the much more important component:
image registration based pose tracking.
B. Global localization for initialization
After taking off, the location of MAV is searched in the
entire map for initialization. To avoid sliding window search,
which is quite time consuming, we adopt the correlation filter
with a high frame rate proposed in [31], and exploit the Fast
Fourier Transform to expedite the correlation as Equation (1),
where F=F(f)is the 2D Fourier transform of the input
on-board image, H=F(h)is the same transform of the
map, denotes element wise multiplication and * indicates
complex conjugate. As no training is available for detection,
we correlate the current frame and the map. As a result,
transforming G into the spatial domain gives a confidence map
of the location.
Take the on-board image displayed in Fig. 3 as an example,
it corresponds to the region on the map. Its confidence map is
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
shown in Fig. 4, from which it is evident that the correct loca-
tion of the on-board image possesses the highest confidence.
In addition, the processing time of correlation is merely 0.036
second. Despite the efficiency of using correlation filter for
initial localization, it may not be suitable for the subsequent
localization. This is because the takeoff position could be set
to a place where distinctive landmarks are visible, which may
not always be the case during the entire flight.
Fig. 3: Onboard image at taking off position, and its corre-
sponding region in the map.
Fig. 4: The confidence map. Red indicates high confidence
while blue indicates low confidence. The black area represents
the highest confidence. Note that the confidence corresponds
to the top left corner of the template, instead of its center. Best
viewed in color.
C. Pose tracking
After initialization, the position of MAV is estimated via
image registration. In this section, optical flow based motion
estimation as well as HOG and particle filter based image
registration are introduced.
1) Position prediction: To narrow down the search, we
make a rough guess of the predicted position by estimating the
inter-frame motion and then confine the following matching
search. To obtain the motion, the points to be tracked are
selected based on the good feature criteria [32], and iter-
ative Lucas-Kanade method with pyramids [33] is used to
estimate the optical flow fields. Similar to [34], the inter-
frame translation could be estimated by assuming the ground
plane is roughly flat. Specifically, homography His applied to
describe the relationship between co-planar feature points in
two images, from which the motion dynamics could be derived
according to Equation (2):
where Rand Tare the inter-frame rotation and translation
respectively, Nis the normal vector of the ground plane, and h
is the altitude. R,N,hare obtained from the onboard sensors
and Tcan be calculated as Equation (3):
2) Image descriptor: We use the images given in Fig. 5
as an example, which is popularly encountered during our
image registration task, to justify that our image descriptor
is appropriate. The first image is the on-board image, while
the second one is its counterpart in the map. Another two
subimages, whose locations are close to the ground truth
location, are extracted from the map as outliers. The best
image descriptor should be able to maximize the relative
difference between the matched and mismatched image pairs.
Fig. 5: Images for similarity estimation comparison. The first
one is the on-board image, and the second one is its ground
truth match. The third and fourth images are subimages located
close to the ground truth.
TABLE I: Similarity estimation using different descriptors.
Image 1 Image 2 Image 3
MI 1 1.253 1.208
CC 1 1.328 1.320
LSS 1 1.055 1.025
HOG 1 1.829 2.428
We choose four commonly used approaches for registering
images of different modalities, namely Mutual Information
(MI) used in [35], sample Correlation Coefficient (CC), Local
Self Similarity (LSS) descriptors proposed in [28] and HOG
features. The performance of different image descriptors are
demonstrated in Table I2, where the second, third, fourth
images in Fig. 5 are labelled as image 1, 2 and 3. MI is calcu-
lated based on Equation (4), where H(X), H(Y)are marginal
entropies, H(X, Y )is the joint entropy. CC is computed
according to Equation (5), where xi, yiare measurements in
X, Y . LSS descriptors are extracted from the images at five
pixels apart, and the distance is computed by L2norm. For
HOG, the value is obtained based on the correlation coefficient
of the HOG features extracted from the images. From the
comparison results, it is obvious that HOG is superior to other
image descriptors, since MI and CC could not differentiate
the outliers as good as HOG does. In addition, LSS barely
recognizes the outliers, achieving the worst performance.
I(X;Y) = H(X) + H(Y)H(X, Y )(4)
rxy =
The success of HOG is not surprising, as it is already widely
used for pedestrian detection due to its robustness. To construct
2The correlation values are transformed to distance by d= 1
correlation, and the values are normalized using the ground truth distance.
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
HOG, 1D point derivative masks are convolved with the image
to obtain the gradients. Gradient histograms are constructed
in cells and blocks. A cell contains 8×8pixels whereas a
block consists of 2×2cells. The gradient orientations are
divided into 9 bins and every pixel in the cell votes for two
adjacent bins weighted by gradient magnitude. Additionally,
bilinear interpolation is used to compute the contribution of
a pixel to the cells containing it, wherein the importance of
pixels follows Gaussian distribution with respect to the block
center. The histograms are normalized locally to compensate
for illumination variance. Clipped L2norm normalization
scheme is performed to the histogram of every block. Be-
cause the blocks are overlapped, every cell contributes to
multiple blocks, significantly improving the performance of
HOG. Eventually the histograms are vectorized to form a 1D
feature. Unlike object detection, no training is available in
navigation. Therefore, we use HOG in a holistic manner as an
image descriptor to encode the gradient information in multi-
modal images. The HOG glyph [36] is visualized in Fig. 6.
It is evident that the gradient patterns remain similar even
though the on-board image undergoes photometric variations
compared with the map. In particular, the structures of road
and house are clearly preserved.
During offline preparation phase, a lookup table is con-
structed to store the HOG features extracted at every pixel
in the map. In this way, the HOG features for the map are
retrieved from the table when registering images online to save
computation time.
Fig. 6: Visualization of HOG features. First row: subimage of
reference map and onboard image. Second row: HOG glyph.
The gradient patterns for houses and roads are quite similar
in HOG glyph.
3) Similarity metrics: Several metrics are compared to
find the optimal one for measuring the similarity of HOG
descriptors. Since HOG descriptors are essentially histograms,
we measure their similarity using Correlation, Chi-Square, In-
tersection, and Bhattacharyya distance. To choose the optimal
metric, we also use images presented in Fig. 5 to calculate the
similarity value.
TABLE II: Similarity estimation using different metrics.
Image 1 Image 2 Image 3
Correlation 1 1.829 2.428
Chi-Square 1 1.810 2.009
Intersection 1 0.954 0.970
Bhattacharyya 1 1.260 1.248
The comparison of similarity metrics is summarized in
Table II3. It turns out that all metrics are able to produce
minimum distance for the matched pair except for Intersection.
Moreover, Correlation differentiates outlier better than Chi-
Square and Bhattacharyya as the distance of outliers are larger.
Therefore, Correlation is chosen as the similarity metric for
HOG descriptors in our experiment.
4) Confined localization: Comparing the HOG feature of
the on-board image with that of all the subimages in the map
is time consuming. Therefore, the traditional sliding window
approach seems unfit as it demands too much computational
resources. Inspired by the tracking algorithm, we employ
particle filter as [37] to estimate the true position. Furthermore,
in order to reduce the number of particles, we confine our
search in the vicinity of the predicted position, adopting a
coarse to fine procedure.
Particle filter: There are Nparticles, and for each particle p,
its properties include {x, y, Hx, Hy, w}, where (x, y )specify
the top left pixel of the particle, (Hx, Hy)is the size of
the subimage covered by the particle and wis the weight.
Here, (x, y)is generated around the predicted position, while
(Hx, Hy)equals to the size of the on-board image.
The optimal estimation of the posterior is the mean state of
the particles. Suppose each ppredicts a location l, then the
estimated state is computed as Equation 6.
E(l) =
Based on the predicted state (xp, yp)of where the MAV
could be in the next frame, we calculate the likelihood that
MAV location (xc, yc)is actually at this location. After the
particles are drawn, the subimages of the map located at the
particles are compared with the current frame. To estimate the
likelihood, we use Gaussian distribution to normalize these
distance values based on Equation 7, where dis the distance
between the two images under comparison, σis the standard
deviation, ˆwis then normalized based on the sum of all
weights to ensure that wis in the range [0,1].
2πσ exp( d2
We do not use a dynamical model here to propagate the
particles. Instead, since the MAV moves erratically due to wind
and the camera is not stabilized, we initialize the particles in
every frame using optical flow estimation as [38] did.
3Correlation and Intersection measures similarity while Chi-Square and
Bhattacharyya measures distance. Since distance is used in particle filter, we
transform similarity values to distance by d= 1cor relation. The distance
values are then normalized with respect to the ground truth value.
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
Coarse to fine search: At the beginning, the particles are
drawn around the takeoff position. Subsequently, optical flow
provides translation between consecutive frames, and the pre-
dicted position is updated by accumulating the translation prior
to image registration. Similar to [39], around the predicted
position, the search is conducted from coarse level to fine level
to reduce the computational burden. For the coarse search,
Nparticles are drawn randomly in a rectangular area, whose
width and height are both sc, with a large search interval c.
The fine search is carried out in a smaller area with size sfand
search interval f. Different to [39], our method relies mostly
on the coarse search which is often quite accurate. If the
minimum similarity distance of the coarse search is larger than
a threshold τd, then the match is considered invalid. Only when
the coarse search fails to produce valid match do we conduct
the fine search. Fine search still centers at the predicted
position and the coarse search result is discarded. When the
minimum similarity distances in both coarse and fine search
are above the threshold τd, indicating that image registration
result is unreliable, the predicted position by optical flow is
retained as the current location.
We now perform experiments on the datasets collected
during IMAV 2014 and 2015. Before introducing any experi-
mental detail, it should be clarified that we firstly conduct the
preprocessing by rotating and resizing each on-board frame,
as introduced in Section III-A.
A. Experiments on IMAV 2014 dataset
In this section, we present the performance of our method on
the aerial images collected during IMAV 2014. Our quadrotor
platform is shown in Fig. 1(a), whose dimension is 35 cm in
height and 86 cm in diagonal width with a maximum take-
off weight of 3 kg. Its on-board hardware includes an IG-
500N attitude and heading reference system (AHRS) from
SBG Systems and a downward-facing PointGrey Chameleon
camera and an Ascending Technologies Mastermind computer.
The flight test is carried out at Oostdorp4, Netherlands. The
quadrotor flies at about 80 m above the ground and sweeps
overhead the whole Oostdorp village. The speed is about 2
m/s and the total flight duration is about 3 min.
1) Parameters: The most important parameters of our
method are Nand sc. A larger Nvalue increases the accuracy
of the weighted center but demands more computational
resources. Likewise, larger scensures the matching is robust to
jitter while smaller screduces the time consumed. Hence, we
trade off the robustness and efficiency when determining those
parametric values. Regarding the sensitivity of our method to
these parameters, it is found that Nshould be larger than
40 to have sufficient particles to make a valid estimation.
Meanwhile, scshould be larger than 35 to account for the
inaccuracy arising from optical flow.
In our experiment, the varied parameters used are set as
follows. During preprocessing, the size of Google map is 850×
4Google Maps location [52.142815,5.843196]
500, and the size of on-board image is 180 ×180. We use the
HOG in OpenCV with cell size 32 ×32, block size 64 ×64,
block stride 32 ×32. For coarse to fine search, we set N=
50, sc= 40,c= 4,sf= 20,f= 1,σ= 0.01, τd= 0.75.
2) Results: We firstly compare the effect of optical flow
based position prediction and the result is presented in Fig.
7. Clearly, the method without prediction from optical flow
is prone to failure, especially when the motion is large. By
contrast, using optical flow information can overcome such
difficulty by predicting the search region.
Fig. 7: Comparison of our method with and without optical
flow prediction. Best viewed in color.
We then compare two metrics for outlier rejection, namely
minimum distance (MD) as well as Peak to Sidelobe Ratio
(PSR) defined in [31]. PSR is computed according to Equation
(8), where dmin is the minimum distance, and µ, σ are the
mean and standard deviation of the distance for all particles
in the search region, excluding a circle with 5 pixels radius
around the minimum position.
Θ = dmin µ
The solid line in Fig. 8 is MD while the dashed line is
PSR. Both MD and PSR are normalized to the range [0,
1]. MD peaks within the yellow highlighted interval and
attains large values, when the match becomes unreliable due
to significant illumination change (refer to video). In contrast,
PSR remains oscillating in that region. MD outperforms PSR
in the sense that it indicates when the match is incorrect.
Therefore, MD is applied in our approach throughout the
localization experiments.
We then compare the localization results of our method with
that of VO method, and the ground truth is provided by GPS.
As shown in Fig. 9, the red line depicts the GPS ground truth,
the brown line indicates the localisation from VO based on
optical flow. The green dots represent the result of our method
and the blue crosses are unreliable match locations where
optical flow predictions are retained. As can be observed from
the video, the sequence is challenging for image registration
for three reasons. Firstly, the Google map is not up-to-date,
such that the trees and buildings are missing in some regions.
Secondly, the map image only has low resolution, which may
reduce the amount of visible gradient patterns. Moreover,
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
Fig. 8: Comparison of outlier rejection metrics. Best viewed
in color.
the scene undergoes large illumination change. Clearly, our
method achieves superior performance than VO. Furthermore,
the position prediction step in our method deals with un-
reliable match effectively as well. When there is obvious
illumination change around the second turn, the HOG based
match produces low similarity, and the predicted position
is closer to the ground truth. During the whole trajectory,
image registration failure constitutes 7% of all the matches.
Such failures mostly concentrate at two regions, where either
there are few gradient patterns in the scene or has significant
illumination change. Moreover, the oscillation of our results
is sometimes significant, mainly due to the jitter of MAV. A
gimbal could be used to mitigate the oscillation. In comparison
to the ground truth, the root mean square error (RMSE) of our
method is 6.773 m. The errors are quite small compared with
a 169.188 m RMSE for the VO based on optical flow alone.
In fact, the localisation accuracy of our method is comparable
to GPS, whose RMSE is 3 m.
Fig. 9: Localization results of IMAV 2014. Best viewed in
Lastly, we report on the amount of computations incurred
by our method on an Intel i7 3.40 GHz processor. Our method
is implemented in C++ using OpenCV 2.4.9 library and runs
at 15.625 Hz on average for each frame. Our current update
rate is sufficient for the position localization, since its output
will be fused with onboard INS at 50 Hz [34]. Practically, the
resulting trajectory is smooth as long as our update frequency
is higher than 10 Hz.
B. Experiments on IMAV 2015 dataset
The IMAV 2015 dataset is collected at Floriansdorf5, Ger-
many. The structure of the quadrotor platform is simplified
compared with the one used in IMAV 2014, consisting of
only the Pixhawk and Intel NUC. A gimbal is used to stabilize
the downward-facing PointGrey Chameleon camera since the
wind is quite strong, with speed of more than 10 m/s. The
operation altitude is 30 m and the maximum speed is about 4
m/s. The total flight time is about 1 minute, during which the
quadrotor flies over a 60 ×30 m2region. Moreover, the same
parameters as previous section are used.
1) Map comparison: To visualize the difference between
Google map and the aerial images, Pix4Dmapper is used to
stitch the on-board images to create a panorama. As shown
in Fig. 10, the Google map is quite different from the actual
scene. For instance, the house at the bottom left does not exist
in Google map, so as the trees on the top left. The dramatic
scene changes pose a great challenge to image registration.
Nevertheless, our method is able to overcome this because the
main road structure dominating the scene is preserved. As a
result, the gradient patterns are still similar.
Fig. 10: Comparison of Google map and the stitched actual
scene. Left: Google map. Right: Panorama created from the
on-board images. Best viewed in color.
Fig. 11: Localization results of IMAV 2015. Best viewed in
2) Path analysis: Correlation filter is used for location
initialization. Next, the localisation is based on the pose
tracking described previously. In Fig. 11, the red line depicts
the GPS ground truth, the brown line indicates the localisation
from VO based on optical flow alone, the green dots represent
5Google Maps location [50.788812,6.046862]
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
the output of our method and the blue crosses are unreliable
match locations where optical flow predictions are retained. On
one hand, the localisation based on optical flow alone deviates
significantly from the GPS ground truth as it is not able to
reduce the drift. Consequently, the end position is far away
from the true landing position. On the other hand, our method
continues to provide reliable localisation, since the green dots
follow the red line closely, and the RMSE of our method is
6.584 m. Furthermore, the trajectory in Fig. 11 is smoother
than the one in Fig. 9, because a gimbal is used to stabilize
the camera against the wind.
This paper presents the first study of localization using HOG
features in GPS-denied environment, by registering on-board
aerial images to Google map. The localization experiments
using flight data show that our method could supplement GPS
since its error is comparatively small. Since the datasets used
in this work are limited to urban landscapes, our approach
constitutes an initial benchmark, and we will construct more
datasets of different environments, such as forests. These
datasets will be made publically available for research purpose.
For future research directions, we will try different cameras
to address more challenging localization tasks. For example,
fisheye cameras will be used for image registration in homoge-
nous regions since they possess a large field of view. Moreover,
we will install a thermal camera and design the algorithm for
night time navigation. Subsequently, we will perform more
evaluation on challenging environments in both day and night
The authors would like to thank the members of NUS UAV
Research Group for their kind support.
[1] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile
and accurate monocular slam system,” Robotics, IEEE Transactions on,
vol. 31, no. 5, pp. 1147–1163, 2015.
[2] M. J. Milford, G. F. Wyeth, and D. Rasser, “Ratslam: a hippocampal
model for simultaneous localization and mapping,” in Robotics and
Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International
Conference on, vol. 1. IEEE, 2004, pp. 403–408.
[3] J. Engel, T. Sch¨
ops, and D. Cremers, “Lsd-slam: Large-scale direct
monocular slam,” in Computer Vision–ECCV 2014. Springer, 2014,
pp. 834–849.
[4] G. Klein and D. Murray, “Parallel tracking and mapping for small ar
workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th
IEEE and ACM International Symposium on. IEEE, 2007, pp. 225–234.
[5] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense
tracking and mapping in real-time,” in Computer Vision (ICCV), 2011
IEEE International Conference on. IEEE, 2011, pp. 2320–2327.
[6] K. L. Ho and P. Newman, “Loop closure detection in slam by combining
visual and spatial appearance,” Robotics and Autonomous Systems,
vol. 54, no. 9, pp. 740–749, 2006.
[7] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part i: The first
30 years and fundamentals,” IEEE Robotics and Automation Magazine,
vol. 18, no. 4, pp. 80–92, 2011.
[8] D. Scaramuzza and F. Fraundorfer, “Visual odometry: Part ii-the first 30
years and fundamentals,” Robotics Automation Magazine, vol. 19, no. 1,
[9] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct
monocular visual odometry,” in Robotics and Automation (ICRA), 2014
IEEE International Conference on. IEEE, 2014, pp. 15–22.
[10] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and
D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d
mapping with a quadrotor micro aerial vehicle,” Journal of Field
Robotics, 2015.
[11] S. Weiss, M. W. Achtelik, S. Lynen, M. C. Achtelik, L. Kneip, M. Chli,
and R. Siegwart, “Monocular vision for long-term micro aerial vehicle
state estimation: A compendium,” Journal of Field Robotics, vol. 30,
no. 5, pp. 803–831, 2013.
[12] B. M. Kitt, J. Rehder, A. D. Chambers, M. Schonbein, H. Lategahn,
and S. Singh, “Monocular visual odometry using a planar road model
to solve scale ambiguity,” 2011.
[13] J. Zhang and S. Singh, “Visual–inertial combined odometry system for
aerial vehicles,” Journal of Field Robotics, vol. 32, no. 8, pp. 1043–1055,
[14] T. Patterson, S. McClean, P. Morrow, and G. Parr, “Utilizing geographic
information system data for unmanned aerial vehicle position estima-
tion,” in 2011 Canadian Conference on Computer and Robot Vision
(CRV). IEEE, 2011, pp. 8–15.
[15] G. Conte and P. Doherty, “An integrated uav navigation system based on
aerial image matching,” in Aerospace Conference, 2008 IEEE. IEEE,
2008, pp. 1–10.
[16] F. Lindsten, J. Callmer, H. Ohlsson, D. Tornqvist, T. Schon, and
F. Gustafsson, “Geo-referencing for uav navigation using environmental
classification,” in Robotics and Automation (ICRA), 2010 IEEE Interna-
tional Conference on. IEEE, 2010, pp. 1420–1425.
[17] A. L. Majdik, Y. Albers-Schoenberg, and D. Scaramuzza, “Mav urban
localization from google street view data,” in Intelligent Robots and
Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE,
2013, pp. 3979–3986.
[18] C. Le Barz, N. Thome, M. Cord, S. Herbin, and M. Sanfourche, “Global
robot ego-localization combining image retrieval and hmm-based fil-
tering,” in 6th Workshop on Planning, Perception and Navigation for
Intelligent Vehicles, 2014, p. 6.
[19] D.-Y. Gu, C.-F. Zhu, J. Guo, S.-X. Li, and H.-X. Chang, “Vision-aided
uav navigation using gis data,” in Vehicular Electronics and Safety
(ICVES), 2010 IEEE International Conference on. IEEE, 2010, pp.
[20] C.-F. Zhu, S.-X. Li, H.-X. Chang, and J.-X. Zhang, “Matching road
networks extracted from aerial images to gis data,” in Information
Processing, 2009. APCIP 2009. Asia-Pacific Conference on, vol. 2.
IEEE, 2009, pp. 63–66.
[21] V. Tchernykh, M. Beck, and K. Janschek, “Optical flow navigation for
an outdoor uav using a wide angle mono camera and dem matching,”
in Mechatronic Systems, vol. 4, no. 1, 2006, pp. 590–595.
[22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,
International journal of computer vision, vol. 60, no. 2, pp. 91–110,
[23] I. Laptev, “On space-time interest points,International Journal of
Computer Vision, vol. 64, no. 2-3, pp. 107–123, 2005.
[24] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recog-
nition using shape contexts,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 24, no. 4, pp. 509–522, 2002.
[25] K. Mikolajczyk and C. Schmid, “A performance evaluation of local
descriptors,” Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, vol. 27, no. 10, pp. 1615–1630, 2005.
[26] N. Jojic and Y. Caspi, “Capturing image structure with probabilistic
index maps,” in Computer Vision and Pattern Recognition, 2004. CVPR
2004. Proceedings of the 2004 IEEE Computer Society Conference on,
vol. 1. IEEE, 2004, pp. I–212.
[27] C. Stauffer and E. Grimson, “Similarity templates for detection and
recognition,” in Computer Vision and Pattern Recognition, 2001. CVPR
2001. Proceedings of the 2001 IEEE Computer Society Conference on,
vol. 1. IEEE, 2001, pp. I–221.
[28] E. Shechtman and M. Irani, “Matching local self-similarities across
images and videos,” in Computer Vision and Pattern Recognition, 2007.
CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
[29] K. Chatfield, J. Philbin, and A. Zisserman, “Efficient retrieval of
deformable shape classes using local self-similarities,” in Computer
Vision Workshops (ICCV Workshops), 2009 IEEE 12th International
Conference on. IEEE, 2009, pp. 264–271.
[30] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
[31] D. S. Bolme, J. R. Beveridge, B. Draper, Y. M. Lui et al., “Visual
object tracking using adaptive correlation filters,” in Computer Vision
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE,
2010, pp. 2544–2550.
[32] J. Shi and C. Tomasi, “Good features to track,” in IEEE Computer
Society Conference on Computer Vision and Pattern Recognition. IEEE,
1994, pp. 593–600.
[33] J. yves Bouguet, “Pyramidal implementation of the lucas kanade feature
tracker,Intel Corporation, Microprocessor Research Labs, 2000.
[34] S. Zhao, F. Lin, K. Peng, B. M. Chen, and T. H. Lee, “Homography-
based vision-aided inertial navigation of uavs in unknown environ-
ments,” in Proc. 2012 AIAA Guidance, Navigation, and Control Con-
ference, 2012.
[35] W. M. Wells, P. Viola, H. Atsumi, S. Nakajima, and R. Kikinis, “Multi-
modal volume registration by maximization of mutual information,
Medical image analysis, vol. 1, no. 1, pp. 35–51, 1996.
[36] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba, “Hoggles:
Visualizing object detection features,” in Computer Vision (ICCV), 2013
IEEE International Conference on. IEEE, 2013, pp. 1–8.
[37] K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptive color-
based particle filter,Image and vision computing, vol. 21, no. 1, pp.
99–110, 2003.
[38] A. Yao, D. Uebersax, J. Gall, and L. Van Gool, “Tracking people in
broadcast sports,” in Pattern Recognition. Springer, 2010, pp. 151–
[39] K. Zhang, L. Zhang, and M.-H. Yang, “Fast compressive tracking,
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 36, no. 10, pp. 2002–2015, 2014.
CONFIDENTIAL. Limited circulation. For review only
IEEE T-RO Submission no.: 16-0191.1
Preprint submitted to IEEE Transactions on Robotics. Received: April 14, 2016 01:24:02 PST
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
The requirement to operate aircraft in GPS-denied environments can be met by using visual odometry. Aiming at a full-scale aircraft equipped with a high-accuracy inertial navigation system (INS), the proposed method combines vision and the INS for odometry estimation. With such an INS, the aircraft orientation is accurate with low drift, but it contains high-frequency noise that can affect the vehicle motion estimation, causing position estimation to drift. Our method takes the INS orientation as input and estimates translation. During motion estimation, the method virtually rotates the camera by reparametrizing features with their depth direction perpendicular to the ground. This partially eliminates error accumulation in motion estimation caused by the INS high-frequency noise, resulting in a slow drift. We experiment on two hardware configurations in the acquisition of depth for the visual features: 1) the height of the aircraft above the ground is measured by an altimeter assuming that the imaged ground is a local planar patch, and 2) the depth map of the ground is registered with a two-dimensional laser in a push-broom configuration. The method is tested with data collected from a full-scale helicopter. The accumulative flying distance for the overall tests is approximately 78 km. We observe slightly better accuracy with the push-broom laser than the altimeter.
Full-text available
The use of mobile robots in search-and-rescue and disaster-response missions has increased significantly in recent years. However, they are still remotely controlled by expert professionals on an actuator set-point level, and they would benefit, therefore, from any bit of autonomy added. This would allow them to execute high-level commands, such as “execute this trajectory” or “map this area.” In this paper, we describe a vision-based quadrotor micro aerial vehicle that can autonomously execute a given trajectory and provide a live, dense three-dimensional (3D) map of an area. This map is presented to the operator while the quadrotor is mapping, so that there are no unnecessary delays in the mission. Our system does not rely on any external positioning system (e.g., GPS or motion capture systems) as sensing, computation, and control are performed fully onboard a smartphone processor. Since we use standard, off-the-shelf components from the hobbyist and smartphone markets, the total cost of our system is very low. Due to its low weight (below 450 g), it is also passively safe and can be deployed close to humans. We describe both the hardware and the software architecture of our system. We detail our visual odometry pipeline, the state estimation and control, and our live dense 3D mapping, with an overview of how all the modules work and how they have been integrated into the final system. We report the results of our experiments both indoors and outdoors. Our quadrotor was demonstrated over 100 times at multiple trade fairs, at public events, and to rescue professionals. We discuss the practical challenges and lessons learned. Code, datasets, and videos are publicly available to the robotics community.
Full-text available
This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.
Conference Paper
Full-text available
This paper investigates the navigation of small-scale unmanned aerial vehicles (UAVs) in unknown and GPS-denied environments. We consider a UAV equipped with a low-cost inertial measurement unit (IMU) and a monocular camera. The IMU can measure the specific acceleration and angular rate of the UAV. The IMU measurements are assumed to be corrupted by white noises and unknown constant biases. Hence the position, velocity and attitude of the UAV estimated by pure IMU dead reckoning will all drift over time. The monocular camera takes image sequences of the ground scene during flight. By assuming the ground scene is a level plane, the vision measurement, homography matrices, can be obtained from pairs of consecutive images. We propose a novel approach to fuse IMU and vision measurements by using an extended Kalman filter (EKF). Unlike conventional approaches, homography matrices are not required to be decomposed. Instead, they are converted to vectors and fed into the EKF directly. In the end, we analyze the observability of the proposed navigation system. We show that the velocity and attitude of the UAV and the unknown biases in IMU measurements are all observable when noisy yaw angle can be measured using a magnetometer. Numerical simulations verify our observability analysis and show that all UAV states except the position can be estimated without drift. The position drift is significantly reduced compared to the IMU dead reckoning.© 2012 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.
Conference Paper
Full-text available
We propose a semi-direct monocular visual odom-etry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. Our algorithm operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier measurements is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle state-estimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 300 frames per second on a consumer laptop. We call our approach SVO (Semi-direct Visual Odometry) and release our implementation as open-source software.
Conference Paper
We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct meth- ods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on sim(3), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.
The recent technological advances in Micro Aerial Vehicles (MAVs) have triggered great interest in the robotics community, as their deployability in missions of surveillance and reconnaissance has now become a realistic prospect. The state of the art, however, still lacks solutions that can work for a long duration in large, unknown, and GPS-denied environments. Here, we present our visual pipeline and MAV state-estimation framework, which uses feeds from a monocular camera and an Inertial Measurement Unit (IMU) to achieve real-time and onboard autonomous flight in general and realistic scenarios. The challenge lies in dealing with the power and weight restrictions onboard a MAV while providing the robustness necessary in real and long-term missions. This article provides a concise summary of our work on achieving the first onboard vision-based power-on-and-go system for autonomous MAV flights. We discuss our insights on the lessons learned throughout the different stages of this research, from the conception of the idea to the thorough theoretical analysis of the proposed framework and, finally, the real-world implementation and deployment. Looking into the onboard estimation of monocular visual odometry, the sensor fusion strategy, the state estimation and self-calibration of the system, and finally some implementation issues, the reader is guided through the different modules comprising our framework. The validity and power of this framework are illustrated via a comprehensive set of experiments in a large outdoor mission, demonstrating successful operation over flights of more than 360 m trajectory and 70 m altitude change.
Conference Paper
We introduce algorithms to visualize feature spaces used by object detectors. Our method works by inverting a visual feature back to multiple natural images. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and supports that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of recognition systems.