Content uploaded by Guido de Croon

Author content

All content in this area was uploaded by Guido de Croon on Dec 15, 2016

Content may be subject to copyright.

Efﬁcient Global Indoor Localization for Micro Aerial Vehicles

V. Strobel R. Meertens G.C.H.E. de Croon

Abstract— Indoor localization for autonomous micro aerial

vehicles (MAVs) requires speciﬁc localization techniques, since

the Global Positioning System (GPS) is usually not available.

We present an efﬁcient onboard computer vision approach that

estimates 2D positions of an MAV in real-time. This global

localization system does not suffer from error accumulation

over time and uses a k-Nearest Neighbors (k-NN) algorithm

to predict positions based on textons—small characteristic

image patches that capture the texture of an environment. A

particle ﬁlter aggregates the estimates and resolves positional

ambiguities. To predict the performance of the approach in

a given setting, we developed an evaluation technique that

compares environments and identiﬁes critical areas within

them. We conducted ﬂight tests to demonstrate the applicability

of our approach. The algorithm has a localization accuracy

of approximately 0.6 m on a 5 m×5 m area at a runtime of

32 ms on board of an MAV. Based on random sampling, its

computational effort is scalable to different platforms, trading

off speed and accuracy.

I. INTRODUCTION

Accurate onboard localization is a key challenge for micro

aerial vehicles (MAV). In conﬁned spaces, speciﬁc localiza-

tion algorithms are essential, since the Global Positioning

System (GPS) is usually not available. While light-weight

MAVs could be employed in various indoor tasks, they

cannot fall back on standard localization approaches due

to their limited payload and processing power. To address

this issue, this paper presents an efﬁcient indoor localization

technique.

Our contribution is a machine learning-based indoor

localization system that runs onboard of an MAV paving the

way to an autonomous system. In the presented approach,

computational power is shifted to an ofﬂine training phase

to achieve high-speed during live operation. In contrast to

visual SLAM frameworks, this project considers scenarios in

which the environment is known beforehand or can even be

actively modiﬁed. The approach is based on the occurrence

of textons, which are small characteristic image patches.

With textons as image features and a k-Nearest Neighbors

(k-NN) algorithm, we obtain 2D positions in real-time within

a known indoor environment. A particle ﬁlter was developed

that handles the estimates of the k-NN algorithm and resolves

positional ambiguities. We consider settings in which the

MAV moves at an approximately constant height, such that

the estimation of height is not necessary. In contrast to

existing approaches that use active sensors, the developed

approach only uses a passive monocular downward-looking

camera. While carrying active sensors, such as laser range

ﬁnders, is too demanding for a light-weight MAV, onboard

cameras can typically be attached. Additionally, we devel-

oped a technique for evaluating the suitability of a given

Fig. 1: The ﬁgure illustrates the presented system from

a high-level perspective. A feature vector—the texton

histogram—is extracted from the current camera image. The

vector is forwarded to a machine learning model that uses

ak-Nearest Neighbors algorithm to output k x, y-position

estimates. These estimates are passed to a particle ﬁlter,

which ﬁlters position estimates over time and outputs a

ﬁnal position estimate (red point). The expected loss shows

regions in the map where a lower localization accuracy is

expected. The average expected loss can be used as “ﬁtness

value” of a given map.

environment for the presented algorithm. It identiﬁes critical

areas and assigns a global loss value to an environment. This

allows for comparing different potential maps and identifying

regions with low expected localization accuracy.

The developed global localization system does not suffer

from error accumulation over time. Since it is intended to

further reduce the size of MAVs, lightweight and scalable

position estimation algorithms are needed. Onboard process-

ing helps to reduce errors and delays introduced by wireless

communication, and ensures a high versatility on the way

to an autonomous system. The validity of the approach is

evaluated in ﬂight experiments. An overview of the presented

approach can be seen in Figure 1.

The remainder of this paper is structured as follows.

Section II surveys existing indoor localization approaches. In

Section III, the developed texton-based approach is presented

and its components, the k-NN algorithm and the particle

ﬁlter, are introduced. Section IV describes the setup and

results of the ﬂight experiments. The results are discussed

in Section V and we draw conclusions in Section VI.

arXiv:1610.07233v1 [cs.RO] 23 Oct 2016

II. RELATED WORK

While a wide range of methods for indoor localization

exists, we only consider methods in this section that use

the same technical and conceptual setup—localization with

a monocular camera.

One distinguishes two types of robot localization: local

techniques and global techniques [14]. Local techniques need

an initial reference point and estimate coordinates based on

the change in position over time. Once they lost track, the

position can typically not be recovered. The approaches also

suffer from “drift” since errors are accumulating over time.

Global techniques are more powerful and do not need an

initial reference point. They can recover when temporarily

losing track and address the kidnapped robot problem, in

which a robot is carried to an arbitrary location [13].

A. Optical Flow

Optical ﬂow algorithms estimate the apparent motion

between successive images. The most popular optical ﬂow

methods are gradient based approaches and keypoint-based

methods [4]. Optical ﬂow methods belong to the class of lo-

cal localization techniques and can only estimate the position

relative to an initial reference point. The approaches suffer

from accumulating errors over time and typically do not

provide a means for correcting these errors. Most approaches

are computationally rather complex [4].

B. Fiducial Markers

Fiducial markers have been used for UAV localization and

landing [12, 21]. The markers encode information by the

spatial arrangement of black and white or colored image

patches. Their corners can be used for estimating the camera

pose at a high frequency.

An advantage of ﬁducial markers is their widespread

use, leading to technically mature and open-source libraries.

A drawback of the approach is that motion blur, which

frequently occurs during ﬂight, can hinder the detection of

markers [1]. Furthermore, partial occlusion of the markers

through objects or shadows break the detection. Another

downside is that markers might be considered as visually

unpleasant and may not ﬁt into a product or environmental

design [5].

C. Homography Determination & Keypoint Matching

A standard approach for estimating camera pose is de-

tecting and describing keypoints of the current view and

a reference image [22], using algorithms such as Scale-

invariant feature transform (SIF T) [19], followed by ﬁnd-

ing a homography—a perspective transformation—between

both keypoint sets. A keypoint is a salient image location

described by a feature vector. Depending on the algorithm,

it is invariant to different viewing angles and scaling.

This homography-based approach is employed in frame-

works for visual Simultaneous Localization and Mapping

(SLAM) but the pipeline of feature detection, description,

matching, and pose estimation is computationally com-

plex [15]. While the approach has been employed for global

localization for UAVs, the required processing power is still

too high for small MAVs [7].

D. Convolutional Neural Networks

Convolutional neural networks (CNNs) are a specialized

machine learning method for image processing [18]. The

supervised method has outperformed other approaches in

many computer vision challenges [10]. While their training is

usually time-consuming, predictions with CNNs often takes

only few milliseconds, shifting computational effort from the

test phase to the training phase. CNNs have been used as

robust alternative for keypoint detection and description if

images were perturbed [10] but needed more computation

time than SI FT.

In recent work, present a framework for regressing camera

positions based on CNNs [15]. The approach is rather robust

to different lighting settings, motion blur, and varying camera

intrinsics. The approach predicts positions on a modern

desktop computer in short time.

E. Texton-based Methods

Textons [24] are small characteristic image patches; their

frequency in an image can be used as image feature vector.

A texton histogram is obtained by extracting patches from

an image and comparing them to all textons in a “texton

dictionary”. The frequency of the most similar texton is then

incremented in the histogram.

Texton histograms are ﬂexible image features and their

extraction requires little processing time, which makes them

suitable for MAV on-board algorithms. The approach allows

for adjusting the computational effort by modifying the

amount of extracted image patches, resulting in a trade-off

between accuracy and execution frequency [8].

[7] use textons as image features to distinguish between

three height classes of the MAV during ﬂight. Using a

nearest neighbor classiﬁer, their approach achieves a height

classiﬁcation accuracy of approximately 78 % on a hold-

out test set. This enables a ﬂapping-wing MAV to roughly

hold its height during an experiment. In another work, [9]

introduce the appearance variation cue, which is based on

textons, for estimating the proximity to objects [9]. Using

this method, the MAV achieves a high accuracy for collision

detection and can avoid obstacles in a 5m×5mofﬁce space.

III. METHODS

The pseudo code in Algorithm 1 shows a high-level

overview of the parts of the framework. Details are given

in the following sections.

A. Hardware and Software

We decided to use the quadcopter Parrot Bebop Drone as

a prototype for all our tests. The developed approach makes

use of the bottom camera only, which has a resolution of

640 ×480 pixels with a frequency of 30 frames per second.

Algorithm 1 High-level texton framework

1: t←0

2: X0←INIT PARTICLES

3: while true do

4: t←t+ 1

5: It←REC EIV E IMG FRO M CAMERA

6: Ht←GET TE XTON HISTO GRA M(It)

7: zt←k-NN(Ht)

8: Xt←PARTICLE FI LTER(Xt−1,zt)

9: xt, yt←MAP EST IMATE(Xt)

10: end

B. Dataset Generation

A main idea of the presented method is to shift com-

putational effort to a pre-ﬂight phase. Since the MAV will

be used in a ﬁxed environment, the results of these pre-

calculations can be employed during the actual ﬂight phase.

Supervised machine learning methods need a training set to

ﬁnd a mapping from features to target values. In this ﬁrst

step, the goal is to label images with the physical x, y-

position of the UAV at the time of taking the image.

One possible way to create the data set is to align the

images with high-precision position estimates from a motion

tracking system, which yields high-quality training sets.

Major disadvantages of the approach are that motion tracking

systems are usually expensive and time-consuming to move

to different environments.

As an alternative, we sought a low-budget and more ﬂex-

ible solution. Out of the presented approaches in Section II,

the homography-based approach (Section II-C) promises the

highest ﬂexibility with a good accuracy but also requires

the most processing time. Since fast processing time is not

relevant during the pre-ﬂight phase, the approach is well-

suited for the problem. The required image dataset can be

obtained by using images gathered during manual ﬂight or

by recording images with a hand-held camera. To get a

hyperspatial image of the scene for creating a map, the

images from the dataset have to be stitched together. With

certain software packages the images can be orthrectiﬁed

by estimating the most probable viewing angle based on

the set of all images. However, since a downward-looking

camera is attached to the UAV, most images will be roughly

aligned with the z-axis, given slow ﬂight [3]. For the stitching

process, we used the freeware software Microsoft Image

Composite Editor (ICE) [20]. Keypoints of the current image

and the map image are detected and described using the

SIFT algorithm. This is followed by a matching process,

that identiﬁes corresponding keypoints between both images.

These matches allow for ﬁnding a homography between

both images. For determining the x, y-position of the current

image, its center is projected on the reference image using

the homography matrix.

C. Texton Dictionary Generation

For learning a suitable texton dictionary for an environ-

ment, image patches were clustered. The resulting cluster

centers—the prototypes of the clustering result—are the

textons [25]. The clustering was performed with a Kohonen

network [16]. The ﬁrst 100 images of each dataset were used

to generate the dictionary. From each image, 1000 randomly

selected image patches of size w×h= 6 ×6pixels were

extracted, yielding N= 100 000 image patches in total that

were clustered. For our approach, we also used the color

channels U and V from the camera to obtain color textons.

D. Histogram Extraction

The images from the preliminary dataset are converted

to the ﬁnal training set that consists of texton histograms

and x, y-values. To extract histograms in the full sampling

setting, a small window—or kernel—is convolved across the

width and height of an image and patches are extracted from

all positions. Each patch is compared with all textons in the

dictionary and is labeled with the nearest match based on

Euclidean distance. The histogram is normalized by dividing

the number of cases in each bin by the total number of

extracted patches, to yield the relative frequency of each

texton.

The convolution is a time-consuming step, since all possi-

ble combinations of width and height are considered: (640−

w+ 1) ·(480 −h+ 1) = 301 625 samples are extracted. To

speed up the time requirements of the histogram extraction

step, the kernel can be applied only to randomly sampled

image position instead [8]. This sampling step speeds up the

creation of the histograms and permits a trade-off between

speed and accuracy. The random sampling step introduces

random effects into the approach. Therefore, for generating

the training dataset, no random sampling was used to obtain

high-quality feature vectors.

E. k-Nearest Neighbors (k-NN) algorithm

The k-Nearest Neighbors (k-NN) algorithm is the “ma-

chine learning-core” of the developed algorithm. Taking

a texton histogram as input, the algorithm measures the

Euclidean distance of this histogram to all histograms in

the training dataset and outputs the kmost similar training

histograms and the corresponding x, y-positions.

While the k-NN algorithm is one of the simplest machine

learning algorithms, it offers several advantages [17]: it

is non-parametric, allowing for the modeling of arbitrary

distributions. Its capability to output multiple predictions

enables neat integration with the developed particle ﬁlter.

Additionally, k-NN regression often outperforms more so-

phisticated algorithms [6]. A frequent point of criticism is

its increasing computational complexity with an increasing

size of the training dataset. While the used training datasets

consisted of fewer than 1000 images, resulting in short

prediction times (see also Figure 6), time complexity can

be reduced by storing and searching the training examples

in an efﬁcient manner, for example, with tree structures [2].

F. Filtering

Computer vision-based estimations are often noisy or

ambiguous. Texton histograms obtained during ﬂight will

not perfectly match the ones in the training dataset: blur,

lighting settings, viewing angles, and, other variables change

the shape of the histograms.

A popular ﬁlter choice is the Kalman ﬁlter. However, the

Kalman ﬁlter is not able to represent multimodal proba-

bility distributions. This makes it rather unsuitable for the

presented global localization approach. The “naive” k-NN

regression calculates the mean of the koutputs and forwards

this value to the Kalman Filter. We decided to use a more

sophisticated method to capture multimodal distributions.

Given an adequate measurement model, a general Bayesian

ﬁlter can simultaneously maintain multiple possible locations

and resolve the ambiguity as soon as one location can be

favored. In this case, the predictions of the kneighbors can

be directly fed into the ﬁlter without averaging them ﬁrst.

However, a general Bayesian ﬁlter is computationally in-

tractable. Therefore, a variant based on random sampling was

used: the particle ﬁlter. While its computational complexity

is still high compared to a Kalman ﬁlter, one can modify

the amount of particles to trade off speed and accuracy and

adapt the computational payload to the used processor.

The weighted particles are a discrete approximation of the

probability density function (pdf ) of the state vector (x, y-

position of the MAV). Estimating the ﬁltered position of the

MAV can be described as p(Xt|Zt), where Xtis the state

vector at time tand Zt=z1, ..., ztare all outputs of the

k-NN algorithm up to time t, with each zirepresenting the

k x, y-outputs of the algorithm at time i.

The used particle ﬁlter is initialized with particles at

random x, y-positions. To incorporate the measurement noise

for each of the kestimates from the k-NN algorithm, we de-

veloped a two-dimensional Gaussian Mixture Model (GMM)

as measurement model. The GMM is parameterized by the

variances Σ[j], j ∈ {1, . . . , k}that are dependent on the rank

jof the prediction of the k-NN algorithm (for example,

j= 2 is the second nearest neighbor). The variance matrix

Σ[j]speciﬁes the variances of the deviations in x-direction

and y-direction and the correlation ρbetween the deviations.

The values for Σ[j]were determined by calculating the

variance-covariance matrix for the difference between the

ground truth Tfrom the motion tracking system and the

predictions Pjof the k-NN algorithm: Σ[j]:= Var(T−Pj).

In contrast to the measurement model, the used motion

model is simple. It is solely based on Gaussian process noise

and does not consider velocity estimates, headings, or control

inputs. Its mean and variance are dependent on the expected

velocity of the MAV. We used the forward difference Tt−

Tt−1to estimate the average movement and its variance-

covariance matrix Σprocess between timesteps tand t−1.

The algorithm of the developed particle ﬁlter is presented

in the pseudo code in Algorithm 2. In the pseudo code,

Xis the list of particles, fthe two-dimensional Gaussian

probability density function, z[i]

tthe ith neighbor from the

kNN prediction, x[m]

tthe mth particle at time t, and w[m]

t

its corresponding weight. The “resampling wheel” [23] per-

Algorithm 2 Particle ﬁlter update

1: procedure PARTICLE FILT ER (Xt−1, zt)

2: Initialize particle list

3: Xtemp := ∅

4: for m= 1 to Mdo

5: Add random process noise (motion model)

6: x[m]

t←x[m]

t+N(0,Σprocess)

7: Iterate over k-NN preds (measurement model)

8: w←0

9: for i= 1 to kdo

10: Gaussian Mixture Model

11: w←w+f(z[i]

t;x[m]

t,Σ[i]

measurement)

12: Xtemp := Xtemp ∪(x[m]

t, w)

13: Importance resampling

14: Xt←RESAMPLING WHE EL (Xtemp)

15: return Xt

forms the importance resampling step.

With the GMM, the information of all kneighbors can

be used, yielding a possibly multimodal distribution. While

a multimodal distribution allows for keeping track of several

possible positions, certain subsystems—for example a con-

trol loop—often need one point estimate. Using a weighted

average of the particles would again introduce the problem

that it could fall into a low density region (an unlikely

position). Instead, we used a maximum a posteriori (MAP)

estimate, as described by [11].

The estimation of uncertainty was modeled using the

spread of the particles—as expressed by their variance in

x-direction and y-direction.

G. Map evaluation

The performance of the developed method depends on the

environment: a texture-rich environment without repeating

patterns will be better suited than a texture-poor environment.

Ideally, one would like to know if the algorithm will work

in a given environment. Therefore, we propose an evaluation

scheme that can compare different environments and areas

within an environment. This scheme assigns a global ﬁtness

value or global loss value to a “map”—expressed as dataset

Dconsisting of Ntexton histograms hiand corresponding

x, y-coordinates posi= (xi, yi). The ﬁtness value is intended

to be proportional to the accuracy that can be expected

when using this dataset as training set for the developed

localization algorithm. The scheme allows for inspecting

the dataset and detecting regions within the map that are

responsible for the overall ﬁtness value.

The idea behind the global loss function Lis that his-

tograms hiand hjin closeby areas should be similar and

the similarity should decrease with increasing distance of

the corresponding x, y-coordinates posiand posj. Therefore,

the approach is based on the difference between actual

and ideal texton histogram similarities in a dataset. The

ideal texton similarity distribution is modeled as a two-

dimensional Gaussian distribution around each x, y-position

in the dataset. Using this idea, a histogram is compared

to all others by comparing expected similarities to actual

similarities. This results in a loss value per sample of the

dataset (local loss). Applying the algorithm to each sample

in the dataset yields the global loss of a dataset.

The method uses the cosine similarity (CS ) to compare

histograms:

CS (hi, hj) = hT

ihj

||hi|| ||hj||

The cosine similarity has the convenient property that its

values are bounded between −1and 1. In the present case,

since the elements of the histograms are non-negative, it is

even bounded between 0and 1. Let the function fdescribe

the non-normalized one-dimensional Gaussian probability

density function:

f(x;µ, σ) = e−(x−µ)2

2σ2

Since we assume that the ideal similarity in x-position is

independent of the y-position, the ideal two-dimensional

similarity function de(posi,posj; Σ) can be modeled as the

product of the respective one-dimensional function f:

de(posi,posj; Σ) = f(xi;xj, σx)·f(yi;yj, σy)

This function is also bounded between 0and 1, which

makes the functions deand CS —ideal similarity and actual

similarity—easily comparable. In summary, we propose the

following global loss function (L) for evaluating a given

dataset (D):

L(D) = 1

N2

N

X

i=1

N

X

j=1

CS (hi, hj)−f(xi;xj, σx)·f(yi;yj, σy)

The simple difference—in contrast to least absolute devi-

ations or least square errors—ensures that similarities that

are less similar than the ideal similarity reduce the loss.

Therefore, a high variation in texture is always seen as

“positive”. The variances σxand σyspecify the dimension of

the region, where similar histograms are desired. The lower

their value, the more focused the ideal similarity will be,

requiring a high texture variety for getting a low loss value.

A high value might overestimate the suitability of a dataset.

While the approach is relatively robust to the choice of the

parameter values, we still need to ﬁnd a heuristic for suitable

values.

IV. ANALYSIS

In the experiments, the MAV was guided along ﬂight plans

using the motion tracking systen. If not otherwise stated, we

used the following default values for the parameters in our

framework.

•number of samples in the histogram extraction step: 400

•number of textons in the dictionary: 20

•number of particles of the particle ﬁlter: 50

•number of histograms / images in the training set: 800

200 400 600 800 1000 1200 1400 1600 1800

100

200

300

400

500

600

700

800

900

1000 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fig. 2: The ﬁgure shows the loss of a map: the regions that

did not follow the ideal similarity pattern are displayed in

red. For the visualization, the loss values per sample in the

dataset were smoothed with a Gaussian ﬁlter. This assigns a

loss value to each x, y-position of the map.

•number of histograms / images in the test set: 415

•number of neighbors in the k-NN algorithm: 5

Map-dependent texton dictionaries were used and created by

conducting an initial ﬂight over the respective maps.

A. Baseline: Homography-based Approach

To ﬁnd a baseline for our approach and to provide a

homography-based training set, we used the homography-

based approach to estimate x, y-coordinates in the same

environment and based on the same images as the texton-

based framework. The required hyperspatial image (Figure 3)

of the environment was stitched together using 800 images

and the software Microsoft ICE.

Fig. 3: The created map (size: approximately 5×5meters)

that was stitched together using 800 images. The “non-

mapped” area in the center of the image is a result of the set

ﬂight path. An image distortion can be seen at the right-hand

side, where the landing spot appears twice, while in reality,

only one circle was visible.

We estimated the x, y -coordinates of the 415 test images

using the homography-based approach and compared the

predictions to the ground truth. The predictions were not

ﬁltered. The results can be found in the following table.

x-position y-position

Error in cm 31 59

STD in cm 68 77

B. Training Set based on Motion Tracking System

In this experiment, the position estimates were calculated

on board of the MAV using the texton-based approach

with the particle ﬁlter. The Euclidean distances between the

estimates of the motion tracking system and the texton-based

approach were measured in x-direction and y-direction.

The training dataset was composed of 800 texton his-

tograms with corresponding x, y-coordinates that were ob-

tained from the motion tracking system. The images were

recorded in a 5×5meter area at a height of approximately

one meter in a time span of one hour before the experiment

to keep environmental factors roughly the same.

The results can be found in the following table. They are

based on 415 images, which corresponds to a ﬂight time of

approximately 35 seconds.

x-position y-position

Error in cm 61 59

STD in cm 39 39

C. Training Set based on Homography-ﬁnding Method

In this experiment, the training dataset was created by

estimating the x, y-positions of the 800 training images

using the homography-ﬁnding method from the previous

section and the same hyperspatial image. Apart from that,

the settings are the same as in the previous experiment.

x-position y-position

Error in cm 54 97

STD in cm 41 61

D. Triggered Landing

For the triggered landing experiment, the MAV was guided

along random ﬂight paths, which covered a 5×5meter

area; during navigation, the MAV was programmed to land

as soon as its position estimates were in a “landing zone”:

an x, y-position with a speciﬁed radius r. A safety criterion

was introduced such that the landing is only performed if

the standard deviations of the particles in x-direction and

y-direction are below thresholds θxand θy. We set the

parameters to θx=θy= 60 cm. The x, y -coordinate of the

circle was speciﬁed in the ﬂight plan; the radius was set

to r= 60 cm. We performed six triggered landings; after

each landing, the x, y-center of the zone was randomly set

to another position in the map. For the texton framework,

the same training set as in Experiment IV-B was used.

Four out of six landings were correctly performed in the

landing area. The distances of the two outliers were 14 cm

and 18 cm, measured as distance to the circumference of the

landing area.

E. Speed versus Accuracy Trade-Off

Adapting the frequency of the main loop of the developed

approach to make it suitable for different platforms with

varying processing power is one of its core parts.

Figures 4 and 5 show the speed versus accuracy trade-

off as a function of the used particles and of the used

samples in the histogram extraction step, respectively. As a

reference, the frequency using full sampling in the histogram

extraction step was 0.1 Hz. The above stated default values

were used for the ceteris paribus assumption, when varying

the parameters.

While the bottom camera of the Parrot Bebop Drone has

a frequency of 30 Hz, the Paparazzi software currently only

receives the images with a frequency of 12.5Hz. Therefore,

the maximum achievable frequency without further image

processing is 12.5 Hz, which is the baseline for the conducted

experiments.

Figure 6 illustrates the frequency as a function of the

used histograms in the k-NN algorithm. We did not compare

the frequency to the distance between ground truth and the

predictions, since our training dataset did not contain more

than 800 histograms.

After having received the image, the processing time of

the presented algorithm using the default parameter values

is 32 ms, which includes the histogram extraction (16 ms) as

well as the k-NN predictions, the ﬁltering and the output of

the best x, y-coordinate (16 ms).

V. DISCUSSION

The ﬂight tests show initial evidence for the real-world

suitability of the method. It yields slightly less accurate re-

sults than the unﬁltered homography-ﬁnding method. While

we did not test the frequency of the homography-based

approach on board of an MAV, on a desktop computer, it

took 200 ms per image. Therefore, the developed algorithm

runs at a much higher frequency. The training set generation

based on the homography method yielded higher errors in

the ﬂight test. Filtering the estimates of the homography-

method ﬁrst could improve the accuracy. The triggered

landing (Experiment 4.4.4) showed good accuracy: while

most landings were triggered inside the landing zone, two out

of the six landings were outliers. However, their distance to

the landing area were rather small, with an average distance

of 16 cm.

The experiments addressing the “Speed versus accuracy

trade-of” show that with an increasing accuracy of the

approach, the frequency of the algorithm decreases. However,

the errors reach a plateau after which no large improvements

can be expected at the lower end of parameter ranges. By

optimizing the parameters, one can obtain localization errors

below 50 cm with the developed approach.

While we compared the settings of different parameters,

there are no generally optimal parameters for the presented

framework: setting the number of textons, the number of

images patches, or the number of neighbors is dependent

on the environment and the size of the training dataset. The

parameters have to be adapted to the particular environment.

Despite the overall promising results of our localization

algorithm, we noticed drawbacks during the ﬂight tests

and identiﬁed several directions for future research that are

described in what follows. The accuracy could be further

0 200 400 600 800 1000

60 80 100 120

Number of particles

Error x−direction (cm)

0 200 400 600 800 1000

35 40 45 50

Number of particles

SD Error x−direction (cm)

0 200 400 600 800 1000

0 2 4 6 8 10 12

Number of particles

Frequency (Hz)

Fig. 4: Speed versus accuracy trade-off in x-direction as a function of the number of used particles.

0 2000 4000 6000 8000 10000

60 70 80 90 100 110

Number of samples

Error x−direction (cm)

0 2000 4000 6000 8000 10000

40 50 60 70 80

Number of samples

SD Error x−direction (cm)

0 2000 4000 6000 8000 10000

4 6 8 10 12

Number of samples

Frequency (Hz)

Fig. 5: Speed versus accuracy trade-off in x-direction as a function of the number of used samples in the histogram extraction

step.

0 10000 20000 30000 40000 50000

6 7 8 9 10 11 12

Number of histograms

Frequency (Hz)

Fig. 6: Frequency of the main loop as a function of the number of histograms in the training set.

improved by combining the presented global localization

technique with a local technique. To this end, odometry

estimates using optical ﬂow or the inclusion of data from

the inertial measurement unit (IMU) could be suitable.

Our current implementation assumes constant height up

to few centimeters and only small rotations of the MAV.

While a quadroter can move in every direction without

performing yaw movements, other MAVs or the use of the

front camera for obstacle avoidance could require them. The

inclusions of images of arbitrary yaw movements into the

dataset would inﬂate its size to a great extent. This could

lead to a deterioration of the accuracy and increase the time-

complexity of the k-NN algorithm. Instead, a “derotation” of

the camera image based on IMU data could be performed to

align it with the underlying images of the dataset.

VI. CONCLUSION

This paper presented an approach for lightweight indoor

localization of MAVs. We pursued an onboard design to fos-

ter real-world use. The conducted experiments underline the

applicability of the system. Promising results were obtained

for position estimates and accurate landing in the indoor

environment.

An important step in the approach is to shift computational

effort to a pre-ﬂight phase. This provides the advantages

of sophisticated algorithms, without affecting performance

during ﬂight. The approach can trade off speed with accuracy

to use it on a wide range of models. The map evaluation

technique allows for predicting and improving the quality of

the approach.

REFERENCES

[1] E. R. AlBasiouny, A. Sarhan, and T Medhat, “Mean-

shift-fast algorithm to handle motion-blur with track-

ing ﬁducial markers,” in Computer Engineering & Sys-

tems (ICCES), 2015 Tenth International Conference

on, IEEE, 2015, pp. 286–292.

[2] N. Bhatia et al., “Survey of nearest neighbor tech-

niques,” arXiv preprint arXiv:1007.0085, 2010.

[3] M. Bl¨

osch, S. Weiss, D. Scaramuzza, and R. Siegwart,

“Vision based mav navigation in unknown and un-

structured environments,” in 2010 IEEE International

Conference on Robotics and Automation (ICRA),

IEEE, 2010, pp. 21–28.

[4] H. Chao, Y. Gu, and M. Napolitano, “A survey of op-

tical ﬂow techniques for uav navigation applications,”

in Unmanned Aircraft Systems (ICUAS), 2013 Inter-

national Conference on, IEEE, 2013, pp. 710–716.

[5] H.-K. Chu, C.-S. Chang, R.-R. Lee, and N. J. Mitra,

“Halftone qr codes,” ACM Transactions on Graphics

(TOG), vol. 32, no. 6, p. 217, 2013.

[6] (). Cpsc 340: machine learning and data mining,

[Online]. Available: https : / / www . cs . ubc .

ca/ ˜schmidtm / Courses/ 340- F15 / L7. pdf

(visited on 08/10/2016).

[7] G. De Croon, K. De Clercq, Ruijsink, Remes, and C.

De Wagter, “Design, aerodynamics, and vision-based

control of the DelFly,” International Journal of Micro

Air Vehicles, vol. 1, no. 2, pp. 71–97, 2009.

[8] G. De Croon, C De Wagter, B. Remes, and R Rui-

jsink, “Sub-sampling: real-time vision for micro air

vehicles,” Robotics and Autonomous Systems, vol. 60,

no. 2, pp. 167–181, 2012.

[9] G. De Croon, E. De Weerdt, C. De Wagter, B. Remes,

and R. Ruijsink, “The appearance variation cue for

obstacle avoidance,” IEEE Transactions on Robotics,

vol. 28, no. 2, pp. 529–534, 2012.

[10] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller,

and T. Brox, “Discriminative unsupervised feature

learning with convolutional neural networks,” in Ad-

vances in Neural Information Processing Systems,

2014, pp. 766–774.

[11] H. Driessen and Y. Boers, “Map estimation in particle

ﬁlter tracking,” in 2008 IET Seminar on Target Track-

ing and Data Fusion: Algorithms and Applications,

IET, 2008, pp. 41–45.

[12] D. Eberli, D. Scaramuzza, S. Weiss, and R. Siegwart,

“Vision based position control for MAVs using one

single circular landmark,” Journal of Intelligent &

Robotic Systems, vol. 61, no. 1-4, pp. 495–512, 2011.

[13] S. P. Engelson and D. V. McDermott, “Error correction

in mobile robot map learning,” in IEEE International

Conference on Robotics and Automation, 1992 Pro-

ceedings., IEEE, 1992, pp. 2555–2560.

[14] D. Fox, W. Burgard, F. Dellaert, and S. Thrun, “Monte

carlo localization: efﬁcient position estimation for

mobile robots,” AAAI/IAAI, vol. 1999, pp. 343–349,

1999.

[15] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: a

convolutional network for real-time 6-dof camera relo-

calization,” in Proceedings of the IEEE International

Conference on Computer Vision, 2015, pp. 2938–

2946.

[16] T. Kohonen, “The self-organizing map,” Proceedings

of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.

[17] M. Kordos, M. Blachnik, and D. Strzempa, “Do we

need whatever more than k-NN?,” in International

Conference on Artiﬁcial Intelligence and Soft Com-

puting, Springer, 2010, pp. 414–421.

[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,

“Gradient-based learning applied to document recog-

nition,” Proceedings of the IEEE, vol. 86, no. 11,

pp. 2278–2324, 1998.

[19] D. G. Lowe, “Object recognition from local scale-

invariant features,” in Computer vision, 1999. The pro-

ceedings of the seventh IEEE international conference

on, Ieee, vol. 2, 1999, pp. 1150–1157.

[20] (). Microsoft image composite editor, [Online]. Avail-

able: http : / / research . microsoft . com /

en- us /um/redmond/ projects/ice/ (visited

on 08/02/2016).

[21] Parrot. (2015). Ces 2015 : parrot bebop dance chore-

ography, [Online]. Available: https : / / www .

youtube.com/watch?v=A_3UifFb45Y (visited

on 10/15/2015).

[22] S. Se, D. Lowe, and J. Little, “Global localization us-

ing distinctive visual features,” in IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems,

2002., IEEE, vol. 1, 2002, pp. 226–231.

[23] S. Thrun. (). Artiﬁcial intelligence for robotics, [On-

line]. Available: https:// www .udacity.com/

course / artificial - intelligence - for -

robotics -- cs373 (visited on 03/10/2016), in

preparation.

[24] M. Varma and A. Zisserman, “A statistical approach

to texture classiﬁcation from single images,” Interna-

tional Journal of Computer Vision, vol. 62, no. 1-2,

pp. 61–81, 2005.

[25] —, “Texture classiﬁcation: are ﬁlter banks neces-

sary?,” in 2003 IEEE Computer Society Sonference

on Computer Vision and Pattern Recognition, 2003

Proceedings, IEEE, vol. 2, 2003, pp. II–691.