Content uploaded by Rafael Muñoz-Salinas
Author content
All content in this area was uploaded by Rafael Muñoz-Salinas on Sep 17, 2018
Content may be subject to copyright.
SPM-SLAM: Simultaneous Localization and Mapping with Squared Planar Markers 1
Rafael Mu˜noz-Salinas1,2,∗, Manuel J. Mar´ın-Jimenez1,2, R. Medina-Carnicer1,2 2
Abstract 3
SLAM is generally addressed using natural landmarks such as keypoints or texture, but it poses some limitations, such as 4
the need for enough textured environments and high computational demands. In some cases, it is preferable sacrificing 5
the flexibility of such methods for an increase in speed and robustness by using artificial landmarks. 6
The recent work [1] proposes an off-line method to obtain a map of squared planar markers in large indoor environments. 7
By freely distributing a set of markers printed on a piece of paper, the method estimates the marker poses from a set 8
of images, given that at least two markers are visible in each image. Afterwards, camera localization can be done, in 9
the correct scale. However, an off-line process has several limitations. First, errors can not be detected until the whole 10
process is finished, e.g., an insufficient number of markers in the scene or markers not properly spotted in the capture 11
stage. Second, the method is not incremental, so, in case of requiring the expansion of the map, it is necessary to repeat 12
the whole process from start. Finally, the method can not be employed in real-time systems with limited computational 13
resources such as mobile robots or UAVs. 14
To solve these limitations, this work proposes a real-time solution to the problems of simultaneously localizing the 15
camera and building a map of planar markers. This paper contributes with a number of solutions to the problems arising 16
when solving SLAM from squared planar markers, coining the term SPM-SLAM. The experiments carried out show that 17
our method can be more robust, precise and fast, than visual SLAM methods based on keypoints or texture. 18
Keywords: Fiducial Markers, Marker Mapping, SLAM 19
1. Introduction 20
Simultaneous localization and mapping is the problem of creating a map of the environment and estimating the camera 21
pose at the same time [2]. Sparse [3] and dense [4] solutions using natural features have attracted most of the research effort 22
∗Corresponding author
Email addresses: rmsalinas@uco.es (Rafael Mu˜noz-Salinas), mjmarin@uco.es (Manuel J. Mar´ın-Jimenez), rmedina@uco.es (R.
Medina-Carnicer)
1Computing and Numerical Analysis Department, Edificio Einstein. Campus de Rabanales, C´ordoba University, 14071, C´ordoba, Spain,
Tlfn:(+34)957212255
2Instituto Maim´onides de Investigaci´on en Biomedicina (IMIBIC). Avenida Men´endez Pidal s/n, 14004, C´ordoba, Spain,
Tlfn:(+34)957213861
Preprint submitted to Pattern Recognition September 17, 2018
reaching a high degree of performance. Nevertheless, they have a number of limitations in some realistic scenarios. First,23
they require a certain amount of texture, which in some indoor environments is not available (e.g., labs and corridors).24
Second, bag of words (BoW) [5] methods are employed to solve the relocalization problem. However, they have a limited25
performance under viewpoint changes and repetitive patterns that are common in certain environments. Third, when26
using a single camera, the map generated is scale agnostic and consequently can not be directly employed for navigation27
tasks. Four, it is well known that translational movement is required in order build the map when using a single camera.28
In many cases, using artificial markers is an acceptable solution to robustly estimate the camera pose at a very reduced29
cost, using very few computational resources, and solving the above-mentioned problems. Approaches based on squared30
planar markers are very popular [6, 7, 8, 9, 10, 11, 12, 13, 14]. These markers are composed of an external black border31
and an internal code (most often binary) to uniquely identify them. Their main advantage is that the corners of a single32
marker can be employed for camera pose estimation. The general approach, however, is using a single marker, or at most,33
a small set of them for which their relative pose is known beforehand (e.g., a set of markers in a printed piece of paper).34
Of course, the range of applicability of such systems is limited to very small areas. Very few attempts have been made35
to create large-scale camera localization system based on squared planar markers that anyone can print and stick in the36
desired environment.37
The main difficulty when dealing with squared planar markers is the ambiguity problem [15, 16, 17]. Although in38
theory, it should be possible to accurately obtain the pose from the four corners of a planar square, in practice, due to39
noise in the localization of the corners, two solutions appear, and in some cases, it is impossible to distinguish the correct40
one.41
In the previous work [1], the authors of this paper proposed an off-line method to create a map of squared planar42
markers able to deal with the ambiguity problem. One only needs to print markers with a regular printer, place them in43
the desired environment so as to cover the working area, and taking pictures of them. Then, the method creates a map44
with the pose of the markers by analyzing the images. The method can employ an unlimited number of high-resolution45
images in order to achieve very high accuracy since no time restrictions are imposed (it is an off-line process). However,46
off-line processes have several limitations. First, errors cannot be detected until the whole process is finished, e.g., an47
insufficient number of markers in the scene or markers not properly spotted in the capture stage. Second, the method is48
not incremental, so, in case of requiring the expansion of the map, it is necessary to repeat the whole process from start.49
Finally, the method is not suitable for real-time applications, e.g., a mobile robot or a UAV with limited computational50
2
Figure 1: SPM-SLAM: a method to build cost-effective and robust localization systems using squared markers. The printed markers are placed
in arbitrary locations of the environment. Then, given a video sequence spotting the markers, the proposed system builds the map of squared
planar markers and localizes the camera at the same time. Blue squares represent the map of markers, obtained from the images of the video
sequence shown. The right image shows a three-dimensional representation of the markers found in blue, and in green the selected locations of
the camera employed for optimization.
resources that has to create a map of the environment while navigating. 51
This paper proposes a complete system able to dynamically build a map of the markers and simultaneously localizing 52
the camera pose from a video sequence. The proposed method uses only information from planar markers and is able to 53
deal with the ambiguity problem. In our tests, the proposed method is able to run at more than 150 Hz using a single 54
thread on a laptop computer. Figure 1 shows an example created in our lab, in which the pose of both the markers and 55
the camera is obtained incrementally using the images of the video sequence recorded with a hand-held camera. 56
Four original contributions are made in this paper in order to build the proposed system. First, this is, up to our 57
knowledge, the first real-time SPM-SLAM system able to deal with the ambiguity problem and to operate in large indoor 58
environments. Second, we propose a method for map initialization from a set of ambiguously detected markers seen from 59
at least two different locations. Third, we propose a method to estimate the pose of markers from ambiguous observation 60
of them. Finally, we propose a method for loop closure detection and correction using squared planar markers. It is also 61
worth mentioning that the proposed method is publicly available for evaluation purposes3.62
The rest of the paper is organized as follows. After presenting the related work in Sect. 2, we give in Sect. 3 an 63
overview of the problem along with the main elements required to solve it. Sect. 4 describes the details of each module of 64
the proposed system, and, in Sect. 5, a thorough experimental study is carried out on challenging scenes to validate our 65
approach. Finally, the main conclusions of this work are summarized in Sect. 6. 66
3http://www.uco.es/investiga/grupos/ava/node/58
3
Figure 2: Pose ambiguity problem: the same observed projection can be obtained from two different camera poses γand ˙γ.
2. Related works67
Visual SLAM aims at solving the simultaneous localization and mapping problem using visual information exclusively.68
In [18], Klein and Murray presented their PTAM system, in which two different threads running in parallel perform69
tracking and mapping. The work showed the possibility of splitting the tasks into two different threads, achieving real-70
time performance. However, their keypoint descriptors did not consider the detection of large loops. The recent work71
of Mur-Artal et al. [3] presents a keyframe-based SLAM method using ORB keypoints [19]. Their approach operates in72
real-time and is able to detect the loop closure and correct the poses accordingly. Engel et al. [4] proposed a semi-dense73
monocular visual SLAM solution called LSD-SLAM. In their approach, scenes are reconstructed in a semi-dense fashion,74
by fusing spatiotemporally consistent edges. However, in order to solve the relocalization and loop-closure problems, they75
use keypoint features. Despite the good results achieved by these type of systems, they pose several drawbacks. Tracking76
typically fails with rapid motion, large changes in viewpoint, or significant appearance changes. In addition, the maps77
obtained are scale agnostic, so they can not be directly used for robot navigation. Finally, relocalization often fails in78
regions of the environment far from these where the map was created.79
Square-based fiducial markers [6, 7, 8, 9, 10, 11] consist of an external black border and an internal code (most often80
binary) to uniquely identify each marker. The main advantage of such markers is that the camera pose can be estimated81
using its four corners, under some circumstances. While in theory, the pose of the camera can be uniquely estimated from82
the four corners, in practice, there is a rotation ambiguity that corresponds to an unknown reflection of the plane about the83
camera’s z-axis [15, 16, 17]. The problem is shown in Figure 2. The marker is represented as the side of a cube, which can84
be in two different orientations (red and blue color) thus obtaining almost identical projections from two different camera85
locations γand ˙γ. The methods proposed in [15, 16, 17] find the best solution by a careful analysis of the projections.86
In most cases, the reprojection error of one solution is much lower than the reprojection error of the other one. Then, no87
4
ambiguity problem is observed and the correct solution is the one with lowest error. However, when imaging small planes 88
or planes at a distance significantly larger than the camera’s focal length, in the presence of noise, the reprojection error 89
of both solutions is so similar that it is not possible to determine the correct one. In these cases, the observed size of the 90
marker in the image becomes small and thus the error in the estimation of the corners becomes relatively large. Then, 91
the reprojection errors of both solutions become similar and the correct one can not be distinguished. 92
Maybe because of this problem, very few attempts have been made to automatically create large maps of markers 93
placed in arbitrary locations as we propose in this work. Davison et al. propose in [20] a method for monocular SLAM 94
in which a squared marker is employed for initialization. Although they rely exclusively on natural features for tracking, 95
they show that using a single squared marker for initialization is a good choice to obtain the map in the real scale. 96
Lim and Lee [21] present an approach to SLAM with planar markers. An Extended Kalman-Filter (EKF) is used to 97
track a robot pose while navigating in an environment with some markers in it. As markers are found, they are added 98
to the map considering the current robot pose along with the relative pose of the marker and the robot. Their approach, 99
however, does not consider the ambiguity or the loop closure problems. A similar approach is presented in [22] for an 100
autonomous blimp. 101
The work of Klopschitz and Schmalstieg [23] shows a system for estimating the 3D location of fiducial markers in 102
indoor environments. A video sequence of the environment is recorded and camera position estimated using (Structure 103
from Motion) SfM (with natural keypoints). Once camera locations are accurately obtained, marker locations are obtained 104
by triangulation. This approach does not deal with the dual problem of camera and marker localization jointly. Also, it 105
assumes the correct functioning of an SfM method that, as we have already commented, is not always possible. 106
Karam et al. [24] propose the creation of a pose graph where nodes represent markers and edges the relative pose 107
between them. The map is created in an on-line process, and edges are updated dynamically. Whenever a pair of markers 108
are seen in a frame, their relative position is updated and if it is better than the previous one, replaced. For localization, 109
their approach selects, from the set of visible markers at that time, the one whose path to an origin node is minimum. 110
Their approach poses several problems. First, they do not account for the ambiguity problem. Second, they only consider 111
one marker for localization from all visible ones. However, using all visible markers at the same time can lead to better 112
localization results. Third, their experimental results conducted do not really prove the validity of their proposal in 113
complex scenes. 114
The work of Neunert et al. [25] presents a monocular visual-inertial EKF-SLAM system based on artificial landmarks. 115
5
Again, an EKF is employed to do SLAM fusing information from the markers and an inertial measurement unit. As116
in previous cases, the marker ambiguity and loop closure problems are not considered in their work. Additionally, our117
method does not need the use of inertial sensors to solve the problem.118
Finally, the authors of this work have proposed in [1] an off-line method to obtain a map of markers in large indoor119
environments. In our previous work, we deal with the ambiguity problem and propose a robust solution for marker mapping120
and localization. The main difference with this work is that [1] considers that all the video frames are all available at121
once, and solves the problem globally. Nonetheless, the off-line approach poses some limitations. For instance, errors in122
the video capture of the sequence, such as a missing marker, cannot be detected until it is post-processed. In contrast,123
this work proposes an online solution, in which the map is built incrementally as video frames are captured.124
3. System overview125
The problem to be solved in this paper is the following. Consider a video sequence recording an environment in which126
fiducial markers have been placed. The goal is to simultaneously determine the pose of the camera w.r.t. an arbitrary127
reference system (global reference system) and to create a map of the markers and their location in the global reference128
system. In our problem, only information from the fiducial markers is used.129
This section provides an introduction to the system main elements. First, we provide a reference guide that will help130
the reader with the notation employed through this work (Sect. 3.1) and then we provide a brief explanation of the main131
mathematical concepts required for this work (Sect. 3.2). Afterward, we will describe the map (Sect. 3.3), which is the132
data structure employed to keep the operational information and we will continue (Sect. 3.4) providing a brief description133
of the system control loop.134
3.1. Clarification on the notation135
Below, we provide a summary of the most relevant terms employed along the paper:136
•ft: frame. Image acquired by a camera at the time instant t.137
•frs: frame reference system. Reference system centered on the camera origin when the frame was acquired. Each138
frame has its own reference system.139
•mrs: marker reference system. Reference system centered in a marker. Each marker has its own reference system.140
•grs: global reference system. The common reference system w.r.t. which we desire to obtain all measures.141
6
Figure 3: Main terms employed along the paper.
•γm: grs ←mrs. Transform points from the reference system of marker mto the global reference system. 142
•γt: frs ←grs. Transform points from the global reference system to the reference system of frame t.143
•γt
m: frs ←mrs. Transform points from the reference system of marker mto the reference system of frame t.144
Please notice that, when using transformations γ, the superscript refers to frames, while the under-script refers to 145
markers. Fig. 3 depicts the terms explained. 146
3.2. Initial concepts 147
Let us consider a three-dimensional point pa∈R3in an arbitrary reference system a. In order to express such point 148
into another reference system b, it undergoes a rotation followed by a translation. Let us denote by γ∈R6the three 149
rotational and translational components r= (rx, ry, rz) and t= (tx, ty, tz): 150
γ= (r, t)|r, t ∈R3(1)
Using Rodrigues’ rotation formula, the rotation matrix Rcan be obtained from ras: 151
R=I3×3+rsin θ+r2(1 −cos θ),(2)
where I3×3is the identity matrix and rdenotes the antisymmetric matrix 152
r=
0 -rz/θ ry/θ
rz/θ 0 -rx/θ
-ry/θ rx/θ 0
,(3)
7
such that θ=||r||2.153
Then, in combination with t, the 4 ×4 matrix154
Γ(γ) =
Rt>
01
(4)
can be used to transform a point in homogeneous coordinates from ato bas:155
p>
b
1
= Γ(γ)
p>
a
1
(5)
To ease the notation through this paper, we will define the operator (·) to express the transform of Eq. 5:156
pb=γ·pa.(6)
Likewise, we also define the operator (·) to concatenate transforms such as:157
γa,b = Γ(γa)Γ(γb) = γa·γb.(7)
The basic idea is that we will operate along the paper with the 6-dof parameter γas if it was the equivalent 4 ×4158
matrix Γ(γ).159
A point pprojects into the camera plane into a pixel u∈R2. Assuming that the camera parameters are known, the160
projection can be obtained as a function:161
u= Ψ(δ, γ, p),(8)
where
δ= (fx, fy, cx, cy, k1, . . . , kn),
refers to the camera intrinsic parameters, comprised by the focal distances (fx, fy), optical center (cx, cy) and distortion162
parameters (k1, . . . , kn). The parameter γrepresents camera pose from which frame was acquired, i.e., the transform that163
moves a point from an arbitrary reference system to the camera one.164
Squared fiducial markers are comprised by an external black border and an inner binary code mthat identify them.165
8
Assuming that all markers have the same size s, their four corners can be expressed w.r.t. one of its corners as: 166
c1=( 0, 0, 0 ),
c2=( 0, s, 0 ),
c3=( s,s, 0 ),
c4=( s, 0, 0 ).
(9)
We shall denote by 167
ωt
m={ut
m,l |u∈R2, l = 1 . . . 4}(10)
the pixel locations in which the four corners of marker m(Eq. 9) are observed in frame ft.168
The reprojection error of a marker min a frame ftgiven the transform γis calculated as: 169
et
m(γ) =
4
X
l=1 Ψ(δ, γ, cl)−ut
m,l2,(11)
where Ψ is the projection function defined in Eq. 8, and γis a transform moving the 3d corner locations from the mrs to 170
the frs. 171
As already mentioned, when estimating the pose of a marker mfrom a single image, there is a rotation ambiguity that 172
corresponds to an unknown reflection of the plane about the camera’s z-axis. However, it is possible to obtain the two 173
solutions [16, 17], let us denote them by 174
θt
m={γt
m,˙γt
m},(12)
and their corresponding reprojection errors et
m(γt
m) and et
m( ˙γt
m). Assuming that the solutions are sorted such that 175
et
m(γt
m)< et
m( ˙γt
m), the ambiguity problem occurs when the ratio et
m(γt
m)/et
m( ˙γt
m) is near one. In our work, we assume 176
that when the error ratio is above a threshold τe, the solution γt
mis unambiguous and thus a reliable indication of the 177
maker pose w.r.t. to the frame. Otherwise, the pose is considered ambiguous. This paper proposes methods to deal and 178
overcome the ambiguity problem by using information from either multiple markers or multiple views. 179
3.3. System map 180
In order to store all the elements required to solve the SLAM problem, our system maintains the following map
W={M,F,G, ω, γt−1}.
Wis composed of a set of markers M, a set of keyframes F, a connection graph G, the frame index ωof the current 181
keyframe fω, and the pose of the last frame observed γt−1.182
9
Figure 4: Pipeline for marker-based SLAM. See text for details.
The set
M={m|m∈ D},
represents the set of markers observed in the scene, where Dis the set of valid marker identifiers, also called dictionary183
[10]. If the pose of marker mhas been properly estimated, then γmrepresents the transformation moving the corners184
of the marker from the marker reference system (mrs) to the global reference system (grs). Nevertheless, the pose of a185
marker is not necessarily estimated the first time it is spotted. Sometimes, several observations are required in order to186
produce an estimate of its pose. So, if the pose of an observed marker has not been estimated, we set γm= 0 indicating187
that it is invalid. Along the video sequence, marker poses will be sub ject to continuous optimization in order to globally188
match all available observations.189
The map also registers information about some of the visited frames, named keyframes. In particular, for each keyframe190
fi∈ F, we keep the following information:191
•First, the transform moving points from the grs to the frame reference system (frs) is denoted γi. The initial estimate192
of the frame pose is obtained using the information from the markers in M. Along the video sequence, the poses193
are refined as part of an optimization process.194
•Second,195
Λi={m|m∈ M},(13)
10
represents the set of visible markers in frame fi.196
The graph Grepresents relations between the observed keyframes. Nodes represent keyframes, while edges represent the 197
total number of common markers visible in two keyframes, i.e., |Λi∩Λj|for keyframes fiand fj. There will only be an 198
edge between two nodes if they have at least a common marker. The graph will be used both in the optimization process 199
to accurately estimate marker and keyframe poses, and for the loop closure detection and correction. 200
The current keyframe fωis the element from Fclosest (in the 3D space) to the previous frame and it plays an 201
important role when inserting new keyframes to the map, and in the process of loop closure detection. Finally, γt−1202
represents the pose of the last visited frame ft−1.203
3.4. Operational overview 204
This section aims at providing an outline of the main tasks involved in the proposed system. Figure 4 depicts the steps 205
that are explained below. 206
In the beginning, the map is empty and the system is in initialization mode, looking for one or several markers to start 207
the tracking process. Initialization can be done from either one or two frames using a novel method we propose (Sect. 4.1), 208
which is one of the contributions of this paper. As a result of the initialization, the initial keyframe(s) are added to the 209
map, as well as the initial marker(s). 210
For each new frame acquired ft, markers are detected and tracking is performed. Tracking consists in estimating the 211
current frame pose γt, by minimizing the reprojection error of the visible marker corners. To do so, the previous frame 212
pose γt−1is used as the initial estimate. Indeed, tracking can only be done using the observed markers with a valid pose 213
(γm6= 0) in the map (Sect. 4.3). 214
Once γtis estimated, the map is updated. If fthas no new markers, it is added to the map only if the 3D distance to 215
the current keyframe fωis above a certain threshold (Sect. 4.4). However, if the frame contains new markers, both the 216
frame and markers are added to the map. Nonetheless, if the pose of a new marker cannot be unambiguously determined, 217
it is added but setting γm= 0, delaying the estimation of its pose to subsequent frames. This work also contributes 218
by proposing a method to estimate the pose of markers from ambiguous estimations in an arbitrary number of frames 219
(Sect. 4.5). 220
To avoid the unnecessary growth of the map, a keyframe culling process is run whenever a new keyframe is added to 221
the map. The goal is to keep the minimal set of keyframes that allows a proper optimization of the reprojection error of 222
the markers of the map (Sect. 4.6). 223
11
After keyframe culling, a local optimization process is run in order to adjust both the keyframe and marker poses224
by minimizing the reprojection error of the markers (Sect. 4.7). The basic idea is that the new observations just added225
(keyframes and markers) help to improve the keyframe poses. However, this is a local process since only those keyframes226
of the map that observe the markers in the current frame will be affected by this process.227
An important aspect to consider while tracking is the detection of loop closures (Sect. 4.9). A loop closure takes place228
when a region of the map is revisited. In most cases, the error accumulated along the camera path makes impossible229
to obtain an accurate pose estimation. So, the closure must be properly detected and corrected before continuing the230
tracking process (Sect. 4.7). Then, the current keyframe can be added to the map and the local optimization process run231
again to improve the accuracy of the map. The method for loop closure detection and correction using markers (Sect. 4.9)232
in another contribution of this work.233
The final step before capturing a new frame is the selection of the current keyframe fω. As already indicated, the234
current keyframe is the one nearest to ft−1and is used for keyframe insertion and loop closure detection.235
In case of not detecting any marker in the current frame, the system moves to the state of relocalization. In relocalization236
mode, the system waits until valid markers are observed to estimate the current pose. However, relocalization is a process237
prone to more errors than tracking, since no initial estimation is known from the previous frame. We propose a method238
to relocalize even from ambiguously detected markers in Sect. 4.8.239
Finally, when there are no more frames to be processed, we perform a global optimization process (Sect. 4.10) that240
jointly optimizes keyframe and marker poses, along with the camera intrinsic parameters.241
4. Detailed system description242
This section explains in detail the different components involved in the whole process previously outlined.243
4.1. Initialization244
Initialization is done at the beginning of the video sequence and can not be completed until a set of restrictions are245
met. At the initialization stage, the global reference system is established and the first markers and keyframes added to246
the map.247
We propose two initialization methods: one-frame and two-frames initialization. The former is able to initialize the248
system with only one frame if there is at least one marker unambiguously detected. The latter initializes the system even249
from ambiguously detected markers only. In order to be able to run both methods simultaneously, we proceed as follows.250
We store the first frame f0and run one-frame initialization. If it succeeds, then the system is initialized and we move to251
12
the tracking state. If not, we process the next frame f1and run one-frame initialization in it. If it does succeed, we move 252
to tracking state. If not, we try two-frames initialization using f0and f1. If it does not succeed, we will try using f0with 253
subsequent frames f2, f3, . . .. If after nattempts, neither one-frame nor two-frames approaches succeed, the role of f0is 254
replaced by fn.255
4.1.1. One-frame initialization 256
One-frame initialization is done when among the markers observed in a frame, the pose of at least one marker is 257
detected unambiguously, i.e. 258
∃m∈Λt|e(γt
m)/e( ˙γt
m)> τe.(14)
Then, the frame is considered to be the center of the global reference system and added to the map along with the 259
observed markers. If two-frames initialization succeeds, the two frames are added to the map along with all the detected 260
markers, and a connection between them is established in the graph (the weight of the edge equals the number of common 261
markers in both frames). Then, the first frame is assumed to be the center of the grs, and the γtis established as the 262
pose of the second frame, which is also the current keyframe fω. Below, we explain the method we propose for two-frames 263
initialization. 264
4.1.2. Two-frames initialization 265
The proposed two-frames initialization method estimates the pose of both frames and markers even from markers
ambiguously observed in the two frames. To do so, we first need to estimate the relative pose γ1,0between the two frames
considered f0and f1. Let us illustrate the proposed solution with the simplest example: a single marker mwhich is
ambiguously detected in frames f0and f1. Let be γ1,0
mthe relative pose for the frames using the marker m. In that case,
for the frame f0we obtain two marker poses:
θ0
m={γ0
m,˙γ0
m},
and another two:
θ1
m={γ1
m,˙γ1
m}
for frame f1.266
The relative pose between the two frames can be estimated from θ0
mand θ1
m. In total, there are four possible solutions 267
for γ1,0
m∈ C1,0
mby combining the elements in θ0
mand θ1
m:268
C1,0
m={γ1
mγ0
m−1, γ1
m˙γ0
m−1,˙γ1
mγ0
m−1,˙γ1
m˙γ0
m−1}.(15)
13
If there is more than one marker, we shall denote269
C1,0={C1,0
m|m∈Λ1∪Λ0}(16)
the set of possible transforms between the two frames. The best solution is the one that better explains the given270
observations, i.e., the solution that minimizes the reprojection error of the marker corners in both frames:271
β(γ) = X
m∈Λ1∪Λ0
min e1
m(γγ0
m), e1
m(γ˙γ0
m)+
min e0
m(γ−1γ1
m), e0
m(γ−1˙γ1
m),
(17)
272
γ1,0= argmin
γ∈C1,0
β(γ).(18)
Since it is not known yet which of the two solutions from θi
mis the good one, the min function is employed. In order273
to make the process robust, initialization is only accepted if the average reprojection error β(γ1,0)/|C1,0|is small and the274
baseline between the frames is larger than τbl, which is a parameter of the system.275
Once the pose between the frames is determined, the pose of the observed markers w.r.t the grs must be estimated. For276
each marker, four possible poses can be obtained given the values of θ0
mand θ1
m. The best solution is the one minimizing277
the reprojection error of the marker corners in both frames:278
γm=
γfif γf∈θ0
m
γ0,1γfif γf∈θ1
m
(19)
279
γf= argmin
γ∈θ0
m∪θ1
m
e0
m(γ) + e1
m(γ1,0γ) if γ∈θ0
m
e0
m(γ0,1γ) + e1
m(γ) if γ∈θ1
m
(20)
where γ0,1=γ1,0−1,280
4.2. The reference keyframe selection for the current frame281
The selection of the current keyframe fωis the last step of the process. The goal is to find, amongst the keyframes,282
the nearest one to the current camera location. It will be used when adding new keyframes to avoid adding frames too283
close to each other, and also in the process of loop closure detection. It is computed as:284
ω= argmin
j∈F
d(fj, f t),(21)
14
where ddenotes the translational distance between two frame poses:
d(fi, f j) = ||(ti
x, ti
y, ti
z)−(tj
x, tj
y, tj
z)||2.
4.3. Tracking 285
Tracking is dealt as a non-linear optimization process minimizing the reprojection error of the observed markers in the 286
new frame ft, using the previous pose γt−1as starting point. 287
Tracking consists in estimating the pose that transforms a point from the grs to the frs of frame ft, i.e. γt, assuming 288
that the pose of the markers is fixed. In essence, the problem reduces to minimizing the squared reprojection error of the 289
marker corners (Eq. 11) so as to obtain the value of γt. Nevertheless, please remember that not all the detected markers 290
can be employed for tracking. Only those with a valid pose on the map (i.e., γm6= 0) are eligible for tracking purposes. 291
In addition, as we will explain later in Sect. 4.9, the markers that cause loop closures will not be employed for tracking 292
purposes. Thus, only a subset of the detected markers, denoted by L0(t) (see Eq. 36 in Sect. 4.9.1) are employed for 293
tracking. 294
As a consequence, the reprojection error of the markers employed for tracking is expressed as: 295
E(t) = X
m∈L0(t)
et
m(γt·γm).(22)
The Levenberg–Marquardt’s (LM) algorithm [26] is used for minimization. 296
4.4. Addition of new keyframes and markers to the map 297
Once γthas been estimated, we must decide if the frame should be added to the map as a new keyframe. The key 298
idea is that keyframes are added only if they add information to the system. Thus, the following rules apply: 299
1. If the frame has, at least, one new marker, the frame and the markers are added to the map. Markers unambiguously 300
detected are added with their estimated location, whereas ambiguously detected markers are added setting their 301
pose to zero, i.e., γm= 0. Later, it will be possible to estimate their pose from multiple views (Sect. 4.5). 302
2. Else, 303
(a) If there is a marker in Mwith zero pose (γm= 0) whose pose is unambiguously detected in this frame, then 304
add the frame and set the marker pose. 305
(b) Else, the frame is added only if the distance to the current keyframe fωis larger than the threshold τbl. This 306
ensures a minimum baseline between frames for triangulation and optimization (Sects. 4.5 and 4.7). 307
15
4.5. Marker pose estimation308
As previously indicated, when a marker mis first spotted in a frame ft, it is added to the map. If its location w.r.t309
to the frame γt
mcan be unambiguously determined, then its pose in the grs is obtained as:310
γm=γt−1·γt
m(23)
Otherwise, its pose is set to zero until it can be estimated.311
This work proposes a method to estimate the pose of a marker from a set of ambiguous observations (see Section 2312
for a description of the ambiguity problem). Let (γ1, . . . , γ n) be the poses of the keyframes in which marker mhas been313
observed so far. Remember that these are known poses obtained using other markers (not the marker m) and that they314
express the transform from the grs to the frs (Fig. 3).315
Consider now the case of the marker mwhen ambiguously observed in a keyframe fi. The markrer pose wrt the grs316
system γmis unknown as well as γi
m, its position wrt to the frame. Since the marker mhas been ambiguously detected,317
there are two possibilities for the true pose γi
m, let us denote them γi
mand ˙γi
m. Since the marker is observed in nkeyframes,318
and two different marker poses are obtained for each keyframe, there is a total of n×2 possible poses for the marker m319
in all keyframes it is observed, that we shall denote by:320
Θm={γ1
m,˙γ1
m, . . . , γn
m,˙γn
m}.(24)
See Figure 5 for a visual explanation of the problem for the limited case of two keyframes.321
Let us also define I(α) as the function that given a pose α∈Θm, returns the frame it belongs to:322
I(α) = i, if α ∈Θm∧α= (γi
m∨˙γi
m).(25)
Our proposal is to choose γ∗
m∈Θmthat minimizes the reprojection error of the marker in all nframes:323
γ∗
m= argmin
α∈Θm
n
X
i=1
ei
mγiγI(α)−1α,(26)
where γI(α)is the pose of the keyframe I(α). Then, the initial marker pose can be obtained from its ambiguous observation324
as:325
γm=γI(γ∗
m)−1γ∗
m.(27)
In order for Equation 26 to be useful, the distance between the keyframes considered must be large enough. Imagine326
the degenerate case of two keyframes fiand fjwith zero or infinitesimal displacement between them. Then, the estimated327
16
Figure 5: Initial pose estimation γmof a marker mambiguosly detected in two keyframes f0and f1with known poses γ0and γ1. Due to
ambiguity, there are two possible poses for each keyframe-marker pair. In total, it makes the set of poses {γ0
m,˙γ0
m, γ1
m,˙γ1
m}from which the
best estimation γmis obtained using Eq.27.
poses of the markers in both keyframes are likely to be equal and thus it would be impossible to resolve the marker pose 328
ambiguity. To prevent this case, keyframes inserted in the map must have a minimum distance τbl (as already explained 329
in Sect. 4.4). Finally, it is necessary to establish the minimum number of keyframes in which markers must be observed: 330
our experience indicates that three is a good value. 331
4.6. Keyframe culling 332
In order to avoid an unlimited growth of the map, every time a keyframe is added, a culling procedure is run to remove 333
redundant keyframes. This helps to keep the processing time constrained when global and local optimizations are done 334
(Sect. 4.7 and 4.10). The key idea is that for each marker m∈ M, we select a set of keyframes from Fin which mis 335
seen from a wide variety of poses. Therefore, the number of keyframes is reduced without significantly compromising the 336
results of the optimization processes. 337
Let us denote 338
Ωm={i|fi∈ F} (28)
to the set of keyframes selected for marker m, considering that the size of the set Nc∈ |Ωm|is given. Our goal can 339
be described as selecting from F, the Ncmost distant keyframes. We proceed as follows: first, we select the two most 340
separated keyframes from Fin which the marker appears and add it to Ωm. Then, we keep adding the farthest keyframe 341
17
from the elements in Ωmuntil the desired number of keyframes is reached. The process is repeated independently for each342
keyframe obtaining:343
Ω = [
m∈M
Ωm.(29)
Finally, the keyframes of the map not in Ω are removed from the map.344
4.7. Local optimization345
Whenever a new keyframe is added to the map, the pose of its neighbor keyframes, as well as the pose of the markers346
observed in them, is updated to account for the new observation.347
Let us denote by348
G(ft) = {i|fi∈ F,Λi∩Λt6=∅},(30)
the set of keyframes connected to a keyframe ftaccording to the graph G, and by349
Gm(ft) = [
i∈G(ft)
Λi,(31)
to the markers visible from the keyframes in G(ft) (see equation 13).350
The goal of the local optimization is to jointly optimize the poses of the keyframes and markers in G(ft) and Gm(ft)351
respectively. However, since the markers in Gm(ft) might also be seen from other keyframes not included in G(ft), those352
must also be considered. Let us denote the whole set of keyframes in which marker mis visible as353
V(m) = {i|fi∈ F, m ∈Λi}.(32)
Then, the total reprojection error of all markers and keyframes included in the local optimization is expressed as:354
e(γ1, . . . , γa, γ1, . . . , γb) = X
m∈Gm(ft)X
i∈V(m)
et
m(γi·γm),(33)
where γirepresents the pose of the keyframes in G(ft), γmthe poses of the markers in Gm(ft), and a, b represent the355
cardinality of these sets.356
As in the previous case, Eq. 33 is minimized using the LM algorithm [26], but in this case, exploiting the sparseness357
of the data. It is clear that since not every marker projects in every keyframe, the Jacobian matrix employed by the LM358
algorithm is sparse and thus using a sparse approach reduces considerably the computing time.359
Finally, it must be indicated that, cannot avoid infinite solutions, the pose of the first keyframe f0is never subject360
to optimization. Otherwise, the whole reference system could be moved arbitrarily generating infinite solutions for the361
optimization process.362
18
4.8. Relocalization 363
When no valid markers are visible in the current frame, its pose cannot be estimated and the system enters in 364
relocalization mode. Later, when a frame with valid markers is available again, the system proceeds to relocalization. 365
Since relocalization consists in estimating the frame pose without prior information, it is prone to more errors than the 366
tracking process. When tracking, the initial pose employed for optimization γt−1is close to the real one. Thus, marker 367
pose ambiguity is not a problem when optimizing. However, when relocalizing without knowledge of the previous pose, 368
we are subject to the ambiguity problem. 369
Imagine the simplest case of being in relocalization mode and detecting a single marker. If the marker is unambiguously 370
detected there would be no problem. However, it would be very risky to relocalize from this single marker if its pose was 371
ambiguously determined. 372
In this work we employ the robust method for relocalization from a set of ambiguously detected markers proposed in 373
[1]. Let us consider the simplest case of two markers iand jalready in the map with known poses that are ambiguously 374
detected in the frame ft. The set of marker poses obtained in frame ftwould be 375
Θt={γt
i,˙γt
i, γt
j,˙γt
j}.
Let us imagine that γt
iis the best estimation. Then, we could use it to estimate the pose of the frame ftas:
E(γt
i) = γt
iγ−1
i,
considering that γiis already known. Then, the reprojection error of both markers
X
k∈{i,j}
et
k(Eγt
i)γk.
should be minimal. 376
In general, the relocalization problem from markers reduces to finding the transform from Θtminimizing the reprojec- 377
tion error of the valid markers observed in ft:378
argmin
γ∈ΘtX
k∈Λt|γk6=0
et
i(E(γ)γk).(34)
The only case in which Eq. 34 cannot, is if ftcontains a single valid marker which is ambiguously detected. In the 379
rest of possible cases, the proposed method can on, e.g., one marker unambiguously detected, two markers ambiguously 380
detected, one marker unambiguously detected and two markers ambiguously detected, etc. 381
19
The pose estimated in Eq. 34 is an initial solution that is further refined using Eq. 22, as in the tracking process.382
4.9. Loop closure detection and correction383
When a region of the map is revisited after a long trajectory, there is an inevitable drift in the camera pose estimation.384
Then, loop closure consists in detecting this situation and correcting the drift by distributing error along the loop.385
It must be noticed that loop closure in our problem cannot be handled as in keypoint-based SLAM. When using386
keypoints, loop closure can be detected after tracking using several approaches such as [27, 28, 29, 30]. Once detected,387
the keypoints in the current frame are matched against the keypoints in the starting loop keyframe.388
We would like to stress that while revisiting a place does not affect the tracking performance of traditional SLAM389
techniques, it can be a cause of failure in SPM-SLAM if it is not managed properly. Imagine a scenario with ten markers390
{0,...,9}placed around a room. In the initial frame, the marker 0 is spotted, and then, the camera moves around the391
room discovering the sequence of markers 1 to 9. Eventually, the camera approaches the initial location. Suppose that in392
frame ft−1, only the marker 9 is visible, and so γt−1is computed using it. Then, in the next frame, both markers 0 and393
9 are visible at the same time. Because of the drift, the estimated pose of marker 9 may be far from the real one. So, if394
the pose of ftis computed using only marker 9, it differs from the pose computed from marker 0. Even more, the frame395
pose obtained using both markers simultaneously would be incorrect too.396
So, in our problem, the loop closure first affects tracking, making impossible to obtain a reliable frame pose estimation.397
That is the reason why, when tracking is done, not all known markers can be used (Sect. 4.3). In addition, it must be398
indicated that the local optimization method (Sect. 4.7) cannot be applied to solve the drift in many cases. If the drift399
is too large, the LM algorithm, which is a local optimization method, is unable to obtain good solutions. Therefore, it is400
required to do first a deeper correction, and then to employ the local optimization.401
4.9.1. Detection402
A loop closure is detected when, in the current frame ft, there is a marker that is already in Mwith a valid pose, but403
that is not visible in any of the neighbors (in the graph G) of the current keyframe fω. Let us denote by404
L(t) = {i∈Λt|i /∈ Gm(fω), γi6= 0},(35)
the set of markers in ftthat causes the loop closure detection, where Gm(fω) (previously defined in Eq. 31) is the set405
of keyframes sharing markers with the keyframe fω.406
20
Figure 6: First test video sequence and results. The images (a-e) show the sequence in which the markers printed on a piece of paper are
reconstructed. Blue squares represent the camera locations while the multicolored squares are the markers. The video sequence starts at the
upper left corner of the piece of paper and moves along the border. (f) shows a photo of the printed piece of paper. (Best viewed in color)
When a loop closure is detected, two poses for ftcan be estimated: the one obtained using the markers in L(t), and 407
another one using the rest of the detected markers: 408
L0(t)=Λt\ L(t).(36)
The former is the one that we should have if there had been no drift, and we shall denote it ˙γt. The latter, is the pose 409
obtained by tracking γt(Eq. 22) and is the nearest to the previous frame ft−1. The goal of loop closure is to adjust the 410
pose of all keyframes and markers so that both poses (γtand ˙γt) become equal. 411
To compute ˙γtusing the markers in L(t) the method explained in Sect. 4.8 is employed to have an estimation. If 412
no reliable pose can be obtained with this method, it means that there is only one marker in L(t) and that its pose is 413
ambiguously detected in ft. Then, we assume that there are two possible solutions for ˙γt, namely ˙γt
1,˙γt
2, and we will 414
apply the correction method using both and then select as the optimal solution the one providing less error. Therefore, 415
our method can do loop closure even with a single marker ambiguously detected. 416
The loop closure correction method works in two stages. First, a spanning tree containing the keyframes that form 417
the loop is created and its poses corrected. Second, the corrected keyframe poses are employed to correct the marker 418
locations, and then, an optimization aimed at reducing the reprojection error is done. 419
4.9.2. Correction 420
Let us assume, for the sake of clarity, that L(t) has only one marker m. We select from Fthe keyframe in which m421
was first seen fc. Then, a minimum spanning tree Tis created with its root at fcusing Kruskal’s algorithm [31]. If the 422
size of L(t) is greater than one, we select as fcthe oldest keyframe in which a marker from L(t) is seen. 423
The spanning tree Tis used to create the pose graph P={V, E}. The nodes
V={i|i∈ F}
21
are the keyframes included in T, and the edges
E={(i, j)|i, j ∈ F}
represent connections between them. We also add ftas a node in the graph, as well as the edges (t, c) and (t, ω) to close424
the loop.425
Given the initial keyframe poses, stored in the map, the relative transformation ˆγj,i between connected keyframes of426
the graph are obtained as:427
ˆγj,i =γj·(γi)−1,(37)
and stored for later use. The only exception is the edge connecting the new frame to the closing keyframe ˆγc,t , which is
computed by using the expected pose ˙γtinstead of γt:
ˆγc,t =γc·( ˙γt)−1.
The value ˆγc,t represents the transformation that one should obtain if there had been no drift, however the observed428
transformation is in fact γc·(γt)−1, so there is a difference between them:429
∆c,t = ˆγc,t ·γt·(γc)−1.(38)
The basic idea is to spread this difference among all graph connections by making small adjustments in the keyframe430
poses. Then, the error along all the edges of the graph is expressed as:431
∆(P) = X
(i,j)∈E
||ˆγj,i ·γi·(γj)−1||2
2.(39)
The sparse LM algorithm is used to minimize Eq. 39. It is important to stress that the relative transformations432
ˆγj,i are computed once before the optimization and remain fixed during it. However, the values γiare changed during433
optimization. Another important aspect that must be indicated is that γcis not subject to optimization (it remains fixed),434
for two reasons. First, it could generate infinite solutions since all keyframes could undergo a global transformation that435
does not affect the error function. Second, the keyframe fcmay have connections with other keyframes not in the pose436
graph P. By freezing the pose γc, we ensure that the loop of keyframes optimized remains attached to the rest of map437
keyframes.438
In contrast to [32], we do not need to introduce the scale factor into the optimization process as we know the actual439
size of each marker and it can be reliably estimated.440
22
Once the poses of the frames have been corrected, marker poses must be corrected accordingly. Let us denote by ∆γi
the transform applied to frame fito move it the corrected pose. Imagine the simplest case of the marker mto be observed
only in one frame fi. Then, the corrected pose of the marker should be
γ0
m= ∆γi·γm.
If the marker is observed from more than one keyframe, we can average the corrections to obtain an initial estimate of 441
the pose: 442
γ0
m=
1
|V(m)|X
i∈V(m)
∆γi
γm.(40)
Up to this point, the corrected keymarker and frame locations can be used as a good starting solution for local 443
optimization already explained in Sect. 4.7. 444
4.10. Global optimization 445
As the final step of our processing, when there are no more frames to be processed, we perform a global optimization. 446
This process is similar to the local one (Sect. 4.7) with two exceptions. First, all keyframes and poses are included for 447
optimization. Second, the camera intrinsic parameters are also included. Thus, the final global optimization consists in 448
minimizing the function: 449
e(γ1, . . . , γa, γ1, . . . , γb, δ) = X
m∈M X
i∈F
ei
m(γi·γm) (41)
where a=|M| and b=|F|. Again, the equation is solved using the sparse LM algorithm. The inclusion of the camera 450
parameters in the optimization allows to obtain more precise results. 451
5. Experiments and results 452
This section explains the experiments carried out to validate our proposal using five different tests. All the tests were 453
run on a i7 Intel computer with 8GB of RAM, running our code on a single thread. The ArUco library [6, 10] was 454
employed for marker detection on the recorded video sequences and also for calibrating the employed cameras. 455
Two different measures have been employed to evaluate the quality of the proposed approach: the accuracy in the 456
estimation of the marker poses, and the accuracy in the estimation of the frame poses. The first one can be evaluated by 457
calculating Absolute Corner Error (ACE) [1], computed as the mean error between the estimated three-dimensional marker 458
corner locations and the ground truth ones. In order to do this, it is necessary to transform the estimated corners to the 459
ground truth reference system, which can be done using Horn’s method [33]. The accuracy of the estimated frame poses 460
23
Figure 7: Calibration object test. (a) Image of the calibration ob ject in which markers have been attached. (b-e) Images showing the SLAM
process. The object is on a rotating platform. (f) Three-dimensional reconstruction of the calibration object and the positions of the stereo
camera employed to evaluate the precision. Green and pink pyramids represent the position of the stereo cameras at the locations employed
for testing (Best viewed in color).
is obtained using the Absolute Trajectory Error (ATE) measure, which calculates the mean error between the translation461
components of the frame poses estimated and the ground truth ones.462
The proposed method has been compared with three other approaches. First, the off-line fiducial mapper method463
proposed in [1]. As already mentioned, the main difference is that the method proposed in this paper is a continuous464
mapping process, while [1] is an off-line process. Additionally, we have compared our system with two of the most popular465
SLAM methods: LSD-SLAM [4] and ORB-SLAM4[3]. The first method is based on semi-dense stereo matching, while466
the second one relies on ORB keypoints, however, both are capable of managing loop closures.467
4We employed the latest version of the software ORB-SLAM2.
24
Regarding the parameters of the proposed method, we have set the value τe= 3 (Eq. 14), which according to our 468
experience is a good choice. This parameter is relevant in relocalization, keyframe insertion, and loop closure detection 469
stages. The minimum possible value for this parameter is 1, but this allows ambiguous poses to be considered. On the 470
other hand, large values involve a more conservative mode, in which only markers very reliably observed (very near) are 471
employed. The other parameter to be considered is τbl (Sect. 4.1.2), which is the minimum distance between frames in 472
the map. We have observed that a value of 7 mm performs well in most of our experiments. 473
Before continuing, we would like to indicate that the proposed system is publicly available for free usage 5.474
5.1. Printed marker map 475
The first test is aimed at analyzing the precision of the proposed method in estimating the three-dimensional position 476
of markers and the effectiveness of the loop closure. For that purpose, we have printed in an A4 piece of paper a set of 477
markers as shown in Fig. 6(e). In this case, the exact location of the markers is known. To test our system, we recorded 478
a video sequence of size 1920 ×1080 pixels moving around the border of the piece of paper so as to force a loop-closure. 479
The sequence of markers reconstructed is shown in Figures 6(a-d). When analyzing the error, we obtained an ACE of 480
3.2×10−2millimeters. The same video sequence was processed with the off-line mapper [1], obtaining a similar error of 481
2.9×10−2millimeters. 482
5.2. Calibration object 483
In this second test, we employ the proposed system to create an arbitrary calibration object. The calibration object 484
can be used either to calibrate a camera or to estimate its position. In this case, we used a cardboard box with markers 485
attached on its sides. The positions of the markers are unknown and were estimated with the proposed method. To that 486
end, the box is placed on a rotating platform and a video sequence is recorded with the camera being static looking at 487
the box. Figure 7(a-e) shows the video sequence of the reconstruction. 488
In order to evaluate the accuracy of the calibration object, we used an indirect method. Instead of calculating the 489
error of the marker positions, we calculated the error that the object induces in camera pose estimation. To do so, we 490
employed a stereo camera consisting of two PtGrey FLEA3 units placed in parallel. In order to calibrate the cameras and 491
estimate the displacement between them, the OpenCV toolbox [34] was employed. A chessboard pattern was captured 20 492
times at varying distances, and both intrinsic and extrinsic parameters of the two cameras estimated. According to the 493
OpenCV tool, the average distance between them was 9.82 cm. 494
5http://www.uco.es/investiga/grupos/ava/node/58
25
Figure 8: Full reconstruction of the laboratory. Images (a) to (g) show the status of the reconstruction in frames
4540,6900,8900,9000,12000,12500 and 12900. Loop closure is done after frames (c) and (f). Red ellipses show the region of the map with the
largest error before the loop closure. It can be observed in frames (d) and (g) that the correction of our system is successful. Image (h) shows
the ground truth 3D reconstruction obtained with a Leica Laser Scanner. The average error obtained in this sequence is 0.021 meters.
26
Our calibration object was then used to estimate the relative position of the cameras in order to analyze the difference 495
with respect to the OpenCV estimation. We employed the intrinsic parameters obtained by OpenCV, and only the extrinsic 496
parameters were calculated. A total of 30 positions were used for evaluation, shown in Figure 7(f). The difference between 497
our estimation and the OpenCV one was 1.23 ±0.7 mm. The same experiment was done using the off-line mapper [1], 498
obtaining a similar difference of 1.33 ±0.9 mm. 499
5.3. Full lab sequence 500
For this test, we have placed a total of 90 markers along our laboratory, which is comprised of two rooms connected 501
by a door. Each room has an approximated dimension of 7 ×7 squared meters. The goal of this test is to analyze the 502
capability of the proposed method to reconstruct the three-dimensional position of markers placed in a large environment. 503
The laboratory was scanned using a Leica 3D laser scanner (see Fig 8h) that provided a 3D point cloud with a very high 504
accuracy. Then, we manually selected in the point cloud the points that belong to the corners of the markers. These are 505
the ground truth corner locations employed to compute the ACE in this experiment. 506
We recorded a video sequence of 12.480 frames of size 1920×1080 pixels moving along the two rooms of the laboratory. 507
The reconstruction sequence can be observed in Figs. 8(a-g) where the blue squares represent the markers placed on the 508
walls M, and the green squares the keyframe Fstored in the map. The sequence starts recording the left room, moving 509
around it until it is completely observed. In Fig. 8(c) one can observe that a loop is about to be closed. We have marked 510
with a red ellipse the region of the closure. Then, Fig. 8(d) shows the result after the loop closure correction. Once the 511
left room is completely observed, we cross the door and place the camera in the center of the right room Fig. 8(e). Then, 512
we move the camera until all markers in the room are spotted. Again, we have marked in Fig. 8(f) the region affected 513
by a large drift that is corrected using the loop closure method explained. The final reconstruction obtained is shown in 514
Fig. 8(g). 515
This sequence is especially interesting since we have ground-truth information and thus we can evaluate the different 516
parameters of our proposal. 517
The main parameters we are interested on are: the number of frames per marker kept in the map Nc=|Ωm|518
(Eq. 28), and the image size. Both parameters influence the computing speed and the reconstruction error. Table 1 519
shows the different evaluations conducted. We have analyzed the video sequence for all combinations of the parameter 520
Nc={5,10,15}with the image sizes {640 ×480,800 ×600,1920 ×1080}. The column FPS1indicates the frames per 521
second of our method when the time required to detect markers is not considered. In other words, it only considers the 522
27
Table 1: Metrics computed in the two-labs sequence. We tested our system using several values for the parameter Nc=|Ωm|and the input
image sizes. FPS1indicates the frames per second of our proposal discounting the time required for detecting the markers. FPS2represents
the total frames per second including marker detection.
NcImage Size FPS1FPS2ACE
5 640 ×480 576 236 0.034
5 800 ×600 518 189 0.023
5 1920 ×1080 412 71 0.023
10 640 ×480 253 155 0.021
10 800 ×600 251 136 0.023
10 1920 ×1080 221 61 0.023
15 640 ×480 148 108 0.021
15 800 ×600 150 101 0.021
15 1920 ×1080 140 53 0.022
amount of time required by our proposal. Column FPS2shows the total computing time considering the detection of the523
markers using the ArUco library.524
As can be observed, apart from the first case, the rest of the tests obtained a very similar ACE. It is true that the525
lowest resolutions seem to obtain slightly better precisions in most cases. We believe that the margin of error in the ground526
truth estimation is in the range of millimeters and the error differences observed in these cases cannot be considered of527
relevance. Another possibility is that ArUco corner estimation performs better in low-resolution images.528
In any case, we consider that the second row is probably the best trade-off between speed and precision since the full529
system can run at nearly 200 FPS using a single thread.530
5.4. Comparison with other SLAM approaches531
In the following experiments we evaluate the proposed method with the keypoints-based methods ORB-SLAM2 [3]532
and LSD-SLAM [4], along with the previously OffLine proposed method [1]. Eight video sequences were recorded in one of533
the rooms of our lab in which an Optitrack motion capture system is installed. It is comprised of six cameras and is able534
to track the position of an object at 100 Hz. The videos were recorded using a PtGrey FLEA3 camera, which captures535
images of 1920 ×1080 pixels at a frame rate of 60 Hz.536
Three of the video sequences (Seq1, Seq2, and Seq3) point towards the wall and the camera starts and end at the same537
28
Figure 9: a) Panoramic view of the lab walls showing the markers on the wall. b) Panoramic view of the markers placed in the ceiling. c)
Reconstruction obtained by our method for Seq1. Keyframes presented in blue and markers as rectangles. d) Results obtained by the Offline
method [1] in Seq1. e) Results of the ORB-SLAM2 method for Seq1. The method fails at some point of the trajectory and is not able to
properly recover the structure of the scene even when loop closure is detected. f ) Results of LSD-SLAM for Seq1. The method completely fails
to reconstruct the trajectory. g,h,i,j) Reconstruction results of the methods for Seq4, with markers in the ceiling. In that case, ORB-SLAM2
obtains better results than in the previous environment.
location, in order to facilitate the detection of the loop closure. In these sequences, the markers placed on walls are used 538
for tracking (Fig. 9a). The sequences (Seq4, Seq5, and Seq6) shows another interesting use case in which the markers are 539
placed in the ceiling (Fig. 9b). As previously, the sequence starts and ends at the same location to facilitate the detection 540
of the loop closure. Finally, in the sequences Seq7 and Seq8, the camera is moved around pointing at the walls, and the 541
sequences finish looking at the same region of the room where the sequence starts so that the closure of the loop could be 542
detected. Nonetheless, although pointing at the same spot, the initial and final locations were separated by two meters 543
approximately. In the first frames of the sequences, translational movements are performed so as to allow ORB-SLAM2 544
and LSD-SLAM methods to initialize. 545
For the ORB-SLAM2 and LSD-SLAM methods, the image resolution was reduced to 640 ×480 as usually for these 546
methods. Since these SLAM methods run several threads simultaneously, their results may vary from one execution to 547
29
Table 2: Results obtained by the different tested methods for the eight sequences registered with a motion capture system. Absolute Trajectory
Error is measured in centimeters. tr el is a relative measure, and rrel is (deg/25cm)
Seq. SPM-SLAM OffLine [1] LSD-SLAM [4] ORB-SLAM2 [3]
ATE tr el rrel ATE trel rrel ATE trel rrel ATE trel rrel
Seq 1 5.68 1.42 0.20 7.19 1.50 0.21 133.78 3.63 2.09 44.4 1.10 0.40
Seq 2 4.71 1.60 0.83 4.36 1.66 0.88 72.88 1.02 2.19 38.1 1.10 0.88
Seq 3 5.90 1.52 1.21 31.05 2.29 1.45 226.16 1.90 2.32 76.2 1.06 1.36
Seq 4 1.64 0.87 1.22 5.28 0.92 1.23 235.65 1.12 2.16 1.89 0.79 1.20
Seq 5 1.78 1.24 1.42 2.08 1.36 1.60 236.42 1.18 2.08 61.97 1.30 1.17
Seq 6 1.64 1.34 1.61 1.52 1.47 1.67 25.03 1.01 1.87 110.00 1.40 2.27
Seq 7 4.97 1.35 1.05 4.82 1.39 1.17 50.88 0.94 1.57 123.91 0.89 1.20
Seq 8 6.54 1.15 1.23 87.63 4.15 2.34 94.48 3.46 2.30 12.43 0.81 1.25
another. LSD-SLAM allows avoiding that problem by setting a parameter that forces a frame to be procesed before the548
next one is employed. To overcome the problem in ORB-SLAM2, we feed the images to the system with enough delay so549
that all optimization processes can be finished before the next image is given to the system. This is achieved by sleeping550
the camera thread 400 ms between the capture of each image.551
The quantitative results obtained by the different methods are shown in Table 2. For each method we have computed,552
the absolute trajectory error (ATE), the average relative translation error trel and the relative rotational error rrel , as553
proposed in [35], at the distance of 25 cm. As it can be observed, the proposed method obtain good results in all sequences.554
The worst results are obtained by the LSD-SLAM method, which fails in all sequences. We believe that the reason is555
that the rolling-shutter camera employed. The ORB-SLAM2 method systematically fails in all sequences but Seq4. The556
OffLine method [1] previously proposed fails in two of the sequences (Seq3, and Seq8). Finally, it is also worth noticing557
that the tracking results for the sequences in which the markers are in the ceiling obtain higher accuracy.558
Figures 9(c-f) shows the results of the different methods for Seq1. The video sequences, the results obtained, code and559
scripts to reproduce the experiments of this section are publicly available 6.560
5.5. Rotational movement561
This final test aims at analyzing the capability of our method to work under mostly rotational movements. It is indeed562
a problematic case for monocular visual SLAM methods since some degree of translation is required to reliably compute563
6http://www.uco.es/investiga/grupos/ava/node/58
30
the essential or homography matrices. To test the system, we recorded a video sequence in the first room of the lab using 564
a camera of resolution 1280 ×960 pixels. The camera was set in the center of the room and a person was spinning while 565
pointing at the walls with the camera. The reconstruction obtained is shown in Figure 1. In order to reconstruct the 566
scene, we set the minimum baseline between frames τbl = 0.03 m. Otherwise, very few frames are added to the map. 567
The reconstruction ACE in this test is 3.1 cm. When the sequence was processed with the ORB-SLAM and LSD-SLAM 568
methods, they were unable to initialize. The same sequence on with the method proposed in [1] which obtained an ACE 569
of 2.9 cm. It must be noticed, however, that this is an off-line method and that the number of frames employed for 570
optimization is not restricted as in our case. This is the reason why it outperforms the method proposed here in most 571
cases. 572
6. Conclusions and future work 573
This work has proposed a system to simultaneous map and localizing from a set of squared planar markers (SPM- 574
SLAM). Using as input a video sequence showing markers placed at arbitrary locations of the environment, the proposed 575
system incrementally builds a precise map of the markers with correct knowledge of the scale and determines at the same 576
time the camera pose. The proposed system allows to create cost-effective and large-scale tracking systems able to operate 577
in real-time with minimal computational requirements. 578
This paper has contributed proposing a number of techniques to deal with the inherent ambiguity that occurs when 579
using planar markers. First, we have proposed a method to initialize the map from a set of images with ambiguous markers 580
only. Second, we have proposed a method to estimate the pose of markers from ambiguous observation given that the 581
frame poses are known. Finally, we have proposed a method to detect and solve the loop closure problem from squared 582
planar markers. 583
The experiments conducted have shown the effectiveness of the proposed system testing it in several situations. In the 584
most complex scenario, our laboratory, two rooms of 49 squared meters each, have been properly mapped. The speed of 585
our system reaches 150 Hz using a single thread, so it is especially attractive as an effective localization system in many 586
robotic tasks requiring low computational resources. Additionally, the proposed system also removes the restriction of 587
monocular SLAM systems that fail under rotational movements. 588
The proposed system is publicly available7for free usage. As future work, our plan is to mix natural features with 589
7http://www.uco.es/investiga/grupos/ava/node/58
31
fiducial markers. The basic idea is to be able to use the markers whenever they are seen but to avoid an excessive amount590
of them by also using keypoints. Another possible future work is to develop a model that analytically describes the type591
of errors present in squared planar tracking in order to assists placing the markers.592
Acknowledgment593
This project has been funded under projects TIN2016-75279-P and IFI16/00033 (ISCIII) of Spain Ministry of Economy,594
Industry, and Competitiveness, and FEDER.595
References596
[1] R. Mu˜noz-Salinas, M. J. Mar´ın-Jimenez, E. Yeguas-Bolivar, R. Medina-Carnicer, Mapping and localization from597
planar markers, Pattern Recognition 73 (Supplement C) (2018) 158 – 171.598
[2] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, J. J. Leonard, Past, present, and599
future of simultaneous localization and mapping: Toward the robust-perception age, IEEE Transactions on Robotics600
32 (6) (2016) 1309–1332.601
[3] R. Mur-Artal, J. M. M. Montiel, J. D. Tard´os, Orb-slam: A versatile and accurate monocular slam system, IEEE602
Transactions on Robotics 31 (5) (2015) 1147–1163.603
[4] J. Engel, T. Sch¨ops, D. Cremers, LSD-SLAM: Large-scale direct monocular SLAM, in: European Conference on604
Computer Vision (ECCV), 2014.605
[5] D. Galvez-L´opez, J. D. Tardos, Bags of binary words for fast place recognition in image sequences, IEEE Transactions606
on Robotics 28 (5) (2012) 1188–1197.607
[6] S. Garrido-Jurado, R. Mu˜noz-Salinas, F. Madrid-Cuevas, M. Mar´ın-Jim´enez, Automatic generation and detection of608
highly reliable fiducial markers under occlusion, Pattern Recognition 47 (6) (2014) 2280 – 2292.609
[7] M. Fiala, Designing highly reliable fiducial markers, IEEE Trans. Pattern Anal. Mach. Intell. 32 (7) (2010) 1317–1324.610
[8] H. Kato, M. Billinghurst, Marker tracking and hmd calibration for a video-based augmented reality conferencing611
system, in: Proceedings of the 2nd IEEE and ACM International Workshop on Augmented Reality, IEEE Computer612
Society, 1999, pp. 85–94.613
32
[9] D. Schmalstieg, A. Fuhrmann, G. Hesina, Z. Szalav´ari, L. M. Encarna¸c¨ao, M. Gervautz, W. Purgathofer, The 614
studierstube augmented reality project, Presence: Teleoper. Virtual Environ. 11 (1) (2002) 33–54. 615
[10] S. Garrido-Jurado, R. Mu˜noz-Salinas, F. Madrid-Cuevas, R. Medina-Carnicer, Generation of fiducial marker dictio- 616
naries using mixed integer linear programming, Pattern Recognition 51 (2016) 481 – 491. 617
[11] M. Fiala, Comparing ARTag and ARToolKit Plus fiducial marker systems, in: IEEE International Workshop on 618
Haptic Audio Visual Environments and their Applications, 2005, pp. 147–152. 619
[12] E. Olson, AprilTag: A robust and flexible visual fiducial system, in: Proceedings of the IEEE International Conference 620
on Robotics and Automation (ICRA), IEEE, 2011, pp. 3400–3407. 621
[13] J. Wang, E. Olson, Apriltag 2: Efficient and robust fiducial detection, in: 2016 IEEE/RSJ International Conference 622
on Intelligent Robots and Systems (IROS), 2016, pp. 4193–4198. doi:10.1109/IROS.2016.7759617.623
[14] J. DeGol, T. Bretl, D. Hoiem, Chromatag: A colored marker and fast detection algorithm, in: ICCV, 2017. 624
[15] D. Oberkampf, D. F. DeMenthon, L. S. Davis, Iterative pose estimation using coplanar feature points, Computer 625
Vision and Image Understanding 63 (3) (1996) 495 – 511. 626
[16] G. Schweighofer, A. Pinz, Robust pose estimation from a planar target, IEEE Transactions on Pattern Analysis and 627
Machine Intelligence 28 (12) (2006) 2024–2030. 628
[17] T. Collins, A. Bartoli, Infinitesimal plane-based pose estimation, International Journal of Computer Vision 109 (3) 629
(2014) 252–286. 630
[18] G. Klein, D. Murray, Parallel tracking and mapping for small ar workspaces, in: Mixed and Augmented Reality, 2007. 631
ISMAR 2007. 6th IEEE and ACM International Symposium on, 2007, pp. 225–234. 632
[19] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient alternative to sift or surf, in: 2011 International 633
Conference on Computer Vision, 2011, pp. 2564–2571. 634
[20] A. J. Davison, I. D. Reid, N. D. Molton, O. Stasse, Monoslam: Real-time single camera slam, IEEE Transactions on 635
Pattern Analysis and Machine Intelligence 29 (6) (2007) 1052–1067. 636
[21] H. Lim, Y. S. Lee, Real-time single camera slam using fiducial markers, in: ICCAS-SICE, 2009, 2009, pp. 177–182. 637
33
[22] T. Yamada, T. Yairi, S. H. Bener, K. Machida, A study on slam for indoor blimp with visual markers, in: ICCAS-638
SICE, 2009, IEEE, 2009, pp. 647–652.639
[23] M. Klopschitz, D. Schmalstieg, Automatic reconstruction of widearea fiducial marker models, in: In ISMAR, IEEE640
Computer Society, 2007, pp. 1–4.641
[24] K. Shaya, A. Mavrinac, J. L. A. Herrera, X. Chen, A self-localization system with global error reduction and online642
map-building capabilities, in: Intelligent Robotics and Applications, Springer, 2012, pp. 13–22.643
[25] M. Neunert, M. Bl¨osch, J. Buchli, An open source, fiducial based, visual-inertial state estimation system, arXiv644
preprint arXiv:1507.02081.645
[26] K. Madsen, H. B. Nielsen, O. Tingleff, Methods for non-linear least squares problems (2nd ed.) (2004).646
[27] J. Sivic, A. Zisserman, Efficient visual search of videos cast as text retrieval, IEEE Transactions on Pattern Analysis647
and Machine Intelligence 31 (4) (2009) 591–606.648
[28] Y. Feng, Y. Wu, L. Fan, Real-time SLAM relocalization with online learning of binary feature indexing, Machine649
Vision and Applications 28 (8) (2017) 953–963.650
[29] Y. Feng, L. Fan, Y. Wu, Fast localization in large-scale environments using supervised indexing of binary features,651
IEEE Transactions on Image Processing 25 (1) (2016) 343–358.652
[30] L. Liu, H. Li, Y. Dai, Efficient global 2d-3d matching for camera localization in a large-scale 3d map, in: 2017 IEEE653
International Conference on Computer Vision (ICCV), 2017, pp. 2391–2400.654
[31] J. B. Kruskal, On the shortest spanning subtree of a graph and the traveling salesman problem, Proceedings of the655
American Mathematical Society 7 (1) (1956) 48–50.656
[32] H. Strasdat, J. Montiel, A. J. Davison, Scale drift-aware large scale monocular slam, Robotics: Science and Systems657
VI.658
[33] B. K. P. Horn, Closed-form solution of absolute orientation using unit quaternions, J. Opt. Soc. Am. A 4 (4) (1987)659
629–642.660
[34] G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision in C++ with the OpenCV Library, 2nd Edition, O’Reilly661
Media, Inc., 2013.662
34
[35] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: Conference 663
on Computer Vision and Pattern Recognition (CVPR), 2012. 664
35