Tracking Fiducial Markers with Discriminative Correlation Filters

Preprint (PDF Available) · June 2020with 174 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
Cite this publication
Preprints and early-stage research may not have been peer reviewed yet.
Abstract
In the last few years, squared fiducial markers have become a popular and efficient tool to solve monocular localization and tracking problems at a very low cost. Nevertheless, marker detection is affected by noise and blur: small camera movements may cause image blurriness that prevents marker detection. The contribution of this paper is twofold. First, it proposes a novel approach for estimating the location of markers in images using a set of Discriminative Correlation Filters (DCF). The proposed method outperforms state-of-the-art methods for marker detection and standard DCFs in terms of speed, precision, and sensitivity. Our method is robust to blur and scales very well with image resolution, obtaining more than 200fps in HD images using a single CPU thread. As a second contribution, this paper proposes a method for camera localization with marker maps employing a predictive approach to detect visible markers with high precision, speed, and robustness to blurriness. The method has been compared to the state-of-the-art SLAM methods obtaining, better accuracy, sensitivity, and speed. The proposed approach is publicly available as part of the ArUco library.
Advertisement
Tracking Fiducial Markers with Discriminative Correlation Filters
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS
Francisco J. Romero-Ramirez a, Rafael Muñoz-Salinas *,a,b, and Rafael Medina-Carnicera,b
a
Departamento de Informática y Análisis Numérico, Edificio Einstein. Campus de Rabanales, Universidad de Co´
rdoba,
14071, Córdoba, Spain, Tlfn:(+34)957212289
bInstituto Maimónides de Investigación en Biomedicina (IMIBIC). Avenida Menéndez Pidal s/n, 14004, Córdoba,
Spain, Tlfn:(+34)957213861
June 30, 2020
ABS TRAC T
In the last few years, squared fiducial markers have become a popular and efficient tool to solve
1
monocular localization and tracking problems at a very low cost. Nevertheless, marker detection
2
is affected by noise and blur: small camera movements may cause image blurriness that prevents
3
marker detection. 4
The contribution of this paper is two-fold. First, it proposes a novel approach for estimating the
5
location of markers in images using a set of Discriminative Correlation Filters (DCF). The proposed
6
method outperforms state-of-the-art methods for marker detection and standard DCFs in terms of
7
speed, precision, and sensitivity. Our method is robust to blur and scales very well with image
8
resolution, obtaining more than 200fps in HD images using a single CPU thread. 9
As a second contribution, this paper proposes a method for camera localization with marker maps
10
employing a predictive approach to detect visible markers with high precision, speed, and robustness
11
to blurriness. The method has been compared to the state-of-the-art SLAM methods obtaining, better
12
accuracy, sensitivity, and speed. The proposed approach is publicly available as part of the ArUco
13
library. 14
Keywords Discriminative Correlation Filter ·Squared Fiducial Markers ·Marker Mapping ·SLAM. 15
Corresponding author.
Email addresses: fj.romero@uco.es (Francisco J. Romero-Ramirez), rmsalinas@uco.es (Rafael Muñoz-Salinas), rmedina@uco.es
(Rafael Medina-Carnicer)
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 1: Map of markers generated by the works [
9
,
10
]. (a) Image of tracking room where a set of markers are
randomly placed in the walls. (b) Marker map generated with [
9
]. Blue squares represents the pose of the markers and
green ones the pose of the cameras. (c) Laser reconstruction of the room to help to understand the three-dimensional
configuration of the room.
1 Introduction16
Squared fiducial markers have become a popular and efficient method to solve monocular localization and tracking
17
problems at a very low cost in indoor environments. In medical applications, they are used for tracking of surgical
18
equipment [
1
,
2
,
3
]. In augmented reality (AR) problems, it is employed to estimate the camera pose so as to properly
19
render the scene [
4
,
5
]. In autonomous navigation or drone landing, it provides visual references for navigation and
20
landing [6, 7, 8].21
The recent works in squared markers [
9
,
10
] make it is possible to estimate the camera pose in the environment (with the
22
correct scale) by just analyzing images where some markers are visible. Given a set of these markers printed on a regular
23
piece of paper and placed randomly in the environment (Fig. 1a), it is possible to estimate their three-dimensional
24
location from a set of images or a video sequence showing them (Fig. 1b). This method allows obtaining motion
25
tracking systems of very low cost, requiring only a camera.26
Nevertheless, one of the limitations of these techniques is that the detection of markers is sensitive to blurring. Figure 2
27
shows the appearance of the markers under different blurring levels obtained by moving the camera at different speeds.
28
Even at low camera speeds, manually recorded videos have blurriness that prevents detection (see Figure 2b). This
29
effect happens either because the camera, which is not placed on a gimbal (e.g. in low-cost AR applications or drone
30
landing), or because the marker moves fast (surgical equipment tracking). The high sensitivity to blurring is a limitation
31
to the spread of that technology in applications of low cost and low computing power.32
2
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 2: Tracking of a fiducial marker along the a video sequence with the proposed method. From left to right, the
marker is observed with increasing blurring levels. The proposed method is capable of tracking the marker in figures
a-d but not in figure e. The estimated marker location is drawn as a red rectangle.
The contribution of this paper is two-fold. First, this work proposes a novel approach for estimating the location of
33
markers in images, that is both fast and robust to blur, which consists in employing a set of Discriminative Correlation
34
Filters (DCF). In order to speed up computation, our method employs a pyramid of images and selects at each frame the
35
one where tracking can be done at maximum speed. Figure 2 shows the tracking capabilities of the proposed method.
36
As a second contribution, we propose a novel approach for monocular camera pose estimation using marker maps.
37
The proposed method, given a marker map, employs the previous trackers and a predictive approach to detect visible
38
markers with high precision, speed, and robustness to blurriness. 39
The experiments conducted shows that the proposed marker tracking method is fastest and more robust to blur than
40
the state-of-the-art marker detection algorithms, and more precise than the best DCFs. In addition, our proposal is
41
compared with three state-of-the-art SLAM methods: ORBSlam2 [
11
], LDSO [
12
], and UcoSLAM [
13
]. Our method
42
outperforms them in terms of speed and precision. The proposed method is publicly available as part of the ArUco
43
library2.44
The remainder of this paper is structured as follows. Sect. 2 provides an overview of the related works, while Sect. 2.2
45
explains the basis of DCF. Our contributions are explained in Sect. 3 and 4, while Sect. 5 presents the experiments
46
conducted and Sect. 6 draws some conclusions. 47
2 Related works 48
2.1 Fiducial marker systems 49
Fiducial marker systems are commonly used in camera pose estimation and tracking processes, due to their high
50
accuracy, robustness, and speed. Since the appearance of ARToolKit [
14
], a large number of systems based on square
51
markers have emerged [
10
,
15
,
16
,
17
]. In general, the detection process involves thresholding the scene in which a set
52
of squared regions (candidate markers) are extracted from the background. Later, the interior of the regions is analyzed,
53
discarding those that cannot be identified. Finally, using the four corners of at least one marker, it is possible to estimate
54
the position of the camera. 55
2https://www.uco.es/investiga/grupos/ava/node/26
3
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Several works have analyzed the performance of marker detection systems [
18
] showing that speed, robustness, and
56
accuracy are essential factors to be taken into account, where ArUco and AprilTag markers systems take advantage
57
[19, 20, 21].58
2.2 Discriminative Correlation Filters59
Since the appearance of the work of Bolme et al. [
22
] with Minimum Output Sum of Squared Error (MOSSE),
60
discriminative correlations filters (DCF) have increased their popularity, becoming one of the main methods of visual
61
tracking due to its efficiency and robustness.62
Many other researchers have work on improving several aspects of the initial MOSSE proposal. Henriques et al. [
23
]
63
replaces the use of grayscale filters by using HOG features, Danelljan et al. [
24
] introduces learning multi-channel
64
filters with Colornames, Li et al. [
25
] and Lukežiˇ
c et al. [
26
] use the integration of both HOG and Colornames. Other
65
works employing convolutional features of CNNs [27, 28, 29] have shown high performance.66
DCF usually has limited information about the contour, leading to false positives in some scenarios such as rapid
67
movement, occlusion, or background noise. Mueller et al. [
30
] use context information in filter training to improve the
68
performance of state-of-art algorithms without incurring in high computational costs. On the other hand, to reduce the
69
boundary effects Danneljan et al. [
31
] reformulate the learning function by considering larger image regions, penalizing
70
filter values outside the bounding box.71
Another limitation of DCFs is the assumption that the target has a fixed size and that it is completely aligned to a
72
rectangular region. However, the shape of the tracked objects and their rotation makes the filter learn the background,
73
leading to errors in tracking. Danelljan et al. [
32
] presented a method to estimate the scale by training a classifier on a
74
pyramidal scale. Also, Lukežiˇ
c et al. [
26
] introduce the channel and spatial reliability concepts. The spatial reliability
75
map adjusts the filter to the object to the object allowing to adapt the size of the search region and improving the
76
tracking in non-rectangular objects. The channel reliability reflects the discriminative power of each filter channel.
77
Mathematical Basis of Discriminative Correlation Filters78
Correlation filter based tracking applies a continuous adaptive process to find the filter that when applied on the desired
79
target produces the maximum response. In its simpler form, the filter is a small image patch centered around the object
80
to be tracked. However, in order to increase the robustness to appearance changes, a set of modified images of the
81
target (created using affine transformations) are employed to build the filter. Once the initial filter is created, the filter is
82
applied on the next image at the same location. Then, the position with the maximum response within the region is
83
considered the new target location. Finally, the filter is updated to adapt changes in appearance and the process repeated
84
in the subsequent frames [22].85
Let us denote
X={x1, ..., xn}
the set of gray-scale patches of the target observed under different appearance
86
conditions. It will be used as a training set to create the initial filter
h
. Also, let us denote
G={g1, ..., gn}
the desired
87
response of the filter when applied on the patches, i.e,
h(xi) = gi.
Although
gi
can have any shape, it is generally
88
generated as a 2D Gaussian (σ= 2) centered at the center of the patch. Thus, in practice gi=gji, j ∈ {1, . . . , n}.89
Computing the correlation in the Fourier Domain has demonstrated to be the best way to speed up computation and
90
obtain a certain degree of robustness to misalignment. Correlation in the frequency domain turns into element-wise
91
4
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
multiplications expressed as: 92
G=XH(1)
where
G, X
and
H
denotes the Fourier transforms of
g∈ G, x ∈ X
and
h
respectively,
is the element-wise
93
multiplication and
the complex conjugate. In consequence, the estimation of the optimal correlation filter
H
in the
94
Fourier domain is computed as: 95
min
H
n
X
i=1
||XiHGi||2+λ1||H||2(2)
where λ1is a regularisation term. Since Eq. 2 is convex, it has a single global minimum that can be expressed as: 96
H1=Pn
i=1 XiG
i
λ1+Pn
i=1 XiX
i
(3)
which expresses how to obtain the correlation filter in the first frame. The regularisation term
λ1
prevents divisions by
97
zero. 98
In frame
t
(
t > 1
), the filter is applied to the previous target location and the location with maximum response is
99
expected to be current target location. We shall define
zt
as the image patch centred at the maximum response location
100
in t, and Ztis its Fourier transform. Then, the filter is updated using a running average so that 101
Ht=ηAt+ (1 η)At1
ηBt+ (1 η)Bt1
(4)
102
At=GtZ
t(5)
103
Bt=ZtZ
t(6)
where the parameter η[0,1] is the learning rate. 104
An important aspect to consider is how to detect when the tracking has failed. A method to do so is analyzing the Peak
105
to Sidelobe Ratio (PSR), which is the ratio between the filter value at the point with maximum response and the average
106
response in the rest of the pixels. It has been observed than values below 7.0indicates tracking failure [22]. 107
In general, the area surrounding the tracked object may contain distracting information for tracking that leads to an
108
erroneous local minimum. An effective approach to alleviate this problem is to include contextual information in the
109
filter [
30
]. Instead of considering only the target appearance to build the filter, patches surrounding the target are also
110
employed as negative examples. Following this approach, Eq. 2 is updated so that the minimization function takes into
111
account a set of patches surrounding the target. 112
If Y={y1, . . . , ym}is the set of contextual patches (blue patches in Fig. 3a) , then Eq.2 becomes: 113
min
H
n
X
i=1
||XiHGi||2+λ1||H||2+λ2
m
X
j=1
||YjH||2(7)
where
λ2
modulates the relative importance of the context and
Yj
is the Fourier transform of
yj
. Using this approach,
114
the update of the filter in frame t (t > 1) is expressed as: 115
Ht=ηAt+ (1 η)At1
(ηBt+ (1 η)Bt1) + λ2(ηDt+ (1 η)Dt1)(8)
5
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 3: Tracking process with correlative filters. (a) Training process in frame t. The filter is updated using the central
patch of the marker in addition to the 4 patches around it. (b) represents the process of tracking in the frame t+1, for it
uses the filter updated in t, the maximum value of response indicates the new position of the marker.
where116
Dt=
m
X
i=1
Yi,t Y
i,t (9)
and At,Btare obtained from Eqs. 5 and 6.117
3 Tracking of a Squared Marker118
This section introduces our first contribution, a DCF-based tracker that allows the continuous tracking of a square
119
fiducial marker throughout a video sequence. Since estimating the exact location of the marker corners is required to
120
estimate its three-dimensional pose, our method must be able to track them. Therefore, our approach employs a total of
121
five filters: one filter for tracking the marker general appearance, and four additional filters for tracking the corners.
122
In order to speed up computation while adapting to scale changes, a multi-resolution pyramid tracking approach is
123
proposed. Filters of fixed size are employed (constraining the computation time), but, at each iteration, the scale where
124
the marker dimension best fits the filter size is employed.125
Our process can be summarized in the following steps. In the first frame, we find the pyramid level where the filters
126
are created with the desired size. In subsequent frames, the filters are first applied in the neighboring regions of the
127
previous location at the same pyramid level to find the optimal location of the marker and its corners. Then, to adapt to
128
scale changes of the marker, we must find the scale that produces the highest response of the filter. Finally, the filters
129
are updated.130
Below, we provide a formal description of the proposed method.131
3.1 Tracker definition and initialization132
The initial step to track a marker
m
along a video sequence is to find it in the image. A marker is a squared matrix in
133
which each element represents a bit (see Fig. 4). The marker is comprised of a black region, which helps to detect it, and
134
the inner region containing the bits that uniquely identify the marker. Let us define the sequence of bits of a marker as
135
b(m)=(b1, . . . , bn)|bi∈ {0,1},(10)
6
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 4: Identification of the tracked marker. The computation of the homography on the detected polygon, allows to
take the central value of its identification bits, and analyzed in its four possible orientations.
which is created row by row starting at the top-left bit of the matrix. The detection of the marker in the image can be
136
efficiently done using the method proposed in [
10
]. The method extract contours in the image, obtain its polygonal
137
approximation, and discard those that are not quadrilateral. Each remaining polygon
p
is analyzed to check if it belongs
138
to a valid marker. Its four corners are employed to compute Homography matrix that determines the central pixel of
139
each bit in the image and its pixel intensities are thresholded using Otsu’s [
33
] algorithm obtaining its bit sequence
140
b(p)
in its four main rotations (0
o
, 90
o
, 180
o
and 270
o
). If the Hamming distance of both is zero in any of the possible
141
rotations, then we have a perfect match and the marker is considered as detected (see Fig. 4). 142
Let us define
c={ck|ckR2, k ∈ {1,...,4}}
as the pixel coordinates of the four corners for marker
m
in image
I
,
C(m)R2
as the location of the marker center,
143
and A(m)as the observed marker area. 144
Our aim is to use patches of length side
τs
to create the DCFs for marker and corners. To do so, the patches are obtained
from an down-sampled version of the image Iwhere the marker area A(m)is most similar to τ2
s. If we denote
I= (I0, I1, ..., In)
as the pyramid of images (I0=I) where the image Ij, j > 0is the original image Idown-sampled by the factor
βj|β[0,1],
then, we can define: 145
L(m) =
0if τ2
s
A(m)1
jlogβτ2
s
A(m)k otherwise
(11)
as the pyramid level where the area
A(m)
of the marker is most similar to the desired patch area
τ2
s
. In other words, the
146
image IL(m)is where the initial patches of area τ2
swill be extracted. Please notice that b·c denotes the floor function. 147
We shall define
P(p, τs)
as the function that returns a patch of size
τ2
s
centred at
pR2
in the image
IL(m)
. Conse-
148
quently, the patches to generate the DCFs for the marker and its corners are
P(C(m), τs), P (c1, τs), P (c2, τs), P (c3, τs)149
and P(c4, τs), respectively. 150
7
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Let us then define the tracker for marker mat time tas:151
Tm
t={Tm
0,t, . . . , T m
4,t, lm
t}(12)
where
Tm
i,t
represents the Fourier transforms of the DCF for the marker center (
Tm
0,t
) and its four corners (
Tm
i,t , i
{1,...,4}) (see Eq. 7), while lm
trepresents the pyramid level employed for correlation at time t. In the first frame,
lm
1=L(m).
3.2 Tracking and Update152
In subsequent frames (
t > 1
), the filters are applied at the previous location, and the location of maximum filter response
153
is obtained:154
Em
i,t(lm
t) = argmax
pR2
P SR(Tm
i,t , p, lm
t),(13)
where
P SR
indicates the response of the filter
Tm
i,t
centred at pixel
p
in the image
Ilm
t
. If the maximum
P SR
for the
155
marker tracker Tm
0,t is below the established threshold value, the marker is considered as lost.156
A very important aspect to consider is the need for an accurate estimation of the marker corners. The corner locations157
estimated by Eq. 13 do not have the required accuracy for pose estimation. First, because tracking normally is done at a
158
reduced version of the original image. Second, even if the tracker is run at the lowest piramid level
I0
, the result is
159
not accurate enough. The corners locations must be refined with sub-pixel accuracy. Thus, in order to obtain a precise
160
corner estimation, we employ an iterative corner upsampling process that produces a precise corner location
S(Tm
i,t )161
in the original image
I0
. To do so, first, a corner search with sub-pixel accuracy is performed in the vicinity of the
162
estimated corner locations
Em
i,t(lm
t)
. For that purpose, the refinement method implemented in the OpenCV library [
34
]
163
is been employed. Then, the corner location is upsampled to the previous pyramid level
lm
t1
, and the search repeated.
164
The process stops when the image I0is reached.165
Adapting to scale is another crucial element for a successful tracking. In the first frame, correlation is done at the
166
pyramid level
lm
1
where the DCFs were initialized. However, due to scale changes of the marker (when approaching
167
or moving away from it), the initial pyramid level
lm
1
may not be the one for which the filters obtain its maximum
168
response. Thus, it is necessary to find the best pyramid level for the next frame. To do so, the response of the filter
Tm
0,t
169
at the contiguous pyramid scales is analyzed, and the one maximizing the marker filter is selected:170
lm
t+1 = argmax
l∈{lm
t+1,lm
t,lm
t1]}
P SR(Tm
0,t,Em
0,t(l), l)(14)
Once the best pyramid level is found, all the filters are updated using the patches extracted from that level.171
3.3 Confidence measure172
The proposed method can track the marker
m
under large appearance changes caused by blur (see Fig. 2). However,
173
in some cases, the blurriness level is so high that the estimated location of the corners is not reliable enough for
174
three-dimensional pose estimation.175
We propose a confidence measure
wm[0,1]
that indicates how reliable is the estimation provided by our tracker.
176
As it will become evident in the next section, this measure will allow favoring some markers over others when doing
177
8
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
localization from multiple makers. Values near to
1
indicate high confidence in the detection while values near to zero
178
indicate low confidence. 179
The measure is composed of two terms. First, the normalized Hamming
H
distance between the marker bit sequence
b(m)
and the bit sequence
b(p)
observed for the polygon
p
formed by the four marker corners estimated by our tracker:
H(b(m), b(p))
|b(m)|.
But then, this value is modulated by the response of the corner trackers
P4
i=1 PSRi
4,
where 180
PSRi=(1P S R(Tm
i,t ,Em
i,t(l), l)> χ
0otherwise (15)
indicates if the tracking of a corner was successful or not. 181
Thus, the confidence measure is expressed as: 182
wm= 1 H(b(m), b(p))
|b(m)|P4
i=1 PSRi
4!.(16)
We have found after several experiments that the combination of both terms provides better results than any one of them
183
separately. 184
4 Robust Marker-Map based Pose Estimation 185
This section explains our second contribution, an extension of the previous methodology aimed at camera pose
186
estimation with marker maps. A marker map is a set of markers placed in known map locations of the environment that
187
are employed for camera localization in indoor environments. The observation of a single marker can be enough to
188
obtain the pose of the camera on the map. However, the more markers are visible, the better the accuracy that can be
189
obtained (see Fig. 6). 190
Our goal is estimating the camera pose
θtR6
(position and angle) in the map given: (i) a set of markers
M
in known
191
map locations, (ii) an image Itshowing some of them, and (iii) the previous camera location θt1.192
We shall define the set of markers in our map by 193
M={m={qm
1, . . . , qm
4}},(17)
where
qm
iR3
represents the three-dimensional coordinates of the marker corners in the environment. The map can
194
be obtained from images of the environment using any of the methods described in [9, 13, 35]. 195
Given an image showing some of the markers, it is possible to estimate the camera pose by analyzing the set of 2D-3D
196
correspondences. Since the 3D location of the corners is known in advance (
M
), their 2D image projections can be
197
employed to find the pose between the camera and the global reference system by minimizing the reprojection of the
198
observed markers as will be explained later in Sect. 4.2. 199
The rest of this Section explains the proposed method to estimate the camera pose
θt
given an input image
It
, which
200
can be summarized in Alg 1. 201
9
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
4.1 Method overview202
Our method employs a set of trackers
Tt={Tm
t},m∈ M, t 1,
to estimate the position of the markers in the image, where
Tm
t
is the type of tracker defined in the previous Section
203
(Eq. 12). We are proposing a tracking method, and thus, it requires to be initialized. The initial position
θ1
and
T1
are
204
obtained from the markers detected with a marker detector [19, 36, 17].205
In subsequent frames
It
, the trackers
Tt
are applied in order to find the new markers locations. Tracking of a marker
206
may fail for several reasons: it fall outside the image view, occlusion, high blur, etc. Thus, we remove from
Tt
the
207
trackers
Tm
t
with a low response (PSR) of the central tracker
Tm
0,t
. The corners of the remaining markers are employed
208
to obtain an initial estimation of the camera pose ˆ
θt(Sect. 4.2).209
As the camera moves along the environment, some markers will fall out of the camera view while others will appear.
210
Since we know both the pose of the camera
ˆ
θt
and the three-dimensional location of the markers
M
, we can estimate
211
which markers should be visible in the current image and where (Sect. 4.3). For each expected visible marker, (not
212
in
Tt
) a quick detection is done on the expected image region where it should be visible. If correctly detected, a new
213
tracker
Tm
t
is added to
Tt
. After all the new markers have been added, we calculate the final camera pose
θt
using all
214
the visible markers.215
Tracking may fail either because of very fast movement causing a lot of blur (Fig. 2e), or because there are no markers
216
are visible in the image. Thus, as final step we analyze if a tracking confidence measure
wTt
(explained in Sect. 4.4)
217
is high enough. If not, the tracking should stop until a reliable pose can be obtained using a regular marker detector
218
[19, 36, 17] to restart tracking.219
4.2 Camera pose estimation220
The estimation of the camera pose given a set of markers detected in the image consists in minimizing the reprojection
221
error of their corners, considering its confidence wm(Eq. 16):222
θt= argmin
θX
m∈M
wmH(em
t(θ)) ,(18)
where
em
t(θ)
represents the reprojection error of the corners of marker
m
and
H
is the Hubber function, employed to
223
minimize the impact of possible outliers:224
H(a) =
1
2a2for |a| ≤ α
α(|a| − 1
2α)otherwise
(19)
The reprojection error of a marker mis defined as:225
em
t(θ) =
4
X
i=1
||ψ(qm
i, θ)− S(Tm
i,t )||2,(20)
where the function
ψ(q, θ)R2
projects the three-dimensional point
q
in the image given the camera pose
θ
and
226
S(Tm
i,t )is the precise corner location in the original image I0(section 3.2 ).227
10
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Algorithm 1: Tracking algorithm overview for image It
Data: M, θt1,Tt1,It
Result: θt,Tt, wTt
begin
TtApplyF il ters(Tt1)(Sect 3);
for Tm
t∈ Ttdo
if P SR(Tm
t)< χ then
remove Tm
tfrom Tt
end
end
Estimate pose ˆ
θtusing Tt(Sect. 4.2);
Look for new visible marker (Sect. 4.3);
for each new maker mdo
add Tm
tto Tt
end
Obtain the final pose θtusing updated Tt;
Calculate tracking confidence wTt(Sect. 4.4);
end
Equation 18 is a non-linear function that can be efficiently minimized using the Levenberg–Marquardt’s (LM) algorithm
228
[37]. 229
4.3 Look for visible markers 230
As the video sequence progresses, and the camera moves, markers will appear and disappear from the scene. The
231
initialization of these markers is essential to achieve continuous tracking and accurate pose estimation. 232
Given that the three-dimensional locations of the marker corners in
M
are known, and an initial camera pose
ˆ
θt
for the
233
It
image is available, we can calculate which markers should be visible in the image
It
image and where their corner
234
should project. 235
For each expected marker, we apply a detection process in the region where it should be visible. First, the Otsu’s
236
thresholding algorithm [
33
] is applied, and contours are extracted using the Suzuki and Abe algorithm [
38
]. Using the
237
Douglas and Peucker algorithm [
39
], the largest a squared polygon
p
is selected. Then, the bits
b(p)
of the polygon are
238
extracted and if they match the predicted marker
b(m)
, using the Hamming distance, the marker is considered found
239
and a tracker initialized and added to Ttto be employed for the next image. 240
4.4 Calculate Tracking Confidence 241
As previously mentioned, tracking may fail due to the absence of markers, blur, bad lightning conditions, marker
242
occlusion or any other reason. Therefore, it is important to provide a confidence value indicating how reliable the
243
11
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Table 1: Nomenclature and values of the main parameters used by the proposed method.
Parameter Default value Description
λ1104Filter regulation parameter
(Sect 2.2)
λ220 Context-Aware parameter
(Sect 2.2)
η0.2Learning rate
(Sect 2.2)
τs32 Filter size (Sect 3.1)
β0.7Pyramid scale factor
(Sect 3.1)
χ5.7Peak to Sidelobe Ratio
(Alg. 1 )
α2.5Hubber function cut-off value
(Sect 4.2)
τc0.1Tracking Confidence Threshold
(Sect 4.4)
estimated pose
θt
is. It allows determining whether tracking has failed, and in that case, the system can stop tracking
244
and use a slower but more conservative method for detecting the markers [19, 36, 17].245
In this paper, we propose a confidence measure based on the following principle. If a single marker is spotted very near
246
to the camera, occupying a large region of the image, the estimation of the pose is reliable. However, if the same marker
247
is detected far from the camera, occupying only a very small region of the image, the estimation is very unreliable. In248
the end, the reliability of the estimated pose depends mainly on the total area of the points employed for computing
249
Eq. 18. If the points are far apart, occupying a large region of the image, the estimated pose is reliable, and vice versa.
250
So, let us define the confidence measure
wTt
as the relative area of the convex hull formed by the marker corners
251
employed in Eq. 18. This value is one of the points cover all the image, and tends to zero as they are more concentrated
252
in a region. If the confidence wTtis below a threshold τc, we consider the tracking has failed.253
5 Experiments and results254
This section explains the experiments carried out to validate our proposal. The goal of the experiments is to evaluate
255
the robustness, speed, and accuracy of the proposed method for marker tracking. Experiments have been divided in
256
two categories. First, the individual marker tracking algorithm (Sect. 3) is tested, comparing it with state-of-the-art
257
marker detection methods (Sect. 5.1), and correlation filter trackers (Sect. 5.2). Afterward, our method for camera
258
pose estimation using marker maps (Sect. 4) is compared with the state-of-the-art SLAM methods in challenging video
259
sequences (Sect. 5.3).260
12
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
All experiments have been performed using an Intel
R
processor Core
TM
i7-7500U CPU @ 2.70GHz 4, with 8Gb of
261
Ram, and the Ubuntu 18.04 operating system. Although some of the processing could be parallelized, only one thread
262
has been used. 263
Several parameters that control the behavior of the proposed algorithms we have introduced along the paper. The values
264
used for these parameters have been experimentally selected and are shown in Table 1. 265
Finally, we must indicate that the code has been integrated as part of the public library ArUco. We will refer to the
266
proposed method as TR-ArUco. The code and the videos recorded to conduct the experiments are publicly available
3
.
267
5.1 Comparison with Fiducial Squared Marker Detectors 268
This section makes a comparison in terms of speed and detection rate of the proposed method TR-ArUco against
269
the main state-of-the-art marker detection and tracking algorithms: ArUco [
19
] and AprilTag [
17
]. Nowadays, both
270
methods are widely used due to their high performance in terms of speed detecting fiducial markers. 271
Both AprilTag and ArUco detector has different configurable parameters establishing a balance between speed and
272
detection range. For the AprilTag detector, this parameter is the decimation factor, and for the ArUco detector, it is
273
the minMarkerSize. To do a fair comparison, parameter values that maximize the number of detections are chosen.
274
Thus, for AprilTag the decimate factor 2 has been employed, and the minMarkerSize is set to 0 for ArUco. Additionally,
275
for the ArUco method two versions have been used: the
ArU co_N ORMAL
detection method, which employs an
276
adaptive image threshold, and the ArUco_F AST detection method that uses a global threshold. 277
A set of video sequences have been recorded showing a squared marker (of size
6×6
cm) printed on a piece of
278
paper. Along the sequences, the marker remains static, while the camera moves at different speeds and distances from
279
the marker. Throughout the sequences, the marker is seen with different sizes, lighting conditions, and degrees of
280
deformations produced by blurring. In total,
10
video sequences, containing a total of
3326
video frames, of resolution
281
1920 ×1080
, have been recorded using a mobile phone. The location of the corners can not be estimated in all the
282
images of the sequence (see Fig. 2 (c-e)). However, we know the marker is visible in all the images. As a consequence,
283
we can analyze the True Positive Ratio of detections. Besides, we have analyzed the processing speed of the methods.
284
Table 2 shows the results of the experiment for several resolutions of the video sequences (namely 1080p, 720p and
285
480p.) 286
As can be seen, our method obtains the highest TPR in all the tests performed. Also, although the proposed method is
287
not the fastest one for all resolutions, it becomes the fastest one as the resolution increases. The speed of our method
288
TR-ArUco is less affected by the image resolution because it employs filters of fixed size. The proposed pyramid
289
method is the key to obtain high frame rates as the resolution increases. In the end, our method obtains the highest TPR
290
an is an order of magnitude faster than AprilTag, the second one with highest TPR. 291
Table 3 shows a summary of the average time consumed by the different steps of the TR-ArUco method. Notice that
292
steps
1.11.2
are only performed when the marker is not being tracked, i.e. in the first frame of the video sequence,
293
or when a marker that was being tracked is lost. The steps
2.12.4
are performed on all frames. As can be seen,
294
the computation times for the resolutions used are similar, with an average computation time of
4.19
ms. Among the
295
3https://www.uco.es/investiga/grupos/ava/node/69
13
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Table 2: For each method and image resolution: number of frames per second (FPS) and true positive rate (TPR).
Method 480p 720p 1080p
FPS TPR FPS TPR FPS TPR
TR-ArUco 348.986 0.719 297.750 0.901 234.796 0.946
AprilT ag 132.567 0.479 49.581 0.644 26.978 0.629
ArU co_NO RMAL 760.168 0.333 227.289 0.416 102.737 0.460
ArU co_F AST 806.543 0.358 570.364 0.454 198.292 0.532
Table 3: Mean computing times (milliseconds) of the different steps of the proposed method for different resolutions.
Resolution
480p 720p 1080p
Step 1.1:ArUco detect 0.204 0.704 0.947
Step 1.2:Creating filters 0.028 0.029 0.031
Time Step 1 (ms) 0.232 0.733 0.978
Step 2.1:Convert to grey 0.267 0.492 1.365
Step 2.2:Optimal scale 1.191 1.382 1.645
Step 2.3:Track corners 0.984 1.006 1.038
Step 3.3:Track marker 1.050 1.111 1.142
Time Step 2 (ms) 3.492 3.991 5.190
different phases, the selection of the optimal scale is the most time-consuming one. Also, it is affected by the visible
296
area of the marker, i.e., the larger the marker appears in the image, the more computation is required.297
5.2 Comparison with Discriminative Correlation Filters298
This Section compares the proposed method with the state-of-the-art Discriminative Correlation Filters trackers,
299
namely, KCF [
40
], CSRT [
26
], MIL [
41
], TLD [
42
], MEDIANFLOW[
43
], MOSSE[
22
] and BOOSTING[
44
]. The
300
implementations provided in the public library OpenCV4library have been employed.301
The key aspect when detecting a squared fiducial marker is correctly detecting the position of its four corners in the
302
image. Consequently, this experiment aims at evaluating the capability of the above indicated DCFs to track the four
303
corners of a marker.304
The video sequences employed in the previous experiment have been used for this one. The ground truth has been
305
obtained using the ArUco library, obtaining the location of the marker corners in these frames where the marker is
306
detectable. For each one of the selected DCFs trackers, we have applied the following methodology. A total of four
307
independent trackers have been employed to track the marker corners. The trackers are initialized in the first frame to
308
the center of each corner, and then, the trackers are applied to the subsequent frames. The size of the filters is half the
309
size of the marker (see Fig. 5). Whenever the tracking error becomes higher than a number of pixels
, the trackers are
310
initialized so as to avoid the trackers to become completely lost for the rest of the sequence. For our tracker, we proceed
311
in a similar way, re-initializing the tracker if the error in the estimation of the corners becomes greater than .312
The results obtained for different values of
are shown in Table 4, where each row represents a method. The columns
313
show the total number of re-initialization required in the sequences evaluated (init), the average frames per second
314
4https://opencv.org/
14
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 5: Naive approach employed to track a marker consist in using four independent DCFs: one for each corner.
Table 4: Results obtained by different state-of-the-art DCF trackers. We evaluate the total number of tracking re-
initializations (init), the computation time (fps), and the average tracking error (err).
 < 5 < 10  < 15
init fps err init fps err init fps err
TR-ARUCO 24 287.42 0.82 18 266.14 0.98 13 294.24 1.13
CSRT 72 7.12 1.63 60 8.15 1.76 57 8.35 1.85
BOOSTING 98 13.16 1.83 91 12.68 1.92 84 12.85 2.30
MEDIANFLOW 116 19.55 1.92 77 25.26 2.92 61 21.26 4.06
MIL 245 5.99 2.68 163 6.00 3.97 112 5.90 4.65
MOSSE 270 1105.06 2.31 214 1035.57 3.38 166 1020.56 4.20
KCF 338 91.29 2.74 236 79.82 4.97 186 75.23 6.73
TLD 734 4.90 4.08 539 2.37 7.45 409 2.39 10.48
employed by each method (fps), and the average tracking error (err) which is expressed in pixels. In this set of
315
experiments, only images of resolution 1080phave been employed. 316
As can be observed, the proposed method TR-ArUco outperforms the rest of the methods in the three parameters
317
evaluated. Our method obtains a stable frame rate, which is an order of magnitude faster than the rest of the methods
318
(except for MOSSE). The same can be said about the number of re-initializations, which is much lower than in the rest
319
of the algorithms. Finally, the tracking error of the corners of our method is the lowest of all. The main conclusion is
320
that the proposed method outperforms the naive approach (i.e., using individual DCFs) for the given problem. 321
Figure 2 shows some of the images evaluated in this experiment, overlaying in red the estimations obtained. Figure
322
2-(e) shows a case in which our method fails and requires re-initialization. As can be seen, our method requires
323
re-initialization only in very extreme cases. 324
5.3 Comparison with SLAM methods 325
This section analyzes the TR-ArUco method for camera pose estimation using marker maps (Sec. 4), with the state-of-art
326
SLAM methods. The following SLAM algorithms have been tested: 327
ORBSlam2 [11]: a SLAM method based on keypoints. 328
LDSO [12]: a SLAM method based on photoconsistency. 329
ArU co_M M [45]: a SLAM method based on fiducial squared markers. 330
UcoSLAM [13]: a SLAM method using both keypoints and fiducial squared markers. 331
15
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
For evaluation purposes we have employed two different datasets: the publicly available SPM dataset [
9
], and a new
332
dataset created for this paper (the DCF dataset 5).333
Both datasets have been recorded in our laboratory where a set of fiducial squared markers have been placed at random
334
locations. The SPM dataset consists of eight video sequences recorded with a PtGrey FLEA3 camera capturing
335
1920 ×1080
images at 60Hz. The videos show up to fifty different fiducial markers of
16.5
cm, distributed in the walls
336
and ceiling of the room. The DCF dataset has nine video sequences recorded with an ELP camera capturing at
30337
Hz frame rate with a resolution of
1920 ×1080
pixels. In this case, a total of
102
markers of a smaller size (
7.9
cm),
338
have been distributed by the walls and ceiling of the room. The videos of the DCF dataset have been recorded moving
339
the camera fast and with brusque movements with the aim of achieving different degrees of blurring. In both cases,
340
the ground truth camera poses are obtained using an Optitrack motion capture system equipped with six cameras (see
341
Fig. 6).342
While the ORBSlam2 and LDSO makes no use of the markers explicitly, the
ArU co_M M
and UcoSLAM methods use
343
the markers for tracking. However, our method, TR-ArUco, requires the location of the marker to be know in advance
344
(i.e. the marker map). The map has been created with the UcoSLAM method using a long video sequence that covers
345
all markers in the room.346
For the SLAM methods, the following methodology has been employed to analyze the video sequences. The sequence
347
has been first processed to obtain the map and then, using the generated map, it is processed again to estimate the camera
348
poses at each frame. In this way, the SLAM methods are evaluated after correcting possible loops in the sequence and
349
obtains better accuracy. In consequence, a fair comparison with our method, that has a known map of the environment
350
build in a previous phase, can be made.351
Table 5 show the results obtained. For each video sequence (row) and method (column), three measures have been
352
obtained. First, the computing time (FPS). Second, the Absolute Trajectory Error (ATE), which is the translational
353
RMSE after
Sim(3)
alignment [
46
] of the estimated poses with the ground truth. And third, the percentage of the video
354
sequence frames for which the method provides a pose estimation (
%
Trck). It must be indicated that SLAM systems do
355
not provide estimations in all the frames of a sequence: in some cases, they get lost due to fast movement or lack of
356
texture.357
Two conclusions can be drawn from Table 5. First, the proposed method outperforms the others in terms of speed and
358
percentage of tracked frames. Second, that the LDSO method performs poorly in most of the sequences tested.359
However, comparing the results of two SLAM methods is not a trivial task. Imagine a method that only estimates the
360
pose of the camera in the first ten frames while a second method estimates poses in the whole sequence. Because of the
361
reduced drift in the first frames, the total ATE of the first method will be smaller than the ATE of the second method
362
(which evaluates the whole sequence). This is why (%Trck) is also an important aspect to consider.363
The work [
13
] proposes an evaluation methodology to compare two SLAM methods A and B combining both the ATE
364
and the
%
Trck. It defines a measure
Sp(A, B)[1,1]
that employs a confidence level
p[0,1]
. When
Sp(A, B)
is
365
close to
1
, it indicates that the A method is better than B, while values close to
1
indicates that the B method is better
366
than A.367
5https://mega.nz/folder/LiRCDYYb#aAOjirkUt54-0CGr3C6-1g
16
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Table 5: Results obtained for each method in the SMP [
9
] and DCF datasets. For each sequence, the frames per second
(FPS), absolute trajectory error (ATE), and percentage of tracked frames (%Trck) are reported.
Dataset Sequence TR-ArUco ArUco_MM LDSO ORB_SLAM2 UcoSLAM
FPS ATE %Trck FPS ATE %Trck FPS ATE %Trck FPS ATE %Trck FPS ATE %Trck
SPM video1 58.5 0.068 99.8 48.8 0.062 99.6 2.97 0.769 46.2 12.6 2.360 99.3 9.96 0.378 65.1
SPM video2 62.0 0.111 99.8 53.4 0.103 98.5 1.47 1.250 99.8 10.8 0.575 97.4 22.1 0.054 99.8
SPM video3 49.1 0.061 99.8 42.8 0.058 98.2 1.65 2.320 99.8 12.6 0.054 99.8 24.7 0.098 99.8
SPM video4 44.6 0.015 99.8 45.5 0.013 99.2 0 0.000 0.03 12.2 0.020 99.8 24.9 0.011 99.8
SPM video5 38.8 0.023 99.8 41.4 0.019 98.6 0 0.000 0.04 11.8 1.410 94.7 23.1 0.026 98.0
SPM video6 34.2 0.145 99.8 45.1 0.018 98.6 0 0.000 0.04 11.4 0.527 96.8 6.46 0.670 52.2
SPM video7 36.0 0.950 99.8 31.3 1.020 99.2 0 0.000 0.05 9.62 1.280 99.4 17.3 1.860 100.
SPM video8 36.5 0.077 99.9 41.8 0.077 99.6 0 03.05 0.437 55.0 17.9 0.049 99.8
DCF video1 44.2 0.116 97.4 27.5 0.108 73.6 0 01.11 0.499 30.4 5.46 0.095 57.6
DCF video2 43.6 0.095 96.5 31.1 0.114 80.2 0 0009.18 0.109 73.1
DCF video3 51.7 0.085 99.9 42.6 0.082 91.9 0 00012.3 0.105 87.6
DCF video4 46.5 0.072 99.9 44.2 0.076 93.2 0 00012.7 0.074 88.8
DCF video5 29.0 0.163 81.9 19.4 0.106 60.3 0 01.56 0.293 38.0 4.07 0.081 53.7
DCF video6 38.7 0.093 99.9 31.6 0.101 82.3 0 0007.82 0.092 75.8
DCF video7 42.8 0.116 94.7 27.6 0.114 72.3 0 00.04 0.040 6.62 5.61 0.102 65.0
DCF video8 52.1 0.067 99.9 52.6 0.071 98.5 0 07.39 0.303 84.4 14.8 0.065 96.6
DCF video9 41.5 0.074 99.9 45.2 0.082 96.0 0 00011.0 0.067 88.5
Table 6: Measure
Sp(A, B)
according to different confidence levels
p
of the analyzed methods. The final ranking shows
TR-ArUco as the best, while LDSO provides the worst scores.
method B
method A TR-ArUco ArUco_MM UcoSLAM ORB_SLAM2 LDSO
p= 0.01 p= 0.1p= 0.25 p= 0.01 p= 0.1p= 0.25 p= 0.01 p= 0.1p= 0.25 p= 0.01 p= 0.1p= 0.25 p= 0.01 p= 0.1p= 0.25
TR-ArUco − − 0.50.21 0.15 0.56 0.50.29 0.75 0.62 0.58 0.25 0.25 0.25
ArUco_MM 0.5 0.21 0.15 − − 0.26 0.21 0.18 0.62 0.58 0.58 0.12 0.25 0.25
UcoSLAM 0.56 0.5 0.29 0.26 0.21 0.18 − − 0.38 0.29 0.25 0.25 0.25 0.25
ORB_SLAM2 0.75 0.62 0.58 0.62 0.58 0.58 0.38 0.29 0.25 − − 0.062 0.12 0.12
LDSO 0.25 0.25 0.25 0.12 0.25 0.25 0.25 0.25 0.25 0.062 0.12 0.12 − −
Times winner 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0
Table 6 shows the values of
Sp(A, B)
for each pair of methods, using the
17
sequences of the SPM and DCF datasets,
368
for different confidence values. As can be seen, the proposed method
T R ArU co
obtains best scores than the rest of
369
the methods for different confidence levels. The last row of the Table indicates how many times a method obtains better
370
results than other methods. In our case, the value
4
means that proposed method wins to the other four tested methods.
371
The main conclusion that can be obtained from this experiment is that the proposed method outperforms the state-of-
372
the-art SLAM methods in terms of speed, accuracy and sensitivity, for this particular problem. 373
6 Conclusions 374
This paper has proposed methods for tracking squared fiducial markers under challenging conditions. Our first
375
contribution is a method for tracking squared marker using a set of Discriminative Correlation Filters which combines a
376
proper scale selection and a corner upsampling strategy. The proposed method outperforms state-of-the-art methods for
377
17
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
Figure 6: Map of markers displayed in the laboratory for experimentation. Some scenes of the environment correspond-
ing to the first video are shown in it.
marker detection and standard DCFs in terms of speed, precision and sensitivity. In addition, our method scales very
378
well with image resolution, obtaining more than 200fps in HD images using a single CPU thread.379
Our second contribution is a method for low-cost camera pose estimation using fiducial marker maps. The proposed
380
method is able to estimate the pose of a camera by tracking the position of the already visible markers and predicting
381
the location of the markers appearing in the scene. Our method has been compared to state-of-the-art SLAM methods
382
obtaining, better accuracy, sensitivity, and speed.383
The proposed methods are publicly available for other researchers as part of the ArUco library
6
, and the datasets
384
employed in this paper are available to ease the reproduction of the experiments.385
Acknowledgments386
This project has been funded under projects TIN2016-75279-P and IFI16/00033 (ISCIII) of Spain Ministry of Economy,
387
Industry and Competitiveness, and FEDER.388
References389
[1]
Hirenkumar Nakawala, Giancarlo Ferrigno, and Elena De Momi. Development of an intelligent surgical training
390
system for thoracentesis. Artificial Intelligence in Medicine, 84:50 – 63, 2018.391
[2]
P. Matthies, B. Frisch, J. Vogel, T. Lasser, M. Friebe, and N. Navab. Inside-Out Tracking for Flexible Hand-held
392
Nuclear Tomographic Imaging. In IEEE Nuclear Science Symposium and Medical Imaging Conference, San
393
Diego, USA, November 2015.394
6https://www.uco.es/investiga/grupos/ava/node/26
18
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
[3]
Praveen Kumar Kanithi, Jyotirmoy Chatterjee, and Debdoot Sheet. Immersive augmented reality system for
395
assisting needle positioning during ultrasound guided intervention. In Proceedings of the Tenth Indian Conference
396
on Computer Vision, Graphics and Image Processing, ICVGIP ’16, pages 65:1–65:8, New York, NY, USA, 2016.
397
ACM. 398
[4]
E. Marchand, H. Uchiyama, and F. Spindler. Pose estimation for augmented reality: A hands-on survey. IEEE
399
Transactions on Visualization and Computer Graphics, 22(12):2633–2651, 2016. 400
[5]
H. Duan and Q. Zhang. Visual measurement in simulation environment for vision-based uav autonomous aerial
401
refueling. IEEE Transactions on Instrumentation and Measurement, 64(9):2468–2480, 2015. 402
[6]
A. Marut, K. Wojtowicz, and K. Falkowski. Aruco markers pose estimation in uav landing aid system. In 2019
403
IEEE 5th International Workshop on Metrology for AeroSpace (MetroAeroSpace), pages 261–266, 2019. 404
[7]
M. F. Sani and G. Karimian. Automatic navigation and landing of an indoor ar. drone quadrotor using aruco
405
marker and inertial sensors. In 2017 International Conference on Computer and Drone Applications (IConDA),
406
pages 102–107, 2017. 407
[8]
R. Polvara, S. Sharma, J. Wan, A. Manning, and R. Sutton. Towards autonomous landing on a moving vessel
408
through fiducial markers. In 2017 European Conference on Mobile Robots (ECMR), pages 1–6, 2017. 409
[9]
R. Muñoz-Salinas, M. J. Marín-Jiménez, and R. Medina-Carnicer. Spm-slam: Simultaneous localization and
410
mapping with squared planar markers. Pattern Recognition, 86:156 – 171, 2019. 411
[10]
S. Garrido-Jurado, R. Muñoz Salinas, F. J. Madrid-Cuevas, and M. J. Marín-Jiménez. Automatic generation and
412
detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292, 2014. 413
[11]
R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.
414
IEEE Transactions on Robotics, 33(5):1255–1262, 2017. 415
[12]
X. Gao, R. Wang, N. Demmel, and D. Cremers. Ldso: Direct sparse odometry with loop closure. In 2018
416
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204, 2018. 417
[13]
Rafael Muñoz-Salinas and R. Medina-Carnicer. Ucoslam: Simultaneous localization and mapping by fusion of
418
keypoints and squared planar markers. Pattern Recognition, 101:107193, 2020. 419
[14]
H. Kato and M. Billinghurst. Marker tracking and hmd calibration for a video-based augmented reality conferenc-
420
ing system. In Augmented Reality, 1999. (IWAR ’99) Proceedings. 2nd IEEE and ACM International Workshop
421
on, pages 85–94, 1999. 422
[15] Quentin Bonnard, Séverin Lemaignan, Guillaume Zufferey, Andrea Mazzei, Sébastien Cuendet, Nan Li, Ayberk 423
Özgür, and Pierre Dillenbourg. Chilitags 2: Robust fiducial markers for augmented reality and robotics., 2013. 424
[16]
D. Wagner and D. Schmalstieg. ARToolKitPlus for pose tracking on mobile devices. In Computer Vision Winter
425
Workshop, pages 139–146, 2007. 426
[17]
E. Olson. Apriltag: A robust and flexible visual fiducial system. In Robotics and Automation (ICRA), 2011 IEEE
427
International Conference on, pages 3400–3407, May 2011. 428
[18]
A. Sagitov, K. Shabalina, R. Lavrenov, and E. Magid. Comparing fiducial marker systems in the presence of
429
occlusion. In 2017 International Conference on Mechanical, System and Control Engineering (ICMSC), pages
430
377–382, 2017. 431
19
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
[19]
Francisco J. Romero-Ramirez, Rafael Muñoz-Salinas, and Rafael Medina-Carnicer. Speeded up detection of
432
squared fiducial markers. Image and Vision Computing, 76:38–47, 06 2018.433
[20]
F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer. Fractal markers: A new approach for
434
long-range marker pose estimation under occlusion. IEEE Access, 7:169908–169919, 2019.435
[21]
M. Krogius, A. Haggenmiller, and E. Olson. Flexible layouts for fiducial tags. In 2019 IEEE/RSJ International
436
Conference on Intelligent Robots and Systems (IROS), pages 1898–1903, 2019.437
[22]
D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation filters.
438
In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2544–2550,
439
2010.440
[23]
João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation
441
filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, Mar 2015.442
[24]
M. Danelljan, F. S. Khan, M. Felsberg, and J. v. d. Weijer. Adaptive color attributes for real-time visual tracking.
443
In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1090–1097, 2014.444
[25]
Yang Li and Jianke Zhu. A scale adaptive kernel correlation filter tracker with feature integration. pages 254–265,
445
03 2015.446
[26]
Alan Lukežiˇ
c, Tomáš Vojíˇ
r, Luka ˇ
Cehovin Zajc, Jiˇ
rí Matas, and Matej Kristan. Discriminative correlation filter
447
tracker with channel and spatial reliability. International Journal of Computer Vision, 126(7):671–688, Jan 2018.
448
[27]
C. Ma, J. Huang, X. Yang, and M. Yang. Hierarchical convolutional features for visual tracking. In 2015 IEEE
449
International Conference on Computer Vision (ICCV), pages 3074–3082, 2015.450
[28]
M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Convolutional features for correlation filter based visual
451
tracking. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pages 621–629, 2015.
452
[29]
Martin Danelljan, Andreas Robinson, Fahad Khan, and Michael Felsberg. Beyond correlation filters: Learning
453
continuous convolution operators for visual tracking. pages 472–488. Springer International Publishing, 10 2016.
454
[30]
M. Mueller, N. Smith, and B. Ghanem. Context-aware correlation filter tracking. In 2017 IEEE Conference on
455
Computer Vision and Pattern Recognition (CVPR), pages 1387–1395, 2017.456
[31]
M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual
457
tracking. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4310–4318, 2015.458
[32]
Martin Danelljan, Gustav Häger, Fahad Khan, and Michael Felsberg. Accurate scale estimation for robust visual
459
tracking. pages 65.1–65.11, 01 2014.460
[33]
N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and
461
Cybernetics, 9(1):62–66, 1979.462
[34]
Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer Vision in C++ with the OpenCV Library.
463
O’Reilly Media, Inc., 2nd edition, 2013.464
[35]
Rafael Muñoz-Salinas, Manuel J. Marín-Jimenez, Enrique Yeguas-Bolivar, and R. Medina-Carnicer. Mapping and
465
localization from planar markers. Pattern Recognition, 73:158 – 171, January 2018.466
20
PREPRINT SUBMITTED TO EXPERT SYSTEMS WITH APPLICATIONS - JUNE 30, 2020
[36]
M. Fiala. Designing highly reliable fiducial markers. IEEE Transactions on Pattern Analysis and Machine
467
Intelligence, 32(7):1317–1324, 2010. 468
[37] K. Madsen, H. B. Nielsen, and O. Tingleff. Methods for non-linear least squares problems (2nd ed.), 2004. 469
[38]
Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and
470
Image Processing, 30(1):32 – 46, 1985. 471
[39]
D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number of points required to represent
472
a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and
473
Geovisualization, 2(10):112 – 122, 1973. 474
[40]
João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. Exploiting the circulant structure of tracking-
475
by-detection with kernels. In Computer Vision – ECCV 2012, pages 702–715. Springer Berlin Heidelberg,
476
2012. 477
[41]
B. Babenko, Ming-Hsuan Yang, and Serge Belongie. Visual tracking with online multiple instance learning. In
478
2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 983–990, 06 2009. 479
[42]
Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and
480
Machine Intelligence, 34(7):1409–1422, 2012. 481
[43]
Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In
482
2010 20th International Conference on Pattern Recognition, pages 2756–2759, 2010. 483
[44]
Helmut Grabner, Michael Grabner, and Horst Bischof. Real-time tracking via on-line boosting. volume 1, pages
484
47–56, 01 2006. 485
[45]
Rafael Muñoz-Salinas, Manuel J. Marín-Jimenez, Enrique Yeguas-Bolivar, and R. Medina-Carnicer. Mapping and
486
localization from planar markers. Pattern Recognition, 73:158 – 171, 2018. 487
[46]
J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference
488
on Compututer Visision (ECCV), September 2014. 489
21
ResearchGate has not been able to resolve any citations for this publication.
  • Article
    Simultaneous Localization and Mapping is the process of simultaneously creating a map of the environment while navigating in it. Most of the SLAM approaches use natural features (e.g. keypoints) that are unstable over time, repetitive in many cases or their number insufficient for a robust tracking (e.g. in indoor buildings). Other researchers, on the other hand, have proposed the use of artificial landmarks, such as squared fiducial markers, placed in the environment to help tracking and relocalization. This paper proposes a novel SLAM approach by fusing natural and artificial landmarks in order to achieve long-term robust tracking in many scenarios. Our method has been compared to the start-of-the-art methods ORB-SLAM2 [1], LDSO [2] and SPM-SLAM [3] in the public datasets Kitti [4], Euroc-MAV [5], TUM [6] and SPM [3], obtaining better precision, robustness and speed. Our tests also show that the combination of markers and keypoints achieves better accuracy than each one of them independently.
  • Article
    Full-text available
    Squared fiducial markers are a powerful tool for camera pose estimation in applications such as robots, unmanned vehicles and augmented reality. The four corners of a single marker are enough to estimate the pose of a calibrated camera. However, they have some limitations. First, the methods proposed for detection are ineffective under occlusion. A small occlusion in any part of the marker makes it undetectable. Second, the range at which they can be detected is limited by their size. Very big markers can be detected from a far distance, but as the camera approaches them, they are not fully visible, and thus they can not be detected. Small markers, however, can not be detected from large distances. This paper proposes solutions to the above-mentioned problems. We propose the Fractal Marker, a novel type of marker that is built as an aggregation of squared markers, one into another, in a recursive manner. Also, we proposed a novel method for detecting Fractal Markers under severe occlusions. The results of our experiments show that the proposed method achieves a wider detection range than traditional markers and great robustness to occlusion.
  • Article
    Full-text available
    SLAM is generally addressed using natural landmarks such as keypoints or texture, but it poses some limitations, such as the need for enough textured environments and high computational demands. In some cases, it is preferable sacrificing the flexibility of such methods for an increase in speed and robustness by using artificial landmarks. The recent work [1] proposes an off-line method to obtain a map of squared planar markers in large indoor environments. By freely distributing a set of markers printed on a piece of paper, the method estimates the marker poses from a set of images, given that at least two markers are visible in each image. Afterwards, camera localization can be done, in the correct scale. However, an off-line process has several limitations. First, errors can not be detected until the whole process is finished, e.g., an insufficient number of markers in the scene or markers not properly spotted in the capture stage. Second, the method is not incremental, so, in case of requiring the expansion of the map, it is necessary to repeat the whole process from start. Finally, the method can not be employed in real-time systems with limited computational resources such as mobile robots or UAVs. To solve these limitations, this work proposes a real-time solution to the problems of simultaneously localizing the camera and building a map of planar markers. This paper contributes with a number of solutions to the problems arising when solving SLAM from squared planar markers, coining the term SPM-SLAM. The experiments carried out show that our method can be more robust, precise and fast, than visual SLAM methods based on keypoints or texture.
  • Conference Paper
    Full-text available
    Within the next few years, unmanned quadrotors are likely to become an important vehicle in humans' daily life. However, their automatic navigation and landing in indoor environments are among the commonly discussed topics in this regard. In fact, the quadrotor should be able to automatically find the landing point from the nearby position, navigate toward it, and finally, land on it accurately and smoothly. In this paper, we proposed a low-cost and thorough solution to this problem by using both bottom-facing and front-facing cameras of the drone. In addition, in the case that vision data were unavailable, inertial measurements alongside a Kalman filter were used to navigate the drone to achieve the promising continuity and reliability. An AR.Drone 2.0 quadrotor, as well as an ArUco marker, were employed in order to test the proposed method experimentally. The results indicated that the drone successfully landed on the predefined position with an acceptable time and accuracy.
  • Article
    Full-text available
    Surgical training improves patient care, helps to reduce surgical risks, increases surgeon's confidence, and thus enhances overall patient safety. Current surgical training systems are more focused on developing technical skills, e.g. dexterity, of the surgeons while lacking the aspects of context-awareness and intra-operative real-time guidance. Context-aware intelligent training systems interpret the current surgical situation and help surgeons to train on surgical tasks. As a prototypical scenario, we chose Thoracentesis procedure in this work. We designed the context-aware software framework using the surgical process model encompassing ontology and production rules, based on the procedure descriptions obtained through textbooks and interviews, and ontology-based and marker-based object recognition, where the system tracked and recognised surgical instruments and materials in surgeon's hands and recognised surgical instruments on the surgical stand. The ontology was validated using annotated surgical videos, where the system identified "Anaesthesia" and "Aspiration" phase with 100% relative frequency and "Penetration" phase with 65% relative frequency. The system tracked surgical swab and 50mL syringe with approximately 88.23% and 100% accuracy in surgeon's hands and recognised surgical instruments with approximately 90% accuracy on the surgical stand. Surgical workflow training with the proposed system showed equivalent results as the traditional mentor-based training regime, thus this work is a step forward a new tool for context awareness and decision-making during surgical training.
  • Conference Paper
    We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct meth- ods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on sim(3), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.
  • Conference Paper
    This paper propose an autonomous landing method for unmanned aerial vehicles (UAVs), aiming to address those situations in which the landing pad is the deck of a ship. Fiducial marker are used to obtain the six-degrees of freedom (DOF) relative-pose of the UAV to the landing pad. In order to compensate interruptions of the video stream, an extended Kalman filter (EKF) is used to estimate the ship’s current position with reference to its last known one, just using the odometry and the inertial data. Due to the difficulty of testing the proposed algorithm in the real world, synthetic simulations have been performed on a robotic test-bed comprising the AR Drone 2.0 and the Husky A200. The results show the EKF performs well enough in providing accurate information to direct the UAV in proximity of the other vehicle such that the marker becomes visible again. Due to the use of inertial measurements only in the data fusion process, this solution can be adopted in indoor navigation scenarios, when a global positioning system is not available.