Content uploaded by Rafael Muñoz-Salinas
Author content
All content in this area was uploaded by Rafael Muñoz-Salinas on Jun 28, 2018
Content may be subject to copyright.
Speeded Up Detection of Squared Fiducial Markers
Francisco J. Romero-Ramirez1, Rafael Mu˜noz-Salinas1,2,∗, Rafael Medina-Carnicer1,2
Abstract
Squared planar markers have become a popular method for pose estimation in applications such as autonomous robots,
unmanned vehicles or virtual trainers. The markers allow estimating the position of a monocular camera with minimal
cost, high robustness, and speed. One only needs to create markers with a regular printer, place them in the desired
environment so as to cover the working area, and then registering their location from a set of images.
Nevertheless, marker detection is a time-consuming process, especially as the image dimensions grows. Modern cameras
are able to acquire high resolutions images, but fiducial marker systems are not adapted in terms of computing speed.
This paper proposes a multi-scale strategy for speeding up marker detection in video sequences by wisely selecting the
most appropriate scale for detection, identification and corner estimation. The experiments conducted show that the
proposed approach outperforms the state-of-the-art methods without sacrificing accuracy or robustness. Our method is
up to 40 times faster than the state-of-the-art method, achieving over 1000 fps in 4K images without any parallelization.
Keywords: Fiducial Markers, Marker Mapping, SLAM.
1. Introduction1
Pose estimation is a common task for many applications2
such as autonomous robots [1, 2, 3], unmanned vehicles3
[4, 5, 6, 7, 8] and virtual assistants [9, 10, 11, 12], among4
other.5
Cameras are cheap sensors that can be effectively used6
for this task. In the ideal case, natural features such as7
keypoints or texture [13, 14, 15, 16] are be employed to8
create a map of the environment. Although some of the9
traditional problems of previous methods for this task have10
been solved in the last few years, other problems remain.11
For instance, they are subject to filter stability issues or12
significant computational requirements.13
In any case, artificial landmarks are a popular approach14
for camera pose estimation. Square fiducial markers, com-15
prised by an external squared black border and an internal16
identification code, are especially attractive because the17
camera pose can be estimated from the four corners of a18
single marker [17, 18, 19, 20]. The recent work of [21] is19
∗Corresponding author
Email addresses: fj.romero@uco.es (Francisco J.
Romero-Ramirez), in1musar@uco.es (Rafael Mu˜noz-Salinas),
rmedina@uco.es (Rafael Medina-Carnicer)
1Departamento de Inform´atica y An´alisis Num´erico, Edificio Ein-
stein. Campus de Rabanales, Universidad de Co´rdoba, 14071,
C´ordoba, Spain, Tlfn:(+34)957212289
2Instituto Maim´onides de Investigaci´on en Biomedicina (IM-
IBIC). Avenida Men´endez Pidal s/n, 14004, C´ordoba, Spain,
Tlfn:(+34)957213861
a step forward the use of this type of markers in large- 20
scale problems. One only need to print the set of markers 21
with a regular printer, place them in the area under which 22
the camera must move, and take a set of pictures of the 23
markers. The pictures are then analyzed and the three- 24
dimensional marker locations automatically obtained. Af- 25
terward, a single image spotting a marker is enough to 26
estimate the camera pose. 27
Despite the recent advances, marker detection can be a 28
time-consuming process. Considering that the systems re- 29
quiring localization have in many cases limited resources, 30
such as mobile phones or aerial vehicles, the computational 31
effort of localization should be kept to a minimum. The 32
computing time employed in marker detection is a func- 33
tion of the image size employed: the larger the images, the 34
slower the process. On the other hand, high-resolution im- 35
ages are preferable since markers can be detected, even 36
if far from the camera, with high accuracy. The con- 37
tinuous reduction in the cost of the cameras, along with 38
the increase of their resolution, makes necessary to de- 39
velop methods able to reliably detect the markers in high- 40
resolution images. 41
The main contribution of this paper is a novel method 42
for detecting square fiducial markers in video sequences. 43
The proposed method relies on the idea that markers can 44
be detected in smaller versions of the image, and employs a 45
multi-scale approach to speed up computation while main- 46
taining the precision and accuracy. In addition, the sys- 47
tem is able to dynamically adapt its parameters in order 48
Preprint submitted to Image and Vision Computing June 28, 2018
to achieve maximum performance in the analyzed video49
sequence. Our approach has been extensively tested and50
compared with the state-of-the-art methods for marker de-51
tection. The results show that our method is more than an52
order of magnitude faster than state-of-the-art approaches53
without compromising robustness or accuracy, and with-54
out requiring any type of parallelism.55
The remainder of this paper is structured as follows.56
Section 2 explains the works most related to ours. Sec-57
tion 3 details our proposal for speeding up the detection58
of markers. Finally, Section 4 gives a exhaustive analysis59
of the proposed method and Section 5 draws some conclu-60
sions.61
2. Related works62
Fiducials marker systems are commonly used for camera63
localization and tracking when robustness, precision, and64
speed are required. In the simplest case, points are used65
as fiducial markers, such as LEDs, retroreflective spheres66
or planar dots [22, 23]. However, their main drawback is67
the need of a method to solve the assignment problem, i.e.,68
assigning a unique and consistent identifier to each element69
over time. In order to ease the problem, a common solution70
consists in adding an identifying code into each marker.71
Examples of this are planar circular markers [24, 25], 2D-72
barcodes [26, 27] and even some authors have proposed73
markers designed using evolutionary algorithms [28].74
Amongst all proposed approaches, these based on75
squared planar markers have gained popularity. These76
markers consist of an external black border and an inter-77
nal code (most often binary) that uniquely identifies each78
marker (see Fig 1). Their main advantage is that the pose79
of the camera can be estimated from a single marker.80
ARToolKit [29] is one of the pioneer proposals. They81
employed markers with a custom pattern that is identified82
by template matching. This identification method, how-83
ever, is prone to error and not very robust to illumination84
changes. In addition, the method’s sensitivity degrades85
as the number of markers increases. As a consequence,86
other authors improved that work by using binary BCH87
codes [30] (which allows a more robust error detection) and88
named it ARToolKit+ [31]. The project was halted and89
followed by the Studierstube Tracker project [32], which is90
privative. Similar to the ARToolKit+ project is the dis-91
continued project ARTag [33].92
BinARyID [34] is one of the first systems that proposed93
a method for generating customizable marker codes. In-94
stead of using a predefined set of codes, they proposed95
a method for generating the desired number of codes for96
each particular application. However, they do not consider97
the possibility of error detection and correction. AprilT- 98
ags [18], however, proposed methods for error detection 99
and correction, but their approach was not suitable for a 100
large number of markers. 101
The work ArUco [17] is probably the most popular sys- 102
tem for marker detection nowadays. It adapts to non- 103
uniform illumination, and is very robust, being able to 104
do error detection and correction of the binary codes im- 105
plemented. In addition, the authors proposed a method 106
to obtain optimal binary codes (in terms of intermarker- 107
distance) using Mixed Integer Linear Programming [35]. 108
Chilitags [36] is a variation of ArUco that employs a sim- 109
pler method for decoding the marker binary codes. As we 110
show in the experimental section, the method has a bad 111
behavior in high-resolution images. 112
The recent work [21] is a step towards the applicabil- 113
ity of such methods to large areas, proposing a method 114
for estimating the three-dimensional location of a set of 115
markers freely placed in the environment (Fig 1). Given 116
a set of images taken with a regular camera (such as a 117
mobile phone), the method automatically estimates their 118
location. This is an important step that allows extending 119
the robust localization of fiducial markers to very large 120
areas. 121
Although all fiducial marker systems aim maximum 122
speed in their design, few specific solutions have been pro- 123
posed to speed up the detection process. The work of 124
Johnston et. al. [37] is an interesting example in which 125
the authors propose a method to speed up computation by 126
parallelizing the image segmentation process. Neverthe- 127
less, both speed and computing power is a crucial aspect, 128
especially if the localization system needs to be embedded 129
in devices with limited resources. 130
Our work can be seen as an improvement of the ArUco 131
system, that according to our experience, is one of the most 132
reliable fiducial marker systems nowadays (see Sec 4 for 133
further details). We propose a novel method for marker de- 134
tection and identification that allows to speed up the com- 135
puting time in video sequences by wisely exploiting tempo- 136
ral information and an applying multi-scale approach. In 137
contrast to previous works, no parallelization is required in 138
our method, thus making it especially attractive for mobile 139
devices with limited computational resources. 140
3. Speeded up marker detection 141
This section provides a detailed explanation of the 142
method proposed for speeding up the detection of squared 143
planar markers. First, Sect. 3.1 provides an overview of 144
the pipeline employed in the previous work, ArUco [17], 145
for marker detection and identification, highlighting the 146
parts of the process susceptible to be accelerated. Then, 147
2
Figure 1: Detection and identification pipeline of ArUco. (a)
Original image. (b) Image thresholded using an adaptive method.
(c) Contours extracted. (d) Filtered contours that approximate to
four-corner polygons. (e) Canonical image computed for one of the
squared contours detected. (f ) Binarization after applying Otsu’s
method.
Sect. 3.2 explains the proposed method to speed up the148
process.149
3.1. Marker detection and identification in ArUco150
The main steps for marker detection and identification151
proposed in ArUco [17] are depicted in Figure 1. Given the152
input image I(Figure 1a), the following steps are taken:153
•Image segmentation (Figure 1b). Since the designed154
markers have an external black border surrounded by155
a white space, the borders can be found by segmen-156
tation. In their approach, a local adaptive method is157
employed: the mean intensity value mof each pixel158
is computed using a window size wt. The pixel is set159
to zero if its intensity is greater than m−c, where c160
is a constant value. This method is robust and ob-161
tains good results for a wide range of values of its162
parameters wtand c.163
•Contour extraction and filtering (Figures 1(c,d)). The164
contour following algorithm of Suzuki and Abe [38]165
is employed to obtain the set of contours from the166
thresholded image. Since most of the contours ex-167
tracted correspond to irrelevant background elements,168
a filtering step is required. First, contours too small169
are discarded. Second, the remaining contours are170
approximated to its most similar polygon using the171
Douglas and Peucker algorithm [39]. Those that do172
not approximate well to a four-corner convex polygon173
are discarded from further processing.174
•Marker code extraction (Figures 1(e,f)). The next175
step consists in analyzing the inner region of the re-176
maining contours to determine which of them are valid 177
markers. To do so, perspective projection is first re- 178
moved by computing the homography matrix, and the 179
resulting canonical image (Fig. 1e) is thresholded us- 180
ing the Otsu’s method [40]. The binarized image 181
(Fig. 1f) is divided into a regular grid and each ele- 182
ment is assigned a binary value according to the ma- 183
jority of the pixels in the cell. For each marker candi- 184
date, it is necessary to determine whether it belongs 185
to the set of valid markers or if it is a background el- 186
ement. Four possible identifiers are obtained for each 187
candidate, corresponding to the four possible rota- 188
tions of the canonical image. If any of the identifiers 189
belong to the set of valid markers, then it is accepted. 190
•Subpixel corner refinement. The last step consists in 191
estimating the location of the corners with subpixel 192
accuracy. To do so, the method employs a linear 193
regression of the marker’s contour pixels. In other 194
words, it estimates the lines of the marker sides em- 195
ploying all the contour pixels and computes the in- 196
tersections. This method, however, is not reliable for 197
uncalibrated cameras with small focal lenses (such as 198
fisheye cameras) since they usually exhibit high dis- 199
tortion. 200
When analyzing the computing times of this pipeline, 201
it can be observed that the Image segmentation and the 202
Marker code extraction steps are consuming most of the 203
computing time. The time employed in the image segmen- 204
tation step is proportional to the image size, that also in- 205
fluences the length of the contours extracted and thus the 206
computing time employed in the Contour extraction and 207
filtering step. The extraction of the canonical image (in 208
the Marker code extraction step) involves two operations. 209
First, computing the homography matrix, which is cheap. 210
But then, the inner region of each contour must be warped 211
to create the canonical image. This step requires access to 212
the image pixels of the contour region performing an inter- 213
polation in order to obtain the canonical image. The main 214
problem is that the time required to obtain the canonical 215
image depends on the size of the observed contour. The 216
larger a contour in the original image, the more time it is 217
required to obtain the canonical image. Moreover, since 218
most of the contours obtained do not belong to markers, 219
the system may employ a large amount of time computing 220
canonical images that will be later rejected. 221
A simpler approach to solving that problem would be to 222
directly sample a few sets of pixels from the inner region 223
of the marker. This is the method employed in ChiliTags. 224
However, as it will be shown in the experimental section, 225
it is prone to many false negatives. 226
3
Figure 2: Process pipeline Main steps for fast detection and identification of squared planar markers.(a) Original input image. (b) Resized
image for marker search. (c) Thresholded image. (d) Rectangles found (pink). (e) Markers detected with its corresponding identification.
The image pyramid is used to speed up homography computation. (f) The corners obtained in (e) are upsampled to find their location in the
original image with subpixel precision.
3.2. Proposed method227
The key ideas of our proposal in order to speed up the228
computation are explained below. First, while the adap-229
tive thresholding method employed in ArUco is robust to230
many illumination conditions without altering its param-231
eters, it is a time-consuming process that requires a con-232
volution. By taking advantage of temporal information,233
the adaptive thresholding method is replaced by a global234
thresholding approach.235
Second, instead of using the original input image, a236
smaller version is employed. This is based on the fact237
that, in most cases, the useful markers for camera pose238
estimation must have a minimum size. Imagine an image239
of dimensions 1920 ×1080 pixels, in which a marker is de-240
tected as a small square with a side length of 10 pixels.241
Indeed, the estimation of the camera pose is not reliable242
at such small resolution. Thus, one might want to set a243
minimum length to the markers employed for camera pose244
estimation. For instance, let say that we only use markers245
with a minimum side length of ˙τi= 100 pixels, i.e., with a246
total area of 10.000 pixels. Another situation in which we247
can set a limit to the length of markers is when processing248
video sequences. It is clear that the length of a marker249
must be similar to its length in the previous frame.250
Now, let us also think about the size of the canonical251
images employed (Figure 1e). The smaller the image, the252
faster the detection process but the poorer the image qual-253
ity. Our experience, however, indicates that very reliable254
detection of the binary code can be obtained from very255
small canonical images, such 32×32 pixels. In other words,256
all the rectangles detected in the image, no matter their257
side length, are reduced to canonical images of side length258
τc= 32 pixels, for the purpose of identification.259
Our idea, then, is to employ a reduced version of the 260
input image, using the scale factor τc
˙τi, so as to speed up 261
the segmentation step. In the reduced image, the smallest 262
allowed markers, with a side length of 100 pixels in the 263
original image, appears as rectangles with a side length of 264
32 pixels. As a consequence, there will be no loss of quality 265
when they are converted into the canonical image. 266
This idea has one drawback: the location of the corners 267
extracted in the low resolution image is not as good esti- 268
mations as the ones that can be obtained in the original 269
image. Thus, the pose estimated with them will have a 270
higher error. To solve that problem, a corner upsampling 271
step is included, in which the precision of the corners is re- 272
fined up to subpixel accuracy in the original input image 273
by employing an image pyramid. 274
Finally, it must be considered that the generation of 275
the canonical image is a very time-consuming operation 276
(even if the process is done in the reduced image) that 277
proportional to the contour length. We propose a method 278
to perform the extraction of the canonical images in almost 279
constant time (independently of the contour length) by 280
wisely employing the image pyramid. 281
Below, there is a detailed explanation of the main steps 282
of the proposed method, using Figure 2 to ease the expla- 283
nation. 284
1. Image Resize: Given the input image I(Fig 2a), the
first step consists in obtaining a resized version Ir
(Fig 2b) that will be employed for segmentation. As
previously pointed out, the size of the reduced image
is calculated as:
Ir
w=τc
˙τi
Iw;Ir
h=τc
˙τi
Ih,(1)
where the subscripts wand hdenotes width and height
4
Figure 3: Pyramidal Warping. Scene showing tree marker at
different resolutions. The left column shows the canonical images
warped from the pyramid of images. Larger markers are warped
from smaller images. For each marker, the image of the pyramid
that minimizes the warping time while preserving the resolution is
selected.
respectively. In order to decouple the desired mini-
mum marker size from the input image dimensions,
we define ˙τias:
˙τi=τc+max(Iw, Ih)τi|τi∈[0,1],(2)
where the normalized parameter τiindicates the min-285
imum marker size as a value in the range [0,1]. When286
τi= 0, the reduced image will be the same size as the287
original image. As τitends to one, the image Irbe-288
comes smaller, and consequently, the computational289
time required for the following step is reduced. The290
impact of this parameter in the final speed up is mea-291
sured in the experimental section.292
2. Image Segmentation: As already indicated, a global293
threshold method is employed using the following pol-294
icy. If no markers were detected in the previous frame,295
a random threshold search is performed. The random296
process is repeated up to three times using the range297
of threshold values [10,240]. For each tested thresh-298
old value, the whole pipeline explained below is per-299
formed. If after a number of attempts, no marker is300
found, it is assumed that no markers are visible in the301
frame. If at least one marker is detected, a histogram302
is created using the pixel values of all detected mark-303
ers. Then, Otsu’s algorithm [40] is employed to select304
the optimal threshold for the next frame. The calcu-305
lated threshold is applied to Irin order to obtain It306
(Fig 2c). As we show experimentally, the proposed 307
method can adapt to smooth and abrupt illumination 308
changes. 309
3. Contour Extraction and Filtering: First, contours are 310
extracted from the image Itusing Suzuki and Abe al- 311
gorithm [38], then small contours are removed. Since 312
the extracted contours will rarely be squared (due to 313
perspective projection), their perimeter is employed 314
for rejection purposes: those with a perimeter smaller 315
than P(τc)=4×τcpixels are rejected. For the re- 316
maining contours, a polygonal approximation is per- 317
formed using Douglas and Peucker algorithm [39], and 318
those that do not approximate to a convex polygon of 319
four corners are also rejected. Finally, the remaining 320
contours are the candidates to be markers (Fig 2d). 321
4. Image Pyramid Creation: An image pyramid
I= (I0, . . . , In)
with a set of resized versions of I, is created. I0de- 322
notes the original image and the subsequent images 323
Iiare created by subsampling Ii−1by a factor of two. 324
The number nof images in the pyramid is such that
the smallest image dimensions is close to τc×τc, i.e.,
n= argmin
v|Iv∈I
|(Iv
wIv
h)−τ2
c|.(3)
5. Marker Code Extraction: In this step the canonical 325
images of the remaining contours must be extracted 326
and then binarized. Our method uses the pyramid 327
of images Ipreviously computed to ensure that the 328
process is performed in constant time, independently 329
of the input image and contour sizes. The key princi- 330
ple is selecting, for each contour, the image from the 331
pyramid in which the contour length is most similar 332
to the canonical image length P(τc). In this manner, 333
warping is faster. 334
Let us consider a detected contour ϑ∈Ir, and denote
by P(ϑ)jits perimeter in the image Ij∈ I. Then,
the best image Ih∈ I for homography computation
is selected as:
Ih|h= argmin
j∈{0,1,...n}
|P(ϑ)j−P(τc)|.(4)
335
The pyramidal warping method employed can be bet- 336
ter understood in Fig. 3, which shows a scene with 337
three markers at different distances. The left im- 338
ages represent the canonical images obtained while 339
the right images show the pyramid of images. In our 340
method, the canonical image of the smallest marker is 341
extracted from the largest image in the pyramid (top 342
5
Figure 4: Test sequences. (a) The set of 16 markers employed for evaluation. There are four markers from each method tested: ArUco,
AprilTags, ArToolKit+ and ChiliTags. (b-e) Images from the video sequences used for testing. The markers are seen as small as in (b), and
as big as in (e), where the marker represents the 40% of the total image area.
row of Fig 3). As the length of the marker increases,343
smaller images of the pyramid are employed to obtain344
the canonical view. This guarantees that the canon-345
ical image is obtained in almost constant time using346
the minimum possible computation.347
Finally, for each canonical image, the Otsu’s method348
[40] for binarization is employed, and the inner code349
analyzed to determine whether it is a valid marker or350
not. This is a very cheap operation.351
6. Corner Upsampling: So far, markers have been de-352
tected in the image Ir. However, it is required to353
precisely localize their corners in the original image354
I. As previously indicated, the precision of the esti-355
mated camera pose is directly influenced by the pre-356
cision in the corner localization. Since the difference357
in size between the images Iand Ircan be very large,358
a direct upsampling can lead to errors. Instead, we359
proceed in incremental steps looking for the corners360
in larger versions of the image Iruntil the image Iis361
reached.362
For the corner upsampling task, the image Ii∈ I of363
the pyramid with the most similar size to Iris selected364
in the first place, i.e.,365
Ii= argmin
Iv∈I
|(Iv
wIv
h)−(Ir
wIr
h)|.(5)
Then, the position of each contour corner in the image366
Iiis computed by simply upsampling the corner lo-367
cations. This is, however, an approximate estimation368
that does not precisely indicate the corner position369
in the image Ii. Thus, a corner refinement process is370
done in the vicinity of each corner so as to find its best371
location in the selected image Ii. For that purpose, 372
the method implemented in the OpenCV library [41] 373
has been employed. Once the search is done in Iifor 374
all corners, the operation is repeated for the image 375
Ii−1, until I0is reached. In contrast to the ArUco 376
approach, this one is not affected by lens distortions. 377
7. Estimation of τi:The parameter τihas a direct influ- 378
ence in the computation time. The higher it is, the 379
faster the computation. A naive approach consists 380
in setting a fixed value for this parameter. However, 381
when processing video sequences, the parameter can 382
be automatically adjusted at the end of each frame. 383
In the first image of the sequence, the parameter τiis 384
set to zero. Thus, markers of any size are detected. 385
Then, for the next frame, τiis set to a value slightly 386
smaller than the size of the smallest marker detected 387
in the previous frame. In this way, markers could be 388
detected even if the camera moves away from them. 389
Therefore, the parameter τican be dynamically up- 390
dated as: 391
τi= (1 −τs)P(ϑs)/4 (6)
where ϑsis the marker with the smallest perimeter 392
found in the image, and τsis a factor in the range 393
(0,1] that accounts for the camera motion speed. For 394
instance, when τs= 0.1, it means that in the next 395
frame, τiis such that markers 10% smaller than the 396
smallest marker in the current image will be sought. 397
If no markers are detected in a frame, τiis set to zero 398
so that in the next frame markers of any size can be 399
detected. 400
6
Figure 5: Sp eedUp of ArUco3 compared to ArUco, ArToolKit+, ChiliTags and AprilTags for resolutions: 4K (3840 ×2160), 1080p
(1920 ×1080), 720p (1280 ×720), 600p (800 ×600) and 480p (640 ×480). The horizontal axis represents the percentage of area occupied by
the markers in each frame, and the vertical axis one indicates how many times ArUco3 is faster.
As can be observed, the proposed pipeline includes a401
number of differences with respect to the original ArUco402
pipeline that allows increasing significantly the processing403
speed as we show next.404
4. Experiments and results405
This section shows the results obtained to validate the406
methodology proposed for the detection of fiducial mark-407
ers.408
First, in Sect 4.1, the computing times of our proposal409
are compared to the best alternatives found in the liter-410
ature: AprilTags [18], ChiliTags [36], ArToolKit+ [31],411
as well as ArUco [17] which is included in the OpenCV412
library3. Then, Sect. 4.2 analyses and compares the sensi-413
tivity of the proposed method with the above-mentioned414
methods. The main goal is to demonstrate that our ap-415
proach is able to reliably detect the markers with a very416
high true positive ratio, under a wide range of marker reso-417
lutions, while keeping the false positive rate to zero. After-418
ward, Sect. 4.3 studies the impact of the different system419
parameters on the speed and sensitivity, while Sect. 4.4420
evaluates the precision in the estimation of the corners.421
Finally, Sect. 4.5 shows the performance of the proposed422
method in a realistic video sequence with occlusions, illu-423
mination, and scale changes.424
To carry out the first three experiments, several videos425
have been recorded in our laboratory. Figure 4(b-e) shows426
some images of the video sequences employed. For these427
tests, a panel with a total of 16 markers was printed (Fig-428
ure 4a), four from each one of the fiducial markers em-429
ployed. The sequences were recorded at different distances430
at a frame rate of 30 fps using an Honor 5 mobile phone at431
4K resolution. The videos employed are publicly available432
4for evaluation purposes.433
3https://opencv.org/
4https://mega.nz/#F!DnA1wIAQ!6f6owb81G0E7Sw3EfddUXQ
In the video, there are frames in which the markers ap- 434
pear as small as can be observed in Figure 4b, where 435
the area of each marker occupies only 0.5% of the image, 436
and frames in which the marker is observed as big as in 437
Figure 4e, where the marker occupies 40% of total im- 438
age area. In total, the video sequences recorded sum up 439
to 10.666 frames. The video frames have been processed 440
at different resolutions so that the impact of the image 441
resolution in the computing time can be analyzed. In par- 442
ticular, the following the standard image resolutions have 443
been employed: 4K (3840 ×2160), 1080p (1920 ×1080), 444
720p (1280 ×720), 600p (800 ×600) and 480p (640×480). 445
All tests were performed using an Intel R
Core TM i7- 446
4700HQ 8-core processor with 8 GB RAM and Ubuntu 447
16.04 as the operating system. However, only one execu- 448
tion thread was employed in the tests performed. 449
It must be indicated that the code generated as part of 450
this work has been publicly released as the version 3 of the 451
popular ArUco library5. So, in the experiments section, 452
the method proposed in this paper will be referred to as 453
ArUco3. 454
4.1. Speedup 455
This section compares the computing times of the pro-
posed method with the most commonly used alternatives
AprilTags, ArToolKit+, ChiliTags, and ArUco. To do so,
we compute the speedup of our approach as the ratio be-
tween the computing time of an alternative (t1) and the
computing time of ArUco3 (t2) in processing the same im-
age:
SpeedUp =t1/t2(7)
In our method, the value τc= 32 was employed in all the 456
sequences, while τiand the segmentation threshold where 457
automatically computed as explained in the Steps 2 and 7 458
of the proposed method (Sect. 3.2). 459
5http://www.uco.es/grupos/ava/node/25
7
Table 1: Mean computing times (milliseconds) of the different steps
of the proposed method for different resolutions.
Resolution
480p 600p 720p 1080p 2160p
Step 1:Image Resize 0.037 0.050 0.057 0.068 0.101
Step 2:Image Segmentation 0.044 0.048 0.059 0.084 0.351
Step 3:Contour Extraction and Filtering 0.219 0.250 0.301 0.403 1.109
Step 4:Image Pyramid Creation 0.037 0.076 0.096 0.186 0.476
Step 5:Marker code extraction 0.510 0.519 0.542 0.547 0.583
Step 6:Corner Upsampling 0.058 0.065 0.079 0.096 0.134
Time (ms) 0.903 1.009 1.133 1.384 2.755
Fig. 5 shows the speedup of our approach for different460
image resolutions. The horizontal axis represents the rel-461
ative area occupied by the marker in the image, while the462
vertical axis represents the speedup. A total of 30 speed463
measurements were performed for each image, taking the464
median computing time for our evaluation. In the tests,465
the speedup is evaluated as a function of the observed466
marker area in order to better understand the behavior467
of our approach.468
The tests conducted clearly show that the proposed469
method (ArUco3) is faster than the rest of the methods470
and that the speedup increases with the image resolu-471
tion and with the observed marker area. Compared to472
ArUco implementation in the OpenCV library, the pro-473
posed method is significantly faster, achieving a minimum474
speedup of 17 in 4Kresolutions, up to 40 in the best case.475
In order to properly analyze the computing times of the476
different steps of the proposed method (Sect. 3.2), Table 1477
shows a summary for different image resolutions. Likewise,478
Fig. 6 shows the percentage of the total time required by479
each step. Please notice that Step 7 (Eq. 6) has been480
omitted because its computing time is negligible.481
As can be seen, the two most time-consuming opera-482
tions are Step 3 and 5. In particular, Step 5 requires spe-483
cial attention, since it proves the validity of the multi-scale484
method proposed for marker warping. It can be observed485
in the table, that the amount of time employed by Step 5486
is constant across all resolutions. In other words, the com-487
puting time does not increase significantly with the image488
resolution. Also notice how the time of Step 3 increases489
in 2160p. It is because this step involves operations that490
depend on the image dimensions, which grow quadrati-491
cally. An interesting future work is to develop methods492
reducing the time for contour extraction and filtering in493
high-resolution images.494
In any case, considering the average total computing495
time, the proposed method achieves in average more than496
360 fps in 4Kresolutions and more than 1000 fps in the497
lowest resolution, without any parallelism.498
Figure 6: Main steps ArUco3 times. Percentage of time of the
global computation required by each of the steps for resolutions: 4K,
1080p, 720p, 600p and 480p.
4.2. Sensitivity analysis 499
Correct detection of markers is a critical aspect that 500
must be analyzed to verify that the proposed algorithm is 501
able to obviate redundant information present in the scene, 502
extracting exclusively marker information. Fig. 7 shows 503
the True Positive Rate (TPR) of the proposed method as 504
a function of the area occupied by the marker in the image 505
for different image resolutions. 506
As can be observed, below certain marker area, the de- 507
tection is not reliable. This is because the observed marker 508
area is very small, making it difficult to distinguish the 509
different bits of the inner binary code. Once the observed 510
area of the marker reaches a certain limit, the proposed 511
method achieves perfect detection in all resolutions. It 512
must be remarked, that the False Positive Rate is zero in 513
all cases tested. Since it is a binary problem, the True 514
Negative Rate is one (TNR=1-FPR). 515
For a comparative evaluation performance between 516
ArUco3 and the other methods, the TPR has been an- 517
alyzed individually and the results are shown in Fig. 7. 518
As can be observed, ArUco behaves exactly like ArUco3. 519
AprilTags, however, has very poor behavior in all resolu- 520
tions, especially as the marker or the image sizes increases. 521
As we already commented in Sect. 2, AprilTags does not 522
rely on warping the marker image but instead does a sub- 523
sampling of a few pixels on the image in order to obtain 524
the binary code. This may be one of the reasons for its 525
poor performance. ArToolKit+ behaves reasonably well 526
across all the image resolutions and marker areas, while 527
Chilitags shows a somewhat unreliable behavior in all res- 528
olutions but 480p. 529
In conclusion, the proposed approach behaves similar to 530
the previous version of ArUco. 531
8
Figure 7: True Positive Ratio. Mean true positive ratio (TPR) for ArUco3, Chilitags, ArUco, ArToolKit+ and AprilTags for resolutions:
4K, 1080p, 720p, 600p and 480p), as function of the observed area for the set of markers.
4.3. Analysis of parameters532
The computing time and robustness of the proposed533
method depend mainly on two parameters, namely τi
534
which indicates the minimum size of the markers detected,535
and τc, the size of the canonical image.536
The parameter τihas an influence on the computing537
time, since it determines the size of the resized image Ir
538
(Eq. 1). We have analyzed the speed as a function of539
this parameter and the results are shown in Fig. 8. The540
figure represents the horizontal axis the value τi, and in the541
vertical axis, the average speed (measured as frames per542
second) in the sequences analyzed, independently of the543
observed marker area. A different line has been depicted544
for each image resolution. In this case, we have set fixed545
the parameter τc= 32.546
It can be observed that the curves follow a similar pat-547
tern in the five cases analyzed. In general, the maxi-548
mum increase in speed is obtained in the range of values549
τi= (0,0.2). Beyond that point, the improvement be-550
comes marginal. To better understand the impact of this551
parameter, Table 2 shows the reduction of the input im-552
age size Ifor different values of τi. For instance, when553
τi= 0.02, the resized image Iris 48% smaller than the554
original input image I(see Eq. 1). Beyond τi= 0.2, the555
resized image is so small that it has not a big impact in the556
speedup because there are other steps with a fixed com-557
puting time such as the Step 5 (Marker Code Extraction). 558
Table 2: Image size reduction for different values of τi.
τi0.01 0.015 0.02 0.1 0.2
Size reduction 0% 31% 48% 82% 90%
In any case, it must be noticed that the proposed 559
method is able to achieve 1000 fps in 4K resolutions when 560
detecting markers larger than 10% (τi= 0.1) of the image 561
area, and the same limit of 1000 fps is achieved for 1080p 562
resolutions for τi= 0.05. 563
With regards to the parameter τc, it indirectly influences 564
the speed since it determines the size of the resized images 565
(Eq 1). The smaller it is, the smaller the resized image Ir.566
Nevertheless, this parameter also has an influence on the 567
correct detection of the markers. The parameter indicates 568
the size of the canonical images used to identify the bi- 569
nary code of markers. If the canonical image is very small, 570
pixels are mixed up, and identification is not robust. Con- 571
sequently, the goal is to determine the minimum value of 572
τcthat achieves the best TPR. Fig. 9 shows the TPR ob- 573
tained for different configurations of the parameter τc. As 574
can be seen, for low values of the parameter τc(between 575
8 and 32) the system shows problems in the detection of 576
markers. However, for τc≥32 there is no improvement in 577
the TPR. Thus, we conclude that the value τc= 32 is the 578
9
Figure 8: Parameter τi.Speed of
method as a function of the parameter τi
for the different resolutions tested.
Figure 9: Parameter τc.True positive
rate obtained by different configurations of
parameter τc
Figure 10: Vertex jitter measured for
the different marker systems.
best choice.579
4.4. Precision of corner detection580
An important aspect to consider in the detection of the581
markers is vertex jitter, which refers to the noise in the582
estimation of the corners’ location. These errors are prob-583
lematic because they propagate to the estimation of the584
camera pose. In our method, a corner upsampling step585
(Step 6 in Sect. 3.2) is proposed to refine the corners’ esti-586
mations from the reduced image Irto the original image587
I. This section analyzes the proposed method comparing588
the results with the other marker systems.589
In order to perform the experiments, the camera has590
been placed at a fixed position recording the set of mark-591
ers already presented in Fig. 4a. Since the camera is not592
moving, the average location estimated for each corner can593
be considered to be the correct one (i.e., a Gaussian error594
distribution is assumed). Then, the standard deviation is595
an error measure for the localization of the corners. The596
process has been repeated a total of six times at varying597
distances and the results obtained are shown in Fig. 10 as598
box plots. In Table 3, the average error of each method599
has been indicated.
Table 3: Vertex jitter analysis: Standard deviations of the different
methods in estimating the marker corners.
Method ArUco ArUco3 Chilitags AprilTags ArToolKit+
Average error (pix) 0.140 0.161 0.174 0.225 0.432
600
As can be observed, the ArUco system obtains the best601
results, followed by our proposal ArUco3. However, it can602
be seen that the difference between both methods is of603
only 0.02 pixels, which is very small to consider it rele-604
vant. Chilitags shows a similar behavior than ArUco and605
ArUco3, but AprilTags and ArToolKit+ exhibit worse per-606
formance.607
4.5. Video sequence analysis 608
This section aims at showing the behavior of the pro- 609
posed system in a realistic scenario. For that purpose, four 610
markers have been placed in an environment with irregular 611
lighting and a video sequence has been recorded using a 612
4K mobile phone camera. Figure 11(a-e) show the frames 613
1,665,1300,1700 and 2100 of the video sequence. At the 614
start of the sequence, the camera is around five meters 615
away from the markers. The camera approaches the mark- 616
ers and then moves away again. As can be seen, around 617
frame 650 (Figure 11b), the user occludes the markers tem- 618
porarily. 619
Figure 11f shows the values of the parameter τiauto- 620
matically calculated along the sequence and Figure 11g 621
the processing speed. As can be observed, the system is 622
able to automatically adapt the value of τiaccording to the 623
observed marker area, thus adapting the computing speed 624
of the system. The maximum speed is obtained around 625
the frame 1300 when the camera is closest to the markers. 626
It can also be observed that around frame 650 when 627
the user occludes the markers with his hand, the system is 628
unable to detect any marker. Thus, the system searches for 629
the full resolution image (τi= 0) and the speed decreases. 630
However, when the markers are observed again, the system 631
recovers its speed. 632
Finally, Figure 11h shows the threshold values employed 633
for segmentation in each frame. As can be seen, the system 634
adapts to the illumination changes. Along the sequence, 635
the system does not produce any false negative nor posi- 636
tives. 637
5. Conclusions and future work 638
This paper has proposed a novel approach for detect- 639
ing fiducial markers aimed at maximizing speed while pre- 640
serving accuracy and robustness. The proposed method 641
10
Figure 11: Video Sequence in a realistic scenario. (a-e) Frames of the video sequence. The camera approaches the marker and then moves
away. The user occludes the camera temporarily. (f ) Evolution of the parameter τiautomatically computed. (g) Speed of the proposed
method in each frame of the sequence. (h) Thresholds automatically computed for each frame. The system adapts to illumination changes.
is specially designed to take advantage of the increasing642
camera resolutions available nowadays. Instead of detect-643
ing markers in the original image, a smaller version of the644
image is employed, in which the detection can be done645
at higher speed. By wisely employing a multi-scale image646
representation, the proposed method is able to find the po-647
sition of the marker corners with subpixel accuracy in the648
original image. The size of the processed image, as well649
as the threshold employed for segmentation, are dynam-650
ically adapted in each frame considering the information651
of the previous one. As a consequence, the system speed652
dynamically adapts in order to achieve the maximum per-653
formance.654
As shown experimentally, the proposed method outper-655
forms the state-of-the-art systems in terms of computing656
speed, without compromising the sensitivity or the preci-657
sion. Our method is between 17 and 40 times faster than658
the ArUco approach implemented in the OpenCV library.659
When compared to other approaches such as Chilitags,660
AprilTags, and ArToolKit+, our method achieves even661
higher speedups. 662
We consider as possible future works to investigate the 663
use of the proposed method in fish-eye cameras. The idea 664
is to compare the method with the rectified images if there 665
is analyze the method’s performance in presence of high 666
distortion. Also, we as well as to characterize the perfor- 667
mance when multiple fiducial markers with significantly 668
different scales are present in the same image. 669
Our system, which is publicly available as open source 670
code6, is a cost-effective tool for fast and precise self- 671
localization in applications such as robotics, unmanned 672
vehicles or augmented reality applications. 673
Acknowledgments 674
This project has been funded under projects TIN2016- 675
75279-P and IFI16/00033 (ISCIII) of Spain Ministry of 676
Economy, Industry and Competitiveness, and FEDER. 677
6http://www.uco.es/grupos/ava/node/25
11
References678
[1] R. Sim, J. J. Little, Autonomous vision-based robotic explo-679
ration and mapping using hybrid maps and particle filters, Im-680
age and Vision Computing 27 (1) (2009) 167 – 177, canadian681
Robotic Vision 2005 and 2006.682
[2] A. Pichler, S. C. Akkaladevi, M. Ikeda, M. Hofmann, M. Plasch,683
C. W¨ogerer, G. Fritz, Towards shared autonomy for robotic684
tasks in manufacturing, Procedia Manufacturing 11 (Supple-685
ment C) (2017) 72 – 82, 27th International Conference on Flex-686
ible Automation and Intelligent Manufacturing, FAIM2017, 27-687
30 June 2017, Modena, Italy.688
[3] R. Valencia-Garcia, R. Martinez-B´ejar, A. Gasparetto, An in-689
telligent framework for simulating robot-assisted surgical oper-690
ations, Expert Systems with Applications 28 (3) (2005) 425 –691
433.692
[4] A. Broggi, E. Dickmanns, Applications of computer vision to693
intelligent vehicles, Image and Vision Computing 18 (5) (2000)694
365 – 366.695
[5] T. Patterson, S. McClean, P. Morrow, G. Parr, C. Luo, Timely696
autonomous identification of uav safe landing zones, Image and697
Vision Computing 32 (9) (2014) 568 – 578.698
[6] D. Gonz´alez, J. P´erez, V. Milan´es, Parametric-based path gen-699
eration for automated vehicles at roundabouts, Expert Systems700
with Applications 71 (2017) 332 – 341.701
[7] J. L. Sanchez-Lopez, J. Pestana, P. de la Puente, P. Campoy, A702
reliable open-source system architecture for the fast designing703
and prototyping of autonomous multi-uav systems: Simulation704
and experimentation, Journal of Intelligent & Robotic Systems705
(2015) 1–19.706
[8] M. Olivares-Mendez, S. Kannan, H. Voos, Vision based fuzzy707
control autonomous landing with uavs: From v-rep to real708
experiments, in: Control and Automation (MED), 2015 23th709
Mediterranean Conference on, 2015, pp. 14–21.710
[9] S. Pflugi, R. Vasireddy, T. Lerch, T. M. Ecker, M. Tannast,711
N. Boemke, K. Siebenrock, G. Zheng, Augmented marker track-712
ing for peri-acetabular osteotomy surgery, in: 2017 39th Annual713
International Conference of the IEEE Engineering in Medicine714
and Biology Society (EMBC), 2017, pp. 937–941.715
[10] J. P. Lima, R. Roberto, F. Sim˜oes, M. Almeida, L. Figueiredo,716
J. M. Teixeira, V. Teichrieb, Markerless tracking system for717
augmented reality in the automotive industry, Expert Systems718
with Applications 82 (2017) 100 – 114.719
[11] P. Chen, Z. Peng, D. Li, L. Yang, An improved augmented re-720
ality system based on andar, Journal of Visual Communication721
and Image Representation 37 (2016) 63 – 69, weakly supervised722
learning and its applications.723
[12] S. Khattak, B. Cowan, I. Chepurna, A. Hogue, A real-time724
reconstructed 3d environment augmented with virtual objects725
rendered with correct occlusion, in: Games Media Entertain-726
ment (GEM), 2014 IEEE, 2014, pp. 1–8.727
[13] J. Engel, T. Sch¨ops, D. Cremers, LSD-SLAM: Large-scale direct728
monocular SLAM, 2014.729
[14] R. Mur-Artal, J. M. M. Montiel, J. D. Tard´os, Orb-slam: A ver-730
satile and accurate monocular slam system, IEEE Transactions731
on Robotics 31 (5) (2015) 1147–1163.732
[15] Cooperative pose estimation of a fleet of robots based on inter-733
active points alignment, Expert Systems with Applications 45734
(2016) 150 – 160.735
[16] S.-h. Zhong, Y. Liu, Q.-c. Chen, Visual orientation inhomogene-736
ity based scale-invariant feature transform, Expert Syst. Appl.737
42 (13) (2015) 5658–5667.738
[17] S. Garrido-Jurado, R. Mu˜noz Salinas, F. J. Madrid-Cuevas,739
M. J. Mar´ın-Jim´enez, Automatic generation and detection of740
highly reliable fiducial markers under occlusion, Pattern Recog- 741
nition 47 (6) (2014) 2280–2292. 742
[18] E. Olson, Apriltag: A robust and flexible visual fiducial system, 743
in: Robotics and Automation (ICRA), 2011 IEEE International 744
Conference on, 2011, pp. 3400–3407. 745
[19] F. Ababsa, M. Mallem, Robust camera pose estimation using 746
2d fiducials tracking for real-time augmented reality systems, 747
in: Proceedings of the 2004 ACM SIGGRAPH International 748
Conference on Virtual Reality Continuum and Its Applications 749
in Industry, VRCAI ’04, 2004, pp. 431–435. 750
[20] V. Mond´ejar-Guerra, S. Garrido-Jurado, R. Mu˜noz-Salinas, M.- 751
J. Mar´ın-Jim´enez, R. Medina-Carnicer, Robust identification of 752
fiducial markers in challenging conditions, Expert Systems with 753
Applications 93 (1) (2018) 336–345. 754
[21] R. Mu˜noz-Salinas, M. J. Mar´ın-Jimenez, E. Yeguas-Bolivar, 755
R. Medina-Carnicer, Mapping and localization from planar 756
markers, Pattern Recognition 73 (January 2018) 158 – 171. 757
[22] K. Dorfm¨uller, H. Wirth, Real-time hand and head tracking for 758
virtual environments using infrared beacons, in: in Proceedings 759
CAPTECH’98. 1998, Springer, 1998, pp. 113–127. 760
[23] M. Ribo, A. Pinz, A. L. Fuhrmann, A new optical tracking sys- 761
tem for virtual and augmented reality applications, in: In Pro- 762
ceedings of the IEEE Instrumentation and Measurement Tech- 763
nical Conference, 2001, pp. 1932–1936. 764
[24] V. A. Knyaz, R. V. Sibiryakov, The development of new coded 765
targets for automated point identification and non-contact sur- 766
face measurements, in: 3D Surface Measurements, International 767
Archives of Photogrammetry and Remote Sensing, Vol. XXXII, 768
part 5, 1998, pp. 80–85. 769
[25] L. Naimark, E. Foxlin, Circular data matrix fiducial system 770
and robust image processing for a wearable vision-inertial self- 771
tracker, in: Proceedings of the 1st International Symposium on 772
Mixed and Augmented Reality, ISMAR ’02, IEEE Computer 773
Society, Washington, DC, USA, 2002, pp. 27–36. 774
[26] J. Rekimoto, Y. Ayatsuka, Cybercode: designing augmented 775
reality environments with visual tags, in: Proceedings of DARE 776
2000 on Designing augmented reality environments, DARE ’00, 777
ACM, New York, NY, USA, 2000, pp. 1–10. 778
[27] M. Rohs, B. Gfeller, Using camera-equipped mobile phones for 779
interacting with real-world objects, in: Advances in Pervasive 780
Computing, 2004, pp. 265–271. 781
[28] M. Kaltenbrunner, R. Bencina, reactivision: a computer-vision 782
framework for table-based tangible interaction, in: Proceedings 783
of the 1st international conference on Tangible and embedded 784
interaction, TEI ’07, ACM, New York, NY, USA, 2007, pp. 785
69–74. 786
[29] H. Kato, M. Billinghurst, Marker tracking and hmd calibration 787
for a video-based augmented reality conferencing system, in: 788
Augmented Reality, 1999. (IWAR ’99) Proceedings. 2nd IEEE 789
and ACM International Workshop on, 1999, pp. 85–94. 790
[30] S. Lin, D. J. Costello, Error Control Coding, Second Edition, 791
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2004. 792
[31] D. Wagner, D. Schmalstieg, ARToolKitPlus for pose tracking on 793
mobile devices, in: Computer Vision Winter Workshop, 2007, 794
pp. 139–146. 795
[32] D. Schmalstieg, A. Fuhrmann, G. Hesina, Z. Szalav´ari, L. M. 796
Encarna¸c¨ao, M. Gervautz, W. Purgathofer, The studierstube 797
augmented reality project, Presence: Teleoper. Virtual Environ. 798
11 (1) (2002) 33–54. 799
[33] M. Fiala, Designing highly reliable fiducial markers, IEEE 800
Transactions on Pattern Analysis and Machine Intelligence 801
32 (7) (2010) 1317–1324. 802
[34] D. Flohr, J. Fischer, A Lightweight ID-Based Extension for 803
Marker Tracking Systems, in: Eurographics Symposium on Vir- 804
tual Environments (EGVE) Short Paper Proceedings, 2007, pp. 805
12
59–64.806
[35] S. Garrido-Jurado, R. Mu˜noz-Salinas, F. Madrid-Cuevas,807
R. Medina-Carnicer, Generation of fiducial marker dictionaries808
using mixed integer linear programming, Pattern Recognition809
51 (2016) 481–491.810
[36] Q. Bonnard, S. Lemaignan, G. Zufferey, A. Mazzei, S. Cuendet,811
N. Li, A. ¨
Ozg¨ur, P. Dillenbourg, Chilitags 2: Robust fiducial812
markers for augmented reality and robotics. (2013).813
URL http://chili.epfl.ch/software814
[37] D. Johnston, M. Fleury, A. Downton, A. Clark, Real-time po-815
sitioning for augmented reality on a custom parallel machine,816
Image and Vision Computing 23 (3) (2005) 271 – 286.817
[38] Topological structural analysis of digitized binary images by818
border following, Computer Vision, Graphics, and Image Pro-819
cessing 30 (1) (1985) 32 – 46.820
[39] D. H. Douglas, T. K. Peucker, Algorithms for the reduction821
of the number of points required to represent a digitized line822
or its caricature, Cartographica: The International Journal for823
Geographic Information and Geovisualization 2 (10) (1973) 112824
– 122.825
[40] N. Otsu, A threshold selection method from gray-level his-826
tograms, IEEE Transactions on Systems, Man, and Cybernetics827
9 (1) (1979) 62–66.828
[41] G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision in829
C++ with the OpenCV Library, 2nd Edition, O’Reilly Media,830
Inc., 2013.831
13