ArticlePDF Available

Automatic generation and detection of highly reliable fiducial markers under occlusion


Abstract and Figures

This paper presents a fiducial marker system specially appropriated for camera pose estimation in applications such as augmented reality and robot localization. Three main contributions are presented. First, we propose an algorithm for generating configurable marker dictionaries (in size and number of bits) following a criterion to maximize the inter-marker distance and the number of bit transitions. In the process, we derive the maximum theoretical inter-marker distance that dictionaries of square binary markers can have. Second, a method for automatically detecting the markers and correcting possible errors is proposed. Third, a solution to the occlusion problem in augmented reality applications is shown. To that aim, multiple markers are combined with an occlusion mask calculated by color segmentation. The experiments conducted show that our proposal obtains dictionaries with higher inter-marker distances and lower false negative rates than state-of-the-art systems, and provides an effective solution to the occlusion problem.
Content may be subject to copyright.
Automatic generation and detection of highly reliable fiducial markers
under occlusion
S. Garrido-Jurado, R. Mu˜noz-Salinas, F.J Madrid-Cuevas, M.J. Mar´ın-Jim´enez
Department of Computing and Numerical Analysis.
University of C´ordoba.
14071 C´ordoba (Spain)
This paper presents a fiducial marker system specially ap-
propriated for camera pose estimation in applications such
as augmented reality, robot localization, etc. Three main
contributions are presented. First, we propose an algo-
rithm for generating configurable marker dictionaries (in
size and number of bits) following a criterion to maximize
the inter-marker distance and the number of bit transi-
tions. In the process, we derive the maximum theoretical
inter-marker distance that dictionaries of square binary
markers can have. Second, a method for automatically
detecting the markers and correcting possible errors is pro-
posed. Third, a solution to the occlusion problem in aug-
mented reality applications is shown. To that aim, multi-
ple markers are combined with an occlusion mask calcu-
lated by color segmentation. The experiments conducted
show that our proposal obtains dictionaries with higher
inter-marker distances and lower false negative rates than
state-of-the-art systems, and provides an effective solution
to the occlusion problem.
1 Introduction
Camera pose estimation (Fig. 1(a,b)) is a common prob-
lem in many applications requiring a precise localization
in the environment such as augmented and virtual real-
ity applications, robotics, etc [1, 2, 3, 4]. Obtaining the
camera pose from images requires to find the correspon-
dences between known points in the environment and their
camera projections. While some approaches seek natural
features such as key points or textures [5, 6, 7, 8, 9], fidu-
cial markers are still an attractive approach because they
are easy to detect and allows to achieve high speed and
Amongst the several fiducial marker systems proposed
in the literature, those based on square markers have
gained popularity, specially in the augmented reality com-
munity [10, 11, 12]. The reason why is that they allow
to extract the camera pose from their four corners, given
that the camera is properly calibrated. In most of the
approaches, markers encode an unique identification by a
binary code that may include error detection and correc-
tion bits. In general, each author has proposed its own
predefined set of markers (dictionary). The problems of
setting a predefined dictionary are twofold. First, in some
cases, the number of markers required by the application
might be higher than the dictionary size. Second, if the
number of markers required is smaller, then it is preferable
to use a smaller dictionary whose inter-marker distance is
as high as possible, so as to reduce the inter-marker con-
fusion rate.
Another common problem in augmented reality applica-
tions is related to the occlusion. The problem occurs when
a real object appears occluding the virtual scene. In this
case, the virtual objects are rendered on the real ob ject,
which should be visible (see Fig. 1(c,d)). This is indeed a
limitation to the augmented experience since user can not
interact freely.
This paper presents a fiducial marker system based on
square markers offering solutions to the above mentioned
problems. First, we propose a general method for gener-
ating configurable dictionaries (both in size and number
of bits). Our algorithm creates dictionaries following a
criterion to maximize the inter-marker distance and the
number of bit transitions. In the process, we derive the
maximum theoretical inter-marker distance that a dictio-
nary of square binary markers can have. Then, a method
for automatically detecting markers in images and correct-
ing possible errors, based on our generated dictionaries, is
presented. Third, we propose a solution to the occlusion
problem based on combining multiple markers and an oc-
clusion mask calculated using color information. While
using multiple markers provides robustness against occlu-
sion, color information is used to determine the occluded
pixels avoiding rendering on them.
The rest of the paper is structured as follows. Section 2
presents the most relevant works related to ours. Section
3 explains the proposed method to generate marker dictio-
naries. Section 4 shows the process proposed for marker
detection and error correction. Section 5 presents our so-
lution to the occlusion problem. Finally, Section 6 shows
the experimentation carried out, and Section 7 draws some
Finally, it must be indicated that our work has been
implemented in the ArUco library which is freely avail-
able [13].
Figure 1: Example of augmented reality scene. (a) Input image containing a set of fiducial markers. (b) Markers
automatically detected and used for camera pose estimation. (c) Augmented scene without considering user’s occlusion.
(d) Augmented scene considering occlusion.
2 Related work
A fiducial marker system is composed by a set of valid
markers and an algorithm which performs its detection,
and possibly correction, in images. Several fiducial marker
systems have been proposed in the literature as shown in
Figure 2.
The simplest proposals consist in using points as fiducial
markers, such as LEDs, retroreflective spheres or planar
dots [14, 15], which can be segmented using basic tech-
niques over controlled conditions. Their identification is
usually obtained from the relative position of the markers
and often involves a complex process.
Other approaches use planar circular markers where the
identification is encoded in circular sectors or concentric
rings [16][17]. However, circular markers usually provide
just one correspondence point (the center), making neces-
sary the detection of several of them for pose estimation.
Other types of fiducial markers are based on blob de-
tection. Cybercode[18] or VisualCode[19] are derived from
2D-barcodes technology as MaxiCode or QR but can also
accurately provide several correspondence points. Other
popular fiducial markers are the ReacTIVision amoeba
markers [20] which are also based on blob detection and its
design was optimized by using genetic algorithms. Some
authors have proposed the use of trained classifiers to im-
prove detection in cases of bad illumination and blurring
caused by fast camera movement [21].
An alternative to the previous approaches are the
square-based fiducial markers systems. Their main ad-
vantage is that the presence of four prominent points can
be employed to obtain the pose, while the inner region is
used for identification (either using a binary code or an ar-
bitrary pattern such as an image). In the arbitrary pattern
category, one of the most popular systems is ARToolKit
[10], an open source project which has been extensively
used in the last decade, especially in the academic com-
munity. ARToolkit markers are composed by a wide black
border with an inner image which is stored in a database
of valid patterns. Despite of its popularity, it has some
drawbacks. First, it uses a template matching approach to
identify markers, obtaining high false positive and inter-
marker confusion rates [22]. Second, the system uses a
fixed global threshold to detect squares, making it very
sensitive to varying lighting conditions.
Most of the square-based fiducial systems uses binary
Figure 2: Examples of fiducial markers proposed in previ-
ous works.
codes. Matrix[23] is one of the first and simplest pro-
posals. It uses a binary code with redundant bits for er-
ror detection. The ARTag [11] system is based on the
same principles but improves the robustness to lighting
and partial occlusion by using an edge-based square detec-
tion method, instead of a fixed threshold. Additionally, it
uses a binary coding scheme that includes checksum bits
for error detection and correction. It also recommends
using its dictionary markers in a specific order so as to
maximize the inter-marker distances. Its main drawback
is that the proposed marker dictionary is fixed to 36 bits
and the maximun number of erroneous bits that can be
corrected is two, independently of the inter-marker dis-
tances of the subset of markers used.
ARToolKit Plus [24] improves some of the features of
ARToolKit. First, it includes a method to automatically
update the global threshold value depending on pixel val-
ues from previously detected markers. Second, it employs
binary codes including error detection and correction, thus
achieving higher robustness than its predecessor. The
last known version of ARToolKitPlus employs a binary
BCH [25] code for 36 bits markers which presents a min-
imun Hamming distance of two. As a consequence, AR-
ToolKitPlus BCH markers can detect a maximun error of
one bit and cannot perform error correction. ARToolkit-
Plus project was halted and followed by the Studierstube
Tracker[12] project which is not publicly available.
BinARyID[26] proposes a method to generate binary
coded markers focused on avoiding rotation ambiguities,
however it only achieves Hamming distance of one between
two markers and does not present any error correction
process. There are also some closed-source systems which
employ square markers such as the SCR, HOM and IGD
[27] marker systems used by the ARVIKA project [28].
This paper proposes a square-based fiducial marker sys-
tem with binary codes. However, instead of using a pre-
defined set of markers, we propose a method for generat-
ing configurable marker dictionaries (with arbitrary size
and number of markers), containing only the number of
markers required. Our algorithm produces markers us-
ing a criterion to maximize the inter-marker distance and
the number of bit transitions. Additionally, a method for
detecting and correcting errors, based on the dictionary
obtained, is proposed. This method allows error correc-
tion of a greater number of erroneous bits compared to the
current state of the art systems.
Our last contribution is related to the occlusion prob-
lem in augmented reality applications. When designing
an augmented reality application, interactivity is a key
aspect to consider. So, one may expect users to occlude
the markers. ARTag handles the problem in two ways.
First, the marker detection method allows small breaks in
the square sides. Second, they employ several markers si-
multaneously, thus, the occlusion of some of them does not
affect the global pose estimation. Despite of being robust
to occlusion, ARTag still has a main drawback: it can not
detect precisely occlusion. As a consequence, if an object
moves between the camera and the augmented scene (e.g.
user’s hands), the virtual objects will be rendered on the
hands, hiding it (see Fig. 1(c,d)).
Proposals to detect the occluded regions usually fall
into three main categories: depth-based, model-based, and
color-based approaches. Depth-based approaches try to
calculate the depth of the image pixels to detect occlu-
sions. However, these approaches require depth-based sen-
sors, such as stereo, time of flight or structured light cam-
eras [29, 30, 31]. When a single camera is used, some
authors have adopted model-based approaches [32, 33].
The idea is to provide geometric models of the objects
which can occlude the scene, and detect their pose. This
solution is not practical in many applications where the
occluding objects are not known in advance, and imposes
very strong performance limitations. Finally, color-based
approaches [34], can be employed. The idea is to create a
color model of the scene (background) which is then com-
pared to the foreground objects.
In this work, we propose the use of multiple markers to
handle occlusion (as in ARTag). However, we also propose
the use of a color map for precisely detecting the visible
pixels, so that the virtual scene is only rendered on them.
In order to improve segmentation, we employ blue and
green markers, instead of classical black-and-white ones.
As we experimentally show, our proposal is an effective
method for improving current augmented reality applica-
tions such as in gaming or film industry, although not
limited to that.
Figure 3: Examples of markers of different sizes, n, gener-
ated with the proposed method. From left to right: n= 5,
n= 6 and n= 8.
3 Automatic dictionary
The most relevant aspects to consider when designing a
marker dictionary are the false positive and negative rates,
the inter-marker confusion rate, and the number of valid
markers [11]. The first two are often tackled in the lit-
erature using error detection and correction bits, which,
on the other hand, reduces the number of valid markers.
The third one, depends only on the distance between the
markers employed. If they are too close, a few erroneous
bits can lead to another valid marker of the dictionary,
and the error could not be even detected.
Another desirable property of markers is having a high
number of bit transitions, so that they are less likely to be
confused with environment objects. For instance, the bi-
nary codes with only zeros or ones will be printed as com-
pletely black or white markers respectively, which would
be easily confused with environment objects.
While previous works impose fixed dictionaries, we pro-
pose an automatic method for generating them with the
desired number of markers and with the desired number
of bits. Our problem is then to select mmarkers, from the
space of all markers with n×nbits, D, so that they are as
far as possible from each other and with as many bit tran-
sitions as possible. In general, the problem is to find the
dictionary Dthat maximizes the desired criterion ˆτ(D):
D= argmax
Since a complete evaluation of the search space is not
feasible even for a small n, an stochastic algorithm that
finds suboptimal solutions is proposed.
3.1 Algorithm overview
Our algorithm starts from an empty dictionary Dthat is
incrementally populated with new markers. Our markers
are encoded as a (n+ 2) ×(n+ 2) grid (Fig. 3 ) where the
external cells are set to black, creating an external border
easily detectable. The remaining n×ncells are employed
for coding. Thus, we might define a marker,
m= (w0, w1, ..., wn1),(2)
as a tuple composed by nbinary words wof length nsuch
w= (b0, . . . , bn1|bi∈ {0,1}).(3)
Let us also denote Was the set of all possible words of n
bits, whose cardinal is |W|= 2n.
At each iteration of the algorithm, a marker is selected
based on a stochastic process that assigns more probability
to markers with a higher number of bit transitions and
whose words have not been yet added to D. If the distance
between the generated marker and these in Dis greater
than a minimum value τ, then it is added. Otherwise, the
marker is rejected and a new marker is randomly selected.
The process stops when the required number of markers
is achieved.
Because of the probabilistic nature of the algorithm, the
acceptance of new markers could be improbable or even
impossible in some cases. To guarantee the convergence
of the algorithm, the distance threshold is initially set to
the maximum possible inter-marker distance that the dic-
tionary can have τ0. Along the process, the value of τis
reduced after a number of unproductive iterations ψ. The
final value ˆτ(D) represents the minimum distance between
any two markers in D, and it will be used as the base for
error detection and correction (explained in Sect. 4). The
proposed algorithm is summarized in Alg. 1.
Algorithm 1 Dictionary generation process
DØ# Reset dictionary
ττ0#Initialize target distance, see Sect. 3.4
%0# Reset unproductive iteration counter
while Dhas not desired size do
Generate a new marker m#Sect. 3.2
if distance of mto elements in Dis τthen
DDm# Add to dictionary
%%+ 1 # It was unproductive
# maximium unproductive iteration reached ?
if %=ψthen
ττ1# Decrease target distance
end if
end if
end while
3.2 Marker generation
As previously pointed out, markers are selected using a
random process leaded by a probability distribution that
assigns a higher probability to these markers with a high
number of transitions and whose words are not yet present
in D. The proposed process for generating a marker con-
sists in selecting nwords from Wwith replacement. To do
so, each word wiWhas a probability of being selected
at each iteration that is defined as:
Pwj∈W T(wj)O(wj,D).(4)
Eq. 4 defines the probability of selecting a word as the
combination of two functions. The first one, T(wi)[0,1],
is related to the number of bit transitions of the word. It
is defined as
j=0 δ(wj+1
i, wj
being wj
ithe j-bit of the word wi, and δis 1 if both el-
ements are equal and 0 otherwise. So, T(wi) tends to 1
as the number of transitions between consecutive bits in-
creases and to 0 as the number of transitions decreases.
For instance, the words 010110 and 000011 present values
of T= 4/5 and T= 1/5, respectively, which are propor-
tional to the number of bit transitions.
On the other hand, the function O(wi,D) accounts for
the number of times the word wiappears amongst the
markers in D. The idea is to reduce the probability
of choosing words that have already been selected many
times. It is defined in the interval [0,1] as
O(wi,D) = (1PmiDPwjmiδ(wj,wi)
n|D|if |D| 6= 0
1 otherwise .
The double sum counts the appearances of wamongst
the markers in D, while the denominator counts the total
number of words in D. Thus, O(wi,D) is 1 if wiis not in
D, and tends to 0 as it appears a higher number of times.
Finally, in the first iteration (|D|= 0), the function is
defined as 1 so that all words have the same probability
of being selected.
3.3 Distance calculation
As previously indicated, a marker is added to the dictio-
nary if its distance to the markers in the dictionary is be-
low τ. The concept of distance between markers must be
defined considering that they are printed as binary grids
of n×nbits that can be observed under rotation. Then,
let us define the distance between two markers as
D(mi, mj) = min
k∈{0,1,2,3}{H(mi, Rk(mj)) }.(7)
The function His the Hamming distance between two
markers, which is defined as the sum of hamming distances
between each pair of marker words. The function Rkis
an operator that rotates the marker grid k×90 degrees in
clockwise direction. The function Dis then the rotation-
invariant Hamming distance between the markers.
Let us also define the distance of a marker to a dictio-
D(mi,D) = min
mjD{D(mi, mj)},(8)
as the distance of the marker to nearest one in the dictio-
Finally, it is not only important to distinguish markers
from each other, but also to correctly identify the marker
orientation. Otherwise, pose estimation would fail. So, a
valid marker must also guarantee that the minimum dis-
tance to its own rotations is above τ. Thus, we define the
marker self-distance as
S(mi) = min
k∈{1,2,3}{H(mi, Rk(mi)) }.(9)
In summary, we only add a marker to the dictionary
if both S(mi) and D(mi,D) are greater or equal than τ.
Otherwise, the marker is rejected and a new one generated.
After a number of unproductive iterations ψ, the value of
Figure 4: Examples of quartets for a 2×2 and 3×3 marker.
Each arrow indicates the destination of a bit after a 90
degrees clockwise rotation.
τis decreased by one so as to allow new markers to be
In the end, the markers of the generated dictionary have
a minimum distance between them and to themselves, ˆτ,
that is the last τemployed. This value can be calculated
for any marker dictionary (manually or automatically gen-
erated) as:
ˆτ(D) = min min
mi6=mjD{D(mi, mj)}.
3.4 Maximum inter-marker distance: τ0
The proposed algorithm requires an initial value for the
parameter τ0. If one analyzes the first iteration (when
the dictionary is empty), it is clear that the only distance
to consider is the self distance (Eq. 9), since the distance
to other markers is not applicable. So, the maximum self
distance for markers of size n×n(let us denote it by S
n) is
the maximum distance that a dictionary can have for these
type of markers. This section explains how to determine
n, which is equivalent to find the marker of size n×n
with highest self-distance.
If we analyze the path of the bits when applying 90 de-
grees rotations to a marker, it is clear that any bit (x, y)
changes its position to another three locations until it re-
turns to its original position (see Figure 4). It can be
understood, that the Hamming distance provided by a
marker bit to Eq. 9 is only influenced by these other three
bits. So, let us define a quartet as the set composed by
these positions: {(x, y),(ny1, x),(nx1, n y
1),(y, n x1)}.
In general, a marker of size n×n, has a total of C
quartets that can be calculated as:
where b·c represents the floor function. If nis odd, the
central bit of the marker constitutes a quartet by itself
which does not provide extra distance to S.
If a quartet is expressed as a bit string, a 90 degrees
rotation can be obtained as a circular bit shift operation.
For instance, the quartet 1100, becomes (0110 0011
1001) in successive rotations. In fact, for the purpose of
calculating S
n, these four quartets are equivalent, and we
will refer to them as a quartet group Qi. It can be seen
from Eq. 9, that the contribution of any quartet is given
Group Quartets Hamming distances
90 deg 180 deg 270 deg
Q10000 0 0 0
Q21000 0100 0010 0001 2 2 2
Q31100 0110 0011 1001 2 4 2
Q40101 1010 4 0 4
Q51110 0111 1011 1101 2 2 2
Q61111 0 0 0
Table 1: Quartet groups and quartet Hamming distances
for each rotation.
by the distance of its successive rotations to the original
quartet. For instance, quartet 1100 contributes to Eq. 9
with distances (2,4,2) as it rotates:
H(1100,0110) = 2; H(1100,0011) = 4; H(1100,1001) = 2.
But also, if we start from quartet 0110 and rotate it suc-
cessively, we obtain the quartets (0011 1001 1100),
that again provide the distances (2,4,2):
H(0110,0011) = 2; H(0110,1001) = 4; H(0110,1100) = 2.
In fact, there are only 6 quartet groups (shown in Table
1), thus reducing the problem considerably.
As previously indicated, calculating S
nis the problem
of obtaining the marker with highest self-distance, and we
have turned this problem into assigning quartets groups to
the Cquartets of a maker. It can be seen that it is in fact
a multi-objective optimization, where each quartet group
Qiis a possible solution and the objectives to maximize
are the distances for each rotation. If the Pareto front
is obtained, it can be observed that the groups Q3and
Q4dominates the rest of solutions. Thus, the problem is
simplified, again, to assign Q3and Q4to the Cquartets
of a marker.
From a brief analysis, it can be deduced that S
nis ob-
tained by assigning the groups {Q3, Q3, Q4}(in this order)
repeatedly until completing the Cquartets. For instance,
the simplest marker is a 2×2 marker (C= 1), S
n= 2 and
is obtained by assigning Q3. For a 3 ×3 marker (C= 2),
n= 4 which is obtained by assigning Q3twice. For a
4×4 marker (C= 4), S
n= 10 obtained by assigning the
groups {Q3, Q3, Q4, Q3}. This last case is showed in detail
in Table 2.
Therefore, for a generic marker with Cquartets, the
value S
nfollows the rule:
n= 2 4C
Then, we employ the value:
as starting point for our algorithm.
4 Marker detection and error
This section explains the steps employed to automatically
detect the markers in an image (Fig. 5(a)). The process
Quartet Group Hamming distances
90 degrees 180 degrees 270 degrees
1Q32 4 2
2Q32 4 2
3Q44 0 4
4Q32 4 2
Total distances 10 12 10
smin(10,12,10) = 10
Table 2: Quartet assignment for a 4 ×4 marker (C=
4) to obtain S
n. It can be observed as the sequence
{Q3, Q3, Q4}is repeated until filling all the quartets in
the marker.
Figure 5: Image Process for automatic marker detection.
(a) Original Image. (b) Result of applying local threshold-
ing. (c) Contour detection. (d) Polygonal approximation
and removal of irrelevant contours. (e) Example of marker
after perspective transformation. (f) Bit assignment for
each cell.
is comprised by several steps aimed at detecting rectan-
gles and extracting the binary code from them. For that
purpose, we take as input a gray-scale image. While the
image analysis is not a novel contribution, the marker
code identification and error correction is a new approach
specifically designed for the generated dictionaries of our
method. Following are described the steps employed by
our system.
Image segmentation: Firstly, the most prominent
contours in the gray-scale image are extracted. Our
initial approach was employing the Canny edge detec-
tor [35], however, it is very slow for our real-time pur-
poses. In this work, we have opted for a local adaptive
thresholding approach which has proven to be very
robust to different lighting condition (see Fig. 5(b)).
Contour extraction and filtering: Afterward, a con-
tour extraction is performed on the thresholded image
using the Suzuki and Abe [36] algorithm. It produces
the set of image contours, most of which are irrelevant
for our purposes (see Figure 5(c)). Then, a polygo-
nal approximation is performed using the Douglas-
Peucker [37] algorithm. Since markers are enclosed
in rectangular contours, these that are not approx-
imated to 4-vertex polygons are discarded. Finally,
we simplify near contours leaving only the external
ones. Figure 5(d) shows the resulting polygons from
this process.
Marker Code extraction: The next step consists in an-
alyzing the inner region of these contours to extract
its internal code. First, perspective projection is re-
moved by computing the homography matrix (Fig.
5(e)). The resulting image is thresholded using the
Otsu’s method [38], which provides the optimal im-
age threshold value given that image distribution is
bimodal (which holds true in this case). Then, the bi-
narized image is divided into a regular grid and each
element is assigned the value 0 or 1 depending on the
values of the majority of pixels into it (see Fig. 5(e,f)).
A first rejection test consists in detecting the presence
of the black border. If all the bits of the border are
zero, then the inner grid is analyzed using the method
described below.
Marker identification and error correction: At this
point, it is necessary to determine which of the marker
candidates obtained actually belongs to the dictio-
nary and which are just part of the environment.
Once the code of a marker candidate is extracted,
four different identifiers are obtained (one for each
possible rotation). If any of them is found in D, we
consider the candidate as a valid marker. To speed
up this process, the dictionary elements are sorted
as a balanced binary tree. To that aim, markers are
represented by the integer value obtained by concate-
nating all its bits. It can be deduced then, that this
process has a logarithmic complexity O(4 log2(|D|)),
where the factor 4 indicates that it is necessary one
search for each rotation of the marker candidate.
If no match is found, the correction method can be
applied. Considering that the minimum distance be-
tween any two markers in Dis ˆτ, an error of at most
bτ1)/2cbits can be detected and corrected. There-
fore, our marker correction method consists in calcu-
lating the distance of the erroneous marker candidate
to all the markers in D(using Eq. 8). If the distance
is equal or smaller than bτ1)/2c, we consider that
the nearest marker is the correct one. This process,
though, presents a linear complexity of O(4|D|), since
each rotation of the candidate has to be compared to
the entire dictionary. Nonetheless, it is a highly par-
allelizable process that can be efficiently implemented
in current computers.
Please note that, compared to the dictionaries of
ARToolKitPlus (which can not correct errors) and
ARTag ( only capable of recovering errors of two bits),
our approach can correct errors of bτ1)/2cbits.
For instance, for a dictionary generated in the exper-
imental section with 6 ×6 bits and 30 markers, we
obtained ˆτ= 12. So, our approach can correct 5 bits
of errors in this dictionary. Additionally, we can gen-
erate markers with more bits which leads to a larger
ˆτ, thus increasing the correction capabilities. Actu-
ally, our detection and correction method is a general
framework that can be used with any dictionary (in-
cluding ARToolKitPlus and ARTag dictionaries). In
fact, if our method is employed with the ARTag dic-
tionary of 30 markers, for instance, we could recover
from errors of 5 bits, instead of the 2 bits they can
recover from.
Corner refinement and pose estimation: Once a
marker has been detected, it is possible to estimate
its pose respect to the camera by iteratively minimiz-
ing the reprojection error of the corners (using for in-
stance the Levenberg-Marquardt algorithm[39, 40]).
While, many approaches have been proposed for cor-
ner detection [41, 42, 43], we have opted for doing a
linear regression of the marker side pixels to calculate
their intersections. This approach was also employed
in ARTag [11], ARToolKit [10] and ARToolKitPlus
5 Occlusion detection
Detecting a single marker might fail for different reasons
such as poor lighting conditions, fast camera movement,
occlusions, etc. A common approach to improve the ro-
bustness of a marker system is the use of marker boards. A
marker board is a pattern composed by multiple markers
whose corner locations are referred to a common reference
system. Boards present two main advantages. First, since
there are more than one marker, it is less likely to lose
all of them at the same time. Second, the more mark-
ers are detected, the more corner points are available for
computing the camera pose, thus, the pose obtained is less
influenced by noise. Figure 1(a) shows the robustness of
a marker board against partial occlusion.
Based on the marker board idea, a method to overcome
the occlusion problem in augmented reality applications
(i.e., virtual objects rendered on real objects as shown in
Fig. 1(c,d)) is proposed. Our approach consists in defining
a color map of the board that is employed to compute an
occlusion mask by color segmentation.
Although the proposed method is general enough to
work with any combinations of colors, we have opted in
our tests to replace black and white markers by others
with higher chromatic contrast so as to improve color seg-
mentation. In our case, blue and green have been selected.
Additionally we have opted for using only the hue compo-
nent of the HSV color model, since we have observed that
it provides the highest robustness to lighting changes and
Let us define the color map Mas a nc×mcgrid, where
each cell crepresents the color distribution of the pixels of
a board region. If the board pose is properly estimated,
it is possible to compute the homography Hmthat maps
Figure 6: Occlusion mask example. (a) Hue component of
Fig. 1(a) with the detected markers (b) Occlusion mask:
white pixels represent visible regions of the board.
the board image pixels pinto the map space
Then, the corresponding cell pcis obtained by discretizing
the result to its nearest value pc= [pm]. Let us denote by
Icthe set of image board pixels that maps onto cell c.
If the grid size of Mis relatively small compared to
the size of the board in the images, Icwill contain pix-
els of the two main board colors. It is assumed then,
that the distribution of the colors in each cell can be
modeled by a mixture of two Gaussians [44], using the
Expectation-Maximization algorithm [45] to obtain its pa-
rameters. Therefore, the pdf of the color uin a cell ccan
be approximated by the expression
P(u, c) = X
k, Σc
where Nc
k, Σc
k) is the k-th Gaussian distribution and
kis the mixing coefficient, being
k= 1.
In an initial step, the map must be created from a view
of the board without occlusion. In subsequent frames,
color segmentation is done analyzing if the probability of
a pixel is below a certain threshold θc. However, to avoid
the hard partitioning imposed by the discretization, the
probability of each pixel is computed as the weighted av-
erage of the probabilities obtained by the neighbor cells in
the map:
P(p) = Pc∈H(pc)w(pm, c)P(pu, c)
Pc∈H(pc)w(pm, c),(15)
where puis the color of the pixel, H(pc)⊂ M are the
nearest neighbor cells of pc, and
w(pm, c) = (2 − |pmc|1)2(16)
is a weighting factor based on the L1-norm between the
mapped value pmand the center of the cell c. The value
2 represents the maximum possible L1distance between
neighbors. As a consequence, the proposed weighting
value is very fast to compute and provides good results
in practice.
Considering that the dimension of the observed board
in the image is much bigger than the number of cells in
the color map, neighbor pixels in the image are likely to
have similar probabilities. Thus, we can speed up com-
putation by downsampling the image pixels employed for
calculating the mask and assigning the same value to its
Figure 6 shows the results of the detection and segmen-
tation obtained by our method using as input the hue
channel and a downsampling factor of 4. As can be seen,
the occluding hand is properly detected by color segmen-
Finally, it must considered that the lighting conditions
might change, thus making it necessary to update the map.
This process can be done with each new frame, or less
frequently to avoid increasing the computing time exces-
sively. In order to update the color map, the probability
distribution of the map cells are recalculated using only
the visible pixels of the board. The process only applies
to cells with a minimum number of visible pixels γc, i.e.,
only if |Ic|> γc.
6 Experiments and results
This section explains the experimentation carried out to
test our proposal. First, the processing times required
for marker detection and correction are analyzed. Then,
the proposed method is compared with the state-of-the-
art systems in terms of inter-marker distances, number of
bit transitions, robustness against noise and vertex jitter.
Finally, an analysis of the occlusion method proposed is
As already indicated, this work is available under the
BSD license in the ArUco library [13].
6.1 Processing Time
Processing time is a crucial feature in many real time fidu-
cial applications (such as augmented reality). The marker
detection process of Sect. 4 can be divided in two main
steps: finding marker candidates and analyzing them to
determine if they actually belong to the dictionary.
The detection performance of our method has been
tested for a dictionary size of |D|= 24. The processing
time for candidate detection, marker identification and er-
ror correction was measured for several video sequences.
The tests were performed using a single core of a system
equipped with an Intel Core 2 Quad 2.40 Ghz processor,
2048 MB of RAM and Ubuntu 12.04 as the operating sys-
tem with a load average of 0.1. Table 3 summarizes the
average obtained results for a total of 6000 images with
resolution of 640 ×480 pixels. The sequences include in-
door recording with several markers and marker boards
arranged in the environment.
In addition, we have evaluated the computing time
required for generating dictionaries with the proposed
method for 6 ×6 markers. The value of τwas reduced
after ψ= 5000 unproductive iterations. The computing
times for dictionaries of sizes 10,100 and 1000 elements are
approximately 8, 20 and 90 minutes, respectively. Since
Candidates detection 8.17 ms/image
Marker identification 0.17 ms/candidate
Error correction 0.71 ms/candidate
Total time (|D|= 24) 11.08 ms/image
Table 3: Average processing times for the different steps
of our method.
Figure 7: Inter-marker distances of ARTag dictionaries
and ours (Eq. 10) for an increasing number of markers.
ArUco values correspond to the mean of 30 runs of our al-
gorithm (with and without considering reflection). Higher
distances reduces the possibility of inter-marker confusion
in case of error.
this is an off-line process done only once, we consider that
the computing times obtained are appropriated for real
applications. It must be considered, though, that gen-
erating the first elements of the dictionary is more time
consuming because the high inter-distances imposed. As
τdecreases, the computation speed increases.
Finally, the time required for creating the color map and
the occlusion mask in the sequences reported in Sect. 6.6,
are 170 and 4 ms, respectively. In these sequences, the
board has an average dimension of 320 ×240 pixels.
6.2 Analysis of Dictionary distances
The inter-marker confusion rate is related to the distances
between the markers in the dictionary ˆτ(D) (Eq. 10). The
higher the distance between markers, the more difficult is
to confuse them in case of error. The marker dictionary
proposed by Fiala in the ARTag [11] system improves the
distances of other systems such as ARToolKitPlus [24] or
BinARyID [26]. His work recommends using its dictionary
(of 6×6 markers) in a specific order so as to maximize the
We have compared the dictionaries generated with our
method to these obtained by incrementally adding the first
1000 recommended markers of ARTag. For our algorithm,
the initial distance employed is τ0= 24 (Eq. 13), which
has been decremented by one after ψ= 5000 unproduc-
tive iterations. Since ARTag considers the possibility of
marker reflection (i.e. markers seen in a mirror), we have
Figure 8: Standard deviations of inter-marker distances
obtained by our method in Figure 7 (with and without
considering reflection).
also tested our method including the reflection condition.
However, we consider this is as an uncommon case in fidu-
cial marker applications.
Figure 7 shows the values ˆτ(D) for the dictionaries as
their size increases. The results shown for our method
represent the average values of 30 runs of our algorithm.
As can be seen, our system outperforms the ARTag dic-
tionaries in the majority of the cases and obtains the same
results in the worst ones. Even when considering reflec-
tion, our method still outperforms the ARTag results in
most cases. ARToolKitPlus system has not been com-
pared since it does not include a recommended marker
order as ARTag. However, the minimum distance in AR-
ToolKitPlus considering all the BCH markers is 2, which
is a low value in comparison to our method, or ARTag.
Figure 8 shows standard deviations for 30 runs of the
tests shown in Figure 7. It can be observed that there are
two patterns in the deviation results: (i) peaks which cor-
respond to the slopes in Figure 7, and (ii) intervals with-
out deviation where the inter-marker distance remains the
same in all runs. As can be observed, the higher deviations
occurs at the transitions of ˆτ(D) in Figure 7. It must be
noted, though, that in most of the cases, the maximun de-
viation is 0.5. Just in the generation of the first markers,
the deviation ascends up to 1.4 and 0.7 (with and without
considering reflection respectively).
6.3 Evaluation of the bit transitions
Our marker generation process encourages markers with
a high number of bit transitions, thus, reducing the pos-
sibility of confusion with environment elements. Figure
9 shows the number of bit transitions of the dictionaries
generated in the previous section with our method and
with ARTag. The number of transitions are obtained as
the sum of the transitions for each word in the marker.
As in the previous case, our results represent the average
values obtained for 30 different marker dictionaries gen-
erated with our algorithm. It must be indicated that the
maximum standard deviation obtained in all cases was 1.7.
Figure 9: Number of bit transitions of ARTag dictionaries
and ours for an increasing number of markers. Higher
number of transitions reduces the possibility of confusion
with environment elements.
Figure 10: False negative rates for different levels of addi-
tive Gaussian noise.
It can be observed that our approach generates markers
with more transitions than ARTag. Also, the number of
transitions does not decrease drastically as the number of
markers selected grows. The mean bit transitions for all
the BCH markers in ARToolKitPlus is 15.0 which is also
below our method.
6.4 Error detection
The false positive and false negative rates are related to
the coding scheme and the number of redundant bits em-
ployed for error detection and correction. In our approach,
however, false positives are not detected by checking re-
dundant bits but analyzing the distance to the dictionary
markers. A comparison between the correction capabili-
ties of ARToolKitPlus, ARTag and our method has been
performed by comparing the false negative rates from a
set of 100 test images for each system. The images showed
markers of each system from different distances and view-
points. The images were taken in the same positions for
Figure 11: Image examples from video sequences used to test the proposed fiducial marker system. First row shows
cases of correct marker detection. Second row shows cases where false positives have not been detected.
each of the tested systems. Different levels of additive
Gaussian noise have been applied to the images to mea-
sure the robustness of the methods. Figure 10 shows the
false negative rates obtained as a function of the noise
As can be observed, the proposed method is more robust
against high amounts of noise than the rest. ARToolKit-
Plus false negative rate increases sharply for high levels of
noise. ARTag presents a higher sensitivity for low levels of
noise, however it is nearly as robust as our method for high
levels. Figure 11 shows some examples of the sequences
used to test the proposed system. It must be indicated,
though, that no false positives have been detected by any
method in the video sequences tested during our experi-
6.5 Vertex jitter
An important issue in many augmented reality applica-
tions is the vertex jitter, which refers to the noise in the
localization of the marker corner. Errors in the location
of corners are propagated to the estimation of the camera
extrinsic parameters, leading to an unpleasant user expe-
riences. This section analyzes the obtained vertex jitter of
i) the result of the polygonal approximation (see Sect. 4),
ii) our method implemented in the ArUco library, iii) the
ARToolKitPlus library and iv) the ARTag library. The
first method is the most basic approach (i.e., no corner
refinement) and is applied to analyze the impact of the
other methods. Then, since the techniques used by AR-
ToolKitPlus, ARTag and our method are based on the
same principle (linear regression of marker side pixels), it
is expected that they obtain similar results.
For the experiments, the camera has been placed at
a fixed position respect to a set of markers and several
frames have been acquired. Then, the camera has been
moved farther away from the marker thus obtaining sev-
eral view points at different distances. The standard de-
viation of the corners locations estimated by each method
has been measured in all the frames. The experiment
Figure 12: Vertex jitter measures for different marker sys-
has been repeated both for black-and-white markers and
green-and-blue markers. Please note that the hue chan-
nel employed for detecting the latter presents less contrast
than the black-and-white markers (see Fig. 6(a)). Thus,
evaluating the different corner refinement systems is espe-
cially relevant in that case.
Figure 12 shows the results obtained as a box plot [46]
for both, black-and-white markers and green-and-blue
markers. The lower and upper ends of the whiskers rep-
resent the minimum and maximum distribution values re-
spectively. The bottom and top of the boxes represent the
lower and upper quartiles, while the middle band repre-
sents the median.
It can be observed that the jitter level is lower in black-
and-white markers than in green-and-blue ones. Nonethe-
less, it is small enough to provide a satisfactory user’s
experience. As expected, not performing any refinement
produces higher deviations. It can also be noted that our
method obtains similar results than these obtained by AR-
ToolKitPlus and ARTag libraries. We consider that dif-
ferences obtained between the three methods can be at-
tributed to implementation details.
6.6 Analysis of Occlusion
Along with the marker system described, a method to
overcome the occlusion problem in augmented reality ap-
plications has been proposed. First, we employ marker
boards so as to increase the probability of seeing complete
markers in the presence of occlussion. Then, we propose
using a color map to calculate an occlusion mask of the
board pixels. We have designed two set of experiments to
validate our proposal. Firstly, it has been analyzed how
different occlusion levels affects to the estimation of the
camera pose. While ARTag introduces the idea of mul-
tiple markers, no analysis of occlussion is made in their
work. Secondly, a qualitative evaluation of the occlusion
mask generated has been performed under different light-
ing conditions. It must be noticed that the estimation
of the occlusion mask is not present in any of the previ-
ous works (ARTag, ARToolKit or ARToolKitPlus), thus a
comparison with them is not feasible.
For our tests, the parameters
θc= 104, γc= 50, nc=mc= 5,
have been employed, providing good results in a wide
range of sequences.
6.6.1 Occlusion tolerance
In this experiments we aim at analyzing the tolerance to
occlusion of our system. To do so, a video sequence is
recorded showing a board composed by 24 markers with-
out occlusion so that all markers are correctly detected.
Assuming gaussian noise, the ground truth camera pose
is assumed to be the average in all the frames. Then, we
have artificially simulated several degrees of oclussion by
randomly removing a percentage of the detected markers
in each frame and computing the pose with the remaining
ones. Thus, the deviation from the ground truth at each
frame is the error introduced by occlusion. This process
has been repeated for three distances from the board to
analyze the impact of distance in the occlussion handling.
The 3D rotation error is computed using the inner prod-
uct of unit quaterions[47]
φ(q1, q2)=1− |q1·q2|
which gives values in the range [0,1]. The traslation error
has been obtained using the Euclidean distance.
Figures 13-14 show the obtained results for different
camera distances to the marker board. It can be observed
that, both in rotation and traslation, the error originated
by the occlusion are insignificant until the occlusion de-
gree is above 85%. It can also be noted that the error
increases as camera is farther from the board.
6.6.2 Qualitative evaluation of the occlusion
Figure 15 shows some captures from a user session using
the green-and-blue marker board. The augmented objects
Figure 13: Rotation error for different degrees of marker
board occlusion and for three camera distances.
Figure 14: Traslation error for different degrees of marker
board occlusion and for three camera distances.
consist in a piece of virtual floor and a virtual character
doing some actions around. It can be observed that the
user hand and other real objects are not occluded by vir-
tual objects since they have different tonalities than the
board and thus can be recognized by our method.
Nonetheless, as any color-based method, it is sensitive
to lighting conditions, i.e., too bright or too dark regions
makes it impossible to detect the markers nor to obtain
a precise occlusion mask. Fig. 16 shows an example of
scene where a lamp has been placed besides the board.
It can be seen that there is a bright spot saturating the
lower right region of the board, where markers can not be
detected. Additionally, because of the light saturation, the
chromatic information in that region (hue channel) is not
reliable, thus producing segmentation errors in the board.
7 Conclusions
This paper has proposed a fiducial marker system specially
appropriated for camera localization in applications such
as augmented reality applications or robotics. Instead of
employing a predefined set of markers, a general method
Figure 15: Examples of users’ interaction applying the occlusion mask. Note that hands and other real objects are
not occluded by the virtual character and the virtual floor texture.
Figure 16: Example of occlusion mask errors due to light saturation. (a) Original input image. (b) Markers detected.
(c) Occlusion mask. (d) Augmented scene.
to generate configurable dictionaries in size and number
of bits has been proposed. The algorithm relies on a prob-
abilistic search maximizing two criteria: the inter-marker
distances and the number of bit transitions. Also, the
theoretical maximum inter-marker distance that a dictio-
nary with square makers can have has been derived. The
paper has also proposed an automatic method to detect
the markers and correct possible errors. Instead of using
redundant bits for error detection and correction, our ap-
proach is based on a search on the generated dictionary.
Finally, a method to overcome the occlusion problem in
augmented reality applications has been presented: a color
map employed to calculate the occlusion mask.
The experiments conducted have shown that the dictio-
naries generated with our method outperforms state-of-
the-art systems in terms of inter-marker distance, number
of bit transitions and false positive rate. Finally, this work
has been set publicly available in the ArUco library [13].
Acknowledgments. We are grateful to the finan-
cial support provided by Science and Technology Min-
istry of Spain and FEDER (projects TIN2012-32952 and
[1] R. T. Azuma, A survey of augmented reality, Pres-
ence 6 (1997) 355–385.
[2] H. Kato, M. Billinghurst, Marker tracking and HMD
calibration for a Video-Based augmented reality con-
ferencing system, Augmented Reality, International
Workshop on 0 (1999) 85–94.
[3] V. Lepetit, P. Fua, Monocular model-based 3d track-
ing of rigid objects: A survey, in: Foundations and
Trends in Computer Graphics and Vision, 2005, pp.
[4] B. Williams, M. Cummins, J. Neira, P. Newman,
I. Reid, J. Tard´os, A comparison of loop closing tech-
niques in monocular slam, Robotics and Autonomous
[5] W. Daniel, R. Gerhard, M. Alessandro, T. Drum-
mond, S. Dieter, Real-time detection and tracking for
augmented reality on mobile phones, IEEE Transac-
tions on Visualization and Computer Graphics 16 (3)
(2010) 355–368.
[6] G. Klein, D. Murray, Parallel tracking and map-
ping for small ar workspaces, in: Proceedings of the
2007 6th IEEE and ACM International Symposium
on Mixed and Augmented Reality, ISMAR ’07, IEEE
Computer Society, Washington, DC, USA, 2007, pp.
[7] K. Mikolajczyk, C. Schmid, Indexing based on scale
invariant interest points., in: ICCV, 2001, pp. 525–
[8] D. G. Lowe, Object recognition from local scale-
invariant features, in: Proceedings of the Interna-
tional Conference on Computer Vision-Volume 2 -
Volume 2, ICCV ’99, IEEE Computer Society, Wash-
ington, DC, USA, 1999, pp. 1150–.
[9] P. Bhattacharya, M. Gavrilova, A survey of landmark
recognition using the bag-of-words framework, in: In-
telligent Computer Graphics, Vol. 441 of Studies in
Computational Intelligence, Springer Berlin Heidel-
berg, 2013, pp. 243–263.
[10] H. Kato, M. Billinghurst, Marker tracking and hmd
calibration for a video-based augmented reality con-
ferencing system, in: Proceedings of the 2nd IEEE
and ACM International Workshop on Augmented Re-
ality, IWAR ’99, IEEE Computer Society, Washing-
ton, DC, USA, 1999, pp. 85–.
[11] M. Fiala, Designing highly reliable fiducial markers,
IEEE Trans. Pattern Anal. Mach. Intell. 32 (7) (2010)
[12] D. Schmalstieg, A. Fuhrmann, G. Hesina,
Z. Szalav´ari, L. M. Encarna¸ao, M. Gervautz,
W. Purgathofer, The studierstube augmented reality
project, Presence: Teleoper. Virtual Environ. 11 (1)
(2002) 33–54.
[13] R. Munoz-Salinas, S. Garrido-Jurado, Aruco li-
[Online; accessed 01-December-2013] (2013).
[14] K. Dorfmller, H. Wirth, Real-time hand and head
tracking for virtual environments using infrared
beacons, in: in Proceedings CAPTECH98. 1998,
Springer, 1998, pp. 113–127.
[15] M. Ribo, A. Pinz, A. L. Fuhrmann, A new optical
tracking system for virtual and augmented reality ap-
plications, in: In Proceedings of the IEEE Instrumen-
tation and Measurement Technical Conference, 2001,
pp. 1932–1936.
[16] V. A. Knyaz, R. V. Sibiryakov, The development of
new coded targets for automated point identification
and non-contact surface measurements, in: 3D Sur-
face Measurements, International Archives of Pho-
togrammetry and Remote Sensing, Vol. XXXII, part
5, 1998, pp. 80–85.
[17] L. Naimark, E. Foxlin, Circular data matrix fiducial
system and robust image processing for a wearable
vision-inertial self-tracker, in: Proceedings of the 1st
International Symposium on Mixed and Augmented
Reality, ISMAR ’02, IEEE Computer Society, Wash-
ington, DC, USA, 2002, pp. 27–.
[18] J. Rekimoto, Y. Ayatsuka, Cybercode: designing
augmented reality environments with visual tags, in:
Proceedings of DARE 2000 on Designing augmented
reality environments, DARE ’00, ACM, New York,
NY, USA, 2000, pp. 1–10.
[19] M. Rohs, B. Gfeller, Using camera-equipped mobile
phones for interacting with real-world objects, in:
Advances in Pervasive Computing, 2004, pp. 265–271.
[20] M. Kaltenbrunner, R. Bencina, reactivision: a
computer-vision framework for table-based tangible
interaction, in: Proceedings of the 1st international
conference on Tangible and embedded interaction,
TEI ’07, ACM, New York, NY, USA, 2007, pp. 69–74.
[21] D. Claus, A. Fitzgibbon, Reliable automatic calibra-
tion of a marker-based position tracking system, in:
Workshop on the Applications of Computer Vision,
2005, pp. 300–305.
[22] M. Fiala, Comparing artag and artoolkit plus fiducial
marker systems, in: IEEE International Workshop on
Haptic Audio Visual Environments and their Appli-
cations, 2005, pp. 147–152.
[23] J. Rekimoto, Matrix: A realtime object identifica-
tion and registration method for augmented reality,
in: Third Asian Pacific Computer and Human Inter-
action, July 15-17, 1998, Kangawa, Japan, Proceed-
ings, IEEE Computer Society, 1998, pp. 63–69.
[24] D. Wagner, D. Schmalstieg, Artoolkitplus for pose
tracking on mobile devices, in: Computer Vision
Winter Workshop, 2007, pp. 139–146.
[25] S. Lin, D. Costello, Error Control Coding: Funda-
mentals and Applications, Prentice Hall, 1983.
[26] D. Flohr, J. Fischer, A Lightweight ID-Based Exten-
sion for Marker Tracking Systems, in: Eurographics
Symposium on Virtual Environments (EGVE) Short
Paper Proceedings, 2007, pp. 59–64.
[27] X. Zhang, S. Fronz, N. Navab, Visual marker de-
tection and decoding in ar systems: A comparative
study, in: Proceedings of the 1st International Sym-
posium on Mixed and Augmented Reality, ISMAR
’02, IEEE Computer Society, Washington, DC, USA,
2002, pp. 97–.
[28] W. Friedrich, D. Jahn, L. Schmidt, Arvika - aug-
mented reality for development, production and ser-
vice, in: DLR Projekttr¨ager des BMBF f¨ur Infor-
mationstechnik (Ed.), International Status Confer-
ence - Lead Projects Human-Computer Interaction
(Saarbr¨ucken 2001), DLR, Berlin, 2001, pp. 79–89.
[29] S. Zollmann, G. Reitmayr, Dense depth maps from
sparse models and image coherence for augmented
reality, in: 18th ACM symposium on Virtual reality
software and technology, 2012, pp. 53–60.
[30] M. o. Berger, Resolving occlusion in augmented re-
ality: a contour based approach without 3d recon-
struction, in: In Proceedings of CVPR (IEEE Confer-
ence on Computer Vision and Pattern Recognition),
Puerto Rico, 1997, pp. 91–96.
[31] J. Schmidt, H. Niemann, S. Vogt, Dense disparity
maps in real-time with an application to augmented
reality, in: Proceedings of the Sixth IEEE Work-
shop on Applications of Computer Vision, WACV
’02, IEEE Computer Society, Washington, DC, USA,
2002, pp. 225–.
[32] A. Fuhrmann, G. Hesina, F. Faure, M. Gervautz,
Occlusion in collaborative augmented environments,
Tech. Rep. TR-186-2-98-29, Institute of Computer
Graphics and Algorithms, Vienna University of Tech-
nology, Favoritenstrasse 9-11/186, A-1040 Vienna,
Austria (Dec. 1998).
[33] V. Lepetit, M. odile Berger, A semi-automatic
method for resolving occlusion in augmented reality,
in: In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 2000, pp. 225–
[34] R. Radke, Image change detection algorithms: a sys-
tematic survey, Image Processing, IEEE Transactions
on 14 (3) (2005) 294–307.
[35] J. Canny, A computational approach to edge detec-
tion, IEEE Trans. Pattern Anal. Mach. Intell. 8 (6)
(1986) 679–698.
[36] S. Suzuki, K. Be, Topological structural analysis of
digitized binary images by border following, Com-
puter Vision, Graphics, and Image Processing 30 (1)
(1985) 32–46.
[37] D. H. Douglas, T. K. Peucker, Algorithms for the re-
duction of the number of points required to represent
a digitized line or its caricature, Cartographica: The
International Journal for Geographic Information and
Geovisualization 10 (2) (1973) 112–122.
[38] N. Otsu, A threshold selection method from gray-level
histograms, IEEE Transactions on Systems, Man and
Cybernetics 9 (1) (1979) 62–66.
[39] D. W. Marquardt, An algorithm for Least-Squares
estimation of nonlinear parameters, SIAM Journal on
Applied Mathematics 11 (2) (1963) 431–441.
[40] R. Hartley, A. Zisserman, Multiple View Geometry in
Computer Vision, 2nd Edition, Cambridge University
Press, New York, NY, USA, 2003.
[41] C. Harris, M. Stephens, A combined corner and edge
detector, in: In Proc. of Fourth Alvey Vision Confer-
ence, 1988, pp. 147–151.
[42] W. F¨orstner, E. G¨ulch, A Fast Operator for Detection
and Precise Location of Distinct Points, Corners and
Centres of Circular Features (1987).
[43] S. M. Smith, J. M. Brady, Susan - a new approach to
low level image processing, International Journal of
Computer Vision 23 (1995) 45–78.
[44] A. Sanjeev, R. Kannan, Learning mixtures of arbi-
trary gaussians, in: Proceedings of the thirty-third
annual ACM symposium on Theory of computing,
STOC ’01, ACM, New York, NY, USA, 2001, pp.
[45] A. P. Dempster, N. M. Laird, D. B. Rubin, Max-
imum likelihood from incomplete data via the EM
algorithm, Journal of the Royal Statistical Society.
Series B (Methodological) 39 (1) (1977) 1–38.
[46] D. F. Williamson, R. A. Parker, J. S. Kendrick, The
box plot: a simple visual method to interpret data.,
Ann Intern Med 110 (11).
[47] D. Q. Huynh, Metrics for 3d rotations: Comparison
and analysis, J. Math. Imaging Vis. 35 (2) (2009)
... In detail, all the algorithms related to the contact classification and to the Jaco2 control run on Matlab 2020 and exchange information with Robot Operating System (ROS). The latter is also used by the localization component which uses the RGB-D sensor for identifying the world frame Σ w using markers and ArUco libraries [141] and for tracking the human skeleton through a proper algorithm 5 . In detail, the following relevant points are considered for the human: torso, left and right hand, elbow, shoulder. ...
Full-text available
Removing the barriers between humans and robots, enabling their collaboration, represents nowadays one of the most promising directions to achieve flexible and efficient production processes. On the one hand, humans are better suited for environment interpretation and decision-making processes, are characterized by greater manipulation skills and are more flexible in the sense of simplicity to be re-tasked; on the other hand, robots are faster, stronger, more precise and better suited to repetitive and/or heavy tasks than humans. The combination of these features can thus potentially lead to increase the production efficiency, quality, and flexibility. Moreover, the physical abilities as well as the robustness to faults of the robotic component are significantly enhanced when multiple cooperative robots are introduced into industrial setups instead of individual robots, enabling the execution of tasks that would otherwise not be possible. Despite the potential performance increase given by human multi-robot collaboration, this topic is still unexplored in the current state of the art and its realizations in real systems are far from trivial. The main issue arises from the need to integrate actions to achieve the desired interaction with human operators, whose behavior can be extremely variable, and actions to coordinate the multi-robot system while handling any related constraints that it poses. Among the latter, for example, there are closed kinematic chains that characterize transport applications and, in general, cooperative manipulation of objects. This thesis work aims to investigate human multi-robot collaboration from multiple perspectives: from safety issues in coexistence scenarios, to strategies for physical interaction up to methodologies for learning from demonstration. Particular attention is devoted to the design of distributed control architectures, which do not rely on central control units for coordinating the different robots, but allow autonomous decision making based on local information. The immediate advantages in terms of system scalability and robustness resulting from the distributed control paradigm come at the expense of increased design complexity with respect to the centralized counterpart.
... These traditional methods have been developed further with goals to reduce false negatives, to detect smaller and denser tags, to achieve more robustness against occlusions [Garrido-Jurado et al. 2014;Kallwies et al. 2020;Krogius et al. 2019;Romero-Ramirez et al. 2018;Wang and Olson 2016]. However, several limitations still remain. ...
Fiducial markers have been broadly used to identify objects or embed messages that can be detected by a camera. Primarily, existing detection methods assume that markers are printed on ideally planar surfaces. Markers often fail to be recognized due to various imaging artifacts of optical/perspective distortion and motion blur. To overcome these limitations, we propose a novel deformable fiducial marker system that consists of three main parts: First, a fiducial marker generator creates a set of free-form color patterns to encode significantly large-scale information in unique visual codes. Second, a differentiable image simulator creates a training dataset of photorealistic scene images with the deformed markers, being rendered during optimization in a differentiable manner. The rendered images include realistic shading with specular reflection, optical distortion, defocus and motion blur, color alteration, imaging noise, and shape deformation of markers. Lastly, a trained marker detector seeks the regions of interest and recognizes multiple marker patterns simultaneously via inverse deformation transformation. The deformable marker creator and detector networks are jointly optimized via the differentiable photorealistic renderer in an end-to-end manner, allowing us to robustly recognize a wide range of deformable markers with high accuracy. Our deformable marker system is capable of decoding 36-bit messages successfully at ~29 fps with severe shape deformation. Results validate that our system significantly outperforms the traditional and data-driven marker methods. Our learning-based marker system opens up new interesting applications of fiducial markers, including cost-effective motion capture of the human body, active 3D scanning using our fiducial markers' array as structured light patterns, and robust augmented reality rendering of virtual objects on dynamic surfaces.
... dedicated tracking hats have been designed. These hats consist of fiducial markers [17] printed on a circular white cardboard surface, which is mounted to the main PCB. Figure 2b shows 10 brushbots with 10 different fiducial markers to uniquely identify and track them. ...
This paper describes the methodology and outcomes of a series of educational events conducted in 2021 which leveraged robot swarms to educate high-school and university students about epidemiological models and how they can inform societal and governmental policies. With a specific focus on the COVID-19 pandemic, the events consisted of 4 online and 3 in-person workshops where students had the chance to interact with a swarm of 20 custom-built brushbots -- small-scale vibration-driven robots optimized for portability and robustness. Through the analysis of data collected during a post-event survey, this paper shows how the events positively impacted the students' views on the scientific method to guide real-world decision making, as well as their interest in robotics.
... However, detection algorithms can fail for several reasons, such as poor lighting conditions, rapid camera movements, and occlusions. A common approach to improve the robustness of a marker detection system is the use of marker boards, i.e., a pattern composed of multiple markers [Garrido-Jurado et al., 2014]. ...
This thesis work deals with extracting features and low-level primitives from perceptual image information to understand scenes. Motivated by the needs and problems in Unmanned Aerial Vehicles (UAVs) vision-based navigation, we propose novel methods focusing on image understanding problems. This work explores three main pieces of information in an image : intensity, color, and texture. In the first chapter of the manuscript, we work with the intensity information through image contours. We combine this information with human perception concepts, such as the Helmholtz principle and the Gestalt laws, to propose an unsupervised framework for object detection and identification. We validate this methodology in the last stage of the drone navigation, just before the landing. In the following chapters of the manuscript, we explore the color and texture information contained in the images. First, we present an analysis of color and texture as global distributions of an image. This approach leads us to study the Optimal Transport theory and its properties as a true metric for color and texture distributions comparison. We review and compare the most popular similarity measures between distributions to show the importance of a metric with the correct properties such as non-negativity and symmetry. We validate such concepts in two image retrieval systems based on the similarity of color distribution and texture energy distribution. Finally, we build an image representation that exploits the relationship between color and texture information. The image representation results from the image’s spectral decomposition, which we obtain by the convolution with a family of Gabor filters. We present in detail the improvements to the Gabor filter and the properties of the complex color spaces. We validate our methodology with a series of segmentation and boundary detection algorithms based on the computed perceptual feature space.
Due to the growing focus on minimally invasive surgery, there is increasing interest in intraoperative software support. For example, augmented reality can be used to provide additional information. Accurate registration is required for effective support. In this work, we present a manual registration method that aims at mimicking natural manipulation of 3D objects using tracked surgical instruments. This method is compared to a point-based registration method in a simulated laparoscopic environment. Both registration methods serve as an initial alignment step prior to surface-based registration refinement. For the evaluation, we conducted a user study with 12 participants. The registration methods were compared in terms of registration accuracy, registration duration, and subjective usability feedback. No significant differences could be found with respect to the previously mentioned criteria between the manual and the point-based registration methods. Thus, the manual registration did not outperform the reference method. However, we found that our method offers qualitative advantages, which may make it more suitable for some application scenarios. Furthermore we identified possible approaches for improvement, which should be investigated in the future to strengthen possible advantages of our registration method.
Object tracking in computer vision can be done either by using a marker-less or marker-based approach. Computer vision systems have been using Fiducial markers for pose estimation in different applications such as augmented reality [5] and robot navigation [4]. With the advancements in Augmented Reality (AR), new tools such as AugmentedReality uco (ArUco) [6] markers have been introduced to the literature. ArUco markers, are used to tackle the localization problem in AR, allowing camera pose estimation to be carried out by a binary matrix. Using a binary matrix not just simplifies the process but also increases the efficiency. As a part of our initiative to create a cost-efficient, 24/7 accessible, Virtual Reality (VR) based chemistry lab for underprivileged students, we wanted to create an alternative way of interacting with the virtual scene. In this study, we used ArUco markers to create a low-cost keyboard only using a piece of paper and an off-the-shelf webcam. We believe this method of keyboard will be more beneficial to the user as they can see the keys before they are typing in the corner of the screen instead of an insufficient on the screen VR keyboard or a regular keyboard where the user can’t see what they are typing with a VR headset. As potential extensions of the base system, we have also designed and evaluated a stereo camera and an IMU sensor based system with various sensor fusion techniques. In summary, the stereo camera reduces occlusion related problems, and the IMU sensor detects vibrations which in turn simplifies the KeyPress detection problem. It has been observed that use of any of these additional sensors improves the overall system performance.
How similar a virtual product is to a real product is one of the most important issues when using virtual simulation to develop real apparel designs. The first step to achieve high similarity is finding optimal simulation parameters for the desired fabrics. However, it is notoriously difficult to find an optimal parameter set that reproduces the physical properties of a specific fabric as closely as possible. It is because the relationship between the changes of simulation parameters and drape shapes is highly non‐linear, not intuitive, and hard to be predicted even by experts. Therefore, users have to repeat trial and error based on personal experience until they find satisfactory results, which is time consuming due to the simulation time required for each trial. To handle this problem, we proposed a neural network model that learns the relationship between the parameter space and the drape space, then we presented a user interface that allows users to quickly explore the extensive drape space through simulation parameters. To validate our method, we provided our UI with experts in the fashion design industry and conducted user studies with them for qualitative evaluation. We introduce a UI that allows users to interactively explore simulated drape shapes by changing simulation parameters. We trained a neural network model to learn the relationship between the parameter space and the drape space, so simulation results (drapes) can be instantly inferred from a given set of simulation parameters. To validate our method, we provided our UI with experts in the fashion design industry and conducted user studies with them for qualitative evaluation.
Full-text available
Recent years have seen an exponential increase in the use of mobile devices. Since many of the mobile devices are equipped with a camera and are connected to the internet, localization in an urban environment using landmark images is gaining popularity. The idea is simple. A tourist takes images of a landmark where he or she is standing with a mobile camera which are then transmitted to a server where the image(s) are matched against a database of landmark images for that locality. If a match is found, relevant information such as background information on the landmark, nearby transit facilities or information on other important landmarks nearby is sent back. This type of application has tremendous potential as a mobile city guide or navigation aid. In this paper, we investigate the use of local invariant shape features and global features such as colour and texture for the recognition task as evident from literature and present various retrieval techniques. A variety of descriptors for landmark recognition and scene classification are discussed. Insights into vocabulary building and weighting schemes for representing landmark images are provided that can help in boosting recognition rates.
The central task of close-range photogrammetry is the solution of correspondence problem, i.e. determination of image coordinates for given space point for two and more images. The paper presents the results of developing and testing new coded targets for automated identification and coordinates determination of marked points. The developed coded targets are independent to location and rotation, provide the possibility of reliable detection and localisation in textured image and precise centre coordinates calculation. The coded targets have been used for automated 3D coordinates measurements by photogrammetric station of State Research Institute for Aviation System. The methodics of automated identification and 3D measurements technique are given. Also the estimations of system performance and measurements accuracy and approaches to high accuracy achievement are presented.
This paper describes a new approach to low level image processing; in particular, edge and corner detection and structure preserving noise reduction.Non-linear filtering is used to define which parts of the image are closely related to each individual pixel; each pixel has associated with it a local image region which is of similar brightness to that pixel. The new feature detectors are based on the minimization of this local image region, and the noise reduction method uses this region as the smoothing neighbourhood. The resulting methods are accurate, noise resistant and fast.Details of the new feature detectors and of the new noise reduction method are described, along with test results.