Content uploaded by Jinjie Li
Author content
All content in this area was uploaded by Jinjie Li on Mar 02, 2023
Content may be subject to copyright.
Indoor Localization for Quadrotors using Invisible Projected Tags
Jinjie Li1, Liang Han2∗, Zhang Ren1
Abstract— Augmented reality (AR) technology has been in-
troduced into the robotics field to narrow the visual gap
between indoor and outdoor environments. However, without
signals from satellite navigation systems, flight experiments in
these indoor AR scenarios need other accurate localization
approaches. This work proposes a real-time centimeter-level
indoor localization method based on psycho-visually invisible
projected tags (IPT), requiring a projector as the sender
and quadrotors with high-speed cameras as the receiver. The
method includes a modulation process for the sender, as well
as demodulation and pose estimation steps for the receiver,
where screen-camera communication technology is applied to
hide fiducial tags using human vision property. Experiments
have demonstrated that IPT can achieve accuracy within ten
centimeters and a speed of about ten FPS. Compared with
other localization methods for AR robotics platforms, IPT is
affordable by using only a projector and high-speed cameras as
hardware consumption and convenient by omitting a coordinate
alignment step. To the authors’ best knowledge, this is the first
time screen-camera communication is utilized for AR robot
localization.
I. INTRODUCTION
The deployment of small and intelligent aerial robots has
significantly influenced civilian tasks such as transportation,
environment preservation, and agriculture [1]. These com-
plex tasks present scientific and technical challenges on
perception, decision making, and agile flight control. For
one of the most widely used aerial robots, the quadrotor,
many researchers carry out flight tests indoors to reduce the
experimental risk. However, two problems exist: first, indoor
environments are just simple abstractions of actual mission
scenarios, unable to provide enough visual information;
second, quadrotors require other methods to localize without
global navigation satellite system signals.
For the first problem, researchers have combined aug-
mented reality (AR) technology with mobile robot research
in recent years [2]. Reina et al. [3] proposed the concept
“virtual sensors” and presented an AR-based sensing frame-
work to control the swarm e-puck robots. Omidshafiei et al.
[4] designed a novel robotics platform for hardware-in-the-
loop experiments, called measurable augmented reality for
prototyping cyber-physical systems (MAR-CPS). The MAR-
CPS supplied visual information by projecting a mission
scenario and robot states onto the ground through several
projectors. However, the above AR robotics platforms all
need additional devices to solve the second problem, indoor
1J. Li and Z. Ren are with the School of Automation Science
and Electrical Engineering, Beihang University, Beijing, 100191, China
lijinjie@buaa.edu.cn;renzhang@buaa.edu.cn
2L. Han is with the Sino-French Engineer School, Beihang University,
Beijing, 100191, China. liang han@buaa.edu.cn
Fig. 1. A quadrotor is flying on a projected AR scenario, using IPT
to collect position data. Requiring no additional equipment, the quadrotor
obtains pose from only projected images through a downward camera.
localization. Specifically, common solutions such as motion
capture system or Ultra-Wide Band (UWB) devices are used
[5], [6]. Nevertheless, the former is high-priced, and the
latter cannot measure orientation and is easily affected by
metal surfaces [7]. Furthermore, separating the localization
system and the displaying system adds an extra step to align
two coordinate systems [4]. This step causes both hardware
consumption and operation complexity. Therefore, the AR
robotics platforms require a low-cost and easy-to-use indoor
localization method for wide use.
In this work, we propose a localization method based on
invisible projected tags (IPT), under the idea to merge the
visualization and localization processes. The main difficulty
of IPT is to embed position information into projected
images without influencing the visual effect. Thus, screen-
camera communication is utilized to hide fiducial tags in the
high-frequency flashing frames. Finally, experiments (Fig.
1) demonstrate that IPT method can provide centimeter-
level position but requires just projectors and cameras as the
hardware cost.
The main contributions of this article are:
•IPT, a real-time centimeter-level localization method
based on the invisible projected fiducial tags,
•an IPT-based indoor AR flight platform for quadrotors,
•experiments to verify feasibility and test performance.
II. REL ATED WORK
Indoor localization methods for quadrotors can be divided
into non-cooperative localization technology and coopera-
tive localization technology. For non-cooperative localization
technology such as simultaneous localization and mapping
(SLAM), the requirements of heavy sensors or complex
algorithms make it unsuitable for quick-start indoor flights.
For cooperative localization technology, the extensively used
arXiv:2203.00356v2 [cs.RO] 13 Mar 2022
methods for indoor flight include motion capture system,
base station system, and 2D fiducial markers [8]–[11].
The motion capture system can measure position and
orientation at the submillimeter level with high reliability,
but the high price limits it in only well-funded research
institutions or the film industry. For the base station sys-
tem, the popular Ultra-Wide Band (UWB) technology can
achieve positioning accuracy within ten centimeters at a low
cost [12], but the lack of orientation measurement and the
sensitivity to metal surfaces limit its use [7]. The fiducial
marker system achieves centimeter-level localization accu-
racy robustly at the cost of just a camera and several printed
papers, making it the choice of many robotic researchers
[13], [14]. There has been sufficient research on the design,
detection, and decoding processes of visual fiducial markers,
such as ARTag [15], ArUco 3 [16], and AprilTag 3 [17]. For
tag-based pose estimation problem, also named as Solve Pose
from nPoints (SolvePnP) problem, many algorithms have
been proposed and integrated into the OpenCV library, such
as EPnP [18], IPPE [19], and SQPnP [20]. The motivation
is to design an inexpensive localization method for indoor
AR robotics systems, and hence the tag-based approach is
chosen in this work.
In order to hide the fiducial tags in the projected video, an-
other technique used is screen-camera communication, which
conveys invisible information using the property of human
eyes. In this field, the primary metrics include invisibility,
throughput, and reliability [21]. For marker-based localiza-
tion, the information transmitted is relatively limited, and
hence the algorithms with excellent invisibility and reliability
are focused, including HiLight [22], TextureCode [23], and
ChromaCode [21]. HiLight hides information in the time
domain through rapid brightness changes. TextureCode goes
deeper by hiding data in the time and spatial domain, based
on the principle that human eyes have different sensitivities
to image areas with various texture densities. ChromaCode
achieves better invisibility by converting the image to the
CIELAB color space for lightness change, matching the
human eye’s sensitivity to different colors. The advantages
of these algorithms can be taken to design IPT method.
III. SYS TEM OV ERVIE W
This work designs an indoor AR flight platform to test the
performance of IPT method, and its workflow is illustrated
in Fig. 2. The platform is divided into a Sender part and
aReceiver part. On the sender side, first, the Modulation
algorithm embeds a visual fiducial tag map into a dynamic
video; second, a projector projects the modulated video to
the ground, providing a visually complex AR environment
for both quadrotors and the audience. On the receiver side, a
high-speed camera is used to capture the image information.
The Demodulation algorithm is then executed to obtain tag
corners’ three-Dimensional (3D) world coordinates and their
two-Dimensional (2D) projections from the captured video.
Finally, these coordinates are input to the Pose Estimation
step with camera parameters to compute pose. Through WIFI
(UDP), the pose data are transmitted to the sender to change
Fig. 2. The overall workflow of IPT. A video is first embedded with a
bundle of invisible AprilTags. The sender side then projects a video onto the
ground. Next, the receiver demodulates the tags through the frames from a
high-speed camera and obtains its location. Finally, the location is sent to
a computer to generate an AR dynamic scenario.
the video content for AR effects. Viewers can watch the
low-frequency video content without being aware of this
projector-camera communication cycle.
The process of IPT, including Modulation,Demodulation,
and Pose Estimation, is introduced in the following sections.
IV. MODULATION
Modulation refers to the process that hides fiducial mark-
ers in a video in the form of high-frequency brightness
change. This process utilizes the flicker-fusion property of
Human Vision System (HVS). For example, if a screen
alternates above the typical frequency of 40∼50 Hz, human
eyes cannot capture this fluctuation but perceive the average
brightness instead [24]. Most laser projectors on the market
have a vertical refresh rate of more than 48 Hz, which
provides the hardware basis for the algorithm.
A. Encoding Process
Tags are hidden in the temporal dimension by converting
them into time-variant high-frequency signals. For a video
at 30 FPS, the first step is to double each frame to get
a new video at 60 FPS (or other frequency supported
by the projector). Then the tags are embedded by adding
or subtracting a light intensity ∆Lon the pixel of the
original frame at the corresponding position. The hidden
pattern includes only 0 and 1, encoded similarly but in
different phases, as shown in Fig. 3. This encoding scheme
is called a Manchester-like coding scheme [23], ensuring
that the projector screen continuously blinks at high speed.
The changing light intensity, ∆L, is chosen according to
the projector and the environment brightness. Under the
guarantee of demodulating successfully, ∆Lis required to
be as small as possible to raise invisibility.
The flicker fusion property of HVS is also called the low-
pass filtering property [25], filtering temporal high-frequency
signals. This property emphasizes the possibility of hiding
patterns among high-speed flickering frames.
B. Color Space Selection
The next step is to determine the color space. Human
eyes have different levels of sensitivity to brightness changes
of different colors. In order to improve invisibility, the
modulation process is required to ensure perceptually uni-
form, which means that the pixels with different colors
Fig. 3. The encoding process of binary pixels. Bit 0 and bit 1 are encoded
as 60 FPS flashing in different phases. Since the flicker-fusion property,
human eyes are difficult to perceive this high-speed lightness change.
Fig. 4. A 9×9binary tag map. The tag map is of the same size as the
original video and designed for the localization task. After encoding, the tag
map is coincided with the display image, making the coordinate systems of
localization and visualization the same naturally.
have similar visual observations for the same brightness
change. The previous research has shown that the brightness
adjustment in CIELAB color space is considerably more
uniform than HSL, YUV, and RGB spaces [21]. CIELAB
space expresses color as three channels: L∗,a∗, and b∗,
where L∗is perceptual lightness, and a∗and b∗indicate
the colors of HVS. The brightness changes in the L∗channel
can be evenly distributed, making humans perceive a uniform
lightness change. Therefore, we first convert the image from
RGB to CIELAB space, then modify the lightness through
L∗channel, and finally convert the color space back. This
conversion is also needed for the demodulation process.
C. Tag Map Design
Fig. 4 shows how the position information is embedded
in an image. The tag map uses the ENU coordinate system,
of which the origin is the center of the map. The image
is a black and white map covered with fiducial markers,
and AprilTag is chosen here. Note that the tag density is
the balance of projector’s specifications (resolution and size
of display area), flight conditions, and camera’s properties
(resolution and angle of view). If the tags are too large or too
far apart, the tags captured by the camera are few. When the
shadows of quadrotors occlude the tags, the demodulation
process may fail. If the tags are too small or too close,
the error of corner detection increases, reducing localization
accuracy. Finally, the AprilTag map has 9×9tags evenly
distributed on a picture at a resolution of 1920 ×2160. The
tags are placed from left to right and from bottom to top.
Algorithm 1 Modulation
Input: 1) fi, the frequency of the input video; 2) Vi, a
video stream at fifrequency; 3) M, an AprilTag map;
4) ∆L, the lightness change; 5) w×h, the resolution
of the output video; 6) fo, the frequency of the output
video
Output: Vo, a modulated video
1: Tmask =M[white ←1, black ← −1, others ←0];
2: while Extract a frame Ffrom Visuccessfully do
3: F←resize(F,[w×h]);
4: F←convertColor(F, RGB2CIELAB);
5: N←fo/fi;
6: F={Fi},(i= 1,2, ..., N )⇐duplicate(F) for N
times;
7: for Fi∈F do
8: Fi[L]←Fi[L] + ∆L·Tmask, where [L]indicates
the lightness channel;
9: Fi[L > 255] ←255 and Fi[L < 0] ←0;
10: ∆L← −1·∆L;
11: Fi←convertColor(Fi, CIELAB2RGB);
12: end for
13: Vo⇐F.
14: end while
Flight experiments have demonstrated that this configuration
balances detection accuracy and robustness.
The modulation algorithm is summarized in Algorithm 1.
After this process, the projector provides a suitable environ-
ment for the receivers to extract their locations.
V. DEMO DUL ATION
The demodulation process first detects tags’ IDs and
their corner coordinates in the 2D image plane. Then tags’
3D world corner coordinates are calculated given IDs, the
configuration of tag map, and the ratio from pixels to actual
size. Note that this ratio must be measured before flights
once the projector is fixed.
Many works have officially provided tools to detect tags.
However, when the markers come from the subtraction of
two successive frames, the quadrotor’s movement causes
many long and narrow lines at the edge of image patterns.
These lines break the continuity of tags’ outer black borders,
causing detection failure. Therefore, we first propose a
sample-based image alignment method and then a prepro-
cessing step to prepare images, and finally, use an open-
source tool AprilTag 3 for detection. The intermediate steps
are shown in Fig. 5.
A. Image Alignment
The frequency of the high-speed camera (120 FPS) is
higher than the flicker frequency of the modulated video (60
FPS). Thus, after converting the color space from RGB to
CIELAB, tags can be obtained by subtracting two successive
frames in the lightness channel. However, the movement of
quadrotors often causes sharp edges at the boundary of image
texture or texts, as shown in Fig. 5b. These edges often cut
off the tags. For AprilTag 3, an important step is to find
(a) Original image (b) Subtract two successive frames (c) Image alignment
(d) Preprocessing (e) Tag detection
Fig. 5. Intermediate steps present the process of extracting fiducial tags from a raw image Fig. (a). First, the tags appear in Fig. (b) after the subtraction
of two successive frames. Then, the disturbances such as text and lines are eliminated by image alignment, as shown in Fig. (c). Next, Fig. (d) displays
the preprocessing result. Each tag forms a continuous boundary, ensuring the success of tag detection in Fig. (e). Although the shadow affects the tag
detection in the left-down corner, other tags are still detected successfully. This shadow-tolerant character verifies the configuration of the tag map.
the contour between white and black components [17], so
this discontinuity of the black borders greatly influences the
detection. Therefore, the two frames need to be shifted a few
pixels vertically and horizontally to eliminate as many edges
as possible. The influence on localization accuracy can be
ignored due to the very short-shifting distance of no more
than 5 pixels.
In this work, a sample-based method is proposed to
determine the shifting distance, achieving a trade-off between
computation speed and alignment accuracy. For two images
with a w×hresolution, sample Ncolumns from the two
frames separately and denote them as Tpre
i,(i= 1, ..., N )
and Tnow
i,(i= 1, ..., N ). The columns are equally spaced
along the X-axis. The number of vertical shifting pixels, m∗
y,
is expected to be the value when the difference between all
Tpre
iand Tnow
iachieves the least:
m∗
y=
arg min
|my|≤b
N
X
i=1
kTpre
i(0 : h−my)−Tnow
i(my:h)k1
(my≥0)
N
X
i=1
kTpre
i(−my:h)−Tnow
i(0 : h+my)k1
(my<0)
,(1)
where bis the maximum shifting distance set by the user,
and (0 : h−my)denotes the elements from 0to h−my−1.
Considering the balance of computational speed and effect,
N= 3 is chosen. The horizontal shifting value can be
calculated through a similar method.
After computing the shifting value, the two shifted frames
are subtracted to get a new image. Unfortunately, shifting
also causes straight lines at image borders, which must be
eliminated. The final aligned image is shown in Fig. 5c.
Comparing with the no-shifting Fig. 5b, the edges in Fig. 5c
are reduced considerably, proving the effect of the algorithm.
B. Preprocessing
Preprocessing includes normalization, filtering, binariza-
tion, and morphology operations. Among these steps, the
most critical parameter is an appropriate threshold for bina-
rization. Since the edges caused by motion always appear in
pairs of black and white, and the tags have approximately
equal areas of black and white pixels, the median of the
image is chosen as the threshold. After thresholding, ’OPEN’
operation is first applied to remove the noise further, and
’CLOSE’ operation is then used to connect the borders of
the tags, shown as Fig. 5d.
C. Tag Detection
The article [26] has extensively compared four widely
used tag detection algorithms (AprilTag 3, AprilTags C++,
ArUco, and ArUco OpenCV) on localization accuracy and
processing time. From the article, AprilTag 3 is the best
choice in most cases. Therefore, AprilTag 3 is selected as
the tag detection tool, and the tag36h11 set is chosen as tag
family. The detecting result is shown in Fig. 5e.
The demodulation algorithm is summarized in Algorithm
2. The 2D coordinates and the 3D coordinates of tag corners
are used in the next section.
VI. POS E ESTIMATION
The pose estimation process based on fiducial markers is
called a Solve PnP problem. It means to solve the position
and orientation of a camera giving the 3D coordinates of n
points in space and their corresponding 2D coordinates in
the photo plane. The core of this problem is to solve (2):
zc
u
v
1
=
fx0u0
0fyv0
0 0 1
R|~
T
xw
yw
zw
1
,(2)
Algorithm 2 Demodulation
Input: 1) Fnow, an RGB frame from camera; 2) Fpr e
L,
the lightness channel of previous frame; 3) Cmap, the
configuration of the AprilTag map; 4) r, the ratio from
pixels to real size.
Output: R, a result list that contains id, corner points,
hamming distance and world coordinates for each tag.
1: while receive a new Fnow
Ldo
2: Fnow ←convertColor(Fnow, RGB2CIELAB);
3: Fnow
L←Fnow[L], where [L]indicates the lightness
channel;
4: Fnow0
L←align(Fnow
L,Fpre
L);
5: Fb←normalization(Fnow0
L−Fpre
L);
6: Fpre
L←Fnow
L, for the next loop;
7: Fb←meanFilter(Fb);
8: Fb←threshold(Fb, median(Fb));
9: Fb←morphologyEx(Fb, OPENING);
10: Fc←morphologyEx(Fb, CLOSING);
11: {id, pi, ham} ← detectTags(Fc);
12: {pw} ← lookupMap({id},Cmap ,r);
13: Return R← {id, pi, ham, pw}.
14: end while
that is
zc·pi=M1·M2·pw,(3)
where zcis the scale factor, [u, v]Tand piare the 2D photo
coordinates, fxand fyare the scaled focal lengths, [u0, v0]T
is the principal point, M1is the matrix of intrinsic camera
parameters, Rand ~
Tare the rotation matrix and position
vector from the world frame to the camera frame, M2is the
3D pose matrix, as well as [xw, yw, zw]Tand pwrefer to the
world 3D coordinates.
In (3), coordinates piand pware calculated through the
demodulation algorithm in the last section, and the camera
matrix M1can be obtained by camera calibration. Consid-
ering the performance and stability, we use IPPE algorithm
[19] to compute the pose matrix M2. Note that the results
from solvePnP() function in OpenCV must be inversely
transformed using (4):
w
cR|w~
T=RT| − RT·~
T,(4)
where w
cRand w~
Tare the rotation and translation from the
camera frame to the world frame, respectively.
VII. EXP ERI MENTS
A. Hardware System
We make an experiment system to test the performance of
IPT. The system is illustrated in Fig. 6, including hardware
for both sender and receiver parts. The sender part uses two
NEC CR3400HL laser projectors to form a 1920 ×2160
screen, of which the size is 2.17m ×2.47m. An NVIDIA
1080 TI graphics card and NVIDIA Surround function are
utilized to output a 1920 ×2160 60-Hz video stream. The
receiver part is a 270-mm-wheelbase quadrotor with a high-
speed camera. To eliminate rolling shutter effect, which
means two successive frames are mixed with each other, a
Fig. 6. IPT AR flight system, adding UWB and OptiTrack for comparison.
The UWB system includes one tag on the quadrotor and four base anchors.
OptiTrack uses a dozen of cameras to provide submillimeter-level position.
120-FPS color global shutter camera is selected. The onboard
computer is a Raspberry Pi 4B with a Broadcom BCM2711
quad-core CPU running at 1.5 GHz.
The system also involves a Nooploop UWB system (25
fps) and an OptiTrack system (120 fps) for performance
comparison. We carry out several indoor flights on this
system to test the performance of IPT.
B. Experimental Setup
The experiment consists of two stages. The first stage aims
to test accuracy. For convenience, all pictures shot by the
camera are stored and then post-processed by a personal
computer. We choose a 15-second video as the test unit,
including 1800 images in total. The second stage aims to
test processing speed, and hence the pictures are processed
onboard for real-time computation.
To simulate the path planning mission, we choose the path
searching process on Amap as the video content and record
it for 30 seconds in advance. This screen recording is looped
as the sender video. In the experiment, the quadrotor flies in
a rectangle by a pilot at a height of about 0.8 m.
C. Results
1) Visual Effect: We first invite three colleagues to eval-
uate the visual effect of IPT. The invisibility of markers is
closely related to the changing light intensity ∆L. When
∆L= 3, they hardly notice the markers. When ∆L= 4, they
observe that something square is slightly blinking but cannot
distinguish what it is. When ∆L= 5, the tag map is obvious.
We choose ∆L= 4 for the following tests considering the
balance of invisibility and demodulation effect.
2) Localization Accuracy: During the experiment, 625
data are sampled successfully for the IPT method. In order to
display the raw effect, data are not filtered. Aligning the data
from OptiTrack, IPT, and UWB in the dimension of time and
space, we draw the comparison result in Fig. 7. OptiTrack
is set as the ground truth since it can offer submillimeter-
level position and 0.01-degree-level orientation. To quanti-
tatively analyze, Mean Absolute Error (MAE) and Standard
0 5 10 15
-1
-0.5
0
0.5
1
X (m)
0 5 10 15
-1
-0.5
0
0.5
1
Y (m)
OptiTrack
IPT
UWB
0 5 10 15
t (s)
0.6
0.8
1
1.2
Z (m)
(a) Position comparison
0 5 10 15
-88
-86
-84
-82
-80
Yaw (deg)
OptiTrack
IPT
0 5 10 15
-10
0
10
20
Pitch (deg)
0 5 10 15
t (s)
-20
-10
0
10
20
Roll (deg)
(b) Orientation comparison
0 5 10 15
t (s)
0
0.2
0.4
0.6
2D Distance Error (m)
0
0.5
1
2D Velocity (m/s)
(c) Positioning error of IPT and velocity varying over time
Fig. 7. The comparison of OptiTrack, IPT, and UWB data for a 15-
second indoor flight. In Fig. (a), IPT achieves better accuracy in the Z
direction than UWB but more oscillations in the X and Y directions. In
Fig. (b), UWB cannot measure orientation, but IPT obtains acceptable yaw
angle and oscillating curves in pitch and roll. Finally, Fig. (c) illustrates the
relationship between horizontal positioning error and velocity change.
Deviation (STD) of UWB and IPT are calculated in Table I.
From the table and Fig. 7a, IPT attains better position data
in the Z direction than UWB but much more oscillated in the
X and Y directions. For orientation, UWB cannot measure
this state but IPT obtains acceptable yaw angle and volatile
pitch angle and roll angle. Note that the oscillations for
horizontal positions mainly comes from the last step of Pose
Estimation, because the severe oscillations in pitch and roll
bring fluctuations in inversely coordinates transformation.
The relation of horizontal positional error and velocity
is shown in Fig. 7c, of which each blue point indicates a
successful localization. Observing the time interval of 4-7s,
TABLE I
POS ITIO NAL ERR OR AND ROTATI ONAL E RROR (ME TRIC S:MAE /ST D).
X Error [m] Y Error [m] Z Error [m]
UWB 0.0305 / 0.0354 0.0232 / 0.0251 0.0943 / 0.1128
IPT 0.0690 / 0.0886 0.0535 / 0.0674 0.0230 / 0.0263
Yaw Error [deg] Pitch Error [deg] Roll Error [deg]
UWB — — —
IPT 0.352 / 0.460 2.659 / 3.657 3.342 / 4.305
TABLE II
THE C OM PARI SO N OF T HR E E LO CA LI ZATI ON M ET HO DS .
OptiTrack UWB IPT
Price1CNY 190,000 CNY 35,000 CNY 30,000
Position Accuracy sub-mm level cm level cm level
Orientation Info YES NO YES
Anti-Interference YES NO YES
No Calibration NO NO YES
the algorithm collects many samples even at a relatively high
velocity, which verifies the image alignment algorithm.
3) Computational Speed: A C++ implementation of IPT
achieves real-time processing at 10 FPS on average for 640×
360 resolution. This frequency underlines the possibility for
flight control of quadrotors.
The comparison of OptiTrack, UWB, and IPT methods for
an AR robotics platform is listed in Table II. IPT achieves
centimeter-level position and orientation measurements with
advantages on price and efficiency, making it a competitive
choice for localization in indoor AR environments.
VIII. CONCLUSION
We proposed IPT, an indoor localization method based on
human-invisible projected fiducial tags. This method utilized
the visual characteristics of human eyes to hide markers. To
test the performance, we designed an AR quadrotor flight
platform, of which the structure consists of the sender part
and receiver part. Indoor flight experiments were conducted
to evaluate accuracy and speed of IPT. The result presented
centimeter-level accuracy and a frequency of ten FPS. The
economical and quick-start features made IPT an appropriate
solution for AR robotics platforms. In future work, we plan
to integrate IPT with IMU to obtain more accurate and
high-frequency pose information. We will also explore the
effect of IPT on different videos and the relationship between
movement and performance.
ACK NO WL EDG MEN TS
We sincerely thank anonymous reviewers for their critical
reading and helpful suggestions. This work is supported by
the Science and Technology Innovation 2030-Key Project
of “New Generation Artificial Intelligence” under Grant
2018AAA0102305, the Zhejiang Provincial Natural Science
Foundation of China under Grant No.LGG22F030025, and
the National Natural Science Foundation of China under
Grants 61803014, 61922008, and 61873011.
1This price consists of devices for both localization and visualization but
no personal computers and robots.
REFERENCES
[1] D. Floreano and R. J. Wood, “Science, technology and the future of
small autonomous drones,” Nature, vol. 521, no. 7553, pp. 460–466,
May 2015.
[2] Z. Makhataeva and H. A. Varol, “Augmented reality for robotics: A
review,” Robotics, vol. 9, no. 2, June 2020.
[3] A. Reina, M. Salvaro, G. Francesca, L. Garattoni, C. Pinciroli,
M. Dorigo, and M. Birattari, “Augmented reality for robots: Virtual
sensing technology applied to a swarm of e-pucks,” in Proceedings
of NASA/ESA Conference on Adaptive Hardware and Systems, June
2015.
[4] S. Omidshafiei, A.-A. Agha-Mohammadi, Y. F. Chen, N. K. Ure, S.-
Y. Liu, B. T. Lopez, R. Surati, J. P. How, and J. Vian, “Measurable
augmented reality for prototyping cyberphysical systems: A robotics
platform to aid the hardware prototyping and performance testing of
algorithms,” IEEE Control Systems Magazine, vol. 36, no. 6, pp. 65–
87, Dec. 2016.
[5] P. Foehn, A. Romero, and D. Scaramuzza, “Time-optimal planning
for quadrotor waypoint flight,” Science Robotics, vol. 6, no. 56, July
2021.
[6] W. Shule, C. M. Almansa, J. P. Queralta, Z. Zou, and T. Westerlund,
“UWB-based localization for multi-UAV systems and collaborative
heterogeneous multi-robot systems,” Procedia Computer Science, vol.
175, pp. 357–364, Jan. 2020.
[7] Z. Wang, S. Li, Z. Zhang, F. Lv, and Y. Hou, “Research on UWB
positioning accuracy in warehouse environment,” Procedia Computer
Science, vol. 131, pp. 946–951, Jan. 2018.
[8] J. A. Preiss, W. Honig, G. S. Sukhatme, and N. Ayanian, “Crazyswarm:
A large nano-quadcopter swarm,” in Proceedings of IEEE Interna-
tional Conference on Robotics and Automation, May 2017, pp. 3299–
3304.
[9] K. Yu, K. Wen, Y. Li, S. Zhang, and K. Zhang, “A Novel NLOS
Mitigation Algorithm for UWB Localization in Harsh Indoor Environ-
ments,” IEEE Transactions on Vehicular Technology, vol. 68, no. 1,
pp. 686–699, Jan. 2019.
[10] M. Beul, D. Droeschel, M. Nieuwenhuisen, J. Quenzel, S. Houben,
and S. Behnke, “Fast autonomous flight in warehouses for inventory
applications,” IEEE Robotics and Automation Letters, vol. 3, no. 4,
pp. 3121–3128, Oct. 2018.
[11] B. Xing, Q. Zhu, F. Pan, and X. Feng, “Marker-based multi-sensor
fusion indoor localization system for micro air vehicles,” Sensors,
vol. 18, no. 6, June 2018.
[12] F. Zafari, A. Gkelias, and K. K. Leung, “A survey of indoor lo-
calization systems and technologies,” IEEE Communications Surveys
Tutorials, vol. 21, no. 3, pp. 2568–2599, Apr. 2019.
[20] G. Terzakis and M. Lourakis, “A consistently fast and globally
optimal solution to the perspective-n-point problem,” in Proceedings
of European Conference on Computer Vision, A. Vedaldi, H. Bischof,
T. Brox, and J.-M. Frahm, Eds., Nov. 2020, pp. 478–494.
[13] P. Serra, R. Cunha, T. Hamel, D. Cabecinhas, and C. Silvestre,
“Landing of a quadrotor on a moving target using dynamic image-
based visual servo control,” IEEE Transactions on Robotics, vol. 32,
no. 6, pp. 1524–1535, Dec. 2016.
[14] Y. Wang, Y. Ji, D. Liu, Y. Tamura, H. Tsuchiya, A. Yamashita, and
H. Asama, “Acmarker: Acoustic camera-based fiducial marker system
in underwater environment,” IEEE Robotics and Automation Letters,
vol. 5, no. 4, pp. 5018–5025, Oct. 2020.
[15] S. Garrido-Jurado, R. Mu˜
noz-Salinas, F. Madrid-Cuevas, and
M. Mar´
ın-Jim´
enez, “Automatic generation and detection of highly
reliable fiducial markers under occlusion,” Pattern Recognition,
vol. 47, no. 6, pp. 2280–2292, June 2014.
[16] F. J. Romero-Ramirez, R. Mu˜
noz-Salinas, and R. Medina-Carnicer,
“Speeded up detection of squared fiducial markers,” Image and Vision
Computing, vol. 76, pp. 38–47, Aug. 2018.
[17] M. Krogius, A. Haggenmiller, and E. Olson, “Flexible layouts for
fiducial tags,” in Proceedings of IEEE/RSJ International Conference
on Intelligent Robots and Systems, Nov. 2019, pp. 1898–1903.
[18] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n)
solution to the PnP problem,” International Journal of Computer
Vision, vol. 81, no. 2, pp. 155–166, Feb. 2009.
[19] T. Collins and A. Bartoli, “Infinitesimal plane-based pose estimation,”
International Journal of Computer Vision, vol. 109, no. 3, pp.
252–286, Sept. 2014.
[21] K. Zhang, Y. Zhao, C. Wu, C. Yang, K. Huang, C. Peng, Y. Liu,
and Z. Yang, “ChromaCode: A fully imperceptible screen-camera
communication system,” IEEE Transactions on Mobile Computing,
vol. 20, no. 3, pp. 861–876, Mar. 2021.
[22] T. Li, C. An, X. Xiao, A. T. Campbell, and X. Zhou, “Real-time screen-
camera communication behind any scene,” in Proceedings of the 13th
Annual International Conference on Mobile Systems, Applications, and
Services, May 2015, pp. 197–211.
[23] V. Nguyen, Y. Tang, A. Ashok, M. Gruteser, K. Dana, W. Hu, E. Wen-
growski, and N. Mandayam, “High-rate flicker-free screen-camera
communication with spatially adaptive embedding,” in Proceedings
of the 35th Annual IEEE International Conference on Computer
Communications, ser. INFOCOM, Apr. 2016, pp. 1–9.
[24] E. Simonson and J. Brozek, “Flicker fusion frequency: Background
and applications,” Physiological Reviews, vol. 32, no. 3, pp. 349–378,
July 1952.
[25] A. Wang, Z. Li, C. Peng, G. Shen, G. Fang, and B. Zeng, “In-
Frame++: Achieve simultaneous screen-human viewing and hidden
screen-camera communication,” in Proceedings of the 13th Annual
International Conference on Mobile Systems, Applications, and Ser-
vices, 2015, pp. 181–195.
[26] J. Kallwies, B. Forkel, and H.-J. Wuensche, “Determining and improv-
ing the localization accuracy of AprilTag detection,” in Proceedings
of IEEE International Conference on Robotics and Automation, May
2020, pp. 8288–8294.