PreprintPDF Available

Autonomous Search of Semantic Objects in Unknown Environments

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper addresses the problem of enabling a robot to search for a semantic object in an unknown and GPS-denied environment. For the robot in the unknown environment to detect and find the target object, it must perform simultaneous localization and mapping (SLAM) at both geometric and semantic levels using its onboard sensors while planning and executing its motion based on the ever-updated SLAM results. In other words, the robot must be able to conduct simultaneous localization, semantic mapping, motion planning, and execution in real-time in the presence of sensing and motion uncertainty. This is an open problem as it combines semantic SLAM based on perception and real-time motion planning and execution under uncertainty. Moreover, the goals of robot motion change on the fly depending on whether and how the robot can detect the target object. We propose a novel approach to tackle the problem, leveraging semantic SLAM, Bayesian Networks, Markov Decision Process, and real-time dynamic planning. The results demonstrate the effectiveness and efficiency of our approach.
Content may be subject to copyright.
Autonomous Search of Semantic Objects in Unknown Environments
Zhentian Qian1, Jie Fu2, and Jing Xiao1
Abstract This paper addresses the problem of enabling
a robot to search for a semantic object in an unknown
and GPS-denied environment. For the robot in the unknown
environment to detect and find the target object, it must
perform simultaneous localization and mapping (SLAM) at
both geometric and semantic levels using its onboard sensors
while planning and executing its motion based on the ever-
updated SLAM results. In other words, the robot must be
able to conduct simultaneous localization, semantic mapping,
motion planning, and execution in real-time in the presence of
sensing and motion uncertainty. This is an open problem as it
combines semantic SLAM based on perception and real-time
motion planning and execution under uncertainty. Moreover,
the goals of robot motion change on the fly depending on
whether and how the robot can detect the target object. We
propose a novel approach to tackle the problem, leveraging
semantic SLAM, Bayesian Networks, Markov Decision Process,
and real-time dynamic planning. The results demonstrate the
effectiveness and efficiency of our approach.
Index Terms Reactive and Sensor-Based Planning, Seman-
tic SLAM, Semantic Scene Understanding, Planning under
Uncertainty.
I. INTRODUCTION
This paper is motivated by the problem of searching an
unknown environment for some target object, which is a
fundamental problem in many application scenarios from
search and rescue to reconnaissance to elderly care.
A. Related Work
There is a significant amount of literature on simultaneous
localization and mapping (SLAM) for robot mapping and
navigation in an unknown environment based on perception,
such as visual and odometry sensing. SLAM methods model
and reduce sensing uncertainties in mapping the unknown
environment and localizing the robot in it at the same time.
Semantic SLAM and active SLAM are particularly relevant.
Semantic SLAM methods are focused on representing,
mapping, and localizing 3D objects and use different rep-
resentations of objects such as meshes [1], quadric [2], [3],
cuboid [4], and OctoMap [5].
Active SLAM aims to choose the optimal trajectory for
a robot to improve map and localization accuracy and
maximize the information gain. The localization accuracy
is typically measured by metrics such as A-opt (sum of the
covariance matrix eigenvalues) [6], [7], D-opt (product of
covariance matrix eigenvalues) [8], E-opt (largest covariance
1Zhentian Qian* and Jing Xiao are with the Robotics Engineer-
ing Department, Worcester Polytechnic Institute, Worcester, MA, USA
zqian@wpi.edu, jxiao2@wpi.edu
2Jie Fu is with the Department of Electrical and Computer Engineering,
University of Florida, Gainesville, FL, USA fujie@ufl.edu
matrix eigenvalue) [9]. Information gain is measured in met-
rics such as joint entropy [10] and expected map information
[11].
However, neither semantic nor active SLAM considers
performing tasks other than mapping an unknown environ-
ment. The planning aspect is not addressed for semantic
SLAM and is downplayed in active SLAM with simple
methods such as A* [8].
Robot path and motion planning is one of the most studied
areas in robotics. The basic objective is to find an optimal and
collision-free path for a robot to navigate to some goals in
an environment. Many traditional path-planning approaches
assume a more or less known environment, i.e., the robot
already has a map and models of objects [12]. On the
other hand, real-time, sensing-based planning in an unknown
environment still largely remains a challenge [13].
Earlier work includes grid-based planning approaches such
as D* [14] and D* Lite [15], sampling-based approaches
such as ERRT [16] and DRRT [17], and adaptive approaches
such as [18]. These approaches consider the environment
dynamic and partially known, but assume the goal position is
known, disregard the uncertainties in sensing, the robot pose,
and dynamics, and do not consider semantic information.
Recently, various techniques based on partially observable
Markov decision processes (POMDPs) have been developed
[19]–[21] to incorporate sensing and robot motion uncer-
tainties into planning in partially observable environments.
However, POMDP suffers from the curse of dimensionality
and is computationally expensive, particularly when the
state space is large. For the POMDP to scale, high-level
abstraction must be made for the state space. For example,
treat objects [20] or rooms [19] as state variables. The
downside is that highly-abstracted models can lose touch
with reality. To bypass this problem, some researchers turn to
deep learning to learn semantic priors and make predictions
on the unobserved region [22], [23]. These methods tend to
suffer from poor generalization.
Next-best view planning is another highly related topic,
designed for efficient visual exploration of unknown space.
Unlike active SLAM, approaches for next-best view planning
typically do not consider robot localization uncertainty. A
next-best view planner starts by sampling a set of views in
the environment, evaluates the estimated information gain
for each view, and selects the view with the maximum
information gain as the next view [24]. Different methods
differ in the sampling methods (uniform sampler, frontier-
based coverage sampler [25]), information gain (path costs
are incorporated in [25], [26]), and the selection of the next
view (receding horizon scheme in [27], Fixed Start Open
arXiv:2302.13236v1 [cs.RO] 26 Feb 2023
Traveling Salesman Problem (FSOTSP) solver in [25]).
However, existing planning methods in unknown envi-
ronments usually do not consider real-time results from
SLAM with embedded and changing uncertainties, such as
the robot’s pose, the metric map, and the semantic map
(generated by semantic SLAM). Only the metric map was
used by next-best view planning approaches [24]–[26].
B. Approach and Contributions
The problem we focus on in this paper, i.e., searching a
target object in an unknown and GPS-denied environment,
requires real-time planning and execution of the robot’s
motion to perform the following two necessary tasks con-
currently, which are intertwined and mutually facilitating:
Simultaneous localization and mapping at both semantic
and geometric levels to facilitate the search of the target
object.
Real-time planning and execution of search motion towards
the target object based on semantic SLAM results or to
expand and improve semantic SLAM results to find and
identify the target object.
This paper addresses such novel challenges by leveraging
the probabilistic representation of semantic SLAM outcomes
(constantly improved), Bayesian network [28] representation
of semantic knowledge relating the target object to surround-
ing objects and Markov decision process (MDP) formulation.
It introduces a novel, adaptive planner that synergizes:
1) active semantic SLAM with improved SLAM results,
2) real-time motion planning under sensing and motion
uncertainty in the partially observed world represented by
ever-updated semantic SLAM results,
3) real-time determination of (intermediate) goals of motion
on the fly and based on ever-improved semantic knowledge.
II. SYS TEM OV ERVIE W
Fig. 1 provides an overview of our system. The robot starts
by scanning the environment with its onboard RGB-D sensor.
The color and depth images from the RGB-D sensor are fed
into geometric and semantic SLAM modules. The SLAM
modules then update the robot’s location, the (observed) free
and occupied space, and objects detected in the environment;
the updated localization and map information are fused in a
single map Et, where the subscript tstands for time instance.
Next, the robot determines a goal of motion and plans a
path based on Et, to either explore new regions or inspect the
candidate target object. It first checks whether it has reached
the current goal. If not, the robot executes the planned path.
Otherwise, either the task is complete, or the robot forms a
new goal.
In the goal-forming process, based on Etand additional
semantic information about the target object oT, the robot
decides if it should search for oTin the explored space or
further observe a detected object that is likely oT. It then
updates the optimal policies for reaching the newly formed
goal. During the entire operation, the robot continuously
checks if it has found the target object instance oTwith high
Geometric
and semantic
SLAM
Goal
forming
Identified
?
RGB-D
sensing
Robot
Updated
fused map
Plan path
Execute
path
No
Tas k
completed
Yes
Fig. 1: System Overview
Fig. 2: Grid map at time t. The grey, white, and black areas
represent unknown, free, and occupied regions.
confidence. If not, the robot will repeat the above processes
until it has reached the time limit or explored all regions.
III. MAPPING AND LOCALIZATION
In this section, we describe how geometric and semantic
SLAM is achieved, and the information from different levels
is fused into a single map Et.
A. Geometric SLAM
Geometric SLAM and semantic SLAM modules run in
parallel in our system. We employ the RTAB-MAP [29]
algorithm for geometric SLAM. It generates a grid map
Gt {0,1,1}W×H, where Wand Hare the width and
height of the grid map. 0,1, and 1in the grid map represent
free, occupied, and unknown space respectively, as shown
in Fig. 2. The geometric SLAM module also estimates the
robot pose (µp,t,Σp,t ), where µp,t and Σp,t are the mean
and covariance of the robot pose at time instance t.
We use off-the-shelve tools [30] to segment the grid map
into different geometric rooms: a room is defined as any
space enclosed within a number of walls to which entry
is possible only by a door or other dividing structure that
connects it either to a hallway or to another room. Every
grid on the grid map Gtis assigned with a corresponding
room ID: RtNW×H. An example is provided in Fig. 3.
B. Semantic SLAM
We adapt the system introduced in [31] for semantic
SLAM. At time instance t1, the position estimation
mi,t1R2for the semantic object oiis:
bel(mt1) N (µt1,Σt1),(1)
Fig. 3: Segmented geometric rooms at time t. The two
segmented rooms are encoded in different colors.
Where bel(·)stands for the belief over a variable. Note that
for simplicity, the subscript iis dropped in (1), as in (2)–(4).
At time instance t, the robot pose xtR2estimated by
geometric SLAM is bel(xt) N (µp,t,Σp,t ).
If the semantic object oiis detected on the color image
It, range-bearing measurement ztwill be generated based
on the depth information of oifrom the depth image. The
range-bearing measurement noise δtis:
δt N (0,Σδ).(2)
The covariance of the range-bearing measurement Σδis
assumed to be independent of time. Then the posterior belief
bel(mt)at time tcan be updated using Bayes’ theorem:
bel(mt) = p(m|z1:t) = p(zt|m,z1:t1)·p(m|z1:t1)
p(zt|z1:t1)
=ηZp(zt|m,xt)·bel(xt)·bel(mt1)dxt,
(3)
where ηis a normalizing term.
Substituting the probability density functions of
p(zt|m,xt),bel(xt), and bel(mt1)into (3), the final
result after simplification suggests that the updated posterior
belief bel(mt)can be approximated by a multivariate
Gaussian distribution bel(mt) N (µt,Σt), where
Σt=KT
1Σ1
δK1+Σ1
t1KT
1Σ1
δK2ΨKT
2Σ1
δK11,
µt=µt1+ΣtKT
1(Σ1
δΣ1
δK2ΨKT
2Σ1
δ)∆zt.
ztis the error between expected and actual range-bearing
measurement. The complete derivation is omitted here.
The object class probability distribution pt(·)is updated
using Bayes’ theorem:
pt(c) = p(c|L1:t) = p(Lt|c, L1:t1)·p(c|L1:t1)
p(Lt|L1:t1)
=ηp(Lt|c)·pt1(c) = p(Lt|c)·pt1(c)
Pc0Cp(Lt|c0)pt1(c0),
(4)
where η= 1/p(Lt|L1:t1)is a normalization constant,
LtR|C|is the confidence level distribution of an object
in different classes, returned by an object detector, such as
YOLOv3 [32] at time t.cCis one of the possible object
classes. p(Lt|c)is the object detector uncertainty model,
representing the probability of object detector outputs Lt
when the object class is c. We use the Dirichlet distribution
Dir(αc)to model this uncertainty, with a different parameter
αcR|C|for each object class c.
Finally, based on the locations of map objects, the corre-
sponding geometric room IDs are assigned to the objects.
Formally, the map object oiis represented as a 4-tuple
oi=hµi,Σi, pi, riiwith µiand Σithe mean and covariance
of the object oipose, pithe object class distribution, and ri
the room ID of oi. The object map is the set of observed
map objects Ot={o1, o2, . . . , on}. The fused map Et=
hGt,Ot,Rticollects the grid map Gt, object map Ot, as
well as the room information Rt.
IV. INFORMATION FOR GOAL FORMING.
As the robot’s mission is to find a target object in an
unknown environment, its goal of motion will be determined
on the fly depending on the information provided by the
fused map Etand the robot’s location. The mission is
accomplished if the target object is visible and identified
as such. Otherwise, there are several types of intermediate
goals for the robot motion:
if the target object is not included in the current map Et,
the robot chooses to explore more. This intermediate goal
requires frontier detection;
if an object in the map is likely the target object (with a
low probability), the robot chooses to observe more of the
object in its visibility region;
if an object in the map is related to the target object based
on the semantic information that they are likely in the same
geometric room, the robot chooses to move to that room in
the hope of being able to see the target object once it is there.
A. Frontier Detection
The Frontier region is the set of cells between free and
unknown space in the grid map Gt. Formally, a grid cell
(i, j)belongs to the frontier region if and only if Gt[i, j ] = 0
and k {0,1,1},l {0,1,1}:Gt[i+k, j +l] = 1.
We use the Canny edge detector [33] to detect the grid
cells between free and unknown. The detected cells are
grouped into edges using 8-connectivity, i.e., each cell with
coordinates (i±1, j ±1) is connected to the cell at (i, j).
Similar to map objects, a frontier edge ejis also assigned
a room ID rjbased on its position. The frontier region is
defined as Ft={he1, r1i,he2, r2i,...,hem, rmi}, where m
is the number of frontier edges. Edges with area |ej|smaller
than 15 cells are deemed as noise and excluded from Ft. The
frontier region at time tis drawn in green in Fig. 4.
B. Visibility Region Computation
At time t, the visibility region Vtfor an object oiin the
grid map Gtwith obstacles is the region of all cells on the
grid map Gtthat object oiis visible. That is, if a line connect-
ing the position of oiand a cell qdoes not intersect with any
obstacle cell and is within the sensing range, then qVt.
We apply a uniform ray-casting algorithm to compute the
visibility region. Rays originating from the object’s position
Fig. 4: The frontier region computed at time t, marked in
green.
Fig. 5: The visibility region computed for one object instance
at time t, marked in blue.
are cast in many directions. Regions illuminated by the ray
(reached by it) are considered the visibility region Vt. The
visibility region for one object is drawn in blue in Fig. 5.
For efficient planning, we only compute the visibility
region for the object most likely in the target object category
cT. We refer to this object as the object of interest oI,
I= arg maxipi(cT).
C. Semantic Prior Knowledge
We leverage prior semantic knowledge to facilitate effi-
cient exploration. The key idea is that objects in the target
category may have a closer affinity to some categories of
objects than others. The co-occurrence relationship between
objects of two categories is estimated based on Lidstone’s
law of succession [34]:
p(ci|cj) = N(ci, cj) + α
N(cj) + α|C|,(5)
where p(ci|cj)is the conditional probability of object of
class cibeing in a geometric room given object of class cjis
already observed in the same room. N(ci, cj)is the number
of times objects of classes ciand cjare observed in the same
room. N(cj)is the number of times object of category cj
are obsereved in a room. α[0,)is a smoothing factor,
and finally |C|is the number of classes.
The probabilistic co-occurrence relationships of multiple
pairs of objects are captured using Eq. (5) and further
assembled into multiple Bayesian networks. We construct
a set of Bayesian networks B={B1, B2, . . .}, with one
Kitchen
Stove Fridge Microwave
Counter
Kettle Cup
Fig. 6: Bayesian Network
for each semantic space S={S1, S2, . . .}. Each semantic
space corresponds to one room category, such as kitchen,
bedroom, bathroom, etc. An example of a Bayesian Network
is illustrated in Fig. 6, demonstrating common object classes
found in a kitchen and their conditional dependency. For each
geometric room riin the environment, we will collect the set
of object classes Ei={c1, c2, . . .}that are observed in the
room ri. Recall that we keep a class probability distribution
for each map object. Thus we cannot draw a deterministic
conclusion regarding the presence of a certain object class
in the room ri. However, to keep the problem tractable, we
assume the presence of object class ckif for any object ojin
the room, the probability of the object ojbeing in class ck
exceeds some threshold λ:ckEi j, pj(ck)> λ.
Given the evidence set Ei, we only consider the Bayesian
networks in Bthat contains the target object instance cTand
shares nodes with Ei; name this new set Bi. By doing so,
we narrow down the possible semantic space categories for
the geometric room rito a subset Si, which corresponds to
Bi. For each Bayesian network Bj Bi, we can compute
the probability of finding the target object instance oTin the
room ribased on evidence Ei, denoted as P(cT|Ei, ri, Bj).
We can then infer the probability of finding the target
object instance oTin the same room riby feeding Eiinto
the Bayesian network set Bi:
P(cT|Ei={c1, c2, . . .}, ri) = max
Bj∈Bi
P(cT|Ei, ri, Bj).
This probability is computed for all geometric rooms.
V. PLANNING AP PROAC H
We now describe how the intermediate goal is determined
for the robot on the fly and how the optimal policy for
reaching the intermediate goal is computed for the robot.
A. Robot Model
The robot is a stochastic dynamic system and can be
represented by a Markov decision process (MDP) Mt=
hS, A, P, R, F iwith the following components:
Sis the discrete state space, representing the mean of the
Gaussian distribution of the robot position. The mean of the
robot’s position is discretized and clipped to the closest grid
cell in the grid map Gtto avoid an infinite state space.
Ais a set of actions. We consider eight actions that allow
the robot to move horizontally, vertically, and diagonally to
reach its eight neighboring grid cells. A low-level controller
is used to map the actions into the robot command.
P:S×A×S[0,1] is the transition probability func-
tion, where P(· | s, a)represents the probability distribution
over next states given an action ataken at the current state
s. For example, for the move-up action, the robot has a high
probability of moving up one cell, but it also has a small
probability of moving to the upper-left or upper-right cell.
R:S×A×SRis the reward function, where
R(s, a, s0)is the reward for executing action aAat state
sSand reaching next state s0S.
FSis the set of (intermediate) goal states, which are
determined on the fly, as described in Section V-C.
B. Reward Shaping
To compute policies that can drive the robot to the frontier
region Ftor visibility region Vt, for exploration or re-
observation, we define two reward functions accordingly.
1) Reward function for reaching Ft:The reward function
R(s, a, s0)is designed as:
R(:,:, s0) = P(xej|s0)·P(cT|E, ri)· |ej|,(6)
where P(xej|s0)is the probability of the robot being at
frontier edge ejif its mean position is s0.P(cT|E, ri)is
the probability to find target object instance cTin geometric
room riwhere edge ejlies given the current evidence E.
|ej|is the size of the frontier edge, representing the possible
information gain by exploring ej.P(xej|s0)can be
calculated by first discretizing the robot’s Gaussian position
distribution (with mean at s0) based on Gtand then summing
up the probability of the robot at each cell that belongs to
ej.P(cT|E, ri)is calculated using the Bayesian network,
as discussed in Section IV-C.
2) Reward function for reaching Vt:The reward function
R(s, a, s0)is designed as:
R(:,:, s0) = P(xVt|s0),(7)
which is the probability of the robot being in visibility region
Vtif its mean position is s0.
C. Goal Determination
We use an optimistic approach in determining the interme-
diate goal. If the probability of the object of interest oIbeing
in the target object category cTexceeds a threshold τ, i.e.,
pI(cT)> τ, then the intermediate goal is to re-observe the
object of interest oI, and the reward function is as defined in
(7). Otherwise, the intermediate goal is to explore the frontier
region, and the reward function is defined as (6).
D. Planner
The MDP Mtand the selected reward function Rare
fed into a planner based on the Real Time Dynamic Pro-
gramming (RTDP) algorithm [35] to compute an optimal
policy πthat maximizes the expected sum of rewards, i.e.,
value function V. A value function Vstarting at state sS
following policy πis defined as follows:
Vπ(s)=Eπ[
X
t=0
γtR(st, at, st+1)],
where π:SAis a deterministic policy over Mtmapping
the state into an action, and γ[0,1) is a discounting factor.
The optimal policy πis computed as follows: for all sS,
π(s) = arg max
πΠ
Vπ(s).
The RTDP algorithm allows us to compute a semi-optimal
policy in a short time1. As the robot carries out the semi-
optimal policy, the policy will be continuously improved by
the RTDP algorithm with the current robot mean position as
the initial state s0and converges to the optimal policy.
E. Adaptation
The fused map Et, frontier region Ft, and visibility region
Vtare updated at every time instance tbased on the ever-
improving semantic SLAM results. Consequently, once the
robot reaches an intermediate goal state, the MDP model
Mtneeds to be updated based on the new fused map Et. We
call this process the adaptation of the MDP model. Next,
the corresponding policy πalso needs to be updated.
Specifically, the following components are adapted: (a) the
discrete state space Sto match the changing grid map Gt, (b)
the transition probability function P, (c) the reward function
Rbased on Eqs. (6) and (7), and (d) the set of intermediate
goal states Fas Ftand Vtchange. The RTDP planner takes
the updated MDP model Mtto generate a new policy.
VI. EXPERIMENTS
We have performed experiments on the Matterport3D
(MP3D) [37] dataset using the Habitat [38] simulator. MP3D
dataset contains 3D scenes of a common indoor environment,
and the Habitat simulator allows the robot to navigate the
virtual 3D scene. Given the towel as the target object
category, the robot’s objective is to find any instance of the
target with a confidence level greater than 1= 0.99.
The action space of the robot is continuous, constituting its
angular and linear velocity. An episode is only successful if
the agent stops its motion when it has identified the target
within a specific time budget (1K seconds). Five episodes
are run for each method, with the robot placed at a random
position at the beginning of each episode. Two shots of the
MP3D scene are given in Fig. 7. This particular scene is
6.4m×8.4min size, and has one level, ten rooms, and
187 objects. The robot start positions and the target object
instances are visualized in Fig. 8, represented as blue boxes
and yellow diamonds.
The accompanying video shows the robot’s operations to
find target objects with different starting locations.
A. Semantic SLAM results
We present the semantic SLAM results obtained during
one episode. Our evaluation focuses on the accuracy and
uncertainty of the collected results.
1Unlike a more traditional approach such as value iteration [36].
Fig. 7: Two snap shots of the MP3D scene.
Fig. 8: Target objects (yellow diamonds) and robot start
positions (blue boxes).
1) Accuracy: We calculate the mean and the median of
the error between the predicted objects’ position and the
ground-truth objects’ position:
Mean =Pn
i=1kˆpipik
n,Median = median(kˆpipik),
where nis the current number of objects, ˆpiis the estimated
object position and piis the ground truth object position.
Their variations with respect to time are plotted in Fig.
9. For reference, the number of identified objects at each
time instance is also plotted in Fig. 10. We can see that
the error increases for the first few seconds as new objects
are identified. Nonetheless, as more observations come in,
the error decreases and converges. We can also see that the
calculated average error appears larger than the median value.
This is because some objects only receive one observation.
As a result, their positions are not updated, contributing to
a large error. For this reason, the median error is considered
a more sensible metric in our case, which is kept at a
reasonable level.
In the same spirit, we also calculate the cross entropy
between the predicted objects’ classes and the ground-truth
objects’ classes: 1
nPn
i=1 PcCpgt
i(c) log pi(c).pi(·)is
the predicted object class distribution, pgt
i(·)is the ground
truth of object class distribution, taking the value one at the
corresponding object class and zero elsewhere. The results
are plotted in Fig. 11. We can see that the cross entropy
gradually decreases with time, proving that the predicted
objects’ classes would converge to the correct results.
2) Uncertainty: Though we do not claim our method
to be an active SLAM method, we observe a decrease
in semantic map uncertainty as the robot progresses. The
0 5 10 15 20 25 30 35
Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Error (m)
Mean
Median
Fig. 9: Mean and median of the error between the predicted
objects’ position and the ground-truth objects’ position.
0 5 10 15 20 25 30 35
Time (s)
0
10
20
30
40
Number of objects
Fig. 10: Number of identified objects with respect to time.
average A-opt (sum of the covariance matrix eigenvalues),
D-opt (product of covariance matrix eigenvalues), and E-
opt (largest covariance matrix eigenvalue) of the map object
position covariance are calculated. Their evolution over time
is plotted in Fig. 12. The spikes in the graph indicate the
identification of new objects, hence the increased position
uncertainty. However, as time goes by and more observations
come in, we can see that all three metrics are kept at a low
level. This shows that the robot can estimate the objects’
position fairly confidently.
Fig. 13 gives a more intuitive representation. In Fig. 13, we
plot the Gaussian functions with their means and covariances
set as the estimated object position means and covariances.
Therefore, each “bell” in the plot represents one object.
Comparing the results we obtained at time instants t= 8s
and t= 70s. We can see that at t= 70s, the bell’s peak
increases, and the base decreases, indicating a more certain
estimation of the object’s position.
The entropy of the predicted object class is also calculated:
1
nPn
i=1 PcCpi(c) log pi(c)and visualized in Fig. 14.
The result suggests that as time progresses, the robot is more
and more certain about the object class that it predicted.
B. Planning results
We evaluate the performance of our planning method vs.
other navigation strategies: FE-SS: a frontier exploration
0 5 10 15 20 25 30 35
Time (s)
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
Cross entropy
Fig. 11: Cross entropy between the predicted objects’ classes
and the ground-truth objects’ classes.
0.0
0.2
A-opt
0.000
0.005
D-opt
0 5 10 15 20 25 30 35
Time (s)
0.0
0.2
E-opt
Fig. 12: The evolution of the object position covariance with
respect to time.
method [39] with rewards defined by (6), and equipped
with our custom Semantic SLAM, and Ours-NS: ablation
of our method without semantic prior knowledge and using
a uniform reward. To evaluate all methods, we report the
following metrics:
Success: percentage of successful episodes
Average path length: average length of the path taken by
the agent in an episode.
Success weighted by path length (SPL) [40]:
1
NPN
i=1 Sili
max(pi,li), where li=length of the shortest
path between goal and the visibility region of target instance
for an episode, pi=length of the path taken by the agent in
an episode, Si=binary indicator of success in episode i.
Average planning time: average time spent on planning
(excluding the time in action).
The testing results are summarized in Table I. Our method
outperforms other methods in success rate, SPL, and path
length by a large margin. Although our method does not
guarantee dominant performance in planning time, it still
exhibits advantages with a shorter average planning time for
all episodes.
The ablation study is conducted by denying access to
the semantic prior knowledge in our method. A significant
performance drop is observed on all metrics after that. This
proves the efficacy of using semantic prior knowledge to
x
200 250 300 350 400 450 500 550
y
0
100
200
300
400
z
0.000
0.005
0.010
0.015
0.020
0.025
0.030
(a) t= 8s.
x
200 250 300 350 400 450 500 550
y
0
100
200
300
400
z
0.000
0.020
0.040
0.060
0.080
0.100
(b) t= 70s.
Fig. 13: Object position covariance at time instant t= 8s
and time instant t= 70s
0 5 10 15 20 25 30 35
Time (s)
3.00
3.25
3.50
3.75
4.00
4.25
4.50
4.75
5.00
Entropy
Fig. 14: The evolution of the predicted object class entropy
with respect to time
guide our search for the target object.
TABLE I: Comparison study
Method Success Path length (m) SPL Planning time (s)
Ours 0.96 4.984 0.46 80.338
FE-SS 0.80 6.796 0.39 105.514
Ours-NS 0.68 8.726 0.41 97.716
VII. CONCLUSIONS
We presented a novel approach to tackle the open problem
of enabling a robot to search for a semantic object in an
unknown and GPS-denied environment. Our approach com-
bines semantic SLAM, Bayesian Networks, Markov Decision
Process, and real-time dynamic planning. The testing results
on Matterport 3D dataset demonstrate both the effective-
ness and efficiency of our approach. Moreover, while our
approach is unique in incorporating semantic object infor-
mation to search for semantic targets, to evaluate its motion
planning performance, we compared it to a non-semantic
baseline planning method and conducted an ablation study.
The results show that our approach has a higher success rate,
shorter path length, and less planning time. In the next step,
we will consider extending our approach to more complex,
compound semantic tasks and tasks that require the robot to
interact with objects.
REFERENCES
[1] D. G´
alvez-L´
opez, M. Salas, J. D. Tard´
os, and J. Montiel, “Real-time
monocular object slam,” Rob. Auton. Syst., vol. 75, pp. 435–449, 2016.
[2] L. Nicholson, M. Milford, and N. S¨
underhauf, “Quadricslam: Dual
quadrics from object detections as landmarks in object-oriented slam,”
IEEE Robot. Autom. Lett. (RA-L), vol. 4, no. 1, pp. 1–8, 2018.
[3] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-
tonomous object-level data association, in IEEE Int. Conf. Robot.
Autom. (ICRA), 2021, pp. 11 203–11 209.
[4] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam, IEEE
Trans. Robot. (T-RO), vol. 35, no. 4, pp. 925–938, 2019.
[5] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic
slam based on object detection and improved octomap, IEEE Access,
vol. 6, pp. 75 545–75 559, 2018.
[6] C. Leung, S. Huang, and G. Dissanayake, “Active slam using model
predictive control and attractor based exploration, in IEEE Int. Conf.
Intell. Robots Syst. (IROS). IEEE, 2006, pp. 5026–5031.
[7] T. Kollar and N. Roy, “Trajectory optimization using reinforcement
learning for map exploration,” Int. J. Robot. Res. (IJRR), vol. 27, no. 2,
pp. 175–196, 2008.
[8] A. Kim and R. M. Eustice, “Perception-driven navigation: Active
visual slam for robotic area coverage, in ICRA. IEEE, 2013, pp.
3196–3203.
[9] S. Ehrenfeld, “On the efficiency of experimental designs, Ann. Math.
Stat., vol. 26, no. 2, pp. 247–255, 1955.
[10] C. Stachniss, G. Grisetti, and W. Burgard, “Information gain-based ex-
ploration using rao-blackwellized particle filters.” in Robotics: Science
and systems, vol. 2, 2005, pp. 65–72.
[11] J.-L. Blanco, J.-A. Fernandez-Madrigal, and J. Gonz´
alez, “A novel
measure of uncertainty for mobile robot slam with rao—blackwellized
particle filters,” IJRR, vol. 27, no. 1, pp. 73–89, 2008.
[12] S. M. LaValle, Planning algorithms. Cambridge university press,
2006.
[13] R. Alterovitz, S. Koenig, and M. Likhachev, “Robot planning in
the real world: Research challenges and opportunities,” Ai Magazine,
vol. 37, no. 2, pp. 76–84, 2016.
[14] A. Stentz, “Optimal and efficient path planning for partially known
environments, in Intelligent unmanned ground vehicles. Springer,
1997, pp. 203–220.
[15] S. Koenig and M. Likhachev, “Fast replanning for navigation in
unknown terrain, T-RO, vol. 21, no. 3, pp. 354–363, 2005.
[16] J. Bruce and M. M. Veloso, “Real-time randomized path planning for
robot navigation,” in Robot soccer world cup. Springer, 2002, pp.
288–295.
[17] D. Ferguson, N. Kalra, and A. Stentz, “Replanning with rrts,” in ICRA.
IEEE, 2006, pp. 1243–1248.
[18] J. Vannoy and J. Xiao, “Real-time adaptive motion planning (ramp)
of mobile manipulators in dynamic environments with unforeseen
changes,” T-RO, vol. 24, no. 5, pp. 1199–1212, 2008.
[19] Z. Wang and G. Tian, “Hybrid offline and online task planning
for service robot using object-level semantic map and probabilistic
inference,” Information Sciences, vol. 593, pp. 78–98, 2022.
[20] T. S. Veiga, M. Silva, R. Ventura, and P. U. Lima, A hierarchi-
cal approach to active semantic mapping using probabilistic logic
and information reward pomdps, in Int. Conf. Automated Planning
Scheduling, vol. 29, 2019, pp. 773–781.
[21] L. Burks, I. Loefgren, and N. R. Ahmed, “Optimal continuous state
pomdp planning with semantic observations: A variational approach,
T-RO, vol. 35, no. 6, pp. 1488–1507, 2019.
[22] D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov,
“Object goal navigation using goal-oriented semantic exploration,
Adv. Neural. Inf. Process. Syst., vol. 33, pp. 4247–4258, 2020.
[23] G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Dani-
ilidis, “Learning to map for active semantic goal navigation, arXiv
preprint arXiv:2106.15648, 2021.
[24] R. Zeng, Y. Wen, W. Zhao, and Y.-J. Liu, “View planning in robot
active vision: A survey of systems, algorithms, and applications,
Comput. Vis. Media., vol. 6, no. 3, pp. 225–245, 2020.
[25] Z. Meng, H. Sun, H. Qin, Z. Chen, C. Zhou, and M. H. Ang,
“Intelligent robotic system for autonomous exploration and active slam
in unknown environments, in Int. Symp. Syst. Integr. (SII). IEEE,
2017, pp. 651–656.
[26] M. Selin, M. Tiger, D. Duberg, F. Heintz, and P. Jensfelt, “Efficient
autonomous exploration planning of large-scale 3-d environments,
RA-L, vol. 4, no. 2, pp. 1699–1706, 2019.
[27] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart,
“Receding horizon” next-best-view” planner for 3d exploration, in
ICRA. IEEE, 2016, pp. 1462–1468.
[28] S. J. Russell, Artificial intelligence a modern approach. Pearson
Education, Inc., 2010.
[29] M. Labb´
e and F. Michaud, “Rtab-map as an open-source lidar and
visual simultaneous localization and mapping library for large-scale
and long-term online operation,” J. Field Robot., vol. 36, no. 2, pp.
416–446, 2019.
[30] R. Bormann, F. Jordan, J. Hampp, and M. H¨
agele, “Indoor coverage
path planning: Survey, implementation, analysis, in ICRA. IEEE,
2018, pp. 1718–1725.
[31] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-
tonomous object-level data association, in ICRA. IEEE, 2021, pp.
11 203–11 209.
[32] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018.
[33] J. Canny, A computational approach to edge detection,” IEEE Trans.
Pattern Anal. Mach. Intell., no. 6, pp. 679–698, 1986.
[34] H. Sch¨
utze, C. D. Manning, and P. Raghavan, Introduction to informa-
tion retrieval. Cambridge University Press Cambridge, 2008, vol. 39.
[35] T. Smith and R. Simmons, “Focused real-time dynamic programming
for mdps: Squeezing more out of a heuristic,” in AAAI, 2006, pp.
1227–1232.
[36] R. Bellman, “A markovian decision process,” J. math. mech., pp. 679–
684, 1957.
[37] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d
data in indoor environments, arXiv preprint arXiv:1709.06158, 2017.
[38] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain,
J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for
embodied ai research,” in IEEE Int. Conf. Comput. Vis., 2019, pp.
9339–9347.
[39] B. Yamauchi, “A frontier-based approach for autonomous exploration,”
in IEEE Int. Symp. Comput. Intell. Robot. Autom., 1997, pp. 146–151.
[40] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta,
V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al.,
“On evaluation of embodied navigation agents, arXiv preprint
arXiv:1807.06757, 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Rapid development of artificial intelligence motivates researchers to expand the capabilities of intelligent and autonomous robots. In many robotic applications, robots are required to make planning decisions based on perceptual information to achieve diverse goals in an efficient and effective way. The planning problem has been investigated in active robot vision, in which a robot analyzes its environment and its own state in order to move sensors to obtain more useful information under certain constraints. View planning, which aims to find the best view sequence for a sensor, is one of the most challenging issues in active robot vision. The quality and efficiency of view planning are critical for many robot systems and are influenced by the nature of their tasks, hardware conditions, scanning states, and planning strategies. In this paper, we first summarize some basic concepts of active robot vision, and then review representative work on systems, algorithms and applications from four perspectives: object reconstruction, scene reconstruction, object recognition, and pose estimation. Finally, some potential directions are outlined for future work.
Article
Full-text available
Distributed as an open‐source library since 2013, real‐time appearance‐based mapping (RTAB‐Map) started as an appearance‐based loop closure detection approach with memory management to deal with large‐scale and long‐term online operation. It then grew to implement simultaneous localization and mapping (SLAM) on various robots and mobile platforms. As each application brings its own set of constraints on sensors, processing capabilities, and locomotion, it raises the question of which SLAM approach is the most appropriate to use in terms of cost, accuracy, computation power, and ease of integration. Since most of SLAM approaches are either visual‐ or lidar‐based, comparison is difficult. Therefore, we decided to extend RTAB‐Map to support both visual and lidar SLAM, providing in one package a tool allowing users to implement and compare a variety of 3D and 2D solutions for a wide range of applications with different robots and sensors. This paper presents this extended version of RTAB‐Map and its use in comparing, both quantitatively and qualitatively, a large selection of popular real‐world datasets (e.g., KITTI, EuRoC, TUM RGB‐D, MIT Stata Center on PR2 robot), outlining strengths, and limitations of visual and lidar SLAM configurations from a practical perspective for autonomous navigation applications.
Article
Full-text available
Due to the development of the computer vision, machine learning and deep learning technologies, the research community focuses not only on the traditional SLAM problems, such as geometric mapping and localization, but also on semantic SLAM. In this paper we propose a Semantic SLAM system which builds the semantic maps with object-level entities, and it is integrated into the RGB-D SLAM framework. The system combines object detection module that is realized by the deep-learning method, and localization module with RGB-D SLAM seamlessly. In the proposed system, object detection module is used to perform object detection and recognition, and localization module is utilized to get the exact location of the camera. The two modules are integrated together to obtain the semantic maps of the environment. Furthermore, to improve the computational efficiency of the framework, an improved Octomap based on Fast Line Rasterization Algorithm is constructed. Meanwhile, for the sake of accuracy and robustness of the semantic map, Conditional Random Field (CRF) is employed to do the optimization. Finally, we evaluate our Semantic SLAM through three different tasks, i.e. Localization, Object Detection and Mapping. Specifically, the accuracy of localization and the mapping speed are evaluated on TUM dataset. Compared with ORB-SLAM2 and original RGB-D SLAM, our system respectively get 72.9% and 91.2% improvements in dynamic environments localization evaluated by root-mean-square error (RMSE). With the improved Octomap, the proposed Semantic SLAM is 66.5% faster than the original RGB-D SLAM. We also demonstrate the efficiency of object detection through quantitative evaluation in an automated inventory management task on a real-world datasets recorded over a realistic office.
Article
Maintaining a semantic map of a complex and dynamic environment, where the uncertainty originates in both noisy perception and unexpected changes, is a challenging problem. In particular, we focus on the problem of maintaining a semantic map of an environment by a mobile agent. In this paper we address this problem in an hierarchical fashion. Firstly, we employ a probabilistic logic model representing the semantic map, as well as the associated uncertainty. Secondly, we model the interaction of the robot with the environment with a set of information-reward POMDP models, one for each partition of the environment (e.g., a room). The partition is performed in order to address the scalability limitations of POMDP models over very large state spaces. We then use probabilistic inference to determine which POMDP and policy to execute next. Experimental results show the efficiency of this architecture in real domestic service robotic scenarios.
Article
To make service robots plan a whole task execution sequence and deal with task failure intelligently in the face of object occlusion and incomplete home environment information, a hybrid offline and online task planning strategy is proposed. We establish object-level semantic map and object location probabilistic relations between dynamic and static objects. Semantic mapping helps to obtain semantic locations of static objects, and the probabilistic relationship between dynamic and static objects can obtain semantic locations of dynamic objects through probabilistic reasoning. Probabilistic planning domain definition language (PPDDL) can generate offline action sequences, while partially observable Markov decision process (POMDP) can generate online action sequences. Therefore, a hybrid task planner that can receive semantic location information is constructed to generate offline and online action sequences, and realizes the dynamic switching of the two kinds of sequences through the designed planner switching mechanism. In order to improve the robustness and intelligence of robot task planning, a task replanning mechanism considering the position relationship between robot and candidate static objects is designed. Experimental results in real environment and simulation environment show that this approach can effectively increase the intelligence of task planning and improve the robustness and efficiency of robot task execution.
Article
This article develops novel strategies for optimal planning with semantic observations using continuous state partially observable Markov decision processes (CPOMDPs). Two major innovations are presented in relation to Gaussian mixture (GM) CPOMDP policy approximation methods. While existing methods have many desirable theoretical properties, they are unable to efficiently represent and reason over hybrid continuous-discrete probabilistic models. The first major innovation is the derivation of closed-form variational Bayes GM approximations of point-based value iteration Bellman policy backups, using softmax models of continuous-discrete semantic observation probabilities. A key benefit of this approach is that dynamic decision making tasks can be performed with complex non-Gaussian uncertainties, while also exploiting continuous dynamic state-space models (thus avoiding cumbersome and costly discretization). The second major innovation is a new clustering-based technique for mixture condensation that scales well to very large GM policy functions and belief functions. Simulation results for a target search and interception task with semantic observations show that the GM policies resulting from these innovations are more effective than those produced by other state-of-the-art policy approximations, but require significantly less modeling overhead and online runtime cost. Additional results show the robustness of this approach to model errors and scaling to higher dimensions.
Article
In this paper, we present a method for single image three-dimensional (3-D) cuboid object detection and multiview object simultaneous localization and mapping in both static and dynamic environments, and demonstrate that the two parts can improve each other. First, for single image object detection, we generate high-quality cuboid proposals from two-dimensional (2-D) bounding boxes and vanishing points sampling. The proposals are further scored and selected based on the alignment with image edges. Second, multiview bundle adjustment with new object measurements is proposed to jointly optimize poses of cameras, objects, and points. Objects can provide long-range geometric and scale constraints to improve camera pose estimation and reduce monocular drift. Instead of treating dynamic regions as outliers, we utilize object representation and motion model constraints to improve the camera pose estimation. The 3-D detection experiments on SUN RGBD and KITTI show better accuracy and robustness over existing approaches. On the public TUM, KITTI odometry and our own collected datasets, our SLAM method achieves the state-of-the-art monocular camera pose estimation and at the same time, improves the 3-D object detection accuracy.
Article
Exploration is an important aspect of robotics, whether it is for mapping, rescue missions or path planning in an unknown environment. Frontier Exploration planning (FEP) and Receding Horizon Next-Best-View planning (RH-NBVP) are two different approaches with different strengths and weaknesses. FEP explores a large environment consisting of separate regions with ease, but is slow at reaching full exploration due to moving back and forth between regions. RH-NBVP shows great potential and efficiently explores individual regions, but has the disadvantage that it can get stuck in large environments not exploring all regions. In this work we present a method that combines both approaches, with FEP as a global exploration planner and RH-NBVP for local exploration. We also present techniques to estimate potential information gain faster, to cache previously estimated gains and to exploit these to efficiently estimate new queries.