Available via license: CC BY 4.0
Content may be subject to copyright.
Autonomous Search of Semantic Objects in Unknown Environments
Zhentian Qian1, Jie Fu2, and Jing Xiao1
Abstract— This paper addresses the problem of enabling
a robot to search for a semantic object in an unknown
and GPS-denied environment. For the robot in the unknown
environment to detect and find the target object, it must
perform simultaneous localization and mapping (SLAM) at
both geometric and semantic levels using its onboard sensors
while planning and executing its motion based on the ever-
updated SLAM results. In other words, the robot must be
able to conduct simultaneous localization, semantic mapping,
motion planning, and execution in real-time in the presence of
sensing and motion uncertainty. This is an open problem as it
combines semantic SLAM based on perception and real-time
motion planning and execution under uncertainty. Moreover,
the goals of robot motion change on the fly depending on
whether and how the robot can detect the target object. We
propose a novel approach to tackle the problem, leveraging
semantic SLAM, Bayesian Networks, Markov Decision Process,
and real-time dynamic planning. The results demonstrate the
effectiveness and efficiency of our approach.
Index Terms— Reactive and Sensor-Based Planning, Seman-
tic SLAM, Semantic Scene Understanding, Planning under
Uncertainty.
I. INTRODUCTION
This paper is motivated by the problem of searching an
unknown environment for some target object, which is a
fundamental problem in many application scenarios from
search and rescue to reconnaissance to elderly care.
A. Related Work
There is a significant amount of literature on simultaneous
localization and mapping (SLAM) for robot mapping and
navigation in an unknown environment based on perception,
such as visual and odometry sensing. SLAM methods model
and reduce sensing uncertainties in mapping the unknown
environment and localizing the robot in it at the same time.
Semantic SLAM and active SLAM are particularly relevant.
Semantic SLAM methods are focused on representing,
mapping, and localizing 3D objects and use different rep-
resentations of objects such as meshes [1], quadric [2], [3],
cuboid [4], and OctoMap [5].
Active SLAM aims to choose the optimal trajectory for
a robot to improve map and localization accuracy and
maximize the information gain. The localization accuracy
is typically measured by metrics such as A-opt (sum of the
covariance matrix eigenvalues) [6], [7], D-opt (product of
covariance matrix eigenvalues) [8], E-opt (largest covariance
1Zhentian Qian* and Jing Xiao are with the Robotics Engineer-
ing Department, Worcester Polytechnic Institute, Worcester, MA, USA
zqian@wpi.edu, jxiao2@wpi.edu
2Jie Fu is with the Department of Electrical and Computer Engineering,
University of Florida, Gainesville, FL, USA fujie@ufl.edu
matrix eigenvalue) [9]. Information gain is measured in met-
rics such as joint entropy [10] and expected map information
[11].
However, neither semantic nor active SLAM considers
performing tasks other than mapping an unknown environ-
ment. The planning aspect is not addressed for semantic
SLAM and is downplayed in active SLAM with simple
methods such as A* [8].
Robot path and motion planning is one of the most studied
areas in robotics. The basic objective is to find an optimal and
collision-free path for a robot to navigate to some goals in
an environment. Many traditional path-planning approaches
assume a more or less known environment, i.e., the robot
already has a map and models of objects [12]. On the
other hand, real-time, sensing-based planning in an unknown
environment still largely remains a challenge [13].
Earlier work includes grid-based planning approaches such
as D* [14] and D* Lite [15], sampling-based approaches
such as ERRT [16] and DRRT [17], and adaptive approaches
such as [18]. These approaches consider the environment
dynamic and partially known, but assume the goal position is
known, disregard the uncertainties in sensing, the robot pose,
and dynamics, and do not consider semantic information.
Recently, various techniques based on partially observable
Markov decision processes (POMDPs) have been developed
[19]–[21] to incorporate sensing and robot motion uncer-
tainties into planning in partially observable environments.
However, POMDP suffers from the curse of dimensionality
and is computationally expensive, particularly when the
state space is large. For the POMDP to scale, high-level
abstraction must be made for the state space. For example,
treat objects [20] or rooms [19] as state variables. The
downside is that highly-abstracted models can lose touch
with reality. To bypass this problem, some researchers turn to
deep learning to learn semantic priors and make predictions
on the unobserved region [22], [23]. These methods tend to
suffer from poor generalization.
Next-best view planning is another highly related topic,
designed for efficient visual exploration of unknown space.
Unlike active SLAM, approaches for next-best view planning
typically do not consider robot localization uncertainty. A
next-best view planner starts by sampling a set of views in
the environment, evaluates the estimated information gain
for each view, and selects the view with the maximum
information gain as the next view [24]. Different methods
differ in the sampling methods (uniform sampler, frontier-
based coverage sampler [25]), information gain (path costs
are incorporated in [25], [26]), and the selection of the next
view (receding horizon scheme in [27], Fixed Start Open
arXiv:2302.13236v1 [cs.RO] 26 Feb 2023
Traveling Salesman Problem (FSOTSP) solver in [25]).
However, existing planning methods in unknown envi-
ronments usually do not consider real-time results from
SLAM with embedded and changing uncertainties, such as
the robot’s pose, the metric map, and the semantic map
(generated by semantic SLAM). Only the metric map was
used by next-best view planning approaches [24]–[26].
B. Approach and Contributions
The problem we focus on in this paper, i.e., searching a
target object in an unknown and GPS-denied environment,
requires real-time planning and execution of the robot’s
motion to perform the following two necessary tasks con-
currently, which are intertwined and mutually facilitating:
– Simultaneous localization and mapping at both semantic
and geometric levels to facilitate the search of the target
object.
– Real-time planning and execution of search motion towards
the target object based on semantic SLAM results or to
expand and improve semantic SLAM results to find and
identify the target object.
This paper addresses such novel challenges by leveraging
the probabilistic representation of semantic SLAM outcomes
(constantly improved), Bayesian network [28] representation
of semantic knowledge relating the target object to surround-
ing objects and Markov decision process (MDP) formulation.
It introduces a novel, adaptive planner that synergizes:
1) active semantic SLAM with improved SLAM results,
2) real-time motion planning under sensing and motion
uncertainty in the partially observed world represented by
ever-updated semantic SLAM results,
3) real-time determination of (intermediate) goals of motion
on the fly and based on ever-improved semantic knowledge.
II. SYS TEM OV ERVIE W
Fig. 1 provides an overview of our system. The robot starts
by scanning the environment with its onboard RGB-D sensor.
The color and depth images from the RGB-D sensor are fed
into geometric and semantic SLAM modules. The SLAM
modules then update the robot’s location, the (observed) free
and occupied space, and objects detected in the environment;
the updated localization and map information are fused in a
single map Et, where the subscript tstands for time instance.
Next, the robot determines a goal of motion and plans a
path based on Et, to either explore new regions or inspect the
candidate target object. It first checks whether it has reached
the current goal. If not, the robot executes the planned path.
Otherwise, either the task is complete, or the robot forms a
new goal.
In the goal-forming process, based on Etand additional
semantic information about the target object oT, the robot
decides if it should search for oTin the explored space or
further observe a detected object that is likely oT. It then
updates the optimal policies for reaching the newly formed
goal. During the entire operation, the robot continuously
checks if it has found the target object instance oTwith high
Geometric
and semantic
SLAM
Goal
forming
Identified
?
RGB-D
sensing
Robot
Updated
fused map
Plan path
Execute
path
No
Tas k
completed
Yes
Fig. 1: System Overview
Fig. 2: Grid map at time t. The grey, white, and black areas
represent unknown, free, and occupied regions.
confidence. If not, the robot will repeat the above processes
until it has reached the time limit or explored all regions.
III. MAPPING AND LOCALIZATION
In this section, we describe how geometric and semantic
SLAM is achieved, and the information from different levels
is fused into a single map Et.
A. Geometric SLAM
Geometric SLAM and semantic SLAM modules run in
parallel in our system. We employ the RTAB-MAP [29]
algorithm for geometric SLAM. It generates a grid map
Gt∈ {0,1,−1}W×H, where Wand Hare the width and
height of the grid map. 0,1, and −1in the grid map represent
free, occupied, and unknown space respectively, as shown
in Fig. 2. The geometric SLAM module also estimates the
robot pose (µp,t,Σp,t ), where µp,t and Σp,t are the mean
and covariance of the robot pose at time instance t.
We use off-the-shelve tools [30] to segment the grid map
into different geometric rooms: a room is defined as any
space enclosed within a number of walls to which entry
is possible only by a door or other dividing structure that
connects it either to a hallway or to another room. Every
grid on the grid map Gtis assigned with a corresponding
room ID: Rt∈NW×H. An example is provided in Fig. 3.
B. Semantic SLAM
We adapt the system introduced in [31] for semantic
SLAM. At time instance t−1, the position estimation
mi,t−1∈R2for the semantic object oiis:
bel(mt−1)∼ N (µt−1,Σt−1),(1)
Fig. 3: Segmented geometric rooms at time t. The two
segmented rooms are encoded in different colors.
Where bel(·)stands for the belief over a variable. Note that
for simplicity, the subscript iis dropped in (1), as in (2)–(4).
At time instance t, the robot pose xt∈R2estimated by
geometric SLAM is bel(xt)∼ N (µp,t,Σp,t ).
If the semantic object oiis detected on the color image
It, range-bearing measurement ztwill be generated based
on the depth information of oifrom the depth image. The
range-bearing measurement noise δtis:
δt∼ N (0,Σδ).(2)
The covariance of the range-bearing measurement Σδis
assumed to be independent of time. Then the posterior belief
bel(mt)at time tcan be updated using Bayes’ theorem:
bel(mt) = p(m|z1:t) = p(zt|m,z1:t−1)·p(m|z1:t−1)
p(zt|z1:t−1)
=ηZp(zt|m,xt)·bel(xt)·bel(mt−1)dxt,
(3)
where ηis a normalizing term.
Substituting the probability density functions of
p(zt|m,xt),bel(xt), and bel(mt−1)into (3), the final
result after simplification suggests that the updated posterior
belief bel(mt)can be approximated by a multivariate
Gaussian distribution bel(mt)∼ N (µt,Σt), where
Σt=KT
1Σ−1
δK1+Σ−1
t−1−KT
1Σ−1
δK2ΨKT
2Σ−1
δK1−1,
µt=µt−1+ΣtKT
1(Σ−1
δ−Σ−1
δK2ΨKT
2Σ−1
δ)∆zt.
∆ztis the error between expected and actual range-bearing
measurement. The complete derivation is omitted here.
The object class probability distribution pt(·)is updated
using Bayes’ theorem:
pt(c) = p(c|L1:t) = p(Lt|c, L1:t−1)·p(c|L1:t−1)
p(Lt|L1:t−1)
=ηp(Lt|c)·pt−1(c) = p(Lt|c)·pt−1(c)
Pc0∈Cp(Lt|c0)pt−1(c0),
(4)
where η= 1/p(Lt|L1:t−1)is a normalization constant,
Lt∈R|C|is the confidence level distribution of an object
in different classes, returned by an object detector, such as
YOLOv3 [32] at time t.c∈Cis one of the possible object
classes. p(Lt|c)is the object detector uncertainty model,
representing the probability of object detector outputs Lt
when the object class is c. We use the Dirichlet distribution
Dir(αc)to model this uncertainty, with a different parameter
αc∈R|C|for each object class c.
Finally, based on the locations of map objects, the corre-
sponding geometric room IDs are assigned to the objects.
Formally, the map object oiis represented as a 4-tuple
oi=hµi,Σi, pi, riiwith µiand Σithe mean and covariance
of the object oipose, pithe object class distribution, and ri
the room ID of oi. The object map is the set of observed
map objects Ot={o1, o2, . . . , on}. The fused map Et=
hGt,Ot,Rticollects the grid map Gt, object map Ot, as
well as the room information Rt.
IV. INFORMATION FOR GOAL FORMING.
As the robot’s mission is to find a target object in an
unknown environment, its goal of motion will be determined
on the fly depending on the information provided by the
fused map Etand the robot’s location. The mission is
accomplished if the target object is visible and identified
as such. Otherwise, there are several types of intermediate
goals for the robot motion:
– if the target object is not included in the current map Et,
the robot chooses to explore more. This intermediate goal
requires frontier detection;
– if an object in the map is likely the target object (with a
low probability), the robot chooses to observe more of the
object in its visibility region;
– if an object in the map is related to the target object based
on the semantic information that they are likely in the same
geometric room, the robot chooses to move to that room in
the hope of being able to see the target object once it is there.
A. Frontier Detection
The Frontier region is the set of cells between free and
unknown space in the grid map Gt. Formally, a grid cell
(i, j)belongs to the frontier region if and only if Gt[i, j ] = 0
and ∃k∈ {0,1,−1},∃l∈ {0,1,−1}:Gt[i+k, j +l] = −1.
We use the Canny edge detector [33] to detect the grid
cells between free and unknown. The detected cells are
grouped into edges using 8-connectivity, i.e., each cell with
coordinates (i±1, j ±1) is connected to the cell at (i, j).
Similar to map objects, a frontier edge ejis also assigned
a room ID rjbased on its position. The frontier region is
defined as Ft={he1, r1i,he2, r2i,...,hem, rmi}, where m
is the number of frontier edges. Edges with area |ej|smaller
than 15 cells are deemed as noise and excluded from Ft. The
frontier region at time tis drawn in green in Fig. 4.
B. Visibility Region Computation
At time t, the visibility region Vtfor an object oiin the
grid map Gtwith obstacles is the region of all cells on the
grid map Gtthat object oiis visible. That is, if a line connect-
ing the position of oiand a cell qdoes not intersect with any
obstacle cell and is within the sensing range, then q∈Vt.
We apply a uniform ray-casting algorithm to compute the
visibility region. Rays originating from the object’s position
Fig. 4: The frontier region computed at time t, marked in
green.
Fig. 5: The visibility region computed for one object instance
at time t, marked in blue.
are cast in many directions. Regions illuminated by the ray
(reached by it) are considered the visibility region Vt. The
visibility region for one object is drawn in blue in Fig. 5.
For efficient planning, we only compute the visibility
region for the object most likely in the target object category
cT. We refer to this object as the object of interest oI,
I= arg maxipi(cT).
C. Semantic Prior Knowledge
We leverage prior semantic knowledge to facilitate effi-
cient exploration. The key idea is that objects in the target
category may have a closer affinity to some categories of
objects than others. The co-occurrence relationship between
objects of two categories is estimated based on Lidstone’s
law of succession [34]:
p(ci|cj) = N(ci, cj) + α
N(cj) + α|C|,(5)
where p(ci|cj)is the conditional probability of object of
class cibeing in a geometric room given object of class cjis
already observed in the same room. N(ci, cj)is the number
of times objects of classes ciand cjare observed in the same
room. N(cj)is the number of times object of category cj
are obsereved in a room. α∈[0,∞)is a smoothing factor,
and finally |C|is the number of classes.
The probabilistic co-occurrence relationships of multiple
pairs of objects are captured using Eq. (5) and further
assembled into multiple Bayesian networks. We construct
a set of Bayesian networks B={B1, B2, . . .}, with one
Kitchen
Stove Fridge Microwave
Counter
Kettle Cup
Fig. 6: Bayesian Network
for each semantic space S={S1, S2, . . .}. Each semantic
space corresponds to one room category, such as kitchen,
bedroom, bathroom, etc. An example of a Bayesian Network
is illustrated in Fig. 6, demonstrating common object classes
found in a kitchen and their conditional dependency. For each
geometric room riin the environment, we will collect the set
of object classes Ei={c1, c2, . . .}that are observed in the
room ri. Recall that we keep a class probability distribution
for each map object. Thus we cannot draw a deterministic
conclusion regarding the presence of a certain object class
in the room ri. However, to keep the problem tractable, we
assume the presence of object class ckif for any object ojin
the room, the probability of the object ojbeing in class ck
exceeds some threshold λ:ck∈Ei⇐⇒ ∃j, pj(ck)> λ.
Given the evidence set Ei, we only consider the Bayesian
networks in Bthat contains the target object instance cTand
shares nodes with Ei; name this new set Bi. By doing so,
we narrow down the possible semantic space categories for
the geometric room rito a subset Si, which corresponds to
Bi. For each Bayesian network Bj∈ Bi, we can compute
the probability of finding the target object instance oTin the
room ribased on evidence Ei, denoted as P(cT|Ei, ri, Bj).
We can then infer the probability of finding the target
object instance oTin the same room riby feeding Eiinto
the Bayesian network set Bi:
P(cT|Ei={c1, c2, . . .}, ri) = max
Bj∈Bi
P(cT|Ei, ri, Bj).
This probability is computed for all geometric rooms.
V. PLANNING AP PROAC H
We now describe how the intermediate goal is determined
for the robot on the fly and how the optimal policy for
reaching the intermediate goal is computed for the robot.
A. Robot Model
The robot is a stochastic dynamic system and can be
represented by a Markov decision process (MDP) Mt=
hS, A, P, R, F iwith the following components:
—Sis the discrete state space, representing the mean of the
Gaussian distribution of the robot position. The mean of the
robot’s position is discretized and clipped to the closest grid
cell in the grid map Gtto avoid an infinite state space.
—Ais a set of actions. We consider eight actions that allow
the robot to move horizontally, vertically, and diagonally to
reach its eight neighboring grid cells. A low-level controller
is used to map the actions into the robot command.
—P:S×A×S→[0,1] is the transition probability func-
tion, where P(· | s, a)represents the probability distribution
over next states given an action ataken at the current state
s. For example, for the move-up action, the robot has a high
probability of moving up one cell, but it also has a small
probability of moving to the upper-left or upper-right cell.
—R:S×A×S→Ris the reward function, where
R(s, a, s0)is the reward for executing action a∈Aat state
s∈Sand reaching next state s0∈S.
—F⊂Sis the set of (intermediate) goal states, which are
determined on the fly, as described in Section V-C.
B. Reward Shaping
To compute policies that can drive the robot to the frontier
region Ftor visibility region Vt, for exploration or re-
observation, we define two reward functions accordingly.
1) Reward function for reaching Ft:The reward function
R(s, a, s0)is designed as:
R(:,:, s0) = P(x∈ej|s0)·P(cT|E, ri)· |ej|,(6)
where P(x∈ej|s0)is the probability of the robot being at
frontier edge ejif its mean position is s0.P(cT|E, ri)is
the probability to find target object instance cTin geometric
room riwhere edge ejlies given the current evidence E.
|ej|is the size of the frontier edge, representing the possible
information gain by exploring ej.P(x∈ej|s0)can be
calculated by first discretizing the robot’s Gaussian position
distribution (with mean at s0) based on Gtand then summing
up the probability of the robot at each cell that belongs to
ej.P(cT|E, ri)is calculated using the Bayesian network,
as discussed in Section IV-C.
2) Reward function for reaching Vt:The reward function
R(s, a, s0)is designed as:
R(:,:, s0) = P(x∈Vt|s0),(7)
which is the probability of the robot being in visibility region
Vtif its mean position is s0.
C. Goal Determination
We use an optimistic approach in determining the interme-
diate goal. If the probability of the object of interest oIbeing
in the target object category cTexceeds a threshold τ, i.e.,
pI(cT)> τ, then the intermediate goal is to re-observe the
object of interest oI, and the reward function is as defined in
(7). Otherwise, the intermediate goal is to explore the frontier
region, and the reward function is defined as (6).
D. Planner
The MDP Mtand the selected reward function Rare
fed into a planner based on the Real Time Dynamic Pro-
gramming (RTDP) algorithm [35] to compute an optimal
policy π∗that maximizes the expected sum of rewards, i.e.,
value function V. A value function Vstarting at state s∈S
following policy πis defined as follows:
Vπ(s)=Eπ[
∞
X
t=0
γtR(st, at, st+1)],
where π:S→Ais a deterministic policy over Mtmapping
the state into an action, and γ∈[0,1) is a discounting factor.
The optimal policy π∗is computed as follows: for all s∈S,
π∗(s) = arg max
π∈Π
Vπ(s).
The RTDP algorithm allows us to compute a semi-optimal
policy in a short time1. As the robot carries out the semi-
optimal policy, the policy will be continuously improved by
the RTDP algorithm with the current robot mean position as
the initial state s0and converges to the optimal policy.
E. Adaptation
The fused map Et, frontier region Ft, and visibility region
Vtare updated at every time instance tbased on the ever-
improving semantic SLAM results. Consequently, once the
robot reaches an intermediate goal state, the MDP model
Mtneeds to be updated based on the new fused map Et. We
call this process the adaptation of the MDP model. Next,
the corresponding policy πalso needs to be updated.
Specifically, the following components are adapted: (a) the
discrete state space Sto match the changing grid map Gt, (b)
the transition probability function P, (c) the reward function
Rbased on Eqs. (6) and (7), and (d) the set of intermediate
goal states Fas Ftand Vtchange. The RTDP planner takes
the updated MDP model Mtto generate a new policy.
VI. EXPERIMENTS
We have performed experiments on the Matterport3D
(MP3D) [37] dataset using the Habitat [38] simulator. MP3D
dataset contains 3D scenes of a common indoor environment,
and the Habitat simulator allows the robot to navigate the
virtual 3D scene. Given the towel as the target object
category, the robot’s objective is to find any instance of the
target with a confidence level greater than 1−= 0.99.
The action space of the robot is continuous, constituting its
angular and linear velocity. An episode is only successful if
the agent stops its motion when it has identified the target
within a specific time budget (1K seconds). Five episodes
are run for each method, with the robot placed at a random
position at the beginning of each episode. Two shots of the
MP3D scene are given in Fig. 7. This particular scene is
6.4m×8.4min size, and has one level, ten rooms, and
187 objects. The robot start positions and the target object
instances are visualized in Fig. 8, represented as blue boxes
and yellow diamonds.
The accompanying video shows the robot’s operations to
find target objects with different starting locations.
A. Semantic SLAM results
We present the semantic SLAM results obtained during
one episode. Our evaluation focuses on the accuracy and
uncertainty of the collected results.
1Unlike a more traditional approach such as value iteration [36].
Fig. 7: Two snap shots of the MP3D scene.
Fig. 8: Target objects (yellow diamonds) and robot start
positions (blue boxes).
1) Accuracy: We calculate the mean and the median of
the error between the predicted objects’ position and the
ground-truth objects’ position:
Mean =Pn
i=1kˆpi−pik
n,Median = median(kˆpi−pik),
where nis the current number of objects, ˆpiis the estimated
object position and piis the ground truth object position.
Their variations with respect to time are plotted in Fig.
9. For reference, the number of identified objects at each
time instance is also plotted in Fig. 10. We can see that
the error increases for the first few seconds as new objects
are identified. Nonetheless, as more observations come in,
the error decreases and converges. We can also see that the
calculated average error appears larger than the median value.
This is because some objects only receive one observation.
As a result, their positions are not updated, contributing to
a large error. For this reason, the median error is considered
a more sensible metric in our case, which is kept at a
reasonable level.
In the same spirit, we also calculate the cross entropy
between the predicted objects’ classes and the ground-truth
objects’ classes: −1
nPn
i=1 Pc∈Cpgt
i(c) log pi(c).pi(·)is
the predicted object class distribution, pgt
i(·)is the ground
truth of object class distribution, taking the value one at the
corresponding object class and zero elsewhere. The results
are plotted in Fig. 11. We can see that the cross entropy
gradually decreases with time, proving that the predicted
objects’ classes would converge to the correct results.
2) Uncertainty: Though we do not claim our method
to be an active SLAM method, we observe a decrease
in semantic map uncertainty as the robot progresses. The
0 5 10 15 20 25 30 35
Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Error (m)
Mean
Median
Fig. 9: Mean and median of the error between the predicted
objects’ position and the ground-truth objects’ position.
0 5 10 15 20 25 30 35
Time (s)
0
10
20
30
40
Number of objects
Fig. 10: Number of identified objects with respect to time.
average A-opt (sum of the covariance matrix eigenvalues),
D-opt (product of covariance matrix eigenvalues), and E-
opt (largest covariance matrix eigenvalue) of the map object
position covariance are calculated. Their evolution over time
is plotted in Fig. 12. The spikes in the graph indicate the
identification of new objects, hence the increased position
uncertainty. However, as time goes by and more observations
come in, we can see that all three metrics are kept at a low
level. This shows that the robot can estimate the objects’
position fairly confidently.
Fig. 13 gives a more intuitive representation. In Fig. 13, we
plot the Gaussian functions with their means and covariances
set as the estimated object position means and covariances.
Therefore, each “bell” in the plot represents one object.
Comparing the results we obtained at time instants t= 8s
and t= 70s. We can see that at t= 70s, the bell’s peak
increases, and the base decreases, indicating a more certain
estimation of the object’s position.
The entropy of the predicted object class is also calculated:
−1
nPn
i=1 Pc∈Cpi(c) log pi(c)and visualized in Fig. 14.
The result suggests that as time progresses, the robot is more
and more certain about the object class that it predicted.
B. Planning results
We evaluate the performance of our planning method vs.
other navigation strategies: FE-SS: a frontier exploration
0 5 10 15 20 25 30 35
Time (s)
2.00
2.25
2.50
2.75
3.00
3.25
3.50
3.75
4.00
Cross entropy
Fig. 11: Cross entropy between the predicted objects’ classes
and the ground-truth objects’ classes.
0.0
0.2
A-opt
0.000
0.005
D-opt
0 5 10 15 20 25 30 35
Time (s)
0.0
0.2
E-opt
Fig. 12: The evolution of the object position covariance with
respect to time.
method [39] with rewards defined by (6), and equipped
with our custom Semantic SLAM, and Ours-NS: ablation
of our method without semantic prior knowledge and using
a uniform reward. To evaluate all methods, we report the
following metrics:
Success: percentage of successful episodes
Average path length: average length of the path taken by
the agent in an episode.
Success weighted by path length (SPL) [40]:
1
NPN
i=1 Sili
max(pi,li), where li=length of the shortest
path between goal and the visibility region of target instance
for an episode, pi=length of the path taken by the agent in
an episode, Si=binary indicator of success in episode i.
Average planning time: average time spent on planning
(excluding the time in action).
The testing results are summarized in Table I. Our method
outperforms other methods in success rate, SPL, and path
length by a large margin. Although our method does not
guarantee dominant performance in planning time, it still
exhibits advantages with a shorter average planning time for
all episodes.
The ablation study is conducted by denying access to
the semantic prior knowledge in our method. A significant
performance drop is observed on all metrics after that. This
proves the efficacy of using semantic prior knowledge to
x
200 250 300 350 400 450 500 550
y
0
100
200
300
400
z
0.000
0.005
0.010
0.015
0.020
0.025
0.030
(a) t= 8s.
x
200 250 300 350 400 450 500 550
y
0
100
200
300
400
z
0.000
0.020
0.040
0.060
0.080
0.100
(b) t= 70s.
Fig. 13: Object position covariance at time instant t= 8s
and time instant t= 70s
0 5 10 15 20 25 30 35
Time (s)
3.00
3.25
3.50
3.75
4.00
4.25
4.50
4.75
5.00
Entropy
Fig. 14: The evolution of the predicted object class entropy
with respect to time
guide our search for the target object.
TABLE I: Comparison study
Method Success Path length (m) SPL Planning time (s)
Ours 0.96 4.984 0.46 80.338
FE-SS 0.80 6.796 0.39 105.514
Ours-NS 0.68 8.726 0.41 97.716
VII. CONCLUSIONS
We presented a novel approach to tackle the open problem
of enabling a robot to search for a semantic object in an
unknown and GPS-denied environment. Our approach com-
bines semantic SLAM, Bayesian Networks, Markov Decision
Process, and real-time dynamic planning. The testing results
on Matterport 3D dataset demonstrate both the effective-
ness and efficiency of our approach. Moreover, while our
approach is unique in incorporating semantic object infor-
mation to search for semantic targets, to evaluate its motion
planning performance, we compared it to a non-semantic
baseline planning method and conducted an ablation study.
The results show that our approach has a higher success rate,
shorter path length, and less planning time. In the next step,
we will consider extending our approach to more complex,
compound semantic tasks and tasks that require the robot to
interact with objects.
REFERENCES
[1] D. G´
alvez-L´
opez, M. Salas, J. D. Tard´
os, and J. Montiel, “Real-time
monocular object slam,” Rob. Auton. Syst., vol. 75, pp. 435–449, 2016.
[2] L. Nicholson, M. Milford, and N. S¨
underhauf, “Quadricslam: Dual
quadrics from object detections as landmarks in object-oriented slam,”
IEEE Robot. Autom. Lett. (RA-L), vol. 4, no. 1, pp. 1–8, 2018.
[3] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-
tonomous object-level data association,” in IEEE Int. Conf. Robot.
Autom. (ICRA), 2021, pp. 11 203–11 209.
[4] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE
Trans. Robot. (T-RO), vol. 35, no. 4, pp. 925–938, 2019.
[5] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic
slam based on object detection and improved octomap,” IEEE Access,
vol. 6, pp. 75 545–75 559, 2018.
[6] C. Leung, S. Huang, and G. Dissanayake, “Active slam using model
predictive control and attractor based exploration,” in IEEE Int. Conf.
Intell. Robots Syst. (IROS). IEEE, 2006, pp. 5026–5031.
[7] T. Kollar and N. Roy, “Trajectory optimization using reinforcement
learning for map exploration,” Int. J. Robot. Res. (IJRR), vol. 27, no. 2,
pp. 175–196, 2008.
[8] A. Kim and R. M. Eustice, “Perception-driven navigation: Active
visual slam for robotic area coverage,” in ICRA. IEEE, 2013, pp.
3196–3203.
[9] S. Ehrenfeld, “On the efficiency of experimental designs,” Ann. Math.
Stat., vol. 26, no. 2, pp. 247–255, 1955.
[10] C. Stachniss, G. Grisetti, and W. Burgard, “Information gain-based ex-
ploration using rao-blackwellized particle filters.” in Robotics: Science
and systems, vol. 2, 2005, pp. 65–72.
[11] J.-L. Blanco, J.-A. Fernandez-Madrigal, and J. Gonz´
alez, “A novel
measure of uncertainty for mobile robot slam with rao—blackwellized
particle filters,” IJRR, vol. 27, no. 1, pp. 73–89, 2008.
[12] S. M. LaValle, Planning algorithms. Cambridge university press,
2006.
[13] R. Alterovitz, S. Koenig, and M. Likhachev, “Robot planning in
the real world: Research challenges and opportunities,” Ai Magazine,
vol. 37, no. 2, pp. 76–84, 2016.
[14] A. Stentz, “Optimal and efficient path planning for partially known
environments,” in Intelligent unmanned ground vehicles. Springer,
1997, pp. 203–220.
[15] S. Koenig and M. Likhachev, “Fast replanning for navigation in
unknown terrain,” T-RO, vol. 21, no. 3, pp. 354–363, 2005.
[16] J. Bruce and M. M. Veloso, “Real-time randomized path planning for
robot navigation,” in Robot soccer world cup. Springer, 2002, pp.
288–295.
[17] D. Ferguson, N. Kalra, and A. Stentz, “Replanning with rrts,” in ICRA.
IEEE, 2006, pp. 1243–1248.
[18] J. Vannoy and J. Xiao, “Real-time adaptive motion planning (ramp)
of mobile manipulators in dynamic environments with unforeseen
changes,” T-RO, vol. 24, no. 5, pp. 1199–1212, 2008.
[19] Z. Wang and G. Tian, “Hybrid offline and online task planning
for service robot using object-level semantic map and probabilistic
inference,” Information Sciences, vol. 593, pp. 78–98, 2022.
[20] T. S. Veiga, M. Silva, R. Ventura, and P. U. Lima, “A hierarchi-
cal approach to active semantic mapping using probabilistic logic
and information reward pomdps,” in Int. Conf. Automated Planning
Scheduling, vol. 29, 2019, pp. 773–781.
[21] L. Burks, I. Loefgren, and N. R. Ahmed, “Optimal continuous state
pomdp planning with semantic observations: A variational approach,”
T-RO, vol. 35, no. 6, pp. 1488–1507, 2019.
[22] D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov,
“Object goal navigation using goal-oriented semantic exploration,”
Adv. Neural. Inf. Process. Syst., vol. 33, pp. 4247–4258, 2020.
[23] G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Dani-
ilidis, “Learning to map for active semantic goal navigation,” arXiv
preprint arXiv:2106.15648, 2021.
[24] R. Zeng, Y. Wen, W. Zhao, and Y.-J. Liu, “View planning in robot
active vision: A survey of systems, algorithms, and applications,”
Comput. Vis. Media., vol. 6, no. 3, pp. 225–245, 2020.
[25] Z. Meng, H. Sun, H. Qin, Z. Chen, C. Zhou, and M. H. Ang,
“Intelligent robotic system for autonomous exploration and active slam
in unknown environments,” in Int. Symp. Syst. Integr. (SII). IEEE,
2017, pp. 651–656.
[26] M. Selin, M. Tiger, D. Duberg, F. Heintz, and P. Jensfelt, “Efficient
autonomous exploration planning of large-scale 3-d environments,”
RA-L, vol. 4, no. 2, pp. 1699–1706, 2019.
[27] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart,
“Receding horizon” next-best-view” planner for 3d exploration,” in
ICRA. IEEE, 2016, pp. 1462–1468.
[28] S. J. Russell, Artificial intelligence a modern approach. Pearson
Education, Inc., 2010.
[29] M. Labb´
e and F. Michaud, “Rtab-map as an open-source lidar and
visual simultaneous localization and mapping library for large-scale
and long-term online operation,” J. Field Robot., vol. 36, no. 2, pp.
416–446, 2019.
[30] R. Bormann, F. Jordan, J. Hampp, and M. H¨
agele, “Indoor coverage
path planning: Survey, implementation, analysis,” in ICRA. IEEE,
2018, pp. 1718–1725.
[31] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-
tonomous object-level data association,” in ICRA. IEEE, 2021, pp.
11 203–11 209.
[32] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018.
[33] J. Canny, “A computational approach to edge detection,” IEEE Trans.
Pattern Anal. Mach. Intell., no. 6, pp. 679–698, 1986.
[34] H. Sch¨
utze, C. D. Manning, and P. Raghavan, Introduction to informa-
tion retrieval. Cambridge University Press Cambridge, 2008, vol. 39.
[35] T. Smith and R. Simmons, “Focused real-time dynamic programming
for mdps: Squeezing more out of a heuristic,” in AAAI, 2006, pp.
1227–1232.
[36] R. Bellman, “A markovian decision process,” J. math. mech., pp. 679–
684, 1957.
[37] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d
data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.
[38] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain,
J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for
embodied ai research,” in IEEE Int. Conf. Comput. Vis., 2019, pp.
9339–9347.
[39] B. Yamauchi, “A frontier-based approach for autonomous exploration,”
in IEEE Int. Symp. Comput. Intell. Robot. Autom., 1997, pp. 146–151.
[40] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta,
V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al.,
“On evaluation of embodied navigation agents,” arXiv preprint
arXiv:1807.06757, 2018.