Exploration for Object Mapping Guided by
Environmental Semantics using UAVs
Reem Ashour 1,* , Tarek Taha 2, Jorge Manuel Miranda Dias 1, Lakmal Seneviratne 1and
Nawaf Almoosa 3
1Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University of Science and
Technology (KU), P.O. Box 127788, Abu Dhabi, UAE; firstname.lastname@example.org (J.D.);
2Algorythma’s Autonomous Aerial Lab, P.O. Box 112230, Abu Dhabi, UAE; email@example.com
3EBTIC, Khalifa University of Science and Technology (KU), P.O. Box 127788, Abu Dhabi, UAE;
Received: 09 February 2020; Accepted: 3 March 2020 ; Published: 10 March 2020
This paper presents a strategy to autonomously explore unknown indoor environments,
focusing on 3D mapping of the environment and performing grid level semantic labeling to
identify all available objects. Unlike conventional exploration techniques that utilize geometric
heuristics and information gain theory on an occupancy grid map, the work presented in this
paper considers semantic information, such as the class of objects, in order to gear the exploration
towards environmental segmentation and object labeling. The proposed approach utilizes deep
learning to map 2D semantically segmented images into 3D semantic point clouds that encapsulate
both occupancy and semantic annotations. A next-best-view exploration algorithm is employed
to iteratively explore and label all the objects in the environment using a novel utility function
that balances exploration and semantic object labeling. The proposed strategy was evaluated in a
realistically simulated indoor environment, and results were benchmarked against other exploration
semantic exploration and mapping; utility function; cost function; semantic mapping;
semantic exploration; next-best-view
The growth in aerial robotics has led to their ubiquitous presence in various ﬁelds—urban search
and rescue (USAR) [
], infrastructure inspection [
], surveillance [
], etc. Some recent research
has focused on USAR activities performed by unmanned aerial vehicles (UAVs) to assist rescue
teams by providing vital information on time-sensitive situations without endangering human lives.
The introduction of unmanned aerial vehicles (UAVs) in USAR environments is beneﬁcial when rapid
responses are required to assist ﬁrst responders with locating victims, survivors, and danger sources in
the environment. UAVs’ speed, agility, and ability to navigate in hazardous environments that contain
rubble and obstacles make them an ideal platform for deployment in USAR environments. UAVs with
autonomous exploration, mapping, and navigation capabilities can provide rescuers with valuable
data, such as 2D and 3D maps, victims’ locations, and localized danger sources, in order to improve
their awareness of the situation and respond appropriately.
Maps can provide a variety of information about the environment; they can reﬂect the structure of
the environment, connections, and knowledge sense. Therefore, providing a rich/informative map for
the ﬁrst responders in USAR with the right representation is an active topic in the USAR robotics ﬁeld.
Robot pose and sensor data are used to build a globally consistent map in which path planning and
Remote Sens. 2020,12, 891; doi:10.3390/rs12050891 www.mdpi.com/journal/remotesensing
Remote Sens. 2020,12, 891 2 of 23
exploration become feasible. In the literature, maps are classiﬁed by the way they represent information
into three types [
]: topological [
], metric [
], and semantic maps [
]. The majority of the
existing mapping approaches aim to construct a global consistence metric map of robots operating
in the environment [
]. Metric maps do not encapsulate the characteristics that are compatible
with human conception. Therefore, semantic maps are preferable to be used for high-level objectives,
such as detecting and identifying the locations of different objects, path planning, and increasing
Autonomous exploration could be performed to minimize the effort required by the ﬁrst
responders in a USAR environment to search for victims and build awareness maps. Many approaches
have been heavily used in the literature to perform autonomous exploration, such as frontier-based
] and information-theory-based exploration [
]. The concept of “next-best-view”
] was extensively used to guide the next best exploration action, whether using the
frontier-based or the information-theory-based concepts. In order to identify the best next exploration
action, a cost function (also called utility function) is used to evaluate various views to identify the one
that maximizes the objectives of the exploration.
Both frontier-based exploration and information-theory-based exploration approaches of
unknown indoor environments use metric (i.e., 2D/3D occupancy) maps without considering the
high-level abstract of contextual information provided by onboard sensors, such as the class of objects
in occupied voxels. Additionally, both approaches do not ﬁlter or classify which information is of
merit to the application-speciﬁc environment, which is responsible for reducing the time of ﬁnding
objects of interest or labeling all the objects in the environment.
In this work, an efﬁcient strategy for autonomous, semantically-aware exploration of object
labeling and 3D mapping for unknown indoor is proposed. The proposed approach utilizes semantic
information encapsulated in the proposed 3D, semantically-aware map for object localization and
labeling using a UAV with an onboard RGBD camera. New utility functions are proposed to direct the
robot towards the objects in the environment. The results show that the proposed strategy is capable of
exploring unknown environments and label objects effectively. Our contributions can be summarized
A 3D, semantically-aware, colored annotated mapping method that uses a deep learning model
to semantically segment the objects in 2D images and project the annotated objects in a 3D,
semantically-aware occupancy map. This includes a novel data structure that extends the
volumetric occuapancy grid representations to include more semantic speciﬁc information.
Development of a multi-objective utility function that encapsulates the quantiﬁed information
generated from the semantic map and volumetric map to allow the robot to explore the unknown
space and to direct the robot to visit the objects and label them effectively.
An overall exploration strategy that utilizes rapidly exploring random tree (RRT) for viewpoint
sampling, and the next-best-view approach with the proposed semantic utility functions to
iteratively explore the unknown environment and label all the objects there.
The remainder of the paper is structured as follows. Section 2provides an overview of the related
work. Section 3details the methodology presented in this work, wherein Section 3.1 illustrates the
proposed strategy and describes the main components, Section 3.2 represents the semantic mapping
strategy employed in this work, and Section 3.3 presents the proposed exploration approach and
its components. Section 4shows our experiment setup and preliminary results of our simulation
experiments, and ﬁnally, Section 5summaries our contributions and provides recommendations for
2. Related Work
Robots deployed in search and rescue missions should possess basic yet very important skills,
such as object detection, 3D mapping, and indoor exploration. The signiﬁcant impact of the unmanned
Remote Sens. 2020,12, 891 3 of 23
aerial vehicles (UAVs) in various robotics ﬁelds made them a promising solution for urban search
and rescue tasks. For victim detection, the UAV could utilize the onboard sensors and the onboard
intelligence to accurately detect and locate victims in disaster scenes in order to increase the likelihood
of ﬁnding victims [
]. For mapping and 3D reconstruction, the UAV should be capable of
reconstructing the scene in 3D and locating the contextual and rich information that can be used by the
rescuers’ team. For exploration, the UAV should be able select the exploration positions in order to
facilitate both the victim detection and 3D reconstruction tasks. In this section, the related work for the
object detection, semantic mapping, and exploration is provided.
2.1. Object Detection and Classiﬁcation
Victim detection is the most critical task in search and rescue missions, wherein victims should be
identiﬁed as fast as possible for the ﬁrst responders to intervene and rescue them. Object detection task
is concerned with discovering instances of semantic objects that belong to a particular class, such as
human, table, or chair. In the literature, a variety of mechanisms are used to detect objects either by
using a single sensor or fusing multiple sensors .
Visual approaches for object detection is the most reliable resource used to ﬁnd objects in
scenes. The visual object detection is extensively performed using machine learning methods and
currently through deep learning techniques. Both techniques depend on the visual information
provided by an individual image frame. In the context of machine learning approaches, a feature
extraction method, such as scale-invariant feature transform (SIFT) [
] or histogram of oriented
gradients (HOG) features [
], is used to deﬁne the features. Then, a machine learning model, such
as supportive-vector-machine (SVM) [
], is employed for object classiﬁcation. In deep learning,
the neural networks are capable of detecting objects without the feature extraction process. There are
different types of object-detection-based deep learning model techniques depending on the source of
information captured from the onboard sensor, such as 2D images and 3D point cloud.
Utilizing the various information captured from the onboard sensors such as 2D images and 3D
point cloud, different deep learning models are designed to detect objects. Commonly, convolutional
neural networks (CNNs) are used to detect and identify objects from 2D images. The most commonly
used deep learning models for object detection from 2D images are the single shot multiBox detector
], and the you only look once (YOLO) [
]. These methods create a bounding box that
surrounds the classiﬁed objects in the 2D image. Nonetheless, object detection and classiﬁcation can be
performed using the deep learning models that apply semantic segmentation for 2D images, such as
]. However, using depth images, object detection and classiﬁcation is performed using the
deep learning networks that segment point cloud and classify the segments to the different classes,
such as PointNet , PointNet++ , and Frustum pointnets .
2.2. Semantic Mapping
A semantic map is an augmented representation of the robot’s environment that encapsulates
characteristics compatible with human conception. Exploration and path planning algorithms beneﬁt
from the additional information that the semantic maps provide about the surrounding environment.
Most of existing mapping strategies intend to build a global consistency metric map of robots operating
in the environment [
]. In USAR, the robot should be capable of guiding the ﬁrst responders to
the victim’s location safely. Therefore, the robots should acquire more advanced skills than the metric
mapping; the robot should understand the surrounding environment. A solution can be found by the
semantic mapping, which is a qualitative representation of the robot’s surrounding environment to
augment the exploration and task planning. In the literature, since contemporary robots use the metric
maps to explore the environment, most of the work on semantic maps uses the metric maps as a basis
for the semantic map and builds on top of semantic attributes; see [33,34].
Remote Sens. 2020,12, 891 4 of 23
Two of the most commonly used autonomous exploration approaches are frontier-based
approaches and information-theory-based approaches. In
robot navigates to the boundaries between free space and unexplored space by evaluating the cost
function to maximize map coverage [
]. Utility functions, such as path cost and expected information
gain, are usually applied to evaluate next candidate position (i.e., frontier) to visit. However,
in information-theory-based approaches, the robot navigates to candidate position (i.e., viewpoint
sample) to maximize information gain using probabilistic map representation [
]. Therefore, a good
information gain prediction algorithm plays a vital role in both methods. During exploration, different
types of maps can be created to either reconstruct the 3D scene or to present the scene with varying
types of information depending on their semantics utilizing the obtained data from the onboard
sensors . Advanced variant extensions of this frontier-based algorithm were developed in [35–37].
The commonly used exploration techniques of unknown indoor environments use metric
(i.e., 2D/3D occupancy) maps without considering the high-level abstract of information provided by
onboard sensors, such as the class of objects in occupied voxels. These methods may work locally on
simple environments but they are unlikely to function well in cluttered and dynamic environments.
These drawbacks motivate the need for a new efﬁcient exploration strategy that evaluates the contextual
information from the environment. Fortunately, current technological advancements in hardware and
software algorithms (such as deep learning) have provided a new research direction by encapsulating
semantic in occupancy maps that allows efﬁcient object labeling and localization.
applied in [
] has used the contextual
information encapsulated in semantic maps to enhance the exploration process by maximizing the use
of available resources. T. Dang et al. [
] proposed an autonomous exploration model that explores
an unknown environment and simultaneously searches for objects of interest. The proposed method
utilizes a deep learning model that classiﬁes the objects from 2D images and projects them into a 3D
semantic occupied map. Each object of interest occupies a voxel in the 3D map with a speciﬁc color.
The exploration process uses the 3D semantic occupied map to search for objects of interest and to
explore the unknown environment. T. Dang et al. in [
] employed the visual saliency model to build
an annotated map using visual importance incrementally. The exploration planner simultaneously
optimizes the exploration of unknown spaces and directs the robot towards the salient objects.
In contrast to these methods, we ﬁrstly propose creating a new, 3D, semantically-aware occupancy
map structure that carries the occupancy information and the semantic annotations to be utilized in the
autonomous exploration task. A deep learning model is utilized to semantically segment the objects
from 2D images and project them into a 3D map where each object is labeled with a unique color,
as illustrated in Section 3.2. Secondly, we propose a new multi-objective utility function that uses the
semantic information to direct the robot towards the objects in the environment and label them as
shown in Section 3.3. Our work is different from the studies in the literature; instead of using only
an occupancy map for autonomous exploration, we utilize the contextual information from the 3D
semantic map to explore the environment and label all the objects. Besides, the semantic map utilizes a
deep learning model to semantically segment the objects in 2D, which allows providing an accurate
projection of objects in the 3D semantic map. Additionally, our proposed utility function encapsulates
the contextual information by either engaging the conﬁdence value of object belonging to a particular
class or the number of visits for objects of interest, thereby it facilitates the labeling purpose.
3. Materials and Methods
3.1. Proposed Approach
In this work, a bounded exploration area
is initially unknown and has to be explored.
Our exploration strategy attempts to create a 3D map
and ﬁnd a collision-free path
Remote Sens. 2020,12, 891 5 of 23
from an initial position
that leads to the following: (a) Identify the free
parts of the whole exploration area and (b) maximize the information gathering to label objects in the
environment using two sources of information: (1) the depth information captured by the robot sensor
and (2) the semantic information obtained from a deep learning model.
Figure 1shows the proposed semantically-aware exploration of object labeling and 3D mapping
system architecture. The proposed system architecture consists of two main modules: (1) mapping and
information quantiﬁcation, and (2) autonomous exploration. The proposed approach aims to provide
the ﬁrst responders in USAR scenarios a rich environmental representation.
Proposed semantically-aware exploration of object labeling, and the 3D mapping
In the ﬁrst module, the 3D semantic-occupied map is generated by using a deep learning model
to segment the objects in 2D images into different labels. To perform this step, semantic segmentation
is used to perform object classiﬁcation and segmentation in pixels level and assign each pixel to a
particular class. The point clouds corresponding to the image pixels are annotated using the output
from the deep learning model. The annotated point clouds are used to create a 3D semantic occupancy
map. The 3D semantic-occupied map is divided into small cubical voxels
with edge length
that refers to the map resolution. Each voxel stores three types of information, occupancy, color,
and conﬁdence values, which correspond to the voxel’s vacancy, class, and certainty, respectively.
Initially, all the voxels are assigned to be unknown. After the initial data gathering from the
onboard sensors, the 3D semantic occupancy map is updated to contain
Vf ree ∈V
, and the
updated information in each voxel about the number of visits, conﬁdence values, and class type.
are subject to the limited ﬁeld of view (FOV) and effective sensing distance constraints.
In the second module, the autonomous exploration module uses the contextual information
obtained from the mapping stage. In this work, we adopted the NBV algorithm to determine the next
best view that fulﬁlls the ultimate goal to label all objects in the environment eventually. The exploration
module involves three stages: (i) viewpoint sampling by employing a receding horizon, (ii) viewpoint
evaluation (iii) navigation, and termination.
Two new multi-objective utility functions are proposed to evaluate each candidate viewpoint
sample in terms of two criteria: (i) information gain obtained from exploring new space that has not
been mapped yet; (ii) the information gain obtained from semantically labeled objects with either
lower conﬁdence or visits values for the object of interest to direct the robot to label the objects in the
unknown environment accurately.
The candidate that maximizes the gain is selected to be the next best view, and the robot moves to
that position. The whole process is repeated in a receding horizon fashion. Paths are only planned
in the free space
. Within the algorithm’s implementation, the robot conﬁguration is given
Remote Sens. 2020,12, 891 6 of 23
, where roll and pitch are considered near-zero. In this work, the following
assumptions were made; (i) the robot’s position was assumed to fully known since the simulation
experiment was performed on the Gazebo simulator, and the position was obtained directly through
the simulated models of the UAV (robot localization is out of the scope of this research); (ii) the
environment is unknown; and (iii) the evaluation of utility functions for the exploration process is
performed on a voxel scale rather than object scale.
3.2. Mapping and Information Quantiﬁcation
The procedure for the mapping and information quantiﬁcation is provided in Figure 1. It involves
three stages: scene understanding, where semantic segments are generated for objects found in a
2D image frame; annotated point cloud generation, where the point cloud captured from the depth
camera is annotated to the corresponding class from the deep learning model output; and the 3D,
semantically-aware mapping, which registers the detection annotated point cloud to 3D occupancy
grid map. The stages are explained in detail in the following sections.
3.2.1. 2D-Image-Based Semantic Segmentation and Semantic Fusion
The deep neural network pyramid scene parsing network (PSPNet) based on semantic
] is employed to provide semantic segments for the different objects in 2D
images. The network architecture is shown in Figure 2. PSPNet operates in ﬁve steps as follows:
(i) create a feature map using resnet, (ii) apply pyramid pooling on the created feature map to
produce feature maps at varying resolutions, (iii) perform convolution on each pooled feature
map, (iv) create a ﬁnal feature map by upsampling the feature maps and concatenate them, and
(v) generate a class score map by employing ﬁnal convolutions.
Figure 2. Pyramid scene parsing network (PSPNet). Extracted from .
The PSPNet model is trained [
] on two datasets, the SUNRGB [
] and ADE20K [
The ADE20K dataset includes a large vocabulary of 150 classes from both indoor and outdoor
objects. However, the SUNRGBD data set contains 38 semantic classes for indoor scenes. The
fact that the ADE20K dataset contains a larger number of objects made it a preferable choice
to segment an unknown environment with a diversity of objects. Examples of 2D images with
semantic segments are shown in Figure 3a,b respectively. The RGB images are obtained using
the simulated sensor in the Gazebo simulator that mimics the characteristics of an actual camera.
In this work, only ADE20K dataset is used and the semantic labels of ADE20K dataset are shown
in Figure 4.
Remote Sens. 2020,12, 891 7 of 23
(a) RGB image/color textured. (b) Semantic segmentation for a 2D image.
Figure 3. RGB image and semantic segment image.
Figure 4. Semantic labels of ADE20K data set in BGR format.
Semantic information encapsulated in the point cloud is fused using the “max fusion”
]. In the max fusion approach, the semantic color that corresponds to the highest
conﬁdence value generated from the CNN model is included in the generated point cloud. In
addition, the same value of the conﬁdence and the label color are saved in the corresponding
voxel of octomap.
3.2.2. Annotated Point Cloud Generation
The point cloud is a collection of points in an unordered structure where each point contains a
coordinate in a speciﬁc reference frame. To generate a point cloud, ﬁrstly, the depth image is registered
with the same reference frame that the color image is registered in, which is usually the camera
frame. After that, the pixels are transformed from the camera frame to the real world frame using the
image position, its depth, and the intrinsic camera parameters to form the point cloud. An example
of the point cloud generation process is illustrated in Figure 5. The point-clouds are textured color.
Therefore, the colors of the point clouds have the same colors as the objects presented in the image
frame, as shown in Figure 5c. An illustration of the point cloud data structure when storing semantic
information using max fusion is shown in Figure 6.
Remote Sens. 2020,12, 891 8 of 23
(a) Color image. (b) Registered depth image.
(c) Generated point cloud.
Figure 5. Point cloud generation.
Figure 6. Max fusion.
3.2.3. 3D Semantic Map Generation
To improve the semantic exploration strategies, the object classes detected and localized per depth
and color sensor input are used to obtain a 3D semantic-occupied annotated map representation of the
The proposed 3D occupied semantic map structure is based on an occupancy grid map called
] which uses octree structure. An octree is a hierarchical data structure that divides the
spaces in cubes of the same size called voxels. Each division is then recursively subdivided into
eight sub-volumes until a given minimum voxel size is reached, as shown in Figure 7. The resolution
determines the voxel size. The map
consists of the cubical element of the same
voxels for index
. The voxels in the traditional occupancy maps can only hold volumetric
information. However, in addition to the volumetric information, the proposed semantic occupancy
Remote Sens. 2020,12, 891 9 of 23
map adds the semantic information to the voxels. As a result, each voxel
in the proposed map
holds the volumetric information and semantic information. The semantic information includes the
semantic color, conﬁdence value, and number of visits. The conﬁdence value is the value for the
highest conﬁdence generated from the CNN.
Right: an octree example where the free voxels are shaded white and occupied voxels
are shaded black. Left: volumetric model that shows the corresponding tree representation on the
Each voxel miholds the following information:
1. Position (x,y,z);
2. Probability of being occupied (value) Po;
3. Semantic color (r,g,b);
4. Conﬁdence value c(x)from class type (cs);
5. Number of visits.
Initially, all voxels are assigned to be unknown where
0.5, colored in white, and given
conﬁdence value 0. At every point cloud view, the input data to the mapping and exploration
module are 2D image and 3D point cloud point clouds that consist of the positional and the semantic
con f i dence)
information. The color indicates a certain class type and not the texture color.
The point clouds are used to build the voxels in the semantic-occupied map. This information is fed
back into the map, and the corresponding position and semantic color are assigned to the voxels
located within the camera FOV. The occupancy value
, which represents the probability
of being occupied is assigned to the observed voxels using the volumetric mapping. Simultaneously,
the class type is assigned to the voxel using semantic color indication. The visualized constructed map
is colored according to the object class.
3.3. Semantically-Aware Exploration
The proposed exploration strategy provides the robot with the ability to explore unknown
environments, while simultaneously optimizing information gathering and directing the robot to label
objects in the environment. To enable this functionality, two new multi-objective utility functions
are proposed to account for the semantic information (conﬁdence or number of visits) that each
voxel encapsulates. The proposed system used information from the semantic-occupied map to
evaluate the next best exploration action. A technique used to explore an unknown environment
is the next-best view (NBV) approach. The main steps in the exploration task are: (A) viewpoint
sampling, (B) viewpoint evaluation, (C) navigating toward the selected viewpoint, and (D) termination.
The exploration process is summarized in Figure 8. At the beginning of the exploration process, a robot
uses the onboard sensors to observe the scene and produce a set of viewpoints candidates (also known
as states or poses). The accepted viewpoint candidates are then evaluated using a utility function (also
known as reward, cost, or heuristic function). The viewpoint which maximizes the utility function
is selected as the next goal and is called “next-best view”. The exploration process is repeated until
a termination condition is met, indicating the end of the exploration. Algorithm 1is used as a path
planning procedure adapted from .
Remote Sens. 2020,12, 891 10 of 23
Figure 8. General components of the NBV method.
Algorithm 1 Proposed planner—iterative step.
1: ξ0current robot conﬁguration
2: Initialize Twith ξ0
3: gbest ←0
4: nbest ←n0(ξ0)
5: NT←Number of initial Nodes in T
6: while NT<Nmax or gbest =0do
7: Incrementally build Tby adding nnew(ξnew )
9: if InformationGain(nnew)>gbest then
10: gbest ←InformationGain(nnew)
11: nbest ←nnew
12: end if
13: if NT>NTOL then
15: end if
16: σRH,nRH ,ξRH ←ExtractBestSegment(nbes)
17: end while
18: return σRH
3.3.1. Viewpoint Sampling
The viewpoint samples are the possible positions the robot may visit as a next step. Taking all the
possible views in consideration is computationally intractable and expensive [
]. Hence, a subset
of possible views is selected for the evaluation step based on the information it can provide. Different
sampling strategies are used, such as (i) regular grid sampling [
], (ii) frontier sampling [
], and (iii)
incremental sampling (i.e., rapidly-exploring random trees (RRT)) .
The incremental approaches are different from the regular grid sampling and frontier sampling
by selecting a series of points randomly in a tree-like manner instead of multiple single views for
evaluation. These trees can be built by rapidly-exploring random trees (RRT) [
] or their variant
]. The tree is expanded throughout all the exploration space, and each branch forms a group
Remote Sens. 2020,12, 891 11 of 23
of random branches. RRT is extensively used due to its ability to work in higher dimension spaces,
and it ensures probabilistic completeness. Each vertex in the search space has a Voronoi region that
is calculated by the RRT, where the randomization guarantees a biased search towards vertices with
more signiﬁcant Voronoi regions.
Each point in the branch is evaluated using a utility function, and the branch which maximizes
the utility function is chosen. Although the evaluation is performed for the whole branch, and the
best branch is selected for execution, Bircher et al. [
] executes only the ﬁrst edge of the selected
branch. The rest of the branch is utilized for new tree initialization, maintaining the original path while
allowing for new paths to grow as the map is updated. The rapidly-exploring random trees and the
viewpoint sampling approach provided in  are adapted in this research, as shown in Figure 9.
In each iteration, a random geometric tree
of ﬁnite depth with maximum edge length of
is sampled in the known free space of robot conﬁgurations and incrementally built from an initial
. All branches of this tree are evaluated using a utility function. When the best branch
is extracted, the ﬁrst viewpoint
that corresponds to the vertex
from the path
and the robot moves to that point.
Overview of the functional concept of the proposed semantically exploration planner.
At every iteration, a random tree of ﬁnite depth is sampled in the known free space of robot
conﬁgurations. All branches of this tree are evaluated in terms of a combined information gain
related to exploration of unknown areas and semantic-information-based class conﬁdence value as
represented in the occupancy map. The best branch is identiﬁed, and the step to its ﬁrst viewpoint is
conducted by the robot. Subsequently, the whole procedure is repeated in a receding horizon fashion.
The viewpoint or edges are added to the tree and are accepted for evaluation only if they were
inside the exploration area, and the direct line of site from its parent does not cross occupied voxels,
or in other terms, the path is collision free.
3.3.2. Viewpoint Evaluation
The utility function is used to select the best candidate viewpoint for the robot to visit. Many
factors impact the viewpoint evaluation, such as the cost of moving to a new point in terms of distance,
time, and power consumed, as well as the expected information gain (IG) the viewpoint can provide.
The proposed utility function utilizes both volumetric and semantic information that a single voxel
holds. The gain is calculated for each branch as the accumulated gain for the single nodes in the branch.
At each viewpoint
sample, the sensor generates a set of rays
where the rays end if it incident
a physical surface or reaches the limit of the map. For a single view
within a set of views
, the 3D
points from the sensor are annotated and projected into the map. The projection is carried through the
ray casting, resulting in a set of rays
cast for every view. Each ray
traverses the map, the volumetric
and semantic information are accumulated within a set of visited voxels
. The predicted information
Remote Sens. 2020,12, 891 12 of 23
for a single view is the cumulative volumetric information and semantic information collected
along the rays cast from v, such that:
αIvol (x) + βIsem (x)(1)
is the volumetric information,
is the semantic information that a single voxel
are the weights for alternating between the volumetric and semantic gains.
The formulation of the Ivol (x)and Isem(x)deﬁnes the behavior of the system.
For a single voxel, the uncertainty that encodes occupancy probability is deﬁned as voxel entropy,
as in Equation (2).
H(x) = Poln(Po) + ( ¯
donates the probability of voxel
being occupied and
is the complement
. A voxel where no information is provided about it has
and has the highest entropy
1Shannon. Numerous research in the literature considered
the volumetric information to formulate utility functions for different exploration purposes, such as
volumetric gain [
], pure entropy [
], average entropy [
], occlusion aware [
], and unobserved
The volumetric gain and pure entropy utility functions tend to reduce the map entropy by
directing the robot towards the views that contain a more signiﬁcant amount of unknown voxels.
However, information gain obtained via average entropy favors the views traverse a fewer number of
voxels with higher entropy. In addition, the occlusion aware utility function tends to visit views with
more unknown voxels near to the candidate view.
Utility functions in the literature count for volumetric,= or visibility likelihood of the voxels
and neglect the semantic information. We propose two multi-objective functions that consider both
volumetric and semantic information encapsulated in each voxel. In this research, the volumetric
information is adapted from [
], where the
is deﬁned as the accumulative number of the
unknown visible voxel in the candidate viewpoint. Let
be the indicator that the voxel is visible.
A voxel is visible if the line of sight from the robot to the voxel does not contain an occupied voxel.
Hence, Ivis (x)is deﬁned as:
Ivis (x) = (1x∈visible(munknonwn )
However, we propose two different multi-objective utility functions, semantic visible voxel (SVV)
and semantic visited object Of interest visible voxel (SVOI), with two different semantic information
•Semantic Visible Voxel (SVV)
The proposed SVV multi-objective function tends to direct the robot towards the views that
contain known occupied voxels with a small conﬁdence value to label all the objects in the scene.
The conﬁdence value is obtained from CNN, and it is the maximum conﬁdence value for the
corresponding class out of 150 objects. Let
be the indicator if the semantic conﬁdence value
assigned to a voxel is less than a predeﬁned threshold:
Ic(x) = (1c(x)<cthreshold
0 otherwise , (4)
is the semantic conﬁdence value fused by the deep learning model, and
predeﬁned conﬁdence threshold.
Remote Sens. 2020,12, 891 13 of 23
We propose semantic information
for a single voxel as a voxel that is visible, semantically
labeled, and its conﬁdence value less than a predeﬁned threshold. The Isvv(x)is deﬁned as:
Isvv(x) = Ic(x)∗Ivis(x)(5)
Finally, the gain obtained from the proposed utility function is calculated by substituting
in Equation (1), where
0.7. The total gain for a single view combines
the volumetric and semantic info for all the voxels in the view. Figure 10 illustrates the SVV
function which is a combination(accumulation) of both the volumetric in formation in Figure 10a
and semantic information in Figure 10b.
(a) Volumetric Gain Illustration (b) Semantic Voxels Information Isvv(x)
SVV Utility function illustration: occupied (black), free (green), unknown (gray), and rays
(red lines), sensor range (dashed blue), visible unknown (pink cubes).
•Semantic Visited Object Of Interest Visible Voxel (SVOI)
The proposed SVOI multi-objective function tends to direct the robot toward the views that
contain occupied voxels classiﬁed as an object of interest but not visited sufﬁciently. In this
function, the number of visits for each voxel is recorded to be used in the semantic information
be the indicator if the object belongs to the set of interest object
Is(x) = (1x∈Sobj
Each voxel in the semantic map holds a value
xnumO f Visi ts
that represents the number of times it
has been visited. In addition, let Iob j be the indicator of visiting threshold:
Iobj (x) = (1xnu mO f Visits <xt hreshold
where xthreshold is a predeﬁned threshold for the sufﬁcient number of visiting.
We propose semantic information
for a single voxel as a voxel that visible voxels,
semantically labeled, belongs to object of interest, and their visiting value is less than a predeﬁned
threshold. It is deﬁned as:
Isvio (x) = Is(x)∗Iobj (x)∗Ivis (x)(8)
Remote Sens. 2020,12, 891 14 of 23
Finally, the gain obtained from the proposed utility function is calculated by substituting
in Equation (1) where
0.7. The total gain for a single view combines
the volumetric and semantic info for all the voxels in the view.
The exploration task ﬁnishes when a termination condition is satisﬁed. Most of the termination
condition depends on the scenario, such as victims’ detection [
], where the number of iterations
surpasses a predeﬁned number of iterations [
], and information gain for all sampled viewpoints falls
below a particular threshold of iterations [
]. Additionally, for frontier-based methods, the exploration
terminates when no more frontiers are found .
The aim of the exploration is to label all the objects in the environment, and the termination
condition proposed is when all the voxels are identiﬁed with a conﬁdence value greater than the
4. Experimental Setup and Results
4.1. Experimental Setup
The main goal of the proposed approach is to explore the environment and label all the detected
object autonomously. For each scenario, the robot begins in a starting location, and the entire process
runs until a termination condition is satisﬁed. In the proposed scenario, takeoff and rotation steps
are performed initially to create an initial map to allow the planner to generate sample viewpoints
The viewpoint sampling is performed using an incremental approach, speciﬁcally, rapidly-
exploring random tree (RRT), where a series of points are selected randomly in a tree-like manner,
instead of multiple single points for evaluation. Then, the viewpoints in each branch are evaluated
using different utility functions for each test separately. After that, the branch that maximizes the gain
is selected; the ﬁrst edge is executed. The robot then navigates to the selected point and the process
terminates when the number of exploration iteration exceeds a predeﬁned number of iterations.
Simulation experiments were performed on an ASUS laptop (Intel Core i7 @ 2.8 GHz x 8, 16 GB
RAM). The NBV framework was implemented on Ubuntu 16.04 using the Robot Operating System
(ROS kinetic) to handle message sharing and ease the transition to hardware. The gazebo was used
to perform the actual simulation, with programming done in both C++ and Python. The simulations
were performed using a UAV equipped with one RGB-D camera. The speciﬁcations of the camera are
shown in Table 1. A collision box was constructed around the UAV to check for collisions with the
environment and constrained within a work-space of size 0.25 m, 0.25 m, 0.25 m.
Table 1. Camera parameters.
Horizontal FOV 60◦
Vertical FOV 45◦
Resolution 0.1 m
Range 5 m
(Roll, Pitch, Yaw) (0, 0, 0)
Remote Sens. 2020,12, 891 15 of 23
The gazebo environment, shown in Figure 11, was used as the unknown environment that
the robot should explore. The simulation environment has the dimensions of 9.2
multi-connected rooms with a corridor with several objects placed inside the rooms. The environment
contains 11 objects, which are walls, ﬂoors, three people, a sofa, two chairs, a book shelve, a vase,
and a table. The constructed maps are based on 3D occupancy grid using OctoMap library [
res =0.15mper pixel.
Figure 11. Simulation environment.
Each utility function was tested separately under controlled environments, and the following
assumptions were made:
•The position of the robot was considered entirely known from the gazebo.
•The explored environment is unknown with known boundaries only.
4.1.3. Baselines for Comparison
For comparison baselines, we use the following utility functions for our evaluation: volumetric
], pure entropy [
], average entropy [
], occlusion aware [
], and unobserved voxels [
The different utility functions are shown in Table 2.
Table 2. Utility functions.
Utility Function Ivol(x)Isem (x)
Volumetric gain  0
Pure Entropy  0
Average Entropy  0
Occlusion Aware  0
Unobserved Voxels  0
Proposed SVV  Equation (5)
Proposed SVOI  Equation (8)
4.1.4. Evaluation Metrics
The evaluation metrics used in this work are the volumetric coverage, the number of detected
objects, and the number of sufﬁciently labeled objects. The sufﬁciently labeled means that either
the voxels have been visited multiple times until their conﬁdence values increase above a certain
threshold which is assigned to 0.7 or the number of visits exceeds a predeﬁned threshold, which is
assigned to three visits. This results in labeling the voxels correctly. Table 3summarizes the evaluation
Remote Sens. 2020,12, 891 16 of 23
Table 3. Evaluation metrics.
Percentage of the number of known
voxels compared with the total
number of voxels the environment
can cover. After each iteration, the
coverage is calculated as follows
Free+Occupied+U nkno wn
Counting the number of detected
objects in the environment
Sufﬁciently labeled objects
Counting the number of objects that
are sufﬁciently labeled using the
semantic color table
Total voxel type
Count the voxel for each category
- Occupied Low conﬁdence
- Occupied High conﬁdence
4.2. Experiment Results and Analysis
4.2.1. Mapping Evaluation
The results for seven different constructed 3D semantically-aware occupancy maps using the
different utility functions are presented in Figure 12. The presented maps are obtained after 120
iterations. From Figure 12, the volumetric Gain, SVV, and SVOI utility functions were capable of
constructing most of the map and covered almost above 85% of the map. However, the pure entropy,
average entropy, and occlusion aware failed to construct the map within the predeﬁned number
4.2.2. Area Coverage Over Number of Iteration
The two proposed utility functions are compared with the state of the art commonly used utility
functions. In both of the proposed utility functions,
was set to 0.3 and
was set to 0.7 to steer the
utility towards semantic labeling rather than volumetric exploration. The reported results are for seven
different experiments simulation tests. The tests are divided according to the viewpoint evaluation
approaches. The exploration terminates after 120 iterations, and the semantic fusion method used in
this test is the max fusion. Table 4summarizes the results obtained. Figure 13 shows the volumetric
coverage plot vs. the number of iteration for all the tests. The SVV utility function almost reached the
same volumetric coverage as the SVOI and the volumetric utility functions. The SVOI and volumetric
utility functions reached higher volumetric coverage in less number of iterations.
Remote Sens. 2020,12, 891 17 of 23
(a) Volumetric exploration (b) Pure Entropy
(c) Average Entropy (d) Occlusion aware
(e) SVV (f) SVOI
(g) SVOI (person)
Generated semantic maps using different utility functions after 120 iterations. The colors
correspond to semantic classes, the green line segments represent the path travelled, and the red arrows
represent the viewpoints along the path.
Remote Sens. 2020,12, 891 18 of 23
Table 4. Evaluation results.
NUM View Point
Generation Utility Volumetric
Num of Detected
Volumetric Gain 91.396 11 8
Pure Entropy 54.3802 8 5
3 Average Entropy 49.7883 8 5
4 Occlusion Aware 44.8293 8 5
5 Unobserved Voxel 75.8496 10 6
6 SVV 88.5718 11 7
7 SVOI 93.1325 11 8
8 SVOI (person detection) 92.0332 11 9
Figure 13. Volumetric coverage reached by different utilities using the same number of iterations.
4.2.3. Object Detection and Labeling
The color map shown previously in Figure 4is used to show the detected objects. The volumetric
gain, SVV, and SVOI utility function were able to detect all the objects in the environment. They were
able to label more than 75% of objects sufﬁciently. Figure 14 shows four different object detections at
the voxel level. The proposed SVOI function was capable of detecting a larger number of person voxels
in less number of iteration, as shown in Figure 14a. In addition, the SVOI function showed a good
performance when detecting chair voxels compared to other utility functions, as shown in Figure 14b.
The number of voxels for each object is recorded at each iteration until the exploration process
terminates (120 iterations). Based on the observations from Figure 14, the recorded number of voxels
for each object varies. The variation is due to the semantic fusion method, max fusion, which selects
the semantic label with the highest conﬁdence for the voxel. In addition, max fusion switches the
labels of the voxel if the conﬁdence value is marginal at the end of an iteration.
4.3. Object Search Application
The SVOI utility is an adaptive function where we can select the object of interest. In this test,
rather than choosing the entire dataset to form interest objects, only the person selected to be the
object of interest. Hence, in search and rescue context, this test aims to search for victims, locate them,
and project their locations in the 3D semantic map. The results compared to the state of the art utility
functions in terms of volumetric coverage, constructed semantic map and the ability to detect persons
in less number of iterations. Any object from the dataset can be chosen as an object of interest.
Remote Sens. 2020,12, 891 19 of 23
Figure 15 shows that the proposed utility SVOI function outperforms the benchmark utility
functions by covering 85 % of the environment in less number of iterations. Figure 12g shows the
reconstructed map using the proposed function using the person as object of interest. The results show
that the proposed utility function was able to sufﬁciently label one additional object (person) as shown
in Table 4. Finally, Figure 16 shows that the proposed utility function performed better in detecting a
person (ex. victim) compared with the other utilities in this study.
(a) Person detection (b) Chair detection
(c) Sofa detection (d) Table detection
Figure 14. Number of voxels labelled for each object using different utility functions.
Figure 15. Volumetric coverage (proposed utility SVOI where the object of interest is the person).
Remote Sens. 2020,12, 891 20 of 23
Figure 16. Person detection at the voxel level.
In this paper, a strategy for autonomous, semantically-aware exploration for object labeling and
3D semantic mapping was presented. This strategy facilitates exploring and mapping unknown
environments efﬁciently, while directing the robot to label the objects present in the environment.
In this work, we proposed a novel 3D, semantically-aware occupancy map data structure that carries
the occupancy information as well as the semantic speciﬁc annotations to be utilized in the autonomous
exploration task. The mapping method employs a deep learning model to segment objects semantically,
generate annotated 3D point cloud, and create a 3D, occupied, semantically-aware map. Two new
multi-objective utility functions were introduced to quantify the contextual information from the 3D
semantic-occupied map to explore unknown environments.
Experimental results in the simulation demonstrated that the semantically-aware exploration
and mapping was successfully able to explore the environment and label the objects by utilizing the
contextual information from the 3D semantic map. The proposed utility functions showed a reliable
performance in exploration and object labeling. Comparison between the proposed utility functions
and the state of the art utility functions is provided: the proposed utility outperforms the benchmark
utilities by volumetric coverage and accurate labeling.
The semantic fusion method used in the proposed strategy suffers from inaccuracies (frequent
label shifts) when the conﬁdence in the semantic labeling is on the border between multiple classes.
This will be addressed in future work by improving the accuracy of the semantic labeling and adapting
a variation of the Bayesian probabilistic semantic fusion. Additionally, the utility function introduced
requires careful tuning of the scaling variables
; we will explore in the future, the possibility of
introducing an adaptive scaling strategy that accommodates for various objectives and environments.
conceptualization, R.A., T.T., and N.A.; methodology, R.A. and T.T.; software, R.A. and T.T.;
validation, R.A.; formal analysis, R.A.; investigation, R.A.; resources, R.A.; data curation, R.A.; writing—original
draft preparation, R.A.; writing—review and editing, R.A.; visualization, R.A.; supervision, T.T., J.D., L.S., and
N.A.; project administration, T.T., J.D., L.S., and N.A.; funding acquisition. All authors have read and agreed to
the published version of the manuscript.
This work was supported by ICT fund for “A Semi Autonomous Mobile Robotic C4ISR System for
Urban Search and Rescue” project.
This publication acknowledges the support by the Khalifa University of Science and
Technology under Award Number RC1-2018-KUCARS.
Conﬂicts of Interest: The authors declare no conﬂict of interest.
Remote Sens. 2020,12, 891 21 of 23
Naidoo, Y.; Stopforth, R.; Bright, G. Development of an UAV for search & rescue applications. In Proceedings
of the IEEE Africon’11, Livingstone, Zambia, 13–15 September 2011; pp. 1–6.
Erdelj, M.; Natalizio, E.; Chowdhury, K.R.; Akyildiz, I.F. Help from the sky: Leveraging UAVs for disaster
management. IEEE Pervasive Comput. 2017,16, 24–32.
Waharte, S.; Trigoni, N. Supporting search and rescue operations with UAVs. In Proceedings of the
2010 International Conference on Emerging Security Technologies, Canterbury, UK, 6–7 September 2010;
Hallermann, N.; Morgenthal, G. Visual inspection strategies for large bridges using Unmanned Aerial
Vehicles (UAV). In Proceedings of the of 7th IABMAS, International Conference on Bridge Maintenance,
Safety and Management, Shangai, China, 2014; pp. 661–667.
Wada, A.; Yamashita, T.; Maruyama, M.; Arai, T.; Adachi, H.; Tsuji, H. A surveillance system using small
unmanned aerial vehicle (UAV) related technologies. NEC Tech. J. 2015,8, 68–72.
Lang, D.; Paulus, D. Semantic Maps for Robotics. In Proceedings of the Workshop” Workshop on AI
Robotics” at ICRA, Chicago, IL, USA, 14–18 September 2014.
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present,
and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot.
Kostavelis, I.; Gasteratos, A. Semantic mapping for mobile robotics tasks: A survey. Robot. Auton. Syst.
Wurm, K.M.; Hornung, A.; Bennewitz, M.; Stachniss, C.; Burgard, W. OctoMap: A probabilistic, ﬂexible,
and compact 3D map representation for robotic systems. In Proceedings of the ICRA 2010 workshop on
Best Practice in 3D Perception and Modeling for Mobile Manipulation, Anchorage, AS, USA, 3–7 May 2010;
Lai, K.; Bo, L.; Fox, D. Unsupervised feature learning for 3d scene labeling. In Proceedings of the 2014 IEEE
International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014;
11. Pillai, S.; Leonard, J. Monocular slam supported object recognition. arXiv 2015, arXiv:1506.01732.
Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. Slam++: Simultaneous
localisation and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1352–1359.
Vineet, V.; Miksik, O.; Lidegaard, M.; Nießner, M.; Golodetz, S.; Prisacariu, V.A.; Kähler, O.; Murray, D.W.;
Izadi, S.; Pérez, P.; et al. Incremental dense semantic stereo fusion for large-scale semantic scene
reconstruction. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation
(ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 75–82.
Kostavelis, I.; Charalampous, K.; Gasteratos, A.; Tsotsos, J.K. Robot navigation via spatial and temporal
coherent semantic maps. Eng. Appl. Artif. Intell. 2016,48, 173–187.
Yamauchi, B. A frontier-based approach for autonomous exploration. In Proceedings of the Proceedings
1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.
’Towards New Computational Principles for Robotics and Automation’, Monterey, CA, USA,
10–11 July 1997
Elfes, A. Using occupancy grids for mobile robot perception and navigation. Computer
Connolly, C. The determination of next best views. In Proceedings of the 1985 IEEE International Conference
on Robotics and Automation, St. Louis, MO, USA, 25–28 March 1985; Volume 2, pp. 432–435.
Półka, M.; Ptak, S.; Kuziora, Ł. The use of UAV’s for search and rescue operations. Procedia Eng.
Tang, J.; Zhu, K.; Guo, H.; Liao, C.; Zhang, S. Simulation optimization of search and rescue in disaster relief
based on distributed auction mechanism. Algorithms 2017,10, 125.
Goian, A.; Ashour, R.; Ahmad, U.; Taha, T.; Almoosa, N.; Seneviratne, L. Victim Localization in USAR
Scenario Exploiting Multi-Layer Mapping Structure. Remote Sens. 2019,11, 2704.
21. Lindeberg, T. Scale invariant feature transform. Scholarpedia 2012,7, 10491.
Remote Sens. 2020,12, 891 22 of 23
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA,
20–25 June 2005; Volume 1, pp. 886–893.
Wang, L. Support Vector Machines: Theory and Applications; Springer Science & Business Media: Berlin,
Germany, 2005; Volume 177.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox
detector. In Proceedings of the European conference on Computer Vision, Amsterdam, The Netherlands,
8–16 October 2016; Springer: Berlin, Germany, 2016; pp. 21–37.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Uniﬁed, real-time object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, The Netherlands,
8–16 October 2016; pp. 779–788.
26. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv a 2018, rXiv:1804.02767.
Redmon, J.; Farhadi, A. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classiﬁcation and
segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 21–26 July 2017; pp. 652–660.
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric
space. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017;
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 18–22 June 2018; pp. 918–927.
Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An Open-Source Library for Real-Time
Metric-Semantic Localization and Mapping. arXiv 2019, arXiv:1910.02490.
Dang, T.; Papachristos, C.; Alexis, K. Visual Saliency-Aware Receding Horizon Autonomous Exploration with
Application to Aerial Robotics. In Proceedings of the 2018 IEEE International Conference on Robotics and
Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 2526–2533. doi:10.1109/ICRA.2018.8460992.
Dang, T.; Papachristos, C.; Alexis, K. Autonomous exploration and simultaneous object search using aerial
robots. In Proceedings of the 2018 IEEE Aerospace Conference, Big Sky, MT, USA, 3–10 March 2018.
Heng, L.; Gotovos, A.; Krause, A.; Pollefeys, M. Efﬁcient visual exploration and coverage with a micro aerial
vehicle in unknown environments. In Proceedings of the 2015 IEEE International Conference on Robotics
and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1071–1078.
Fraundorfer, F.; Heng, L.; Honegger, D.; Lee, G.H.; Meier, L.; Tanskanen, P.; Pollefeys, M. Vision-based
autonomous mapping and exploration using a quadrotor MAV. In Proceedings of the 2012 IEEE/RSJ
International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012;
Cieslewski, T.; Kaufmann, E.; Scaramuzza, D. Rapid Exploration with Multi-Rotors: A Frontier Selection
Method for High Speed Flight. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017.
Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005;
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 633–641.
Xuan, Z.; David, F. Real-Time Voxel Based 3D Semantic Mapping with a Hand Held RGB-D Camera. 2018.
Available online: https://github.com/ﬂoatlazer/semantic_slam (accessed on 3 March 2020).
Hornung, A.; Wurm, K.M.; Bennewitz, M.; Stachniss, C.; Burgard, W. OctoMap: An efﬁcient probabilistic 3D
mapping framework based on octrees. Auton. Robots 2013,34, 189–206.
Remote Sens. 2020,12, 891 23 of 23
Bircher, A.; Kamel, M.; Alexis, K.; Oleynikova, H.; Siegwart, R. Receding Horizon “Next-Best-View” Planner
for 3D Exploration. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation
(ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1462–1468.
Burgard, W.; Moors, M.; Stachniss, C.; Schneider, F.E. Coordinated multi-robot exploration. IEEE Trans. Robot.
Stachniss, C.; Grisetti, G.; Burgard, W. Information Gain-based Exploration Using Rao-Blackwellized Particle
Filters. Robot. Sci. Syst. 2005,265–72.
Paul, G.; Webb, S.; Liu, D.; Dissanayake, G. Autonomous robot manipulator-based exploration and mapping
system for bridge maintenance. Robot. Auton. Syst. 2011,59, 543–554.
Al khawaldah, M.; Nuchter, A. Enhanced frontier-based exploration for indoor environment with multiple
robots. Adv. Robot. 2015,29. doi:10.1080/01691864.2015.1015443.
Karaman, S.; Frazzoli, E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res.
Lavalle, S.M. Rapidly-Exploring Random Trees: A New Tool for Path Planning; Technical Report; Iowa State
University: Ames, IA, USA, 1998.
Delmerico, J.A.; Isler, S.; Sabzevari, R.; Scaramuzza, D. A comparison of volumetric information gain metrics
for active 3D object reconstruction. Auton. Robot. 2018,42, 197–208.
Kriegel, S.; Rink, C.; Bodenmüller, T.; Suppa, M. Efﬁcient next-best-scan planning for autonomous 3D surface
reconstruction of unknown objects. J. Real-Time Image Process. 2015,10, 611–631.
Batista, N.C.; Pereira, G.A.S. A Probabilistic Approach for Fusing People Detectors. J. Control Autom.
Electr. Syst. 2015,26, 616–629. doi:10.1007/s40313-015-0202-6.
Isler, S.; Sabzevari, R.; Delmerico, J.; Scaramuzza, D. An information gain formulation for active volumetric
3D reconstruction. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation
(ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 3477–3484. doi:10.1109/ICRA.2016.7487527.
2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).