ArticlePDF Available

A Stacked LSTM based Approach for Reducing Semantic Pose Estimation Error

Authors:
  • Dubai Futue Labs

Abstract and Figures

Achieving high estimation accuracy is significant for semantic simultaneous localization and mapping (SLAM) tasks. Yet, the estimation process is vulnerable to several sources of error including limitations of the instruments used to perceive the environment, shortcomings of the employed algorithm, environmental conditions, or other unpredictable noise. In this paper, a novel stacked long short term memory (LSTM) based error reduction approach is developed to enhance the accuracy of semantic SLAM in presence of such error sources. Training and testing datasets were constructed through simulated and realtime experiments. The effectiveness of the proposed approach was demonstrated by its ability to capture and reduce semantic SLAM estimation errors in training and testing datasets. Quantitative performance measurement was carried out using the absolute trajectory error (ATE) metric. The proposed approach was compared to vanilla and bidirectional LSTM networks, shallow and deep neural networks, and to support vector machines. The proposed approach outperforms all other structures and was able to significantly improve the accuracy of semantic SLAM. To further verify the applicability of the proposed approach, it was tested on real-time sequences from the TUM RGB-D dataset, where it was able to improve the estimated trajectories.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 1
A Stacked LSTM based Approach for Reducing
Semantic Pose Estimation Error
Rana Azzam1, Yusra Alkendi1, Tarek Taha2, Shoudong Huang3, and Yahya Zweiri1,4
Abstract—Achieving high estimation accuracy is significant for
semantic simultaneous localization and mapping (SLAM) tasks.
Yet, the estimation process is vulnerable to several sources of
error including limitations of the instruments used to perceive
the environment, shortcomings of the employed algorithm, envi-
ronmental conditions, or other unpredictable noise. In this paper,
a novel stacked long short term memory (LSTM) based error
reduction approach is developed to enhance the accuracy of
semantic SLAM in presence of such error sources. Training and
testing datasets were constructed through simulated and real-
time experiments. The effectiveness of the proposed approach was
demonstrated by its ability to capture and reduce semantic SLAM
estimation errors in training and testing datasets. Quantitative
performance measurement was carried out using the absolute
trajectory error (ATE) metric. The proposed approach was
compared to vanilla and bidirectional LSTM networks, shallow
and deep neural networks, and to support vector machines. The
proposed approach outperforms all other structures and was
able to significantly improve the accuracy of semantic SLAM.
To further verify the applicability of the proposed approach, it
was tested on real-time sequences from the TUM RGB-D dataset,
where it was able to improve the estimated trajectories.
Index Terms—LSTM, Deep Learning, Semantic SLAM, Local-
ization Error, Measurement Uncertainty, Sensor Noise
I. INT ROD UC TI ON
SIMULTANEOUS localization and mapping (SLAM) is
one of the most prevalent research problems in the
robotics community. It is defined as the problem of estimating
the trajectory of a robotic vehicle and incrementally construct-
ing a map of its surroundings, provided with measurements
perceived from the environment [1]. SLAM serves as a key
enabler of a wide range of applications in mobile robotics,
such as search and rescue [2], [3], [4], autonomous navi-
gation [5], and augmented reality [6]. Semantic SLAM [1]
relies on visual measurements obtained by a vision sensor. It
exploits understanding of the surrounding structure to build
highly expressive maps that are easy for human operators
to understand. It started to captivate a tremendous amount
of attention, especially after the deep learning breakthrough,
which led to advancements in object detection and tracking
1Rana Azzam, Yusra Alkendi, and Yahya Zweiri are with KU Cen-
ter for Autonomous Robotic Systems (KUCARS), Khalifa University of
Science and Technology, Abu Dhabi, UAE, e-mail: {rana.azzam,
yusra.alkendi, yahya.zweiri}@ku.ac.ae
2Tarek Taha, is with Robotics Lab, Dubai Future Foundation, UAE, email:
tarek.taha@dubaifuture.gov.ae
3Shoudong Huang, is with University of Technology Sydney, Sydney,
Australia, email: Shoudong.Huang@uts.edu.au
4Yahya Zweiri, Faculty of Science, Engineering and Com-
puting, Kingston University London, London SW15 3DW, UK
email:y.zweiri@kingston.ac.uk
techniques [7]. The accuracy of the localization is a critical
success factor in robotic tasks, particularly those involving
interaction with humans. Examples of such tasks are search
and rescue, autonomous driving, and elder care. Owing to its
infancy, semantic SLAM is yet to achieve more robustness in
the presence of noisy measurements, like those occurring due
to inaccurate object pose estimation with respect to the vision
sensor.
The uncertainty of SLAM estimates might arise due to
measurement errors that differ based on the adopted approach
to SLAM. In the case of object-based semantic SLAM, er-
rors mostly occur when post-processing the sensory data to
determine the poses of the observed features relative to the
sensor in the environment. This process starts with detecting
the landmark in the environment and determining its bounding
box, then computing its centroid. The centroid of the landmark
is then utilized to compute the relative pose between the
feature and the vision sensor. Furthermore, occlusions have a
significant impact on the accuracy of the estimated object pose
[8]. Occlusions happen when part of the object is observed in
an image, while the rest is either hidden by other objects in the
scene or is out of the field of view of the vision sensor. Due
to the advancements in deep learning based object detectors,
occluded objects can still be detected and correctly labeled in
an image. Hence, if they are not properly accounted for, the
estimated pose of an occluded object can be far from the true
one and may consequently cause severe accuracy degradation.
In addition, the limitations of the sensors used to perceive the
environment introduce another primary source of uncertainty.
The approach proposed in this paper aims at reducing the
joint effects of several sources of errors on the accuracy of
semantic SLAM estimates. These errors might arise from
limitations of the software and hardware components used
to perform semantic SLAM, from external environmental
conditions, or from unpredictable noise. Formulating a noise
model that accounts for all such errors is challenging, espe-
cially because some errors occur unexpectedly during data
collection and/or processing. Hence, a stacked LSTM based
neural network is proposed in this paper to learn and capture
the error patterns associated with the trajectory estimates of
semantic SLAM. By comparing the trajectory estimates to the
corresponding ground truth, the network is trained to reduce
the error and hence enhance the accuracy of semantic SLAM.
The proposed approach is general; it can be used for
any SLAM system since it operates on trajectory estimates
rather than raw measurements. It targets trajectories of ground
vehicles which are usually expressed using three degrees
of freedom in the 2-dimensional (2D) space; the vehicle’s
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 2
position (x, y)and orientation (ϑ). The approach is applicable
to any 2D SLAM problem. However, the employed neural
network was trained on data obtained using semantic SLAM,
and hence, is intended to improve the accuracy of semantic
SLAM trajectory estimates.
The proposed approach can be used in applications that
require accurate localization of the robotic vehicle. For ex-
ample, a more accurate estimate of the trajectory estimated by
semantic SLAM will result in a meaningful and more accurate
map of the environment. Another practical use-case scenario
of the approach presented in this paper is in search and rescue
applications. If the robots that are employed as first responders
after a particular catastrophe have the ability to accurately
pinpoint their location, it will expedite the process of rescuing
victims, if any, or locating areas that need immediate help.
The contributions of our paper are listed below.
We developed a novel stacked-LSTM based approach to
identify and reduce pose estimation error in object-based
semantic SLAM. The approach alleviates the combined
effect of predictable and unpredictable noise on the
accuracy of trajectory estimates.
We developed an automated search approach to select
the architecture and hyper-parameters of the proposed
stacked-LSTM neural network.
We extensively tested the proposed approach on sim-
ulated and real-time experiments, where its superiority
compared to shallow neural networks (SNN), deep neural
networks (DNN), support vector machines (SVM), and
semantic SLAM was proven.
The rest of the paper is organized as follows. Section II
presents recent related research work from the literature. The
proposed approach is introduced in Section III, followed by
experimental validation in Section IV. Finally, the conclusions
of this work are drawn in Section V.
II. RE LATE D WOR K
A. Deep Neural Networks:
Neural networks are trained to exhibit a particular behavior,
suited for the problem at hand, when fed with data. During
training, the internal parameters of the network, referred to
as weights, are adjusted to minimize the discrepancy between
the network’s prediction and the desired output [9]. A SNN
is a network with an input layer, one hidden layer, and an
output layer. Networks with two or more hidden layers are
referred to as deep neural networks (DNNs). DNNs are much
more efficient than SNNs with regards to the required number
of computational units, especially when modelling a complex
problem. This is attributed to the non-linear nature of the
activation functions occurring at several layers in the DNN
[10].
Furthermore, recurrent neural networks (RNN) are artifi-
cial neural networks that are capable of informing knowl-
edge from a context. This is attributed to the use of loops
which allows information to be fed back to the network
after being processed. However, such networks might suffer
from vanishing gradients, which motivated the need for Long
Short Term Memory (LSTM) cells [11]. LSTM cells enable
RNNs to retain information that are essential, and discard
them otherwise. This functionality cannot be realized when
using conventional neural networks. DNNs and LSTMs have
been exhibiting state-of-the-art performance in a multitude of
various applications, including computer vision [12], [13], [14]
and robotics [15], [16], [17].
B. SLAM and the Intervention of Deep Learning
A rich body of literature has addressed the SLAM problem,
and a wide range of algorithms exhibiting varying levels of
performance in terms of reliability, accuracy, and efficiency
have been proposed [18], [19], [20]. The utilization of deep
learning approaches has been witnessed in a substantial share
of these approaches in the past few years [1], and their
capability to outperform the classical approaches has been
demonstrated [21], [22], [23], [24], [17]. In addition, deep
learning based object detection techniques [25], [26], [27]
promoted the advancement of object-based semantic SLAM,
which relies on observations of landmarks that can be seman-
tically labelled in the environment, such as the approaches
presented in [28], [29]. Obtaining a reliable observation of a
landmark in the environment and accurately pinpointing its
position with respect to the sensor remains a challenge. On
a different note, much less research effort was made in the
area of employing deep learning approaches to improve the
accuracy of state estimation as discussed in the next section.
C. Enhancing SLAM Estimation Accuracy
The accuracy of state estimation in SLAM applications is
vulnerable to the effects of several error sources. Such errors
occur in one or more stages in the SLAM pipeline, such as
data collection, data processing, and optimization. Most of
the existing work in the literature assumes that noise models
always follow fixed distributions that can be mathematically
formulated [30]. Nonetheless, this is not always the case in
practical applications and might lead to severe degradation in
estimation accuracy.
When visual measurements and dead reckoning are used
together to estimate the state of a system, estimation uncer-
tainty may result from visual sensor noise [31], [32], land-
mark detection and localization accuracy [33], odometry drift
[34], or failure to arrive at a globally optimum estimate due
to measurement noise. The effect of unpredictable nonuni-
form noise as well as external environmental conditions is
also inevitable [35]. To enhance the accuracy of localiza-
tion, the solutions found in the literature can be classified
into: (1) controlling the environment under investigation [36],
(2) sensor data fusion [37], [38], (3) improving measure-
ment covariane estimation [39], [30], [35], [40], or (4) cor-
recting measurement errors, which can be further classified
into classical [41], [42], [43], [44] and learning approaches
[16], [45], [34], [46], [17], [47].
The work presented in [36] studies the placement of passive
tags, used as landmarks, in the environment, to always keep
the localization accuracy within a particular range. In another
vein, the robustness of indoor localization was supported
by accumulating sensory data, which compensated for the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 3
Fig. 1: Proposed deep learning approach
limitations of the employed sonar sensors as presented in [37].
Another example of measurement fusion can be found in [38]
where measurements recorded by multiple IMUs along with
other extroceptive sensors were integrated to improve local-
ization accuracy. Instead of assuming a fixed measurement
noise model, the work proposed in [39] predicts the noise
model based on raw measurements by means of a DNN.
The DNN was able to accurately predict the covariance of
measurements obtained by light and vision sensors. However,
the approach assumes that noise models follow a zero-mean
Gaussian distribution which does not hold all the time, putting
in doubt the generality of the approach. Similarly, the work
presented in [30] relaxes the assumption of a fixed measure-
ment noise model for dead reckoning and QR code detections
by employing a tailored extended Hfilter. The approach is
general and more computationally expensive than the extended
Kalman filter, yet achieves higher accuracy. Similarly, the
approach proposed in [35] improves the accuracy of SLAM
estimates by employing an adaptive Gaussian particle filter
whose job is to compensate for bias in measurements. More
particularly, the approach targets unmodeled, unpredictable
noise patterns that are experienced in marine environments. A
recent noise model learning approach was proposed in [40]
where a deep neural network was trained to estimate the
covariance of inertial measurements, which are then used in
an extended Kalman filter to perform localization. Evaluation
results demonstrated the applicability of the approach, yet in
its current version, it works for inertial sensors only.
Several previous studies have addressed correction of mea-
surements using classical or learning approaches. In [41],
visual sensor limitations were accounted for by superimposing
camera oscillations to improve the accuracy of visual SLAM.
The work presented in [42] utilizes probabilistic fuzzy logic
to reduce measurement uncertainties occurring due to stochas-
tic and non-stochastic disturbances. This approach handles
dead-reckoning and range measurements and was proven to
outperform ordinary fuzzy logic in terms of improving the
accuracy of positioning and mapping estimates. The work
presented in [43] was able to cope with occasional failure of
inertial sensors during localization, by means of a discrete-
time Hfilter. Localization was supported by a reference
wireless sensor network. In [44], a novel ultrasonic sensor
with self-configuration abilities was developed to cope with
collisions of ultrasonic waves and hence enhance localization
accuracy. The developed algorithm is also capable of handling
topological changes in the environment in a real-time manner.
In [16], a deep learning based approach is employed to
improve the altitude estimation of a flying robotic vehicle.
Moreover, in [45], the accuracy of the odometry of a wheeled
cart, calculated using its dynamic equations, was improved us-
ing an SNN. The network was trained to compute an estimate
of the vehicle’s traveled distance and orientation. However,
since the network is composed of a single hidden layer only,
it might not capture all patterns of estimation errors, and
hence cannot generalize well. Correction of odometry mea-
surements was also addressed in [34], [46] where Gaussian
processes were trained based on the discrepancies between
the odometry model and ground truth. The scalability and
accuracy of the model proposed in [46] were achieved through
deep kernal learning. Furthermore, the approach proposed in
[17] improves the accuracy of stereo visual localization. The
authors proposed a loss function based on the Lie group
SE (3) geodesic distance and used it to train a DNN to more
accurately estimate the relative transformation between subse-
quent images. The advantages of the proposed approach over
classical visual odometry was demonstrated through several
experiments. However, the DNN operates on images, which
makes the approach computationally expensive and requires a
large DNN structure to achieve the sought performance. Ad-
ditionally, an end-to-end learning approach to visual odometry
was presented in [15] where sequential learning was employed
to improve the pose certainty. The DNN operates on raw
images, from which it infers the uncertainty of the poses.
Operating on large amounts of data requires the hardware
to have high computational capabilities. The work proposed
in [47] utilizes deep learning to improve depth estimation
which is then used to perform dense monocular SLAM. In
a classical factor graph approach, the authors propose to use
multiple objective functions, or factors, addressing several
types of errors to further improve estimation accuracy. These
factors are the photometric, reprojection, and sparse geometric
factors. Combining these factors have resulted in robust motion
estimation when tested on several real-time sequences.
The proposed stacked-LSTM based approach has the fol-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 4
Fig. 2: Semantic SLAM
lowing advantages over the aforementioned methods:
It alleviates the effects of all the possible disturbances
experienced while performing SLAM, including measure-
ment errors, sensor failure, data processing faults, or any
other unpredictable noise.
It operates on a trajectory rather than raw sensor measure-
ments like images. Hence, it can be used to reduce pose
estimation errors irrespective of the employed sensors.
It is efficient since the input to the neural network is a
short segment of the trajectory, which takes much less
time than images to be processed.
It does not require any particular arrangement of the
environment. More specifically, it does not depend on
the number, geometry, or placement of landmarks in the
environment.
The stacked nature of the LSTM and dense layers along
with the nonlinear activation functions facilitate identify-
ing complex error patterns that could be challenging to
model mathematically.
III. PROP OS ED AP PROACH
The deep learning approach proposed in this paper is
depicted in Fig. 1. In general, a ground vehicle’s trajectory,
estimated using semantic SLAM, is passed to a neural network
which will identify and reduce possible pose estimation errors.
The semantic SLAM algorithm will be described in Section
III-A along with the error sources that contribute to reducing
the accuracy of pose estimation. Section III-B details the deep
learning based pose estimation error reduction approach.
A. Semantic SLAM
The adopted semantic SLAM is designed for ground vehi-
cles and is performed based on measurements from the vehi-
cle’s wheel encoders and an RGB-D camera that is mounted
on top. A factor graph is used to model the problem as shown
in Fig. 2, where robot poses and map features (landmarks)
are represented as nodes. The graph contains two types of
edges: solid edges between every two consecutive pose nodes,
to represent a spatial constraint (denoted as emot), and dashed
edges between a pose node and a landmark node, to represent
an observation of the landmark at that pose (denoted as emeas).
Each edge models a nonlinear quadratic constraint, that can
be mathematically formulated as shown below:
emot = (xtg(ut, xt1))TR1
t(xtg(ut, xt1)) (1)
where gis the motion model, utis the control command and
xtis the robot pose at time t, and Rtis the covariance matrix
of the motion noise, which is assumed to be Gaussian.
emeas = (zj
th(xt, mj))TQ1
t(zj
th(xt, mj)) (2)
where zj
tare measurements, his the measurement function,
mjis a landmark, and Qtis the covariance matrix of the
measurement noise, which is assumed to be Gaussian.
The goal of graph SLAM is then to find a configuration
of the nodes that minimizes the error introduced by the
constraints. In our approach, this is done by means of the
incremental smoothing and mapping algorithm, iSAM2 [48].
1) Landmark Pose Estimation and Data Association: To
perform semantic SLAM, the vehicle’s relative position to the
observed landmarks must be computed. This is done using
input RGB-D frames. RGB images are passed to the object
detector, you only look once (YOLO) [25]. For each detected
object in an image, YOLO predicts a label and a bounding
box. The relative position between the camera and the detected
object is then computed as the distance between the camera
and the centroid of the object, as proposed in [29]. Briefly, the
input depth image that corresponds to the input RGB image is
converted to point cloud. The point cloud is segmented in order
to extract the cluster of points that belongs to the detected
landmark using kd-tree search. The geometric centroid of the
cluster is then computed.
Detected objects are associated to landmarks in the map
based on their label, which is predicted by YOLO. Since
multiple objects of the same category might exist in the envi-
ronment, the observation is associated to the closest landmark
within a particular distance threshold. If no landmarks exist
within that threshold, a new instance is inserted in the map.
2) Measurement Uncertainty: There are several factors that
contribute to reducing the estimation accuracy when perform-
ing semantic slam. Starting from the inputs, the sensors used to
perceive the environment suffer from limitations that decrease
the accuracy of the obtained measurements. For example,
uncertainty of RGB-D measurements might be caused by (1)
axial noise [49], which increases when the distance to the de-
tected object increases, (2) lateral noise [49], which increases
near the image corners, (3) multi-path interference [50], (4)
flying pixels [50], and (5) the scene’s characteristics, such as
color variations, temperature [50], and illumination conditions.
In addition, odometry drift is affected by the accuracy of wheel
encoders [34], wheel materials, floor flatness and materials.
Fig. 3 describes the model that was used to simulate odometry
noise and add it to simulated odometry which is considered
to be perfect with no error.
Furthermore, object detection might result in incorrect labels
or bounding box predictions. If such false detections are
not treated as outliers, the accuracy of pose estimation and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 5
Fig. 3: Odometry noise model
data association is severely affected. Given an accurate object
label and bounding box, the pose estimation module is yet
error-prone. Depending on the structure of the environment
under investigation, object classes, camera position, illumina-
tion conditions, and most importantly, object occlusions, the
accuracy of segmentation and clustering can be significantly
reduced.
Modeling such errors can be extremely challenging, there-
fore, guaranteeing a maximum likelihood estimate using the
employed incremental smoothing and mapping technique is
difficult.
The input to the neural network is the trajectory estimated
by the semantic SLAM system. The estimation is based on
observations that may sometimes be inaccurate due to several
sources of error affecting data acquisition and/or processing.
For example, the aforementioned types of RGB-D sensor
noise may degrade the quality of the acquired RGB-D frames,
which will consequently affect the accuracy of the information
obtained from such images. Also, object detection, labeling,
and segmentation are subject to errors that may negatively
impact the corresponding measurement constraints. In sim-
ulated experiments, noise was simulated and added to the
measurements to mimic the real noise.
The motion measurements and observations are passed to
the optimization algorithm, in the SLAM back-end, along with
an estimate of the measurement noise model. The optimization
algorithm then estimates the trajectory to find a configuration
of the poses that minimizes the overall error along the trajec-
tory. The resulting estimate of the robot trajectory still suffers
from estimation errors since it was based on observations
inferred from inaccurate measurements that propagate along
the SLAM pipeline.
The neural network is trained to identify the error patterns in
the final estimate of the robot trajectory by comparing it to the
corresponding ground truth. For every pose, the neural network
exploits a segment of the trajectory that precede that pose to
determine the pose estimation error and reduce it accordingly.
Fig. 4: Example dividing a trajectory to sequences of length 4
B. Stacked LSTM based Noise Reduction Approach
To find the best-suited neural network architecture, a sys-
tematic search was done in a pool of neural networks of
varying types, depths, and activation functions. Three types of
LSTM networks were explored; simple vanilla LSTM, stacked
LSTM, and bidirectional LSTM. Moreover, a set of shallow
and deep fully connected feedforward neural networks were in-
vestigated. A hybrid of LSTM and fully connected layers was
also considered. The performance of all the tested networks
was evaluated using a dataset containing data generated from
simulated trajectories. The training and validation results were
compared using the absolute trajectory error, or ATE in short.
More specifically, the euclidean distance between the ground
truth pose and the corrected pose by the network is computed
for all poses along the trajectory and the mean error is used
to compare the performance of the different networks.
Each of the tested neural networks takes in a segment of
the trajectory, consisting of the current 2D pose, along with
a number of previous poses. The length of this segment will
hereinafter be referred to as sequence length. An illustrative
example of how a trajectory is divided into sequences of length
4 is depicted in Fig. 4. Each pose in this segment is denoted
as X= [xm, ym, ϑm]T, and hence the input is a 2D array of
poses. It is worth mentioning that the network is not expected
to predict a new pose following the input segment. Rather, it
learns to correct the last pose based on the previous poses in
the segment. Therefore, the output of the neural network is a
3-tuple that represents the pose, and is obtained by means of
a dense layer of size three, activated using sigmoid for all the
tested neural networks.
First, a search was conducted to determine the most suited
type of LSTM networks; vanilla, bidirectional, or stacked.
Several LSTM networks with varying number of units and
sequence lengths were trained and evaluated. The number of
units in each LSTM layer was set to 2mfor m[5,9]. The
sequence length was varied between 10 and 90, in increments
of 10. In addition, stacked LSTM was tested with 2 and 3
LSTM layers. The mean ATE obtained on the training and val-
idation datasets by all the tested LSTM network architectures
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 6
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Sequence Length
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Mean Absolute Trajectory Error (m)
LSTM Layer Size =32 LSTM Layer Size =64 LSTM Layer Size =128 LSTM Layer Size =256 LSTM Layer Size =512
Vanilla LSTM Bidirectional LSTM Stacked LSTM - 2 Layers Stacked LSTM - 3 Layers
Fig. 5: Mean ATE obtained on the training and validation datasets by different LSTM network architectures
are shown in Fig. 5. Several architectures performed well and
were able to improve the accuracy of the estimated trajectory.
However, the architecture with two stacked LSTM layers, each
with 256 units, with a sequence length of 30 exhibited the
highest performance among all the considered LSTM networks
in terms of reducing ATE.
In an attempt to further improve the results, one, two, and
three dense layers were added after the LSTM layers and
various sizes and activation functions were tested, including
sigmoid, swish [51], tanh, ReLU, and linear. Adding a dense
layer of size 256 with a sigmoid activation function resulted in
smoother trajectories with lower mean ATE. The mean ATE
achieved by the other hybrid architectures did not improve.
To aid generalization and overcome overfitting, dropout layers
were added to the architecture.
Fig. 6 depicts the architecture of the adopted stacked LSTM
neural network. The network accepts trajectory segments of
length 30. Each of the poses in a segment consists of a three-
tuple, X= [xm, ym, ϑm]T, representing the position and
orientation of the vehicle at a time instance. The segment
is then passed to two stacked LSTM layers, separated by a
dropout layer, with dropout probability of 0.1. Each LSTM
layer consists of 256 units. Then, after another dropout layer,
a fully connected layer with 256 neurons, activated by sigmoid
is added. Finally, a dense layer, activated by sigmoid is used
to predict the improved pose.
The computational complexity of the proposed model per
time step is O(W)where Wis the size of the weight-
space. This is attributed to the fact that the time complexity
to update a single weight is O(1). The size of the weight-
space is a function of the input size, hidden units, and output
size [52], which were detailed earlier. The total number of
trainable parameters in the proposed model is around 850k
parameters. The architecture of the proposed model and its
hyper-parameters, such as the batch size and the number of
training epochs, are fixed. Hence, for N training samples, the
complexity becomes O(N)since one training epoch runs in
O(1).
The proposed neural network will be compared to shallow
and deep fully connected neural networks (SNN and DNN
respectively) and to support vector machines (SVM). Hence,
a pool of SNNs, DNNs, and SVMs were investigated to search
for the best structure from each paradigm for our problem. The
parameters that were varied for SNN and DNN are the number
of neurons per layer and the activation functions. Different
depths of DNNs were also attempted. As for SVMs, a set
of variables, like the kernel and its corresponding parameters,
were changed and the SVM that resulted in the lowest mean
ATE across the training and validation datasets was selected.
The search for the most suited SNN, DNN, and SVM was
done to ensure the fairness of our comparisons.
IV. EXP ER IM EN TAL VALI DATI ON
In this section, the proposed approach is validated through
a set of simulated and real-time sequences from publicly
available datasets. The performance of the stacked LSTM
neural network is then compared against other regression
techniques, including SVM, SNN, and DNN where it proved
to outperform them.
As mentioned earlier, the proposed approach can be applied
to trajectories computed using any 2D SLAM. However, the
datasets used here were obtained using semantic SLAM.
The rest of this section is organized as follows: The
experimental set-up used to record the training dataset is
presented in Section IV-A. In Section IV-B, the structure of
the training datasets is described, followed by details about the
training process. After that, the performance of the proposed
approach is analyzed then compared against that of SNN,
DNN, and SVM in Section IV-C. Finally, in Section IV-D,
the proposed approach is tested on a set of simulated and
real-time experiments, including three SLAM sequences from
the TUM RGB-D dataset [53].
A. Experimental Set-up
A simulated pioneer 3AT robot with an RGB-D camera
mounted in a front-forward position was used to navigate in
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 7
Fig. 6: Proposed Neural Network Architecture
several simulated environments and collect the sensory data re-
quired to perform semantic SLAM. A sample of the simulated
environments is shown in Figure 7 with the simulated robotic
vehicle at its starting point. This environment is 13 ×10m2
and is populated with 37 object instances of two different
categories; 19 TV monitors and 18 bottles. Other simulated
environments were also used where the object categories
include potted plant, table, and person (who is assumed to
be static while recording the experiment).
Fig. 7: Simulated environment setup with ground vehicle at its starting
point
Several simulated trajectories were recorded in these sim-
ulated environments then passed to the semantic SLAM al-
gorithm to generate training data. Since the odometry mea-
surements obtained from simulations are perfect, the odometry
noise model described in Figure 3, was used to simulate noise
and add it to the recorded measurements. Another source of
error was observed when passing the RGB frames to the
object detector, YOLO [25], then performing segmentation to
determine the centroid of the observed object. The centroids
of the detected objects were seen to deviate from their true
positions. This was mainly due to object occlusions since
YOLO was able to detect an object even if part of it is
occluded. Consequently, only the visible part of the object
was used to compute the centroid of that instance, causing
an error in the measurement. The error varies depending on
Fig. 8: Distribution of object measurements from some recorded
simulated experiments in the environment shown in Fig. 7
the size of the object and hence the standard deviation of the
noise associated with each object observation was set based
on the object’s dimensions. Object measurements from some
recorded experiments in the previously described simulated
environment are shown in Figure 8.
Real-time experiments were taken from the TUM RGB-D
dataset [53], where a Pioneer 3AT robot with a Kinect RGB-D
sensor mounted in a front forward position was joysticked in
a large hole. Multiple instances of chairs and tables appeared
in the environment and were used as observations to perform
semantic SLAM. Recordings of experiments are provided with
the corresponding ground truth which was used during the
training process. Table I summarizes details about the three
trajectories taken from the dataset.
The specifications of the computer used to conduct the
semantic SLAM experiments are listed in Table II.
The semantic SLAM algorithm was implemented using the
Robot Operating System (ROS) [54] Kinetic on Ubuntu 16.04.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 8
TABLE I: Summary of TUM RGB-D sequences used for evaluating
the proposed approach
Name Tajectory Length Trajectory Dimensions
Freiburg pioneer slam 40.38m 5.50m x 5.94m
Freiburg pioneer slam2 21.735m 4.98m x 5.34m
Freiburg pioneer slam3 18.135m 5.29m x 5.25m
TABLE II: Hardware specifications
Computer type: ASUS STRIX laptop
Processor: with Intel core i7-6700HQ @ 2.60GHz ×8
System type: 64-bit operating system
Operating System: Linux – Kubuntu 16.04 distribution
The communication between the simulated/real hardware and
ROS was performed through RosAria and OpenNI for the
ground vehicle and the RGB-D sensor, respectively. The
system software was implemented in C++, where gtsam [55]
and its iSAM2 [48] implementation was used to perform
incremental smoothing and mapping, YOLO [25] was used
for object detection, openCV [56] and depth image proc were
used for processing RGB and depth images respectively, and
point cloud library (PCL) [57] was used to process point
clouds.
B. Dataset Preparation
A total of 18 different simulated trajectories were generated
and used to construct the dataset. Every estimated trajectory
was divided into smaller overlapping segments of length 30 as
described in Fig. 4. The ground truth corresponding to the last
pose in each segment, is set as the target for that input segment
and is referred to as T= [x, y, ϑ]T. In simulated experiments,
the recorded odometry measurements, before adding simulated
noise, is set as the target. In real-time experiments, ground
truth data was obtained from the dataset online.
The output of the network is an improved estimate of the
robot’s 2D poses along the trajectory, where each pose is
denoted as Y= [xe, ye, ϑe]T. It is worth mentioning that in
the current version of the system, reduction of pose estimation
error is done offline.
For the neural network to perform well, the datasets need to
be re-scaled to a common range. To that end, all the collected
data were normalized to the range [0.05,0.95]. The reason
why this particular range is selected is to avoid the problem
of vanishing gradients that occurs when the neurons saturate,
i.e. reach the minimum or maximum value of the activation
function (0 and 1 respectively for sigmoid), and hence the
derivative of the function at that point drops to values close
to zero. Using the same normalization parameters, predictions
were re-scaled to the original data range.
Backpropagation [58] was used to train the network in a
supervised manner. The Adam optimizer was employed, with
a learning rate of 0.0001, to minimize the mean absolute
error, over 1000 training epochs. The batch size was set
to 100. Building, training, and testing the different neural
networks was done using Keras [59] with a Tensorflow back-
end (version 2.0).
C. Performance Evaluation
In this section, the performance of the proposed stacked
LSTM will be compared to DNN, SNN, and SVM using a
set of simulated trajectories. The datasets were randomly split
into two parts; 80% for training, and 20% for validation to aid
the model’s regularization.
To ensure fairness when comparing the proposed approach
to DNN, SNN, and SVM a similar search strategy was
adopted to find the most suited architecture using the same
training dataset. Fully connected SNNs with varying activation
functions, including sigmoid, swish, ReLU, linear and tanh,
and layer sizes set to 2mfor m[5,9], were examined. Along
the same lines, fully connected DNNs with depths varying
from two to six layers, layer sizes set to 2mfor m[5,9], and
activation functions including sigmoid, swish, ReLU, linear
and tanh, were investigated. The SNN and DNN that achieved
the lowest mean ATE were selected to be compared to the
proposed approach. The SNN that performed the best in terms
of reducing pose estimation error had 512 neurons in its single
hidden layer and was activated using ReLU. The DNN, on the
other hand, had 6 hidden layers, each of size 256 neurons, and
activated using sigmoid. The third regression technique that
will be compared against the proposed approach is SVM. More
particularly, varying structures of the epsilon support vector
regression model [60] were explored. Several kernels including
linear, sigmoid, and polynomial with varying regularization
and epsilon parameters were tested. The best performing SVM
out of the tested pool was of a 5-degree polynomial kernel.
The predictions of these three models were compared to that
of the proposed stacked-LSTM approach as will be presented
next.
Fig. 9 depicts nine different trajectories, along which the
performance of the proposed approach is evaluated and com-
pared to the other alternatives. It is evident that the proposed
stacked LSTM was capable of identifying error patterns in the
input trajectories and significantly improving them along all
the depicted trajectories. DNN predictions have also shown
substantial improvements to the trajectories, yet there are seg-
ments of the trajectories where DNN predictions still suffered
from errors. SNN predictions demonstrated improvements at
times, but especially after turns in the trajectory, predictions
exhibited high fluctuations and hence large ATE. SVM pre-
dictions were mostly less accurate than the input trajectories
generated by semantic SLAM and hence, no improvement to
the ATE was observed.
Table III lists the mean ATE achieved by the stacked LSTM
network, SNN, DNN, and SVM along each trajectory. SVM
predictions have not shown any improvement to the mean ATE
along any of the nine trajectories. Stacked LSTM, SNN, and
DNN, on the other hand, were able to reduce the mean ATE
along all the trajectories. However, the proposed approach has
clearly outperformed all the other alternatives, by achieving
the lowest mean ATE along all the trajectories.
D. Performance Analysis on Publicly Available Datasets
To further verify the applicability of the proposed approach,
the training dataset was extended to include more simulated
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 9
Semantic SLAM Stacked LSTM DNN SNN SVM Ground Truth Start Position
(a) Trajectory 1
(b) Trajectory 2
1 2 3 4 5 6 7 8 9
X(m)
1
2
3
4
5
6
7
8
Y(m)
4.6 4.8 5 5.2 5.4
3.8
4
4.2
4.4 x2
(c) Trajectory 3
(d) Trajectory 4
(e) Trajectory 5
(f) Trajectory 6
-2 0 2 4 6 8 10
X(m)
-10
-8
-6
-4
-2
0
2
4
Y(m)
2 2.5 3
1.5
2
2.5 x2
(g) Trajectory 7
-8 -7 -6 -5 -4 -3 -2 -1
X(m)
1
2
3
4
5
6
7
8
Y(m)
-5.6 -5.4 -5.2 -5 -4.8
1.8
2
2.2
2.4
2.6 x2
(h) Trajectory 8
-8 -7 -6 -5 -4 -3 -2 -1
X(m)
1
2
3
4
5
6
7
8
Y(m)
-4.4 -4.2 -4 -3.8 -3.6
2
2.2
2.4
2.6 x2
(i) Trajectory 9
Fig. 9: Training and validation results on simulated trajectories
TABLE III: Comparison of mean ATE (m) achieved along the
trajectories in Fig. 9
Semantic SLAM Stacked LSTM SNN DNN SVM
Trajectory 1 0.20 0.025 0.19 0.048 0.68
Trajectory 2 0.20 0.021 0.16 0.035 0.59
Trajectory 3 0.30 0.025 0.16 0.034 0.68
Trajectory 4 0.26 0.024 0.15 0.040 0.58
Trajectory 5 0.17 0.019 0.095 0.034 0.57
Trajectory 6 0.32 0.024 0.26 0.037 0.75
Trajectory 7 0.71 0.022 0.22 0.042 0.86
Trajectory 8 0.24 0.024 0.15 0.047 0.42
Trajectory 9 0.32 0.025 0.17 0.042 0.48
and real-time experiments. The latter were taken from pub-
licly available datasets that are used as a benchmark by the
robotics community, particularly, Freiburg2 Pioneer SLAM,
Freiburg2 Pioneer SLAM2, and Freiburg2 Pioneer SLAM3
from the TUM RGB-D dataset [53]. These public datasets
resemble a practical use-case scenario where a ground vehicle
performs a maneuver in an indoor environment, populated
with objects. The vehicle is equipped with wheel encoders
and an RGB-D sensor mounted in a front-forward position,
as described in Section IV-A. Vision measurements and the
concurrent odometry are passed to the semantic SLAM system,
which estimates the vehicle’s trajectory in the environment.
The process of acquiring and processing data is vulnerable to
several sources of error that hinder the accuracy of the esti-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 10
0 5 10 15
X(m)
0
2
4
6
8
10
Y(m)
Ground Truth
Semantic SLAM
0 2 4 6 8 10 12
X(m)
0
2
4
6
8
10
Y(m)
Ground Truth
Stacked LSTM
0 2000 4000 6000 8000
Poses
0
0.5
1
1.5
2
2.5
3
ATE(m)
Stacked LSTM output ATE
Semantic SLAM output ATE
Fig. 10: Simulated experiment - sample 1
0 2 4 6 8 10 12
X(m)
0
2
4
6
8
10
12
Y(m)
Ground Truth
Semantic SLAM
0 2 4 6 8 10 12
X(m)
0
2
4
6
8
10
12
Y(m)
Ground Truth
Stacked LSTM
0 1000 2000 3000 4000 5000 6000
Poses
0
0.5
1
1.5
2
2.5
3
ATE(m)
Stacked LSTM output ATE
Semantic SLAM output ATE
Fig. 11: Simulated experiment - sample 2
mated trajectory. To reduce such inaccuracies, the trajectory
is passed to the stacked-LSTM neural network which in turns
identifies and reduces pose estimation errors. The trajectory
is divided into small overlapping segments as depicted in
Fig. 4 and described in Section III-B. Each pose along the
trajectory is corrected based on the vehicle’s preceding poses.
In the current version, the correction is done offline. The same
process is applicable to any 2D SLAM estimate.
The training dataset, including all the simulated and real-
time trajectories, was divided into three parts, 70% for training,
15% for validation, and 15% for testing. The training set
will be used to optimize the weights of the network during
the training process. The performance of the neural network
on the validation set will be used to further update the
network’s weights after every training epochs. Finally, an
unbiased evaluation of the network will be obtained using the
testing set. Table IV lists the mean ATE that the proposed
approach achieved on the training, validation and testing sets.
It is evident that the stacked LSTM was able to identify and
significantly reduce the error patterns along the trajectories
in the dataset. The mean ATE on the testing dataset, which
was not exposed to the network during training, dropped
from 65cm to 2cm. This proves the validity of the proposed
approach on simulated and real-time experiments. Examples
of trajectories from simulated and real-time experiments are
depicted in Figures 11-15. The leftmost plot in each figure
shows the ground truth trajectory and the trajectory estimated
by semantic SLAM. The middle plot shows the output of the
proposed approach compared to the ground truth trajectory.
The rightmost plot depicts the ATE along the trajectory for
both the input to the network and its output. Fig. 15 depicts
the regression plot of the variable ϑfor the sequence depicted
in Fig. 14. The Pearson’s regression coefficient, R, is equal
to 0.99996 as shown in the plot. The regression plot for ϑin
other sequences is very similar.
The proposed stacked-LSTM based approach can generalize
well and is robust to input perturbations. However, the network
may be fine-tuned to learn new noise models, which were
never exposed to the network during training, and hence be
able to recognize a wider variety of pose estimation errors. A
portion of the data obtained from the new environment can
be used to fine-tune the network, which will then be able to
reduce pose estimation error along trajectories recorded under
the same conditions. If the training dataset consists of a wide
range of error patterns, the neural network will have more
potential to perform error reduction along previously unseen
trajectories.
A possible use-case scenario of the proposed approach is in
search and rescue applications. A robot’s mission could be to
find victims in a collapsed structure, then notify the rescue
teams of the victim’s location. While performing semantic
SLAM, the robot can navigate in the environment and once
a victim is found, the robot’s trajectory is communicated to
the rescue team. This trajectory is first refined, using the
proposed stacked-LSTM based noise reduction approach, to
pinpoint the robot’s position accurately. This will help the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 11
TABLE IV: Mean ATE (m) achieved by the proposed approach on
the training, validation, and testing datasets
Training set Validation set Testing set
Input mean ATE (m) 0.65 0.66 0.66
Output mean ATE (m) 0.021 0.026 0.025
TABLE V: Input and output mean ATE (m)
Input mean ATE (m) Output mean ATE (m)
Simulated Trajectory 1 (Fig. 10) 1.59 0.051
Simulated Trajectory 2 (Fig. 11) 0.53 0.041
Freiburg2 Pioneer SLAM Trajectory (Fig. 12) 1.11 0.057
Freiburg2 Pioneer SLAM2 Trajectory (Fig. 13) 0.69 0.048
Freiburg2 Pioneer SLAM3 Trajectory (Fig. 14) 0.52 0.042
human responders to arrive at the location in a shorter time.
Another use-case scenario of the proposed approach can be
seen in applications that require the robot to map its surround-
ing environment. After the robot gathers the required visual
information from the environment, its trajectory is estimated
using semantic SLAM then passed to the proposed stacked-
LSTM based neural network for possible error identification
and reduction. Reprojecting visual observations along the
corrected trajectory will result in a more accurate and reliable
map of the environment compared to that obtained directly
from the solver.
V. CONCLUSION
The error in estimating vehicle and landmark poses signifi-
cantly hiders the success of semantic SLAM and its usability
in high-accuracy critical applications. Several predictable and
unpredictable sources of uncertainties contribute to forming
such error, including landmark local pose estimation, object
detection, incorrect data association, visual sensor noise, and
odometry drift. The work proposed in this paper employs a
novel, general, and efficient deep learning approach to enhance
the robustness of semantic SLAM, by reducing the combined
effect of such errors on the trajectory estimation. A stacked
LSTM based neural network was developed after conducting
an extensive search among different neural network types
and hyperparameters. The architecture was adopted based on
the network’s ability to capture various error patterns and
significantly decrease the estimation error. Simulated and real-
time experiments, including three sequences from the TUM
RGB-D dataset, were used to measure the performance of
the proposed approach. The results have proven the ability
of the proposed approach to successfully identify and reduce
pose estimation errors resulting from multiple factors in the
semantic SLAM pipeline. The performance of neural network
was quantified using the mean absolute trajectory error. It was
compared to that of SNN, DNN, and SVM on several test sets,
and the maximum estimation error reduction was evidently
achieved by the proposed approach.
The capability of the proposed approach to generalize well
to new semantic SLAM trajectory estimates relies heavily on
the training dataset. The dataset should include samples col-
lected from different environments, under varying conditions,
and using various sets of sensors. If the network is exposed to
a wide range of error patterns during training, it will have
a higher potential to perform error reduction in previously
unseen trajectories.
The work presented in this paper can be extended to 3D
SLAM which applies to a wider range of robotic vehicles
like aerial vehicles. In such case, the neural network will be
expected to predict more variables and account for various
noise models. It can also be integrated into the online SLAM
algorithm and hence error reduction will be performed right
after computing a new SLAM estimate, making it more
efficient.
ACK NOW LE DG EM EN TS
This publication is based upon research work supported
by the Khalifa University of Science and Technology under
Award No. RC1-2018-KUCARS
REF ER EN CE S
[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[2] X. Chen, H. Zhang, H. Lu, J. Xiao, Q. Qiu, and Y. Li, “Robust SLAM
system based on monocular vision and LiDAR for robotic urban search
and rescue,” SSRR 2017 - 15th IEEE International Symposium on Safety,
Security and Rescue Robotics, Conference, pp. 41–47, 2017.
[3] A. Denker and M. C. Is¸eri, “Design and implementation of a semi-
autonomous mobile search and rescue robot: SALVOR,” IDAP 2017
- International Artificial Intelligence and Data Processing Symposium,
2017.
[4] J. Casper and R. R. Murphy, “Human-robot interactions during the
robot-assisted urban search and rescue response at the World Trade
Center,IEEE Transactions on Systems, Man, and Cybernetics, Part
B: Cybernetics, vol. 33, no. 3, pp. 367–385, 2003.
[5] A. Pfrunder, P. V. Borges, A. R. Romero, G. Catt, and A. Elfes,
“Real-time autonomous ground vehicle navigation in heterogeneous
environments using a 3D LiDAR,IEEE International Conference on
Intelligent Robots and Systems, vol. 2017-September, pp. 2601–2608,
2017.
[6] D. Ramadasan, M. Chevaldonne, and T. Chateau, “Real-time slam for
static multi-objects learning and tracking applied to augmented reality
applications,” in 2015 IEEE Virtual Reality (VR), March 2015, pp. 267–
268.
[7] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object Detection in
20 Years: A Survey,” pp. 1–39, 2019. [Online]. Available: http:
//arxiv.org/abs/1905.05055
[8] S. Soetens, A. Sarris, and K. Vansteenhuyse, “Pose Estimation Errors,
the Ultimate Diagnosis,” European Space Agency, (Special Publication)
ESA SP, vol. 1, no. 515, pp. 181–184, 2002.
[9] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[10] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-
wise training of deep networks,” Advances in Neural Information
Processing Systems, no. 1, pp. 153–160, 2007.
[11] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,Communications of the ACM,
vol. 60, no. 6, pp. 84–90, 2017.
[13] C. Farabet, C. Couprie, L. Najman, and Y. Lecun, “Learning Hierarchical
Features for Scene Labeling,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 35, no. 8, pp. 1915–1929, 2013.
[14] A. Garcia-Perez, F. Gheriss, D. Bedford, A. Garcia-Perez, F. Gheriss, and
D. Bedford, “Going deeper with convolutions,Designing and Tracking
Knowledge Management Metrics, pp. 163–182, 2019.
[15] S. Wang, R. Clark, H. Wen, and N. Trigoni, “End-to-end, sequence-to-
sequence probabilistic visual odometry through deep neural networks,”
International Journal of Robotics Research, vol. 37, no. 4-5, pp. 513–
542, 2018.
[16] M. K. Al-Sharman, Y. Zweiri, M. A. K. Jaradat, R. Al-Husari, D. Gan,
and L. D. Seneviratne, “Deep-learning-based neural network training for
state estimation enhancement: Application to attitude estimation,” IEEE
Transactions on Instrumentation and Measurement, vol. 69, no. 1, pp.
24–34, 2019.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 12
-4 -2 0 2
X(m)
-3
-2
-1
0
1
2
3
4
Y(m)
Ground Truth
Semantic SLAM
-4 -3 -2 -1 0 1 2
X(m)
-3
-2
-1
0
1
2
3
4
Y(m)
Ground Truth
Stacked LSTM
0 200 400 600 800 1000
Poses
0
0.5
1
1.5
2
2.5
3
ATE(m)
Stacked LSTM output ATE
Semantic SLAM output ATE
Fig. 12: Real-time experiment - Freiburg2 Pioneer SLAM sequence
-3 -2 -1 0 1 2 3
X(m)
-3
-2
-1
0
1
2
3
Y(m)
Ground Truth
Semantic SLAM
-3 -2 -1 0 1 2 3
X(m)
-3
-2
-1
0
1
2
3
Y(m)
Ground Truth
Stacked LSTM
0 200 400 600 800 1000
Poses
0
0.5
1
1.5
2
2.5
3
ATE(m)
Stacked LSTM output ATE
Semantic SLAM output ATE
Fig. 13: Real-time experiment - Freiburg2 Pioneer SLAM2 sequence
-3 -2 -1 0 1 2
X(m)
-2
-1
0
1
2
3
Y(m)
Ground Truth
Semantic SLAM
-3 -2 -1 0 1 2
X(m)
-2
-1
0
1
2
3
Y(m)
Ground Truth
Stacked LSTM
0 200 400 600 800 1000
Poses
0
0.5
1
1.5
2
2.5
3
ATE(m)
Stacked LSTM output ATE
Semantic SLAM output ATE
Fig. 14: Real-time experiment - Freiburg2 Pioneer SLAM3 sequence
[17] V. Peretroukhin and J. Kelly, “DPC-Net: Deep Pose Correction for
Visual Localization,IEEE Robotics and Automation Letters, vol. 3,
no. 3, pp. 2424–2431, 2018.
[18] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An Open-Source SLAM
System for Monocular, Stereo, and RGB-D Cameras,IEEE Transac-
tions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[19] J. Engel, T. Sch¨
ops, and D. Cremers, “LSD-SLAM: Large-Scale Direct
monocular SLAM,” Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), vol. 8690 LNCS, no. PART 2, pp. 834–849, 2014.
[20] R. Gomez-Ojeda, F. A. Moreno, D. Zu˜
niga-No¨
el, D. Scaramuzza, and
J. Gonzalez-Jimenez, “PL-SLAM: A Stereo SLAM System Through
the Combination of Points and Line Segments,” IEEE Transactions on
Robotics, vol. 35, no. 3, pp. 734–746, 2019.
[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring
Representation Learning With CNNs for Frame-to-Frame Ego-Motion
Estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.
18–25, 2016.
[22] D. Eigen and R. Fergus, “Predicting depth, surface normals and se-
mantic labels with a common multi-scale convolutional architecture,
Proceedings of the IEEE International Conference on Computer Vision,
vol. 2015 International Conference on Computer Vision, ICCV 2015,
pp. 2650–2658, 2015.
[23] C. Cadena, A. Dick, and I. D. Reid, “Multi-modal auto-encoders as
joint estimators for robotics scene understanding,” Robotics: Science
and Systems, vol. 12, 2016.
[24] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional
network for real-time 6-dof camera relocalization,” Proceedings of
the IEEE International Conference on Computer Vision, vol. 2015
International Conference on Computer Vision, ICCV 2015, pp. 2938–
2946, 2015.
[25] J. S. D. R. G. A. F. Redmon, “(YOLO) You Only Look Once: Unified,
Real-Time Object Detection,Cvpr, 2016.
[26] Y. Konishi, Y. Hanzawa, M. Kawade, and M. Hashimoto, “SSD: Single
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIM.2020.3031156, IEEE
Transactions on Instrumentation and Measurement
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 13
-3 -2 -1 0 1 2
Target (rad)
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output (rad)
: R=0.99996
Data
Fit
Y = T
Fig. 15: Regression plot for the orientation variable ϑfor the
Freiburg Pioneer SLAM3 sequence
Shot MultiBox Detector,” Springer International Publishing AG, vol. 1,
pp. 398–413, 2016.
[27] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and
S. Birchfield, “Deep Object Pose Estimation for Semantic Robotic
Grasping of Household Objects,” no. CoRL, pp. 1–11, 2018. [Online].
Available: http://arxiv.org/abs/1809.10790
[28] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Prob-
abilistic data association for semantic SLAM,” Proceedings - IEEE
International Conference on Robotics and Automation, pp. 1722–1729,
2017.
[29] B. Mu, S. Y. Liu, L. Paull, J. Leonard, and J. P. How, “SLAM
with objects using a nonparametric pose graph,” IEEE International
Conference on Intelligent Robots and Systems, vol. 2016-November, pp.
4602–4609, 2016.
[30] P. Nazemzadeh, D. Fontanelli, D. Macii, and L. Palopoli, “Indoor
localization of mobile robots through QR code detection and dead
reckoning data fusion,” IEEE/ASME Transactions on Mechatronics,
vol. 22, no. 6, pp. 2588–2599, 2017.
[31] P. Ozog and R. M. Eustice, “On the importance of modeling camera cal-
ibration uncertainty in visual SLAM,” Proceedings - IEEE International
Conference on Robotics and Automation, pp. 3777–3784, 2013.
[32] J. H. Park, Y. D. Shin, J. H. Bae, and M. H. Baeg, “Spatial uncertainty
model for visual features using a KinectTM sensor,Sensors (Switzer-
land), vol. 12, no. 7, pp. 8640–8662, 2012.
[33] N. S¨
underhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner,
B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke, “The limits
and potentials of deep learning for robotics,” International Journal of
Robotics Research, vol. 37, no. 4-5, pp. 405–420, 2018.
[34] J. Hidalgo-Carrio, D. Hennes, J. Schwendner, and F. Kirchner, “Gaussian
process estimation of odometry errors for localization and mapping,”
Proceedings - IEEE International Conference on Robotics and Automa-
tion, pp. 5696–5701, 2017.
[35] A. Rao and W. Han, “An Adaptive Gaussian Particle Filter based
Simultaneous Localization and Mapping with dynamic process model
noise bias compensation,” Proceedings of the 2015 7th IEEE Interna-
tional Conference on Cybernetics and Intelligent Systems, CIS 2015 and
Robotics, Automation and Mechatronics, RAM 2015, pp. 210–215, 2015.
[36] V. Magnago, L. Palopoli, R. Passerone, D. Fontanelli, and D. Macii,
“Effective Landmark Placement for Robot Indoor Localization with Po-
sition Uncertainty Constraints,” IEEE Transactions on Instrumentation
and Measurement, vol. 68, no. 11, pp. 4443–4455, 2019.
[37] H. Liu, F. Sun, B. Fang, and X. Zhang, “Robotic Room-Level Localiza-
tion Using Multiple Sets of Sonar Measurements,” IEEE Transactions
on Instrumentation and Measurement, vol. 66, no. 1, pp. 2–13, 2017.
[38] M. Zhang, X. Xu, Y. Chen, and M. Li, “A Lightweight and Accurate
Localization Algorithm Using Multiple Inertial Measurement Units,”
IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1508–1515,
2020.
[39] K. Liu, K. Ok, W. Vega-Brown, and N. Roy, “Deep inference for
covariance estimation: Learning Gaussian noise models for state esti-
mation,” Proceedings - IEEE International Conference on Robotics and
Automation, pp. 1436–1443, 2018.
[40] M. Brossard, A. Barrau, and S. Bonnabel, “AI-IMU Dead-Reckoning,”
IEEE Transactions on Intelligent Vehicles, pp. 1–1, 2020.
[41] M. Heshmat, M. Abdellatif, and H. Abbas, “Improving visual SLAM
accuracy through deliberate camera oscillations,” ROSE 2013 - 2013
IEEE International Symposium on Robotic and Sensors Environments,
Proceedings, pp. 154–159, 2013.
[42] S. Chen and C. Chen, “Probabilistic fuzzy system for uncertain lo-
calization and map building of mobile robots,” IEEE Transactions on
Instrumentation and Measurement, vol. 61, no. 6, pp. 1546–1560, 2012.
[43] H. Hur and H. S. Ahn, “Unknown Input Hobserver-based localization
of a mobile robot with sensor failure,” IEEE/ASME Transactions on
Mechatronics, vol. 19, no. 6, pp. 1830–1838, 2014.
[44] J. W. Yoon and T. Park, “Maximizing localization accuracy via self-
configurable ultrasonic sensor grouping using genetic approach,” IEEE
Transactions on Instrumentation and Measurement, vol. 65, no. 7, pp.
1518–1529, 2016.
[45] J. Toledo, J. D. Pi˜
neiro, R. Arnay, D. Acosta, and L. Acosta, “Improving
odometric accuracy for an autonomous electric cart,” Sensors (Switzer-
land), vol. 18, no. 1, 2018.
[46] M. Brossard and S. Bonnabel, “Learning wheel odometry and imu
errors for localization,” Proceedings - IEEE International Conference
on Robotics and Automation, vol. 2019-May, pp. 291–297, 2019.
[47] J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “DeepFactors:
Real-Time Probabilistic Dense Monocular SLAM,IEEE Robotics and
Automation Letters, vol. 5, no. 2, pp. 721–728, 2020.
[48] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and
F. Dellaert, “ISAM2: Incremental smoothing and mapping using the
Bayes tree,” International Journal of Robotics Research, vol. 31, no. 2,
pp. 216–235, 2012.
[49] C. V. Nguyen, S. Izadi, and D. Lovell, “Modeling kinect sensor noise
for improved 3D reconstruction and tracking,Proceedings - 2nd Joint
3DIM/3DPVT Conference: 3D Imaging, Modeling, Processing, Visual-
ization and Transmission, 3DIMPVT 2012, pp. 524–530, 2012.
[50] T. S. B, Z. Zhou, G. Zhao, and M. Pietik, “Comparison of Kinect V1
and V2 Depth Images in Terms of Accuracy and Precision,” vol. 1, no.
March, pp. 277–289, 2017.
[51] B. Zoph and Q. V. Le, “Searching for activation functions,6th Interna-
tional Conference on Learning Representations, ICLR 2018 - Workshop
Track Proceedings, pp. 1–13, 2018.
[52] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, p. 1735–1780, Nov. 1997. [Online]. Available:
https://doi.org/10.1162/neco.1997.9.8.1735
[53] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
“A benchmark for the evaluation of RGB-D SLAM systems,IEEE
International Conference on Intelligent Robots and Systems, pp. 573–
580, 2012.
[54] A. Koubaa, Ed., Robot Operating System (ROS): The Complete Refer-
ence (Volume 3), ser. Studies in Computational Intelligence. Springer,
2018, vol. 778.
[55] F. Dellaert, “Factor graphs and GTSAM: A hands-on introduction,” no.
September, pp. 1–26, 2012.
[56] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
Tools, 2000.
[57] R. B. Rusu and S. Cousins, “3D is here: Point Cloud Library (PCL),”
Proceedings - IEEE International Conference on Robotics and Automa-
tion, pp. 1–4, 2011.
[58] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006. [Online]. Available: https://science.sciencemag.org/content/313/
5786/504
[59] F. Chollet et al., “Keras,” https://keras.io, 2015.
[60] M. H. Law and J. T. Kwok, “Bayesian support vector regression,
in In Proceedings of the Eighth International Workshop on Artificial
Intelligence and Statistics. Key West, 2001, pp. 239–244.
... Spatial constraints, that are formulated based upon the sensory measurements collected by the robot, are encoded as edges that connect the graph vertices. These measurements are susceptible to a wide range of uncertainties [4], including sensor noise and systematic biases. Consequent to such uncertainties, obtaining a perfect estimation of the robot trajectory is deemed impossible [5]. ...
... Every pose-graph variant was also solved using the Levenberg-Marquardt (LM) optimizer. Then, each solution was evaluated by means of a cost function [13] that uses the chordal distance as a parameterization of the distance between two rotation matrices in SO(2) as in (4). The chordal distance is defined as the Frobenius norm of the difference between the rotation matrices as seen in the second term of VOLUME 4, 2016 the following equation. ...
Article
Full-text available
The ability to decide if a solution to a pose-graph problem is globally optimal is of high significance for safety-critical applications. Converging to a local-minimum may result in severe estimation errors along the estimated trajectory. In this paper, we propose a graph neural network based on a novel implementation of a graph convolutional-like layer, called PoseConv, to perform classification of pose-graphs as optimal or sub-optimal. The operation of PoseConv required incorporating a new node feature, referred to as cost, to hold the information that the nodes will communicate. A training and testing dataset was generated based on publicly available bench-marking pose-graphs. The neural classifier is then trained and extensively tested on several subsets of the pose-graph samples in the dataset. Testing results have proven the model’s capability to perform classification with 92-98% accuracy, for the different partitions of the training and testing dataset. In addition, the model was able to generalize to previously unseen variants of pose-graphs in the training dataset. Our method trades a small amount of accuracy for a large improvement in processing time. This makes it faster than other existing methods by up-to three orders of magnitude, which could be of paramount importance when using computationally-limited robots overseen by human operators.
... Thus, in [38] unsupervised learning-based DL method is employed to create the trajectory, maps and estimate camera pose. A pose estimation error minimization approach is studied in [40] to accurately extract the features of the tracked objects VOLUME 4, 2016 and environments. Energy effective SLAM operations may be ensured by analyzing the minimum number of samples required to be collected by the sensors/cameras for effective reconstruction of the map images, which is done in [41]. ...
Article
Full-text available
The evolution of the wireless network systems over decades has been providing new services to the users with the help of innovative network and device technologies. In recent times, the 5G network systems are about to be deployed which creates the opportunity to realize massive connectivity with high throughput, low latency, high energy efficiency and security. It also focuses on providing massive Internet of Things (IoT) network connectivity as well as services for good health, large-scale agricultural and industrial production, intelligent traffic control and electricity generation, transmission and distribution systems. However, the ever-increasing number of user devices is directing the researchers towards beyond 5G systems to allocate these user devices with higher bandwidth. Researches on the 6G wireless network systems have already begun to provide higher bandwidth availability for densely connected larger network devices with QoS surety. Researchers are leveraging artificial intelligence (AI)/machine learning (ML) for enhancing future IoT network operations and services. This paper attempts to discuss AI/ML algorithms that can help in developing energy efficient, secured and effective IoT network operations and services. In particular, our article concentrates on the major issues and factors that influence the design of the communication systems for future IoT with the integration of AI/ML. It also highlights application domains, including smart healthcare, smart agriculture, smart transportation, smart grid and smart industry that can operate efficiently and securely. Finally, this paper ends with the discussion on future research scopes with these algorithms in addressing the open issues of the future IoT network systems.
... Mahdi et al. used low-cost sensors as input together with expensive sensors to generate the true labels for the measurement noise [12]. Azzam et al. trained long short-term memory (LSTM) models to improve the trajectory accuracy obtained using a simultaneous localization and mapping (SLAM) algorithm [13]. However, they used the pose estimated by the SLAM algorithm as input and not direct sensor measurements. ...
Preprint
Full-text available
A deep neural network (DNN) is trained to estimate the speed of a car driving in an urban area using as input a stream of measurements from a low-cost six-axis inertial measurement unit (IMU). Three hours of data was collected by driving through the city of Ashdod, Israel in a car equipped with a global navigation satellite system (GNSS) real time kinematic (RTK) positioning device and a synchronized IMU. Ground truth labels for the car speed were calculated using the position measurements obtained at the high rate of 50 [Hz]. A DNN architecture with long short-term memory layers is proposed to enable high-frequency speed estimation that accounts for previous inputs history and the nonlinear relation between speed, acceleration, and angular velocity. A simplified aided dead reckoning localization scheme is formulated to assess the trained model which provides the speed pseudo-measurement. The trained model is shown to substantially improve the position accuracy during a 4 minutes drive without the use of GNSS position updates.
... To minimize the uncertainty in the trajectory reconstruction phase, as well as bringing improvements to relative localization algorithms [23], it is possible to act on the sensor orientation and to use active vision techniques to track featurerich areas. Among the first type is the work of [24] and [25], where different camera configurations were tested for photogrammetric reconstruction of near-vertical road cut-slope from UAV. ...
Article
Full-text available
Pose estimation is critical for mobile robots to fulfill various tasks such as path following or mapping the environment. This is usually accomplished by Simultaneous Localization and Mapping (SLAM). However, computationally constrained systems, such as planetary rovers, rely on less intensive Guidance Navigation and Control (GNC) solutions generally derived solely from Visual Odometry (VO), wheel odometry and the onboard Inertial Measurement Unit (IMU). Although providing adequate localization performances, the drift accumulated over time is not compensated by loop closing capabilities, typical of SLAM. Usually, rovers send surface images to the ground station, these images are used for multiple purposes, such as scientific and operational planning. The number of images is constrained by the communication bandwidth and power budget. The set of transmitted images can be used as a mean to correct the robot’s trajectory in an off-line manner. In this work is presented a solution to the problem of selecting the optimal set of viewpoints belonging to the planned path from which to capture and transmit images: (1) it guarantees accurate trajectory correction and (2) complies with the maximum number of images that can be transmitted to ground control given the available data budget. To this end, it is proposed (1) a delocalized/decentralized sensor fusion approach based on Pose Graph Optimization and Structure from Motion, and (2) a strategy to select a minimal set of viewpoints along the trajectory that, given a tentative geometry of the environment and the global path that the rover must follow, minimizes the uncertainty of all the robot poses. Optimal camera viewpoint positions are selected as a function of the planned trajectory, the approximate scene geometry, and the maximum transmittable number of images. The proposed method has been tested on a dataset of stereo-images collected in a representative Martian environment, the ALTEC Mars Terrain Simulator (MTS), with the ExoMars Testing Rover (ExoTeR – ESA property). Rover stereo-images ground truth was given with millimetric accuracy by a Motion Capture (MC) system.
Article
A deep neural network (DNN) is trained to estimate the speed of a car driving in an urban area using as input a stream of measurements from a low-cost six-axis inertial measurement unit (IMU). Three hours of data was collected by driving through the city of Ashdod, Israel in a car equipped with a global navigation satellite system (GNSS) real time kinematic (RTK) positioning device and a synchronized IMU. Ground truth labels for the car speed were calculated using the position measurements obtained at the high rate of 50 Hz. A DNN architecture with long short-term memory layers is proposed to enable high-frequency speed estimation that accounts for previous inputs history and the nonlinear relation between speed, acceleration and angular velocity. A simplified aided dead reckoning localization scheme is formulated to assess the trained model which provides the speed pseudo-measurement. The trained model is shown to substantially improve the position accuracy during a 4 minutes drive without the use of GNSS position updates.
Article
Basic vital signs such as heart and respiratory rates (HR and RR) are essential bio-indicators. Their longitudinal in-home collection enables prediction and detection of disease onset and change, providing for earlier health intervention. In this paper, we propose a robust, non-touch vital signs monitoring system using a pair of co-located Ultra-Wide Band (UWB) and depth sensors. By extensive manual examination, we identify four typical temporal and spectral signal patterns and their suitable vital signs estimators. We devise a probabilistic weighted framework (PWF) that quantifies evidence of these patterns to update the weighted combination of estimator output to track the vital signs robustly. We also design a “heatmap” based signal quality detector to exclude the disturbed signal from inadvertent motions. To monitor multiple co-habiting subjects in-home, we build a two-branch long short-term memory (LSTM) neural network to distinguish between individuals and their activities, providing activity context crucial to disambiguating critical from normal vital sign variability. To achieve reliable context annotation, we carefully devise the feature set of the consecutive skeletal poses from the depth data, and develop a probabilistic tracking model to tackle non-line-of-sight (NLOS) cases. Our experimental results demonstrate the robustness and superior performance of the individual modules as well as the end-to-end system for passive and context-aware vital signs monitoring.
Article
In this paper, we present a hybrid spatio-temporal embedding network (named as STENet) for human trajectory forecasting, which is built upon a GAN-based hierarchical framework. Differently from traditional approaches that only use LSTM for trajectory modeling, we exploit the 1D Convolutional Neural Network (1D-CNN) to embed position features at multiple temporal scales. Moreover, we propose a two-stage graph attention mechanism, which can better describe mutual interactions among pedestrians in the crowd. Additionally, group influences at every time step are taken into account as well. The overall framework is designed using a hierarchical manner, and trained using the Wasserstein distance. We carry out our experiments on the ETH and the UCY datasets. The corresponding results demonstrate the effectiveness of the proposed framework.
Article
Full-text available
In this paper we propose a novel accurate method for dead-reckoning of wheeled vehicles based only on an Inertial Measurement Unit (IMU). In the context of intelligent vehicles, robust and accurate dead-reckoning based on the IMU may prove useful to correlate feeds from imaging sensors, to safely navigate through obstructions, or for safe emergency stops in the extreme case of exteroceptive sensors failure. The key components of the method are the Kalman filter and the use of deep neural networks to dynamically adapt the noise parameters of the filter. The method is tested on the KITTI odometry dataset, and our dead-reckoning inertial method based only on the IMU accurately estimates 3D position, velocity, orientation of the vehicle and self-calibrates the IMU biases. We achieve on average a 1.10% translational error and the algorithm competes with top-ranked methods which, by contrast, use LiDAR or stereo vision. We make our implementation open-source at: https://github.com/mbrossar/ai-imu-dr.
Article
Full-text available
The application of deep learning in robotics leads to very specific problems and research questions that are typically not addressed by the computer vision and machine learning communities. In this paper we discuss a number of robotics-specific learning, reasoning, and embodiment challenges for deep learning. We explain the need for better evaluation metrics, highlight the importance and unique challenges for deep robotic learning in simulation, and explore the spectrum between purely data-driven and model-driven approaches. We hope this paper provides a motivating overview of important research directions to overcome the current limitations, and help fulfill the promising potentials of deep learning in robotics.
Book
This book is the fifth volume in the successful book series Robot Operating System: The Complete Reference. The objective of the book is to provide the reader with comprehensive coverage on the Robot Operating System (ROS), which is currently considered to be the primary development framework for robotics applications, and the latest trends and contributing systems. The content is divided into six parts. Pat I presents for the first time the emerging ROS 2.0 framework, while Part II focuses on multi-robot systems, namely on SLAM and Swarm coordination. Part III provides two chapters on autonomous systems, namely self-driving cars and unmanned aerial systems. In turn, Part IV addresses the contributions of simulation frameworks for ROS. In Part V, two chapters explore robotic manipulators and legged robots. Finally, Part VI presents emerging topics in monocular SLAM and a chapter on fault tolerance systems for ROS. Given its scope, the book will offer a valuable companion for ROS users and developers, helping them deepen their knowledge of ROS capabilities and features.
Article
This paper proposes a novel inertial-aided localization approach by fusing information from multiple inertial measurement units (IMUs) and exteroceptive sensors. IMU is a low-cost motion sensor which provides measurements on angular velocity and gravity compensated linear acceleration of a moving platform, and widely used in modern localization systems. To date, most existing inertial-aided localization methods exploit only one single IMU. While the single-IMU localization yields acceptable accuracy and robustness for different use cases, the overall performance can be further improved by using multiple IMUs. To this end, we propose a lightweight and accurate algorithm for fusing measurements from multiple IMUs and exteroceptive sensors, which is able to obtain noticeable performance gain without incurring additional computational cost. To achieve this, we first probabilistically map measurements from all IMUs onto a virtual IMU. This step is performed by stochastic estimation with least-square estimators and probabilistic marginalization of inter-IMU rotational accelerations. Subsequently, the propagation model for both state and error state of the virtual IMU is also derived, which enables the use of the classical filter-based or optimization-based sensor fusion algorithms for localization. Finally, results from both simulation and real-world tests are provided, which demonstrate that the proposed algorithm outperforms competing algorithms by noticeable margins.
Article
The ability to estimate rich geometry and camera motion from monocular imagery is fundamental to future interactive robotics and augmented reality applications. Different approaches have been proposed that vary in scene geometry representation (sparse landmarks, dense maps), the consistency metric used for optimising the multi-view problem, and the use of learned priors. We present a SLAM system that unifies these methods in a probabilistic framework while still maintaining real-time performance. This is achieved through the use of a learned compact depth map representation and reformulating three different types of errors: photometric, reprojection and geometric, which we make use of within standard factor graph software. We evaluate our system on trajectory estimation and depth reconstruction on real-world sequences and present various examples of estimated dense geometry.
Article
Achieving precise state estimation is needed for the unmanned aerial vehicle to perform a successful flight with a high degree of stability. Nonetheless, obtaining accurate state estimation is considered challenging due to the inaccuracies associated with the measurements of the onboard commercial-off-the-shelf inertial measurement unit. The immense vibration of the vehicle's rotors makes these measurements suffer from issues like large drifts, biases, and immense unpredictable noise sequences. These issues cannot be significantly tackled using classical estimators, and an accurate sensor fusion technique needs to be developed. In this paper, a deep learning (DL) framework is developed to enhance the performance of the state estimator. A deep neural network (DNN) is trained using a deep-learning-based technique to identify the associated measurement noise models and filter them out. The dropout technique is adopted for training DNN to avoid overfitting and reduce the complexity of nets computations. Compared to the classical estimation results, the proposed DL technique demonstrates capabilities in identifying the measurement's noise characteristics. As an example, an enhancement in estimating the attitude states at near hover is proven using this approach. Furthermore, an actual hover flight was performed to validate the proposed estimation enhancement method.
Article
A well-known, crucial problem for indoor positioning of mobile agents (e.g., robots) equipped with exteroceptive sensors is related to the need to deploy reference landmarks in a given environment. Normally, anytime a landmark is detected, an agent estimates its own location and attitude with respect to landmark position and/or orientation in the chosen reference frame. When instead no landmark is recognized, other sensors (e.g., odometers in the case of wheeled robots) can be used to track the agent position and orientation from the last detected landmark. At the moment, landmark placement is usually based just on common-sense criteria, which are not formalized properly. As a result, positioning uncertainty tends to grow unpredictably. On the contrary, the purpose of this paper is to minimize the number of landmarks, while ensuring that localization uncertainty is kept within wanted boundaries. The developed approach relies on the following key features: a dynamic model describing agents' motion, a model predicting the agents' paths within a given environment and, finally, a conjunctive normal form formalization of the optimization problem, which can be efficiently (although approximately) solved by a greedy algorithm. The effectiveness of the proposed landmark placement technique is first demonstrated through simulations in a variety of conditions and then it is validated through experiments on the field, by using non-Bayesian and Bayesian position tracking algorithms.