ArticlePDF Available

An evaluation of the RGB-D SLAM system

Authors:

Abstract and Figures

We present an approach to simultaneous localization and mapping (SLAM) for RGB-D cameras like the Microsoft Kinect. Our system concurrently estimates the trajectory of a hand-held Kinect and generates a dense D model of the environment. We present the key features of our approach and evaluate its performance thoroughly on a recently published dataset, including a large set of sequences of different scenes with varying camera speeds and illumination conditions. In particular, we evaluate the accuracy, robustness, and processing time for three different feature descriptors (SIFT, SURF, and ORB). The experiments demonstrate that our system can robustly deal with difficult data in common indoor scenarios while being fast enough for online operation. Our system is fully available as open-source.
Content may be subject to copyright.
An Evaluation of the RGB-D SLAM System
Felix Endres
1
Jürgen Hess
1
Nikolas Engelhard
1
Jürgen Sturm
2
Daniel Cremers
2
Wolfram Burgard
1
Abstract We present an approach to simultaneous local-
ization and mapping (SLAM) for RGB-D cameras like the
Microsoft Kinect. Our system concurrently estimates the tra-
jectory of a hand-held Kinect and generates a dense 3D
model of the environment. We present the key features of
our approach and evaluate its performance thoroughly on a
recently published dataset, including a large set of sequences
of different scenes with varying camera speeds and illumination
conditions. In particular, we evaluate the accuracy, robustness,
and processing time for three different feature descriptors
(SIFT, SURF, and ORB). The experiments demonstrate that
our system can robustly deal with difficult data in common
indoor scenarios while being fast enough for online operation.
Our system is fully available as open-source.
I. INTRODUCTION
Many relevant applications in robotics and computer vi-
sion require the ability to quickly acquire 3D models of the
environment and to estimate the camera pose with respect to
this model. A robot, for example, needs to know its location
in the world to navigate between places. This problem is a
classical and challenging chicken-and-egg problem because
localizing the camera in the world requires the 3D model of
the world, and building the 3D model in turn requires the
pose of the camera. Therefore, both the camera trajectory
and the 3D model need to be estimated at the same time.
With the introduction of the Microsoft Kinect camera, a
new sensor has appeared on the market that provides both
color images and dense depth maps at full video frame rate.
This allows us to create a novel approach to SLAM that
combines the scale information of 3D depth sensing with the
strengths of visual features to create dense 3D environment
representations.
Our approach consists of four processing steps as illus-
trated in Figure 2. First, we extract visual features from
the incoming color images. Then we match these features
against features from previous images. By evaluating the
depth images at the locations of these feature points, we
obtain a set of point-wise 3D correspondences between any
two frames. Based on these correspondences, we estimate the
relative transformation between the frames using RANSAC.
As the pairwise pose estimates between frames are not
1
Felix Endres, Jürgen Hess, Nikolas Engelhard and Wolfram Burgard
are with the Department of Computer Science, University of Freiburg, Ger-
many. {endres,hess,engelhar,burgard}@informatik.uni-
freiburg.de
2
Jürgen Sturm and Daniel Cremers are with the Computer Sci-
ence Department, Technical University of Munich, Germany. {sturmju,
cremers}@in.tum.de
This work has partly been supported by the European Commission under
grant agreement number FP7-248258-First-MM
(a) Input data: Sequence of RGB-D images
1 0 1 2 3
1.5
1
0.5
0
x [m]
y [m]
true
estimated
(b) Ground truth and estimated camera trajectory projected to 2D
(c) Output: voxel grid (here displayed at 1 cm resolution)
Fig. 1. Our approach registers sequences of RGB-D images (a) to recover
the trajectory of the camera (b) and to create globally consistent volumetric
3D models (c).
necessarily globally consistent, we optimize the resulting
pose graph in the fourth step using the g
2
o solver, which is a
general open-source framework for optimizing graph-based
nonlinear error functions [11]. The output of our algorithm at
this stage is a globally consistent 3D model of the perceived
environment, represented as a colored point cloud. Finally,
we use the Octomap library [33] to generate a volumetric
representation of the environment.
The contribution of this paper is twofold. First we describe
our RGB-D SLAM system and its components. Our second
contribution is a thorough analysis of the performance of
our system on a recently published benchmark dataset [29].
The dataset contains both the RGB-D images of the Kinect
with time-synchronized ground truth poses obtained from
a high-accuracy motion-capture system. As the benchmark
comes with a tool that evaluates the accuracy of an estimated
trajectory, the measured performance of our algorithm can be
objectively compared to those of other systems. Therefore,
we hope to establish with our evaluation a baseline for future
RGB-D SLAM systems. All code required to reproduce,
verify (and improve) on our results is also fully available
online. To summarize our results, we found that our sys-
tem provides the camera pose with an average RMSE of
9.7 cm and 3.95° in a typical office environment, and can
very robustly handle even the high speed sequences of the
benchmark with average velocities of up to 50 deg/s and
0.43 m/s.
The remainder of this paper is organized as follows.
We briefly review related approaches in Sec. II, before we
introduce our approach in Sec. III. We evaluate our system
in Sec. IV using the RGB-D benchmark sequences.
II. RELATED WORK
The general problem of SLAM has a long history in
robotics [30], [22], [10], [6], [15], [9], [23]. Especially
methods designed to learn three-dimensional maps of the
environment employ laser scanners or Time-of-Flight (ToF)
cameras to provide dense point clouds of an environment.
Many modern variants apply the iteratively closest point
(ICP) algorithm [2], [26], [27] for aligning pairs of local
point clouds to establish constraints between observations.
These constraints are then used to find maximum likelihood
map.
Visual SLAM systems [5], [16], [28] in the computer
vision literature often referred to as structure and motion
estimation [14], [21] typically extract sparse keypoints
from the camera images. Visual feature points have the
advantage of being more informative which simplifies data
association. Relevant feature descriptors include SIFT [20],
SURF [1], and the recently introduced ORB features [25], as
well as parallelized versions thereof like SIFTGPU [32]. In
the monocular setting, the absolute scale of the map cannot
be determined, so that additional normalization steps are
required during optimization. Stereo SLAM systems [17],
[24] do not suffer from this limitation as the depth can be
calculated from the disparity between the two images. In
general, however, the disparity can only be estimated for
RGB Image
RGB Image
Pairwise Feature
Matching
(SURF, SI FT,...)
Pairwise 6D
Transformation Estimation
(RANSAC)
Global Pose Graph
Optimization
( o)
RGB Images
Registered
3D Point Clouds
Depth Images
Voxelization
(OctoMap)
3D Occupancy
Grid Map
RGBD Sensor
}
Frontend
}
Backend
Fig. 2. Schematic overview of our approach. We extract visual features that
we associate to 3D points. Subsequently, we mutually register pairs of image
frames and build a pose graph, that is optimized using g
2
o. Finally, we
generate a textured voxel occupancy map using the OctoMapping approach.
distinctive points in the image, i.e., surfaces with little or no
texture cannot be matched easily.
Novel RGB-D sensors that are based on structured light
like the Microsoft Kinect directly provide dense depth maps
and color images. Note that in general SLAM approaches
that operate on RGB-D images are structurally different from
stereo systems as the input is dense RGB-D instead of two
color images. Fioraio and Konolige [7] recently presented
a system that uses bundle adjustment to align the dense
point clouds of the Kinect directly however without further
exploiting the RGB images. Most similar to our work is the
approach of Henry et al. [12]. Their approach uses sparse
keypoint matches between consecutive color images as an
initialization to ICP. In their experiments, they found how-
ever that often the computationally expensive ICP step was
not necessary. Therefore, they improved the algorithm so that
ICP was only used if few (or none) keypoint matches could
be established. While Henry et al. use sparse bundle adjust-
ment [19] for the optimization of the 2.5D reprojection errors
in RGB-D image space, we optimize the 3D pose graph
using the g
2
o [18] framework. Finally, the post-processing
of the two approaches is different: Henry et al. post-process
the resulting point cloud into a surfel representation, while
we create a volumetric voxel representation [33] that can
directly be used for robot localization, path planning and
navigation [13].
Finally, in contrast to all of the above approaches, we
evaluate our system on a publicly available benchmark [29].
Therefore, our results can directly be compared to other
approaches that are evaluated on the same datasets. We have
fully released our code (including the evaluation routines) as
open-source
1
to ensure that our results are reproducible and
scientifically verifiable.
III. APPROACH
This section gives a detailed description of our approach.
A schematic overview of our system is given in Figure 2.
The trajectory estimation is divided into a frontend and
a backend. Whereas the frontend extracts spatial relations
between individual observations, the backend optimizes the
poses of these observations in a so-called pose graph and
with respect to a non-linear error function.
In the frontend, we use the visual image of the RGB-D
sensor to detect keypoints and extract descriptors. These are
matched to previously extracted descriptors and the relative
transformation between the sensor poses is computed using
RANSAC. Together with the depth information this allows
us to register dense point clouds in a common coordinate
system.
To compute globally optimal poses for the sensor posi-
tions w.r.t. the estimated relative transformations, we use a
graph-based optimization routine. After reconstruction of the
trajectory, we compute an occupancy voxel grid map.
A. SLAM Frontend
The frontend is responsible for establishing spatial rela-
tions from the sensor data. Our system computes pairwise
relations between camera images by matching of visual
features. We rely on OpenCV [3] for detection, description
and matching of various feature types, namely SURF, SIFT
and ORB. Since SIFT features are computationally much
more demanding than SURF and ORB, we also make use of
a GPU based implementation of SIFT [32].
ORB is a new keypoint detector and feature descriptor
combination recently introduced by Rublee et al. [25]. It
is based on the FAST detector [25] and the BRIEF [4]
descriptor. ORB computes an unambiguous orientation from
the FAST corners and uses it for descriptor extraction, thus
making the combination robust to viewpoint changes. Being
significantly faster to compute than SIFT and SURF, we
added it as an alternative in our system and present an
evaluation in Section IV-B.1.
For the SURF keypoint detector, we apply the self-
adjusting variant which increases or decreases the threshold
on the Hessian to keep the number of keypoints roughly
constant. While this slows down keypoint detection in case
of fluctuating scene properties, variations in the number of
keypoints with a fixed threshold can lead to inaccurate or
even failed motion estimations. Too many features slow the
system down in the matching step and may lead to many
false positives.
After the detection of the keypoints, we project the feature
locations from the image to 3D using the depth measurement
at the center of the keypoint. The transformation of the
camera pose between two frames can then be computed in
closed form these correspondences [31].
However, no visual feature provides perfect reliability with
respect to repeatability and false positives. Further, the depth
data often is inconsistent with the color image, mainly due
1
http://ros.org/wiki/rgbdslam
to a missing synchronization of the shutters of infrared and
color camera, but also due to interpolation at depth jumps.
Since visually salient points often lie at object borders, the
3D feature positions are prone to be at a wrong depth, making
the robust estimation of transformations highly non-trivial.
A well-known approach to cope with noisy data and
outliers is the Random Sample Consensus (RANSAC) al-
gorithm [8]: After matching the feature descriptors of two
frames, we randomly select three matched feature pairs,
which is the minimal number from which a rigid transforma-
tion in SE(3) can be computed. Thanks to the full 3D position
we can efficiently avoid outliers by refusing sample sets for
which the pairwise Euclidean distances do not match. If the
samples pass this test, they are used to compute an estimate
of the rigid transformation. We apply the transformation to all
matched features and count the number of inliers. In our case,
consider a feature point an inlier when its mutual distance
after the transformation is smaller than 3 cm. Subsequently,
we use the inliers to compute a refined transformation. These
steps are iterated and the transformation with most inliers is
kept.
While the described procedure is very fast (exact timing
depends on the number of features and the outlier percent-
age), after few seconds the number of past frames is too
high to compare a new frame against all previous frames.
Therefore, we select a subset of twenty frames, consisting
of the 3 most recent frames and uniformly sampled earlier
frames, and compute the pairwise transformations in parallel
using threads.
If a frame could be matched to any predecessor, it is added
as a node to the pose graph of the SLAM backend, the
pairwise spatial relations connecting it to the existing pose
graph. The process applied when no relation to previous
nodes can be found, depends on the application. If it is
tolerable that the map is fragmented, a sensible action would
be to keep the node even though disconnected, starting a
new map fragment possibly to be connected later on through
a loop closure. For evaluation purposes, we do not allow
fragmentation in our experiments. If a frame cannot be
matched, it will be connected to the prior node in the pose
graph under the assumption of a constant motion model
with high uncertainty. While this usually leads to higher
error values than evaluation on the biggest fragment, it
facilitates the comparison to other approaches, by reducing
the comparison to fully connected trajectories.
B. SLAM Backend
The pairwise transformations between sensor poses, as
computed by the frontend, form the edges of a pose graph.
Due to estimation errors, the edges form no globally con-
sistent trajectory. To create a globally consistent trajectory
we optimize the pose graph using the g
2
o framework [18].
The g
2
o framework is an easily extensible graph optimizer
that can be applied to a wide range of problems including
several variants of SLAM and bundle adjustment. It performs
a minimization of a non-linear error function that can be
represented as a graph, as for example the one created by
TABLE I
WE EVALUATED OUR SYSTEM ON A LARGE SET OF SEQUENCES FROM THE RGB-D SLAM DATASET [29]. ON AVERAGE, OUR SYSTEM ACHIEVES AN
ACCURACY OF 9.7 CM AND 3.95° AND REQUIRES APPROXIMATELY 0.35 S OF PROCESSING TIME PER IMAGE.
Sequence Name Length Duration Avg. Angular Avg. Transl. Frames Total g
2
o Transl. Rot.
Velocity Velocity Runtime Runtime RMSE RMSE
FR1 360 5.82 m 28.69 s 41.60 deg/s 0.21 m/s 745 145 s 0.66 s 0.103 m 3.41°
FR1 desk2 10.16 m 24.86 s 29.31 deg/s 0.43 m/s 614 176 s 0.68 s 0.102 m 3.81°
FR1 desk 9.26 m 23.40 s 23.33 deg/s 0.41 m/s 575 199 s 1.31 s 0.049 m 2.43°
FR1 floor 12.57 m 49.87 s 15.07 deg/s 0.26 m/s 1214 488 s 3.93 s 0.055 m 2.35°
FR1 plant 14.80 m 41.53 s 27.89 deg/s 0.37 m/s 1112 424 s 1.28 s 0.142 m 6.34°
FR1 room 15.99 m 48.90 s 29.88 deg/s 0.33 m/s 1332 423 s 1.56 s 0.219 m 9.04°
FR1 rpy 1.66 m 27.67 s 50.15 deg/s 0.06 m/s 687 243 s 10.26 s 0.042 m 2.50°
FR1 teddy 15.71 m 50.82 s 21.32 deg/s 0.32 m/s 1395 556 s 1.72 s 0.138 m 4.75°
FR1 xyz 7.11 m 30.09 s 8.92 deg/s 0.24 m/s 788 365 s 40.09 s 0.021 m 0.90°
(a) Result of frame-to-frame tracking (no loop clo-
sures)
(b) Result after graph optimization (loop closures) (c) Volumetric 3D map after post-processing
Fig. 3. (a) Using only stepwise tracking leads to undesired results due to the accumulated drift. (b) After graph optimization with 5 iterations of g
2
o, an
excellent alignment of most point clouds is achieved. Individual clouds are still slightly misaligned. (c) Through integration of the free-space information
these artefact are greatly diminished after computation of occupancy probabilities.
the SLAM frontend described above. Generally, the error
function has the form
F(x) =
X
hi,ji∈C
e(x
i
, x
j
, z
ij
)
>
ij
e(x
i
, x
j
, z
ij
) (1)
x
= argmin
x
F(x). (2)
Here, x = (x
>
1
, . . . , x
>
n
)
>
is a vector of pose represen-
tations x
i
. z
ij
and
ij
represent respectively the mean and
the information matrix of a constraint relating the poses x
j
,
i.e., the pairwise transformation computed by the frontend.
e(x
i
, x
j
, z
ij
) is a vector error function that measures how
well the poses x
i
and x
j
satisfy the constraint z
ij
. It
is 0 when x
i
and x
j
perfectly match the constraint, i.e.,
the difference of the poses exactly matches the estimated
transformation. For more details on the error function, we
refer the interested reader to [18].
Global optimzation is especially beneficial in case of
a loop closure, i.e., when revisiting known parts of the
map, since the loop closing edges in the graph allows to
diminish the accumulated error. Unfortunately, large errors
in the motion estimation step may impede the accuracy
of large parts of the graph. This is primarily a problem
in areas of highly ambiguous features. We therefore use a
threshold to prune edges with high error values after the
initial convergence and continue the optimization.
C. Map Representation
The system described so far computes a globally consistent
trajectory. Using this trajectory, the original data can be used
to construct a representation of the environment. Projecting
all point measurements into a global 3D coordinate system
leads to a straightforward point-based representation. Such
a model, however, is highly redundant and requires vast
computational and memory resources.
To overcome these limitations, we use 3D occupancy grid
maps to represent the environment. In our implementation,
we use the octree-based mapping framework OctoMap [33].
Voxels are managed in an efficient tree structure that leads to
a compact memory representation and inherently allows for
map queries at multiple resolutions. The use of probabilistic
occupancy estimation furthermore provides a means of cop-
ing with noisy measurements and errors in pose estimation.
In contrast to a point-based representation, it represents free
space and unmapped areas explicitly which is essential for
robot navigation and exploration tasks.
IV. EVALUATION
To evaluate our system, we use the RGB-D bench-
mark [29] which provides a dataset of Kinect sequences
with synchronized ground truth. Furthermore, the benchmark
provides an evaluation tool that computes the root mean
square error (RSME) given an estimated trajectory.
TABLE II
EVALUATION OF THE ACCURACY WITH RESPECT TO THE KEYPOINT
DETECTOR AND FEATURE EXTRACTOR
Success Transl. RMSE Rot. RMSE
(Avg. ± Std. Dev.) (Avg. ± Std. Dev.)
SIFTGPU 9/9 0.097 m ± 0.063 m 3.95°± 2.47°
SURF 9/9 0.098 m ± 0.078 m 3.39°± 1.55°
ORB 7/9 0.215 m ± 0.189 m 7.75°± 5.55°
For our evaluation we choose the Freiburg1 (FR1) dataset
consisting of nine sequences placed in a typical indoor
environment. Two of the these, the sequences “FR1 xyz/rpy”
have very simple motions. The result on these sequences
show the capabilities of our approach in the best case.
However, results like this can usually be achieved when the
sensor can be carefully moved in an indoor environment, e.g.,
during manual map recording prior to employing a robot.
The other datasets are more challenging as they cover larger
areas of the office space and unrestricted camera motions.
In this section, we first present our results on the accuracy
of our SLAM system and evaluate how the accuracy depends
on the chosen feature detector and sensor frame rate. Second,
we investigate the influence of various parameters on the
runtime of our system.
We evaluated our system on all nine sequences from the
FR1 set, see Table I. As can be seen from this table, the
average camera velocities range from 9 deg/s to 42 deg/s and
from 0.06 m/s to 0.43 m/s.
A. Accuracy of the Trajectory Estimation
In our first rounds of experiments, we evaluated the
accuracy of our system on all sequences using SIFTGPU
feature extraction and approximate matching using FLANN.
On the simple “xyz” and “rpy” sequences, we obtain the
best values of 2.1 cm and 4.1 cm RMSE error, respectively.
We achieved the worst result of 20.1 cm RMSE error on the
“room” sequence of 16 m length consisting of a long round
through an office space. In general, this evaluation shows that
our approach performs well in most of the sequences. High
angular and translational velocity pose no obvious difficulty
even though frames exhibit less overlap.
Further, we investigated the influence of different choices
for the keypoint detectors and feature descriptors. Table II
shows the root mean square of the translational and rotational
error per sequence. The accuracy achieved is similar for SIFT
and SURF. In contrast, ORB features turn out to be less
accurate and also less reliable: Using ORB, the trajectory
estimation failed for two of the nine sequences. This could
not be resolved by adapting the parameters of the feature
detector to find more keypoints.
B. Runtime Evaluation
In this section we discuss the computational requirements
of the presented system. All experiments were carried out on
a quad-core CPU with 8 GB of memory. By parallelizing our
software we achieved a speed-up by a factor between 2 and
2.5. Graph optimization is started after all frames have been
TABLE III
RUNTIME ANALYSIS OF FEATURE EXTRACTION WITH RESPECT TO THE
CHOSEN INTEREST POINT DETECTOR AND FEATURE DESCRIPTOR. WE
FOUND THAT ORB IS FASTER THAN SURF AND SIFT BY ONE ORDER OF
MAGNITUDE.
Type Count Runtime Detection and Extraction
Avg. ± Std. Dev. Averages
SURF 1733 ± 153 0.34 s, 0.34 s
ORB 1117 ± 558 0.018 s, 0.0086 s
SIFTGPU 1918 ± 599 0.19 s
TABLE IV
RUNTIME ANALYSIS OF PAIRWISE FRAME REGISTRATION. USING AN
APPROXIMATELY NEAREST NEIGHBOR METHOD DURING MATCHING
REDUCES THE RUNTIME BY A FACTOR OF TWO.
Matcher Runtime (Avg. ± Std. Dev)
FLANN 0.203 s ± 0.078 s
Brute Force 0.386 s ± 0.120 s
processed and runs on a single core. The runtime results
presented in Table I were generated using SIFTGPU and
FLANN matching. On average, our system required 0.35 s
per frame. The overall runtime performance strongly depends
on the configurations presented in the previous sections.
Therefore, we evaluated the runtime performance in more
detail in the remainder of this section.
1) Feature Detection and Descriptor Extraction: Our ap-
proach has to detect features and extract their descriptors in
each incoming image frame. Table III shows a comparison
of the computation time required for the different feature
types as described in Section III-A. We found that ORB is
faster than SURF and SIFTGPU by one order of magnitude.
However, the system produced high errors in two of the nine
sequences. SIFTGPU is still about 3.5 times as fast as the
non-parallelized SURF implementation.
2) Feature Matching and Motion Estimation: Feature
matching and motion estimation needs to be computed at
least once per frame. If the current frame is only matched
against one predecessor, the resulting camera trajectory will
quickly accumulate errors over time. Feature matching for
many frames is costly to compute, but especially since we
do not assume the availability of any odometry information,
only with the possibility for loop closures, the system can be
accurate over longer trajectories. Further, more information
about pairwise relative transformations makes the trajectory
estimation more robust to errors in pose estimation. However,
a densely connected pose graph also requires more time to
optimize. We therefore found matching the current features
to those of 20 previous frames a good compromise. Table IV
shows the average runtime for matching and motion estima-
tion. We found that FLANN reduces the time required for
frame-to-frame registration by a factor of two.
3) Pose Graph Optimization: The optimization of small
pose graphs is fast enough to be done in real time, i.e., for
every frame. With longer sequences of densely connected
poses, the optimization time increases. However, if the
motion estimates are reliable, the global optimization is not
required in every step. In all sequences, the ratio of graph
optimization versus the total runtime was below 6 %.
V. CONCLUSION AND OUTLOOK
We presented a novel approach to visual SLAM from
RGB-D sensors. Our approach extracts visual keypoints from
the color images and uses the depth images to localize them
in 3D. We use RANSAC to robustly estimate the trans-
formations between RGB-D frames and optimize the pose
graph using non-linear optimization. Finally, we generate a
volumetric 3D map of the environment that can be used
for robot localization, navigation and path planning. We
evaluated our approach quantitatively on a publicly available
RGB-D dataset. On this dataset our system achieves on
average an accuracy of 9.7 cm and 3.95°. With an average
frame processing time of 0.35 s our approach is suitable
for online operation. To allow other researchers to use our
software and reproduce the results (and improve on them),
we have released all source code required to run and re-
evaluate our RGB-D SLAM system as open-source.
During the evaluation of our system, we discovered that
sometimes erroneous edges are created that lead to a degra-
dation of the mapping result. It would be interesting to
see how such problems can be efficiently detected and
possibly repaired autonomously. In the near future, we want
to optimize the keypoint matching scheme, for example by
adding a feature dictionary, pruning never matched features,
and integrating keypoints as landmarks directly during non-
linear optimization.
REFERENCES
[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust
features (SURF). Comput. Vis. Image Underst., 110:346–359, 2008.
[2] P. J. Besl and H. D. McKay. A method for registration of 3-D shapes.
IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), 14(2):239–256, 1992.
[3] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with
the OpenCV Library. O’Reilly Media, 2008.
[4] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust
Independent Elementary Features. In Proc. of the Europ. Conf. on
Computer Vision (ECCV), 2010.
[5] A.J. Davison. Real-time simultaneous localisation and mapping with
a single camera. In Proc. of the IEEE Intl. Conf. on Computer Vision
(ICCV), 2003.
[6] F. Dellaert. Square Root SAM. In Proc. of Robotics: Science and
Systems (RSS), pages 177–184, 2005.
[7] N. Fioraio and K. Konolige. Realtime visual and point cloud slam.
In Proc. of the RGB-D Workshop on Advanced Reasoning with Depth
Cameras at Robotics: Science and Systems Conf. (RSS), 2011.
[8] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated
cartography. Commun. ACM, 24(6):381–395, 1981.
[9] U. Frese, P. Larsson, and T. Duckett. A multilevel relaxation algorithm
for simultaneous localisation and mapping. IEEE Trans. on Robotics,
21(2):1–12, 2005.
[10] G. Grisetti, S. Grzonka, C. Stachniss, P. Pfaff, and W. Burgard.
Efficient estimation of accurate maximum likelihood maps in 3d. In
Proc. of the Intl. Conf. on Intelligent Robots and Systems (IROS),
2007.
[11] G. Grisetti, R. Kümmerle, C. Stachniss, U. Frese, and C. Hertzberg.
Hierarchical optimization on manifolds for online 2D and 3D mapping.
In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA),
2010.
[12] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping:
Using depth cameras for dense 3D modeling of indoor environments.
In Proc. of the Intl. Symp. on Experimental Robotics (ISER), 2010.
[13] A. Hornung, K.M. Wurm, and M. Bennewitz. Humanoid robot
localization in complex indoor environments. In Proc. of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
2010.
[14] H. Jin, P. Favaro, and S. Soatto. Real-time 3-d motion and structure
of point-features: A front-end for vision-based control and interaction.
In Proc. of the IEEE Intl. Conf. on Computer Vision and Pattern
Recognition (CVPR), 2000.
[15] M. Kaess, A. Ranganathan, and F. Dellaert. iSAM: Incremental
smoothing and mapping. IEEE Trans. on Robotics, 24(6):1365–1378,
2008.
[16] G. Klein and D. Murray. Parallel tracking and mapping for small
AR workspaces. In Proc. IEEE and ACM Intl. Symp. on Mixed and
Augmented Reality (ISMAR), 2007.
[17] K. Konolige, M. Agrawal, R. Bolles, C. Cowan, M. Fischler, and
B. Gerkey. Outdoor mapping and navigation using stereo vision. In
Experimental Robotics, volume 39 of Springer Tracts in Advanced
Robotics, pages 179–190. Springer, 2008.
[18] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard.
g2o: A general framework for graph optimization. In Proc. of the
IEEE Intl. Conf. on Robotics and Automation (ICRA), 2011.
[19] M.I.A. Lourakis and A.A. Argyros. SBA: a software package for
generic sparse bundle adjustment. ACM Transactions on Mathematical
Software, 2009.
[20] D.G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, 2004.
[21] D. Nister. Preemptive RANSAC for live structure and motion estima-
tion. In Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV),
2003.
[22] A. Nüchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D
SLAM with approximate data association. In Proc. of the 12th
Intl. Conference on Advanced Robotics (ICAR), pages 242–249, 2005.
[23] E. Olson, J. Leonard, and S. Teller. Fast iterative optimization of pose
graphs with poor initial estimates. In Proc. of the IEEE Intl. Conf. on
Robotics and Automation (ICRA), pages 2262–2269, 2006.
[24] L. M. Paz, P. Pinies, J. D. Tardos, and J. Neira. Large-scale 6-
DOF SLAM with stereo-in-hand. IEEE Transactions on Robotics,
24(5):946–957, 2008.
[25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: an efficient
alternative to SIFT or SURF. In Proc. of the IEEE Intl. Conf. on
Computer Vision (ICCV), volume 13, 2011.
[26] S. Rusinkiewicz and M. Levoy. Efficient variants of the ICP algorithm.
In Proc. of the Intl. Conf. on 3-D Digital Imaging and Modeling, 2001.
[27] A. Segal, D. Haehnel, and S. Thrun. Generalized-ICP. In Proceedings
of Robotics: Science and Systems, 2009.
[28] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale drift-aware
large scale monocular slam. In Proceedings of Robotics: Science and
Systems, 2010.
[29] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W. Bur-
gard, D. Cremers, and R. Siegwart. Towards a benchmark for RGB-D
SLAM evaluation. In Proc. of the RGB-D Workshop on Advanced
Reasoning with Depth Cameras at Robotics: Science and Systems
Conf. (RSS), 2011.
[30] S. Thrun. Robotic mapping: A survey. In Exploring Artificial
Intelligence in the New Millennium. Morgan Kaufmann, 2003.
[31] Shinji Umeyama. Least-squares estimation of transformation param-
eters between two point patterns. IEEE Transactions on Pattern
Analysis and Machine Intelligence, (13), 1991.
[32] Changchang Wu. SiftGPU: A GPU implementation of scale in-
variant feature transform (SIFT). http://cs.unc.edu/~ccwu/
siftgpu, 2007.
[33] K.M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Bur-
gard. OctoMap: A probabilistic, flexible, and compact 3D map
representation for robotic systems. In Proc. of the ICRA 2010
Workshop on Best Practice in 3D Perception and Modeling for Mobile
Manipulation, 2010.
... The experiment environment of the proposed framework is AMD® Ryzen 7 5800h (8 cores @3.2 GHz), 16GB RAM, ROS Melodic [27], and NVIDIA GeForce RTX 3070. We use absolute trajectory error (ATE) [32], relative rotation error (RRE), and relative translation error (RTE) [10] to evaluate overall accuracy and odometry drift for ego and object localization. ...
Preprint
The SLAM system built on the static scene assumption will introduce significant estimation errors when a large number of moving objects appear in the field of view. Tracking and maintaining semantic objects is beneficial to understand the scene and provide rich decision information for planning and control modules. This paper introduces MLO, a multi-object Lidar odometry which tracks ego-motion and movable objects with only the lidar sensor. First, it achieves information extraction of foreground movable objects, surface road, and static background features based on geometry and object fusion perception module. While robustly estimating ego-motion, it accomplishes multi-object tracking through the least-squares method fused by 3D bounding boxes and geometric point clouds. Then, a continuous 4D semantic object map on the timeline can be created. Our approach is evaluated qualitatively and quantitatively under different scenarios on the public KITTI dataset. The experiment results show that the ego localization accuracy of MLO is better than A-LOAM system in highly dynamic, unstructured, and unknown semantic scenes. Meanwhile, the multi-object tracking method with semantic-geometry fusion also has apparent advantages in accuracy and tracking robustness compared with the single method.
... Since the publications of [19,20], the end-to-end trainable CNN-based models have been the prototype architecture for dense depth [24,25,52] or disparity estimation [29,31,56,67]. The principal idea is to leverage learned representation to improve matching cost [36,54] or depth cues [7] with contextual information of appropriately large local area. ...
Preprint
Full-text available
Current stereo matching techniques are challenged by restricted searching space, occluded regions, and sheer size. While single image depth estimation is spared from these challenges and can achieve satisfactory results with the extracted monocular cues, the lack of stereoscopic relationship renders the monocular prediction less reliable on its own, especially in highly dynamic or cluttered environments. To address these issues in both scenarios, we present an optic-chiasm-inspired self-supervised binocular depth estimation method, wherein vision transformer (ViT) with a gated positional cross-attention (GPCA) layer is designed to enable feature-sensitive pattern retrieval between views while retaining the extensive context information aggregated through self-attentions. Monocular cues from a single view are thereafter conditionally rectified by a blending layer with the retrieved pattern pairs. This crossover design is biologically analogous to the optic-chasma structure in human visual system and hence the name, ChiTransformer. Our experiments show that this architecture yields substantial improvements over state-of-the-art self-supervised stereo approaches by 11%, and can be used on both rectilinear and non-rectilinear (e.g., fisheye) images.
Article
Full-text available
In the case of simultaneous localization and mapping, route planning and navigation are based on data captured by multiple sensors, including built-in cameras. Nowadays, mobile devices frequently have more than one camera with overlapping fields of view, leading to solutions where depth information can also be gathered along with ordinary RGB color data. Using these RGB-D sensors, two- and three-dimensional point clouds can be recorded from the mobile devices, which provide additional information for localization and mapping. The method of matching point clouds during the movement of the device is essential: reducing noise while having an acceptable processing time is crucial for a real-life application. In this paper, we present a novel ISVD-based method for displacement estimation, using key points detected by SURF and ORB feature detectors. The ISVD algorithm is a fitting procedure based on SVD resolution, which removes outliers from the point clouds to be fitted in several steps. The developed method removes these outlying points in several steps, in each iteration examining the relative error of the point pairs and then progressively reducing the maximum error for the next matching step. An advantage over relevant methods is that this method always gives the same result, as no random steps are included.
Article
Aiming at the problem that adaptive Monte Carlo localization (AMCL) algorithm is difficult to localize in large scenes and similar environments. This paper uses a semantic information-assisted approach to improve the AMCL algorithm. This method realizes the robust localization of the robot in the large scenes and similar environments. Firstly, the 2D grid map created by simultaneous localization and mapping using lidar can obtain highly accurate indoor environmental contour information. Secondly, the semantic object capture is achieved by using a depth camera combined with an instance segmentation algorithm. Then, the semantic grid map is created by mapping the semantic point cloud through the back-projection process of the pinhole camera. Finally, semantic grid maps are used as a priori information to assist in localization, which will be used to improve the initial particle swarm distribution in the global localization of the AMCL algorithm and thus will solve the robot localization problem in this environment. The experimental evidence shows that the semantic grid map solves the environmental information degradation problem caused by 2D lidar as well as improves the robot’s perception of the environment. In addition, this paper improves the localization robustness of the AMCL algorithm in large scenes and similar environments, resulting in an average localization success rate of about 90% or even higher, and further reduces the number of iterations. The global localization problem of robots in large scenes and similar environments is effectively solved.
Article
Full-text available
The consistent developments in additive manufacturing (AM) processes are revolutionizing the fabrication of 3-dimensional (3D) parts. Indeed, 3D printing processes are prompt, parallel, material efficient, and cost-effective, along with their capabilities to introduce added dimensions to the computer-aided design (CAD) models. Notably, 3D Printing is making progressive developments to fabricate optical devices such as regular lenses, contact lenses, waveguides, and more recently, Fresnel lenses. But extended functionalities of these optical devices are also desirable. Therefore, we demonstrate masked stereolithography (MSLA) based fabrication of five-dimensional (5D) Fresnel lenses by incorporating color-change phenomena (4th dimension) using thermochromic powder that changes color in response to external temperature variations (25–36 °C). The holographic diffraction effect (5th dimension) is produced by imprinting a diffraction grating during the printing process. Optical focusing performance for the 5D printed lenses has been evaluated by reporting achievable focal length, with <2 mm average deviation, without postprocessing in 450–650 nm spectral range. However, in the near IR region (850–980 nm), the average deviation was around 11.5 mm. Enhanced optical properties along with surface quality have been reported. Thus, MSLA process can fabricate optical components with promising applications in the fields of sensing and communication.
Article
Full-text available
Simultaneous Localization and Mapping (SLAM) is a process to use multiple sensors to position an unmanned mobile vehicle without previous knowledge of the environment, and meanwhile construct a map of this environment for the further applications. Over the past three decades, SLAM has been intensively researched and widely applied in mobile robot control and unmanned vehicle navigation. SLAM technology has demonstrated a great potential in autonomously navigating the mobile robot and simultaneously reconstructing the three‐dimensional (3D) information of surrounding environment. With the vigorous driving of sensor technology and 3D reconstruction algorithms, many attempts have been conducted to propose novel systems and algorithms combined with different sensors to solve the SLAM problem. Notably, SLAM has been extended to various aspects of agriculture involved with autonomous navigation, 3D mapping, field monitoring, and intelligent spraying. This paper focuses on the recent developments and applications of SLAM, particularly in complex and unstructured agricultural environment. A detailed summary of the developments of SLAM is given from three main fundamental types: light detection and ranging SLAM, Visual SLAM, and Sensor Fusion SLAM, and we also discuss the applications and prospects of SLAM technology in agricultural mapping, agricultural navigation, and precise automatic agriculture. Particular attention has been paid to the SLAM sensors, systems, and algorithms applied in agricultural tasks. Additionally, the challenges and future trends of SLAM are reported.
Article
Full-text available
High-quality 3D reconstruction is an important topic in computer graphics and computer vision with many applications, such as robotics and augmented reality. The advent of consumer RGB-D cameras has made a profound advance in indoor scene reconstruction. For the past few years, researchers have spent significant effort to develop algorithms to capture 3D models with RGB-D cameras. As depth images produced by consumer RGB-D cameras are noisy and incomplete when surfaces are shiny, bright, transparent, or far from the camera, obtaining high-quality 3D scene models is still a challenge for existing systems. We here review high-quality 3D indoor scene reconstruction methods using consumer RGB-D cameras. In this paper, we make comparisons and analyses from the following aspects: (i) depth processing methods in 3D reconstruction are reviewed in terms of enhancement and completion, (ii) ICP-based, feature-based, and hybrid methods of camera pose estimation methods are reviewed, and (iii) surface reconstruction methods are reviewed in terms of surface fusion, optimization, and completion. The performance of state-of-the-art methods is also compared and analyzed. This survey will be useful for researchers who want to follow best practices in designing new high-quality 3D reconstruction methods.
Article
Full-text available
Reconstruction of high-fidelity 3D objects or scenes is a fundamental research problem. Recent advances in RGB-D fusion have demonstrated the potential of producing 3D models from consumer-level RGB-D cameras. However, due to the discrete nature and limited resolution of their surface representations (e.g., point- or voxel-based), existing approaches suffer from the accumulation of errors in camera tracking and distortion in the reconstruction, which leads to an unsatisfactory 3D reconstruction. In this paper, we present a method using on-the-fly implicits of Hermite Radial Basis Functions (HRBFs) as a continuous surface representation for camera tracking in an existing RGB-D fusion framework. Furthermore, curvature estimation and confidence evaluation are coherently derived from the inherent surface properties of the on-the-fly HRBF implicits, which devote to a data fusion with better quality. We argue that our continuous but on-the-fly surface representation can effectively mitigate the impact of noise with its robustness and constrain the reconstruction with inherent surface smoothness when being compared with discrete representations. Experimental results on various real-world and synthetic datasets demonstrate that our HRBF-fusion outperforms the state-of-the-art approaches in terms of tracking robustness and reconstruction accuracy.
Preprint
Full-text available
Reconstruction of high-fidelity 3D objects or scenes is a fundamental research problem. Recent advances in RGB-D fusion have demonstrated the potential of producing 3D models from consumer-level RGB-D cameras. However, due to the discrete nature and limited resolution of their surface representations (e.g., point- or voxel-based), existing approaches suffer from the accumulation of errors in camera tracking and distortion in the reconstruction, which leads to an unsatisfactory 3D reconstruction. In this paper, we present a method using on-the-fly implicits of Hermite Radial Basis Functions (HRBFs) as a continuous surface representation for camera tracking in an existing RGB-D fusion framework. Furthermore, curvature estimation and confidence evaluation are coherently derived from the inherent surface properties of the on-the-fly HRBF implicits, which devote to a data fusion with better quality. We argue that our continuous but on-the-fly surface representation can effectively mitigate the impact of noise with its robustness and constrain the reconstruction with inherent surface smoothness when being compared with discrete representations. Experimental results on various real-world and synthetic datasets demonstrate that our HRBF-fusion outperforms the state-of-the-art approaches in terms of tracking robustness and reconstruction accuracy.
Article
Aimed at the problems existing in the present Red Green Blue-Depth (RGB-D) three-dimensional (3D) reconstruction algorithms in the unbounded extension area, such as low accuracy, inaccurate pose estimation, and more restrictions on data set shooting, an optimization algorithm for indoor unbounded RGB-D dense point cloud 3D reconstruction with high accuracy is proposed. The algorithm aims at obtaining better pose estimation during image construction. In the image preprocessing stage, normal direction information is given to each point cloud. In camera pose estimation, since perspective-n-points (PNP) pose estimation is more accurate and has a smaller cumulative error than the traditional near-point iterative algorithm, this paper improves PNP pose estimation and puts it into the pose estimation algorithm. Direct average distribution of errors to achieve loop closure will affect the accuracy of pose estimation. In this study, Similarity Transformation of 3 Points (SIM3) was used to optimize the solution before global Bundle adjustment (BA), enhancing the closed-loop performance of the algorithm. Experimental verification showed that the error of the proposed algorithm for indoor environment reconstruction was about 2 cm at macro and small scales, and the reconstruction error was less than 2%. It can be widely used for RGB-D 3D reconstruction of large indoor scenes and has high accuracy in pose estimation and mapping.
Article
Full-text available
RGB-D cameras (such as the Microsoft Kinect) are novel sensing systems that capture RGB images along with per-pixel depth information. In this paper we investigate how such cameras can be used for building dense 3D maps of indoor environments. Such maps have applications in robot navigation, manipulation, semantic mapping, and telepresence. We present RGB-D Mapping, a full 3D mapping system that utilizes a novel joint optimization algorithm combining visual features and shape-based alignment. Visual and depth information are also combined for view-based loop-closure detection, followed by pose optimization to achieve globally consistent maps. We evaluate RGB-D Mapping on two large indoor environments, and show that it effectively combines the visual and shape information available from RGB-D cameras.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
Ego-motion estimation for an agile single camera moving through general, unknown scenes becomes a much more challenging problem when real-time performance is required rather than under the off-line processing conditions under which most successful structure from motion work has been achieved. This task of estimating camera motion from measurements of a continuously expanding set of self-mapped visual features is one of a class of problems known as Simultaneous Localisation and Mapping (SLAM) in the robotics community, and we argue that such real-time mapping research, despite rarely being camera-based, is more relevant here than off-line structure from motion methods due to the more fundamental emphasis placed on propagation of uncertainty. We present a top-down Bayesian framework for single-camera localisation via mapping of a sparse set of natural features using motion modelling and an information-guided active measurement strategy, in particular addressing the difficult issue of real-time feature initialisation via a factored sampling approach. Real-time handling of uncertainty permits robust localisation via the creating and active measurement of a sparse map of landmarks such that regions can be re-visited after periods of neglect and localisation can continue through periods when few features are visible. Results are presented of real-time localisation for a hand-waved camera with very sparse prior scene knowledge and all processing carried out on a desktop PC.