ArticlePDF Available

An evaluation of the RGB-D SLAM system

Authors:

Abstract and Figures

We present an approach to simultaneous localization and mapping (SLAM) for RGB-D cameras like the Microsoft Kinect. Our system concurrently estimates the trajectory of a hand-held Kinect and generates a dense D model of the environment. We present the key features of our approach and evaluate its performance thoroughly on a recently published dataset, including a large set of sequences of different scenes with varying camera speeds and illumination conditions. In particular, we evaluate the accuracy, robustness, and processing time for three different feature descriptors (SIFT, SURF, and ORB). The experiments demonstrate that our system can robustly deal with difficult data in common indoor scenarios while being fast enough for online operation. Our system is fully available as open-source.
Content may be subject to copyright.
An Evaluation of the RGB-D SLAM System
Felix Endres
1
Jürgen Hess
1
Nikolas Engelhard
1
Jürgen Sturm
2
Daniel Cremers
2
Wolfram Burgard
1
Abstract We present an approach to simultaneous local-
ization and mapping (SLAM) for RGB-D cameras like the
Microsoft Kinect. Our system concurrently estimates the tra-
jectory of a hand-held Kinect and generates a dense 3D
model of the environment. We present the key features of
our approach and evaluate its performance thoroughly on a
recently published dataset, including a large set of sequences
of different scenes with varying camera speeds and illumination
conditions. In particular, we evaluate the accuracy, robustness,
and processing time for three different feature descriptors
(SIFT, SURF, and ORB). The experiments demonstrate that
our system can robustly deal with difficult data in common
indoor scenarios while being fast enough for online operation.
Our system is fully available as open-source.
I. INTRODUCTION
Many relevant applications in robotics and computer vi-
sion require the ability to quickly acquire 3D models of the
environment and to estimate the camera pose with respect to
this model. A robot, for example, needs to know its location
in the world to navigate between places. This problem is a
classical and challenging chicken-and-egg problem because
localizing the camera in the world requires the 3D model of
the world, and building the 3D model in turn requires the
pose of the camera. Therefore, both the camera trajectory
and the 3D model need to be estimated at the same time.
With the introduction of the Microsoft Kinect camera, a
new sensor has appeared on the market that provides both
color images and dense depth maps at full video frame rate.
This allows us to create a novel approach to SLAM that
combines the scale information of 3D depth sensing with the
strengths of visual features to create dense 3D environment
representations.
Our approach consists of four processing steps as illus-
trated in Figure 2. First, we extract visual features from
the incoming color images. Then we match these features
against features from previous images. By evaluating the
depth images at the locations of these feature points, we
obtain a set of point-wise 3D correspondences between any
two frames. Based on these correspondences, we estimate the
relative transformation between the frames using RANSAC.
As the pairwise pose estimates between frames are not
1
Felix Endres, Jürgen Hess, Nikolas Engelhard and Wolfram Burgard
are with the Department of Computer Science, University of Freiburg, Ger-
many. {endres,hess,engelhar,burgard}@informatik.uni-
freiburg.de
2
Jürgen Sturm and Daniel Cremers are with the Computer Sci-
ence Department, Technical University of Munich, Germany. {sturmju,
cremers}@in.tum.de
This work has partly been supported by the European Commission under
grant agreement number FP7-248258-First-MM
(a) Input data: Sequence of RGB-D images
1 0 1 2 3
1.5
1
0.5
0
x [m]
y [m]
true
estimated
(b) Ground truth and estimated camera trajectory projected to 2D
(c) Output: voxel grid (here displayed at 1 cm resolution)
Fig. 1. Our approach registers sequences of RGB-D images (a) to recover
the trajectory of the camera (b) and to create globally consistent volumetric
3D models (c).
necessarily globally consistent, we optimize the resulting
pose graph in the fourth step using the g
2
o solver, which is a
general open-source framework for optimizing graph-based
nonlinear error functions [11]. The output of our algorithm at
this stage is a globally consistent 3D model of the perceived
environment, represented as a colored point cloud. Finally,
we use the Octomap library [33] to generate a volumetric
representation of the environment.
The contribution of this paper is twofold. First we describe
our RGB-D SLAM system and its components. Our second
contribution is a thorough analysis of the performance of
our system on a recently published benchmark dataset [29].
The dataset contains both the RGB-D images of the Kinect
with time-synchronized ground truth poses obtained from
a high-accuracy motion-capture system. As the benchmark
comes with a tool that evaluates the accuracy of an estimated
trajectory, the measured performance of our algorithm can be
objectively compared to those of other systems. Therefore,
we hope to establish with our evaluation a baseline for future
RGB-D SLAM systems. All code required to reproduce,
verify (and improve) on our results is also fully available
online. To summarize our results, we found that our sys-
tem provides the camera pose with an average RMSE of
9.7 cm and 3.95° in a typical office environment, and can
very robustly handle even the high speed sequences of the
benchmark with average velocities of up to 50 deg/s and
0.43 m/s.
The remainder of this paper is organized as follows.
We briefly review related approaches in Sec. II, before we
introduce our approach in Sec. III. We evaluate our system
in Sec. IV using the RGB-D benchmark sequences.
II. RELATED WORK
The general problem of SLAM has a long history in
robotics [30], [22], [10], [6], [15], [9], [23]. Especially
methods designed to learn three-dimensional maps of the
environment employ laser scanners or Time-of-Flight (ToF)
cameras to provide dense point clouds of an environment.
Many modern variants apply the iteratively closest point
(ICP) algorithm [2], [26], [27] for aligning pairs of local
point clouds to establish constraints between observations.
These constraints are then used to find maximum likelihood
map.
Visual SLAM systems [5], [16], [28] in the computer
vision literature often referred to as structure and motion
estimation [14], [21] typically extract sparse keypoints
from the camera images. Visual feature points have the
advantage of being more informative which simplifies data
association. Relevant feature descriptors include SIFT [20],
SURF [1], and the recently introduced ORB features [25], as
well as parallelized versions thereof like SIFTGPU [32]. In
the monocular setting, the absolute scale of the map cannot
be determined, so that additional normalization steps are
required during optimization. Stereo SLAM systems [17],
[24] do not suffer from this limitation as the depth can be
calculated from the disparity between the two images. In
general, however, the disparity can only be estimated for
RGB Image
RGB Image
Pairwise Feature
Matching
(SURF, SI FT,...)
Pairwise 6D
Transformation Estimation
(RANSAC)
Global Pose Graph
Optimization
( o)
RGB Images
Registered
3D Point Clouds
Depth Images
Voxelization
(OctoMap)
3D Occupancy
Grid Map
RGBD Sensor
}
Frontend
}
Backend
Fig. 2. Schematic overview of our approach. We extract visual features that
we associate to 3D points. Subsequently, we mutually register pairs of image
frames and build a pose graph, that is optimized using g
2
o. Finally, we
generate a textured voxel occupancy map using the OctoMapping approach.
distinctive points in the image, i.e., surfaces with little or no
texture cannot be matched easily.
Novel RGB-D sensors that are based on structured light
like the Microsoft Kinect directly provide dense depth maps
and color images. Note that in general SLAM approaches
that operate on RGB-D images are structurally different from
stereo systems as the input is dense RGB-D instead of two
color images. Fioraio and Konolige [7] recently presented
a system that uses bundle adjustment to align the dense
point clouds of the Kinect directly however without further
exploiting the RGB images. Most similar to our work is the
approach of Henry et al. [12]. Their approach uses sparse
keypoint matches between consecutive color images as an
initialization to ICP. In their experiments, they found how-
ever that often the computationally expensive ICP step was
not necessary. Therefore, they improved the algorithm so that
ICP was only used if few (or none) keypoint matches could
be established. While Henry et al. use sparse bundle adjust-
ment [19] for the optimization of the 2.5D reprojection errors
in RGB-D image space, we optimize the 3D pose graph
using the g
2
o [18] framework. Finally, the post-processing
of the two approaches is different: Henry et al. post-process
the resulting point cloud into a surfel representation, while
we create a volumetric voxel representation [33] that can
directly be used for robot localization, path planning and
navigation [13].
Finally, in contrast to all of the above approaches, we
evaluate our system on a publicly available benchmark [29].
Therefore, our results can directly be compared to other
approaches that are evaluated on the same datasets. We have
fully released our code (including the evaluation routines) as
open-source
1
to ensure that our results are reproducible and
scientifically verifiable.
III. APPROACH
This section gives a detailed description of our approach.
A schematic overview of our system is given in Figure 2.
The trajectory estimation is divided into a frontend and
a backend. Whereas the frontend extracts spatial relations
between individual observations, the backend optimizes the
poses of these observations in a so-called pose graph and
with respect to a non-linear error function.
In the frontend, we use the visual image of the RGB-D
sensor to detect keypoints and extract descriptors. These are
matched to previously extracted descriptors and the relative
transformation between the sensor poses is computed using
RANSAC. Together with the depth information this allows
us to register dense point clouds in a common coordinate
system.
To compute globally optimal poses for the sensor posi-
tions w.r.t. the estimated relative transformations, we use a
graph-based optimization routine. After reconstruction of the
trajectory, we compute an occupancy voxel grid map.
A. SLAM Frontend
The frontend is responsible for establishing spatial rela-
tions from the sensor data. Our system computes pairwise
relations between camera images by matching of visual
features. We rely on OpenCV [3] for detection, description
and matching of various feature types, namely SURF, SIFT
and ORB. Since SIFT features are computationally much
more demanding than SURF and ORB, we also make use of
a GPU based implementation of SIFT [32].
ORB is a new keypoint detector and feature descriptor
combination recently introduced by Rublee et al. [25]. It
is based on the FAST detector [25] and the BRIEF [4]
descriptor. ORB computes an unambiguous orientation from
the FAST corners and uses it for descriptor extraction, thus
making the combination robust to viewpoint changes. Being
significantly faster to compute than SIFT and SURF, we
added it as an alternative in our system and present an
evaluation in Section IV-B.1.
For the SURF keypoint detector, we apply the self-
adjusting variant which increases or decreases the threshold
on the Hessian to keep the number of keypoints roughly
constant. While this slows down keypoint detection in case
of fluctuating scene properties, variations in the number of
keypoints with a fixed threshold can lead to inaccurate or
even failed motion estimations. Too many features slow the
system down in the matching step and may lead to many
false positives.
After the detection of the keypoints, we project the feature
locations from the image to 3D using the depth measurement
at the center of the keypoint. The transformation of the
camera pose between two frames can then be computed in
closed form these correspondences [31].
However, no visual feature provides perfect reliability with
respect to repeatability and false positives. Further, the depth
data often is inconsistent with the color image, mainly due
1
http://ros.org/wiki/rgbdslam
to a missing synchronization of the shutters of infrared and
color camera, but also due to interpolation at depth jumps.
Since visually salient points often lie at object borders, the
3D feature positions are prone to be at a wrong depth, making
the robust estimation of transformations highly non-trivial.
A well-known approach to cope with noisy data and
outliers is the Random Sample Consensus (RANSAC) al-
gorithm [8]: After matching the feature descriptors of two
frames, we randomly select three matched feature pairs,
which is the minimal number from which a rigid transforma-
tion in SE(3) can be computed. Thanks to the full 3D position
we can efficiently avoid outliers by refusing sample sets for
which the pairwise Euclidean distances do not match. If the
samples pass this test, they are used to compute an estimate
of the rigid transformation. We apply the transformation to all
matched features and count the number of inliers. In our case,
consider a feature point an inlier when its mutual distance
after the transformation is smaller than 3 cm. Subsequently,
we use the inliers to compute a refined transformation. These
steps are iterated and the transformation with most inliers is
kept.
While the described procedure is very fast (exact timing
depends on the number of features and the outlier percent-
age), after few seconds the number of past frames is too
high to compare a new frame against all previous frames.
Therefore, we select a subset of twenty frames, consisting
of the 3 most recent frames and uniformly sampled earlier
frames, and compute the pairwise transformations in parallel
using threads.
If a frame could be matched to any predecessor, it is added
as a node to the pose graph of the SLAM backend, the
pairwise spatial relations connecting it to the existing pose
graph. The process applied when no relation to previous
nodes can be found, depends on the application. If it is
tolerable that the map is fragmented, a sensible action would
be to keep the node even though disconnected, starting a
new map fragment possibly to be connected later on through
a loop closure. For evaluation purposes, we do not allow
fragmentation in our experiments. If a frame cannot be
matched, it will be connected to the prior node in the pose
graph under the assumption of a constant motion model
with high uncertainty. While this usually leads to higher
error values than evaluation on the biggest fragment, it
facilitates the comparison to other approaches, by reducing
the comparison to fully connected trajectories.
B. SLAM Backend
The pairwise transformations between sensor poses, as
computed by the frontend, form the edges of a pose graph.
Due to estimation errors, the edges form no globally con-
sistent trajectory. To create a globally consistent trajectory
we optimize the pose graph using the g
2
o framework [18].
The g
2
o framework is an easily extensible graph optimizer
that can be applied to a wide range of problems including
several variants of SLAM and bundle adjustment. It performs
a minimization of a non-linear error function that can be
represented as a graph, as for example the one created by
TABLE I
WE EVALUATED OUR SYSTEM ON A LARGE SET OF SEQUENCES FROM THE RGB-D SLAM DATASET [29]. ON AVERAGE, OUR SYSTEM ACHIEVES AN
ACCURACY OF 9.7 CM AND 3.95° AND REQUIRES APPROXIMATELY 0.35 S OF PROCESSING TIME PER IMAGE.
Sequence Name Length Duration Avg. Angular Avg. Transl. Frames Total g
2
o Transl. Rot.
Velocity Velocity Runtime Runtime RMSE RMSE
FR1 360 5.82 m 28.69 s 41.60 deg/s 0.21 m/s 745 145 s 0.66 s 0.103 m 3.41°
FR1 desk2 10.16 m 24.86 s 29.31 deg/s 0.43 m/s 614 176 s 0.68 s 0.102 m 3.81°
FR1 desk 9.26 m 23.40 s 23.33 deg/s 0.41 m/s 575 199 s 1.31 s 0.049 m 2.43°
FR1 floor 12.57 m 49.87 s 15.07 deg/s 0.26 m/s 1214 488 s 3.93 s 0.055 m 2.35°
FR1 plant 14.80 m 41.53 s 27.89 deg/s 0.37 m/s 1112 424 s 1.28 s 0.142 m 6.34°
FR1 room 15.99 m 48.90 s 29.88 deg/s 0.33 m/s 1332 423 s 1.56 s 0.219 m 9.04°
FR1 rpy 1.66 m 27.67 s 50.15 deg/s 0.06 m/s 687 243 s 10.26 s 0.042 m 2.50°
FR1 teddy 15.71 m 50.82 s 21.32 deg/s 0.32 m/s 1395 556 s 1.72 s 0.138 m 4.75°
FR1 xyz 7.11 m 30.09 s 8.92 deg/s 0.24 m/s 788 365 s 40.09 s 0.021 m 0.90°
(a) Result of frame-to-frame tracking (no loop clo-
sures)
(b) Result after graph optimization (loop closures) (c) Volumetric 3D map after post-processing
Fig. 3. (a) Using only stepwise tracking leads to undesired results due to the accumulated drift. (b) After graph optimization with 5 iterations of g
2
o, an
excellent alignment of most point clouds is achieved. Individual clouds are still slightly misaligned. (c) Through integration of the free-space information
these artefact are greatly diminished after computation of occupancy probabilities.
the SLAM frontend described above. Generally, the error
function has the form
F(x) =
X
hi,ji∈C
e(x
i
, x
j
, z
ij
)
>
ij
e(x
i
, x
j
, z
ij
) (1)
x
= argmin
x
F(x). (2)
Here, x = (x
>
1
, . . . , x
>
n
)
>
is a vector of pose represen-
tations x
i
. z
ij
and
ij
represent respectively the mean and
the information matrix of a constraint relating the poses x
j
,
i.e., the pairwise transformation computed by the frontend.
e(x
i
, x
j
, z
ij
) is a vector error function that measures how
well the poses x
i
and x
j
satisfy the constraint z
ij
. It
is 0 when x
i
and x
j
perfectly match the constraint, i.e.,
the difference of the poses exactly matches the estimated
transformation. For more details on the error function, we
refer the interested reader to [18].
Global optimzation is especially beneficial in case of
a loop closure, i.e., when revisiting known parts of the
map, since the loop closing edges in the graph allows to
diminish the accumulated error. Unfortunately, large errors
in the motion estimation step may impede the accuracy
of large parts of the graph. This is primarily a problem
in areas of highly ambiguous features. We therefore use a
threshold to prune edges with high error values after the
initial convergence and continue the optimization.
C. Map Representation
The system described so far computes a globally consistent
trajectory. Using this trajectory, the original data can be used
to construct a representation of the environment. Projecting
all point measurements into a global 3D coordinate system
leads to a straightforward point-based representation. Such
a model, however, is highly redundant and requires vast
computational and memory resources.
To overcome these limitations, we use 3D occupancy grid
maps to represent the environment. In our implementation,
we use the octree-based mapping framework OctoMap [33].
Voxels are managed in an efficient tree structure that leads to
a compact memory representation and inherently allows for
map queries at multiple resolutions. The use of probabilistic
occupancy estimation furthermore provides a means of cop-
ing with noisy measurements and errors in pose estimation.
In contrast to a point-based representation, it represents free
space and unmapped areas explicitly which is essential for
robot navigation and exploration tasks.
IV. EVALUATION
To evaluate our system, we use the RGB-D bench-
mark [29] which provides a dataset of Kinect sequences
with synchronized ground truth. Furthermore, the benchmark
provides an evaluation tool that computes the root mean
square error (RSME) given an estimated trajectory.
TABLE II
EVALUATION OF THE ACCURACY WITH RESPECT TO THE KEYPOINT
DETECTOR AND FEATURE EXTRACTOR
Success Transl. RMSE Rot. RMSE
(Avg. ± Std. Dev.) (Avg. ± Std. Dev.)
SIFTGPU 9/9 0.097 m ± 0.063 m 3.95°± 2.47°
SURF 9/9 0.098 m ± 0.078 m 3.39°± 1.55°
ORB 7/9 0.215 m ± 0.189 m 7.75°± 5.55°
For our evaluation we choose the Freiburg1 (FR1) dataset
consisting of nine sequences placed in a typical indoor
environment. Two of the these, the sequences “FR1 xyz/rpy”
have very simple motions. The result on these sequences
show the capabilities of our approach in the best case.
However, results like this can usually be achieved when the
sensor can be carefully moved in an indoor environment, e.g.,
during manual map recording prior to employing a robot.
The other datasets are more challenging as they cover larger
areas of the office space and unrestricted camera motions.
In this section, we first present our results on the accuracy
of our SLAM system and evaluate how the accuracy depends
on the chosen feature detector and sensor frame rate. Second,
we investigate the influence of various parameters on the
runtime of our system.
We evaluated our system on all nine sequences from the
FR1 set, see Table I. As can be seen from this table, the
average camera velocities range from 9 deg/s to 42 deg/s and
from 0.06 m/s to 0.43 m/s.
A. Accuracy of the Trajectory Estimation
In our first rounds of experiments, we evaluated the
accuracy of our system on all sequences using SIFTGPU
feature extraction and approximate matching using FLANN.
On the simple “xyz” and “rpy” sequences, we obtain the
best values of 2.1 cm and 4.1 cm RMSE error, respectively.
We achieved the worst result of 20.1 cm RMSE error on the
“room” sequence of 16 m length consisting of a long round
through an office space. In general, this evaluation shows that
our approach performs well in most of the sequences. High
angular and translational velocity pose no obvious difficulty
even though frames exhibit less overlap.
Further, we investigated the influence of different choices
for the keypoint detectors and feature descriptors. Table II
shows the root mean square of the translational and rotational
error per sequence. The accuracy achieved is similar for SIFT
and SURF. In contrast, ORB features turn out to be less
accurate and also less reliable: Using ORB, the trajectory
estimation failed for two of the nine sequences. This could
not be resolved by adapting the parameters of the feature
detector to find more keypoints.
B. Runtime Evaluation
In this section we discuss the computational requirements
of the presented system. All experiments were carried out on
a quad-core CPU with 8 GB of memory. By parallelizing our
software we achieved a speed-up by a factor between 2 and
2.5. Graph optimization is started after all frames have been
TABLE III
RUNTIME ANALYSIS OF FEATURE EXTRACTION WITH RESPECT TO THE
CHOSEN INTEREST POINT DETECTOR AND FEATURE DESCRIPTOR. WE
FOUND THAT ORB IS FASTER THAN SURF AND SIFT BY ONE ORDER OF
MAGNITUDE.
Type Count Runtime Detection and Extraction
Avg. ± Std. Dev. Averages
SURF 1733 ± 153 0.34 s, 0.34 s
ORB 1117 ± 558 0.018 s, 0.0086 s
SIFTGPU 1918 ± 599 0.19 s
TABLE IV
RUNTIME ANALYSIS OF PAIRWISE FRAME REGISTRATION. USING AN
APPROXIMATELY NEAREST NEIGHBOR METHOD DURING MATCHING
REDUCES THE RUNTIME BY A FACTOR OF TWO.
Matcher Runtime (Avg. ± Std. Dev)
FLANN 0.203 s ± 0.078 s
Brute Force 0.386 s ± 0.120 s
processed and runs on a single core. The runtime results
presented in Table I were generated using SIFTGPU and
FLANN matching. On average, our system required 0.35 s
per frame. The overall runtime performance strongly depends
on the configurations presented in the previous sections.
Therefore, we evaluated the runtime performance in more
detail in the remainder of this section.
1) Feature Detection and Descriptor Extraction: Our ap-
proach has to detect features and extract their descriptors in
each incoming image frame. Table III shows a comparison
of the computation time required for the different feature
types as described in Section III-A. We found that ORB is
faster than SURF and SIFTGPU by one order of magnitude.
However, the system produced high errors in two of the nine
sequences. SIFTGPU is still about 3.5 times as fast as the
non-parallelized SURF implementation.
2) Feature Matching and Motion Estimation: Feature
matching and motion estimation needs to be computed at
least once per frame. If the current frame is only matched
against one predecessor, the resulting camera trajectory will
quickly accumulate errors over time. Feature matching for
many frames is costly to compute, but especially since we
do not assume the availability of any odometry information,
only with the possibility for loop closures, the system can be
accurate over longer trajectories. Further, more information
about pairwise relative transformations makes the trajectory
estimation more robust to errors in pose estimation. However,
a densely connected pose graph also requires more time to
optimize. We therefore found matching the current features
to those of 20 previous frames a good compromise. Table IV
shows the average runtime for matching and motion estima-
tion. We found that FLANN reduces the time required for
frame-to-frame registration by a factor of two.
3) Pose Graph Optimization: The optimization of small
pose graphs is fast enough to be done in real time, i.e., for
every frame. With longer sequences of densely connected
poses, the optimization time increases. However, if the
motion estimates are reliable, the global optimization is not
required in every step. In all sequences, the ratio of graph
optimization versus the total runtime was below 6 %.
V. CONCLUSION AND OUTLOOK
We presented a novel approach to visual SLAM from
RGB-D sensors. Our approach extracts visual keypoints from
the color images and uses the depth images to localize them
in 3D. We use RANSAC to robustly estimate the trans-
formations between RGB-D frames and optimize the pose
graph using non-linear optimization. Finally, we generate a
volumetric 3D map of the environment that can be used
for robot localization, navigation and path planning. We
evaluated our approach quantitatively on a publicly available
RGB-D dataset. On this dataset our system achieves on
average an accuracy of 9.7 cm and 3.95°. With an average
frame processing time of 0.35 s our approach is suitable
for online operation. To allow other researchers to use our
software and reproduce the results (and improve on them),
we have released all source code required to run and re-
evaluate our RGB-D SLAM system as open-source.
During the evaluation of our system, we discovered that
sometimes erroneous edges are created that lead to a degra-
dation of the mapping result. It would be interesting to
see how such problems can be efficiently detected and
possibly repaired autonomously. In the near future, we want
to optimize the keypoint matching scheme, for example by
adding a feature dictionary, pruning never matched features,
and integrating keypoints as landmarks directly during non-
linear optimization.
REFERENCES
[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust
features (SURF). Comput. Vis. Image Underst., 110:346–359, 2008.
[2] P. J. Besl and H. D. McKay. A method for registration of 3-D shapes.
IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), 14(2):239–256, 1992.
[3] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with
the OpenCV Library. O’Reilly Media, 2008.
[4] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust
Independent Elementary Features. In Proc. of the Europ. Conf. on
Computer Vision (ECCV), 2010.
[5] A.J. Davison. Real-time simultaneous localisation and mapping with
a single camera. In Proc. of the IEEE Intl. Conf. on Computer Vision
(ICCV), 2003.
[6] F. Dellaert. Square Root SAM. In Proc. of Robotics: Science and
Systems (RSS), pages 177–184, 2005.
[7] N. Fioraio and K. Konolige. Realtime visual and point cloud slam.
In Proc. of the RGB-D Workshop on Advanced Reasoning with Depth
Cameras at Robotics: Science and Systems Conf. (RSS), 2011.
[8] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated
cartography. Commun. ACM, 24(6):381–395, 1981.
[9] U. Frese, P. Larsson, and T. Duckett. A multilevel relaxation algorithm
for simultaneous localisation and mapping. IEEE Trans. on Robotics,
21(2):1–12, 2005.
[10] G. Grisetti, S. Grzonka, C. Stachniss, P. Pfaff, and W. Burgard.
Efficient estimation of accurate maximum likelihood maps in 3d. In
Proc. of the Intl. Conf. on Intelligent Robots and Systems (IROS),
2007.
[11] G. Grisetti, R. Kümmerle, C. Stachniss, U. Frese, and C. Hertzberg.
Hierarchical optimization on manifolds for online 2D and 3D mapping.
In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA),
2010.
[12] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping:
Using depth cameras for dense 3D modeling of indoor environments.
In Proc. of the Intl. Symp. on Experimental Robotics (ISER), 2010.
[13] A. Hornung, K.M. Wurm, and M. Bennewitz. Humanoid robot
localization in complex indoor environments. In Proc. of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
2010.
[14] H. Jin, P. Favaro, and S. Soatto. Real-time 3-d motion and structure
of point-features: A front-end for vision-based control and interaction.
In Proc. of the IEEE Intl. Conf. on Computer Vision and Pattern
Recognition (CVPR), 2000.
[15] M. Kaess, A. Ranganathan, and F. Dellaert. iSAM: Incremental
smoothing and mapping. IEEE Trans. on Robotics, 24(6):1365–1378,
2008.
[16] G. Klein and D. Murray. Parallel tracking and mapping for small
AR workspaces. In Proc. IEEE and ACM Intl. Symp. on Mixed and
Augmented Reality (ISMAR), 2007.
[17] K. Konolige, M. Agrawal, R. Bolles, C. Cowan, M. Fischler, and
B. Gerkey. Outdoor mapping and navigation using stereo vision. In
Experimental Robotics, volume 39 of Springer Tracts in Advanced
Robotics, pages 179–190. Springer, 2008.
[18] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard.
g2o: A general framework for graph optimization. In Proc. of the
IEEE Intl. Conf. on Robotics and Automation (ICRA), 2011.
[19] M.I.A. Lourakis and A.A. Argyros. SBA: a software package for
generic sparse bundle adjustment. ACM Transactions on Mathematical
Software, 2009.
[20] D.G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, 2004.
[21] D. Nister. Preemptive RANSAC for live structure and motion estima-
tion. In Proc. of the IEEE Intl. Conf. on Computer Vision (ICCV),
2003.
[22] A. Nüchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D
SLAM with approximate data association. In Proc. of the 12th
Intl. Conference on Advanced Robotics (ICAR), pages 242–249, 2005.
[23] E. Olson, J. Leonard, and S. Teller. Fast iterative optimization of pose
graphs with poor initial estimates. In Proc. of the IEEE Intl. Conf. on
Robotics and Automation (ICRA), pages 2262–2269, 2006.
[24] L. M. Paz, P. Pinies, J. D. Tardos, and J. Neira. Large-scale 6-
DOF SLAM with stereo-in-hand. IEEE Transactions on Robotics,
24(5):946–957, 2008.
[25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: an efficient
alternative to SIFT or SURF. In Proc. of the IEEE Intl. Conf. on
Computer Vision (ICCV), volume 13, 2011.
[26] S. Rusinkiewicz and M. Levoy. Efficient variants of the ICP algorithm.
In Proc. of the Intl. Conf. on 3-D Digital Imaging and Modeling, 2001.
[27] A. Segal, D. Haehnel, and S. Thrun. Generalized-ICP. In Proceedings
of Robotics: Science and Systems, 2009.
[28] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale drift-aware
large scale monocular slam. In Proceedings of Robotics: Science and
Systems, 2010.
[29] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W. Bur-
gard, D. Cremers, and R. Siegwart. Towards a benchmark for RGB-D
SLAM evaluation. In Proc. of the RGB-D Workshop on Advanced
Reasoning with Depth Cameras at Robotics: Science and Systems
Conf. (RSS), 2011.
[30] S. Thrun. Robotic mapping: A survey. In Exploring Artificial
Intelligence in the New Millennium. Morgan Kaufmann, 2003.
[31] Shinji Umeyama. Least-squares estimation of transformation param-
eters between two point patterns. IEEE Transactions on Pattern
Analysis and Machine Intelligence, (13), 1991.
[32] Changchang Wu. SiftGPU: A GPU implementation of scale in-
variant feature transform (SIFT). http://cs.unc.edu/~ccwu/
siftgpu, 2007.
[33] K.M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Bur-
gard. OctoMap: A probabilistic, flexible, and compact 3D map
representation for robotic systems. In Proc. of the ICRA 2010
Workshop on Best Practice in 3D Perception and Modeling for Mobile
Manipulation, 2010.
... Then, to quantitatively evaluate the performance between the two systems, we compared their ATE and RPE [32] results. Absolute Trajectory Error(ATE), quantifies the discrepancy between the predicted pose and the actual pose, thereby offering a clear indication of the algorithm's precision and the overall consistency of the global trajectory. ...
Preprint
Full-text available
Traditional SLAM systems have achieved good robustness in static environments. However, the presence of dynamic objects in real-world environments can significantly reduce their localization accuracy. This paper introduces a dynamic SLAM system that combines semantic segmentation with epipolar constraints. The system can detect and remove dynamic points in dynamic environments, achieving good localization performance and generating dense point cloud maps. Initially, the system employs the YOLOv5 deep learning network to extract semantic information from images, thereby generating prior semantic masks for dynamic objects. Subsequently, a novel method for eliminating dynamic feature points is introduced. This method utilizes an adaptive threshold correlated with depth, integrating semantic prior information and epipolar constraint geometric information to further assess the motion state of feature points, thereby removing dynamic points. Finally, a dense point cloud map is produced in dynamic environments by integrating depth information with semantic information. Experiments conducted on the TUM dataset indicate that, compared to ORBSLAM2, the proposed system achieves an average localization accuracy improvement of 95\% in highly dynamic sequences, demonstrating the algorithm's effectiveness in enhancing localization accuracy in dynamic environments.
... Common RGB-D cameras determine depth perception through the projection of infrared patterns, constraining their usage in indoor environments. 3 However, depth ranges are often limited to 5 m. 8 Examples of RGB-D SLAM include the work of Endres et al., 9 who used the Kinect camera to simultaneously provide camera trajectory and generate dense 3D models of indoor environments, and Salas-Moreno et al., 10 who developed SLAM++, a method that registered prior 3D objects to replace existing detected objects, leading to increased mapping efficiency. ...
Article
Full-text available
Visual simultaneous localization and mapping (SLAM) remains a focal point in robotics research, particularly in the realm of mobile robots. Despite the existence of robust methods such as ORBSLAM3, their effectiveness is limited in dynamic scenarios. The influence of moving entities in these scenarios poses challenges to data association, leading to compromised pose estimation accuracy. This paper proposes a novel approach that utilizes spatial reasoning to reduce the influence of dynamic entities present in an environment. Our approach, known as human–object interaction detection, identifies the dynamic nature of an object by evaluating the intersecting area between the bounding boxes of a person and the object. We tested our approach by extending the ORBSLAM3 RGB-D SLAM algorithm. Consequently, all ORB features associated with dynamic objects are filtered out from the ORBSLAM3 tracking thread. To validate our approach, we conducted evaluations on highly dynamic sequences extracted from the TUM RGB-D dataset. Our results exhibited a significant performance enhancement over ORBSLAM3. Furthermore, in comparison to other state-of-the-art research, our results remained competitive, given the simplicity of our approach.
Chapter
Full-text available
Reconstructing real-world scenes with unparalleled levels of realism and detail has been a long-standing goal in the fields of computer vision and graphics. Achieving this goal necessitates coordinated efforts in both sensing techniques and plenoptic reconstruction algorithms.
Chapter
Full-text available
Empowered by advanced plenoptic sensing systems, light-field imaging becomes one of the most extensively used methods for capturing 3D views of a scene. In contrast to the traditional input to a 3D graphics system, namely, scenes consisting of pre-defined geometric primitives with different materials and sets of lights, the input to a light field is only a set of 2D images which are informative and cost effective. Unfortunately, due to the limited sensor resolution, existing systems must balance the spatial and angular resolution, i.e., one can obtain dense sampling images in the spatial dimension but only sparse sampling images in the angular (viewing angle) dimension or vice versa.
Article
Full-text available
In multiplayer, first‐person shooter games like Counter‐Strike: Global Offensive (CS:GO), coordinated movement is a critical component of high‐level strategic play. However, the complexity of team coordination and the variety of conditions present in popular game maps make it impractical to author hand‐crafted movement policies for every scenario. We show that it is possible to take a data‐driven approach to creating human‐like movement controllers for CS:GO. We curate a team movement dataset comprising 123 hours of professional game play traces, and use this dataset to train a transformer‐based movement model that generates human‐like team movement for all players in a “Retakes” round of the game. Importantly, the movement prediction model is efficient. Performing inference for all players takes less than 0.5 ms per game step (amortized cost) on a single CPU core, making it plausible for use in commercial games today. Human evaluators assess that our model behaves more like humans than both commercially‐available bots and procedural movement controllers scripted by experts (16% to 59% higher by TrueSkill rating of “human‐like”). Using experiments involving in‐game bot vs. bot self‐play, we demonstrate that our model performs simple forms of teamwork, makes fewer common movement mistakes, and yields movement distributions, player lifetimes, and kill locations similar to those observed in professional CS:GO match play.
Article
Full-text available
Bees are among the master navigators of the insect world. Despite impressive advances in robot navigation research, the performance of these insects is still unrivaled by any artificial system in terms of training efficiency and generalization capabilities, particularly considering the limited computational capacity. On the other hand, computational principles underlying these extraordinary feats are still only partially understood. The theoretical framework of reinforcement learning (RL) provides an ideal focal point to bring the two fields together for mutual benefit. In particular, we analyze and compare representations of space in robot and insect navigation models through the lens of RL, as the efficiency of insect navigation is likely rooted in an efficient and robust internal representation, linking retinotopic (egocentric) visual input with the geometry of the environment. While RL has long been at the core of robot navigation research, current computational theories of insect navigation are not commonly formulated within this framework, but largely as an associative learning process implemented in the insect brain, especially in the mushroom body (MB). Here we propose specific hypothetical components of the MB circuit that would enable the implementation of a certain class of relatively simple RL algorithms, capable of integrating distinct components of a navigation task, reminiscent of hierarchical RL models used in robot navigation. We discuss how current models of insect and robot navigation are exploring representations beyond classical, complete map-like representations, with spatial information being embedded in the respective latent representations to varying degrees.
Article
This paper presents a novel simultaneous localisation and mapping (SLAM) technique, termed SLAM-TKA, for assisting total knee arthroplasty (TKA), a highly effective orthopaedic surgery that replaces arthritic or dysfunctional joint surfaces with knee prostheses. Our proposed SLAM algorithm uses information from a pre-operative tibia CT scan, intra-operative 2D X-ray images, and a trocar pin 3D mesh model to simultaneously localise the X-ray device and map the two trocar pins. Then, the estimated pins are used to evaluate the accuracy of the bone resection plane before the actual bone cutting, which plays a crucial role in precisely implanting the knee prostheses. To ensure high accuracy and robustness of the proposed SLAM algorithm, three energy terms are proposed and used together to align the edge observations of the tibia, fibula and pins on the intra-operative X-ray images and their corresponding pre-operative 3D mesh models in both 2D and 3D space. To enable the proposed iteration-based SLAM algorithm to be implemented in real-time such that the evaluation processing does not interrupt much on the workflow of TKA, the data association of edge correspondences matching and exhausted points-to-mesh distance calculation are pre-computed using the signed distance field method. Simulations are used to evaluate the accuracy and robustness of the proposed algorithm, and the experiments using in-vivo datasets from five patients demonstrate the high accuracy and efficiency in practice. The code and datasets are released at https://github.com/zsustc/SLAM-TKA.
Article
Full-text available
RGB-D cameras (such as the Microsoft Kinect) are novel sensing systems that capture RGB images along with per-pixel depth information. In this paper we investigate how such cameras can be used for building dense 3D maps of indoor environments. Such maps have applications in robot navigation, manipulation, semantic mapping, and telepresence. We present RGB-D Mapping, a full 3D mapping system that utilizes a novel joint optimization algorithm combining visual features and shape-based alignment. Visual and depth information are also combined for view-based loop-closure detection, followed by pose optimization to achieve globally consistent maps. We evaluate RGB-D Mapping on two large indoor environments, and show that it effectively combines the visual and shape information available from RGB-D cameras.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
Ego-motion estimation for an agile single camera moving through general, unknown scenes becomes a much more challenging problem when real-time performance is required rather than under the off-line processing conditions under which most successful structure from motion work has been achieved. This task of estimating camera motion from measurements of a continuously expanding set of self-mapped visual features is one of a class of problems known as Simultaneous Localisation and Mapping (SLAM) in the robotics community, and we argue that such real-time mapping research, despite rarely being camera-based, is more relevant here than off-line structure from motion methods due to the more fundamental emphasis placed on propagation of uncertainty. We present a top-down Bayesian framework for single-camera localisation via mapping of a sparse set of natural features using motion modelling and an information-guided active measurement strategy, in particular addressing the difficult issue of real-time feature initialisation via a factored sampling approach. Real-time handling of uncertainty permits robust localisation via the creating and active measurement of a sparse map of landmarks such that regions can be re-visited after periods of neglect and localisation can continue through periods when few features are visible. Results are presented of real-time localisation for a hand-waved camera with very sparse prior scene knowledge and all processing carried out on a desktop PC.