ArticlePublisher preview available

Feature-based visual simultaneous localization and mapping: a survey

  • Dubai Futue Labs
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Visual simultaneous localization and mapping (SLAM) has attracted high attention over the past few years. In this paper, a comprehensive survey of the state-of-the-art feature-based visual SLAM approaches is presented. The reviewed approaches are classified based on the visual features observed in the environment. Visual features can be seen at different levels; low-level features like points and edges, middle-level features like planes and blobs, and high-level features like semantically labeled objects. One of the most critical research gaps regarding visual SLAM approaches concluded from this study is the lack of generality. Some approaches exhibit a very high level of maturity, in terms of accuracy and efficiency. Yet, they are tailored to very specific environments, like feature-rich and static environments. When operating in different environments, such approaches experience severe degradation in performance. In addition, due to software and hardware limitations, guaranteeing a robust visual SLAM approach is extremely challenging. Although semantics have been heavily exploited in visual SLAM, understanding of the scene by incorporating relationships between features is not yet fully explored. A detailed discussion of such research challenges is provided throughout the paper.
This content is subject to copyright. Terms and conditions apply.
SN Applied Sciences (2020) 2:224 |
Review Paper
Feature‑based visual simultaneous localization andmapping: asurvey
RanaAzzam1 · TarekTaha2· ShoudongHuang3· YahyaZweiri4
Received: 30 October 2019 / Accepted: 8 January 2020 / Published online: 16 January 2020
© Springer Nature Switzerland AG 2020
Visual simultaneous localization and mapping (SLAM) has attracted high attention over the past few years. In this paper,
a comprehensive survey of the state-of-the-art feature-based visual SLAM approaches is presented. The reviewed
approaches are classied based on the visual features observed in the environment. Visual features can be seen at dif-
ferent levels; low-level features like points and edges, middle-level features like planes and blobs, and high-level features
like semantically labeled objects. One of the most critical research gaps regarding visual SLAM approaches concluded
from this study is the lack of generality. Some approaches exhibit a very high level of maturity, in terms of accuracy and
eciency. Yet, they are tailored to very specic environments, like feature-rich and static environments. When operating
in dierent environments, such approaches experience severe degradation in performance. In addition, due to software
and hardware limitations, guaranteeing a robust visual SLAM approach is extremely challenging. Although semantics
have been heavily exploited in visual SLAM, understanding of the scene by incorporating relationships between features
is not yet fully explored. A detailed discussion of such research challenges is provided throughout the paper.
Keywords Robotics· SLAM· Localization· Sensors· Factor graphs· Semantics
1 Introduction
Following several decades of exhaustive research and
intensive investigation, Simultaneous Localization and
Mapping (SLAM) continues to dominate a magnicent
share of the research conducted in the robotics commu-
nity. SLAM is the problem of concurrently estimating the
position of a robotic vehicle navigating in a previously
unexplored environment while progressively construct-
ing a map of it. The estimation is done based on meas-
urements collected by means of sensors mounted on the
vehicle including: vision, proximity, light, position, and
inertial sensors, to name a few. SLAM systems employ
these measurements in a multitude of various methods
to localize the robot and map its surroundings. However,
the building blocks of any SLAM system include a set of
common components such as: map/trajectory initializa-
tion; data association; and loop closure. Dierent estima-
tion techniques can then be used to estimate the robot’s
trajectory and generate a map of the environment.
The implementation details of every SLAM approach
relies on the employed sensor(s), and hence on the data
collected from the environment. In this paper, we thor-
oughly review the most recent visual SLAM systems with
focus on the feature-based approaches, where conven-
tional vision sensors such as monocular, depth, or stereo
cameras are employed to observe the environment. From
here on, visual SLAM systems are referred to as monocu-
lar SLAM, RGB-D SLAM, or stereo SLAM if they employ a
monocular camera, an RGB-D camera, or a stereo camera,
* Rana Azzam,; Tarek Taha,; Shoudong Huang,; Yahya
Zweiri, | 1Khalifa University ofScience andTechnology, AbuDhabi, UAE. 2Algorythma’s Autonomous Aerial Lab,
AbuDhabi, UAE. 3University ofTechnology Sydney, Sydney, Australia. 4Faculty ofScience, Engineering andComputing, Kingston University
London, Kingston, UK.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Similarly, Agunbiade & Zuva [12] identifies SLAM problems that remain unresolved and proposes a novel SLAM technique based on the information they gathered. Last but not least, Azzam et al. [13] presents a thorough review of feature-based SLAM algorithms categorised based on the type of features detected: Low-Level, Mid-Level, High Level or Hybrid. This review will begin by briefly describing the general mathematical concept which forms the basis of SLAM algorithms attempting to solve the SLAM problem in Section 1. ...
... In contrast to Direct Methods, Feature Based Methods only map and track prominent locations in an image containing specific features [7]. According to Azzam et al. [13]; there are four types of features that can be observed and tracked in this method. Low Level features are abundant and easily observable features in a highly textured environment such as points, corners or lines [29,30]; Middle Level features are planes or blobs which are found in cases where there is difficulty in detecting Low-Level features [31]; High-Level features provides semantic information that helps build meaningful maps that a system can understand [32][33][34]; while Hybrid features combines two or more of the previous levels to output a relatively more accurate and efficient solution to the SLAM problem [35][36][37]. ...
... Solving the SLAM problem offline implies the processing of potentially large number of measurements that have been collected by a robot after a significant length of time [74,75]. Though it accomplishes consistency in results; this method is computationally expensive making it unusable in real-time applications, large scale environments and systems operating endlessly [13,20]. ...
Full-text available
This paper aims to review current approaches for real-time artificial intelligence-based visual Simultaneous Localization and Mapping (vSLAM) operating in dynamic environments. Through the review of several relatively recent published papers for vSLAM in dynamic environments, an attempt is made to explain the concept of Simultaneous Localization and Mapping (SLAM) and its purpose; identify the general framework for real-time AI-based vSLAM approaches in dynamic environments and highlighting the potential solutions that has been developed and their significant results. All related information regarding this topic was obtained with the intention to answer three main questions. Firstly, how do robots localize and map within an unknown environment? Secondly, how can state-of-the-art vSLAM be modified with an AI algorithm to function within a dynamic environment? And lastly, what level of success has these approaches achieved in developing methods for real-time AI-based vSLAM in dynamic environments? The paper intends to provide readers with a clearer general understanding of SLAM and acts as a road map for steps to move forward in developing viable approaches for real-time vSLAM in dynamic environments based on artificial intelligence.
... The particle filtering process updates the states during each window around each particle during a measurement timeframe T F [138]. To address the shortcomings of EKF, such as the linearization error and noise due to Gaussian distribution assumptions, particle filters have been used while a UAV navigation is mapped using the estimates provided by the sensors [149]. The inter-UAV measurements are based on Euclidean distance between the particle descriptors and the current landmark image descriptors [150]. ...
... In visual SLAM, the features are extracted using images. Visual SLAM localizes and builds a map using these images that continuously outputs UAV positions and landmark maps [4,149]. Visual UAV odometry and navigation trajectory information is used as input for sampling steps in visual SLAM. ...
... Visual SLAM is based on the sparse structure of the error term e ij that describes the residual about l j in q i , the i-th position and the j-th landmark [148]. The derivatives of the remaining variables are 0. The corresponding error term in the sparsity matrix containing the visual SLAM imagery data has the following form [25,64,125,141,149]: ...
Full-text available
This article presents a survey of simultaneous localization and mapping (SLAM) and data fusion techniques for object detection and environmental scene perception in unmanned aerial vehicles (UAVs). We critically evaluate some current SLAM implementations in robotics and autonomous vehicles and their applicability and scalability to UAVs. SLAM is envisioned as a potential technique for object detection and scene perception to enable UAV navigation through continuous state estimation. In this article, we bridge the gap between SLAM and data fusion in UAVs while also comprehensively surveying related object detection techniques such as visual odometry and aerial photogrammetry. We begin with an introduction to applications where UAV localization is necessary, followed by an analysis of multimodal sensor data fusion to fuse the information gathered from different sensors mounted on UAVs. We then discuss SLAM techniques such as Kalman filters and extended Kalman filters to address scene perception, mapping, and localization in UAVs. The findings are summarized to correlate prevalent and futuristic SLAM and data fusion for UAV navigation, and some avenues for further research are discussed.
... Starting from the fusion of vision and visual inertia, Servieres et al. [19] reviewed and compared important methods and summarized excellent algorithms emerging in SLAM. Azzam et al. [20] conducted a comprehensive study on feature-based methods. They classified the reviewed methods according to the visual features observed in the environment. ...
... For the VSLAM system, the visual odometer, as the front-end of SLAM, is an indispensable part [65]. Ref. [20] points out that VSLAM can be divided into the direct method and indirect method according to the different image information collected by the front-end visual odometer. The indirect method needs to select a certain number of representative points from the collected images, called key points, and detect and match them in the following images to gain the camera pose. ...
Full-text available
Visual SLAM (VSLAM) has been developing rapidly due to its advantages of low-cost sensors, the easy fusion of other sensors, and richer environmental information. Traditional visionbased SLAM research has made many achievements, but it may fail to achieve wished results in challenging environments. Deep learning has promoted the development of computer vision, and the combination of deep learning and SLAM has attracted more and more attention. Semantic information, as high-level environmental information, can enable robots to better understand the surrounding environment. This paper introduces the development of VSLAM technology from two aspects: traditional VSLAM and semantic VSLAM combined with deep learning. For traditional VSLAM, we summarize the advantages and disadvantages of indirect and direct methods in detail and give some classical VSLAM open-source algorithms. In addition, we focus on the development of semantic VSLAM based on deep learning. Starting with typical neural networks CNN and RNN, we summarize the improvement of neural networks for the VSLAM system in detail. Later, we focus on the help of target detection and semantic segmentation for VSLAM semantic information introduction. We believe that the development of the future intelligent era cannot be without the help of semantic technology. Introducing deep learning into the VSLAM system to provide semantic information can help robots better perceive the surrounding environment and provide people with higher-level help.
... Feature-based vSLAM method mainly detects the feature points in adjacent images and matches them by comparing the feature descriptors, and then solves the camera pose according to the matching relationship [49]. Among the early vSLAM, feature-based vSLAM methods, especially extended Kalman filter (EKF) SLAM, dominated for a long time [50,51]. ...
Full-text available
Autonomous navigation and positioning are key to the successful performance of unmanned underwater vehicles (UUVs) in environmental monitoring, oceanographic mapping, and critical marine infrastructure inspections in the sea. Cameras have been at the center of attention as an underwater sensor due to the advantages of low costs and rich content information in high visibility ocean waters, especially in the fields of underwater target recognition, navigation, and positioning. This paper is not only a literature overview of the vision-based navigation and positioning of autonomous UUVs but also critically evaluates the methodologies which have been developed and that directly affect such UUVs. In this paper, the visual navigation and positioning algorithms are divided into two categories: geometry-based methods and deep learning-based. In this paper, the two types of SOTA methods are compared experimentally and quantitatively using a public underwater dataset and their potentials and shortcomings are analyzed, providing a panoramic theoretical reference and technical scheme comparison for UUV visual navigation and positioning research in the highly dynamic and three-dimensional ocean environments.
... Simultaneous Localization and Mapping (SLAM) is of great importance in 3D computer vision with many applications in autonomous driving [1, 2], indoor robotics [3,4], building surveying and mapping [5,6], etc. ...
Simultaneous Localization and Mapping (SLAM) plays an important role in outdoor and indoor applications ranging from autonomous driving to indoor robotics. Outdoor SLAM has been widely used with the assistance of LiDAR or GPS. For indoor applications, the LiDAR technique does not satisfy the accuracy requirement and the GPS signals will be lost. An accurate and efficient scene sensing technique is required for indoor SLAM. As the most promising 3D sensing technique, the opportunities for indoor SLAM with fringe projection profilometry (FPP) systems are obvious, but methods to date have not fully leveraged the accuracy and speed of sensing that such systems offer. In this paper, we propose a novel FPP-based indoor SLAM method based on the coordinate transformation relationship of FPP, where the 2D-to-3D descriptor-assisted is used for mapping and localization. The correspondences generated by matching descriptors are used for fast and accurate mapping, and the transform estimation between the 2D and 3D descriptors is used to localize the sensor. The provided experimental results demonstrate that the proposed indoor SLAM can achieve the localization and mapping accuracy around one millimeter.
Full-text available
In indoor low-texture environments, the point feature-based visual SLAM system has poor robustness and low trajectory accuracy. Therefore, we propose a visual inertial SLAM algorithm based on point-line feature fusion. Firstly, in order to improve the quality of the extracted line segment, a line segment extraction algorithm with adaptive threshold value is proposed. By constructing the adjacent matrix of the line segment and judging the direction of the line segment, it can decide whether to merge or eliminate other line segments. At the same time, geometric constraint line feature matching is considered to improve the efficiency of processing line features. Compared with the traditional algorithm, the processing efficiency of our proposed method is greatly improved. Then, point, line, and inertial data are effectively fused in a sliding window to achieve high-accuracy pose estimation. Finally, experiments on the EuRoC dataset show that the proposed PLI-VINS performs better than the traditional visual inertial SLAM system using point features and point line features.
Full-text available
High-precision indoor localization is growing extremely quickly, especially for multi-floor scenarios. The data on existing indoor positioning schemes, mainly, come from wireless, visual, or lidar means, which are limited to a single sensor. With the massive deployment of WiFi access points and low-cost cameras, it is possible to combine the above three methods to achieve more accurate, complete, and reliable location results. However, the existing SLAM rapidly advances, so hybrid visual and wireless approaches take advantage of this, in a straightforward manner, without exploring their interactions. In this paper, a high-precision multi-floor indoor positioning method, based on vision, wireless signal characteristics, and lidar is proposed. In the joint scheme, we, first, use the positioning data output in lidar SLAM as the theoretical reference position for visual images; then, use a WiFi signal to estimate the rough area, with likelihood probability; and, finally, use the visual image to fine-tune the floor-estimation and location results. Based on the numerical results, we show that the proposed joint localization scheme can achieve 0.62 m of 3D localization accuracy, on average, and a 1.24-m MSE for two-dimensional tracking trajectories, with an estimation accuracy for the floor equal to 89.22%. Meanwhile, the localization process takes less than 0.25 s, which is of great importance for practical implementation.
Full-text available
Simultaneous Localization and Mapping (SLAM) is a process to use multiple sensors to position an unmanned mobile vehicle without previous knowledge of the environment, and meanwhile construct a map of this environment for the further applications. Over the past three decades, SLAM has been intensively researched and widely applied in mobile robot control and unmanned vehicle navigation. SLAM technology has demonstrated a great potential in autonomously navigating the mobile robot and simultaneously reconstructing the three‐dimensional (3D) information of surrounding environment. With the vigorous driving of sensor technology and 3D reconstruction algorithms, many attempts have been conducted to propose novel systems and algorithms combined with different sensors to solve the SLAM problem. Notably, SLAM has been extended to various aspects of agriculture involved with autonomous navigation, 3D mapping, field monitoring, and intelligent spraying. This paper focuses on the recent developments and applications of SLAM, particularly in complex and unstructured agricultural environment. A detailed summary of the developments of SLAM is given from three main fundamental types: light detection and ranging SLAM, Visual SLAM, and Sensor Fusion SLAM, and we also discuss the applications and prospects of SLAM technology in agricultural mapping, agricultural navigation, and precise automatic agriculture. Particular attention has been paid to the SLAM sensors, systems, and algorithms applied in agricultural tasks. Additionally, the challenges and future trends of SLAM are reported.
Full-text available
Visual Loop Detection (VLD) is a core component of any Visual Simultaneous Localization and Mapping (SLAM) system, and its goal is to determine if the robot has returned to a previously visited region by comparing images obtained at different time steps. This paper presents a new approach to visual Graph-SLAM for underwater robots that goes one step forward the current techniques. The proposal, which centers its attention on designing a robust VLD algorithm aimed at reducing the amount of false loops that enter into the pose graph optimizer, operates in three steps. In the first step, an easily trainable Neural Network performs a fast selection of image pairs that are likely to close loops. The second step carefully confirms or rejects these candidate loops by means of a robust image matcher. During the third step, all the loops accepted in the second step are subject to a geometric consistency verification process, being rejected those that do not fit with it. The accepted loops are then used to feed a Graph-SLAM algorithm. The advantages of this approach are twofold. First, the robustness in front of wrong loop detection. Second, the computational efficiency since each step operates only on the loops accepted in the previous one. This makes online usage of this VLD algorithm possible. Results of experiments with semi-synthetic data and real data obtained with an autonomous robot in several marine resorts of the Balearic Islands, support the validity and suitability of the approach to be applied in further field campaigns.
Full-text available
In various dynamic scenes, there are moveable objects such as pedestrians, which may challenge simultaneous localization and mapping (SLAM) algorithms. Consequently, the localization accuracy may be degraded, and a moving object may negatively impact the constructed maps. Maps that contain semantic information of dynamic objects impart humans or robots with the ability to semantically understand the environment, and they are critical for various intelligent systems and location-based services. In this study, we developed a computationally efficient SLAM solution that is able to accomplish three tasks in real time: (1) complete localization without accuracy loss due to the existence of dynamic objects and generate a static map that does not contain moving objects, (2) extract semantic information of dynamic objects through a computionally efficient approach, and (3) eventually generate semantic maps, which overlay semantic objects on static maps. The proposed semantic SLAM solution was evaluated through four different experiments on two data sets, respectively verifying the tracking accuracy, computational efficiency, and the quality of the generated static maps and semantic maps. The results show that the proposed SLAM solution is computationally efficient by reducing the time consumption for building maps by 2/3; moreover, the relative localization accuracy is improved, with a translational error of only 0.028 m, and is not degraded by dynamic objects. Finally, the proposed solution generates static maps of a dynamic scene without moving objects and semantic maps with high-precision semantic information of specific objects.
Full-text available
Simultaneous localization and mapping (SLAM) is a fundamental problem for various applications. For indoor environments, planes are predominant features that are less affected by measurement noise. In this paper, we propose a novel point-plane SLAM system using RGB-D cameras. First, we extract feature points from RGB images and planes from depth images. Then plane correspondences in the global map can be found using their contours. Considering the limited size of real planes, we exploit constraints of plane edges. In general, a plane edge is an intersecting line of two perpendicular planes. Therefore, instead of line-based constraints, we calculate and generate supposed perpendicular planes from edge lines, resulting in more plane observations and constraints to reduce estimation errors. To exploit the orthogonal structure in indoor environments, we also add structural (parallel or perpendicular) constraints of planes. Finally, we construct a factor graph using all of these features. The cost functions are minimized to estimate camera poses and global map. We test our proposed system on public RGB-D benchmarks, demonstrating its robust and accurate pose estimation results, compared with other state-of-the-art SLAM systems.
Conference Paper
Full-text available
We present DQV-SLAM (Dual Quaternion Visual SLAM). This novel feature-based stereo visual SLAM framework uses a stochastic filter based on the unscented transform and a progressive Bayes update, avoiding linearization of the nonlinear spatial transformation group. 6-DoF poses are represented by dual quaternions where rotational and translational components are stochastically modeled by Bingham and Gaussian distributions. Maps represented by point clouds of ORB-features are in-crementally built and landmarks are updated with an unscented transform-based method. In order to get reliable measurements during the update, an optical flow-based approach is proposed to remove false feature associations. Drift is corrected by pose graph optimization once loop closure is detected. The KITTI and EuRoC datasets for stereo setup are used for evaluation. The performance of the proposed system is comparable to state-of-the-art optimization-based SLAM systems and better than existing filtering-based approaches. I. INTRODUCTION Simultaneous Localization and Mapping (SLAM) entails parallel ego-motion tracking and building a map of unknown surroundings. Among different possible sensors, stereo cameras provide cost-efficient and informative perception. However , camera-only visual SLAM can be challenging, e.g., due to changing lighting conditions, blurring under fast motion, and dynamic scenarios with moving objects, which can lead to reduced accuracy. The state-of-the-art visual SLAM frameworks typically rely on optimization, either for feature-based [1], [2] or direct methods [3], [4]. Optimization-based approaches are shown in [5] to provide better accuracy than filtering-based methods for the same computational cost, as they can process a large set of state variables more efficiently. However, stochastic filters are able to provide a probability distribution over the states without additional computational effort, which enables efficient obstacle avoidance and motion planning. For example, the safety margins for navigation can be adapted according to the current state uncertainty [6]. Moreover, SLAM systems employing a stochastic filter show comparably robust and accurate results with sparser extracted features [7]. This makes them a functional option besides the optimization-based approaches, in particular for mobile robots in practical applications.
Full-text available
The method of simultaneous localization and mapping (SLAM) using a light detection and ranging (LiDAR) sensor is commonly adopted for robot navigation. However, consumer robots are price sensitive and often have to use low-cost sensors. Due to the poor performance of a low-cost LiDAR, error accumulates rapidly while SLAM, and it may cause a huge error for building a larger map. To cope with this problem, this paper proposes a new graph optimization-based SLAM framework through the combination of low-cost LiDAR sensor and vision sensor. In the SLAM framework, a new cost-function considering both scan and image data is proposed, and the Bag of Words (BoW) model with visual features is applied for loop close detection. A 2.5D map presenting both obstacles and vision features is also proposed, as well as a fast relocation method with the map. Experiments were taken on a service robot equipped with a 360° low-cost LiDAR and a front-view RGB-D camera in the real indoor scene. The results show that the proposed method has better performance than using LiDAR or camera only, while the relocation speed with our 2.5D map is much faster than with traditional grid map.
Event cameras are bio-inspired sensors that work radically different from traditional cameras. Instead of capturing images at a fixed rate, they measure per-pixel brightness changes asynchronously. This results in a stream of events, which encode the time, location and sign of the brightness changes. Event cameras posses outstanding properties compared to traditional cameras: very high dynamic range (140 dB vs. 60 dB), high temporal resolution (in the order of microseconds), low power consumption, and do not suffer from motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as high speed and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world.
Simultaneous Localization and Mapping is the process of simultaneously creating a map of the environment while navigating in it. Most of the SLAM approaches use natural features (e.g. keypoints) that are unstable over time, repetitive in many cases or their number insufficient for a robust tracking (e.g. in indoor buildings). Other researchers, on the other hand, have proposed the use of artificial landmarks, such as squared fiducial markers, placed in the environment to help tracking and relocalization. This paper proposes a novel SLAM approach by fusing natural and artificial landmarks in order to achieve long-term robust tracking in many scenarios. Our method has been compared to the start-of-the-art methods ORB-SLAM2 [1], LDSO [2] and SPM-SLAM [3] in the public datasets Kitti [4], Euroc-MAV [5], TUM [6] and SPM [3], obtaining better precision, robustness and speed. Our tests also show that the combination of markers and keypoints achieves better accuracy than each one of them independently.
Visual localization has been well studied in recent decades and applied in many fields as a fundamental capability in robotics. However, the success of the state of the arts usually builds on the assumption that the environment is static. In dynamic scenarios where moving objects are present, the performance of the existing visual localization systems degrades a lot due to the disturbance of the dynamic factors. To address this problem, we propose a novel sparse motion removal (SMR) model that detects the dynamic and static regions for an input frame based on a Bayesian framework. The similarity between the consecutive frames and the difference between the current frame and the reference frame are both considered to reduce the detection uncertainty. After the detection process is finished, the dynamic regions are eliminated while the static ones are fed into a feature-based visual simultaneous localization and mapping (SLAM) system for further visual localization. To verify the proposed method, both qualitative and quantitative experiments are performed and the experimental results have demonstrated that the proposed model can significantly improve the accuracy and robustness for visual localization in dynamic environments.
In this letter, we present a deep learning-based network, GCNv2, for generation of keypoints and descriptors. GCNv2 is built on our previous method, GCN, a network trained for 3D projective geometry. GCNv2 is designed with a binary descriptor vector as the ORB feature so that it can easily replace ORB in systems such as ORB-SLAM2. GCNv2 significantly improves the computational efficiency over GCN that was only able to run on desktop hardware. We show how a modified version of ORB-SLAM2 using GCNv2 features runs on a Jetson TX2, an embedded low-power platform. Experimental results show that GCNv2 retains comparable accuracy as GCN and that it is robust enough to use for control of a flying drone. Source code is available at: .
In this paper, we present a monocular Simultaneous Localization and Mapping (SLAM) algorithm using high-level object and plane landmarks. The built map is denser, more compact and semantic meaningful compared to feature point based SLAM. We first propose a high order graphical model to jointly infer the 3D object and layout planes from single images considering occlusions and semantic constraints. The extracted objects and planes are further optimized with camera poses in a unified SLAM framework. Objects and planes can provide more semantic constraints such as Manhattan and object supporting relationships compared to points. Experiments on various public and collected datasets including ICL NUIM and TUM Mono show that our algorithm can improve camera localization accuracy compared to state-of-the-art SLAM especially when there is no loop closure, and also generate dense maps robustly in many structured environments.