Article

Towards Accurate Loop Closure Detection in Semantic SLAM With 3D Semantic Covisibility Graphs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Loop closure is necessary for correcting errors accumulated in simultaneous localization and mapping (SLAM) in unknown environments. However, conventional loop closure methods based on low-level geometric or image features may cause high ambiguity by not distinguishing similar scenarios. Thus, incorrect loop closures can occur. Though semantic 2D image information is considered in some literature to detect loop closures, there is little work that compares 3D scenes as an integral part of a semantic SLAM system. This letter introduces an approach, called SmSLAM+LCD, integrated into a semantic SLAM system to combine high-level 3D semantic information and low-level feature information to conduct accurate loop closure detection and effective drift reduction. The effectiveness of our approach is demonstrated in testing results.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Visual Place Recognition (VPR) is a crucial task in autonomous driving and robotics applications. In particular, in visual simultaneous localization and mapping (VSLAM), loop closure plays a vital role in correcting accumulated errors [1]. To achieve this, a robot must have the ability to determine whether its current location has been previously visited by comparing the incoming sensor data with a database during navigation. ...
... In Semantic SLAM, research has been conducted on the use of semantic information to perform loop-closure detection based on a more comprehensive understanding of the scene. Studies by Hu et al. [18], Li et al. [19], and Qian et al. [1] use the relative positions of semantic objects to distinguish similar scenes for LCD. However, these methods are limited by not considering the characteristics of each object, making them less effective at distinguishing different places. ...
Preprint
Full-text available
In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo-soojin/context-based-vlpr.
... The semantic objects are discriminative landmarks in loop detection. The similarity of a potential correspondence can be calculated relies on object appearance [3], [12] and orientation [13]. Based on Kimera [8], Hydra [14] embeds a hierarchical descriptor with layers: room, place, and appearance descriptor. ...
... Since the object-based loop detection approaches [4], [12] are not open sourced, we cannot directly compare with them. However, as the scenes shown in figure 4, our experiment environment is dense and involves ambiguous nearby objects. ...
Preprint
Loop detection plays a key role in visual Simultaneous Localization and Mapping (SLAM) by correcting the accumulated pose drift. In indoor scenarios, the richly distributed semantic landmarks are view-point invariant and hold strong descriptive power in loop detection. The current semantic-aided loop detection embeds the topology between semantic instances to search a loop. However, current semantic-aided loop detection methods face challenges in dealing with ambiguous semantic instances and drastic viewpoint differences, which are not fully addressed in the literature. This paper introduces a novel loop detection method based on an incrementally created scene graph, targeting the visual SLAM at indoor scenes. It jointly considers the macro-view topology, micro-view topology, and occupancy of semantic instances to find correct correspondences. Experiments using handheld RGB-D sequence show our method is able to accurately detect loops in drastically changed viewpoints. It maintains a high precision in observing objects with similar topology and appearance. Our method also demonstrates that it is robust in changed indoor scenes.
... Similarly, Liu et al. [55] utilize spatial priors to construct RWDs. To reduce false matches between nodes, some approaches [56] explicitly construct edge descriptors to verify geometric consistency among matched nodes. Along the same lines, Julia et al. [57] employ node triplets to validate the correctness of matched nodes. ...
Preprint
This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.
... By extracting relevant semantic features, vSLAM localizes accurately and robustly on a map even under challenging scenes, i.e., image blurring, different viewpoints, and illumination changes. Qian developed the SmSLAM + LCD (Semantic SLAM + Loop Closure Detection) approach by integrating high-level 3D semantic information with low-level feature correspondences into a unified semantic vSLAM framework [26], enabling accurate closed-loop detection and effective drift reduction. In indoor environments , ref. [27] employed a semantic association method using parking lines to construct maps for self-driving vehicles and indoor parking. ...
Article
Full-text available
Image matching-based visual simultaneous localization and mapping (vSLAM) extracts low-level pixel features to reconstruct camera trajectories and maps through the epipolar geometry method. However, it fails to achieve correct trajectories and mapping when there are low-quality feature correspondences in several challenging environments. Although the RANSAC-based framework can enable better results, it is computationally inefficient and unstable in the presence of a large number of outliers. A Faster R-CNN learning-based semantic filter is proposed to explore the semantic information of inliers to remove low-quality correspondences, helping vSLAM localize accurately in our previous work. However, the semantic filter learning method generalizes low precision for low-level and dense texture-rich scenes, leading the semantic filter-based vSLAM to be unstable and have poor geometry estimation. In this paper, a GGCM-E-based semantic filter using YOLOv8 is proposed to address these problems. Firstly, the semantic patches of images are collected from the KITTI dataset, the TUM dataset provided by the Technical University of Munich, and real outdoor scenes. Secondly, the semantic patches are classified by our proposed GGCM-E descriptors to obtain the YOLOv8 neural network training dataset. Finally, several semantic filters for filtering low-level and dense texture-rich scenes are generated and combined into the ORB-SLAM3 system. Extensive experiments show that the semantic filter can detect and classify semantic levels of different scenes effectively, filtering low-level semantic scenes to improve the quality of correspondences, thus achieving accurate and robust trajectory reconstruction and mapping. For the challenging autonomous driving benchmark and real environments, the vSLAM system with respect to the GGCM-E-based semantic filter demonstrates its superiority regarding reducing the 3D position error, such that the absolute trajectory error is reduced by up to approximately 17.44%, showing its promise and good generalization.
... In addition, semantic information was used to estimate the camera pose in work [21]. In another work, an object co-view was constructed with the semantic information for checking loop candidates based on the underlying geometric features during the loopback detection phase [22]. In previous works, semantics offer high-level feature information that improves localization accuracy, masks dynamic feature points, and assists in bundle adjustment (BA) and loop closure detection in SLAM. ...
Article
Full-text available
With the development of deep learning, a higher level of perception of the environment such as the semantic level can be achieved in the simultaneous localization and mapping (SLAM) domain. However, previous works did not achieve a natural-language level of perception. Therefore, LP-SLAM (Language-Perceptive RGB-D SLAM) is proposed that leverages large language models (LLMs). The texts in the scene can be detected by scene text recognition (STR) and mapped as landmarks with a task-driven selection. A text error correction chain (TECC) is designed with a similarity classification method, a two-stage memory strategy, and a text clustering method. The proposed architecture is designed to deal with the mis-detection and mis-recognition cases of STR and to provide accurate text information to the framework. The proposed framework takes input images and generates a 3D map with sparse point cloud and task-related texts. Finally, a natural user interface (NUI) is designed based on the constructed map and LLM, which gives position instructions based on users’ natural queries. The experimental results validated the proposed TECC design and the overall framework. We publish the virtual dataset with ground truth, as well as the source code for further research. https://github.com/GroupOfLPSLAM/LP_SLAM.
... Yang et al. [97] associated feature points with objects in different viewpoints up with optical flow tracking for common feature points, while triangulating 3D objects to improve the accuracy of bit-pose calculation for outdoor SLAM. smSLAM+LCD [98] added 3D object detection to loopback detection, obtained 3D models of objects by improved YOLOv3, and compared edges and vertices of 3D semantic information during loopback detection, with similar candidate frames for better differentiation effect. Many experiments show that relying on 3D road signs to establish VSLAM with 3D object constraints can yield higher odometry accuracy and can build more detailed semantic maps of the environment. ...
Article
Full-text available
Visual simultaneous localization and mapping (SLAM) is crucial in robotics and autonomous driving. However, traditional visual SLAM faces challenges in dynamic environments. To address this issue, researchers have proposed semantic SLAM, which combines object detection, semantic segmentation, instance segmentation, and visual SLAM. Despite the growing body of literature on semantic SLAM, there is currently a lack of comprehensive research on the integration of object detection and visual SLAM. Therefore, this study aims to gather information from multiple databases and review relevant literature using specific keywords. It focuses on visual SLAM based on object detection, covering different aspects. Firstly, it discusses the current research status and challenges in this field, highlighting methods for incorporating semantic information from object detection networks into mileage measurement, closed-loop detection, and map construction. It also compares the characteristics and performance of various visual SLAM object detection algorithms. Lastly, it provides an outlook on future research directions and emerging trends in visual SLAM. Research has shown that visual SLAM based on object detection has significant improvements compared to traditional SLAM in dynamic point removal, data association, point cloud segmentation, and other technologies. It can improve the robustness and accuracy of the entire SLAM system and can run in real time. With the continuous optimization of algorithms and the improvement of hardware level, object visual SLAM has great potential for development.
... By incorporating objects as map elements during the observation process, the robot's scene understanding [7,8] is enhanced, thereby facilitating the execution of more intricate tasks [9]. On the other hand, the observation of some long-term consistent objects has a positive effect on the long term operation and relocalization of the SLAM system [10,11]. Early work [12] used object CAD models and point cloud information to construct object maps. ...
Article
Full-text available
Object SLAM uses additional semantic information to detect and map objects in the scene, in order to improve the system’s perception and map representation capabilities. Previous methods often use quadrics and cuboids to represent objects, especially in monocular systems. However, their simplistic shapes are insufficient for effectively representing various types of objects, leading to a limitation in the accuracy of object maps and consequently impacting downstream task performance. In this paper, we propose a novel approach for representing objects in monocular SLAM using superquadrics (SQ) with shape parameters. Our method utilizes object appearance and geometry information comprehensively, enabling accurate estimation of object poses and adaptation to various object shapes. Additionally, we propose a lightweight data association strategy to accurately associate semantic observations across multiple views with object landmarks. We implement a monocular semantic SLAM system with real-time performance and conduct comprehensive experiments on public datasets. The results show that our method is able to build accurate object maps and outperforms state-of-the-art methods on object representation.
... These compact representations contain fundamental information of objects such as category, size, and pose. They can serve as semantic landmarks for localization and navigation [4], and have shown advantages in relocalization [14], [15] and long-term operation of SLAM systems. However, these geometric primitives do not capture shape and texture information of objects, which poses a challenge for monocular-based methods. ...
Preprint
Accurate perception of objects in the environment is important for improving the scene understanding capability of SLAM systems. In robotic and augmented reality applications, object maps with semantic and metric information show attractive advantages. In this paper, we present RO-MAP, a novel multi-object mapping pipeline that does not rely on 3D priors. Given only monocular input, we use neural radiance fields to represent objects and couple them with a lightweight object SLAM based on multi-view geometry, to simultaneously localize objects and implicitly learn their dense geometry. We create separate implicit models for each detected object and train them dynamically and in parallel as new observations are added. Experiments on synthetic and real-world datasets demonstrate that our method can generate semantic object map with shape reconstruction, and be competitive with offline methods while achieving real-time performance (25Hz). The code and dataset will be available at: https://github.com/XiaoHan-Git/RO-MAP
... Moreover, some methods extend this work to edit distance matching [22] and semantic histogram-based matching [23]. Furthermore, the methods [24], [25] utilize 3D graph co-visibility to match objects in query and global graphs. However, when there are multiple objects of the same category in the graph or the graphs have similar topological structures, random walk-based and 3D co-visibility-based methods may both incorrectly identify objects and decrease the performances. ...
Preprint
Visual simultaneous localization and mapping (SLAM) systems face challenges in detecting loop closure under the circumstance of large viewpoint changes. In this paper, we present an object-based loop closure detection method based on the spatial layout and semanic consistency of the 3D scene graph. Firstly, we propose an object-level data association approach based on the semantic information from semantic labels, intersection over union (IoU), object color, and object embedding. Subsequently, multi-view bundle adjustment with the associated objects is utilized to jointly optimize the poses of objects and cameras. We represent the refined objects as a 3D spatial graph with semantics and topology. Then, we propose a graph matching approach to select correspondence objects based on the structure layout and semantic property similarity of vertices' neighbors. Finally, we jointly optimize camera trajectories and object poses in an object-level pose graph optimization, which results in a globally consistent map. Experimental results demonstrate that our proposed data association approach can construct more accurate 3D semantic maps, and our loop closure method is more robust than point-based and object-based methods in circumstances with large viewpoint changes.
... In addition, semantic information was used to estimate the camera pose in work [14] . In Work [15], an object co-view was constructed with the semantic information for checking loop candidates based on the underlying geometric features during the loopback detection phase. In the above works, semantic offers high-level feature information that improves localization accuracy, masks dynamic feature points, and assists in bundle adjustment (BA) and loopback detection in SLAM. ...
Preprint
Simultaneous localization and mapping (SLAM) is a critical technology that enables autonomous robots to be aware of their surrounding environment. With the development of deep learning, SLAM systems can achieve a higher level of perception of the environment, including the semantic and text levels. However, current works are limited in their ability to achieve a natural-language level of perception of the world. To address this limitation, we propose LP-SLAM, the first language-perceptive SLAM system that leverages large language models (LLMs). LP-SLAM has two major features: (a) it can detect text in the scene and determine whether it represents a landmark to be stored during the tracking and mapping phase, and (b) it can understand natural language input from humans and provide guidance based on the generated map. We illustrated three usages of the LLM in the system including text cluster, landmark judgment, and natural language navigation. Our proposed system represents an advancement in the field of LLMs based SLAM and opens up new possibilities for autonomous robots to interact with their environment in a more natural and intuitive way.
... Objects are introduced into the observation process as map elements, which improves the robot's understanding of the scene and is helpful to perform more complex tasks [6]. On the other hand, the observation of some long-term consistent objects has a positive effect on the long-term operation and relocalization of SLAM system [7]. In recent work, objects are represented by cubes [8], [9] or ellipsoids [10]- [12], and object map is constructed by only 2D bounding box without prior 3D models. ...
Preprint
Object SLAM uses additional semantic information to detect and map objects in the scene, in order to improve the system's perception and map representation capabilities. Quadrics and cubes are often used to represent objects, but their single shape limits the accuracy of object map and thus affects the application of downstream tasks. In this paper, we introduce superquadrics (SQ) with shape parameters into SLAM for representing objects, and propose a separate parameter estimation method that can accurately estimate object pose and adapt to different shapes. Furthermore, we present a lightweight data association strategy for correctly associating semantic observations in multiple views with object landmarks. We implement a monocular semantic SLAM system with real-time performance and conduct comprehensive experiments on public datasets. The results show that our method is able to build accurate object map and has advantages in object representation. Code will be released upon acceptance.
Preprint
Visual loop closure detection traditionally relies on place recognition methods to retrieve candidate loops that are validated using computationally expensive RANSAC-based geometric verification. As false positive loop closures significantly degrade downstream pose graph estimates, verifying a large number of candidates in online simultaneous localization and mapping scenarios is constrained by limited time and compute resources. While most deep loop closure detection approaches only operate on pairs of keyframes, we relax this constraint by considering neighborhoods of multiple keyframes when detecting loops. In this work, we introduce LoopGNN, a graph neural network architecture that estimates loop closure consensus by leveraging cliques of visually similar keyframes retrieved through place recognition. By propagating deep feature encodings among nodes of the clique, our method yields high-precision estimates while maintaining high recall. Extensive experimental evaluations on the TartanDrive 2.0 and NCLT datasets demonstrate that LoopGNN outperforms traditional baselines. Additionally, an ablation study across various keypoint extractors demonstrates that our method is robust, regardless of the type of deep feature encodings used, and exhibits higher computational efficiency compared to classical geometric verification baselines. We release our code, supplementary material, and keyframe data at https://loopgnn.cs.uni-freiburg.de.
Article
To enhance the reliability and stability of simultaneous localization and mapping (SLAM) in dynamic environments, we propose a novel SLAM system integrating an advanced real-time object detection algorithm, the real-time detection transformer (RT-DETR). Our approach combines RT-DETR’s object detection capabilities with an optical flow-based dynamic thresholding method, effectively filtering out feature points associated with dynamic objects and thereby improving SLAM performance in such environments. We have optimized RT-DETR by substituting its original network backbone with lightweight modules, which reduces the number of parameters by 45% while only incurring a 5% reduction in accuracy. This optimization significantly lowers computational costs, making it feasible for deployment on mobile devices. Experiments conducted on the TUM and BONN dynamic datasets demonstrate that our system reduces the root mean square error (RMSE) of absolute trajectory and relative pose error (RPE) by approximately 28.82% compared to oriented fast and rotated brief-SLAM3 (ORB-SLAM3). Furthermore, experiments conducted on both a high-performance device and an embedded device demonstrate that, compared to Crowd-SLAM, which employs you only look once (YOLO) for dynamic object removal, our approach achieves an 8.52% improvement in absolute trajectory error (ATE), while the average frame per second (FPS) only decreases by 3.07%.
Article
For a robot in an unknown environment to find a target semantic object, it must perform simultaneous localization and mapping (SLAM) at both geometric and semantic levels using its onboard sensors while planning and executing its motion based on the ever-updated SLAM results. In other words, the robot must simultaneously conduct localization, semantic mapping, motion planning, and execution online in the presence of sensing and motion uncertainty. This is an open problem as it intertwines semantic SLAM and adaptive online motion planning and execution under uncertainty based on perception. Moreover, the goals of the robot's motion change on the fly depending on whether and how the robot can detect the target object. We propose a novel approach to tackle the problem, leveraging semantic SLAM, Bayesian Networks, and online probabilistic motion planning. The results demonstrate our approach's effectiveness and efficiency.
Article
Simultaneous localization and mapping (SLAM) in robotics is a fundamental problem. The use of visual odometry (VO) enhances scene recognition in the task of ego-localization within an unknown environment. Semantically meaningful information permits data association and dense mapping to be conducted based on entities representing landmarks rather than manually designed, low-level geometric clues and has inspired various feature descriptors for semantically ensembled SLAM applications. This article illuminates the insights into the measure for semantics and the semantically constrained pose optimization. The concept of semantic extractor and the matched framework are initially presented. As the latest advances in computer vision and the learning-based deep feature acquisition are closely related, the semantic extractor is especially described in a deep learning paradigm. The methodologies pertinent to our explorations for object association and semantics-fused constraining that is amenable for use in a least-squares framework are summarized in a systematic way. By a collection of problem formulations and principle analyses, our review exhibits a fairly unique perspective in semantic SLAM. We further discuss the challenges of semantic uncertainty and explicitly introduce the term ‘semantic reasoning’. Some technology outlooks regarding semantic reasoning are simultaneously given. We argue that for intelligent tasks of robots such as object grasping, dynamic obstacle avoidance, and object-target navigation, semantic reasoning might guide the complex scene understanding under the framework of semantic SLAM directly to a solution.
Article
Visual place recognition (VPR) is an essential tool in robotics perception and navigation. Though much progress has been made recently, the performance of VPR is far from satisfactory in challenging scenarios such as large appearance variations, reverse viewpoints, and heterogeneous data. This work aims to fully leverage semantic and spatial information to achieve more robust and accurate VPR in these challenging scenarios. To this end, we propose a novel bird's eye view (BEV) graph matching based pipeline, which represents a scene as a unified BEV graph that can better integrate appearance, semantics, and spatial structure of the scene. Following a coarse-to-fine hierarchical paradigm, we first search the top N candidates based on global descriptors. Then, we construct BEV graphs, and formulate the similarity measurement of a query-candidate pair as a quadratic assignment problem, for which an iterative solver taking geometric consistency into account is designed. Further, we propose a Shannon entropy based adaptive fusion strategy to fuse the similarity scores from the coarse and fine matching stages. Extensive evaluation across multiple datasets demonstrates the superiority of our method in various challenging scenarios. Code is available at https://github.com/Haochen-Niu/BEVGM .
Article
Loop closure, as one of the crucial components in SLAM, plays an essential role in correcting accumulated errors. Traditional appearance-based methods, such as bag-of-words models, are often limited by local 2D features and the volume of training data, making them less versatile and robust in real-world scenarios, leading to missed detections or false positive detections in loop closure. To address these issues, we first propose a semantic loop detection method based on quadric-level object map topology, which represents scenes through the topological graph of quadric-level objects and achieves accurate loop closure at a wide field of view by comparing differences in the topological graphs. Next, to solve the data association problem between frame and map in loop closure, we propose an object data association method based on multi-level verification, which can associate 2D semantic features of the current frame with 3D object landmarks of the map. Finally, we integrate these two methods into a complete object-aware SLAM system. Qualitative experiments and ablation studies demonstrate the effectiveness and robustness of the proposed object-level data association algorithm. Quantitative experiments show that our semantic loop closure method outperforms existing state-of-the-art methods in terms of precision, recall, and localization accuracy metrics.
Article
Conventional point feature based visual SLAM(Simultaneous Localization and Mapping) is difficult to find reliable point feature to estimate camera pose in structured low-texture environment. In contrast, line features are competent to work in such environment due to their advantage of expressing structural features. The line feature detectionalgorithm used in current point-line fusion SLAM algorithm,such as LSD(Line Segment Detection) algorithm suffer frommassive, short line feature and long disconnected lines, which dramatically decrease the accuracy of pose estimation.Therefore,we propose a novel line feature detection method,named EM-LSD (Elimination, Merging-Line Segment Detection), to obtain high-quality line features by the strategy of short line rejection and approximate line segment merging. In addition, we tightly couple IMU (inertial measurement unit) and visual measurement at back end. Further, we optimize state by minimizing the cost function that contains the information of IMU and point-line features. Finally, we performexperimental validation on the EuRoC and TUM VI dataset, and the experimental results show that our proposed EM-LSD algorithm can significantly improve the quality of extracted line features, and our proposed visual inertial odometryalgorithm can obtain higher localization accuracy than the state-of-the-art SLAM algorithm of the same type, PL-VINS.
Article
For active visual odometry (VO), the visual information detected by the positioning camera matters. By actively controlling the gaze of the camera, the VO tends to retrieve some effective factors, such as textured objects with rich feature points. However, the active rotation of the camera can introduce more uncertainties, which may cause additional gross errors. Therefore, it is a considerable problem for the active VO to avoid the adverse effects of the active rotation while benefiting from it. To address the issue, this article proposes an improved strategy based on a robust adaptive unscented Kalman filter (RAUKF) and the relative posture of the active camera for the active VO. The pose outputted from the VO is transformed to the vehicle pose by means of the pose of the pan-tilt, and the transformed pose is treated as the measurement of the vehicle motion. Subsequently, the measurement is fed to the RAUKF to generate a refined estimation of the vehicle pose, which is then inversely transformed to obtain a more precise camera pose. Finally, the feature points cloud of the VO can be corrected according to the refined camera pose. The proposed method effectively improves the positioning accuracy of the active VO, as demonstrated through numerical and real-vehicle tests. The relative translation error and the relative rotation error of the proposed method are 1.6% and 0.0037 deg/m in average, which reduce 96.22% and 94.79% compared with the raw outputs of the active VO.
Article
Visual simultaneous localization and mapping (SLAM) systems face challenges in detecting loop closure under the circumstance of large viewpoint changes. In this paper, we present an object-level SLAM system based on the spatial layout and semantic consistency of the 3D scene graph. Firstly, we propose an object-level data association approach based on the semantic information from semantic labels, intersection over union (IoU), object color, and object embedding. Subsequently, multi-view bundle adjustment with the associated objects is utilized to jointly optimize the poses of objects and cameras. We represent the refined objects as a 3D spatial graph with semantics and topology. Then, we propose a graph matching approach to find correspondence objects based on the spatial layout and semantic property similarity of vertices’ neighbors. Finally, we jointly optimize camera trajectories and object poses in an object-level pose graph optimization, which results in a globally consistent map. Experimental results demonstrate that our proposed data association approach can construct more accurate 3D semantic maps, and our loop closure method is more robust than point-based and object-based methods in circumstances with large viewpoint changes.
Article
Full-text available
Because Lidar can directly obtain ranging information and is more robust than visual sensors to environmental changes such as illumination, the technology of laser synchronous location and mapping ( SLAM) has been widely developed in recent years. The traditional laser SLAM has made a lot of research achievements. But, it only uses geometric features, has limited understanding of the scene, and is difficult to deal with complex tasks. In addition, the current SLAM application scenarios have transited from traditional static scenes to complex dynamic scenes, and traditional methods are mostly difficult to achieve good performance due to interference of dynamic elements. Therefore, the 3D laser SLAM technology of semantic information enhancement has attracted more and more attention of researchers. The point cloud semantic tags are integrated with pure geometric features. On the one hand, the potential moving objects are filtered out with semantic information to solve the problem of static environmental assumptions. On the other hand, semantic information is used to assist the laser odometer to obtain high-precision positioning and mapping. This article summarizes the research progress of 3D laser SLAM technology for semantic information enhancement, puts forward a general framework for this technology, focuses on the outstanding research achievements and applications in this field in modules, and finally summarizes and prospects the development direction of this field.
Article
Accurate perception of objects in the environment is important for improving the scene understanding capability of SLAM systems. In robotic and augmented reality applications, object maps with semantic and metric information show attractive advantages. In this paper, we present RO-MAP, a novel multi-object mapping pipeline that does not rely on 3D priors. Given only monocular input, we use neural radiance fields to represent objects and couple them with a lightweight object SLAM based on multi-view geometry, to simultaneously localize objects and implicitly learn their dense geometry. We create separate implicit models for each detected object and train them dynamically and in parallel as new observations are added. Experiments on synthetic and real-world datasets demonstrate that our method can generate semantic object map with shape reconstruction, and be competitive with offline methods while achieving real-time performance (25Hz). The code and dataset will be available at: https://github.com/XiaoHan-Git/RO-MAP</uri
Article
Loop Closure Detection is of great significance in the field of intelligent driving systems, as it reduces the cumulative error of the estimated position of the system and assists in generating a consistent global map. Existing methods differ in frame representation methods and the corresponding frame-matching strategy. Traditionally, local feature points and descriptors are studied extensively while recently global descriptors and semantic information extracted from deep learning methods are considered superior in terms of promoting a high-level understanding of the surrounding environments of robots. However, one of the most challenging problems of using semantic information for loop detection is how to deal with inconsistent visual contents from different viewpoints in the same place. In this paper, a semantic loop closure detection method using panoramas is proposed to address this issue. We design a pipeline for efficiently extracting and matching semantic information between frames to identify loops. Most importantly we propose a novel polar coordinate-based panorama representation to address the inconsistent visual appearance problem caused by viewpoint differences. Experiment results show that our proposed method can significantly increase the accuracy of loop closure detection tasks in challenging scenarios where traditional methods may fail.
Article
Object-level Simultaneous Localization and Mapping (SLAM) is critical for mobile robot localization and navigation. Wrong observations due to monocular camera noise and object detection error affect accurate object perception. Most of the existing work adopts simple artificial rules to prevent the construction of object outliers. These strategies are difficult to universalize to different challenging scenarios. Eliminating object outliers remains a challenge for object SLAM. In this paper, we propose a Spatio-temporal consistency model for removing object outliers. Our approach takes only a low-cost monocular camera as the image sensor of the system. We use the graph model to construct spatial consistency as a means to constrain the semantic spatial relationships among multiple objects. Only the objects that satisfy the spatial consistency constraints are constructed. In addition, outliers are detected based on the regularity of object measurements that appear on the time axis. We eliminate the objects with observations in consecutive frames that do not satisfy the temporal consistency constraint. Finally, we couple normal objects to SLAM for pose optimization to improve camera localization accuracy. Experiments on public datasets and a real scenario demonstrate the performance of the proposed approach.
Article
Full-text available
Loop closure detection is a key module for visual simultaneous localization and mapping (SLAM). Most previous methods for this module have not made full use of the information provided by images, i.e., they have only used the visual appearance or have only considered the spatial relationships of landmarks; the visual, spatial and semantic information have not been fully integrated. In this paper, a robust loop closure detection approach integrating visual–spatial–semantic information is proposed by employing topological graphs and convolutional neural network (CNN) features. Firstly, to reduce mismatches under different viewpoints, semantic topological graphs are introduced to encode the spatial relationships of landmarks, and random walk descriptors are employed to characterize the topological graphs for graph matching. Secondly, dynamic landmarks are eliminated by using semantic information, and distinctive landmarks are selected for loop closure detection, thus alleviating the impact of dynamic scenes. Finally, to ease the effect of appearance changes, the appearance-invariant descriptor of the landmark region is extracted by a pre-trained CNN without the specially designed manual features. The proposed approach weakens the influence of viewpoint changes and dynamic scenes, and extensive experiments conducted on open datasets and a mobile robot demonstrated that the proposed method has more satisfactory performance compared to state-of-the-art methods.
Article
Full-text available
Due to the development of the computer vision, machine learning and deep learning technologies, the research community focuses not only on the traditional SLAM problems, such as geometric mapping and localization, but also on semantic SLAM. In this paper we propose a Semantic SLAM system which builds the semantic maps with object-level entities, and it is integrated into the RGB-D SLAM framework. The system combines object detection module that is realized by the deep-learning method, and localization module with RGB-D SLAM seamlessly. In the proposed system, object detection module is used to perform object detection and recognition, and localization module is utilized to get the exact location of the camera. The two modules are integrated together to obtain the semantic maps of the environment. Furthermore, to improve the computational efficiency of the framework, an improved Octomap based on Fast Line Rasterization Algorithm is constructed. Meanwhile, for the sake of accuracy and robustness of the semantic map, Conditional Random Field (CRF) is employed to do the optimization. Finally, we evaluate our Semantic SLAM through three different tasks, i.e. Localization, Object Detection and Mapping. Specifically, the accuracy of localization and the mapping speed are evaluated on TUM dataset. Compared with ORB-SLAM2 and original RGB-D SLAM, our system respectively get 72.9% and 91.2% improvements in dynamic environments localization evaluated by root-mean-square error (RMSE). With the improved Octomap, the proposed Semantic SLAM is 66.5% faster than the original RGB-D SLAM. We also demonstrate the efficiency of object detection through quantitative evaluation in an automated inventory management task on a real-world datasets recorded over a realistic office.
Article
Full-text available
Visual Self-localization in unknown environments is a crucial capability for an autonomous robot. Real life scenarios often present critical challenges for autonomous vision-based localization, such as robustness to viewpoint and appearance changes. To address these issues, this paper proposes a novel strategy that models the visual scene by preserving its geometric and semantic structure and, at the same time, improves appearance invariance through a robust visual representation. Our method relies on high level visual landmarks consisting of appearance invariant descriptors that are extracted by a pre-trained Convolutional Neural Network (CNN) on the basis of image patches. In addition, during the exploration, the landmarks are organized by building an incremental covisibility graph that, at query time, is exploited to retrieve candidate matching locations improving the robustness in terms of viewpoint invariance. In this respect, through the covisibility graph, the algorithm finds, more effectively, location similarities by exploiting the structure of the scene that, in turn, allows the construction of virtual locations i.e., artificially augmented views from a real location that are useful to enhance the loop closure ability of the robot. The proposed approach has been deeply analysed and tested in different challenging scenarios taken from public datasets. The approach has also been compared with a state-of-the-art visual navigation algorithm.
Conference Paper
Full-text available
With the growing demand for deployment of robots in real scenarios, robustness in the perception capabilities for navigation lies at the forefront of research interest, as this forms the backbone of robotic autonomy. Existing place recognition approaches traditionally follow the feature-based bag-of-words paradigm in order to cut down on the richness of information in images. As structural information is typically ignored, such methods suffer from perceptual aliasing and reduced recall, due to the ambiguity of observations. In a bid to boost the robustness of appearance-based place recognition, we consider the world as a continuous constellation of visual words, while keeping track of their covisibility in a graph structure. Locations are queried based on their appearance, and modelled by their corresponding cluster of landmarks from the global covisibility graph, which retains important relational information about landmarks. Complexity is reduced by comparing locations by their graphs of visual words in a simplified manner. Test results show increased recall performance and robustness to noisy observations, compared to state-of-the-art methods.
Article
Full-text available
Finding the relationship between two coordinate systems using pairs of measurements of the coordinates of a number of points in both systems is a classic photogrammetric task. It finds applications in stereophotogrammetry and in robotics. I present here a closed-form solution to the least-squares problem for three or more points. Currently various empirical, graphical, and numerical iterative methods are in use. Derivation of the solution is simplified by use of unit quaternions to represent rotation. I emphasize a symmetry property that a solution to this problem ought to possess. The best translational offset is the difference between the centroid of the coordinates in one system and the rotated and scaled centroid of the coordinates in the other system. The best scale is equal to the ratio of the root-mean-square deviations of the coordinates in the two systems from their respective centroids. These exact results are to be preferred to approximate methods based on measurements of a few selected points. The unit quaternion representing the best rotation is the eigenvector associated with the most positive eigenvalue of a symmetric 4 × 4 matrix. The elements of this matrix are combinations of sums of products of corresponding coordinates of the points.
Article
This article presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multimap SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models. The first main novelty is a tightly integrated visual-inertial SLAM system that fully relies on maximum a posteriori (MAP) estimation, even during IMU initialization, resulting in real-time robust operation in small and large, indoor and outdoor environments, being two to ten times more accurate than previous approaches. The second main novelty is a multiple map system relying on a new place recognition method with improved recall that lets ORB-SLAM3 survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting them. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information from high parallax co-visible keyframes, even if they are widely separated in time or come from previous mapping sessions, boosting accuracy. Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.5 cm in the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, representative of AR/VR scenarios. For the benefit of the community we make public the source code.
Article
A novel semantic loop closure detection method (SLCD) is proposed in this paper for visual simultaneous localization and mapping (V-SLAM) systems. SLCD aims to relieve the instance-level semantic inconsistency issue that arose from dynamic industrial scenes (e.g., autonomous driving in big cities). As the first step in this direction, SLCD fully exploits both low and high-level video frame information, in a coarse-to-fine way. In SLCD, we adopt a convolutional neural network(CNN)-based object detection to acquire object information from the consecutive frames. Meanwhile, we perform a bag of visual words (BoVW)-based similarity calculation to narrow the frames to coarse loop closure candidates. For these candidates, we perform an object matching on them to find their semantic inconsistency cases and remove involved semantic inconsistencies according to their cases. Then we recalculate the similarity scores for these candidates. Finally, loop closures are determined by the similarity scores and a geometrical verification. Favorable performance of the proposed method is demonstrated by comparing it to other state-of-the-art methods using data from several public datasets and our new Dynamic Scenes dataset.
Article
In this paper, we present a method for single image three-dimensional (3-D) cuboid object detection and multiview object simultaneous localization and mapping in both static and dynamic environments, and demonstrate that the two parts can improve each other. First, for single image object detection, we generate high-quality cuboid proposals from two-dimensional (2-D) bounding boxes and vanishing points sampling. The proposals are further scored and selected based on the alignment with image edges. Second, multiview bundle adjustment with new object measurements is proposed to jointly optimize poses of cameras, objects, and points. Objects can provide long-range geometric and scale constraints to improve camera pose estimation and reduce monocular drift. Instead of treating dynamic regions as outliers, we utilize object representation and motion model constraints to improve the camera pose estimation. The 3-D detection experiments on SUN RGBD and KITTI show better accuracy and robustness over existing approaches. On the public TUM, KITTI odometry and our own collected datasets, our SLAM method achieves the state-of-the-art monocular camera pose estimation and at the same time, improves the 3-D object detection accuracy.
Article
Graph alignment refers to the problem of finding a bijective mapping across vertices of two graphs such that, if two nodes are connected in the first graph, their images are connected in the second graph. Most standard graph alignment methods consider an optimization that maximizes the number of matches between the two graphs, ignoring the effect of mismatches. We propose a generalized graph alignment formulation that considers both matches and mismatches in a standard quadratic assignment problem (QAP) formulation. This modification can have a major impact in aligning graphs with different sizes and heterogenous edge densities. Moreover, we propose two methods for solving the generalized graph alignment problem based on spectral decomposition of matrices. We compare the performance of proposed methods with some existing graph alignment algorithms including Natalie2, GHOST, IsoRank, NetAlign, Klau's approach as well as a semidefinite programming-based method over various synthetic and real graph models. Our proposed method based on simultaneous alignment of multiple eigenvectors leads to consistently good performance in different graph models. In particular, in the alignment of regular graph structures which is one of the most difficult graph alignment cases, our proposed method significantly outperforms other methods.
Article
In this paper, we use 2D object detections from multiple views to simultaneously estimate a 3D quadric surface for each object and localize the camera position. We derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D object detections can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for object detectors that addresses the challenge of partially visible objects, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Conference Paper
UnrealCV is a project to help computer vision researchers build virtual worlds using Unreal Engine 4 (UE4). It extends UE4 with a plugin by providing (1) A set of UnrealCV commands to interact with the virtual world. (2) Communication between UE4 and an external program, such as Caffe. UnrealCV can be used in two ways. The first one is using a compiled game binary with UnrealCV embedded. This is as simple as running a game, no knowledge of Unreal Engine is required. The second is installing UnrealCV plugin to Unreal Engine 4 (UE4) and use the editor of UE4 to build a new virtual world. UnrealCV is an open-source software under the MIT license. Since the initial release in September 2016, it has gathered an active community of users, including students and researchers.
Article
We present ORB-SLAM2 a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time in standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city. Our backend based on Bundle Adjustment with monocular and stereo observations allows for accurate trajectory estimation with metric scale. Our system includes a lightweight localization mode that leverages visual odometry tracks for unmapped regions and matches to map points that allow for zero-drift localization. The evaluation in 29 popular public sequences shows that our method achieves state-of-the-art accuracy, being in most cases the most accurate SLAM solution. We publish the source code, not only for the benefit of the SLAM community, but with the aim of being an out-of-the-box SLAM solution for researchers in other fields.
Article
The problem of image based localization has a long history both in robotics and computer vision and shares many similarities with image based retrieval problem. Existing techniques use either local features or (semi)-global image signatures in the context of topological mapping or loop closure detection. Difficulties of the location recognition problem are often affected by large appearance and viewpoint variation between the query view and reference dataset and presence of non-discriminative features due to vegetation, sky and road. In this work we show that semantic segmentation labeling of man-made structures can inform the traditional bag-of-visual words models to obtain proper feature weighting and improve the overall location recognition accuracy. We also demonstrate additional capability of identifying individual buildings and estimating their extent in images, providing the essential building block for semantic localization. Towards this end we introduce a new challenging outdoors urban dataset exhibiting large variations in appearance and viewpoint.
Article
We present a real-time object-based SLAM system that leverages the largest object database to date. Our approach comprises two main components: 1) a monocular SLAM algorithm that exploits object rigidity constraints to improve the map and find its real scale, and 2) a novel object recognition algorithm based on bags of binary words, which provides live detections with a database of 500 3D objects. The two components work together and benefit each other: the SLAM algorithm accumulates information from the observations of the objects, anchors object features to especial map landmarks and sets constrains on the optimization. At the same time, objects partially or fully located within the map are used as a prior to guide the recognition algorithm, achieving higher recall. We evaluate our proposal on five real environments showing improvements on the accuracy of the map and efficiency with respect to other state-of-the-art techniques.
Conference Paper
We present the major advantages of a new 'object oriented' 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-specific objects and structures. As a hand-held depth camera browses a cluttered scene, real-time 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually refined by efficient pose-graph optimisation. This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions. We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction.
Article
We propose a novel method for visual place recognition using bag of words obtained from accelerated segment test (FAST)+BRIEF features. For the first time, we build a vocabulary tree that discretizes a binary descriptor space and use the tree to speed up correspondences for geometrical verification. We present competitive results with no false positives in very different datasets, using exactly the same vocabulary and settings. The whole technique, including feature extraction, requires 22 ms/frame in a sequence with 26 300 images that is one order of magnitude faster than previous approaches.
Article
The authors present a qualitative and quantitative comparison of various similarity measures that form the kernel of common area-based stereo-matching systems. The authors compare classical difference and correlation measures as well as nonparametric measures based on the rank and census transforms for a number of outdoor images. For robotic applications, important considerations include robustness to image defects such as intensity variation and noise, the number of false matches, and computational complexity. In the absence of ground truth data, the authors compare the matching techniques based on the percentage of matches that pass the left-right consistency test. The authors also evaluate the discriminatory power of several match validity measures that are reported in the literature for eliminating false matches and for estimating match confidence. For guidance applications, it is essential to have an estimate of confidence in the three-dimensional points generated by stereo vision. Finally, a new validity measure, the rank constraint, is introduced that is capable of resolving ambiguous matches for rank transform–based matching.
Article
A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. The authors describe the application of RANSAC to the Location Determination Problem (LDP): given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing and analysis conditions. Implementation details and computational examples are also presented
Article
Abstract: "The push-relabel method has been shown to be efficient for solving maximum flow and minimum cost flow problems in practice, and periodic global updates of dual variables have played an important role in the best implementations. Nevertheless, global updates had not been known to yield any theoretical improvement in running time. In this work, we study techniques for implementing push-relabel algorithms to solve bipartite matching and assignment problems. We show that global updates yield a theoretical improvement in the bipartite matching and assignment contexts, and we develop a suite of efficient cost-scaling push-relabel implementations to solve assignment problems. For bipartite matching, we show that a push-relabel algorithm using global updates runs in [formula] time (matching the best bound known) and performs worse by a factor of [square root of n] without the updates. We present a similar result for the assignment problem, for which an algorithm that assumes integer costs in the range [ -C, ..., C] runs in time O([square root of nm] log(nC)) (matching the best cost-scaling bound known). We develop cost-scaling push-relabel implementations that take advantage of the assignment problem's special structure, and compare our codes against the best codes from the literature. The results show that the push-relabel method is very promising for practical use." Cover title. "August 1995." Thesis (Ph. D.)--Stanford University, 1995. Includes bibliographical references.
Location graphs for visual place recognition
  • E Stumm
  • C Mei
  • S Lacroix
  • M Chli