Preprint

SG-Reg: Generalizable and Efficient Scene Graph Registration

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper focuses on the Light Detection and Ranging (LiDAR)–Inertial Measurement Unit (IMU) simultaneous localization and mapping (SLAM) problem: How to fuse the sensor measurement from the LiDAR and IMU to online estimate robot's poses and build a consistent map of the environment. This paper presents LTA‐OM: an efficient, robust, and accurate LiDAR SLAM system. Employing fast direct LiDAR‐inertial odometry (FAST‐LIO2) and Stable Triangle Descriptor as LiDAR–IMU odometry and the loop detection method, respectively, LTA‐OM is implemented to be functionally complete, including loop detection and correction, false‐positive loop closure rejection, long‐term association (LTA) mapping, and multisession localization and mapping. One novelty of this paper is the real‐time LTA mapping, which exploits the direct scan‐to‐map registration of FAST‐LIO2 and employs the corrected history map to provide direct global constraints to the LIO mapping process. LTA mapping also has the notable advantage of achieving drift‐free odometry at revisit places. Besides, a multisession mode is designed to allow the user to store the current session's results, including the corrected map points, optimized odometry, and descriptor database for future sessions. The benefits of this mode are additional accuracy improvement and consistent map stitching, which is helpful for life‐long mapping. Furthermore, LTA‐OM has valuable features for robot control and path planning, including high‐frequency and real‐time odometry, driftless odometry at revisit places, and fast loop closing convergence. LTA‐OM is versatile as it is applicable to both multiline spinning and solid‐state LiDARs, mobile robots and handheld platforms. In experiments, we exhaustively benchmark LTA‐OM and other state‐of‐the‐art LiDAR systems with 18 data sequences. The results show that LTA‐OM steadily outperforms other systems regarding trajectory accuracy, map consistency, and time consumption. The robustness of LTA‐OM is validated in a challenging scene—a multilevel building having similar structures at different levels. To demonstrate our system, we created a video which can be found on https://youtu.be/DVwppEKlKps.
Conference Paper
Full-text available
Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (eg, navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website.
Article
Full-text available
In this letter, we present an evolved version of Situational Graphs, which jointly models in a single optimizable factor graph (1) a pose graph, as a set of robot keyframes comprising associated measurements and robot poses, and (2) a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational information between them. Specifically, our S-Graphs+ is a novel four-layered factor graph that includes: (1) A keyframes layer with robot pose estimates, (2) a walls layer representing wall surfaces, (3) a rooms layer encompassing sets of wall planes, and (4) a floors layer gathering the rooms within a given floor level. The above graph is optimized in real-time to obtain a robust and accurate estimate of the robot's pose and its map, simultaneously constructing and leveraging high-level information of the environment. To extract this high-level information, we present novel room and floor segmentation algorithms utilizing the mapped wall planes and free-space clusters. We tested S-Graphs+ on multiple datasets, including simulated and real data of indoor environments from varying construction sites, and on a real public dataset of several indoor office areas. On average over our datasets, S-Graphs+ outperforms the accuracy of the second-best method by a margin of 10.67%, while extending the robot situational awareness by a richer scene model. Moreover, we make the software available as a docker file.
Article
Full-text available
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing conditions, including day-night changes, as well as weather and seasonal variations, while providing highly accurate six degree-of-freedom (6DOF) camera pose estimates. In this paper, we extend three publicly available datasets containing images captured under a wide variety of viewing conditions, but lacking camera pose information, with ground truth pose information, making evaluation of the impact of various factors on 6DOF camera pose estimation accuracy possible. We also discuss the performance of state-of-the-art localization approaches on these datasets. Additionally, we release around half of the poses for all conditions, and keep the remaining half private as a test set, in the hopes that this will stimulate research on long-term visual localization, learned local image features, and related research areas. Our datasets are available at visuallocalization.net, where we are also hosting a benchmarking server for automatic evaluation of results on the test set. The presented state-of-the-art results are to a large degree based on submissions to our server.
Article
Full-text available
Semidefinite Programming (SDP) and Sums-of-Squ- ares (SOS) relaxations have led to certifiably optimal non-minimal solvers for several robotics and computer vision problems. However, most non-minimal solvers rely on least squares formulations, and, as a result, are brittle against outliers. While a standard approach to regain robustness against outliers is to use robust cost functions, the latter typically introduce other non-convexities, preventing the use of existing non-minimal solvers. In this letter, we enable the simultaneous use of non-minimal solvers and robust estimation by providing a general-purpose approach for robust global estimation, which can be applied to any problem where a non-minimal solver is available for the outlier-free case. To this end, we leverage the Black-Rangarajan duality between robust estimation and outlier processes (which has been traditionally applied to early vision problems), and show that graduated non-convexity ( GNC ) can be used in conjunction with non-minimal solvers to compute robust solutions, without requiring an initial guess. we demonstrate the resulting robust non-minimal solvers in applications, including point cloud and mesh registration, pose graph optimization, and image-based object pose estimation (also called shape alignment ). Our solvers are robust to 70–80% of outliers, outperform RANSAC , are more accurate than specialized local solvers, and faster than specialized global solvers. We also propose the first certifiably optimal non-minimal solver for shape alignment using SOS relaxation.
Article
Collaborative simultaneous localization and mapping (CSLAM) is essential for autonomous aerial swarms, laying the foundation for downstream algorithms such as planning and control. To address existing CSLAM systems' limitations in relative localization accuracy, crucial for close-range UAV collaboration, this paper introduces D2D^{2} SLAM'a novel decentralized and distributed CSLAM system. D2D^{2} SLAM innovatively manages near-field estimation for precise relative state estimation in proximity and far-field estimation for consistent global trajectories. Its adaptable front-end supports both stereo and omnidirectional cameras, catering to various operational needs and overcoming field-of-view challenges in aerial swarms. Experiments demonstrate D2D^{2} SLAM's effectiveness in accurate ego-motion estimation, relative localization, and global consistency. Enhanced by distributed optimization algorithms, D2D^{2} SLAM exhibits remarkable scalability and resilience to network delays, making it well-suited for a wide range of real-world aerial swarm applications. The adaptability and proven performance of D2D^{2} SLAM represent a significant advancement in autonomous aerial swarm technology.
Article
This study introduces a novel framework, G3Reg, for fast and robust global registration of LiDAR point clouds. In contrast to conventional complex keypoints and descriptors, we extract fundamental geometric primitives, including planes, clusters, and lines (PCL) from the raw point cloud to obtain low-level semantic segments. Each segment is represented as a unified Gaussian Ellipsoid Model (GEM), using a probability ellipsoid to ensure the ground truth centers are encompassed with a certain degree of probability. Utilizing these GEMs, we present a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). Specifically, we establish an upper bound, which can be traversed based on the confidence level for compatibility testing to construct the pyramid graph. Then, we solve multiple maximum cliques (MAC) for each level of the pyramid graph, thus generating the corresponding transformation candidates. In the verification phase, we adopt a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate. The algorithm’s performance is validated on three publicly available datasets and a self-collected multi-session dataset. Parameter settings remained unchanged during the experiment evaluations. The results exhibit superior robustness and real-time performance of the G3Reg framework compared to state-of-the-art methods. Furthermore, we demonstrate the potential for integrating individual GEM and PAGOR components into other registration frameworks to enhance their efficacy. Note to Practitioners —Our proposed method aims to perform global registration for outdoor LiDAR point clouds. Our methodology, which extracts point cloud segments and utilizes their centers for registration, differs from conventional approaches that rely on keypoints and descriptors. We further propose GEM to model the uncertainty of the centers and embed it into our distrust-and-verify framework. In theory, our method can be applied to any registration task that involves primitives representable as sets of Gaussians or points. Additionally, practitioners should consider the following to enhance applicability. First, practitioners can fine-tune the parameters of the segmentation algorithm to generate more repeatable segmentation results. Second, although our default setting uses four compatibility test thresholds, fewer may suffice, especially when translations between point clouds are minor. Finally, for geometrically uninformative segments such as vegetation, consider extracting descriptors within these segments to increase correspondences.
Article
3D spatial perception is the problem of building and maintaining an actionable and persistent representation of the environment in real-time using sensor data and prior knowledge. Despite the fast-paced progress in robot perception, most existing methods either build purely geometric maps (as in traditional SLAM) or “flat” metric-semantic maps that do not scale to large environments or large dictionaries of semantic labels. The first part of this paper is concerned with representations: we show that scalable representations for spatial perception need to be hierarchical in nature. Hierarchical representations are efficient to store, and lead to layered graphs with small treewidth, which enable provably efficient inference. We then introduce an example of hierarchical representation for indoor environments, namely a 3D scene graph, and discuss its structure and properties. The second part of the paper focuses on algorithms to incrementally construct a 3D scene graph as the robot explores the environment. Our algorithms combine 3D geometry (e.g., to cluster the free space into a graph of places), topology (to cluster the places into rooms), and geometric deep learning (e.g., to classify the type of rooms the robot is moving across). The third part of the paper focuses on algorithms to maintain and correct 3D scene graphs during long-term operation. We propose hierarchical descriptors for loop closure detection and describe how to correct a scene graph in response to loop closures, by solving a 3D scene graph optimization problem. We conclude the paper by combining the proposed perception algorithms into Hydra, a real-time spatial perception system that builds a 3D scene graph from visual-inertial data in real-time. We showcase Hydra’s performance in photo-realistic simulations and real data collected by a Clearpath Jackal robots and a Unitree A1 robot. We release an open-source implementation of Hydra at https://github.com/MIT-SPARK/Hydra .
Article
Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instanceaware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly
Article
Accurate and robust place recognition is essential for robot navigation, yet achieving full pose invariance and high performance across diverse scenes remains challenging. In this work, we propose a novel global and local combined descriptor named Binary Triangle Combined (BTC) descriptor. We first extract the keypoints of a point cloud by projecting the points to planes extracted therein. Any three keypoints form a unique triangle, with the lengths of its sides constituting a triangle descriptor that captures the global appearance of the point cloud. Thanks to the distinct shape of a triangle given three side lengths, the similarity between two triangles and their vertices (i.e., keypoints) correspondence can be naturally determined from the side lengths of the triangle descriptors. The matched triangle pairs evaluate the appearance similarity between two point clouds, while the vertices' correspondence enables accurate estimation of their relative pose; both are crucial for the place recognition task. To enhance the accuracy of triangle matching, BTC introduces a binary descriptor, which describes the point distribution neighboring each keypoint. The local geometry information encoded by the binary descriptor augments descriptiveness and discriminativeness to the triangle descriptor. Collectively, the two descriptors achieve both global and local descriptions of the environment with high accuracy, efficiency, and robustness. We extensively compare the proposed BTC descriptor against state-of-the-art methods (e.g., Scan Context, LCD-Net) on a wide range of datasets collected using different types of LiDAR sensors (spinning LiDARs and non-repetitive scanning LiDARs) in various environments (urban, campus, forest, park, mountain). The quantitative results demonstrate that BTC exhibits greater adaptability and significant improvement in precision compared to its counterparts, especially in challenging cases with large viewpoint variations (e.g., reverse direction, large translation and/or rotation). To share our findings and contribute to the community, we open-source our code on GitHub: https://github.com/hku-mars/btc_descriptor .
Article
In the era of advancing autonomous driving and increasing reliance on geospatial information, high-precision mapping not only demands accuracy but also flexible construction. Current approaches mainly rely on expensive mapping devices, which are time consuming for city-scale map construction and vulnerable to erroneous data associations without accurate GPS assistance. In this article, we present AutoMerge, a novel framework for merging large-scale maps that surpasses these limitations, which: 1) provides robust place recognition performance despite differences in both translation and viewpoint; 2) is capable of identifying and discarding incorrect loop closures caused by perceptual aliasing; and 3) effectively associates and optimizes large-scale and numerous map segments in the real-world scenario. AutoMerge utilizes multiperspective fusion and adaptive loop closure detection for accurate data associations, and it uses incremental merging to assemble large maps from individual trajectory segments given in random order and with no initial estimations. Furthermore, AutoMerge performs pose graph optimization after assembling the segments to smooth the merged map globally. We demonstrate AutoMerge on both city-scale merging (120 km) and campus-scale repeated merging (4.5 km × 8). The experiments show that AutoMerge: 1) surpasses the second- and third-best methods by 0.9% and 6.5% recall in segment retrieval; 2) achieves comparable 3-D mapping accuracy for 120-km large-scale map assembly; and 3) and is robust to temporally spaced revisits. To our knowledge, AutoMerge is the first mapping approach to merge hundreds of kilometers of individual segments without using GPS.
Article
Visual localization is a challenging task due to the presence of illumination changes, occlusion, and perception from novel viewpoints. Re-localizing the camera pose in long-term setups raises difficulties caused by changes in scene appearance and geometry introduced by human or natural deterioration. Many existing methods use static scene assumptions and fail in dynamic indoor scenes. Only a few works handle scene changes by introducing outlier awareness with pure learning methods. Other recent approaches use semantics to robustify camera localization in changing setups. However, to the best of our knowledge, no method has yet used scene graphs in feature-based approaches to introduce change awareness. In this work, we propose a novel feature-based camera re-localization method that leverages scene graphs within retrieval and feature detection and matching. Semantic scene graphs are used to estimate scene changes by matching instances and relationship triplets. The knowledge of scene changes is then used for our change-aware image retrieval and feature correspondence verification. We show the potential of integrating higher-level knowledge about the scene within a retrieval-based localization pipeline. Our method is evaluated on the RIO10 benchmark with comprehensive evaluations on different levels of scene changes.
Article
3D scene graphs have recently emerged as a powerful high-level representation of 3D environments. A 3D scene graph models the environment as a layered graph where nodes represent spatial concepts at multiple levels of abstraction (from low-level geometry to high-level semantics including objects, places, rooms, buildings, etc.) and edges represent relations between concepts. While 3D scene graphs can serve as an advanced “mental model” for robots, how to build such a rich representation in real-time is still uncharted territory. This paper describes a real-time Spatial Perception System, a suite of algorithms to build a 3D scene graph from sensor data in real-time. Our first contribution is to develop real-time algorithms to incrementally construct the layers of a scene graph as the robot explores the environment; these algorithms build a local ESDF around the current robot trajectory estimate, extract a topological map of places from the ESDF, and then segment the places into rooms using an approach inspired by community-detection techniques. Our second contribution is to investigate loop closure detection and optimization in 3D scene graphs. We show that 3D scene graphs allow defining hierarchical descriptors for place recognition; our descriptors capture statistics across layers in the scene graph, ranging from low-level visual appearance, to summary statistics about objects and places. We then propose the first algorithm to optimize a 3D scene graph in response to loop closures; our approach relies on embedded deformation graphs to simultaneously correct all layers of the scene graph. We implement the proposed system into a highly parallelized architecture, named Hydra, that combines fast early and mid-level perception processes with slower high-level perception. We evaluate Hydra on simulated and real data and show it is able to reconstruct 3D scene graphs with an accuracy comparable with batch offline methods, while running online.
Article
Mobile robots should be aware of their situation , comprising the deep understanding of their surrounding environment along with the estimation of its own state, to successfully make intelligent decisions and execute tasks autonomously in real environments. 3D scene graphs are an emerging field of research that propose to represent the environment in a joint model comprising geometric, semantic and relational/topological dimensions. Although 3D scene graphs have already been combined with SLAM techniques to provide robots with situational understanding, further research is still required to effectively deploy them on-board mobile robots. To this end, we present in this paper a novel, real-time, online built Situational Graph ( S-Graph ), which combines in a single optimizable graph, the representation of the environment with the aforementioned three dimensions, together with the robot pose. Our method utilizes odometry readings and planar surfaces extracted from 3D LiDAR scans, to construct and optimize in real-time a three layered S-Graph that includes (1) a robot tracking layer where the robot poses are registered, (2) a metric-semantic layer with features such as planar walls and (3) our novel topological layer constraining the planar walls using higher-level features such as corridors and rooms. Our proposal does not only demonstrate state-of-the-art results for pose estimation of the robot, but also contributes with a metric-semantic-topological model of the environment.
Article
Loop closure is necessary for correcting errors accumulated in simultaneous localization and mapping (SLAM) in unknown environments. However, conventional loop closure methods based on low-level geometric or image features may cause high ambiguity by not distinguishing similar scenarios. Thus, incorrect loop closures can occur. Though semantic 2D image information is considered in some literature to detect loop closures, there is little work that compares 3D scenes as an integral part of a semantic SLAM system. This letter introduces an approach, called SmSLAM+LCD, integrated into a semantic SLAM system to combine high-level 3D semantic information and low-level feature information to conduct accurate loop closure detection and effective drift reduction. The effectiveness of our approach is demonstrated in testing results.
Article
Nowadays, multirotors are playing important roles in abundant types of missions. During these missions, entering confined and narrow tunnels that are barely accessible to humans is desirable yet extremely challenging for multirotors. The restricted space and significant ego airflow disturbances induce control issues at both fast and slow flight speeds, meanwhile bringing about problems in state estimation and perception. Thus, a smooth trajectory at the proper speed is necessary for safe tunnel flights. To address these challenges, in this letter, a complete autonomous aerial system that can achieve smooth flights through tunnels with dimensions as narrow as to 0.6 m is presented. The system contains a motion planner that generates smooth mini-jerk trajectories along the tunnel center lines, which are extracted according to the map and Euclidean distance field (EDF), and its practical speed range is obtained through computational fluid dynamics (CFD) and flight data analyses. Extensive flight experiments on the quadrotor are conducted inside multiple narrow tunnels to validate the planning framework as well as the robustness of the whole system.
Article
Multi-robot simultaneous localization and mapping (SLAM) is a crucial capability to obtain timely situational awareness over large areas. Real-world applications demand multi-robot SLAM systems to be robust to perceptual aliasing and to operate under limited communication bandwidth; moreover, it is desirable for these systems to capture semantic information to enable high-level decision-making and spatial artificial intelligence. This article presents KimeraMulti \mathsf{{Kimera-Multi}} , a multi-robot system that: 1) is robust and capable of identifying and rejecting incorrect inter- and intrarobot loop closures resulting from perceptual aliasing; 2) is fully distributed and only relies on local (peer-to-peer) communication to achieve distributed localization and mapping; and 3) builds a globally consistent metric-semantic 3-D mesh model of the environment in real time, where faces of the mesh are annotated with semantic labels. KimeraMulti \mathsf{{Kimera-Multi}} is implemented by a team of robots equipped with visual-inertial sensors. Each robot builds a local trajectory estimate and a local mesh using Kimera \mathsf{{Kimera}} . When communication is available, robots initiate a distributed place recognition and robust pose graph optimization protocol based on a distributed graduated nonconvexity algorithm. The proposed protocol allows the robots to improve their local trajectory estimates by leveraging inter-robot loop closures while being robust to outliers. Finally, each robot uses its improved trajectory estimate to correct the local mesh using mesh deformation techniques. We demonstrate KimeraMulti \mathsf{{Kimera-Multi}} in photo-realistic simulations, SLAM benchmarking datasets, and challenging outdoor datasets collected using ground robots. Both real and simulated experiments involve long trajectories (e.g., up to 800 m per robot). The experiments show that KimeraMulti \mathsf{{Kimera-Multi}} : 1) outperforms the state of the art in terms of robustness and accuracy; 2) achieves estimation errors comparable to a centralized SLAM system while being fully distributed; 3) is parsimonious in terms of communication bandwidth; 4) produces accurate metric-semantic 3-D meshes; and 5) is modular and can also be used for standard 3-D reconstruction (i.e., without semantic labels) or for trajectory estimation (i.e., without reconstructing a 3-D mesh).
Article
2019 IEEE. Point cloud registration is a key problem for computer vision applied to robotics, medical imaging, and other applications. This problem involves finding a rigid transformation from one point cloud into another so that they align. Iterative Closest Point (ICP) and its variants provide simple and easily-implemented iterative methods for this task, but these algorithms can converge to spurious local optima. To address local optima and other difficulties in the ICP pipeline, we propose a learning-based method, titled Deep Closest Point (DCP), inspired by recent techniques in computer vision and natural language processing. Our model consists of three parts: A point cloud embedding network, an attention-based module combined with a pointer generation layer to approximate combinatorial matching, and a differentiable singular value decomposition (SVD) layer to extract the final rigid transformation. We train our model end-to-end on the ModelNet40 dataset and show in several settings that it performs better than ICP, its variants (e.g., Go-ICP, FGR), and the recently-proposed learning-based method PointNetLK. Beyond providing a state-of-the-art registration technique, we evaluate the suitability of our learned features transferred to unseen objects. We also provide preliminary analysis of our learned model to help understand whether domain-specific and/or global features facilitate rigid registration.
Article
Loop closure can effectively eliminate the accumulated error and plays an important role in Simultaneous Localization and Mapping (SLAM). There remains challenges in loop detection and loop correction due to the large viewpoints difference and the environment appearance changes. In this paper, we propose a novel loop closure method based on object modeling and semantic graph matching. We use voxels and cuboids to model the object-level features in the environment, and further represent the environment as a semantic graph with topological information. On this basis, an efficient graph matching method based on edit distance is proposed for robust place recognition. Finally, the loop correction is carried out by object alignment between semantic maps. Experimental results demonstrate that the proposed method can realize omni-directional loop closure without viewpoint constraint, and is robust to environmental appearance changes. Compared with the existing appearance-based methods, the performance has been greatly improved.
Article
This article presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multimap SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models. The first main novelty is a tightly integrated visual-inertial SLAM system that fully relies on maximum a posteriori (MAP) estimation, even during IMU initialization, resulting in real-time robust operation in small and large, indoor and outdoor environments, being two to ten times more accurate than previous approaches. The second main novelty is a multiple map system relying on a new place recognition method with improved recall that lets ORB-SLAM3 survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting them. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information from high parallax co-visible keyframes, even if they are widely separated in time or come from previous mapping sessions, boosting accuracy. Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.5 cm in the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, representative of AR/VR scenarios. For the benefit of the community we make public the source code.
Article
Graph matching involves combinatorial optimization based on edge-to-edge affinity matrix, which can be generally formulated as Lawler's quadratic assignment problem (QAP). This paper presents a QAP network directly learning with the affinity matrix (equivalently the association graph) whereby the matching problem is translated into a constrained vertex classification task. The association graph is learned by an embedding network for vertex classification, followed by Sinkhorn normalization and a cross-entropy loss for end-to-end learning. We further improve the embedding model on association graph by introducing Sinkhorn based matching-aware constraint, as well as dummy nodes to deal with unequal sizes of graphs. To our best knowledge, this is one of the first network to directly learn with the general Lawler's QAP. In contrast, recent deep matching methods focus on the learning of node/edge features in two graphs respectively. We also show how to extend our network to hypergraph matching, and matching of multiple graphs. Experimental results on both synthetic graphs and real-world images show its effectiveness. For pure QAP tasks on synthetic data and QAPLIB benchmark, our method can perform competitively and even surpass state-of-the-art graph matching and QAP solvers with notable less time cost. We provide a project homepage at http://thinklab.sjtu.edu.cn/project/NGM/index.html .
Article
We propose the first fast and certifiable algorithm for the registration of two sets of three-dimensional (3-D) points in the presence of large amounts of outlier correspondences. A certifiable algorithm is one that attempts to solve an intractable optimization problem (e.g., robust estimation with outliers) and provides readily checkable conditions to verify if the returned solution is optimal (e.g., if the algorithm produced the most accurate estimate in the face of outliers) or bound its suboptimality or accuracy. Toward this goal, we first reformulate the registration problem using a truncated least squares (TLS) cost that makes the estimation insensitive to a large fraction of spurious correspondences. Then, we provide a general graph-theoretic framework to decouple scale, rotation, and translation estimation, which allows solving in cascade for the three transformations. Despite the fact that each subproblem (scale, rotation, and translation estimation) is still nonconvex and combinatorial in nature, we show that 1) TLS scale and (component-wise) translation estimation can be solved in polynomial time via an adaptive voting scheme, 2) TLS rotation estimation can be relaxed to a semidefinite program (SDP) and the relaxation is tight, even in the presence of extreme outlier rates, and 3) the graph-theoretic framework allows drastic pruning of outliers by finding the maximum clique. We name the resulting algorithm TEASER (Truncated least squares Estimation And SEmidefinite Relaxation). While solving large SDP relaxations is typically slow, we develop a second fast and certifiable algorithm, named TEASER++, that uses graduated nonconvexity to solve the rotation subproblem and leverages Douglas-Rachford Splitting to efficiently certify global optimality. For both algorithms, we provide theoretical bounds on the estimation errors, which are the first of their kind for robust registration problems. Moreover, we test their performance on standard benchmarks, object detection datasets, and the 3DMatch scan matching dataset, and show that 1) both algorithms dominate the state-of-the-art (e.g., RANSAC, branch-&-bound, heuristics) and are robust to more than 99%\text{99\%} outliers when the scale is known, 2) TEASER++ can run in milliseconds and it is currently the fastest robust registration algorithm, and 3) TEASER++ is so robust it can also solve problems without correspondences (e.g., hypothesizing all-to-all correspondences), where it largely outperforms ICP and it is more accurate than Go-ICP while being orders of magnitude faster. We release a fast open-source C++ implementation of TEASER++.
Article
In this paper, we present a method for single image three-dimensional (3-D) cuboid object detection and multiview object simultaneous localization and mapping in both static and dynamic environments, and demonstrate that the two parts can improve each other. First, for single image object detection, we generate high-quality cuboid proposals from two-dimensional (2-D) bounding boxes and vanishing points sampling. The proposals are further scored and selected based on the alignment with image edges. Second, multiview bundle adjustment with new object measurements is proposed to jointly optimize poses of cameras, objects, and points. Objects can provide long-range geometric and scale constraints to improve camera pose estimation and reduce monocular drift. Instead of treating dynamic regions as outliers, we utilize object representation and motion model constraints to improve the camera pose estimation. The 3-D detection experiments on SUN RGBD and KITTI show better accuracy and robustness over existing approaches. On the public TUM, KITTI odometry and our own collected datasets, our SLAM method achieves the state-of-the-art monocular camera pose estimation and at the same time, improves the 3-D object detection accuracy.
Article
In this paper, we use 2D object detections from multiple views to simultaneously estimate a 3D quadric surface for each object and localize the camera position. We derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D object detections can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for object detectors that addresses the challenge of partially visible objects, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera.
Article
Real-time, high-quality, 3D scanning of large-scale scenes is key to mixed reality and robotic applications. However, scalability brings challenges of drift in pose estimation, introducing significant errors in the accumulated model. Approaches often require hours of offline processing to globally correct model errors. Recent online methods demonstrate compelling results but suffer from (1) needing minutes to perform online correction, preventing true real-time use; (2) brittle frame-to-frame (or frame-to-model) pose estimation, resulting in many tracking failures; or (3) supporting only unstructured point-based representations, which limit scan quality and applicability. We systematically address these issues with a novel, real-time, end-to-end reconstruction framework. At its core is a robust pose estimation strategy, optimizing per frame for a global set of camera poses by considering the complete history of RGB-D input with an efficient hierarchical approach. We remove the heavy reliance on temporal tracking and continually localize to the globally optimized frames instead. We contribute a parallelizable optimization framework, which employs correspondences based on sparse features and dense geometric and photometric matching. Our approach estimates globally optimized (i.e., bundle adjusted) poses in real time, supports robust tracking with recovery from gross tracking failures (i.e., relocalization), and re-estimates the 3D model in real time to ensure global consistency, all within a single framework. Our approach outperforms state-of-the-art online systems with quality on par to offline methods, but with unprecedented speed and scan completeness. Our framework leads to a comprehensive online scanning solution for large indoor environments, enabling ease of use and high-quality results.
Article
One camera and one low-cost inertial measurement unit (IMU) form a monocular visual-inertial system (VINS), which is the minimum sensor suite (in size, weight, and power) for the metric six degrees-of-freedom (DOF) state estimation. In this paper, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator. Our approach starts with a robust procedure for estimator initialization. A tightly coupled, nonlinear optimization-based method is used to obtain highly accurate visual-inertial odometry by fusing preintegrated IMU measurements and feature observations. A loop detection module, in combination with our tightly coupled formulation, enables relocalization with minimum computation. We additionally perform 4-DOF pose graph optimization to enforce the global consistency. Furthermore, the proposed system can reuse a map by saving and loading it in an efficient way. The current and previous maps can be merged together by the global pose graph optimization. We validate the performance of our system on public datasets and real-world experiments and compare against other state-of-the-art algorithms. We also perform an onboard closed-loop autonomous flight on the microaerial-vehicle platform and port the algorithm to an iOS-based demonstration. We highlight that the proposed work is a reliable, complete, and versatile system that is applicable for different applications that require high accuracy in localization. We open source our implementations for both PCs ( https://github.com/HKUST-Aerial-Robotics/VINS-Mono ) and iOS mobile devices ( https://github.com/HKUST-Aerial-Robotics/VINS-Mobile ).
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following four principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture’s parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.