ArticlePublisher preview available

YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Simultaneous localization and mapping (SLAM), as one of the core prerequisite technologies for intelligent mobile robots, has attracted much attention in recent years. However, the traditional SLAM systems rely on the static environment assumption, which becomes unstable for the dynamic environment and further limits the real-world practical applications. To deal with the problem, this paper presents a dynamic-environment-robust visual SLAM system named YOLO-SLAM. In YOLO-SLAM, a lightweight object detection network named Darknet19-YOLOv3 is designed, which adopts a low-latency backbone to accelerate and generate essential semantic information for the SLAM system. Then, a new geometric constraint method is proposed to filter dynamic features in the detecting areas, where dynamic features can be distinguished by utilizing the depth difference with Random Sample Consensus (RANSAC). YOLO-SLAM composes the object detection approach and the geometric constraint method in a tightly coupled manner, which is able to effectively reduce the impact of dynamic objects. Experiments are conducted on the challenging dynamic sequences of TUM dataset and Bonn dataset to evaluate the performance of YOLO-SLAM. The results demonstrate that the RMSE index of absolute trajectory error can be significantly reduced to 98.13% compared with ORB-SLAM2 and 51.28% compared with DS-SLAM, indicating that YOLO-SLAM is able to effectively improve stability and accuracy in the highly dynamic environment.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
YOLO-SLAM: A semantic SLAM system towards dynamic environment
with geometric constraint
Wenxin Wu
1
Liang Guo
1
Hongli Gao
1
Zhichao You
1
Yuekai Liu
1
Zhiqiang Chen
1
Received: 25 February 2021 / Accepted: 15 November 2021 / Published online: 8 January 2022
ÓThe Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021
Abstract
Simultaneous localization and mapping (SLAM), as one of the core prerequisite technologies for intelligent mobile robots,
has attracted much attention in recent years. However, the traditional SLAM systems rely on the static environment
assumption, which becomes unstable for the dynamic environment and further limits the real-world practical applications.
To deal with the problem, this paper presents a dynamic-environment-robust visual SLAM system named YOLO-SLAM.
In YOLO-SLAM, a lightweight object detection network named Darknet19-YOLOv3 is designed, which adopts a low-
latency backbone to accelerate and generate essential semantic information for the SLAM system. Then, a new geometric
constraint method is proposed to filter dynamic features in the detecting areas, where dynamic features can be distinguished
by utilizing the depth difference with Random Sample Consensus (RANSAC). YOLO-SLAM composes the object
detection approach and the geometric constraint method in a tightly coupled manner, which is able to effectively reduce the
impact of dynamic objects. Experiments are conducted on the challenging dynamic sequences of TUM dataset and Bonn
dataset to evaluate the performance of YOLO-SLAM. The results demonstrate that the RMSE index of absolute trajectory
error can be significantly reduced to 98.13% compared with ORB-SLAM2 and 51.28% compared with DS-SLAM,
indicating that YOLO-SLAM is able to effectively improve stability and accuracy in the highly dynamic environment.
Keywords Visual SLAM Dynamic environment Object detection Geometric constraint
1 Introduction
In the past years, mobile robots and automatic driving
technology have made significant progress. Simultaneous
localization and mapping (SLAM) as a prerequisite tech-
nology for many robotic applications is attracting wide-
spread interest in this field [13]. SLAM technology plays
an important role in robot positioning estimation and
mapping establishment, which can help robots to position
themselves in an unknown environment without any prior
information and could simultaneously create a map of the
surroundings [4,5]. The position information helps robots
to know where they are in the process of moving, even the
place where robots have never been there. The map allows
robots to restore essential environmental information that
can be applied for the relocation process when robots come
back to the same place.
SLAM can be subdivided into laser-based SLAM and
visual-based SLAM according to the different sensors used
[6]. Visual SLAM, whose main sensor is a camera, com-
monly including monocular camera, stereo camera and
RGB-D camera, has been extensively explored [7]. That’s
because images can store wider scene information than
laser sensors. After data mining, the obtained information
can be widely used in object detection, semantic segmen-
tation, or disease diagnosis [8,9]. The visual SLAM has
developed over thirty years and becomes quite mature in
some specific scenarios. Some advanced visual SLAM
systems have achieved pretty decent performances, such as
ORB-SLAM2 [10], LSD-SLAM [11], RGBD-SLAM-V2
[12].
However, most of the traditional SLAM systems are
fragile when confronted with some extreme environments,
such as the dynamic or rough environment. The dynamic
environment [13] refers to the scene where moving objects
&Liang Guo
guoliang@swjtu.edu.cn
1
School of Mechanical Engineering, Southwest Jiaotong
University, Chengdu 610031, China
123
Neural Computing and Applications (2022) 34:6011–6026
https://doi.org/10.1007/s00521-021-06764-3(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... In this section, we perform experiments using the public TUM RGB-D dataset and real-world scenes, followed by a description of time performance. In our experiments, we compare our approach with ORB-SLAM3 [4], DynaSLAM [22], DS-SLAM [25], and YOLO-SLAM [35]. Because our method is based on ORB-SLAM3, we compare it with the original ORB-SLAM3 to verify the improvement of performance. ...
Article
Full-text available
Simultaneous localization and mapping (SLAM) is a key technique for mobile robotics. Moving objects can vastly impair the performance of a visual SLAM system. To deal with the problem, a new semantic visual SLAM system for indoor environments is proposed. Our system adds a semantic segmentation network and geometric model to detect and remove dynamic feature points on moving objects. Moreover, a 3D point cloud map with semantic information is created using semantic labels and depth images. We evaluate our method on the TUM RGB-D dataset and real-world environments. The evaluation metrics used are absolute trajectory error and relative position error. Experimental results show our method improves the accuracy in dynamic scenes compared to ORB-SLAM3 and other advanced methods.
... Their approach fits a plane using the triangulated 3D points for each a priori planar cluster and uses a CNN-based panoptic segmentation framework titled Detectron2 [19]. Likewise, YOLO-SLAM [20] is another CNN-based approach that couples geometric constraints with the Darknet19-YOLOv3 object detector to generate semantic information. However, these solutions may suffer from performance degradation in recognizing small or non-regular objects and planes. ...
Preprint
Full-text available
Fiducial markers can encode rich information about the environment and can aid Visual SLAM (VSLAM) approaches in reconstructing maps with practical semantic information. Current marker-based VSLAM approaches mainly utilize markers for improving feature detections in low-feature environments and/or for incorporating loop closure constraints, generating only low-level geometric maps of the environment prone to inaccuracies in complex environments. To bridge this gap, this paper presents a VSLAM approach utilizing a monocular camera along with fiducial markers to generate hierarchical representations of the environment while improving the camera pose estimate. The proposed approach detects semantic entities from the surroundings, including walls, corridors, and rooms encoded within markers, and appropriately adds topological constraints among them. Experimental results on a real-world dataset collected with a robot demonstrate that the proposed approach outperforms a traditional marker-based VSLAM baseline in terms of accuracy, given the addition of new constraints while creating enhanced map representations. Furthermore, it shows satisfactory results when comparing the reconstructed map quality to the one reconstructed using a LiDAR SLAM approach.
... Deep-learning-based methods use deep neural networks to identify objects that are thought to move with high probability, such as Mask R-CNN [24], SegNet [25] and you only look once (YOLO) [26,27] and then track and maintain these objects or cover them with a mask [28,29]. To reduce the impact of the neural network on the real-time performance of V-SLAM, some researchers run the neural network in an independent thread [23,30] or just use the neural network in some specifically selected frames and spread the detection results to other frames by feature points matching [31][32][33]. More recent methods based on deep learning tend to go further by not just removing the dynamic foreground, but also inpainting or reconstructing the static background that is occluded by moving targets [34][35][36]. ...
Article
Full-text available
The visual simultaneous localization and mapping (SLAM) method under dynamic environments is a hot and challenging issue in the robotic field. The oriented FAST and Rotated BRIEF (ORB) SLAM algorithm is one of the most effective methods. However, the traditional ORB-SLAM algorithm cannot perform well in dynamic environments due to the feature points of dynamic map points at different timestamps being incorrectly matched. To deal with this problem, an improved visual SLAM method built on ORB-SLAM3 is proposed in this paper. In the proposed method, an improved new map points screening strategy and the repeated exiting map points elimination strategy are presented and combined to identify obvious dynamic map points. Then, a concept of map point reliability is introduced in the ORB-SLAM3 framework. Based on the proposed reliability calculation of the map points, a multi-period check strategy is used to identify the unobvious dynamic map points, which can further deal with the dynamic problem in visual SLAM, for those unobvious dynamic objects. Finally, various experiments are conducted on the challenging dynamic sequences of the TUM RGB-D dataset to evaluate the performance of our visual SLAM method. The experimental results demonstrate that our SLAM method can run at an average time of 17.51 ms per frame. Compared with ORB-SLAM3, the average RMSE of the absolute trajectory error (ATE) of the proposed method in nine dynamic sequences of the TUM RGB-D dataset can be reduced by 63.31%. Compared with the real-time dynamic SLAM methods, the proposed method can obtain state-of-the-art performance. The results prove that the proposed method is a real-time visual SLAM, which is effective in dynamic environments.
... However, Dynaslam II performs poorly in low-texture environments. Wu et al. [15] proposed YOLO-SLAM based on ORB-SLAM2 using YOLOv3 to detect dynamic objects, Although YOLO-SLAM has significant performance in the TUM dataset resulting in a 98% reduction in absolute trajectory error, the algorithm does not evaluate the accuracy in dynamic outdoor environments. Liu et al. [16] combined YOLOv3 with ORB-SLAM2. ...
Article
Full-text available
Recent developments in robotics have heightened the need for visual SLAM. Dynamic objects are a major problem in visual SLAM which reduces the accuracy of localization due to the wrong epipolar geometry. This study set out to find a new method to address the low accuracy of visual SLAM in outdoor dynamic environments. We propose an adaptive feature point selection system for outdoor dynamic environments. Initially, we utilize YOLOv5s with the attention mechanism to obtain a priori dynamic objects in the scene. Then, feature points are selected using an adaptive feature point selector based on the number of a priori dynamic objects and the percentage of a priori dynamic objects occupied in the frame. Finally, dynamic regions are determined using a geometric method based on Lucas-Kanade optical flow and the RANSAC algorithm. We evaluate the accuracy of our system using the KITTI dataset, comparing it to various dynamic feature point selection strategies and DynaSLAM. Experiments show that our proposed system demonstrates a reduction in both absolute trajectory error and relative trajectory error, with a maximum reduction of 39% and 30%, respectively, compared to other systems.
... Shiqiang Yang et al. [20] used a combined geometric and semantic approach for the detection of dynamic object features, which first uses a semantic segmentation network for the segmentation of dynamic objects, and then uses an approach based on the geometric constraint method to reject the segmented dynamic objects. Wenxin Wu et al. [21] accelerated and generated the semantic information necessary for the SLAM system by using a low-latency backbone while proposing a new geometrically constrained method to filter the detection of dynamic points in image frames, where dynamic features can be distinguished by random sampling consistency (RANSAC) using depth differences. Wanfang Xie et al. [22] proposed a motion detection and segmentation method to improve localization accuracy, while using a mask repair method to ensure the integrity of segmented objects, and using the Lucas-Kanade optical flow (LK optical flow) [23] method for dynamic point rejection in the rejection of dynamic objects. ...
Article
Full-text available
When building a map of a dynamic environment, simultaneous localization and mapping systems have problems such as poor robustness and inaccurate pose estimation. This paper proposes a new mapping method based on the ORB-SLAM2 algorithm combined with the YOLOv5 network. First, the YOLOv5 network of the tracing thread is used to detect dynamic objects of each frame, and to get keyframes with detection of dynamic information. Second, the dynamic objects of each image frame are detected using the YOLOv5 network, and the detected dynamic points are rejected. Finally, the global map is constructed using the keyframes after eliminating the highly dynamic objects. The test results using the TUM dataset show that when the map is constructed in a dynamic environment, compared with the ORB-SLAM2 algorithm, the absolute trajectory error of our algorithm is reduced by 97.8%, and the relative positional error is reduced by 59.7%. The average time consumed to track each image frame is improved by 94.7% compared to DynaSLAM. In terms of algorithmic real-time performance, this paper’s algorithm is significantly better than the comparable dynamic SLAM map-building algorithm DynaSLAM.
Article
Simultaneous Localization and Mapping (SLAM) is one of the fundamental capabilities for intelligent mobile robots to perform state estimation in unknown environments. However, most visual SLAM systems rely on the static scene assumption and consequently have severely reduced accuracy and robustness in dynamic scenes. Moreover, the metric maps constructed by many systems lack semantic information, so the robots cannot understand their surroundings at a human cognitive level. In this paper, we propose SG-SLAM, which is a real-time RGB-D semantic visual SLAM system based on the ORB-SLAM2 framework. First, SG-SLAM adds two new parallel threads: an object detecting thread to obtain 2D semantic information and a semantic mapping thread. Then, a fast dynamic feature rejection algorithm combining semantic and geometric information is added to the tracking thread. Finally, they are published to the ROS system for visualization after generating 3D point clouds and 3D semantic objects in the semantic mapping thread. We performed an experimental evaluation on the TUM dataset, the Bonn dataset, and the OpenLORIS-Scene dataset. The results show that SG-SLAM is not only one of the most real-time, accurate, and robust systems in dynamic scenes, but also allows the creation of intuitive semantic metric maps.
Article
Full-text available
The scene rigidity is a strong assumption in typical visual Simultaneous Localization and Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic real-world environments, which are the target of several relevant applications such as augmented reality, semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic results in the tracking thread in their architecture, and the processing time depends on the segmentation methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the others, and therefore the tracking thread does not need to wait for the semantic information anymore. Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making it possible to use segmentation methods with different speeds in a uniform way. We update and propagate semantic information using the moving probability, which is saved in the map and used to remove outliers from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios. Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git.
Article
Full-text available
Deep learning techniques have been widely applied for intelligent fault diagnosis. However, these techniques require large amounts of labeled data from a particular machine, which is demanding for real-world applications. Alternatively, models can be developed based on artificial damages and be applied for industrial data with real damages. In that case, a major challenge arises since the distributions of those artificial and real damages are greatly different, which results in severe performance degradation of conventional deep models. In the present work, a model named deep coupled joint distribution adaptation network (DCJDAN) is proposed to address the large domain discrepancy between artificial and real damages. By utilizing two untied deep convolutional networks, the proposed method allows the source-stream and the target-stream networks to focus on learning domain-representative features, providing flexibility for explicitly modeling the domain discrepancy. To ensure a more effective knowledge transferring, a regulation term is adopted to force the untied coupled networks to stay similar since the source domain and the target domain are related. The joint distribution adaptation module is further adopted to minimize the domain discrepancy, which considers both the marginal and conditional distribution differences and provides more accurate distribution matching. The effectiveness of the proposed method is evaluated based on three bearing datasets with artificial and real damages. As reported, the proposed method achieves an average accuracy of 98.17% for all tasks, which outperforms several state-of-the-art deep domain adaptation models and improves the diagnosis performance compared to the conventional deep learning models.
Article
Full-text available
As a new fire detection technology, image fire detection has recently played a crucial role in reducing fire losses by alarming users early through early fire detection. Image fire detection is based on an algorithmic analysis of images. However, there is a lower accuracy, delayed detection, and a large amount of computation in common detection algorithms, including manually and machine automatically extracting image features. Therefore, novel image fire detection algorithms based on the advanced object detection CNN models of Faster-RCNN, R–FCN, SSD, and YOLO v3 are proposed in this paper. A comparison of the proposed and current algorithms reveals that the accuracy of fire detection algorithms based on object detection CNNs is higher than other algorithms. Especially, the average precision of the algorithm based on YOLO v3 reaches to 83.7%, which is higher than the other proposed algorithms. Besides, the YOLO v3 also has stronger robustness of detection performance, and its detection speed reaches 28 FPS, thereby satisfying the requirements of real-time detection.
Conference Paper
Full-text available
Mapping and localization are essential capabilities of robotic systems. Although the majority of mapping systems focus on static environments, the deployment in real-world sit- uations requires them to handle dynamic objects. In this paper, we propose an approach for an RGB-D sensor that is able to consistently map scenes containing multiple dynamic elements. For localization and mapping, we employ an efficient direct tracking on the truncated signed distance function (TSDF) and leverage color information encoded in the TSDF to estimate the pose of the sensor. The TSDF is efficiently represented using voxel hashing, with most computations parallelized on a GPU. For detecting dynamics, we exploit the residuals obtained after an initial registration, together with the explicit modeling of free space in the model. We evaluate our approach on existing datasets, and provide a new dataset showing highly dynamic scenes. These experiments show that our approach often surpass other state-of-the-art dense SLAM methods. We make available our dataset with the ground truth for both the trajectory of the RGB-D sensor obtained by a motion capture system and the model of the static environment using a high-precision terrestrial laser scanner. Finally, we release our approach as open source code.
Article
Full-text available
Traditional visual simultaneous localization and mapping (SLAM) systems mostly based on small-area static environments. In recent years, some studies focused on combining semantic information with visual SLAM. However, most of them are hard to obtain better performance in the large-scale dynamic environment. And the accuracy, rapidity of the system still needs to strengthen. In this paper, we develop a more efficient semantic SLAM system in the two-wheeled mobile robot by using semantic segmentation to recognize people, chairs, and other objects in every keyframe. With a preliminary understanding of the environment, fusing the RGB-D camera and encoders information, to localization and creating a dense colored octree map without dynamic objects. Besides, for the incomplete identification of movable objects, we used image processing algorithms to enhance the semantic segmentation effect. In the proposed method, enhanced semantic segmentation in keyframes dramatically increases the efficiency of the system. Moreover, fusing the different sensors can highly raise localization accuracy. We conducted experiments on various datasets and in some real environments and compared them with DRE-SLAM, DS-SLAM, to evaluate the performance of the proposed approach. The results suggest we significantly improve the processing efficiency, robustness, and quality of the map.
Article
Full-text available
Real time surface defect inspection plays an important role in the quality control of automated production. For the surface defect inspection, flowing data is a common data form in the automated pipeline production. Although flowing data provides rich information for surface defect inspection, there are still a lot of dynamic distribution challenges caused by flowing data, such as domain shift phenomenon and imbalanced training data. However, many existing industrial inspection solutions are still using static strategies. To solve the dynamic distribution influence in data flow domain, this paper proposes a new deep ensemble learning method with domain fluctuation adaptation. Specifically, a new distribution discrepancy identifier based on estimation of the data set distribution and data characteristic is proposed. It utilizes advantages of both the deep convolutional neural network and the shallow feature based learning method to achieve higher robustness and fine-grained detection in streaming data scenes. In order to validate the proposed method, an inspection bench test system in part of a real industrial surface mount technology production line, is designed and fabricated. The proposed inference model is successfully applied to an embedded terminal with a hybrid heterogeneous computing architecture. At last, the method is validated on the data collected from the manufacturer. The result suggests that the proposed method possesses a competitive mAP rate with good adaptation and robustness in industrial streaming data scenes.
Article
Full-text available
Presently, although many impressed SLAM systems have achieved exceptional accuracy in a real environment, most of them are verified in the static environment. However, for mobile robots and autonomous driving, the dynamic objects in the scene can result in tracking failure or large deviation during pose estimation. In this paper, a general visual SLAM system for dynamic scenes with multiple sensors called DMS-SLAM is proposed. First, the combination of GMS and sliding window is used to achieve the initialization of the system, which can eliminate the influence of dynamic objects and construct a static initialization 3D map. Then, the corresponding 3D points of the current frame in the local map are obtained by reprojection. These points are combined with the constant speed model or reference frame model to achieve the position estimation of the current frame and the update of the 3D map points in the local map. Finally, the keyframes selected by the tracking module are combined with the GMS feature matching algorithm to add static 3D map points to the local map. DMS-SLAM implements pose tracking, closed-loop detection and relocalization based on static 3D map points of the local map and supports monocular, stereo and RGB-D visual sensors in dynamic scenes. Exhaustive evaluation in public TUM and KITTI datasets demonstrates that DMS-SLAM outperforms state-of-the-art visual SLAM systems in accuracy and speed in dynamic scenes.
Article
Full-text available
Simultaneous localization and mapping (SLAM) methods based on an RGB-D camera have been studied and used in robot navigation and perception. So far, most such SLAM methods have been applied to a static environment. However, these methods are incapable of avoiding the drift errors caused by moving objects such as pedestrians, which limits their practical performance in real-world applications. In this paper, a new RGB-D SLAM with moving object detection for dynamic indoor scenes is proposed. The proposed detection method for moving objects is based on mathematical models and geometric constraints, and it can be incorporated into the SLAM process as a data filtering process. In order to verify the proposed method, we conducted sufficient experiments on the public TUM RGB-D dataset and a sequence image dataset from our Kinect V1 camera; both were acquired in common dynamic indoor scenes. The detailed experimental results of our improved RGB-D SLAM were summarized and demonstrate its effectiveness in dynamic indoor scenes.
Article
Visual localization has been well studied in recent decades and applied in many fields as a fundamental capability in robotics. However, the success of the state of the arts usually builds on the assumption that the environment is static. In dynamic scenarios where moving objects are present, the performance of the existing visual localization systems degrades a lot due to the disturbance of the dynamic factors. To address this problem, we propose a novel sparse motion removal (SMR) model that detects the dynamic and static regions for an input frame based on a Bayesian framework. The similarity between the consecutive frames and the difference between the current frame and the reference frame are both considered to reduce the detection uncertainty. After the detection process is finished, the dynamic regions are eliminated while the static ones are fed into a feature-based visual simultaneous localization and mapping (SLAM) system for further visual localization. To verify the proposed method, both qualitative and quantitative experiments are performed and the experimental results have demonstrated that the proposed model can significantly improve the accuracy and robustness for visual localization in dynamic environments.
Article
Localization and mapping in a dynamic scene is a crucial problem for the indoor visual Simultaneous Localization and Mapping (SLAM) system. Most existed visual odometry (VO) or SLAM systems are based on the assumption that the environment is static. The performance of a SLAM system may degenerate when it is operated in a severely dynamic environment. The assumption limits the applications of RGB-D SLAM in the dynamic environment. In this paper, we propose a workflow to segment the objects accurately, which will be marked as the potentially dynamic-object area based on the semantic information. A novel approach for motion detection and removal from the moving camera is introduced. We integrate the semantics based motion detection and the segmentation approach with a RGB-D SLAM system. To evaluate the effectiveness of the proposed approach, we conduct the experiments on the challenging dynamic sequences of TUM-RGBD datasets. Experimental results suggest that our approach improves the accuracy of localization and outperforms the state-of-the-art dynamic-removal-based SLAM system in both severely dynamic and slightly dynamic scenes.