Conference Paper

A Deep Learning Framework for Robust Semantic SLAM

Authors:
  • Dubai Futue Labs
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It takes monocular color images as input and outputs pose trajectory, depth map, and threedimensional point cloud information simultaneously. In [23], a novel deep learning-based approach that can be used to minimize the semantic SLAM estimation error by identifying different noise patterns and trying to reduce them was proposed. ...
Article
Full-text available
Simultaneous localization and mapping (SLAM) is considered as a key technique in augmented reality (AR), robotics and unmanned driving. In the field of SLAM, solutions based on monocular sensors have gradually become important due to their ability to recognize more environmental information with simple structures and low costs. Feature-based ORB-SLAM is popular in many applications, but it has many limitations in complex indoor scenes. Firstly, camera pose estimation based on monocular images is greatly affected by the environment; secondly, monocular images lack scale information and cannot be used to obtain image depth information; thirdly, monocular based SLAM builds a fused map of feature points that lacks semantic information, which is incomprehensible for machine. To solve the aforementioned issues, this paper proposes an SDF-SLAM model based on deep learning, which can perform camera pose estimation in a wider indoor environment and can also perform depth estimation and semantic segmentation on monocular images to obtain an understandable three-dimensional map. SDF-SLAM is tested and verified using a CPU platform and two sets of indoor scenes. The results show that the average accuracy of the predicted point cloud coordinates reaches 90%, and the average accuracy of the semantic labels reaches 67%. Moreover, compared with the state-of-the-art SLAM frameworks, such as ORB-SLAM, LSD-SLAM, and CNN-SLAM, the absolute error of the camera trajectory on indoor data with more feature points is reduced from 0.436 m, 0.495 m, and 0.243 m to 0.037 m, respectively. On indoor data with fewer feature points, they decrease from 1.826 m, 1.206 m, and 0.264 m to 0.124 m, respectively.
... Hence, it aids the optimisation process and induces better model generalisation. A stacked auto-encoder (SAE) architecture was used as an alternative to the Boltzmann machine in a pre-training approach [29]. ...
Article
Full-text available
Ti-6Al-2Sn-4Zr-6Mo is one of the most important titanium alloys characterised by its high strength, fatigue, and toughness properties, making it a popular material for aerospace and biomedical applications. However, no studies have been reported on processing this alloy using laser powder bed fusion. In this paper, a deep learning neural network (DLNN) was introduced to rationalise and predict the densification and hardness due to Laser Powder Bed Fusion of Ti-6Al-2Sn-4Zr-6Mo alloy. The process optimisation results showed that near-full densification is achieved in Ti-6Al-2Sn-4Zr-6Mo alloy samples fabricated using an energy density of 77–113 J/mm3. Furthermore, the hardness of the builds was found to increase with increasing the laser energy density. Porosity and the hardness measurements were found to be sensitive to the island size, especially at high energy density. Hot isostatic pressing (HIP) was able to eliminate the porosity, increase the hardness, and achieve the desirable α and β phases. The developed model was validated and used to produce process maps. The trained deep learning neural network model showed the highest accuracy with a mean percentage error of 3% and 0.2% for the porosity and hardness. The results showed that deep learning neural networks could be an efficient tool for predicting materials properties using small data.
Article
Full-text available
In the evolving landscape of modern robotics, Visual SLAM (V-SLAM) has emerged over the past two decades as a powerful tool, empowering robots with the ability to navigate and map their surroundings. While these methods are traditionally confined to static environments, there has been a growing interest in developing V-SLAM to handle dynamic and realistic scenes. This survey offers a comprehensive overview of the current state-of-the-art V-SLAM methods, including their strengths and weaknesses. The paper also identifies the limitations of existing techniques and proposes potential research directions for future advancements. In addition, it provides an overview of commonly used datasets to evaluate the performance of V-SLAM methods. This survey sheds valuable insights into areas that need additional research to benefit V-SLAM development, including challenges related to limited scalability for systems with multiple agents, sensitivity to lighting changes, high computational cost, and performance issues in noisy environments.
Chapter
Artificial intelligence and additive manufacturing are primary drivers of Industry 4.0, which is reshaping the manufacturing industry. Based on the progressive layer-by-layer principle, additive manufacturing allows for the manufacturing of mechanical parts with a high degree of complexity. In this chapter, a deep learning neural network (DLNN) is introduced to rationalize the effect of cellular structure design factors as well as process variables on physical and mechanical properties utilizing laser powder bed fusion. The models developed were validated and utilized to create process maps. For both design and process optimization, the trained deep learning neural network model showed the highest accuracy. Deep learning neural networks were found to be an effective technique for predicting material properties from limited data sets, as per the findings.
Article
Full-text available
The application of deep learning in robotics leads to very specific problems and research questions that are typically not addressed by the computer vision and machine learning communities. In this paper we discuss a number of robotics-specific learning, reasoning, and embodiment challenges for deep learning. We explain the need for better evaluation metrics, highlight the importance and unique challenges for deep robotic learning in simulation, and explore the spectrum between purely data-driven and model-driven approaches. We hope this paper provides a motivating overview of important research directions to overcome the current limitations, and help fulfill the promising potentials of deep learning in robotics.
Article
Full-text available
In this paper, a study of the odometric system for the autonomous cart Verdino, which is an electric vehicle based on a golf cart, is presented. A mathematical model of the odometric system is derived from cart movement equations, and is used to compute the vehicle position and orientation. The inputs of the system are the odometry encoders, and the model uses the wheels diameter and distance between wheels as parameters. With this model, a least square minimization is made in order to get the nominal best parameters. This model is updated, including a real time wheel diameter measurement improving the accuracy of the results. A neural network model is used in order to learn the odometric model from data. Tests are made using this neural network in several configurations and the results are compared to the mathematical model, showing that the neural network can outperform the first proposed model.
Article
Full-text available
Traditional approaches to stereo visual SLAM rely on point features to estimate the camera trajectory and build a map of the environment. In low-textured environments, though, it is often difficult to find a sufficient number of reliable point features and, as a consequence, the performance of such algorithms degrades. This paper proposes PL-SLAM, a stereo visual SLAM system that combines both points and line segments to work robustly in a wider variety of scenarios, particularly in those where point features are scarce or not well-distributed in the image. PL-SLAM leverages both points and segments at all the instances of the process: visual odometry, keyframe selection, bundle adjustment, etc. We contribute also with a loop closure procedure through a novel bag-of-words approach that exploits the combined descriptive power of the two kinds of features. Additionally, the resulting map is richer and more diverse in 3D elements, which can be exploited to infer valuable, high-level scene structures like planes, empty spaces, ground plane, etc. (not addressed in this work). Our proposal has been tested with several popular datasets (such as KITTI and EuRoC), and is compared to state of the art methods like ORB-SLAM, revealing superior performance in most of the experiments, while still running in real-time. An open source version of the PL-SLAM C++ code will be released for the benefit of the community.
Article
Full-text available
The online learning methods are popular for visual tracking, because of their robust performance for most video sequences. However, the drifting problem caused by noisy updates is still a challenge for most highly adaptive online classifiers. In visual tracking, target object appearance variation, such as deformation and long-term occlusion, easily causes noisy updates. To overcome this problem, a new real-time occlusion-aware visual tracking algorithm is introduced. Firstly, we learn a novel twostage classifier with circulant structure with kernel, named by Integrated Circulant Structure Kernels (ICSK). The first stage is applied for transition estimation and the second is used for scale estimation. The circulant structure makes our algorithm realize fast learning and detection. Then, the ICSK is used to detect the target without occlusion and build a classifier-pool to save these classifiers with noisy updates. When the target is in heavy occlusion or after long-term occlusion, we re-detect it using an optimal classifier selected from the classifier-pool according to an entropy minimization criterion. Extensive experimental results on the full benchmark of 56 challenging videos demonstrate our realtime algorithm achieves better performance than state-of-the-art methods in terms of quantitative and qualitative results.
Article
Full-text available
Simultaneous Localization and Mapping (SLAM) consists in the concurrent construction of a representation of the environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. The paper serves as a tutorial for the non-expert reader. It is also a position paper: by looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors' take on two questions that often animate discussions during robotics conferences: do robots need SLAM? Is SLAM solved?
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Article
Full-text available
Visual ego-motion estimation, or briefly visual odometry (VO), is one of the key building blocks of modern SLAM systems. In the last decade, impressive results have been demonstrated in the context of visual navigation, reaching very high localization performance. However, all ego-motion estimation systems require careful parameter tuning procedures for the specific environment they have to work in. Furthermore, even in ideal scenarios, most state-of-the-art approaches fail to handle image anomalies and imperfections, which results in less robust estimates. VO systems that rely on geometrical approaches extract sparse or dense features and match them to perform frame-to-frame (F2F) motion estimation. However, images contain much more information that can be used to further improve the F2F estimation. To learn new feature representation, a very successful approach is to use deep convolutional neural networks. Inspired by recent advances in deep networks and by previous work on learning methods applied to VO, we explore the use of convolutional neural networks to learn both the best visual features and the best estimator for the task of visual ego-motion estimation. With experiments on publicly available datasets, we show that our approach is robust with respect to blur, luminance, and contrast anomalies and outperforms most state-of-the-art approaches even in nominal conditions.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately 2m and 3 degree accuracy for large scale outdoor scenes and 0.5m and 5 degree accuracy indoors. This is achieved using an efficient 23 layer deep convnet, demonstrating that convnets can be used to solve complicated out of image plane regression problems. This was made possible by leveraging transfer learning from large scale classification data. We show the convnet localizes from high level features and is robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails. Furthermore we show how the pose feature that is produced generalizes to other scenes allowing us to regress pose with only a few dozen training examples.
Conference Paper
Full-text available
In this paper, we present a novel benchmark for the evaluation of RGB-D SLAM systems. We recorded a large set of image sequences from a Microsoft Kinect with highly accurate and time-synchronized ground truth camera poses from a motion capture system. The sequences contain both the color and depth images in full sensor resolution (640 × 480) at video frame rate (30 Hz). The ground-truth trajectory was obtained from a motion-capture system with eight high-speed tracking cameras (100 Hz). The dataset consists of 39 sequences that were recorded in an office environment and an industrial hall. The dataset covers a large variety of scenes and camera motions. We provide sequences for debugging with slow motions as well as longer trajectories with and without loop closures. Most sequences were recorded from a handheld Kinect with unconstrained 6-DOF motions but we also provide sequences from a Kinect mounted on a Pioneer 3 robot that was manually navigated through a cluttered indoor environment. To stimulate the comparison of different approaches, we provide automatic evaluation tools both for the evaluation of drift of visual odometry systems and the global pose error of SLAM systems. The benchmark website [1] contains all data, detailed descriptions of the scenes, specifications of the data formats, sample code, and evaluation tools.
Article
Full-text available
This study proposes a mathematical uncertainty model for the spatial measurement of visual features using Kinect™ sensors. This model can provide qualitative and quantitative analysis for the utilization of Kinect™ sensors as 3D perception sensors. In order to achieve this objective, we derived the propagation relationship of the uncertainties between the disparity image space and the real Cartesian space with the mapping function between the two spaces. Using this propagation relationship, we obtained the mathematical model for the covariance matrix of the measurement error, which represents the uncertainty for spatial position of visual features from Kinect™ sensors. In order to derive the quantitative model of spatial uncertainty for visual features, we estimated the covariance matrix in the disparity image space using collected visual feature data. Further, we computed the spatial uncertainty information by applying the covariance matrix in the disparity image space and the calibrated sensor parameters to the proposed mathematical model. This spatial uncertainty model was verified by comparing the uncertainty ellipsoids for spatial covariance matrices and the distribution of scattered matching visual features. We expect that this spatial uncertainty model and its analyses will be useful in various Kinect™ sensor applications.
Article
Full-text available
We show how to use “complementary priors” to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
Article
This paper studies visual odometry (VO) from the perspective of deep learning. After tremendous efforts in the robotics and computer vision communities over the past few decades, state-of-the-art VO algorithms have demonstrated incredible performance. However, since the VO problem is typically formulated as a pure geometric problem, one of the key features still missing from current VO systems is the capability to automatically gain knowledge and improve performance through learning. In this paper, we investigate whether deep neural networks can be effective and beneficial to the VO problem. An end-to-end, sequence-to-sequence probabilistic visual odometry (ESP-VO) framework is proposed for the monocular VO based on deep recurrent convolutional neural networks. It is trained and deployed in an end-to-end manner, that is, directly inferring poses and uncertainties from a sequence of raw images (video) without adopting any modules from the conventional VO pipeline. It can not only automatically learn effective feature representation encapsulating geometric information through convolutional neural networks, but also implicitly model sequential dynamics and relation for VO using deep recurrent neural networks. Uncertainty is also derived along with the VO estimation without introducing much extra computation. Extensive experiments on several datasets representing driving, flying and walking scenarios show competitive performance of the proposed ESP-VO to the state-of-the-art methods, demonstrating a promising potential of the deep learning technique for VO and verifying that it can be a viable complement to current VO systems.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct meth- ods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on sim(3), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.
Article
We present a novel method to fuse the power of deep networks with the computational efficiency of geometric and probabilistic localization algorithms. In contrast to other methods that completely replace a classical visual estimator with a deep network, we propose an approach that uses a convolutional neural network to learn difficult-to-model corrections to the estimator from ground-truth training data. To this end, we derive a novel loss function for learning SE(3) corrections based on a matrix Lie groups approach, with a natural formulation for balancing translation and rotation errors. We use this loss to train a Deep Pose Correction network (DPC-Net) that predicts corrections for a particular estimator, sensor and environment. Using the KITTI odometry dataset, we demonstrate significant improvements to the accuracy of a computationally-efficient sparse stereo visual odometry pipeline, that render it as accurate as a modern computationally-intensive dense estimator. Further, we show how DPC-Net can be used to mitigate the effect of poorly calibrated lens distortion parameters.
Article
We present ORB-SLAM2 a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time in standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city. Our backend based on Bundle Adjustment with monocular and stereo observations allows for accurate trajectory estimation with metric scale. Our system includes a lightweight localization mode that leverages visual odometry tracks for unmapped regions and matches to map points that allow for zero-drift localization. The evaluation in 29 popular public sequences shows that our method achieves state-of-the-art accuracy, being in most cases the most accurate SLAM solution. We publish the source code, not only for the benefit of the SLAM community, but with the aim of being an out-of-the-box SLAM solution for researchers in other fields.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In recent years, deep neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Conference Paper
This paper reports on methods for incorporating camera calibration uncertainty into a two-view sparse bundle adjustment (SBA) framework. The co-registration of two images is useful in mobile robotics for determining motion over time. These camera measurements can constrain a robot's relative poses so that the trajectory and map can be estimated in a technique known as simultaneous localization and mapping (SLAM). Here, we comment on the importance of propagating uncertainty in both feature extraction and camera calibration in visual pose-graph SLAM. We derive an improved pose covariance estimate that leverages the Unscented Transform, and compare its performance to previous methods in both simulated and experimental trials. The two experiments reported here involve data from a camera mounted on a KUKA robotic arm (where a precise ground-truth trajectory is available) and a Hovering Autonomous Underwater Vehicle (HAUV) for large-scale autonomous ship hull inspection.
Article
High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
Deep object pose estimation for semantic robotic grasping of household objects
  • J Tremblay
  • T To
  • B Sundaralingam
  • Y Xiang
  • D Fox
  • S Birchfield
Searching for activation functions
  • P Ramachandran
  • B Zoph
  • Q V Le
Machine Learning Algorithms from Scratch: With Python
  • brownlee
Deep object pose estimation for semantic robotic grasping of household objects
  • tremblay
Searching for activation functions
  • ramachandran