Chapter

Online Segmentation of LiDAR Sequences: Dataset and Algorithm

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Roof-mounted spinning LiDAR sensors are widely used by autonomous vehicles. However, most semantic datasets and algorithms used for LiDAR sequence segmentation operate on 360360^\circ frames, causing an acquisition latency incompatible with real-time applications. To address this issue, we first introduce HelixNet, a 10 billion point dataset with fine-grained labels, timestamps, and sensor rotation information necessary to accurately assess the real-time readiness of segmentation algorithms. Second, we propose Helix4D, a compact and efficient spatio-temporal transformer architecture specifically designed for rotating LiDAR sequences. Helix4D operates on acquisition slices corresponding to a fraction of a full sensor rotation, significantly reducing the total latency. Helix4D reaches accuracy on par with the best segmentation algorithms on HelixNet and SemanticKITTI with a reduction of over 5×5\times in terms of latency and 50×50\times in model size. The code and data are available at: https://romainloiseau.fr/helixnet.KeywordsLiDARTransformerAutonomous drivingReal-timeOnline segmentation

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To circumvent the prohibitive memory and compute footprint of 3D convolutions on dense voxel grids, SparseConvNet [104] and MinkowskiNet [58] operate on sparse representations, conducing voxel-based models capable of processing large scenes while maintaining high resolution. More recently, hybrid point cloud representations [207,209] have also helped reduce memory cost, while capturing both fine-grained local details and contextual information. ...
... Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of- As the expressivity of deep learning models increases rapidly, so do their complexity and resource requirements [96]. In particular, vision transformers have demonstrated remarkable results for 3D point cloud semantic segmentation [359,244,109,175,209], but their high computational requirements make them challenging to train effectively. Additionally, these models rely on regular grids or point samplings, which do not adapt to the varying complexity of 3D data: the same computational effort is allocated everywhere, regardless of the local geometry or radiometry of the point cloud. ...
... Over the last few years, deep learning approaches for 3D point clouds have garnered considerable interest [110]. Autonomous driving, in particular, has been the focus of numerous studies, resulting in multiple proposed approaches for object detection [351,7], as well as semantic [366,209,355], instance [361,363], and panoptic segmentation [15,89,365,225]. However, these methods consider sequences of sparse LiDAR acquisition, and focus on a small set of classes (pedestrians, cars). ...
Thesis
Full-text available
Over the past decade, deep learning has advanced the analysis of text, image, audio, and video. More recently, transformers and self-supervised learning have triggered a global competition to train gigantic models on Internet-scale datasets, with massive computational resources. This thesis deals with large-scale 3D point cloud analysis and adopts a different approach focused on efficiency. We introduce methods which improve several aspects of the state-of-the-art: faster training, fewer parameters, smaller compute or memory footprint, and better utilization of realistically-available data. In doing so, we strive to devise solutions towards a more frugal and accessible Artificial Intelligence (AI). We first introduce a 3D semantic segmentation model that combines the efficiency of superpoint-based methods with the expressivity of transformers. We build a hierarchical data representation which allows us to drastically accelerate the parsing of large 3D point clouds. Our network proves to match or even surpass state-of-the-art approaches on a range of sensors and acquisition environments, while boasting orders of magnitude fewer parameters, with faster training and inference. We then build on this framework to tackle panoptic segmentation of large-scale 3D point clouds. Existing instance and panoptic segmentation methods do not scale well to large scene with numerous objects because the computation of their loss function implies a costly matching step between true and predicted instances. Instead, we frame this task as a scalable graph clustering problem, which a small network is trained to address from local objectives only, without computing the actual object instances at train time. Our lightweight model can process ten-million-point scenes at once on a single GPU in a few seconds, opening the door to 3D panoptic segmentation at unprecedented scales. Finally, we propose to exploit the complementarity of image and point cloud modalities to enhance 3D scene understanding. We place ourselves in a realistic acquisition setting where multiple arbitrarily-located images observe the same scene, with potential occlusions. Unlike previous 2D-3D fusion approaches, we learn to select information from various views of the same object based on their respective observation conditions: camera-to-object distance, occlusion rate, optical distortion, etc. Our efficient implementation achieves state-of-the-art results both in indoor and outdoor settings, with minimal requirements: raw point clouds, arbitrarily-positioned images, and their cameras poses. Overall, this thesis upholds the principle that for settings with limited data availability, exploiting the structure of the problem unlocks both efficient and performant architectures.
... As the expressivity of deep learning models increases rapidly, so do their complexity and resource requirements [13]. In particular, vision transformers have demonstrated remarkable results for 3D point cloud semantic segmentation [56,37,16,23,32], but their high computational requirements make them challenging to train effectively. Additionally, these models rely on regular grids or point samplings, which do not adapt to the varying complexity of 3D data: the same computational effort is allocated everywhere, regardless of the local geometry or radiometry of the point cloud. ...
... 3D Vision Transformers. Following their adoption for image processing [9,30], Transformer architectures [51] designed explicitly for 3D analysis have shown promising results in terms of performance [56,16] and speed [37,32]. In particular, the Stratified Transformer of Lai et al. uses a specific sampling scheme [23] to model long-range interactions. ...
Conference Paper
Full-text available
We introduce a novel superpoint-based transformer architecture for efficient semantic segmentation of large-scale 3D scenes. Our method incorporates a fast algorithm to partition point clouds into a hierarchical superpoint structure, which makes our preprocessing 7 times faster than existing superpoint-based approaches. Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of-the-art performance on three challenging benchmark datasets: S3DIS (76.0% mIoU 6-fold validation), KITTI-360 (63.5% on Val), and DALES (79.6%). With only 212k parameters, our approach is up to 200 times more compact than other state-of-the-art models while maintaining similar performance. Furthermore, our model can be trained on a single GPU in 3 hours for a fold of the S3DIS dataset, which is 7× to 70× fewer GPU-hours than the best-performing methods. Our code and models are accessible at github.com/ drprojects/superpoint_transformer.
... As the expressivity of deep learning models rapidly increases, so do their complexity and resource requirements [15]. In particular, vision transformers have demonstrated remarkable results for 3D point cloud semantic segmentation [62,42,18,26,37], but their high computational requirements make them challenging to train effectively. Additionally, these models rely on regular grids or point samplings, which do not adapt to the varying complexity of 3D data: the same computational effort is allocated everywhere, regardless of the local geometry or radiometry of the point cloud. ...
... 3D Vision Transformers. Following their adoption for image processing [10,35], Transformer architectures [57] designed explicitly for 3D analysis have shown promising results in terms of performance [62,18] and speed [42,37]. In particular, the Stratified Transformer of Lai et al. uses a specific sampling scheme [26] enabling to model long-range interactions. ...
Preprint
Full-text available
We introduce a novel superpoint-based transformer architecture for efficient semantic segmentation of large-scale 3D scenes. Our method incorporates a fast algorithm to partition point clouds into a hierarchical superpoint structure, which makes our preprocessing 7 times times faster than existing superpoint-based approaches. Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of-the-art performance on three challenging benchmark datasets: S3DIS (76.0% mIoU 6-fold validation), KITTI-360 (63.5% on Val), and DALES (79.6%). With only 212k parameters, our approach is up to 200 times more compact than other state-of-the-art models while maintaining similar performance. Furthermore, our model can be trained on a single GPU in 3 hours for a fold of the S3DIS dataset, which is 7x to 70x fewer GPU-hours than the best-performing methods. Our code and models are accessible at github.com/drprojects/superpoint_transformer.
... LSS methods classify points into K classes, for which dense supervision is available during training. Prior works focus on developing strong encoder-decoder-based architectures for sparse 3D data (Qi et al., 2016Thomas et al., 2019;Yan et al., 2018;Choy et al., 2019;Tang et al., 2020;Zhu et al., 2021;Loiseau et al., 2022;Ye et al., 2021), the fusion of information obtained from different 3D representations (Alonso et al., 2020; or neural architecture search (Tang et al., 2020). On the other hand, LPS methods Sirohi et al., 2021;Gasperini et al., 2021) must additionally segment instances of thing classes. ...
Article
Full-text available
Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.
... Despite advancements in LiDAR scanning technology and the rise of large-scale 3D point cloud datasets, capturing a comprehensive view of exterior architectural structures remains a challenge [21,41]. Specifically, the outdoor roadway-level datasets [18,20,22,28,48,49] often abstract the (large-sized) buildings as single entities, finer details of exteriors due to sensory gap [13]. Although a recent surge of façade-level datasets [24,42,46] have aimed to address this problem by identifying the architectural parts on the building's vertical and planar wall surfaces, ground-level and rooftoplevel exterior structures are disregarded. ...
Preprint
Full-text available
Precise segmentation of architectural structures provides detailed information about various building components, enhancing our understanding and interaction with our built environment. Nevertheless, existing outdoor 3D point cloud datasets have limited and detailed annotations on architectural exteriors due to privacy concerns and the expensive costs of data acquisition and annotation. To overcome this shortfall, this paper introduces a semantically-enriched, photo-realistic 3D architectural models dataset and benchmark for semantic segmentation. It features 4 different building purposes of real-world buildings as well as an open architectural landscape in Hong Kong. Each point cloud is annotated into one of 14 semantic classes.
... Despite advancements in LiDAR scanning technology and the rise of large-scale 3D point cloud datasets, capturing a comprehensive view of exterior architectural structures remains a challenge [21,41]. Specifically, the outdoor roadway-level datasets [18,20,22,28,48,49] often abstract the (large-sized) buildings as single entities, finer details of exteriors due to sensory gap [13]. Although a recent surge of façade-level datasets [24,42,46] have aimed to address this problem by identifying the architectural parts on the building's vertical and planar wall surfaces, ground-level and rooftoplevel exterior structures are disregarded. ...
Conference Paper
Full-text available
Precise segmentation of architectural structures provides detailed information about various building components, enhancing our understanding and interaction with our built environment. Nevertheless , existing outdoor 3D point cloud datasets have limited and detailed annotations on architectural exteriors due to privacy concerns and the expensive costs of data acquisition and annotation. To overcome this shortfall, this paper introduces a semantically-enriched, photo-realistic 3D architectural models dataset and benchmark for semantic segmentation. It features 4 different building purposes of real-world buildings as well as an open architectural landscape in Hong Kong. Each point cloud is annotated into one of 14 semantic classes.
... Despite advancements in LiDAR scanning technology and the rise of large-scale 3D point cloud datasets, capturing a comprehensive view of exterior architectural structures remains a challenge [21,41]. Specifically, the outdoor roadway-level datasets [18,20,22,28,48,49] often abstract the (large-sized) buildings as single entities, finer details of exteriors due to sensory gap [13]. Although a recent surge of façade-level datasets [24,42,46] have aimed to address this problem by identifying the architectural parts on the building's vertical and planar wall surfaces, ground-level and rooftoplevel exterior structures are disregarded. ...
... Over the last few years, deep learning approaches for 3D point clouds have garnered considerable interest [16]. Autonomous driving, in particular, has been the focus of numerous studies, resulting in multiple proposed approaches for object detection [2,76], as well as semantic [41,77,82], instance [79,80] and panoptic segmentation [5,15,43,81]. However, these methods consider sequences of sparse LiDAR acquisition, and focus on a small set of classes (pedestrians, cars). ...
Conference Paper
Full-text available
We introduce a highly efficient method for panoptic seg-mentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inference. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: 50.1 PQ (+7.8) for S3DIS Area 5, and 58.7 PQ (+25.2) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only 209k parameters, our model is over 30 times smaller than the best-competing method and trains up to 15 times faster. Our code and pretrained models are available at https://github.com/drprojects/ superpoint_transformer.
... You can find the recommendation results in Figures 21 and 22. The remote sensing resources depicted in Figures 21 and 22 are sourced from the literature [55][56][57][58][59][60][61][62][63][64][65][66]. ...
Article
Full-text available
Technological progress has led to significant advancements in Earth observation and satellite systems. However, some services associated with remote sensing face issues related to timeliness and relevance, which affect the application of remote sensing resources in various fields and disciplines. The challenge now is to help end-users make precise decisions and recommendations for relevant resources that meet the demands of their specific domains from the vast array of remote sensing resources available. In this study, we propose a remote sensing resource service recommendation model that incorporates a time-aware dual LSTM neural network with similarity graph learning. We further use the stream push technology to enhance the model. We first construct interaction history behavior sequences based on users’ resource search history. Then, we establish a category similarity relationship graph structure based on the cosine similarity matrix between remote sensing resource categories. Next, we use LSTM to represent historical sequences and Graph Convolutional Networks (GCN) to represent graph structures. We construct similarity relationship sequences by combining historical sequences to explore exact similarity relationships using LSTM. We embed user IDs to model users’ unique characteristics. By implementing three modeling approaches, we can achieve precise recommendations for remote sensing services. Finally, we conduct experiments to evaluate our methods using three datasets, and the experimental results show that our method outperforms the state-of-the-art algorithms.
... Finally, KITTI-360 dataset [12] contains semantic segmentation and bounding box annotations in both image and LiDAR modalities (Velodyne HDL-64E and a SICK LMS 200 laser). A recent point cloud semantic segmentation dataset, HelixNet using a Velodyne HDL-64E LiDAR, focusing mainly on french locations was released by [13]. Table I It was created with an affordable but efficient mapping system, which is modular and usable with any vehicle. ...
Preprint
Full-text available
Autonomous driving (AD) perception today relies heavily on deep learning based architectures requiring large scale annotated datasets with their associated costs for curation and annotation. The 3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization. We propose a new dataset, Navya 3D Segmentation (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain, including rural, urban, industrial sites and universities from 13 countries. It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds. We also propose a novel method for sequential dataset split generation based on iterative multi-label stratification, and demonstrated to achieve a +1.2% mIoU improvement over the original split proposed by SemanticKITTI dataset. A complete benchmark for semantic segmentation task was performed, with state of the art methods. Finally, we demonstrate an active learning (AL) based dataset distillation framework. We introduce a novel heuristic-free sampling method called distance sampling in the context of AL. A detailed presentation on the dataset is available at https://www.youtube.com/watch?v=5m6ALIs-s20 .
Preprint
Full-text available
Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods' comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.
Article
LiDAR is an essential sensor for autonomous driving by collecting precise geometric information regarding a scene. As the performance of various LiDAR perception tasks has improved, generalizations to new environments and sensors has emerged to test these optimized models in real-world conditions. Unfortunately, the various annotation strategies of data providers complicate the computation of cross-domain performances. This paper provides a novel dataset, ParisLuco3D, specifically designed for cross-domain evaluation to make it easier to evaluate the performance utilizing various source datasets. Alongside the dataset, online benchmarks for LiDAR semantic segmentation, LiDAR object detection, and LiDAR tracking are provided to ensure a fair comparison across methods. The ParisLuco3D dataset, evaluation scripts, and links to benchmarks can be found at the following website: https://npm3d.fr/parisluco3d</uri
Article
Domain adaptive LiDAR point cloud segmentation aims to learn a target segmentation model from labeled source point clouds and unlabelled target point clouds, which has recently attracted increasing attention due to various challenges in point cloud annotation. However, its performance is still very constrained as most existing studies did not well capture data-specific characteristics of LiDAR point clouds. Inspired by the observation that the domain discrepancy of LiDAR point clouds is highly correlated with point density, we design a density-aware self-training (DAST) technique that introduces point density into the self-training framework for domain adaptive point cloud segmentation. DAST consists of two novel and complementary designs. The first is density-aware pseudo labelling that introduces point density for accurate pseudo labelling of target data and effective self-supervised network retraining. The second is density-aware consistency regularization that encourages to learn density-invariant representations by enforcing target predictions to be consistent across points of different densities. Extensive experiments over multiple large-scale public datasets show that DAST achieves superior domain adaptation performance as compared with the state-of-the-art.
Article
Domain adaptive LiDAR point cloud segmentation aims to learn an effective target segmentation model from labelled source data and unlabelled target data, which has attracted increasing attention in recent years due to the difficulty in point-cloud annotation. It remains a very open research challenge as point clouds of different domains often have clear distribution discrepancies with variations in LiDAR sensor configurations, environmental conditions, occlusions, etc. We design a simple yet effective spatial consistency training framework that can learn superior domain-invariant feature representations from unlabelled target point clouds. The framework exploits three types of spatial consistency, namely, geometric-transform consistency, sparsity consistency, and mixing consistency which capture the semantic invariance of point clouds with respect to viewpoint changes, sparsity changes, and local context changes, respectively. With a concise mean teacher learning strategy, our experiments show that the proposed spatial consistency training outperforms the state-of-the-art significantly and consistently across multiple public benchmarks. Code is available at https://github.com/xiaoaoran/sct-uda .
Article
Autonomous driving (AD) perception today relies heavily on deep learning based architectures requiring large scale annotated datasets with their associated costs for curation and annotation. The 3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization. We propose a new dataset, Navya 3D Segmentation (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain, including rural, urban, industrial sites and universities from 13 countries. It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds. We also propose a novel method for sequential dataset split generation based on iterative multi-label stratification, and demonstrated to achieve a +1.2% mIoU improvement over the original split proposed by SemanticKITTI dataset. A complete benchmark for semantic segmentation task was performed, with state of the art methods. Finally, we demonstrate an Active Learning (AL) based dataset distillation framework. We introduce a novel heuristic-free sampling method called ego-pose distance based sampling in the context of AL. A detailed presentation on the dataset is available here https://www.youtube.com/watch?v=5m6ALIs-s20 .
Chapter
Full-text available
Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.
Article
Full-text available
The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.
Chapter
Full-text available
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st^\mathbf{st} on the competitive SemanticKITTI leaderboard^\star . It also achieves 8–23×\times computation reduction and 3×\times measured speedup over MinkowskiNet and KPConv with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.
Conference Paper
Full-text available
Semantic segmentation of large-scale outdoor point clouds is essential for urban scene understanding in various applications, especially autonomous driving and urban high-definition (HD) mapping. With rapid developments of mobile laser scanning (MLS) systems, massive point clouds are available for scene understanding, but publicly accessible large-scale labeled datasets, which are essential for developing learning-based methods, are still limited. This paper introduces Toronto-3D, a large-scale urban outdoor point cloud dataset acquired by a MLS system in Toronto, Canada for semantic segmentation. This dataset covers approximately 1 km of point clouds and consists of about 78.3 million points with 8 labeled object classes. Baseline experiments for semantic segmentation were conducted and the results confirmed the capability of this dataset to train deep learning models effectively. Toronto-3D is released 1 to encourage new research, and the labels will be improved and updated with feedback from the research community.
Conference Paper
Full-text available
Semantic scene understanding is important for various applications. In particular, self-driving cars need a fine-grained understanding of the surfaces and objects in their vicinity. Light detection and ranging (LiDAR) provides precise geometric information about the environment and is thus a part of the sensor suites of almost all self-driving cars. Despite the relevance of semantic scene understanding for this application, there is a lack of a large dataset for this task which is based on an automotive LiDAR. In this paper, we introduce a large dataset to propel research on laser-based semantic segmentation. We annotated all sequences of the KITTI Vision Odometry Benchmark and provide dense point-wise annotations for the complete 360o360^{o} field-of-view of the employed automotive LiDAR. We propose three benchmark tasks based on this dataset: (i) semantic segmentation of point clouds using a single scan, (ii) semantic segmentation using multiple past scans, and (iii) semantic scene completion, which requires to anticipate the semantic scene in the future. We provide baseline experiments and show that there is a need for more sophisticated models to efficiently tackle these tasks. Our dataset opens the door for the development of more advanced methods, but also provides plentiful data to investigate new research directions.
Article
Full-text available
Lidar imaging systems are one of the hottest topics in the optronics industry. The need to sense the surroundings of every autonomous vehicle has pushed forward a race dedicated to deciding the final solution to be implemented. However, the diversity of state-of-the-art approaches to the solution brings a large uncertainty on the decision of the dominant final solution. Furthermore, the performance data of each approach often arise from different manufacturers and developers, which usually have some interest in the dispute. Within this paper, we intend to overcome the situation by providing an introductory, neutral overview of the technology linked to lidar imaging systems for autonomous vehicles, and its current state of development. We start with the main single-point measurement principles utilized, which then are combined with different imaging strategies, also described in the paper. An overview of the features of the light sources and photodetectors specific to lidar imaging systems most frequently used in practice is also presented. Finally, a brief section on pending issues for lidar development in autonomous vehicles has been included, in order to present some of the problems which still need to be solved before implementation may be considered as final. The reader is provided with a detailed bibliography containing both relevant books and state-of-the-art papers for further progress in the subject.
Conference Paper
Full-text available
Scene understanding of full-scale 3D models of an urban area remains a challenging task. While advanced computer vision techniques offer cost-effective approaches to analyse 3D urban elements, a precise and densely labelled dataset is quintessential. The paper presents the first-ever labelled dataset for a highly dense Aerial Laser Scanning (ALS) point cloud at city-scale. This work introduces a novel benchmark dataset that includes a manually annotated point cloud for over 260 million laser scanning points into 100'000 (approx.) assets from Dublin LiDAR point cloud (Laefer, et al) in 2015. Objects are labelled into 13 classes using hierarchical levels of detail from large (i.e. building, vegetation and ground) to refined (i.e. window, door and tree) elements. To validate the performance of our dataset, two different applications are showcased. Firstly, the labelled point cloud is employed for training Convolutional Neural Networks (CNNs) to classify urban elements. The dataset is tested on the well-known state-of-the-art CNNs (i.e. PointNet, PointNet++ and So-Net). Secondly, the complete ALS dataset is applied as detailed ground truth for city-scale image-based 3D reconstruction.
Conference Paper
Full-text available
Scene parsing aims to assign a class (semantic) label for each pixel in an image. It is a comprehensive analysis of an image. Given the rise of autonomous driving, pixel-accurate environmental perception is expected to be a key enabling technical piece. However, providing a large scale dataset for the design and evaluation of scene parsing algorithms, in particular for outdoor scenes, has been difficult. The per-pixel labelling process is prohibitively expensive, limiting the scale of existing ones. In this paper, we present a large-scale open dataset, ApolloScape, that consists of RGB videos and corresponding dense 3D point clouds. Comparing with existing datasets, our dataset has the following unique properties. The first is its scale, our initial release contains over 140K images - each with its per-pixel semantic mask, up to 1M is scheduled. The second is its complexity. Captured in various traffic conditions, the number of moving objects averages from tens to over one hundred. And the third is the 3D attribute, each image is tagged with high-accuracy pose information at cm accuracy and the static background point cloud has mm relative accuracy. We are able to label these many images by an interactive and efficient labelling pipeline that utilizes the high-quality 3D point cloud. Moreover, our dataset also contains different lane markings based on the lane colors and styles. We expect our new dataset can deeply benefit various autonomous driving related applications that include but not limited to 2D/3D scene understanding, localization, transfer learning, and driving simulation.
Article
Full-text available
This paper introduces a new Urban Point Cloud Dataset for Automatic Segmentation and Classification acquired by Mobile Laser Scanning (MLS). We describe how the dataset is obtained from acquisition to post-processing and labeling. This dataset can be used to learn classification algorithm, however, given that a great attention has been paid to the split between the different objects, this dataset can also be used to learn the segmentation. The dataset consists of around 2km of MLS point cloud acquired in two cities. The number of points and range of classes make us consider that it can be used to train Deep-Learning methods. Besides we show some results of automatic segmentation and classification. The dataset is available at: http://caor-mines-paristech.fr/fr/paris-lille-3d-dataset/
Conference Paper
Full-text available
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Article
Full-text available
The objective of the TerraMobilita/iQmulus 3D urban analysis benchmark is to evaluate the current state of the art in urban scene analysis from mobile laser scanning (MLS) at large scale. A very detailed semantic tree for urban scenes is proposed. We call analysis the capacity of a method to separate the points of the scene into these categories (classification), and to separate the different objects of the same type for object classes (detection). A very large ground truth is produced manually in two steps using advanced editing tools developed especially for this benchmark. Based on this ground truth, the benchmark aims at evaluating the classification, detection and segmentation quality of the submitted results.
Article
Full-text available
We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10–100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.
Article
Full-text available
Nous présentons dans cet article un système de numérisation mobile 3D hybride laser-image qui permet d'acquérir des infrastructures de données spatiales répondant aux besoins d'applications diverses allant de navigations multimédia immersives jusqu'à de la métrologie 3D à travers le web. Nous détaillons la conception du système, ses capteurs, son architecture et sa calibration, ainsi qu'un service web offrant la possibilité de saisir en 3D via un outil de type SaaS (Software as a Service), permettant à tout un chacun d'enrichir ses propres bases de données à hauteur de ses besoins. Nous abordons également l'anonymisation des données, à savoir la détection et le floutage de plaques d'immatriculation, qui est est une étape inévitable pour la diffusion de ces données sur Internet via des applications grand public.
Article
Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ‘soft’ convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT , outperforms the DeiT (Touvron et al 2020 arXiv: 2012.12877 ) on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it has escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/facebookresearch/convit .
Article
For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k images and 1B 3D points with coherent semantic instance annotations across 2D and 3D. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset, e.g., semantic scene understanding, novel view synthesis and semantic SLAM. KITTI-360 will enable progress at the intersection of these research areas and thus contribute towards solving one of today's grand challenges: the development of fully autonomous self-driving systems.
Chapter
In this paper, we introduce SalsaNext for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet [1] which has an encoder-decoder architecture where the encoder unit has a set of ResNet blocks and the decoder part combines upsampled features from the residual blocks. In contrast to SalsaNet, we introduce a new context module, replace the ResNet encoder blocks with a new residual dilated convolution stack with gradually increasing receptive fields and add the pixel-shuffle layer in the decoder. Additionally, we switch from stride convolution to average pooling and also apply central dropout treatment. To directly optimize the Jaccard index, we further combine the weighted cross entropy loss with Lovász-Softmax loss [4]. We finally inject a Bayesian treatment to compute the epistemic and aleatoric uncertainties for each point in the cloud. We provide a thorough quantitative evaluation on the Semantic-KITTI dataset [3], which demonstrates that the proposed SalsaNext outperforms other published semantic segmentation networks and achieves 3.6%3.6\% more accuracy over the previous state-of-the-art method. We also release our source code (https://github.com/TiagoCortinhal/SalsaNext).
Chapter
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Convolutional network are the de-facto standard for analysing spatio-temporal data such as images, videos, 3D shapes, etc. Whilst some of this data is naturally dense (for instance, photos), many other data sources are inherently sparse. Examples include pen-strokes forming on a piece of paper, or (colored) 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce a sparse convolutional operation tailored to processing sparse data that differs from prior work on sparse convolutional networks in that it operates strictly on submanifolds, rather than "dilating" the observation with every layer in the network. Our empirical analysis of the resulting submanifold sparse convolutional networks shows that they perform on par with state-of-the-art methods whilst requiring substantially less computation.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
An image is worth 16x16 words: transformers for image recognition at scale
  • A Dosovitskiy
HDL-64E User’s Manual. Velodyne LiDAR Inc. 345 Digital Drive
  • V L Inc
Transformers are RNNs: fast autoregressive transformers with linear attention
  • A Katharopoulos
  • A Vyas
  • N Pappas
  • F Fleuret
Fast transformers with clustered attention
  • A Vyas
  • A Katharopoulos
  • F Fleuret