Article

Significance of Image Features in Camera-LiDAR Based Object Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In autonomous cars accurate and reliable detection of objects in the proximity of the vehicle is necessary in order to perform further safety critical actions which depend upon it. Many detectors have been developed in the last few years, but there is still demand for more reliable and more robust detectors. Some detectors rely on a single sensor, while some others are based upon fusion of data from multiple sources. The main aim of this paper is to show how image features can contribute to performance improvement of detectors which rely on pointcloud data only. In addition it will be shown, how lidar reflectance data can be substituted by low level image features without degrading the performance of detectors. Three different approaches are proposed to fuse image features with point cloud data. The extended networks are compared with the original network and tested on a well-known dataset and on our own data, as well. This might be important when the same pretrained model is to be used on data generated by a lidar using different reflectance encoding schemes and when due to the lack of training data retraining is not possible. Different augmentation techniques have been proposed and tested on the KITTI dataset as well as on data acquired by a different lidar sensor. The networks augmented with image features achieved a recall increase of a few percent for occluded objects.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The authors of [18] in their work at the theoretical level analyze the limitations imposed by boundary objects and the sensitivity of the calibration accuracy to the distribution of boundaries in the scene, and at the implementation level they propose a method for detecting lidar boundaries based on cutting voxels from a point cloud and fitting the plane. Three different approaches for combining image characteristics with point cloud data in which lidar reflection data can be replaced with low-level image characteristics without degrading detector performance have been developed by researchers [19]. To solve the problem of combining data from different types of lidar and camera sensors, the authors of [20] developed a multimodal system for detecting and tracking 3D objects (EZFusion). ...
... Let = [ ], then from (12) we get: elements = . Then, to determine the matrix of the internal parameters of the camera, it suffices to use n images and equations(19): ...
Preprint
Full-text available
The task of determining the distance from one object to another is one of the important tasks solved in robotics systems. Conventional algorithms rely on an iterative process of predicting distance estimates, which results in an increased computational burden. Algorithms used in robotic systems should require minimal time costs, as well as be resistant to the presence of noise. To solve these problems, the paper proposes an algorithm for Kalman combination filtering with a Goldschmidt divisor and a median filter. Software simulation showed an increase in the accuracy of predicting the estimate of the developed algorithm in comparison with the traditional filtering algorithm, as well as an increase in the speed of the algorithm. The results obtained can be effectively applied in various computer vision systems.
... Recent advances in deep learning have contributed significantly to the performance improvement of many computer vision tasks [1]- [5]. Some of the notable achievements happened in the area of image classification [6], [7], object detection [8]- [10], and image segmentation [11], [12]. This paper focuses on the ultimate goal of putting deep learning-based object detection methods into practice and proposes an automatic dataset creation method to save time in manual labeling. ...
Article
Full-text available
Numerous high-accuracy neural networks for object detection have been proposed in recent years. Creating a specific training dataset seems to be the main obstacle in putting them into practice. This paper aims to overcome this obstacle and proposes an automatic dataset creation method. This method first extracts objects from the source images and then combines them as synthetic images. These synthetic images are annotated automatically using the data flow in the process and can be used directly for training. To the best of our knowledge, this is the first automatic dataset creation method for object detection tasks. In addition, the adaptive object extraction method and created natural synthetic images make the proposed method maintain strong adaptation and generalization ability. To validate the feasibility, a dataset that includes 44 categories of objects is created for the object detection task in a vending supermarket. Under the strictest metric $AP_{75} $ , both the trained EfficientDet and YOLOv4 achieve higher than 95% in accuracy on the common difficulty testing set and higher than 90% in accuracy on the high difficulty testing set.
Conference Paper
Full-text available
We present Self-Ensembling Single-Stage object Detector (SE-SSD) for accurate and efficient 3D object detection in outdoor point clouds. Our key focus is on exploiting both soft and hard targets with our formulated constraints to jointly optimize the model, without introducing extra computation in the inference. Specifically, SE-SSD contains a pair of teacher and student SSDs, in which we design an effective IoU-based matching strategy to filter soft targets from the teacher and formulate a consistency loss to align student predictions with them. Also, to maximize the distilled knowledge for ensembling the teacher, we design a new augmentation scheme to produce shape-aware augmented samples to train the student, aiming to encourage it to infer complete object shapes. Lastly, to better exploit hard targets, we design an ODIoU loss to supervise the student with constraints on the predicted box centers and orientations. Our SE-SSD attains top performance compared with all prior published works. Also, it attains top precisions for car detection in the KITTI benchmark (ranked 1st and 2nd on the BEV and 3D leaderboards, respectively) with an ultra-high inference speed. The code is available at https://github.com/Vegeta2020/SE-SSD.
Article
Full-text available
Multimodality fusion based on deep neural networks (DNN) is a significant method for intelligent vehicles. The special characteristics of DNN lead to the issue of AI safety and safety test. In this paper, we firstly propose a multimodality fusion framework called Integrated Multimodality Fusion Deep Neural Network (IMF-DNN), which can flexibly accomplish both object detection and end-to-end driving policy for prediction of steering angle and speed. Then, we propose a DNN safety test strategy, which systematically analyzes DNN's robustness and generalization ability in large amounts of diverse driving environment conditions. The test in this paper is based on our IMF-DNN model and the strategy can be widely used for other DNNs. Finally, the experiment analysis is performed on KITTI for object detection and the dateset DBNet for end-to-end tasks. The results show the superior accuracy of the proposed IMF-DNN model and the test strategy's potential ability to improve the robustness and generalization of autonomous vehicle deep learning model. Code is available at https://github.com/ennisnie/IMF-DNN
Article
Full-text available
The stabilization and validation process of the measured position of objects is an important step for high-level perception functions and for the correct processing of sensory data. The goal of this process is to detect and handle inconsistencies between different sensor measurements, which result from the perception system. The aggregation of the detections from different sensors consists in the combination of the sensorial data in one common reference frame for each identified object, leading to the creation of a super-sensor. The result of the data aggregation may end up with errors such as false detections, misplaced object cuboids or an incorrect number of objects in the scene. The stabilization and validation process is focused on mitigating these problems. The current paper proposes four contributions for solving the stabilization and validation task, for autonomous vehicles, using the following sensors: trifocal camera, fisheye camera, long-range RADAR (Radio detection and ranging), and 4-layer and 16-layer LIDARs (Light Detection and Ranging). We propose two original data association methods used in the sensor fusion and tracking processes. The first data association algorithm is created for tracking LIDAR objects and combines multiple appearance and motion features in order to exploit the available information for road objects. The second novel data association algorithm is designed for trifocal camera objects and has the objective of finding measurement correspondences to sensor fused objects such that the super-sensor data are enriched by adding the semantic class information. The implemented trifocal object association solution uses a novel polar association scheme combined with a decision tree to find the best hypothesis–measurement correlations. Another contribution we propose for stabilizing object position and unpredictable behavior of road objects, provided by multiple types of complementary sensors, is the use of a fusion approach based on the Unscented Kalman Filter and a single-layer perceptron. The last novel contribution is related to the validation of the 3D object position, which is solved using a fuzzy logic technique combined with a semantic segmentation image. The proposed algorithms have a real-time performance, achieving a cumulative running time of 90 ms, and have been evaluated using ground truth data extracted from a high-precision GPS (global positioning system) with 2 cm accuracy, obtaining an average error of 0.8 m.
Article
Full-text available
Road traffic crashes are a considerable concern in motorized countries because of their impact on society, economy. The number of accidents has decreased since 2000. This paper gives an overview of road safety on Hungary's road network and reviews the efforts that were done. Statistical prediction of road traffic crashes is described based on the findings of the scientific literature. Using the data provided by the Hungarian Central Statistical Office for the number of crashes and related factors for the period from 2002 to 2017, a new relationship is established between crashes and a number of related factors in an attempt to improve the models' prediction power and to investigate the effect of adding new predictors on the strength of the models.
Conference Paper
Full-text available
The adverse environmental conditions build a bottleneck for the autonomous driving. This challenge is resolved by data fusion of sensor modalities. Here, thermal and Lidar data are fused together for the precise perception of environment.
Article
Full-text available
LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. Voxel-based 3D convolutional networks have been used for some time to enhance the retention of information when processing point cloud LiDAR data. However, problems remain, including a slow inference speed and low orientation estimation performance. We therefore investigate an improved sparse convolution method for such networks, which significantly increases the speed of both training and inference. We also introduce a new form of angle loss regression to improve the orientation estimation performance and a new data augmentation approach that can enhance the convergence speed and performance. The proposed network produces state-of-the-art results on the KITTI 3D object detection benchmarks while maintaining a fast inference speed.
Article
Full-text available
This paper presents a novel method for fully automatic and convenient extrinsic calibration of a 3D LiDAR and a panoramic camera with a normally printed chessboard. The proposed method is based on the 3D corner estimation of the chessboard from the sparse point cloud generated by one frame scan of the LiDAR. To estimate the corners, we formulate a full-scale model of the chessboard and fit it to the segmented 3D points of the chessboard. The model is fitted by optimizing the cost function under constraints of correlation between the reflectance intensity of laser and the color of the chessboard's patterns. Powell's method is introduced for resolving the discontinuity problem in optimization. The corners of the fitted model are considered as the 3D corners of the chessboard. Once the corners of the chessboard in the 3D point cloud are estimated, the extrinsic calibration of the two sensors is converted to a 3D-2D matching problem. The corresponding 3D-2D points are used to calculate the absolute pose of the two sensors with Unified Perspective-n-Point (UPnP). Further, the calculated parameters are regarded as initial values and are refined using the Levenberg-Marquardt method. The performance of the proposed corner detection method from the 3D point cloud is evaluated using simulations. The results of experiments, conducted on a Velodyne HDL-32e LiDAR and a Ladybug3 camera under the proposed re-projection error metric, qualitatively and quantitatively demonstrate the accuracy and stability of the final extrinsic calibration parameters.
Article
Full-text available
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Article
Full-text available
We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10–100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.
Conference Paper
Full-text available
We present a real-time algorithm which enables an autonomous car to comfortably follow other cars at various speeds while keeping a safe distance. We focus on highway scenarios. A velocity and distance regulation approach is presented that depends on the position as well as the velocity of the followed car. Radar sensors provide reliable information on straight lanes, but fail in curves due to their restricted field of view. On the other hand, Lidar sensors are able to cover the regions of interest in almost all situations, but do not provide precise speed information. We combine the advantages of both sensors with a sensor fusion approach in order to provide permanent and precise spatial and dynamical data. Our results in highway experiments with real traffic will be described in detail.
Article
Full-text available
We propose a flexible technique to easily calibrate a camera. It only requires the camera to observe a planar pattern shown at a few (at least two) different orientations. Either the camera or the planar pattern can be freely moved. The motion need not be known. Radial lens distortion is modeled. The proposed procedure consists of a closed-form solution, followed by a nonlinear refinement based on the maximum likelihood criterion. Both computer simulation and real data have been used to test the proposed technique and very good results have been obtained. Compared with classical techniques which use expensive equipment such as two or three orthogonal planes, the proposed technique is easy to use and flexible. It advances 3D computer vision one more step from laboratory environments to real world use.
Article
Advances in LiDAR sensors provide rich 3D data that supports 3D scene understanding. However, due to occlusion and signal miss, LiDAR point clouds are in practice 2.5D as they cover only partial underlying shapes, which poses a fundamental challenge to 3D perception. To tackle the challenge, we present a novel LiDAR-based 3D object detection model, dubbed Behind the Curtain Detector (BtcDet), which learns the object shape priors and estimates the complete object shapes that are partially occluded (curtained) in point clouds. BtcDet first identifies the regions that are affected by occlusion and signal miss. In these regions, our model predicts the probability of occupancy that indicates if a region contains object shapes and integrates this probability map with detection features and generates high-quality 3D proposals. Finally, the occupancy estimation is integrated into the proposal refinement module to generate accurate bounding boxes. Extensive experiments on the KITTI Dataset and the Waymo Open Dataset demonstrate the effectiveness of BtcDet. Particularly for the 3D detection of both cars and cyclists on the KITTI benchmark, BtcDet surpasses all of the published state-of-the-art methods by remarkable margins. Code is released.
Article
We demonstrate a working functional prototype of a cooperative perception system that maintains a real-time digital twin of the traffic environment, providing a more accurate and more reliable model than any of the participant subsystems—in this case, smart vehicles and infrastructure stations—would manage individually. The importance of such technology is that it can facilitate a spectrum of new derivative services, including cloud-assisted and cloud-controlled ADAS functions, dynamic map generation with analytics for traffic control and road infrastructure monitoring, a digital framework for operating vehicle testing grounds, logistics facilities, etc. In this paper, we constrain our discussion on the viability of the core concept and implement a system that provides a single service: the live visualization of our digital twin in a 3D simulation, which instantly and reliably matches the state of the real-world environment and showcases the advantages of real-time fusion of sensory data from various traffic participants. We envision this prototype system as part of a larger network of local information processing and integration nodes, i.e., the logically centralized digital twin is maintained in a physically distributed edge cloud.
Article
Nowadays, and more than a decade after the first steps towards autonomous driving, we keep heading to achieve fully autonomous vehicles on our roads, with LiDAR sensors being a key instrument for the success of this technology. Such advances trigger the emergence of new players in the automotive industry, and along with car manufacturers, this sector represents a multibillion-dollar market where everyone wants to take a share. To understand recent advances and technologies behind LiDAR, this article presents a survey on LiDAR sensors for the automotive industry. With this work, we show the measurement principles and imaging techniques currently being used, going through a review of commercial systems and development solutions available in the market today. Furthermore, we highlight the current and future challenges, providing insights on how both research and industry can step towards better LiDAR solutions.
Chapter
In this paper, we aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors (namely LiDAR point cloud and camera image), as well as the inconsistency between the localization and classification confidence. To this end, we propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner without any image annotations. Besides, a consistency enforcing loss is employed to explicitly encourage the consistency of both the localization and classification confidence. We design an end-to-end learnable framework named EPNet to integrate these two components. Extensive experiments on the KITTI and SUN-RGBD datasets demonstrate the superiority of EPNet over the state-of-the-art methods. Codes and models are available at: https://github.com/happinesslz/EPNet.
Article
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti.