To read the full-text of this research, you can request a copy directly from the authors.
The detection and localization of objects on point
cloud provided by LiDAR sensors has various applications,
especially for autonomous driving. As methods are being more
efficient in terms of performance and execution speed, they
operate in an end-to-end fashion, not allowing in case of errors
the evaluation of which areas could cause failures. We propose
in this paper a method that delivers an intermediate top
view heatmap that will, firstly, facilitates the estimation part
to focus on the right areas, and secondly, allow determining
which parts of the space are keeping the system’s attention.
Vehicles are represented on this map as ellipses. Objects with
heavy occlusion may not illustrate a complete ellipse, however
their mark may still indicate that a vehicle is hidden, allowing
planification algorithms to decide a reduction of the velocity as
something may suddenly appear. Furthermore, thanks to the
alternative representation of objects, detections can be extracted
even from the intermediate map, producing redundancy in
a single system. The approach presented in this paper is
distinguished by its relative computational simplicity compared
to the state-of-art methods. In addition to its implementation
simplicity, the proposed architecture reaches high level of
performance on the KITTI detection benchmark.
To read the full-text of this research, you can request a copy directly from the authors.
... One class detects only the vehicle target. Examples include a vehicle detection network based on heatmap estimation  and a convolutional neural network with a featured representation . The other class are 3D point cloud semantic segmentation networks, and vehicles are only one category of their segmentation results. ...
This paper presents a novel dynamic vehicle tracking framework, achieving accurate pose estimation and tracking in urban environments. For vehicle tracking with laser scanners, pose estimation extracts geometric information of the target from a point cloud clustering unit, which plays an essential role in tracking tasks. However, the point cloud acquired from laser scanners only provides distance measurements to the object surface facing the sensor, leading to nonnegligible pose estimation errors. To address this issue, we take the motion information of targets as feedback to assist vehicle detection and pose estimation. In addition, the heading normalization vehicle model and a robust target size estimation method are introduced to deduce the pose of a vehicle with 2D matched filtering. Furthermore, considering the mobility of vehicles, we utilize the interactive multitude model (IMM) to capture multiple motion patterns. Compared to existing methods in the literature, our method can be applied to spatially sparse or incomplete point cloud observations. Experimental results demonstrate that our vehicle tracking framework achieves promising performance, and its real-time capability is also validated in real traffic scenarios.
LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. Voxel-based 3D convolutional networks have been used for some time to enhance the retention of information when processing point cloud LiDAR data. However, problems remain, including a slow inference speed and low orientation estimation performance. We therefore investigate an improved sparse convolution method for such networks, which significantly increases the speed of both training and inference. We also introduce a new form of angle loss regression to improve the orientation estimation performance and a new data augmentation approach that can enhance the convergence speed and performance. The proposed network produces state-of-the-art results on the KITTI 3D object detection benchmarks while maintaining a fast inference speed.
Object detection and classification in 3D is a key task in Automated Driving (AD). LiDAR sensors are employed to provide the 3D point cloud reconstruction of the surrounding environment, while the task of 3D object bounding box detection in real time remains a strong algorithmic challenge. In this paper, we build on the success of the one-shot regression meta-architecture in the 2D perspective image space and extend it to generate oriented 3D object bounding boxes from LiDAR point cloud. Our main contribution is in extending the loss function of YOLO v2 to include the yaw angle, the 3D box center in Cartesian coordinates and the height of the box as a direct regression problem. This formulation enables real-time performance, which is essential for automated driving. Our results are showing promising figures on KITTI benchmark, achieving real-time performance (40 fps) on Titan X GPU.
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN  and also with the well known DeepLab-LargeFOV  , DeconvNet  architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produces state of the art results on the KITTI 3D object detection benchmark while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles.
While object recognition on 2D images is getting more and more mature, 3D understanding is eagerly in demand yet largely underexplored. In this paper, we study the 3D object detection problem from RGB-D data captured by depth sensors in both indoor and outdoor environments. Different from previous deep learning methods that work on 2D RGB-D images or 3D voxels, which often obscure natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. Although recent works such as PointNet performs well for segmentation in small-scale point clouds, one key challenge is how to efficiently detect objects in large-scale scenes. Leveraging the wisdom of dimension reduction and mature 2D object detectors, we develop a Frustum PointNet framework that addresses the challenge. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms state of the arts by remarkable margins with high efficiency (running at 5 fps).
In this paper, we present a novel approach, called Deep MANTA (Deep Many-Tasks), for many-task vehicle analysis from a given image. A robust convolutional network is introduced for simultaneous vehicle detection, part localization, visibility characterization and 3D dimension estimation. Its architecture is based on a new coarse-to-fine object proposal that boosts the vehicle detection. Moreover, the Deep MANTA network is able to localize vehicle parts even if these parts are not visible. In the inference, the network's outputs are used by a real time robust pose estimation algorithm for fine orientation estimation and 3D vehicle localization. We show in experiments that our method outperforms monocular state-of-the-art approaches on vehicle detection, orientation and 3D location tasks on the very challenging KITTI benchmark.
We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and sub-category detection. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset.
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
This paper proposes a computationally efficient approach to detecting objects natively in 3D point clouds using convolutional neural networks (CNNs). In particular, this is achieved by leveraging a feature-centric voting scheme to implement novel convolutional layers which explicitly exploit the sparsity encountered in the input. To this end we examine the trade-off between accuracy and speed for different architectures and additionally propose to use an L1 penalty on the filter activations to further encourage sparsity in the intermediate representations. To the best of our knowledge, this is the first work to propose sparse convolutional layers and L1 regularisation for efficient large-scale processing of 3D data. We demonstrate the efficacy of our approach on the KITTI object detection benchmark and show that Vote3Deep models with as few as three layers outperform the previous state of the art in both laser and laser-vision based approaches across the board by margins of up to 40% while remaining highly competitive in terms of processing time.
Deep convolutional neural networks achieve better than human level visual recognition accuracy, at the cost of high computational complexity. We propose to factorize the convolutional layers to improve their efficiency. In traditional convolutional layers, the 3D convolution can be considered as performing in-channel spatial convolution and linear channel projection simultaneously, leading to highly redundant computation. By unravelling them apart, the proposed layer only involves single in-channel convolution and linear channel projection. When stacking such layers together, we achieves similar accuracy with significantly less computation. Additionally, we propose a topological connection framework between the input channels and output channels that further improves the layer's efficiency. Our experiments demonstrate that the proposed method remarkably outperforms the standard convolutional layer with regard to accuracy/complexity ratio. Our model achieves accuracy of GoogLeNet while consuming 3.4 times less computation.
We present a novel and practical deep fully convolutional neural network
architecture for semantic pixel-wise segmentation termed SegNet. This core
trainable segmentation engine consists of an encoder network, a corresponding
decoder network followed by a pixel-wise classification layer. The architecture
of the encoder network is topologically identical to the 13 convolutional
layers in the VGG16 network . The role of the decoder network is to map the low
resolution encoder feature maps to full input resolution feature maps for
pixel-wise classification. The novelty of SegNet lies is in the manner in which
the decoder upsamples its lower resolution input feature map(s). Specifically,
the decoder uses pooling indices computed in the max-pooling step of the
corresponding encoder to perform non-linear upsampling. This eliminates the
need for learning to upsample. The upsampled maps are sparse and are then
convolved with trainable filters to produce dense feature maps. We compare our
proposed architecture with the fully convolutional network (FCN) architecture
and its variants. This comparison reveals the memory versus accuracy trade-off
involved in achieving good segmentation performance. The design of SegNet was
primarily motivated by road scene understanding applications. Hence, it is
efficient both in terms of memory and computational time during inference. It
is also significantly smaller in the number of trainable parameters than
competing architectures and can be trained end-to-end using stochastic gradient
descent. We also benchmark the performance of SegNet on Pascal VOC12 salient
object segmentation and the recent SUN RGB-D indoor scene understanding
challenge. We show that SegNet provides competitive performance although it is
significantly smaller than other architectures. We also provide a Caffe
implementation of SegNet and a webdemo at
There is large consent that successful training of deep networks requires
many thousand annotated training samples. In this paper, we present a network
and training strategy that relies on the strong use of data augmentation to use
the available annotated samples more efficiently. The architecture consists of
a contracting path to capture context and a symmetric expanding path that
enables precise localization. We show that such a network can be trained
end-to-end from very few images and outperforms the prior best method (a
sliding-window convolutional network) on the ISBI challenge for segmentation of
neuronal structures in electron microscopic stacks. Using the same network
trained on transmitted light microscopy images (phase contrast and DIC) we won
the ISBI cell tracking challenge 2015 in these categories by a large margin.
Moreover, the network is fast. Segmentation of a 512x512 image takes less than
a second on a recent GPU. The full implementation (based on Caffe) and the
trained networks are available at
This work presents a new efficient method for fitting ellipses to scattered data. Previous algorithms either fitted general conics or were computationally expensive. By minimizing the algebraic distance subject to the constraint4ac ; b =1 the new method incorporates the ellipticity constraint into the normalization factor. The proposed method combines several advantages: (i) It is ellipse-specific so that even bad data will always return an ellipse# (ii) It can be solved naturally by a generalized eigensystem and (iii) it is extremely robust, efficient and easy to implement.
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
We present PointFusion, a generic 3D object detection method that leverages both image and 3D point cloud information. Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and application-agnostic. The image data and the raw point cloud data are independently processed by a CNN and a PointNet architecture, respectively. The resulting outputs are then combined by a novel fusion network, which predicts multiple 3D box hypotheses and their confidences, using the input 3D points as spatial anchors. We evaluate PointFusion on two distinctive datasets: the KITTI dataset that features driving scenes captured with a lidar-camera setup, and the SUN-RGBD dataset that captures indoor environments with RGB-D cameras. Our model is the first one that is able to perform better or on-par with the state-of-the-art on these diverse datasets without any dataset-specific model tuning.
Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet  and Fast R-CNN  have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model , our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
2D fully convolutional network has been recently successfully applied to object detection from images. In this paper, we extend the fully convolutional network based detection techniques to 3D and apply it to point cloud data. The proposed approach is verified on the task of vehicle detection from lidar point cloud for autonomous driving. Experiments on the KITTI dataset shows a significant performance improvement over the previous point cloud based detection approaches.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
We present YOLO, a unified pipeline for object detection. Prior work on
object detection repurposes classifiers to perform detection. Instead, we frame
object detection as a regression problem to spatially separated bounding boxes
and associated class probabilities. A single neural network predicts bounding
boxes and class probabilities directly from full images in one evaluation.
Since the whole detection pipeline is a single network, it can be optimized
end-to-end directly on detection performance. Our unified architecture is also
extremely fast; YOLO processes images in real-time at 45 frames per second,
hundreds to thousands of times faster than existing detection systems. Our
system uses global image context to detect and localize objects, making it less
prone to background errors than top detection systems like R-CNN. By itself,
YOLO detects objects at unprecedented speeds with moderate accuracy. When
combined with state-of-the-art detectors, YOLO boosts performance by 2-3%
State-of-the-art object detection networks depend on region proposal
algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN
have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region
Proposal Network (RPN) that shares full-image convolutional features with the
detection network, thus enabling nearly cost-free region proposals. An RPN is a
fully-convolutional network that simultaneously predicts object bounds and
objectness scores at each position. RPNs are trained end-to-end to generate
high-quality region proposals, which are used by Fast R-CNN for detection. With
a simple alternating optimization, RPN and Fast R-CNN can be trained to share
convolutional features. For the very deep VGG-16 model, our detection system
has a frame rate of 5fps (including all steps) on a GPU, while achieving
state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and
2012 (70.4% mAP) using 300 proposals per image. The code will be released.