Radar Voxel Fusion for 3D Object Detection

Abstract and Figures

Automotive traffic scenes are complex due to the variety of possible scenarios, objects, and weather conditions that need to be handled. In contrast to more constrained environments, such as automated underground trains, automotive perception systems cannot be tailored to a narrow field of specific tasks but must handle an ever-changing environment with unforeseen events. As currently no single sensor is able to reliably perceive all relevant activity in the surroundings, sensor data fusion is applied to perceive as much information as possible. Data fusion of different sensors and sensor modalities on a low abstraction level enables the compensation of sensor weaknesses and misdetections among the sensors before the information-rich sensor data are compressed and thereby information is lost after a sensor-individual object detection. This paper develops a low-level sensor fusion network for 3D object detection, which fuses lidar, camera, and radar data. The fusion network is trained and evaluated on the nuScenes data set. On the test set, fusion of radar data increases the resulting AP (Average Precision) detection score by about 5.1% in comparison to the baseline lidar network. The radar sensor fusion proves especially beneficial in inclement conditions such as rain and night scenes. Fusing additional camera data contributes positively only in conjunction with the radar fusion, which shows that interdependencies of the sensors are important for the detection result. Additionally, the paper proposes a novel loss to handle the discontinuity of a simple yaw representation for object detection. Our updated loss increases the detection and orientation estimation performance for all sensor input configurations. The code for this research has been made available on GitHub.
Content may be subject to copyright.
Radar Voxel Fusion for 3D Object Detection
Felix Nobis 1,* , Ehsan Shafiei 1, Phillip Karle 1, Johannes Betz 2and Markus Lienkamp 1
Citation: Nobis, F.; Shafiei, E.; Karle
P.; Betz, J.; Lienkamp, M. Radar Voxel
Fusion for 3D Object Detection. Appl.
Sci. 2021,11, 5598.
Academic Editor: Chris G. Tzanis
Received: 29 April 2021
Accepted: 11 June 2021
Published: 17 June 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
1Institute of Automotive Technology, Technical University of Munich, 85748 Garching, Germany; (E.S.); (P.K.); (M.L.)
2mLab:Real-Time and Embedded Systems Lab, University of Pennsylvania, Philadelphia, PA 19104, USA;
Automotive traffic scenes are complex due to the variety of possible scenarios, objects, and
weather conditions that need to be handled. In contrast to more constrained environments, such
as automated underground trains, automotive perception systems cannot be tailored to a narrow
field of specific tasks but must handle an ever-changing environment with unforeseen events. As
currently no single sensor is able to reliably perceive all relevant activity in the surroundings, sensor
data fusion is applied to perceive as much information as possible. Data fusion of different sensors
and sensor modalities on a low abstraction level enables the compensation of sensor weaknesses and
misdetections among the sensors before the information-rich sensor data are compressed and thereby
information is lost after a sensor-individual object detection. This paper develops a low-level sensor
fusion network for 3D object detection, which fuses lidar, camera, and radar data. The fusion network
is trained and evaluated on the nuScenes data set. On the test set, fusion of radar data increases
the resulting AP (Average Precision) detection score by about 5.1% in comparison to the baseline
lidar network. The radar sensor fusion proves especially beneficial in inclement conditions such
as rain and night scenes. Fusing additional camera data contributes positively only in conjunction
with the radar fusion, which shows that interdependencies of the sensors are important for the
detection result. Additionally, the paper proposes a novel loss to handle the discontinuity of a simple
yaw representation for object detection. Our updated loss increases the detection and orientation
estimation performance for all sensor input configurations. The code for this research has been made
available on GitHub.
perception; deep learning; sensor fusion; radar point cloud; object detection; sensor;
camera; radar; lidar
1. Introduction
In the current state of the art, researchers focus on 3D object detection in the field of
perception. Three-dimensional object detection is most reliably performed with lidar sensor
data [
] as its higher resolution—when compared to radar sensors—and direct depth
measurement—when compared to camera sensors—provide the most relevant features for
object detection algorithms. However, for redundancy and safety reasons in autonomous
driving applications, additional sensor modalities are required because lidar sensors cannot
detect all relevant objects at all times. Cameras are well-understood, cheap and reliable
sensors for applications such as traffic-sign recognition. Despite their high resolution, their
capabilities for 3D perception are limited as only 2D information is provided by the sensor.
Furthermore, the sensor data quality deteriorates strongly in bad weather conditions such
as snow or heavy rain. Radar sensors are least affected by inclement weather, e.g., fog,
and are therefore a vital asset to make autonomous driving more reliable. However, due
to their low resolution and clutter noise for static vehicles, current radar sensors cannot
perform general object detection without the addition of further modalities. This work
therefore combines the advantages of camera, lidar, and radar sensor modalities to produce
an improved detection result.
Appl. Sci. 2021,11, 5598.
Appl. Sci. 2021,11, 5598 2 of 16
Several strategies exist to fuse the information of different sensors. These systems can
be categorized as early fusion if all input data are first combined and then processed, or
late fusion if all data is first processed independently and the output of the data-specific
algorithms are fused after the processing. Partly independent and joint processing is called
middle or feature fusion.
Late fusion schemes based on a Bayes filter, e.g., the Unscented Kalman Filter (UKF) [
in combination with a matching algorithm for object tracking, are the current state of the art,
due to their simplicity and their effectiveness during operation in constrained environments
and good weather.
Early and feature fusion networks possess the advantage of using all available sensor
information at once and are therefore able to learn from interdependencies of the sensor
data and compensate imperfect sensor data for a robust detection result similar to gradient
boosting [5].
This paper presents an approach to fuse the sensors in an early fusion scheme. Similar
to Wang et al. [
], we color the lidar point cloud with camera RGB information. These
colored lidar points are then fused with the radar points and their radar cross-section (RCS)
and velocity features. The network processes the points jointly in a voxel structure and
outputs the predicted bounding boxes. The paper evaluates several parameterizations and
presents the RadarVoxelFusionNet (RVF-Net), which proved most reliable in our studies.
The contribution of the paper is threefold:
The paper develops an early fusion network for radar, lidar, and camera data for 3D
object detection. The network outperforms the lidar baseline and a Kalman Filter late
fusion approach.
The paper provides a novel loss function to replace the simple discontinuous yaw
parameterization during network training.
The code for this research has been released to the public to make it adaptable to
further use cases.
Section 2discusses related work for object detection and sensor fusion networks. The
proposed model is described in Section 3. The results are shown in Section 4and discussed
in Section 5. Section 6presents our conclusions from the work.
2. Related Work
Firstly, this section gives a short overview of the state of the art of lidar object detection for
autonomous driving. Secondly, a more detailed review of fusion methods for object detection
is given. We refer to [7] for a more detailed overview of radar object detection methods.
2.1. 3D Lidar Object Detection
The seminal work of Qi et al. [
] introduces a method to directly process sparse,
irregular point cloud data with neural networks for semantic segmentation tasks. Their
continued work [
] uses a similar backbone to perform 3D object detection from point
cloud frustums. Their so-called pointnet backbone has been adapted in numerous works
to advance lidar object detection.
VoxelNet [
] processes lidar points in a voxel grid structure. The network aggregates
a feature for each voxel from the associated points. These voxel grid cells are processed in
a convolutional fashion to generate object detection results with an anchor-based region
proposal network (RPN) [11].
The network that achieves the highest object detection score [
] on the KITTI 3D
benchmark [
] uses both a voxel-based and pointnet-based processing to create their
detection results. The processing of the voxel data is performed with submanifold sparse
convolutions as introduced in [
]. The advantage of these sparse implementation of
convolutions lies in the fact that they do not process empty parts of the grid that contain
no information. This is especially advantageous for point cloud processing, as most of the
3D space does not contain any sensor returns. The network that achieves the highest object
Appl. Sci. 2021,11, 5598 3 of 16
detection score on the nuScenes data set [
] is a lidar-only approach as well [
]. Similarly,
it uses a sparse VoxelNet backbone with a second stage for bounding box refinement.
2.2. 2D Sensor Fusion for Object Detection
This section reviews 2D fusion methods. The focus is on methods that fuse radar data
as part of the input data.
Chadwick [
] is the first to use a neural network to fuse low level radar and camera
for 2D object detection. The network fuses the data on a feature level after projecting radar
data to the 2D image plane. The object detection scores of the fusion are higher than the
ones of a camera-only network, especially for distant objects.
CRF-Net [
] develops a similar fusion approach. As an automotive radar does not
measure any height information, the network assumes an extended height of the radar
returns to account for the uncertainty in the radar returns origin. The approach shows a
slight increase in object detection performance both on a private and the public nuScenes
data set [
]. The paper shows further potential for the fusion scheme once less noisy radar
data are available.
YOdar [
] uses a similar projection fusion method. The approach creates two detec-
tion probabilities of separate radar and image processing pipelines and generates their
final detection output by gradient boosting.
2.3. 3D Sensor Fusion for Object Detection
This section reviews 3D fusion methods. The focus is on methods that fuse radar data
as part of the input data.
2.3.1. Camera Radar Fusion
For 3D object detection, the authors of [
] propose GRIF-Net to fuse radar and camera
data. After individual processing, the feature fusion is performed by a gated region of
interest fusion (GRIF). In contrast to concatenation or addition as the fusion operation, the
weight for each sensor in the fusion is learned in the GRIF module. The camera and radar
fusion method outperforms the radar baseline by a great margin on the nuScenes data set.
The CenterFusion architecture [
] first detects objects in the 3D space via image-
based object detection. Radar points inside a frustum around these detections are fused
by concatenation to the image features. The radar features are extended to pillars similar
to [
] in the 2D case. The object detection head operates on these joint features to refine
the detection accuracy. The mean Average Precision (mAP) score of the detection output
increases by 4% for the camera radar fusion compared to their baseline on the nuScenes
validation data set.
While the methods above operate with point cloud-based input data, Lim [
] fuses
azimuth range images and camera images. The camera data are projected to a bird’s-eye
view (BEV) with an Inverse Projection Mapping (IPM). The individually processed branches
are concatenated to generate the object detection results. The fusion approach achieves a
higher detection score than the individual modalities. The IPM limits the detection range
to close objects and an assumed flat road surface.
Kim [
] similarly fuses radar azimuth-range images with camera images. The data
are fused after initial individual processing, and the detection output is generated adopting
the detection head of [
]. The fusion approach outperforms both their image and radar
baselines on their private data set. Their RPN uses a distance threshold in contrast to
standard Intersection over Union (IoU) matching for anchor association. The paper argues
that the IoU metric prefers to associate distant bounding boxes over closer bounding boxes
under certain conditions. Using a distance threshold instead increases the resulting AP by
4–5 points over the IoU threshold matching.
The overall detection accuracy of camera radar fusion networks is significantly lower
than that of lidar-based detection methods.
Appl. Sci. 2021,11, 5598 4 of 16
2.3.2. Lidar Camera Fusion
MV3D [
] projects lidar data both to a BEV perspective and the camera perspective.
The lidar representations are fused with the camera input after some initial processing in a
feature fusion scheme.
] uses a BEV projection of the lidar data and camera data as their input data.
The detection results are calculated with an anchor grid and an RPN as a detection head.
PointPainting [
] first calculates a semantic segmentation mask for an input image.
The detected classes are then projected onto the lidar point cloud via a color-coding for the
different classes. The work expands several lidar 3D object detection networks and shows
that enriching the lidar data with class information augments the detection score.
2.3.3. Lidar Radar Fusion
RadarNet [
] fuses radar and lidar point clouds for object detection. The point clouds
are transformed into a grid representation and then concatenated. After this feature fusion,
the data are processed jointly to propose an object detection. An additional late fusion of
radar features is performed to predict a velocity estimate separate to the object detection task.
2.3.4. Lidar Radar Camera Fusion
Wang [
] projects RGB values of camera images directly onto the lidar point cloud.
This early fusion camera-lidar point cloud is used to create object detection outputs in a
pointnet architecture. Parallel to the object detection, the radar point cloud is processed
to predict velocity estimates of the input point cloud. The velocity estimates are then
associated with the final detection output. The paper experimented with concatenating
different amounts of past data sweeps for the radar network. Concatenating six consecutive
time steps of the radar data for a single processing shows the best results in their study.
The addition of the radar data increases their baseline detection score slightly on the public
nuScenes data set.
3. Methodology
In the following, we list the main conclusions from the state of the art for our work:
Input representation: The input representation of the sensor data dictates which subse-
quent processing techniques can be applied. Pointnet-based methods are beneficial
when dealing with sparse unordered point cloud data. For more dense—but still
sparse—point clouds, such as the fusion of several lidar or radar sensors, sparse
voxel grid structures achieve more favorable results in the object detection literature.
Therefore, we adopt a voxel-based input structure for our data. As many of the
voxels remain empty in the 3D grid, we apply sparse convolutional operations [
for greater efficiency.
Distance Threshold: Anchor-based detection heads predominately use an IoU-based
matching algorithm to identify positive anchors. However, Kim [
] has shown that
this choice might lead to association of distant anchors for certain bounding box
configurations. We argue that both IoU- and distance-based matching thresholds
should be considered to facilitate the learning process. The distance-based threshold
alone might not be a good metric when considering rotated bounding boxes with a
small overlapping area. Our network therefore considers both thresholds to match
the anchor boxes.
Fusion Level: The data from different sensors and modalities can be fused at differ-
ent abstraction levels. Recently, a rising number of papers perform early or feature
fusion to be able to facilitate all available data for object detection simultaneously.
Nonetheless, the state of the art in object detection is still achieved by considering only
lidar data. Due to its resolution and precision advantage from a hardware perspec-
tive, software processing methods cannot compensate for the missing information
in the input data of the additional sensors. Still, there are use cases where the lidar
sensor alone is not sufficient. Inclement weather, such as fog, decreases the lidar
Appl. Sci. 2021,11, 5598 5 of 16
and camera data quality [
] significantly. The radar data, however, is only slightly
affected by the change in environmental conditions. Furthermore, interference effects
of different lidar modules might decrease the detection performance under certain
conditions [27,28]
. A drawback of early fusion algorithms is that temporal synchro-
nized data recording for all sensors needs to be available. However, none of the
publicly available data sets provide such data for all three sensor modalities. The
authors of [
] discuss the publicly available data quality for radar sensors in more
detail. Despite the lack of synchronized data, this study uses an early fusion scheme,
as in similar works, spatio-temporal synchronization errors are treated as noise and
compensated during the learning process of the fusion network. In contrast to recent
papers, where some initial processing is applied before fusing the data, we present
a direct early fusion to enable the network to learn optimal combined features for
the input data. The early fusion can make use of the complementary sensor informa-
tion provided by radar, camera and lidar sensors—before any data compression by
sensor-individual processing is performed.
3.1. Input Data
The input data to the network consists of the lidar data with its three spatial coordi-
, and intensity value
. Similar to [
], colorization from projected camera images
is added to the lidar data with
features. Additionally, the radar data contributes
its spatial coordinates, intensity value
—and the radial velocity with its Cartesian
. Local offsets for the points in the voxels
complete the
input space. The raw data are fused and processed jointly by the network itself. Due to
the early fusion of the input data, any lidar network can easily be adapted to our fusion
approach by adjusting the input dimensions.
3.2. Network Architecture
This paper proposes the RadarVoxelFusionNet (RVF-Net) whose architecture is based
on VoxelNet [
] due to its empirically proven performance and straightforward network
architecture. While other architectures in the state of the art provide higher detection
scores, the application to a non-overengineered network from the literature is preferable
for investigating the effect of a new data fusion method. Recently, A. Ng [
] proposed a
shift from model-centric to data-centric approaches for machine learning development.
An overview of the network architecture is shown in Figure 1. The input point
cloud is partitioned into a 3D voxel grid. Non-empty voxel cells are used as the input
data to the network. The data are split into the features of the input points and the
corresponding coordinates. The input features are processed by voxel feature encoding
(VFE) layers composed of fully connected and max-pooling operations for the points inside
each voxel. The pooling is used to aggregate one single feature per voxel. In the global
feature generation, the voxel features are processed by sparse 3D submanifold convolutions
to efficiently handle the sparse voxel grid input. The
dimension is merged with the feature
dimension to create a sparse feature tensor in the form of a 2D grid. The sparse tensor is
converted to a dense 2D grid and processed with standard 2D convolutions to generate
features in a BEV representation. These features are the basis for the detection output heads.
The detection head consists of three parts: The classification head, which outputs a
class score for each anchor box; the regression head with seven regression values for the
bounding box position (
), dimensions (
) and the yaw angle
; the direction
head, which outputs a complementary classification value for the yaw angle estimation
. For more details on the network architecture, we refer to the work of [
] and our
open source implementation. The next section focuses on our proposed yaw loss, which is
conceptually different from the original VoxelNet implementation.
Appl. Sci. 2021,11, 5598 6 of 16
(0, 0, 0)
(2, 2, 0)
(2, 0, 2)
Sparse Voxel
Feature Generation
2D Convolutions
Detection HeadsInput Point Cloud
3D Sparse Convolutions
Feature Generation
Figure 1. Network architecture of the proposed RVF-Net.
3.3. Yaw Loss Parameterization
While the original VoxelNet paper uses a simple yaw regression, we use a more
complex parameterization to facilitate the learning process. Zhou [
] argues that a simple
yaw representation is disadvantageous, as the optimizer needs to regress a smooth function
over a discontinuity, e.g., from
to +
. Furthermore, the loss value for small
positive angle differences is much lower than that of greater positive angle differences,
while the absolute angle difference from the anchor orientation might be the same.
Figure 2
visualizes this problem of an exemplary simple yaw regression.
Figure 2.
Vehicle bounding boxes are visualized in a BEV. The heading of the vehicles is visualized
with a line from the middle of the bounding box to the front. The relative angular deviations from
the orange and blue ground truth boxes to the anchor box are equal. However, the resulting loss
value of the orange bounding box is significantly higher than that of the blue one.
To account for this problem, the network estimates the yaw angle with a combination
of a classification and a regression head. The classifier is inherently designed to deal with a
discontinuous domain, enabling the regression of a continuous target. The regression head
regresses the actual angle difference in the interval
with a smooth sine function,
which is continuous even at the limits of the interval. The regression output of the yaw
angle of a bounding box is
θd=θGT θA
Appl. Sci. 2021,11, 5598 7 of 16
is the ground truth box yaw angle and
is the associated anchor box
yaw angle.
The classification head determines whether or not the angle difference between the
predicted bounding box and the associated anchor lies inside or outside of the interval
[π/2, π/2). The classification value of the yaw is modeled as
cdir =(1, if π
2(θd+π)mod 2ππ<π
0, otherwise . (2)
As seen above, the directional bin classification head splits the angle space into two
equally spaced regions. The network uses two anchor angle parameterizations at 0 and
2. A vehicle driving towards the sensor vehicle matches with the anchor at
0 rad
. A
vehicle driving in front of the vehicle would match with the same anchor. The angle
classification head intuitively distinguishes between these cases. Therefore, there is no
need to compute additional anchors at πand π/2.
Due to the subdivision of the angular space by the classification head, the yaw regres-
sion needs to regress smaller angle differences, which leads to a fast learning progress. A
simple yaw regression would instead need to learn a rotation of 180 degrees to match the
ground truth bounding box. It has been shown that high regression values and disconti-
nuities negatively impact the network performance [
]. The regression and classification
losses used to estimate the yaw angle are visualized in Figure 3.
The SECOND architecture [
] introduces a sine loss as well. Their subdivision of the
positive and negative half-space, however, comes with the drawback that both bounding
box configurations shown in Figure 3would result in the same regression and classification
loss values. Our loss is able to distinguish these bounding box configurations.
(a) (b)
Figure 3.
Visualization of our yaw loss. (
) Bin classification. (
) Sine regression. The bounding boxes
in (
) are not distinguishable by the sine loss. The bin classification distinguishes these bounding
boxes as visualized by the bold dotted line, which splits the angular space in two parts.
As the training does not learn the angle parameter directly, the regression difference is
added to the anchor angle under consideration of the classification interval output to get
the final value of the yaw angle during inference.
3.4. Data Augmentation
Data augmentation techniques [
] manipulate the input features of a machine learn-
ing method to create a greater variance in the data set. Popular augmentation methods
translate or rotate the input data to generate new input data from the existing data set.
More complex data augmentation techniques include the use of General Adversarial
Networks [
] to generate artificial data frames in the style of the existing data. Complex
data augmentation schemes are beneficial for small data sets. The used nuScenes data set
comprises about 34,000 labeled frames. Due to the relatively large data set, we limit the use
of data augmentation to rotation, translation, and scaling of the input point cloud.
Appl. Sci. 2021,11, 5598 8 of 16
3.5. Inclement Weather
We expect the fusion of different sensor modalities to be most beneficial in inclement
weather, which deteriorates the quality of the output of lidar and camera sensors. We
analyze the nuScenes data set for frames captured in such environment conditions. At the
same time, we make sure that enough input data, in conjunction with data augmentation,
are available for the selected environment conditions to realize a good generalization for
the trained networks. We filter the official nuScenes training and validation sets for samples
recorded in rain or night conditions. Further subsampling for challenging conditions such
as fog is not possible for the currently available data sets. The amount of samples for
each split is shown in Table 1. We expect the lidar quality to deteriorate in the rain scenes,
whereas the camera quality should deteriorate in both rain and night scenes. The radar
detection quality should be unaffected by the environment conditions.
Table 1.
Training and validation splits for different environment conditions. The table only considers
samples in which at least one car is present in the field of view of the front camera.
Data Set Split Training Samples Validation Samples
nuScenes 19,659 4278
Rain 2289 415
Night 4460 788
3.6. Distance Threshold
Similar to [
], we argue that an IoU-based threshold is not the optimal choice for 3D
object detection. We use both an IoU-based and a distance-based threshold to distinguish
between the positive, negative, and ignore bounding box anchors. For our proposed
network, the positive IoU-threshold is empirically set to 35% and the negative threshold is
set to 30%. The distance threshold is set to 0.5 m.
3.7. Simulated Depth Camera
To simplify the software fusion scheme and to lower the cost of the sensor setup, lidar
and camera sensor could be replaced by a depth or stereo camera setup. Even though the
detection performance of stereo vision does not match the one of lidar, recent developments
show promising progress in this field [
]. The relative accuracy of stereo methods is higher
for close range objects, where high accuracy is of greater importance for the planning of the
driving task. The nuScenes data set was chosen for evaluation since it is the only feasible
public data set that contains labeled point cloud radar data. However, stereo camera data
are not included in the nuScenes data set, which we use for evaluation.
In comparison to lidar data, stereo camera data are more dense and contain the color
of objects in its data. To simulate a stereo camera, we use the IP-Basic algorithm [
to approximate a denser depth image from the sparser lidar point cloud. The IP-Basic
algorithm estimates additional depth measurements from lidar pixels, so that additional
camera data can be used for the detection. The depth of these estimated pixels is less
accurate than that of the lidar sensor, which is in compliance with the fact that stereo
camera depth estimation is also more error-prone than that of lidar [36,37].
Our detection pipeline looks for objects in the surroundings of up to
50 m
from the
ego vehicle so that the stereo camera simulation by the lidar is justified as production
stereo cameras can provide reasonable accuracy in this sensor range [
]. An alternative
approach would be to learn the depth of the monocular camera images directly. An
additional study [
] showed that the state of the art algorithms in this field [
] are not
robust enough to create an accurate depth estimation for the whole scene for a subsequent
fusion. Although the visual impression of monocular depth images seems promising, the
disparity measurement of stereo cameras results in a better depth estimation.
Appl. Sci. 2021,11, 5598 9 of 16
3.8. Sensor Fusion
By simulating depth information for the camera, we can investigate the influence
of four different sensors for the overall detection score: radar, camera, simulated depth
camera, and lidar. In addition to the different sensors, consecutive time steps of radar
and lidar sensors are concatenated to increase the data density. While the nuScenes data
set allows to concatenate up to 10 lidar sweeps on the official score board, we limit our
network to use the past 3 radar and lidar sweep data. While using more sweeps may be
beneficial for the overall detection score through the higher data density for static objects,
more sweeps add significant inaccuracies for the position estimate of moving vehicles,
which are of greater interest for a practical use case.
As discussed in our main conclusions from the state of the art in Section 3, we fuse
the different sensor modalities in an early fusion scheme. In particular, we fuse lidar and
camera data by projecting the lidar data into the image space, where the lidar points serve
as a mask to associate the color of the camera image with the 3D points.
To implement the simulated depth camera, we first apply the IP-Basic algorithm to
the lidar input point cloud to approximate the depth of the neighborhood area of the lidar
points to generate a more dense point cloud. The second step is the same as in the lidar and
camera fusion, where the newly created point cloud serves as a mask to create the dense
depth color image.
The radar, lidar, and simulated depth camera data all originate from a continuous
3D space. The data are then fused together in a discrete voxel representation before they
are processed with the network presented in Section 3.2. The first layers of the network
compress the input data to discrete voxel features. The maximum number of points per
voxel is limited to 40 for computational efficiency. As the radar data are much sparser than
lidar data, it is preferred in the otherwise random downsampling process to make sure that
the radar data contributes to the fusion result and its data density is not further reduced.
After the initial fusion step, the data are processed in the RadarVoxelFusionNet in
the same fashion, independent of which data type was used. This modularity is used to
compare the detection result of different sensor configurations.
3.9. Training
The network is trained with an input voxel size of
0.2 m
for the dimensions parallel to
the ground. The voxel size in height direction is 0.4 m.
Similar to the nuScenes split, we limit the sensor detection and evaluation range to
50 m
in front of the vehicle and further to
20 m
on either side to cover the principal area of
interest for driving. The sensor fusion is performed for the front camera, front radar, and
the lidar sensor of the nuScenes data set.
The classification outputs are learned via a binary cross entropy loss. The regression
values are learned via a smooth L1 loss [
]. The training is performed on the official
nuScenes split. We further filter for samples that include at least one vehicle in the sensor
area to save training resources for samples where no object of interest is present. Training
and evaluation are performed for the nuScenes car class. Each network is trained on an
NVIDIA Titan Xp graphics card for 50 epochs or until overfitting can be deduced from the
validation loss curves.
4. Results
The model performance is evaluated with the average precision (AP) metric as defined
by the nuScenes object detection challenge [
]. Our baseline is a VoxelNet-style network
with lidar data as the input source. All networks are trained with our novel yaw loss and
training strategies, as described in Section 3.
4.1. Sensor Fusion
Table 2shows the results of the proposed model with different input sensor data. The
networks have been trained several times to rule out that the different AP scores are caused
Appl. Sci. 2021,11, 5598 10 of 16
by random effects. The lidar baseline outperforms the radar baseline by a great margin.
This is expected as the data density and accuracy of the lidar input data are higher than
that of the radar data.
The fusion of camera RGB and lidar data does not result in an increased detection
accuracy for the proposed network. We assume that this is due to the increased complexity
that the additional image data brings into the optimization process. At the same time, the
additional color feature does not distinguish vehicles from the background, as the same
colors are also widely found in the environment.
The early fusion of radar and lidar data increases the network performance against the
baseline. The fusion of all three modalities increases the detection performance by a greater
margin for most of the evaluated data sets. Only for night scenes, where the camera data
deteriorates most, does the fusion of lidar and radar outperform the RVF-Net. Example
detection results in the BEV perspective from the lidar, RGB input, and the RVF-Net input
are compared in Figure 4.
Table 2.
AP scores for different environment (data) and network configurations on the respective
validation data set.
Network Input nuScenes Rain and Night Rain Night
Lidar 52.18% 50.09% 43.94% 63.56%
Radar 17.43% 16.00% 16.42% 22.46%
Lidar, RGB 49.96% 46.59% 42.72% 61.66%
Lidar, Radar 54.18% 53.10% 47.51% 68.01%
Lidar, RGB, Radar (RVF-Net) 54.86% 53.12% 48.32% 67.39%
Simulated Depth Cam 48.02% 46.07% 39.07% 57.33%
Simulated Depth Cam, Radar 52.06% 48.31% 41.65% 61.04%
(a) (b) (c)
Figure 4.
BEV of the detection results: Lidar and RGB fusion in the top row. RVF-Net fusion in the
bottom row. Detected bounding boxes in orange. Ground truth bounding boxes in black. Lidar
point cloud in blue. Radar point cloud and measured velocity in green. The street is shown in
gray. (
) Only RVF-Net is able to detect the vehicle from the point cloud. (
) RVF-Net detects both
bounding boxes. (
) RVF-Net detects both bounding boxes. However, the detected box on the right
has a high translation and rotation error towards the ground truth bounding box.
The simulated depth camera approach does not increase the detection performance.
The approach adds additional input data by depth-completing the lidar points. However,
the informativeness in this data cannot compensate for the increased complexity introduced
by its addition.
The absolute AP scores between the different columns of Table 2cannot be compared
since the underlying data varies between the columns. The data source has the greatest
Appl. Sci. 2021,11, 5598 11 of 16
influence for the performance of machine learning models. All models have a significantly
higher scores for the night scenes split than for the other splits. This is most likely due to
the lower complexity of the night scenes present in the data set.
The relative performance gain of different input data within each column shows a
valid comparison of the fusion methods since they are trained and evaluated on the same
data. The radar data fusion of the RVF-Net outperforms the lidar baseline by 5.1% on the
nuScenes split, while it outperforms the baseline on the rain split by 10.0% and on the night
split by 6.0%. The increased performance of the radar fusion is especially notable for the
rain split where lidar and camera data quality is limited. The fusion of lidar and radar is
also especially beneficial for night scenes, even though the lidar data quality should not be
affected by these conditions.
4.2. Ablation Studies
This section evaluates additional training configurations of our proposed RVF network
to measure the influence of the proposed training strategies. Table 3shows an overview of
the results.
To study the effect of the introduced yaw loss, we measure the Average Orientation
Error (AOE) as introduced by nuScenes. The novel loss reduces the orientation error by
about 40% from an AOE of 0.5716 with the old loss to an AOE of 0.3468 for the RVF-Net.
At the same time, our novel yaw loss increases the AP score of RVF-Net by
4.1 percent
Even though the orientation of the predicted bounding boxes does not directly impact
the AP calculation, the simpler regression for the novel loss also implicitly increases the
performance for the additional regression targets.
Data augmentation has a significant positive impact on the AP score.
Contrary to the literature results, the combined IoU and distance threshold decreases
the network performance in comparison to a simple IoU threshold configuration. It is up
to further studies to find the reason for this empirical finding.
We have performed additional experiments with 10 lidar sweeps as the input data.
While the sweep accumulation for static objects is not problematic since we compensate
for ego-motion, the point clouds of moving objects are heavily blurred when considering
10 sweeps of data, as the motion of other vehicles cannot be compensated. Nonetheless,
the detection performance increases slightly for the RVF-Net sensor input.
For a speed comparison, we have also started a training with non-sparse convolutions.
However, this configuration could not be trained on our machine since the non-sparse
network is too large and triggers an out-of-memory (OOM) error.
Table 3. AP scores for different training configurations on the validation data set.
Network nuScenes
RVF-Net 54.86%
RVF-Net, simple yaw loss 52.69%
RVF-Net, without augmentation 50.68%
RVF-Net, IoU threshold only 55.93%
RVF-Net, 10 sweeps 55.25%
RVF-Net, standard convolutions OOM error
4.3. Inference Time
The inference time of the network for different input data configurations is shown in
Table 4. The GPU processing time per sample is averaged over all samples of the validation
split. In comparison to the lidar baseline, the RVF-Net fusion increases the processing
time only slightly. The different configurations are suitable for a real-time application with
input data rates of up to
20 Hz
. The processing time increases for the simulated depth
camera input data configuration as the number of points is drastically increased by the
depth completion.
Appl. Sci. 2021,11, 5598 12 of 16
Table 4. Inference times of different sensor input configurations on the NVIDIA Titan Xp GPU.
Network Input Inference Time
Lidar 0.042 s
Radar 0.02 s
Lidar, RGB 0.045 s
Lidar, Radar 0.044 s
RVF-Net 0.044 s
Simulated Depth Cam, Radar 0.061 s
RVF-Net, 10 sweeps 0.063 s
4.4. Early Fusion vs. Late Fusion
The effectiveness of the neural network early fusion approach is further evaluated
against a late fusion scheme for the respective sensors. For the lidar, RGB, and radar input
configurations are fused with an UKF and an Euclidean-distance-based matching algorithm
to generate the final detection output. This late fusion output is compared against the early
fusion RVF-Net and lidar detection results, which are individually tracked with the UKF
to enable comparability. The late fusion tracks objects over consecutive time steps and
requires temporal coherence for the processed samples, which is only given for the samples
within a scene but not over the whole data set. Table 5shows the resulting AP score for
10 randomly sampled scenes to which the late fusion is applied. The sampling is done to
lower the computational and implementation effort, and no manual scene selection in favor
or against the fusion method was performed. The evaluation shows that the late fusion
detection leads to a worse result than the early fusion. Notably, the tracked lidar detection
outperforms the late fusion approach as well. As the radar-only detection accuracy is
relatively poor and its measurement noise does not comply with the zero-mean assumption
of the Kalman filter, a fusion of this data to the lidar data leads to worse results. In contrast
to the early fusion where the radar features increased the detection score, the late fusion
scheme processes the two input sources independently and the detection results cannot
profit from the complementary features of the different sensors. In this paper, the UKF
tracking serves as a fusion method to obtain detection metrics for the late fusion approach.
It is important to note that for an application in autonomous driving, object detections
need to be tracked independent of the data source, for example with a Kalman Filter, to
create a continuous detection output. The evaluation of further tracking metrics will be
performed in a future paper.
Table 5.
AP scores of the tracked sensor inputs. The early fusion RVF-Net outperforms the late fusion
by a great margin.
Network nuScenes
Tracked Lidar 40.01%
Tracked Late Fusion 33.29%
Tracked Early Fusion (RVF-Net) 47.09%
5. Discussion
The RVF-Net early fusion approach proves its effectiveness by outperforming the lidar
baseline by 5.1%. Additional measures have been taken to increase the overall detection
score. Data augmentation especially increased the AP score for all networks. The novel loss,
introduced in Section 3.3, improves both the AP score and notably the orientation error
of the networks. Empirically, the additional classification loss mitigates the discontinuity
problem in the yaw regression, even though classifications are discontinuous decisions on
their own.
Furthermore, the paper shows that the early fusion approach is especially beneficial
in inclement weather conditions. The radar features, while not being dense enough for an
accurate object detection on their own, contribute positively to the detection result when
Appl. Sci. 2021,11, 5598 13 of 16
processed with an additional sensor input. It is interesting to note that the addition of
RGB data increases the performance of the lidar, radar, and camera fusion approach, while
it does not increase the performance of the lidar and RGB fusion. We assume that the
early fusion performs most reliably when more different input data and interdependencies
are present. In addition to increasing robustness and enabling autonomous driving in
inclement weather scenarios, we assume that early fusion schemes can be advantageous
for special use cases such as mining applications, where dust oftentimes limits lidar and
camera detection ranges.
When comparing our network to the official detection scores on the nuScenes data
set, we have to take into account that our approach is evaluated on the validation split
and not on the official test split. The hyperparameters of the network, however, were not
optimized on the validation split, so that it serves as a valid test set. We assume that the
complexity of the data in the frontal field of view does not differ significantly from the full
360 degree view. We therefore assume that the detection AP of our approach scales with
the scores provided by other authors on the validation split. To benchmark our network
on the test split, a 360 degree coverage of the input data would be needed. Though there
are no conceptual obstacles in the way, we decided against the additional implementation
overhead due to the general shortcomings of the radar data provided in the nuScenes data
set [
] and no expected new insights from the additional sensor coverage. The validation
split suffices to evaluate the applicability of the proposed early fusion network.
On the validation split, our approach outperforms several single sensor or fusion
object detection algorithms. For example, the CenterFusion approach [
], which achieves
48.4% AP for the car class on the nuScenes validation split. In the literature, only Wang [
fuses all three sensor modalities. Our fusion approach surpasses their score of 45% AP on
the validation split and 48% AP on the test split.
On the other hand, further object detection methods, such as the leading lidar-only
method CenterPoint [
], outperform even our best network in the ablation studies by a
great margin. The two stage network uses center points to match detection candidates and
performs an additional bounding box refinement to achieve an AP score of 87% on the
test split.
When analyzing the errors in our predictions, we see that the regressed parameters of
the predicted bounding boxes are not as precise as the ones of state-of-the-art networks.
The validation loss curves for our network are shown in Figure 5. The classification loss
overfits before the regression loss converges. Further studies need to be performed in order
to further balance the losses. One approach could be to first only train the regression and
direction loss. The classification loss is then trained in a second stage. Additionally, further
experiments will be performed to fine tune the anchor matching thresholds to the data set
to get a better detection result. The tuning of this outer optimization loop requires access
to extensive GPU power to find optimal hyperparameters. For future work, we expect the
hyperparameters to influence the absolute detection accuracy greatly as simple strategies
such as data augmentation could already improve the overall performance. The focus of
this work lies in the evaluation of different data fusion inputs relative to a potent baseline
network. For this evaluation, we showed a vast amount of evidence to motivate our fusion
scheme and network parameterization.
The simulated depth camera did not provide a better detection result than the lidar-
only detection. This and the late fusion approach show that a simple fusion assumption in
the manner of "more sensor data, better detection result" does not hold true. The complexity
introduced by the additional data decreased the overall detection result. The decision for an
early fusion system is therefore dependent on the sensors and the data quality available in
the sensors. For all investigated sub data sets, we found that early fusion of radar and lidar
data is beneficial for the overall detection result. Interestingly, the usage of 10 lidar sweeps
increased the detection performance of the fusion network over the proposed baseline. This
result occurred despite the fact that the accumulated lidar data leads to blurry contours for
moving objects in the input data. This is especially disadvantageous for objects moving
Appl. Sci. 2021,11, 5598 14 of 16
at a high absolute speed. For practical applications, we therefore use only three sweeps
in our network, as the positions of moving objects are of special interest for autonomous
driving. The established metrics for object detection do not account for the importance of
surrounding objects. We assume that the network trained with 10 sweeps performs worse
in practice, despite its higher AP score. Further research needs to be performed to establish
a detection metric tailored for autonomous driving applications.
Figure 5.
Loss values of the RVF-Net. The classification loss starts to overfit around epoch 30 while
regression and direction loss continue to converge.
The sensors used in the data set do not record the data synchronously. This creates
an additional ambiguity in the input data between the position information inferred from
the lidar and from the radar data. The network training should compensate for this effect
partially; however, we expect the precision of the fusion to increase when synchronized
sensors are available.
This paper focuses on an approach for object detection. Tracking/prediction is applied
as a late fusion scheme or as a subsequent processing step to the early fusion. In contrast,
LiRaNet [
] performs a combined detection and prediction of objects from the sensor data.
We argue that condensed scene information, such as object and lane positions, traffic rules,
etc., are more suitable for the prediction task in practice. A decoupled detection, tracking,
and prediction pipeline increases the interpretability of all modules to facilitate validation
for real-world application in autonomous driving.
6. Conclusions and Outlook
In this paper, we have developed an early fusion network for lidar, camera, and radar
data for 3D object detection. This early fusion network outperforms both the lidar baseline
and the late fusion of lidar, camera, and radar data on a public autonomous driving data set.
In addition, we integrated a novel loss for the yaw angle regression to mitigate the effect
of the discontinuity of a simple yaw regression target. We provide a discussion about the
advantages and disadvantages of the proposed network architecture. Future steps include
the amplification of the fusion scheme to a full 360 degree view and the optimization of
hyperparameters to balance the losses for further convergence of the regression losses.
We have made the code for the proposed network architectures and the interface
to the nuScenes data set available to the public. The repository can be found online at (accessed on 16 June 2021).
Author Contributions:
F.N., as the first author, initiated the idea of this paper and contributed
essentially to its concept and content. Conceptualization, F.N.; methodology, F.N. and E.S.; software,
Appl. Sci. 2021,11, 5598 15 of 16
E.S., F.N. and P.K.; data curation, E.S. and F.N.; writing—original draft preparation, F.N.; writing—
review and editing, E.S., P.K., J.B. and M.L.; visualization, F.N. and E.S.; project administration, J.B.
and M.L. All authors have read and agreed to the published version of the manuscript.
We express gratitude to Continental Engineering Services for funding for the underlying
research project.
Conflicts of Interest: The authors declare no conflict of interest.
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A
Multimodal Dataset for Autonomous Driving. arXiv 2019, arXiv:1903.11027.
2. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. IJRR 2013. [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection.
arXiv 2019, arXiv:1912.13192.
Julier, S.J.; Uhlmann, J.K. New extension of the Kalman filter to nonlinear systems. In Signal Processing, Sensor Fusion, and Target
Recognition VI; SPIE Proceedings; Kadar, I., Ed.; SPIE: Bellingham, WA, USA, 1997; p. 182. [CrossRef]
5. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001,29. [CrossRef]
Wang, L.; Chen, T.; Anklam, C.; Goldluecke, B. High Dimensional Frustum PointNet for 3D Object Detection from Camera,
LiDAR, and Radar. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13
November 2020; pp. 1621–1628. [CrossRef]
Nobis, F.; Fent, F.; Betz, J.; Lienkamp, M. Kernel Point Convolution LSTM Networks for Radar Point Cloud Segmentation. Appl.
Sci. 2021,11, 2599. [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017.
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018 .
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In
Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.;
Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 91–99.
12. Graham, B.; van der Maaten, L. Submanifold Sparse Convolutional Networks. arXiv 2017, arXiv:1706.01307.
Graham, B.; Engelcke, M.; van der Maaten, L. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks.
arXiv 2017, arXiv:1711.10275.
14. Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. arXiv 2020, arXiv:2006.11275.
15. Chadwick, S.; Maddern, W.; Newman, P. Distant Vehicle Detection Using Radar and Vision. arXiv 2019, arXiv:1901.10951.
Nobis, F.; Geisslinger, M.; Weber, M.; Betz, J.; Lienkamp, M. A Deep Learning-based Radar and Camera Sensor Fusion Architecture
for Object Detection. In Proceedings of the 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany,
15–17 October 2019; pp. 1–7. [CrossRef]
Kowol, K.; Rottmann, M.; Bracke, S.; Gottschalk, H. YOdar: Uncertainty-Based Sensor Fusion for Vehicle Detection with Camera
and Radar Sensors. arXiv 2020, arXiv:2010.03320.
Kim, J.; Kim, Y.; Kum, D. Low-level Sensor Fusion Network for 3D Vehicle Detection using Radar Range-Azimuth Heatmap and
Monocular Image. In Proceedings of the Asian Conference on Computer Vision (ACCV) 2020, Kyoto, Japan, 30 November–4
December 2020.
19. Nabati, R.; Qi, H. CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection. arXiv 2020, arXiv:2011.04841.
Lim, T.Y.; Ansari, A.; Major, B.; Daniel, F.; Hamilton, M.; Gowaikar, R.; Subramanian, S. Radar and Camera Early Fusion for
Vehicle Detection in Advanced Driver Assistance Systems. In Proceedings of the Machine Learning for Autonomous Driving
Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December
Kim, Y.; Choi, J.W.; Kum, D. GRIF Net: Gated Region of Interest Fusion Network for Robust 3D Object Detection from Radar
Point Cloud and Monocular Image. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2021.
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation.
In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5
October 2018.
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020.
Appl. Sci. 2021,11, 5598 16 of 16
Yang, B.; Guo, R.; Liang, M.; Casas, S.; Urtasun, R. RadarNet: Exploiting Radar for Robust Perception of Dynamic Objects. arXiv
2020, arXiv:2007.14366.
Daniel, L.; Phippen, D.; Hoare, E.; Stove, A.; Cherniakov, M.; Gashinova, M. Low-THz Radar, Lidar and Optical Imaging through
Artificially Generated Fog. In Proceedings of the International Conference on Radar Systems (Radar 2017), Belfast, Ireland, 23–26
October 2017; The Institution of Engineering and Technology: Stevenage, UK, 2017. [CrossRef]
Hebel, M.; Hammer, M.; Arens, M.; Diehm, A.L. Mitigation of crosstalk effects in multi-LiDAR configurations. In Proceedings
of the Electro-Optical Remote Sensing XII, Berlin, Germany, 12–13 September 2018; Kamerman, G., Steinvall, O., Eds.; SPIE:
Bellingham, WA, USA, 2018; p. 3. [CrossRef]
Kim, G.; Eom, J.; Park, Y. Investigation on the occurrence of mutual interference between pulsed terrestrial LIDAR scanners. In
Proceedings of the 2015 IEEE Intelligent Vehicles Symposium (IV), Seoul, Korea, 28 June–1 July 2015; pp. 437–442. [CrossRef]
Ng, A. A Chat with Andrew on MLOps: From Model-Centric to Data-Centric AI. Available online:
watch?v=06-AZXmwHjo (accessed on 16 June 2021).
Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; Li, H. On the Continuity of Rotation Representations in Neural Networks. arXiv
31. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018,18, 3337. [CrossRef]
32. Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019,6. [CrossRef]
Sandfort, V.; Yan, K.; Pickhardt, P.J.; Summers, R.M. Data augmentation using generative adversarial networks (CycleGAN) to
improve generalizability in CT segmentation tasks. Sci. Rep. 2019,9, 16884. [CrossRef]
You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR++: Accurate
Depth for 3D Object Detection in Autonomous Driving. arXiv 2019, arXiv:1906.06310.
Ku, J.; Harakeh, A.; Waslander, S.L. In Defense of Classical Image Processing: Fast Depth Completion on the CPU. In Proceedings
of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018.
Chen, Y.; Cai, W.L.; Zou, X.J.; Xu, D.F.; Liu, T.H. A Research of Stereo Vision Positioning under Vibration. Appl. Mech. Mater.
44–47, 1315–1319. [CrossRef]
37. Fan, R.; Wang, L.; Bocus, M.J.; Pitas, I. Computer Stereo Vision for Autonomous Driving. arXiv 2020, arXiv:2012.03194.
Instruments, T. Stereo Vision-Facing the Challenges and Seeing the Opportunities for ADAS (Rev. A). Available online: https:// (ac-
cessed on 16 June 2021)
Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR from Visual Depth Estimation:
Bridging the Gap in 3D Object Detection for Autonomous Driving. arXiv 2018, arXiv:1812.07179.
Nobis, F.; Brunhuber, F.; Janssen, S.; Betz, J.; Lienkamp, M. Exploring the Capabilities and Limits of 3D Monocular Object
Detection—A Study on Simulation and Real World Data. In Proceedings of the 2020 IEEE 23rd International Conference on
Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–8. [CrossRef]
Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China
Technol. Sci. 2020,63, 1612–1627. [CrossRef]
42. Girshick, R. Fast R-CNN. In Proceedings of the ICCV 2015, Santiago, Chile, 7–13 December 2015.
Scheiner, N.; Schumann, O.; Kraus, F.; Appenrodt, N.; Dickmann, J.; Sick, B. Off-the-shelf sensor vs. experimental radar—How
much resolution is necessary in automotive radar classification? arXiv 2020, arXiv:2006.05485.
Shah, M.; Huang, Z.; Laddha, A.; Langford, M.; Barber, B.; Zhang, S.; Vallespi-Gonzalez, C.; Urtasun, R. LiRaNet: End-to-End
Trajectory Prediction using Spatio-Temporal Radar Fusion. arXiv 2020, arXiv:2010.00731.
... In contrast, point-based object detection networks do not suffer from this problem, as they extract pointwise features and maintain the exact position of points. However, grid-based methods often provide a very good detection performance [5], [7], [13], [14] and outperformed point-based models in this regard in our experiments. ...
... Clearly, neighborhood context can still be exchanged e.g. by applying convolutional layers on the grid, but the context of points within the cells is already aggregated in an early stage. [6], [13], [14], [23] project the point cloud directly to a grid. The handcrafted, point-wise features (such as radar cross section RCS or radial velocity v r ) are directly aggregated, for example using maxor mean-pooling, if multiple points fall into the same grid cell. ...
... [7] further extend the grid rendering by applying a self-attention mechanism to the cell-wise learnt features. [26] [6], [13], [14], [23] point-wise learned features [8], [27] [5], [7], [25] learned features of points and neighborhood [9], [28], [29] this work ...
This paper presents novel hybrid architectures that combine grid- and point-based processing to improve the detection performance and orientation estimation of radar-based object detection networks. Purely grid-based detection models operate on a bird's-eye-view (BEV) projection of the input point cloud. These approaches suffer from a loss of detailed information through the discrete grid resolution. This applies in particular to radar object detection, where relatively coarse grid resolutions are commonly used to account for the sparsity of radar point clouds. In contrast, point-based models are not affected by this problem as they continuously process point clouds. However, they generally exhibit worse detection performances than grid-based methods. We show that a point-based model can extract neighborhood features, leveraging the exact relative positions of points, before grid rendering. This has significant benefits for a following convolutional detection backbone. In experiments on the public nuScenes dataset our hybrid architecture achieves improvements in terms of detection performance and orientation estimates over networks from previous literature.
... The output on the dataset shows that radar and camera fusion perform better than LiDAR and camera fusion. Nobis et al. [84] proposed a LiDAR, camera, and radar fusion model, RadarVoxelFusionNet (RVF-Net), for 3D object detection. The LiDAR data points are projected into the image space and fused with camera images to simulate the depth camera and generate 3D points. ...
... This operation causes information loss. However, works like [84] directly fused camera, LiDAR, and radar inputs without initial processing. ...
Full-text available
p>Autonomous driving requires accurate, robust, and fast decision-making perception systems to understand the driving environment. Object detection is critical in allowing the perception system to understand the environment. The perception systems, especially 2D object detection and classification, have succeeded because of the emergence of deep learning (DL) in computer vision (CV) applications. However, 2D object detection lacks depth information, which is crucial to understanding the driving environment. Therefore, 3D object detection is fundamental for the perception system of autonomous driving and robotics applications to estimate the objects’ location and understand the driving environment. The CV community has been giving much attention recently to 3D object detection because of the growth of DL models and the need to know accurate locations of objects. However, 3D object detection is still challenging because of scale changes, the lack of 3D sensor information, and occlusions. Researchers have been using multiple sensors to solve these problems and further enhance the performance of the perception system. This survey presents the multisensor (camera, radar, and LiDAR) fusion-based 3D object detection methods. The fully autonomous vehicles need to be equipped with multiple sensors for robust and reliable driving. Camera, LiDAR, and radar sensors and their corresponding advantages and disadvantages are also presented. Then, relevant datasets are summarized, and state-of-the-art multisensor fusion-based methods are reviewed. Finally, challenges, open issues, and possible research directions are presented.</p
... In the field of target detection [35][36][37], recall and precision are mainly used as the performance measure of the algorithm. Precision (P) and recall (R) are, respectively, defined as follows: ...
Full-text available
Detecting 3D objects in a crowd remains a challenging problem since the cars and pedestrians often gather together and occlude each other in the real world. The Pointpillar is the leader in 3D object detection, its detection process is simple, and the detection speed is fast. Due to the use of maxpooling in the Voxel Feature Encode (VFE) stage to extract global features, the fine-grained features will disappear, resulting in insufficient feature expression ability in the feature pyramid network (FPN) stage, so the object detection of small targets is not accurate enough. This paper proposes to improve the detection effect of networks in complex environments by integrating attention mechanisms and the Pointpillar. In the VFE stage of the model, the mixed-attention module (HA) was added to retain the spatial structure information of the point cloud to the greatest extent from the three perspectives: local space, global space, and points. The Convolutional Block Attention Module (CBAM) was embedded in FPN to mine the deep information of pseudoimages. The experiments based on the KITTI dataset demonstrated our method had better performance than other state-of-the-art single-stage algorithms. Compared with another model, in crowd scenes, the mean average precision (mAP) under the bird’s-eye view (BEV) detection benchmark increased from 59.20% of Pointpillar and 66.19% of TANet to 69.91 of ours, the mAP under the 3D detection benchmark was increased from 62% of TANet to 65.11% of ours, and the detection speed only dropped from 13.1 fps of Pointpillar to 12.8 fps of ours.
... • Geological data from different sources [15]. • LIDAR data with RADAR data [16,17]. • LIDAR point clouds with images [18,19]. ...
Full-text available
Point clouds are very common tools used in the work of documenting historic heritage buildings. These clouds usually comprise millions of unrelated points and are not presented in an efficient data structure, making them complicated to use. Furthermore, point clouds do not contain topological or semantic information on the elements they represent. Added to these difficulties is the fact that a variety of different kinds of sensors and measurement methods are used in study and documentation work: photogrammetry, LIDAR, etc. Each point cloud must be fused and integrated so that decisions can be taken based on the total information supplied by all the sensors used. A system must be devised to represent the discrete set of points in order to organise, structure and fuse the point clouds. In this work we propose the concept of multispectral voxels to fuse the point clouds, thus integrating multisensor information in an efficient data structure, and applied it to the real case of a building element in an archaeological context. The use of multispectral voxels for the fusion of point clouds integrates all the multisensor information in their structure. This allows the use of very powerful algorithms such as automatic learning and machine learning to interpret the elements studied.
... These point cloud based networks can be further differentiated into grid-based and point-based architectures. Grid-based approaches first render the point cloud into a 2D bird eye view (BEV) or 3D voxel grid using hand-crafted operations [11], [28], [29], [30], [31] or learned feature-encoders [32], [12], [31] and subsequently apply convolutional backbones to the grid. ...
This paper presents a method to learn the Cartesian velocity of objects using an object detection network on automotive radar data. The proposed method is self-supervised in terms of generating its own training signal for the velocities. Labels are only required for single-frame, oriented bounding boxes (OBBs). Labels for the Cartesian velocities or contiguous sequences, which are expensive to obtain, are not required. The general idea is to pre-train an object detection network without velocities using single-frame OBB labels, and then exploit the network's OBB predictions on unlabelled data for velocity training. In detail, the network's OBB predictions of the unlabelled frames are updated to the timestamp of a labelled frame using the predicted velocities and the distances between the updated OBBs of the unlabelled frame and the OBB predictions of the labelled frame are used to generate a self-supervised training signal for the velocities. The detection network architecture is extended by a module to account for the temporal relation of multiple scans and a module to represent the radars' radial velocity measurements explicitly. A two-step approach of first training only OBB detection, followed by training OBB detection and velocities is used. Further, a pre-training with pseudo-labels generated from radar radial velocity measurements bootstraps the self-supervised method of this paper. Experiments on the publicly available nuScenes dataset show that the proposed method almost reaches the velocity estimation performance of a fully supervised training, but does not require expensive velocity labels. Furthermore, we outperform a baseline method which uses only radial velocity measurements as labels.
... where x represents the object's horizontal position, and z represents the object's forward distance, both relative to the center of the camera. The discontinuity problem can be partly resolved by representing the angle using the cosine of the angle [43,44,45,46,47] or sine of the angle [48,49] only due to its periodic properties. However, only using one of the cosine and sine of the angle cannot be unambiguously converted back into the angle without using at least 3-bins. ...
Full-text available
In recent years, an influx of 3D autonomous vehicle object detection algorithms. However, little attention was paid to orientation prediction. Existing research work proposed various prediction methods, but a holistic, conclusive review has not been conducted. Through our experiments, we categorize and empirically compare the accuracy performance of various existing orientation representations using the KITTI 3D object detection dataset, and propose a new form of orientation representation: Tricosine. Among these, the 2D Cartesian-based representation, or Single Bin, achieves the highest accuracy, with additional channeled inputs (positional encoding and depth map) not boosting prediction performance. Our code is published on Github:
Detection of the surrounding objects of a vehicle is the most crucial step in autonomous driving. Failure to identify those objects correctly in a timely manner can cause irreparable damage, impacting our safety and society. Several studies have been introduced to identify these objects in the two-dimensional (2D) and three-dimensional (3D) vector space. The 2D object detection method has achieved remarkable success; however, in the last few years, detecting objects in 3D have received more remarkable adoption. 3D object recognition has several advantages over 2D detection methods, as more accurate information about the environment is obtained for better detection. For example, the depth of the images is not considered in the 2D detection, which reduces the detection accuracy. Despite considerable efforts in 3D object detection, it has not yet reached the stage of maturity. Therefore, in this paper, we aim at providing a comprehensive overview of the state-of-the-art 3D object detection methods, with a focus on 1) identifying advantages and limitations, 2) revelling a novel categorization of the literature, 3) outlying the various training procedures, 4) highlighting the research gap in the existing methods and 5) building a road map for future directions.
Autonomous driving requires a detailed understanding of complex driving scenes. The redundancy and complementarity of the vehicle's sensors provide an accurate and robust comprehension of the environment, thereby increasing the level of performance and safety. This thesis focuses the on automotive RADAR, which is a low-cost active sensor measuring properties of surrounding objects, including their relative speed, and has the key advantage of not being impacted by adverse weather conditions. With the rapid progress of deep learning and the availability of public driving datasets, the perception ability of vision-based driving systems has considerably improved. The RADAR sensor is seldom used for scene understanding due to its poor angular resolution, the size, noise, and complexity of RADAR raw data as well as the lack of available datasets. This thesis proposes an extensive study of RADAR scene understanding, from the construction of an annotated dataset to the conception of adapted deep learning architectures. First, this thesis details approaches to tackle the current lack of data. A simple simulation as well as generative methods for creating annotated data will be presented. It will also describe the CARRADA dataset, composed of synchronised camera and RADAR data with a semi-automatic annotation method. This thesis then present a proposed set of deep learning architectures with their associated loss functions for RADAR semantic segmentation. It also introduces a method to open up research into the fusion of LiDAR and RADAR sensors for scene understanding. Finally, this thesis exposes a collaborative contribution, the RADIal dataset with synchronised High-Definition (HD) RADAR, LiDAR and camera. A deep learning architecture is also proposed to estimate the RADAR signal processing pipeline while performing multitask learning for object detection and free driving space segmentation.
Autonomous driving requires a detailed understanding of complex driving scenes. The redundancy and complementarity of the vehicle’s sensors provide an accurate and robust comprehension of the environment, thereby increasing the level of performance and safety. This thesis focuses the on automotive RADAR, which is a low-cost active sensor measuring properties of surrounding objects, including their relative speed, and has the key advantage of not being impacted by adverse weather conditions.With the rapid progress of deep learning and the availability of public driving datasets, the perception ability of vision-based driving systems (e.g., detection of objects or trajectory prediction) has considerably improved. The RADAR sensor is seldom used for scene understanding due to its poor angular resolution, the size, noise, and complexity of RADAR raw data as well as the lack of available datasets. This thesis proposes an extensive study of RADAR scene understanding, from the construction of an annotated dataset to the conception of adapted deep learning architectures.First, this thesis details approaches to tackle the current lack of data. A simple simulation as well as generative methods for creating annotated data will be presented. It will also describe the CARRADA dataset, composed of synchronised camera and RADAR data with a semi-automatic method generating annotations on the RADAR representations.This thesis will then present a proposed set of deep learning architectures with their associated loss functions for RADAR semantic segmentation. The proposed architecture with the best results outperforms alternative models, derived either from the semantic segmentation of natural images or from RADAR scene understanding,while requiring significantly fewer parameters. It will also introduce a method to open up research into the fusion of LiDAR and RADAR sensors for scene understanding.Finally, this thesis exposes a collaborative contribution, the RADIal dataset with synchronised High-Definition (HD) RADAR, LiDAR and camera. A deep learning architecture is also proposed to estimate the RADAR signal processing pipeline while performing multitask learning for object detection and free driving space segmentation simultaneously.
Full-text available
State-of-the-art 3D object detection for autonomous driving is achieved by processing lidar sensor data with deep-learning methods. However, the detection quality of the state of the art is still far from enabling safe driving in all conditions. Additional sensor modalities need to be used to increase the confidence and robustness of the overall detection result. Researchers have recently explored radar data as an additional input source for universal 3D object detection. This paper proposes artificial neural network architectures to segment sparse radar point cloud data. Segmentation is an intermediate step towards radar object detection as a complementary concept to lidar object detection. Conceptually, we adapt Kernel Point Convolution (KPConv) layers for radar data. Additionally, we introduce a long short-term memory (LSTM) variant based on KPConv layers to make use of the information content in the time dimension of radar data. This is motivated by classical radar processing, where tracking of features over time is imperative to generate confident object proposals. We benchmark several variants of the network on the public nuScenes data set against a state-of-the-art pointnet-based approach. The performance of the networks is limited by the quality of the publicly available data. The radar data and radar-label quality is of great importance to the training and evaluation of machine learning models. Therefore, the advantages and disadvantages of the available data set, regarding its radar data, are discussed in detail. The need for a radar-focused data set for object detection is expressed. We assume that higher segmentation scores should be achievable with better-quality data for all models compared, and differences between the models should manifest more clearly. To facilitate research with additional radar data, the modular code for this research will be made available to the public.
Conference Paper
Full-text available
Robust and accurate object detection on roads with various objects is essential for automated driving. The radar has been employed in commercial advanced driver assistance systems (ADAS) for a decade due to its low-cost and high-reliability advantages. However, the radar has been used only in limited driving conditions such as highways to detect a few forwarding vehicles because of the limited performance of radar due to low resolution or poor classification. We propose a learning-based detection network using radar range-azimuth heatmap and monocular image in order to fully exploit the radar in complex road environments. We show that radar-image fusion can overcome the inherent weakness of the radar by leveraging camera information. Our proposed network has a two-stage architecture that combines radar and image feature representations rather than fusing each sensor's prediction results to improve detection performance over a single sensor. To demonstrate the effectiveness of the proposed method, we collected radar, camera, and LiDAR data in various driving environments in terms of vehicle speed, lighting conditions, and traffic volume. Experimental results show that the proposed fusion method outperforms the radar-only and the image-only method.
Conference Paper
Full-text available
Robust and accurate scene representation is essential for advanced driver assistance systems (ADAS) such as automated driving. The radar and camera are two widely used sensors for commercial vehicles due to their low-cost, high-reliability, and low-maintenance. Despite their strengths, radar and camera have very limited performance when used individually. In this paper, we propose a low-level sensor fusion 3D object detector that combines two Region of Interest (RoI) from radar and camera feature maps by a Gated RoI Fusion (GRIF) to perform robust vehicle detection. To take advantage of sensors and utilize a sparse radar point cloud, we design a GRIF that employs the explicit gating mechanism to adaptively select the appropriate data when one of the sensors is abnormal. Our experimental evaluations on nuScenes show that our fusion method GRIF not only has significant performance improvement over single radar and image method but achieves comparable performance to the LiDAR detection method. We also observe that the proposed GRIF achieve higher recall than mean or concatenation fusion operation when points are sparse.
We tackle the problem of exploiting Radar for perception in the context of self-driving as Radar provides complementary information to other sensors such as LiDAR or cameras in the form of Doppler velocity. The main challenges of using Radar are the noise and measurement ambiguities which have been a struggle for existing simple input or output fusion methods. To better address this, we propose a new solution that exploits both LiDAR and Radar sensors for perception. Our approach, dubbed RadarNet, features a voxel-based early fusion and an attention-based late fusion, which learn from data to exploit both geometric and dynamic information of Radar data. RadarNet achieves state-of-the-art results on two large-scale real-world datasets in the tasks of object detection and velocity estimation. We further show that exploiting Radar improves the perception capabilities of detecting faraway objects and understanding the motion of dynamic objects.