Content uploaded by Kun Shi
Author content
All content in this area was uploaded by Kun Shi on Oct 31, 2024
Content may be subject to copyright.
Radar and Camera Fusion for Multi-Task Sensing
in Autonomous Driving
1st Kun Shi
State Key Laboratory of Industrial
Control Technology, Zhejiang University
Hangzhou, China
kuns@zju.edu.cn
2nd Shibo He
State Key Laboratory of Industrial
Control Technology, Zhejiang University
Hangzhou, China
s18he@zju.edu.cn
3rd Jiming Chen
State Key Laboratory of Industrial
Control Technology, Zhejiang University
Hangzhou, China
cjm@zju.edu.cn
Abstract—Multi-modal fusion is imperative to the implemen-
tation of reliable autonomous driving. In this paper, we make
full use of both mmWave radar and camera data to reconstruct
reliable depth and full-velocity information. To overcome the
sparsity of radar points, we leverage full-velocity based ego-
motion compensation to achieve more accurate multi-sweep
accumulation. Besides, we incorporate an adaptive two-stage
attention module within the fusion network to exploit the synergy
of camera and radar information. Furthermore, we conduct
extensive experiments on the prevailing nuScenes dataset. The
results show our proposed fusion system consistently outperforms
the state-of-the-art methods.
Index Terms—multi-modal fusion, mmWave radar, object de-
tection, autonomous driving
I. INTRODUCTION
Recent years have witnessed the fast proliferation of multi-
sensor equipment among intelligent vehicles [1]–[3]. As two
mainstream automotive sensors, camera and LiDAR are both
susceptible to inclement weather and occlusions, which can
dramatically decay the perception range and accuracy. In
addition, the velocity of an object is unobtainable by LiDAR
or camera unless exploiting temporal information. By com-
parison, millimeter-wave (mmWave) radar is invulnerable to
heavy rain, dust, fog, snow or dim illumination thanks to its
capacity for all-weather. Moreover, radar can facilely attain
the instantaneous radial velocity, which enables real-time full-
velocity estimation.
Owing to the complementary properties among radar and
camera modalities, multi-modal fusion possesses the poten-
tial to significantly boost perception performance. An ideal
radar-camera fusion architecture is expected to integrate the
superiority of each sensor whilst restraining the respective
deficiency. Specifically, cameras preserve intuitive images with
rich texture and semantic information, but their functionality
degenerates in circumstances of occlusions, long-range ob-
jects, and extreme weather. By contrast, radars are appropriate
for velocity estimation, occlusion-presence and long-range de-
tection, especially in harsh circumstances. Nonetheless, radar
point clouds are prone to sparsity, which is detrimental to the
identification of diversified objects.
The bulk of literature has been dedicated to the issues on
radar-camera fusion. As for the latest deep learning-based
radar-camera fusion methods, they can be broadly divided
into the following types: projection-based fusion, ROI-based
fusion, and aggregate view fusion.
Projection-based fusion. As the most prevailing approach
to the disparateness of heterogeneous modalities, this type
can be further classified into radar-to-camera projection and
camera-to-radar mapping. The former one transforms each
radar point from the radar coordinate to the perpendicular
camera plane (a.k.a., front-view coordinate) as shown in Fig. 1,
and then generates an augmented radar image. The major
drawbacks of this method lie in that it struggles to differentiate
the adjacent or occluded objects in the front-view, as well as
exacerbates the sparsity of radar data. As a contrast, the latter
one inversely projects image data to the bird’s eye view (BEV)
of radar. This method alleviates the radar sparsity issue, but
it is inevitable to estimate the distance of each object from
mere-image data, which may cause ambiguity.
RoI-based fusion. This method leverages one modality
to generate 2D/3D Region of Interests (ROIs) which may
contain valid objects, and can also be categorized into two
types. The first type utilizes radar data to generate ROIs, and
then performs subsequent operations in the camera front-view.
In contrast, the second type uses camera data to obtain the
bounding boxes of potential objects, and then makes use of
radar points inside the bounding boxes to refine the results
of the image detector. Both of two types are conducive to
enhancing efficiency, which is pivotal to real-time applications.
Nevertheless, they may cause the problem of insurmountable
missed detections.
Aggregate view fusion. The basic idea of this approach
derives from a pioneering framework dubbed AVOD, where
a series of anchors are predefined in the front-view map
and the BEV map, respectively. The radar BEV map is
often obtained by representing radar point clouds in a voxel-
grid format. Then, object proposal-wise features are extracted
within the region proposals, and are aggregated by specific
fusion operations to achieve the ultimate results. Although this
method can produce high-recall region proposals, however, the
grid representation of radar data is inefficient and inapplicable
when the radar point clouds are sparse.
Since most commercial automotive radars collect sparse
point clouds, and the authoritative public dataset nuScenes
[1] provides very sparse radar data collected from Conti-
nental ARS 408-21 radars, the performance of projection-
based and aggregate view fusion is suboptimal. Therefore, our
work is based on the concept of ROI-based fusion. Among
the existing radar-camera fusion literature with open source
code, CenterFusion [4] yields SOTA results in the nuScenes
object detection challenge, but its performance is unreliable
in adverse conditions. In CenterFusion, the authors used a
parameter δto remedy the inaccuracy in depth estimation,
since predicting depth utilizing the sole-image features can
bring in ambiguity, particularly when there exist obstructions
and adverse weather conditions. Ideally, increasing the ROI by
a larger δboosts the opportunity of covering the corresponding
radar points inside the ROI. However, the value of δneeds
to be carefully determined in that a larger ROI may involve
false alarms of nearby objects. To this end, we adopt a
radar-camera pair association scheme to realize a more robust
depth estimation. In addition, we further offer solutions for
dealing with the issues on sparsity, height-missing, and clutter
(spurious measurements) of radar data, as well as reliable
full-velocity estimation for more accurate 3D object detection.
Moreover, we integrate a two-stage attention module to explore
the complementary information between camera and radar
considering the environmental uncertainty. To summarize, this
paper has the following three main contributions:
•We take the first attempt to simultaneously address three
principal obstacles in the employment of radar data
(sparsity, height-missing, and clutter).
•we propose a novel multi-task oriented radar-camera
fusion framework which outperforms the state-of-the-art
method in terms of 3Dobject detection, depth estimation,
and full-velocity estimation.
•We integrate a two-stage attention module which com-
bines the intra-senor self-attention and inter-sensor cross-
attention to achieve complementary interactions among
radar and camera sensors.
II. PRO PO SE D MET HO DS
A. Radar Data Preprocessing
In general, there exist three major challenges in the pre-
processing of automotive radar points. First, mmW radar data
are usually represented as 2Dpoints containing no height
information of the object in the BEV map (i.e., the driving
ground plane), since the collected measurements along the
vertical direction tend to be inaccurate or even non-existent
due to limited elevation angular resolution. The missing height
information brings difficulty to object detection. Second, radar
data inevitably contain clutter (reflected signals from objects
not of interest), especially in heavy traffic scenarios. Third,
the sparsity of radar points poses an intractable difficulty to
the utilization of radar data.
With these insights, we explore appropriate approaches to
tackle the aforementioned challenges. An intuitive illustration
for these strategies is provided in Fig. 2. For the sparseness
y
z
x
Camera Front-View
Radar BEV
Fig. 1. An illustration of radar and camera coordinates in nuScenes dataset.
issue, a common method relies on multi-frame aggregation,
which can be described by the following formula:
(xt=xt−1+△dx,t +vx,t−1· △t,
yt=yt−1+△dy,t +vy,t−1· △t, (1)
Although aggregating adjacent radar sweeps is accessible,
without reliably ego motion and object motion compensation,
this type of method struggles in manually determining the
sweep number to trade off between the benefit from aggre-
gation and adverse impact on accuracy. As such, we exploit
full-velocity rather than radial velocity for object motion
compensation to achieve rigorous aggregation. Inspired by [5],
we leverage optical flow from a camera-radar pair to calculate
the point-wise full-velocity. For the clutter interference, we
resort to ROI Frustum-based filtering. To be specific, we first
leverage a mature image-only detector CenterNet to obtain
the 3Dbounding box of each object and a radar-camera pixel
depth association scheme [6] to estimate the depth information.
Then, we combine the 3D box and depth information to form a
frustum. By regarding the frustum as an ROI space, we discard
the noisy radar data outside the frustum. For the height-missing
problem, we expand a radar point cloud into a fixed-size
vertical pillar to enrich the information of radar measurements.
B. Fusion Model Architecture
The overall architecture of our proposed radar-camera fu-
sion model is illustrated in Fig. 2. For the camera data
processing branch, we employ the CenterNet-based network
to generate image features and predict the heatmap, 2D and
3D object bounding boxes, orientation and object center offset
by the first regression head. For the radar point clouds, we
first accumulate multiple radar sweeps with full-velocity based
compensation. Then, we transform radar points to 3Dpillars
and conduct a frustum-based radar-camera association. After
the respective feature extraction, we fuse the two features with
an adaptive two-stage attention module to exploit the synergy
of camera and radar information. This fusion module combines
the intra-senor self-attention and inter-sensor cross-attention
to achieve complementary interactions among heterogeneous
sensors, whose weightiness is automatically re-weighted in
line with the environmental uncertainty [7]. Finally, the in-
tegrated features are sent to the second regression head and a
Two-satge
Attention Module
CenterNet-based Network
××
×
× 2D/3D Size
Orientation
Heatmap
Center Offset
First Regression Head
Radar Pillar Generation
Frustum-based Association
Box Decoder
Radar features
Conv 3×3 Conv 1×1
First Heads
Conv 3×3 Conv 1×1
Second Heads
Mixed Attention
Second Regression Head
Depth Velocity
Orientation Attribute
Accumulate radar sweeps
full velocity based
compensation
radial velocity based
compensation
single sweep
radar points
RGB Image
Image features
Radar-Camera Pixel
Depth Association
Raw radar & camera data
radar points
Depth
3D
Box
Encoder-Decoder Backbone
depth
Frustum
3D Box
Full Velocity
Fig. 2. The overall architecture of the proposed radar-camera fusion network.
box decoder to re-predict the 3D bounding box, velocity, and
attributes (e.g., stationary or moving) required in calculating
nuScenes metrics [1].
III. EXPERIMENTS AND RESULTS
We conduct training and testing on the nuScenes dataset,
and the parameter settings are consistent with CenterFusin.
It is worth noting that we do not use the parameter δas
CenterFsuion since we find that the use of this parameter
brings no gain in the performance after extensive experiments.
Fig. 3 compares the 3D object detection results in both
camera view and radar BEV map. It is noticeable that the
predictions from our method achieve a better fit for object
boxes in all cases, as well as fewer missed detections and false
alarms. Additionally, Table I presents the quantitative results
on the validation splits of the nuScenes dataset. Therein,
mATE, mASE, mAOE, mAVE and mAAE stand for average
translation, scale, orientation, velocity and attribute errors
defined in [1]. Obviously, our method improves CenterNet
and CenterFusion in all the metrics, which validates the ef-
fectiveness and efficiency of the proposed radar-camera fusion
framework. Moreover, the velocity estimated by our scheme
demonstrates a considerable enhancement compared to both
CenterFusion and CenterNet.
TABLE I
QUAN TITATI VE R ESU LTS FO R 3D OBJECT DETECTION.
Method mAP Errors
mATE mASE mAOE mAVE mAAE
CneterNet 0.306 0.716 0.264 0.609 1.426 0.658
CenterFusion 0.332 0.649 0.263 0.535 0.540 0.142
Our Method 0.385 0.608 0.243 0.503 0.236 0.093
IV. CONCLUSION
In this paper, we proposed a novel camera-radar fusion
network to improve multi-task sensing in autonomous driving.
We explored appropriate solutions to address the challenges of
radar preprocessing and introduced a two-stage module to the
fusion network. The detection results on the nuScenes dataset
indicate that our approaches lead to considerable progress in
3D object detection. In future, we will proceed to exploit
Transformer-based methods for radar and camera fusion.
Fig. 3. Visualization comparison between CenterFusion (top) and ours
(bottom). Predictions in blue, ground truth in red, and radar points in green.
REFERENCES
[1] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krish-
nan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset
for autonomous driving,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 11 621–11 631.
[2] S. He, K. Shi, C. Liu, B. Guo, J. Chen, and Z. Shi, “Collaborative sensing
in internet of things: A comprehensive survey,” IEEE Communications
Surveys & Tutorials, vol. 24, no. 3, pp. 1435–1474, 2022.
[3] K. Shi, Z. Shi, C. Yang, S. He, J. Chen, and A. Chen, “Road-map aided
GM-PHD filter for multi-vehicle tracking with automotive radar,” IEEE
Transactions on Industrial Informatics, vol. 18, no. 1, pp. 97–108, 2021.
[4] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera
fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, 2021, pp. 1527–1536.
[5] Y. Long, D. Morris, X. Liu, M. Castro, P. Chakravarty, and P. Narayanan,
“Full-velocity radar returns by radar-camera fusion,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision, 2021, pp.
16 198–16 207.
[6] Y. Long, D. Morris, X. Liu, M. Castro, P. Chakravarty, and P. Narayanan,
“Radar-camera pixel depth association for depth completion,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 12 507–12 516.
[7] C. X. Lu, M. R. U. Saputra, P. Zhao, Y. Almalioglu, P. P. De Gusmao,
C. Chen, K. Sun, N. Trigoni, and A. Markham, “milliEgo: single-chip
mmwave radar aided egomotion estimation via deep sensor fusion,” in
Proceedings of the 18th Conference on Embedded Networked Sensor
Systems, 2020, pp. 109–122.