PreprintPDF Available

Achelous: A Fast Unified Water-surface Panoptic Perception Framework based on Fusion of Monocular Camera and 4D mmWave Radar

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Current perception models for different tasks usually exist in modular forms on Unmanned Surface Vehicles (USVs), which infer extremely slowly in parallel on edge devices, causing the asynchrony between perception results and USV position, and leading to error decisions of autonomous navigation. Compared with Unmanned Ground Vehicles (UGVs), the robust perception of USVs develops relatively slowly. Moreover, most current multi-task perception models are huge in parameters, slow in inference and not scalable. Oriented on this, we propose Achelous, a low-cost and fast unified panoptic perception framework for water-surface perception based on the fusion of a monocular camera and 4D mmWave radar. Achelous can simultaneously perform five tasks, detection and segmentation of visual targets, drivable-area segmentation, waterline segmentation and radar point cloud segmentation. Besides, models in Achelous family, with less than around 5 million parameters, achieve about 18 FPS on an NVIDIA Jetson AGX Xavier, 11 FPS faster than HybridNets, and exceed YOLOX-Tiny and Segformer-B0 on our collected dataset about 5 mAP50-95_{\text{50-95}} and 0.7 mIoU, especially under situations of adverse weather, dark environments and camera failure. To our knowledge, Achelous is the first comprehensive panoptic perception framework combining vision-level and point-cloud-level tasks for water-surface perception. To promote the development of the intelligent transportation community, we release our codes in \url{https://github.com/GuanRunwei/Achelous}.
Content may be subject to copyright.
Achelous: A Fast Unified Water-surface Panoptic Perception Framework
based on Fusion of Monocular Camera and 4D mmWave Radar
Runwei Guan1, Shanliang Yao1, Xiaohui Zhu2, Ka Lok Man2, Eng Gee Lim2, Jeremy Smith1, Yong Yue2,
Yutao Yue3
Abstract Current perception models for different tasks
usually exist in modular forms on Unmanned Surface Vehicles
(USVs), which infer extremely slowly in parallel on edge devices,
causing the asynchrony between perception results and USV
position, and leading to error decisions of autonomous navi-
gation. Compared with Unmanned Ground Vehicles (UGVs),
the robust perception of USVs develops relatively slowly.
Moreover, most current multi-task perception models are huge
in parameters, slow in inference and not scalable. Oriented
on this, we propose Achelous, a low-cost and fast unified
panoptic perception framework for water-surface perception
based on the fusion of a monocular camera and 4D mmWave
radar. Achelous can simultaneously perform five tasks, detection
and segmentation of visual targets, drivable-area segmentation,
waterline segmentation and radar point cloud segmentation.
Besides, models in Achelous family, with less than around
5 million parameters, achieve about 18 FPS on an NVIDIA
Jetson AGX Xavier, 11 FPS faster than HybridNets, and exceed
YOLOX-Tiny and Segformer-B0 on our collected dataset about
5 mAP50-95 and 0.7 mIoU, especially under situations of adverse
weather, dark environments and camera failure. To our knowl-
edge, Achelous is the first comprehensive panoptic perception
framework combining vision-level and point-cloud-level tasks
for water-surface perception. To promote the development of
the intelligent transportation community, we release our codes
in https://github.com/GuanRunwei/Achelous.
Index Terms Unified panoptic perception, Fusion of vision
and radar, USV-based water-surface perception
I. INTRODUCTION
With the rapid development of deep learning, Autonomous
Driving (AD) has been becoming highly intelligent. Percep-
tion, as an essential module of AD, assembles many environ-
ment perception tasks, including object detection, drivable-
area segmentation, lane line segmentation, etc. Correspond-
ingly, multi-sensor fusion has been applied to improve the
precision and robustness of perception systems [1].
The authors acknowledge XJTLU-JITRI Academy of Industrial Tech-
nology for giving valuable support to the joint project. This work is
supported by the Key Program Special Fund of XJTLU (KSF-A-19),
Research Development Fund of XJTLU (RDF-19-02-23). Suzhou Science
and Technology Fund (SYG202122). This work received financial support
from Jiangsu Industrial Technology Research Institute (JITRI) and Wuxi
National Hi-Tech District (WND).
Runwei Guan and Shanliang Yao contribute equally.
1Runwei Guan, Shanliang Yao and Jeremy Smith are with Fac-
ulty of Science and Engineering, University of Liverpool, L69 3BX
Liverpool, United Kingdom. {runwei.guan, shanliang.yao,
J.S.Smith}@liverpool.ac.uk
2Xiaohui Zhu, Ka Lok Man, Eng Gee Lim and Yong Yue
are with School of Advanced Technology, Xi’an Jiaotong-Liverpool
University, 215123 Suzhou, China. {xiaohui.zhu, Ka.Man,
enggee.lim, yong.yue}@xjtlu.edu.cn
3Yutao Yue, corresponding author, is with Institute of Deep Perception
Technology, JITRI, 214000 Wuxi, China. yueyutao@idpt.org
Fig. 1. Comparison of (a) standalone models and (b) multi-task framework
with heterogeneous sensor data. Our Achelous belongs to the latter.
We have witnessed a remarkable development of un-
manned ground vehicles (UGVs) with the help of deep
learning. However, water-surface perception develops con-
siderably slowly compared with road perception. Unmanned
surface vehicles (USVs) play a significant role in water
handling, such as water rescue [2], water quality monitoring
[3], garbage collection [4] and geological exploration [5].
Based on extensive surveys, firstly, we find most models
of water-surface perception can only execute a single task
[6][7][8], which is far to help USVs complete autonomous
navigation. In addition, some researches show that multi-
tasks can improve each other [9][10]. Secondly, the industry
prefers to parallelize multiple single-task models, which may
lead to asynchronous perception results. Besides, perception
system speed depends on the slowest model (Fig. 1). Thirdly,
many models have never considered the problem of real-
time inference on edge devices. They can only run on high-
performance GPU devices [9][11] at remote servers, which
excessively relies on network communication capabilities.
However, the network communication is weaken dramati-
cally during ocean voyages, once the network is interrupted,
it may be a catastrophe for USVs during some dangerous
tasks. Last but not least, vision-only models [12][13] are un-
reliable when confronted with dark environments, dense fog
or lens failure. Currently, 4D mmWave radar is considered
a promising complementary perception sensor for cameras
in adverse situations, but how to efficiently extract irregular
radar features is a challenge. Based on the above,
1) We propose Achelous (Fig. 2), a low-cost and fast
unified water-surface panoptic perception framework
based on the fusion of a monocular camera and 4D
radar. Achelous ensembles ve perception tasks in
one end-to-end framework, including object detection,
object semantic segmentation, waterline segmentation,
drivable-area segmentation and radar point cloud se-
mantic segmentation. Achelous obtains about 18 FPS
on an NVIDIA Jetson AGX Xavier and achieves
Fig. 2. The architecture of Achelous. Blue point clouds in the semantic segmentation of radar point clouds denote clutter while the red one denotes target.
TABLE I
COMPARISON OF PANOPTIC PERCEPTION MO DELS ( FRAM EWOR KS)
Name Type1Sensor(s) Tasks Edge Test Scalable
YOLOP [11] M camera 3 % %
HybridNets [9] M camera 3 % %
Achelous (ours) F camera 5" "
4D radar
1. Model or Framework.
competitive performances compared with single-task
models and other multi-task models on perception
tasks.
2) We propose a simple but effective convolution operator
called Radar Convolution (RadarConv). RadarConv is
friendly to the irregularness of radar point clouds and
can meticulously and effectively extract point cloud
features in 2D image planes, compared with the normal
convolution.
3) To promote the development of water-surface panoptic
perception based on multiple sensors, our Achelous
family are scalable and open-source.
II. ACHELOUS
A. Overview
As Fig. 2 presents, our Achelous is based on a USV
mounted with a monocular camera and 4D radar. The monoc-
ular camera captures RGB images while 4D radar obtains
3D point clouds directly. Each radar point cloud contains
several physical features of targets. Among these physical
features, we select the range, velocity and reflected power of
the target, which cannot be perceived by the camera. To make
radar point clouds assist vision-based object detection, we
transform the coordinates of point clouds from the 3D radar
coordinate system to 2D camera plane. We call 2D radar
pseudo images RVP maps, where each channel represents
the radar target’s range, velocity and power.
The main body of our Achelous contains four parts, a
ViT-based image encoder, a radar feature encoder, prediction
heads and a point cloud semantic segmentation model. As
Table I presents, our Achelous supports more sensors and
tasks than the other two panoptic perception models, YOLOP
and HybridNets. Besides, Achelous specifically performs
edge test, where the modules are optional and scalable.
Achelous has three channel sizes, S0, S1 and S2. To accel-
erate inference, Achelous compresses branches and network
fragmentation as much as possible, and weighs uses of
activation function and group convolution. Besides, Achelous
keep input and output channels of network unit equal to
reduce memory access cost.
B. ViT-based Image Encoder
We have witnessed excellent performances of vision-
transformer-based (ViT-based) models over the past years.
ViT-based models can model global contextual features based
on the self-attention mechanism [14]. In addition, ViTs over-
all exceed CNNs in predictive performances [15][16][17].
ViTs are more robust than CNNs on adversarial attacks
[18][19][20], vision object occlusions [21] and data corrup-
tions [22]. Multi-head self-attention can assemble prediction
features but CNNs cannot [23], whose information capacity
is much more than CNNs with the same parameters. Al-
though ViTs are blamed for slow inference, recent studies
[24][25][26][27] indicate that ViTs with ingenious design can
still run as fast as CNNs. Based on the above advantages,
our Achelous leverages ViT-based models as image encoders.
Followed by the consistent paradigm of backbones, our
image encoder has five stages with feature maps of multi-
scale sizes, where the last four stages contain 2, 2, 6 and 4
layers, respectively. Our Achelous preliminarily contains four
lightweight ViT-based backbones, EdgeNeXt [25], EdgeViT
[24], MobileViT [26] and EfficientFormer [27]. Following
the backbone, spatial pyramid pooling (SPP) [28] is to
enlarge receptive fields of image feature maps with multiple
scales.
C. Dual-FPN and Segmentation Heads
Feature Pyramid Network (FPN), as a significant module,
is to fuse multi-scale features. Achelous has a dual FPN,
fusing features extracted by the ViT-based encoder, where
the first three feature maps are weight-shared and the last
feature map is weight-independent. Between the weight-
shared and weight-independent feature maps, we use two
shuffle attention [29] modules to remeasure features of two
different segmentation tasks. Inspired by GhostNet [30] and
CSPDarknet [31], we design two lightweight FPNs, Ghost-
Dual-FPN (GDF) and CSP-Dual-FPN (CDF), where GDF
could dramatically remove feature redundancy in feature
fusion stage while CDF could speed up feature fusion opera-
tions. Moreover, there are two segmentation heads following
fused feature maps, which are for waterline segmentation,
and semantic segmentation of targets and drivable-area.
D. Point Cloud Semantic Segmentation Model
Since cameras may fail due to awful weather, radar would
take over the perception of Achlous. As radar could not adopt
the vision-based detection pattern, semantic segmentation of
radar point clouds matters. Based on the ideas of permutation
invariance and local feature learning from PointNet [32]
and PointNet++ [33], there are many models of point cloud
processing. Here, to reduce the computation burden of the
framework and make it fast, we adopt PointNet and Point-
Net++ as components of point cloud semantic segmentation.
In addition, we reduce the number of channels of models
to one-third of the original, because radar point clouds are
dramatical sparser than lidars and too many latent channels
are redundant.
E. Radar Convolution, RCBlock, RCNet and Fusion with
Image Features
We notice that radar point clouds are sparse and irreg-
ular, which means the conventional convolution contains
many invalid operations and takes feature maps as regular
grids. To make convolution friendly to feature extraction of
radar point clouds and alleviate feature loss, we propose
radar convolution (Fig. 3), a simple but highly-efficiency
convolution operator. We first adopt average pooling with
3×3 to enlarge the receptive field. Compared to max-
pooling, average pooling can keep more feature information,
aggregating local features. Besides, the pooling operation is
Fig. 3. Radar Convolution, RCBlock and RCNet.
much faster than convolution. To model the irregularity of
radar point clouds, we introduce the deformable convolution
[34], which extract feature with offsets. Based on radar
convolution, we construct RCBlock and RCNet as shown in
Fig. 3. RCBlock contains two 1 ×1 convolutions to weigh
each spatial feature. The number of channels in RCNet is
one-quarter of that in ViT-based image encoder, since radar
point clouds are sparse and do not need so many latent
features with non-linear operations.
RCNet is an auxiliary network for detection, where radar
feature maps are concatenated to image feature maps in
the dual-FPN, to help Achelous localize targets faster and
improve the recall under adverse situations. Although many
works fuse radar and image features in both backbone and
neck [7][35], we find too many branches for fusion will
cause a dramatic drop in inference speed. Since image feature
maps in the dual-FPN contain abundant detailed low-level
features for segmentation, in addition, upsampling operations
and SPP equip feature maps with multi-scale features in
different stages. Therefore, FPN-stage fusion is enough for
robust object detection.
F. Nano Decouple Detection Head
We decouple feature maps in the detection head to predict
bounding boxes, categories and confidence, respectively.
In addition, we adopt depth-wise separable convolution to
reduce parameters to a great extent. Furthermore, anchor-
free is to accelerate inference and SimOTA [36] algorithm
is to improve matching positive samples.
III. EXP ERI MENTS
A. Experimental Settings
Device. We mount a SONY IMX-317 RGB camera and
an Oculii EAGLE Imaging Radar on our USV. Sensors
are temporally synchronized via timestamps and spatially
synchronized via a calibration board.
Data. We capture 50,000 images and frames of radar data
[41]. Each image is 1920 ×1080 pixels. There are seven
classes for detection, including pier, buoy, sailor, ship, boat,
vessel and kayak. Besides these classes, drivable area is an
additional class for semantic segmentation while clutter is an
additional class for point cloud semantic segmentation. Wa-
terline is a single class for the waterline segmentation task.
We annotate both object detection and semantic segmentation
in VOC format. We annotate point cloud categories based
on the ground truth of the bounding box and clustering by
TABLE II
PERFORMANCES OF ACHELOUS, OTH ER MULTI-TA SK MODE LS AND SINGLE-TA SK MODE LS ON OUR TE STSE T.
Methods Sensors TN1Params (M) FLOPs (G)
OD2SS3WS4PC-SS5
FPS14
eFPS15
g
mAP50-95 mAP50 AR50-95 mIoU12
tmIoU13
dmIoU mIoU
Achelous-EN-CDF-PN-S06C16+R17 5 3.59 5.38 37.2 66.3 43.1 68.1 98.8 69.4 57.1 17.5 59.8
Achelous-EN-GDF-PN-S07C+R 5 3.55 2.76 37.5 66.9 44.6 69.1 99.0 69.3 57.8 17.8 61.3
Achelous-EN-CDF-PN2-S08C+R 5 3.69 5.42 37.3 66.3 43.0 68.4 99.0 68.9 60.2 15.2 56.5
Achelous-EN-GDF-PN2-S0 C+R 5 3.64 2.84 37.7 68.1 45.0 67.2 99.2 67.3 59.6 14.8 57.7
Achelous-EF-GDF-PN-S09C+R 5 5.48 3.41 37.4 66.5 43.4 68.7 99.6 66.6 59.4 17.3 50.6
Achelous-EV-GDF-PN-S010 C+R 5 3.79 2.89 38.8 67.3 42.3 69.8 99.6 70.6 58.0 16.4 54.9
Achelous-MV-GDF-PN-S011 C+R 5 3.49 3.04 41.5 71.3 45.6 70.6 99.5 68.8 58.9 16.0 53.7
Achelous-EN-GDF-PN-S1 C+R 5 5.18 3.66 41.3 70.8 45.5 67.4 99.4 69.3 58.8 16.6 59.7
Achelous-EF-GDF-PN-S1 C+R 5 8.07 4.52 40.0 70.2 43.8 68.2 99.3 68.7 58.2 16.6 46.8
Achelous-EV-GDF-PN-S1 C+R 5 4.14 3.16 41.0 70.7 45.9 70.1 99.4 67.9 59.2 16.7 56.6
Achelous-MV-GDF-PN-S1 C+R 5 4.67 4.29 43.1 75.8 47.2 73.2 99.5 69.2 59.1 15.8 55.8
Achelous-EN-GDF-PN-S2 C+R 5 6.90 4.59 40.8 70.9 44.4 69.6 99.3 71.1 59.0 16.1 58.1
Achelous-EF-GDF-PN-S2 C+R 5 14.64 7.13 40.5 70.8 44.5 70.3 99.1 71.7 58.4 13.5 39.3
Achelous-EV-GDF-PN-S2 C+R 5 8.28 5.19 40.3 69.7 43.8 74.1 99.5 67.9 58.3 14.7 47.1
Achelous-MV-GDF-PN-S2 C+R 5 7.18 6.02 45.0 79.4 48.8 73.8 99.6 70.8 58.5 15.6 52.7
YOLOP [11] C 3 7.90 18.6 37.9 68.9 43.5 - 99.0 74.9 - 1.28 8.15
HybridNets [9] C 3 12.83 15.6 39.1 69.8 44.2 - 98.8 71.5 - 6.04 17.1
YOLOv7-Tiny [37] C 1 6.03 33.3 37.3 65.9 43.7 - - - - 36.7 118.6
YOLOX-Tiny [36] C 1 5.04 3.79 39.4 68.0 43.0 - - - - 33.6 102.0
YOLOv4-Tiny [38] C 1 5.89 4.04 13.1 36.3 20.2 - - - - 114.6 352.2
Segformer-B0 [39] C 1 3.71 5.29 - - - 72.5 99.2 72.1 - 41.6 124.7
PSPNet (MobileNet) [40] C 1 2.38 2.30 - - - 69.4 99.0 69.7 - 61.2 246.1
PointNet [32] R 1 3.53 1.19 - - - - - - 59.0 97.0 507.4
PointNet++ [33] R 1 1.88 2.63 - - - - - - 60.7 72.8 384.2
1. TN: task number 2. OD: object detection 3. SS: semantic segmentation 4. WS: waterline segmentation 5. PC-SS: point cloud semantic segmentation 6. EN: EdgeNeXt,
CDF: CSP-Dual-FPN, PN: PointNet 7. GDF: Ghost-Dual-FPN 8. PN2: PointNet++ 9. EF: EfficientFormer 10. EV: EdgeViT 11. MV: MobileViT 12. mIoUt: mIoU of targets
13. mIoUd: mIoU of drivable area 14. FPSe: FPS on Jetson AGX Xavier 15. FPSg: FPS on RTX A4000 16. C: camera 17. R: radar.
Fig. 4. Statistics of our collected dataset.
velocity. We divide data into the training, validation and test
set by the ratio of 7:2:1.
Training and Evaluation. We resize images and RVP
maps to 320 ×320 pixels. We train our Achelous for 100
epochs with a batch size of 32 and an initial learning rate of
0.03. We adopt Stochastic Gradient Descent (SGD) with a
momentum of 0.937 as the optimizer and cosine learning rate
scheduler. We use mixed precision and Exponential Moving
Average (EMA) during training. We use the homoscedastic-
uncertainty-based [10] multi-task training strategy. We adopt
focal loss in detection, dice loss in segmentation and NLL
TABLE III
CON FIGUR AT ION OF NVIDIA JE TSON AG X XAVIER
Modules Specifications
Memory 8 GB
CUDA Cores 384 NVIDIA CUDA cores + 48 Tensor cores
CPU 6-core ARM v8.2 64-bit
DLA 4.1 TFLOPS (FP16) + 8.2 TOPS (INT8)
loss in point cloud segmentation. We train Achelous and
other models from scratch on two RTX A4000 GPUs with
data-parallel mode. We test the FPS of all models on an
NVIDIA Jetson AGX Xavier (TABLE III) and RTX A4000.
We use mAP50-95 , mAP50 and AR50 as metrics to evaluate
object detection while mIoU is to measure semantic segmen-
tation of both image and radar point clouds.
B. Comparison of Achelous with Other Models
We compare our Achelous with other panoptic perception
models and single-task models. We evaluate performances
of object detection, target semantic segmentation, drivable
area segmentation, waterline segmentation and point cloud
semantic segmentation. We also test FPS on an edge device
(NVIDIA Jetson AGX Xavier) and a high-performance GPU
(RTX A4000). We observe that Achelous converges normally
during multi-task training (Fig. 7). As Table II presents, our
Achelous achieves state-of-the-art performances on object
Fig. 5. Panoptic Perception Results of Achelous-EN-CDF-PN-S0, object detection, segmentation of targets, drivable area and waterline, and point cloud
semantic segmentation. (a) Dark environment. (b) Occluded ships on a dense foggy day. (c) The lens blocked by water droplets. (d) Dense targets.
Fig. 6. Comparison of Achelous-MV-GDF-PN-S0 (4.4 million parameters less than YOLOP) with YOLOP under various situations.
Fig. 7. Training and Validation Loss of Achelous-EN-GDF-PN2-S0.
detection, semantic segmentation of objects and drivable area
compared with other panoptic perception models and single-
task models. We observe that Achelous with the backbone of
MobileViT achieves the best performance on object detection
and semantic segmentation for three sizes (S0, S1 and
S2), considerably outperforming other models. However, for
waterline segmentation, YOLOP outperforms Achelous by
about 3% mIoU. For point cloud semantic segmentation,
Achelous with PointNet++ outperforms Achelous with Point-
Net. Furthermore, Achelous is much faster than YOLOP and
HybridNets. The FPS of Achelous is between 13 to 18 on
an NVIDIA Jetson AGX Xavier, which satisfies real-time
inference for autonomous driving of USVs at a high speed.
We also visualize the prediction results of Achelous and
YOLOP in Fig. 5 and Fig. 6. We can see in most cir-
cumstances, our Achelous can better detect and segment
targets than YOLOP, no matter in dark environments, adverse
weather or light interference.
C. Speed of Parallel Standalone Models and Achelous
As TABLE IV presents, we test the latency of standalone
models of parallel inference and our Achelous-EN-GDF-
PN-S0 on both Jetson AGX Xavier and RTX A4000. Our
Achelous’s latency is lower than parallel standalone models
when conducting several panoptic perception tasks, no matter
TABLE IV
INF EREN CE SPEE D OF STANDA LONE MO DELS A ND ACHELOUS
Methods Tasks Latency1
g(ms) Latency2
e(ms)
YOLOX-Tiny [36] OD
21.2 79.5
Segformer-B0 [39] SS
PSPNet [40] WS
PointNet [32] PC-SS
Achelous OD & SS
16.3 (4.9) 56.2 (23.3)
(EN-GDF-PN-S0) WS & PS-SS
1. Latencyg: latency on a RTX A4000 GPU 2. Latencye: latency on an
NVIDIA Jetson AGX Xavier.
TABLE V
ABL AT ION EXP ERIM ENT OF RCNE T AND FUSION MET HODS
Methods mAP50-95 mAP50
Achelous-MobileNetv2 37.2 66.0
Achelous-RCNet 37.5 (0.3) 66.6 (0.6)
Fusion Methods mAP50-95 Latencye(ms)
Backbone + Dual-FPN 37.4 ±0.3 29.5
Dual-FPN 37.5 ±0.2 16.3 (13.2)
on Jetson or RTX A4000. It proves that multi-task models
are necessary for panoptic perception to improve efficiency.
D. Ablation Experiments
We first do the ablation experiment on RCNet (TABLE V),
where we replace RCNet with MobileNetV2 in Achelous-
EN-GDF-PN-S0, a structurally similar network consisting
of normal convolutions. We notice that mAP50-95 drops 0.2
while AP50 drops about 0.5. It proves that RCNet containing
Radar Convolution can better capture and model features of
Fig. 8. Visualization of heatmaps of Achelous-MV-GDF-PN-S2 (7.2M parameters) and YOLOX-M (25.3M parameters) in different situations.
radar point clouds than the normal convolution calculator.
Furthermore, we compare the results of two different
fusion methods. We find that fusion of image and radar
features in both the backbone and fpn stage could not
improve detection performances notably, whose inference
latency is 13.2 ms slower than the fpn-level fusion.
E. Visualization and Analysis of Feature Maps
To validate whether our Achelous pays attention to correct
regions of interest, we adopt Grad-CAM [42] to visualize
the heatmaps of the last layer in FPN, which is connected
with the detection head. We choose YOLOX-M to compare
it with our Achelous-MV-GDF-PN-S2. We choose three
challenging scenarios: a dark environment under the bridge,
lens failure by droplet and a foggy day. Firstly, we observe
that vision-only YOLOX-M performs terribly in the dark
environment, where a sailor is missed and another is not
focused preciously by YOLOX-M. Excitingly, Achelous with
vision and radar features captures the distant sailor suc-
cessfully. Secondly, when confronted with droplets on the
lens, vision-only YOLOX-M completely ignore the targets
in the region disturbed by droplets, but Achelous notices
the ignored sailor. Thirdly, for the circumstance that three
distant ships are obscured by the thick fog, our Achelous
based on vision-radar fusion is aware of three small and
distant ships but YOLOX-M is not, which validates the radar
with long-range detection capability matters in some adverse
weather. In all, Achelous based on the feature-level fusion of
camera and 4D radar with fewer parameters is much more
reliable than vision-only models during various challenging
situations.
IV. CONCLUSION
We propose a powerful and scalable riverway panoptic
perception framework called Achelous based on camera and
4D mmWave radar, which can simultaneously perform five
different perception tasks of vision-level and point-cloud-
level. Achelous is a high-efficiency framework, inferring in
real-time on an NVIDIA Jetson AGX Xavier. We also pro-
pose radar convolution, which can exquisitely extract sparse
and irregular features of radar point clouds. Achelous also
outperforms other panoptic perception models and single-
task models on most perception tasks, especially in adverse
situations. We hope Achelous can promote the development
of water-surface panoptic perception, providing a low-cost
and high-efficiency scheme for researchers.
REFERENCES
[1] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen,
Hongbo Fu, and Chiew-Lan Tai, “Transfusion: Robust lidar-camera
fusion for 3d object detection with transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2022, pp. 1090–1099.
[2] Tingting Yang, Zhi Jiang, Ruijin Sun, Nan Cheng, and Hailong Feng,
“Maritime search and rescue based on group mobile computing for
unmanned aerial vehicles and unmanned surface vehicles, IEEE
transactions on industrial informatics, vol. 16, no. 12, pp. 7700–7708,
2020.
[3] Xiaohui Zhu, Yong Yue, Prudence WH Wong, Yixin Zhang, and Hao
Ding, “Designing an optimized water quality monitoring network with
reserved monitoring locations,” Water, vol. 11, no. 4, pp. 713, 2019.
[4] Dario Madeo, Alessandro Pozzebon, Chiara Mocenni, and Duccio
Bertoni, A low-cost unmanned surface vehicle for pervasive water
quality monitoring,” IEEE Transactions on Instrumentation and
Measurement, vol. 69, no. 4, pp. 1433–1444, 2020.
[5] Zhibin Xue, Jincun Liu, Zhengxing Wu, Sheng Du, Shihan Kong, and
Junzhi Yu, “Development and path planning of a novel unmanned
surface vehicle system and its application to exploitation of qarhan
salt lake,” Science China Information Sciences, vol. 62, no. 8, pp.
1–3, 2019.
[6] Mohammad-Hashem Haghbayan, Fahimeh Farahnakian, Jonne Poiko-
nen, Markus Laurinen, Paavo Nevalainen, Juha Plosila, and Jukka
Heikkonen, “An efficient multi-sensor fusion approach for object
detection in maritime environments, in 2018 21st International
Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018,
pp. 2163–2170.
[7] Yuwei Cheng, Hu Xu, and Yimin Liu, “Robust small object detection
on the water surface through fusion of camera and millimeter wave
radar, in Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 15263–15272.
[8] Keunhwan Kim, Jonghwi Kim, and Jinwhan Kim, “Robust data
association for multi-object detection in maritime environments using
camera and radar measurements,” IEEE Robotics and Automation
Letters, vol. 6, no. 3, pp. 5865–5872, 2021.
[9] Dat Vu, Bao Ngo, and Hung Phan, “Hybridnets: End-to-end perception
network,” arXiv preprint arXiv:2203.09035, 2022.
[10] Alex Kendall, Yarin Gal, and Roberto Cipolla, “Multi-task learning
using uncertainty to weigh losses for scene geometry and semantics,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 7482–7491.
[11] Dong Wu, Man-Wen Liao, Wei-Tian Zhang, Xing-Gang Wang, Xiang
Bai, Wen-Qing Cheng, and Wen-Yu Liu, “Yolop: You only look once
for panoptic driving perception, Machine Intelligence Research, p.
550–562, Nov 2022.
[12] Lili Zhang, Yi Zhang, Zhen Zhang, Jie Shen, and Huibin Wang, “Real-
time water surface object detection based on improved faster r-cnn,
Sensors, vol. 19, no. 16, pp. 3523, 2019.
[13] Tao Liu, Bo Pang, Lei Zhang, Wei Yang, and Xiaoqiang Sun, “Sea
surface object detection algorithm based on yolo v4 fused with reverse
depthwise separable convolution (rdsc) for usv,” Journal of Marine
Science and Engineering, vol. 9, no. 7, pp. 753, 2021.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image
is worth 16x16 words: Transformers for image recognition at scale,
in International Conference on Learning Representations.
[15] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision
transformer using shifted windows,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2021, pp. 10012–10022.
[16] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan
Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al., “Swin
transformer v2: Scaling up capacity and resolution,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2022, pp. 12009–12019.
[17] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei, “Beit: Bert
pre-training of image transformers,” in International Conference on
Learning Representations, 2021.
[18] Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui
Hsieh, “On the adversarial robustness of visual transformers, arXiv
preprint arXiv:2103.15670, vol. 1, no. 2, 2021.
[19] Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li,
Thomas Unterthiner, and Andreas Veit, “Understanding robustness of
transformers for image classification,” Mar 2021.
[20] Sayak Paul and Pin-Yu Chen, “Vision transformers are robust learn-
ers,” Proceedings of the AAAI Conference on Artificial Intelligence,
p. 2071–2081, Jul 2022.
[21] MuhammadMuzammal Naseer, Kanchana Ranasinghe, SalmanH.
Khan, Munawar Hayat, FahadShahbaz Khan, and Ming-Hsuan Yang,
“Intriguing properties of vision transformers,” Dec 2021.
[22] Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis,
Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic, “Revis-
iting the calibration of modern neural networks,” Advances in Neural
Information Processing Systems, vol. 34, pp. 15682–15694, 2021.
[23] Namuk Park and Songkuk Kim, “Blurs behave like ensembles: Spatial
smoothings to improve accuracy, uncertainty, and robustness,” in
International Conference on Machine Learning. PMLR, 2022, pp.
17390–17419.
[24] Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz
Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and Brais Martinez,
“Edgevits: Competing light-weight cnns on mobile devices with vision
transformers,” in Computer Vision–ECCV 2022: 17th European
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
XI. Springer, 2022, pp. 294–311.
[25] Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman
Khan, Syed Waqas Zamir, Rao Muhammad Anwer, and Fahad Shah-
baz Khan, “Edgenext: efficiently amalgamated cnn-transformer archi-
tecture for mobile vision applications,” in Computer Vision–ECCV
2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Part VII. Springer, 2023, pp. 3–20.
[26] Sachin Mehta and Mohammad Rastegari, “Mobilevit: light-weight,
general-purpose, and mobile-friendly vision transformer, arXiv
preprint arXiv:2110.02178, 2021.
[27] Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis,
Sergey Tulyakov, Yanzhi Wang, and Jian Ren, “Efficientformer: Vision
transformers at mobilenet speed,” Advances in Neural Information
Processing Systems, vol. 35, pp. 12934–12949, 2022.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Spatial
pyramid pooling in deep convolutional networks for visual recogni-
tion,” IEEE transactions on pattern analysis and machine intelligence,
vol. 37, no. 9, pp. 1904–1916, 2015.
[29] Qing-Long Zhang and Yu-Bin Yang, “Sa-net: Shuffle attention for
deep convolutional neural networks, in ICASSP 2021 - 2021 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), May 2021.
[30] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and
Chang Xu, “Ghostnet: More features from cheap operations,” in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Aug 2020.
[31] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-YuanMark Liao,
“Yolov4: Optimal speed and accuracy of object detection,” Apr 2020.
[32] R. Qi Charles, Hao Su, Mo Kaichun, and Leonidas J. Guibas,
“Pointnet: Deep learning on point sets for 3d classification and
segmentation,” in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Nov 2017.
[33] CharlesR. Qi, Li Yi, Hao Su, and LeonidasJ. Guibas, “Pointnet++:
Deep hierarchical feature learning on point sets in a metric space,”
Jun 2017.
[34] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai, “Deformable
convnets v2: More deformable, better results, in Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 9308–9316.
[35] Yunyun Song, Zhengyu Xie, Xinwei Wang, and Yingquan Zou, “Ms-
yolo: Object detection based on yolov5 optimized fusion millimeter-
wave radar and machine vision, IEEE Sensors Journal, vol. 22, no.
15, pp. 15435–15447, 2022.
[36] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun, “Yolox:
Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430,
2021.
[37] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao,
“Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-
time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
[38] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao,
“Yolov4: Optimal speed and accuracy of object detection,” arXiv
preprint arXiv:2004.10934, 2020.
[39] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M
Alvarez, and Ping Luo, “Segformer: Simple and efficient design
for semantic segmentation with transformers,” Advances in Neural
Information Processing Systems, vol. 34, pp. 12077–12090, 2021.
[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and
Jiaya Jia, “Pyramid scene parsing network,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017,
pp. 2881–2890.
[41] Shanliang Yao, Runwei Guan, Zhaodong Wu, Yi Ni, Zixian Zhang,
Zile Huang, Xiaohui Zhu, Yutao Yue, Yong Yue, Hyungjoon Seo, and
Ka Lok Man, “Waterscenes: A multi-task 4d radar-camera fusion
dataset and benchmark for autonomous driving on water surfaces,
2023.
[42] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-
ishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual
explanations from deep networks via gradient-based localization,” in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 618–626.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Millimeter-wave radar and machine vision are both important means for intelligent vehicles to perceive the surrounding environment. Aiming at the problem of multi-sensor fusion, this paper proposes the object detection method of millimeter-wave radar and vision fusion. Radar and camera complement each other, and radar data fusion in machine vision network can effectively reduce the rate of missed detection under insufficient light conditions, and improve the accuracy of remote small object detection. The radar information is processed by mapping transformation neural network to obtain the mask map, so that radar information and visual information in the same scale. A multi-data source deep learning object detection network (MS-YOLO) based on millimeter-wave radar and vision fusion was proposed. Homemade datasets were used for training and testing. This maximized the use of sensor information and improved the detection accuracy under the premise of ensuring the detection speed. Compared with the original YOLOv5 (the fifth version of the You Only Look Once) network, the results show that the MS-YOLO network meets the accuracy requirements better. Among the models, the large model of MS-YOLO has the highest accuracy with an mAP reaching 0.888. The small model of MS-YOLO has good accuracy and speed, and the mAP reaches 0.841 while maintaining a high frame rate of 65 fps.
Article
We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: http://github.com/NVlabs/SegFormer
Article
A panoptic driving perception system is an essential part of autonomous driving. A high-precision and real-time perception system can assist the vehicle in making reasonable decisions while driving. We present a panoptic driving perception network (you only look once for panoptic (YOLOP)) to perform traffic object detection, drivable area segmentation, and lane detection simultaneously. It is composed of one encoder for feature extraction and three decoders to handle the specific tasks. Our model performs extremely well on the challenging BDD100K dataset, achieving state-of-the-art on all three tasks in terms of accuracy and speed. Besides, we verify the effectiveness of our multi-task learning model for joint training via ablative studies. To our best knowledge, this is the first work that can process these three visual perception tasks simultaneously in real-time on an embedded device Jetson TX2(23 FPS), and maintain excellent accuracy. To facilitate further research, the source codes and pre-trained models are released at https://github.com/hustvl/YOLOP.
Article
This letter presents a robust data association method for fusing camera and marine radar measurements in order to automatically detect surface ships and determine their locations with respect to the observing ship. In the preprocessing step for sensor fusion, convolutional neural networks are used to perform feature extraction from camera images and semantic segmentation from radar images. The correspondences between the camera and radar image features are optimized using a pair of geometric parameters, and the positions of all the matched object features are determined. The proposed method enables robust data association even without accurate calibration and alignment between the camera and radar measurements. The feasibility and performance of the proposed method are demonstrated using an experimental dataset obtained in a real coastal environment.