Content uploaded by Yihang Tao
Author content
All content in this area was uploaded by Yihang Tao on Sep 24, 2024
Content may be subject to copyright.
Direct-CP: Directed Collaborative Perception for Connected and
Autonomous Vehicles via Proactive Attention
Yihang Tao1, Senkang Hu1, Zhengru Fang1, and Yuguang Fang1*
Abstract— Collaborative perception (CP) leverages visual
data from connected and autonomous vehicles (CAV) to enhance
an ego vehicle’s field of view (FoV). Despite recent progress, cur-
rent CP methods expand the ego vehicle’s 360-degree perceptual
range almost equally, which faces two key challenges. Firstly,
in areas with uneven traffic distribution, focusing on directions
with little traffic offers limited benefits. Secondly, under limited
communication budgets, allocating excessive bandwidth to less
critical directions lowers the perception accuracy in more vital
areas. To address these issues, we propose Direct-CP, a proactive
and direction-aware CP system aiming at improving CP in
specific directions. Our key idea is to enable an ego vehicle
to proactively signal its interested directions and readjust its
attention to enhance local directional CP performance. To
achieve this, we first propose an RSU-aided direction masking
mechanism that assists an ego vehicle in identifying vital
directions. Additionally, we design a direction-aware selective
attention module to wisely aggregate pertinent features based on
ego vehicle’s directional priorities, communication budget, and
the positional data of CAVs. Moreover, we introduce a direction-
weighted detection loss (DWLoss) to capture the divergence
between directional CP outcomes and the ground truth, facil-
itating effective model training. Extensive experiments on the
V2X-Sim 2.0 dataset demonstrate that our approach achieves
19.8% higher local perception accuracy in interested directions
and 2.5% higher overall perception accuracy than the state-of-
the-art methods in collaborative 3D object detection tasks.
I. INTRODUCTION
Collaborative perception (CP) [1]–[3] has emerged as
a promising approach to expand the perceptual range of
individual vehicles by integrating visual data from multiple
connected and autonomous vehicles (CAVs). To effectively
monitor road traffic, each CAV is equipped with an array
of LiDARs or cameras that capture environmental data from
various angles. This information is subsequently synthesized
into a bird’s eye view (BEV) map, offering a comprehensive
representation of a vehicle’s surroundings [4]. Nonetheless,
relying solely on a single BEV-aided perception system is
often insufficient for overcoming blind spots resulted by road
obstacles or other CAVs. To address this shortcoming, CP
has been adapted to allow multiple CAVs to share their
local BEV features, thereby enhancing the accuracy and
comprehensiveness of BEV predictions.
*Corresponding author. This work was supported in part by the Hong
Kong Innovation and Technology Commission under InnoHK Project
CIMDA, by the Hong Kong SAR Government under the Global STEM
Professorship, and by the Hong Kong Jockey Club under JC STEM Lab of
Smart City.
1Yihang Tao, Senkang Hu, Zhengru Fang and Yuguang Fang are
with Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong. (Email: {yihangtao2-c, senkang.forest,
zhefang4-c}@my.cityu.edu.hk, my.Fang@cityu.edu.hk)
Fig. 1. Overview of directed CP framework. With a limited commu-
nication budget, ego CAV may hope to enhance its CP performance more
in certain directions where the traffic is complex while keeping a basic
perception of other directions with minimal traffic.
Currently, most existing studies [5], [6] focus on optimiz-
ing 360-degree omnidirectional CP performance, aiming to
extend an ego CAV’s scope in every direction almost equally.
However, this overlooks the uneven traffic density across
different directions and the varying interest of an ego CAV
in specific directions. For instance, as illustrated in Fig. 1,
when an ego CAV is making a right turn at an intersection,
it may encounter minimal traffic to its rear and left front,
whereas the traffic is significantly more complex to its right
front. In such scenarios, the ego CAV would benefit from
a targeted perception enhancement towards its right front
while maintaining basic (e.g., single-vehicle) perception for
other directions. Existing methods aim to uniformly enhance
perception across all directions, lacking the flexibility for an
ego CAV to proactively adjust its view-level priority.
In addition, the communication overhead is critical factor
that must be carefully considered when designing a CP
system [2], [7]–[16]. With constraints such as a limited
communication budget and a maximum allowable delay,
engaging all collaborators and fully utilizing their captured
views for enhancing perception across 360 degrees can
significantly burden both communication and computational
resources. This is particularly severe when the number of col-
laborators and the frame rate (measured in frames per second,
FPS) are high. Indeed, reallocating communication resources
from less critical directions to enhance perception in more
arXiv:2409.08840v1 [cs.CV] 13 Sep 2024
TABLE I
COMPARISON OF RELATED WORKS.
Method Message Fusion Perception Gain
Who2com [17] Full Average Omnidirectional
V2VNet [8] Full Average Omnidirectional
PACP [1] Full Priority-based average Omnidirectional
When2com [5] Full Agent-level attention Omnidirectional
V2X-ViT [18] Full Self-attention Omnidirectional
Where2com [10] Sparse Confidence-aware attention Omnidirectional
Direct-CP (Ours)Sparse Proactive selective attention Directed
important areas is not only strategically advantageous but
also enhances CP efficiency in terms of both communication
and computation.
Motivated by the above observation, we propose Direct-
CP, which enables an ego CAV to proactively specify its
interested directions and intelligently optimize perception
performance toward these directions under the constraints
of a limited communication budget. To achieve this, we plan
to deploy several roadside units (RSUs) to monitor the traffic
distribution around the ego CAV. These RSUs provide critical
data that assists the ego vehicle in determining its interested
directions. Additionally, we have developed a direction-
aware attention module that inputs the ego CAV’s preferred
directions, communication budget, and the positional infor-
mation of other CAVs, thereby generating sparse query maps
that can intelligently select the most relevant information
from nearby CAVs for aggregation to enhance CP perfor-
mance toward selected directions. Moreover, we define a
direction-weighted detection loss (DWLoss) to measure the
directional perception discrepancy between prediction and
ground truth. To the best of our knowledge, this is the first
work designed to optimize CP based on local directional
priorities. Our contributions can be summarized as follows.
•We propose a flexible CP framework named Direct-CP,
which enhances perception performance towards spe-
cific directions under a limited communication budget,
tailored to the proactive interests of an ego CAV.
•We design a direction-aware selective attention module
to incorporate an RSU-aided direction masking mech-
anism, and adaptively select relevant feature data from
multi-vehicle for boosting local-directional perception.
Additionally, we design a direction-weighted detection
loss (DWLoss) to measure the directed perception dis-
crepancy between the outputs and the ground truth.
•We conduct extensive experiments on collaborative 3D
detection tasks and the results demonstrate that our
method realizes the proactive directed CP enhancement,
achieving 2.5% higher overall perception accuracy and
19.8% higher local perception accuracy in the interested
directions than the state-of-the-art method.
II. REL AT ED WOR KS
A. Collaborative perception
Collaborative perception has gained significant attention
for its ability to enhance the sensing capabilities of individual
vehicles beyond the constraints of isolated sensors. This
approach favors intermediate-stage fusion strategies [19]–
[23], which facilitate the exchange of intermediate feature
representations among CAVs to improve collaboration. How-
ever, as the dimensionality of features and the number of col-
laborators grow, efficient bandwidth management becomes
essential. Who2com [17] introduces a sophisticated multi-
stage handshake mechanism that trains neural networks to
compress critical information for each stage, optimizing
vehicle connectivity through a matching score. V2VNet [8]
leverages a graph neural network to effectively aggregate in-
formation from nearby CAVs, significantly enhancing collec-
tive perception by establishing a robust information network.
However, these approaches overlook the varying importance
of individual CAVs in optimizing CP. To address this, PACP
[1] implements a BEV-match mechanism to prioritize col-
laborative CAVs before message fusion. Nevertheless, PACP
focuses solely on prioritizing each agent without considering
different priority levels of various views from a single
agent. Besides, PACP aims to optimize omnidirectional CP
performance, lacking the flexibility to enable an ego CAV to
dynamically and proactively adjust its directional CP, which
is the focus of this paper.
B. Attention-based LiDAR perception
Recent advancements in LiDAR-based CP have integrated
attention mechanisms to boost performance and reduce com-
munication overhead. When2com [5] employs scaled general
attention to assess correlations among different agents, re-
ducing transmission redundancy. V2X-ViT [18] introduces
the heterogeneous multi-agent attention for fusing messages
across diverse agents. However, these methods require the
initial transmission of full feature maps, which consumes
substantial bandwidth. More recently, Where2comm [10]
advances the field by utilizing sparse feature maps with
location-specific and confidence-aware attention, optimizing
data exchange and processing efficiency by focusing on the
most relevant features. Despite its progress, Where2comm
lacks the flexibility for an ego vehicle to adjust its perceptual
focus based on immediate environmental demands and may
not be optimal under limited communication conditions. As
outlined in Table I, our proposed Direct-CP contrasts by
providing a flexible and directed perception enhancement
tailored to an ego vehicle’s proactive needs under limited
communication constraints. This targeted approach improves
data relevance and efficiency, aligning closely with real-time
needs in dynamic vehicular settings.
III. METHODOLOGY
The overall architecture of our method is depicted in Fig.
2. Several RSUs have been deployed along the roadway to
capture the traffic distribution surrounding the ego vehicle.
Owing to their elevated positions, RSUs are capable of
monitoring a more extensive traffic view than individual
vehicles can achieve. Periodically, the ego CAV transmits
its location and speed data to the nearby RSU and receives
a computed direction attention score (DAS) from the RSU
for reference. Utilizing DAS in conjunction with its own
Fig. 2. Method overview. The ego CAV integrates DAS from nearby RSU and its interest weights to create the direction mask. Subsequently, the initial
query map, pose data from other CAVs, direction mask, and communication budget are fed into QC-Net to produce sparse query maps. QC-Net consists
of two main components: (i) Direction Control Module to output query confidence maps (QCMs) prioritizing direction, and (ii) Query Clipping Layer
to rank QCMs and generate binary query maps only selecting the top Qmax ×H×Wsignificant queries yielding to communication budgets.
interests, the ego vehicle identifies which directions are tem-
porarily non-essential and consequently masks them during
the collaborative perception process. Subsequently, guided
by its prioritized directions, communication budget, and the
pose data of other CAVs, the ego CAV refines the selection
of optimal feature map queries to nearby CAVs, aiming to
maximize the directed perception performance within the
constraints on communication budgets. The detailed methods
are elaborated in the following subsections.
A. RSU-aided direction masking
In this paper, we resort to RSUs to aid ego CAVs in
judging the important directions. We consider splitting the
360-degree space surrounding an ego CAV into Ndir local
directions. Based on the location and speed of an ego CAV,
the corresponding RSU first projects it into the 2D view
it captured, and then calculates DAS in Ndir predefined
directions. Here, for simplicity, we take the detected number
of vehicles as the indicator to calculate DAS. Thus, the
returned DASs from the RSU are represented as {Si
r}Ndir
k=1 =
{Ni
vec}Ndir
k=1 , where Ni
vec denotes the detected number of
vehicles in the i-th direction surrounding the ego CAV. After
getting the DAS {Si
r}Ndir
i=1 from RSU, the ego CAV calculates
the final direction mask combining its own interest weights
{Ii
e}Ndir
i=1 . The interest weights can be flexibly set according
to the ego vehicle’s proactive willingness. When the interest
weights are uniformly assigned, the ego vehicle determines
the direction importance totally according to the RSU. The
final direction mask {Mi}Ndir
i=1 is calculated as follows:
Mi= max (H Si
rIi
e
PNdir
j=1 Sj
rIj
e
−σ1!, H(Si
rIi
e−σ2)),
(1)
where Heaviside step function H(·)equals 1 when param-
eters are positive otherwise 0. σ1is the threshold to judge
if the i-th direction is important among all the directions.
However, in some cases, there could be complex traffic even
in the less important direction, so we set an additional con-
stant threshold σ2to recognize this case. Our proposed RSU-
aided direction masking mechanism provides two important
insights: 1) Only the ego location, speed, and the DAS
from RSU are transmitted during the interaction, requiring
minimal bandwidth and keeps real-time communication; 2)
The interest weight matrix fully ensures the ego vehicle’s
proactivity in judging the important directions, and the ego
vehicle can easily adjust interest weights to patronize the
directions regardless of the RSU’s suggestions.
B. Direction-aware selective attention
Consider NCAVs in total in the scenario. Assume that
the direction priority, the observation sets and perception
supervision of the i-th CAV are represented as Mi,Xi
and Yi, respectively. The object of our considered directed
collaborative perception system is to achieve the maximized
perception performance toward interested directions of all
agents as a function of communication budget Band the
number of CAVs N, written as:
ξΦ(B, N ) = arg max
θ,T
N
X
i=1
gΦθXi,{Ti,k}N
k=1,Mi,Yi,
s.t.
N
X
k=1
|{Ti,k}N
k=1| ≤ B,
(2)
where g(·,·)is the perception performance metric, Φθis the
perception model with trainable parameter θ,{Ti,k}N
k=1 are
the messages transmitted from the k-th agent (each with M
features) to the i-th agent. Note that the case when N= 1
indicates single-vehicle perception.
Upon receiving a 3D point cloud, the i-th CAV first
converts the data into a BEV map. The BEV encoder
Φbev processes this map to extract features, generating the
feature map Φbev(Xi) = Fi∈RH×W×D, where H,W,
and Drepresent height, width, and channel dimensions,
respectively. All agents project their perceptual data into a
unified global coordinate system, facilitating seamless cross-
agent collaboration without the need for complex coordi-
nate transformations. The resulting feature map is fused
with each other following direction-aware selective attention
(DSA). The core component of DSA is the query-control
net (QC-Net) taking initial query map Q0∈RH×W×(N−1),
the embedding of nearby cooperative CAV’s pose matrices
PE({Pi}N
i=2)∈RH×W×(N−1) , the embedding of ego CAV’s
direction mask DE({Mk}Ndir
k=1 )∈RH×W×(N−1), and the
communication budget as input, and generates proactive bi-
nary query maps {Qk}N
k=2 ∈RH×W×(N−1) (value 1 means
activating transmitting data in the corresponding location of
BEV feature map). The communication budget Qmax ∈
[0,1] is defined as the ratio of the maximum number of
activated queries to the size of the query map, which satisfies:
Qmax ≥
N
X
k=2
H
X
i=1
W
X
j=1
Qi,j
k
H×W×(N−1),(3)
where Qmax = 1 means allowing CAVs to transmit full
feature map to the ego vehicle. The QC-Net consists of a
three-layer MLP. The direction control module first generates
query confidence map (QCM) {Ck}N
k=2 ∈RH×W×(N−1) for
each CAV, while Ci,j
k∈[0,1] represents the priority of the
j-th element of the i-th QCM for enhancing CP in the ego
vehicle’s interested directions. Assume the direction control
module is denoted with Φdcl(·), QCM is calculated by:
{Ck}N
k=2 = Φdcl Q0,PE({Pi}N
i=2),DE({Mk}Ndir
k=1 ).
(4)
Given communication constraints, we introduce a query
clipping layer to control the transmitted data during the
collaboration. In this layer, we rank Ci,j
kfor each QCM,
retaining only the top Qmax ×H×Wvalues and setting
others to zero, ensuring adherence to the predefined commu-
nication budget. The QC-Net finally produces sparse query
maps {Qk}N
k=2 as follows:
Qi,j
k=(1,if Ci,j
k∈TOPQmax×H×W{Ck}N
k=2,
0,otherwise, (5)
where TOPk(·)reprsents the top kelements of a set.
Collaborative CAVs receive these query maps and compute
direction-aware sparse feature maps as Hi=Qi⊙ Fi∈
RH×W×D, where ⊙denotes the Hadamard product of two
matrices. Subsequently, each ego vehicle fuses features from
multiple agents at each spatial location:
WDSA
i,j =MAttn (Fi,Hi,j ,Hi,j )⊙ Cj,(6)
where WDSA
i,j ∈RH×Wis DSA weights assigned to the j-th
agent by the i-th agent, MAttn(·)represents the multi-head
attention at each spatial location. The fused feature map for
the ego vehicle Fout
i∈RH×W×Dis expressed as:
Fout
i=FFN
N
X
j=1
WDSA
i,j ⊙ Hi,j
,(7)
Fig. 3. AP of different CP methods under various communication budgets.
where FFN(·)is the feed-forward network.
C. Direction-weighted detection loss
Given the final fused feature map Fout
i, the detec-
tion decoder Φdec(·)generates class and regression out-
puts following [10]. Each output location Φdec(Fout
i)∈
RH×W×7corresponds to a rotated box described by a
7-tuple (c, x, y, h, w, cos α, sin α), representing class confi-
dence, position, size, and angle. To evaluate the discrep-
ancy between the collaborative 3D detection results and the
ground truth, the commonly used detection loss Ldet [24]
combines focal loss, object offset loss, and object size loss.
However, this loss does not fully capture the importance of
specific directions in our directed CP scenario. Therefore, we
introduce a novel loss function, direction-weighted detection
loss (DWLoss), to quantify the divergence in designated
directions. DWLoss is calculated by dividing the 3D detec-
tion results into Ndir subsets and computing the detection
loss {Li
det}Ndir
i=1 for each subset with varying sum weights,
represented as follows:
LDW =PNdir
i=1 Li
det ∗(Mi+σ)
PNdir
i=1 Mi+σNdir
,(8)
where σis a constant weight-control factor. Eq. 8ensures
setting lower weights to non-critical directions by weight
factor σ, aiming to jointly optimize the CP performance in
interested directions and the remaining directions. The choice
of σis crucial: setting it too high may obscure the importance
of interested directions, while setting it too low (an extreme
case is 0), can neglect the accuracy in non-critical directions
during training, potentially degrading perception more than
single-vehicle perception. Ablation studies in Section IV will
offer helpful guidance for determining an effective σ.
IV. EXPERIMENTS
A. Experimental setup
Dataset and baselines. Our experimental evaluations are
conducted on the V2X-Sim 2.0 Dataset [25], an extensive
Fig. 4. Visualization of Direct-CP and baselines on V2X-Sim 2.0 dataset. The green boxes are ground truth and the red boxes are predictions.
Where2comm achieves higher global CP accuracy than the lower-bound but degrades in some local directions. Our proposed Direct-CP proactively guides
the attention to boost perception in the ego’s interested directions (denoted with the number 1, and the right arrow indicates the ego’s moving direction).
simulated dataset generated using the CARLA simulator
[26]. This dataset comprises 10,000 frames of 3D LiDAR
point clouds along with 501,000 annotated 3D bounding
boxes. We configure the perception range to be 64m×64m,
and the 3D points are discretized into a BEV map of di-
mensions (252,100,64). We establish baseline comparisons,
including When2com [5], V2VNet [8], and Where2comm
[10]. To make the perception gain clearer, we set the single-
vehicle perception method as the lower-bound baseline.
Implementation details. We implement our Direct-CP
using PyTorch. The direction control module features a fully
connected layer with dimensions of 100×252 for both input
and output, complemented by a sigmoid layer to match the
BEV feature dimensions, incorporating pose matrices and
direction masks at a spatial resolution of (100,252). Our
detection module utilizes the LiDAR-based 3D object detec-
tion framework PointPillar [27]. We set the training batch
size at 6 and the maximum epochs at 60. The 360-degree
space is divided into 4 directions: [0, 90◦], [90◦, 180◦],
[180◦, 270◦], and [270◦, 360◦], corresponding to left front,
right front, right back, and left back, with interest weights of
[0.9, 0.9, 0.1, 0.1], respectively. The default DWLoss weight
factor σis 1.0 and the default communication budget (defined
in Eq. 3) is 0.2. The setup for our experiments includes 2
Intel(R) Xeon(R) Silver 4410Y CPUs (2.0GHz), 4 NVIDIA
RTX A5000 GPUs, and 512GB DDR4 RAM.
Evaluation metrics. For 3D detection tasks, the inter-
section over union (IoU) is a common evaluation metric,
calculated as the area of intersection divided by the area
of union. However, IoU assesses omnidirectional perception
performance. To specifically evaluate our proposed directed
perception performance, we additionally introduce a metric
named partial-direction intersection over union (PD-IoU).
This involves dividing the BEV map into Ndir subsets based
on predefined directions, with PD-IoU separately measuring
IoU within these individual subsets.
B. Quantitative results
Evaluation of Direct-CP. We evaluate Direct-CP against
baselines in the overall CP performance (AP@IoU=0.5/0.7)
and in specific directions (AP@PD-IoU=0.5/0.7, interested
directions are denoted with *). As shown in Table II, Direct-
CP uses direction-aware selective attention to reallocate com-
munication resources, slightly outperforming the state-of-
the-art Where2comm in terms of overall AP@IoU. For PD-
IoU, Where2comm optimizes CP omnidirectionally, showing
similar AP@PD-IoU across all directions, while Direct-
CP focuses on preferred directions, achieving 18.2% higher
AP@PD-IoU=0.5 and 19.8% higher AP@PD-IoU=0.7 than
Where2comm in these directions. These results demonstrate
that Direct-CP enables an ego vehicle to flexibly adjust view
focus and improve CP performance in the desired directions.
Communication efficiency. Moreover, we investigate how
varying communication budgets affects CP performance
as shown in Fig. 3with budgets ranging from 0.01 to
0.25. Notably, below a budget of 0.1, both Direct-CP and
Where2comm experience a significant drop in AP@IoU=0.7
and AP@PD-IoU=0.7 for interested directions [0, 180◦].
Despite this, Direct-CP slightly outperforms Where2comm
overall and significantly improves perception in interested
directions. At a further reduced budget of 0.01, both meth-
ods perform equally, suffering major perception degradation
likely due to ultra-sparse feature maps impeding model
convergence. Overall, these results highlight Direct-CP’s
efficiency under constrained communication resources.
Abalation studies. To investigate the influence of the
weight factor σon the performance of Direct-CP, we conduct
an ablation study, varying σfrom 0 to 2.0. When σis
below 1.0, we observe a reduction in collaborative detection
accuracy, particularly in less critical directions. Notably,
Fig. 5. Visualization of ego CAV’s attention weights on neighboring CAVs. Where2comm attends to features of other CAVs more equally to optimize
omnidirectional perception. In contrast, our Direct-CP is direction-aware and queries features that are more relevant to the ego’s interested directions.
TABLE II
QUANTITATIVE RESULTS OF COLLABORATIVE 3D DETECTION (COMMUNICATION BUDGET = 0. 2, * INDICATES PATRONIZED DIRECTIONS).
Method AP@PD-IoU=0.5 AP@IoU=0.5 AP@PD-IoU=0.7 AP@IoU=0.7
[0, 90◦]* [90◦, 180◦]* [180◦, 270◦] [270◦, 360◦] [0, 90◦]* [90◦, 180◦]* [180◦, 270◦] [270◦, 360◦]
Lower-bound 40.97 53.89 28.83 37.57 55.01 31.14 38.97 20.13 28.24 41.91
When2comm 34.97 33.56 19.36 49.96 53.56 24.81 24.59 8.75 40.42 38.70
V2VNet 57.49 53.62 28.01 60.36 67.35 42.61 41.05 19.54 36.59 48.22
Where2comm 51.29 59.38 48.83 56.27 79.59 45.86 44.52 37.89 48.83 64.96
Direct-CP 65.84 (↑28.4%)65.48 (↑10.3%) 37.21 60.55 81.17 (↑2.0%)55.76 (↑21.6%)53.20 (↑19.5%) 28.62 49.98 66.57 (↑2.5%)
TABLE III
ABL ATIO N ST UD IE S ON T HE E FFE CT O F DW LO SS W EI GH T FACT OR σ(COMMUNICATION BUDGET = 0.2).
Direct-CP AP@PD-IoU=0.5 AP@IoU=0.5 AP@PD-IoU=0.7 AP@IoU=0.7
[0, 90◦]* [90◦, 180◦]* [180◦, 270◦] [270◦, 360◦] [0, 90◦]* [90◦, 180◦]* [180◦, 270◦] [270◦, 360◦]
σ= 0 38.28 51.37 30.71 27.12 61.63 14.84 29.16 14.75 3.38 31.41
σ= 0.559.83 59.12 36.66 58.30 76.19 41.03 44.85 25.72 46.24 58.18
σ= 1.065.84 65.48 37.21 60.55 81.17 55.76 53.20 28.62 49.98 66.57
σ= 1.552.86 62.98 41.49 55.78 73.94 44.81 51.90 34.93 47.84 62.12
σ= 2.049.24 62.46 32.23 55.14 73.21 36.97 48.21 25.51 42.90 57.18
AP@PD-IoU=0.7 for the sector [270◦, 360◦] declines to
0.03, markedly deteriorating below the lower-bound thresh-
old. Conversely, when σexceeds 1.5, there is a discernible
decrease in detection accuracy for both the areas of interest
and the overall system. Based on these observations, a good
range for σis between 1.0 and 1.5, which balances directed
perception performance with satisfactory overall accuracy.
C. Qualitative results
Visualization of collaborative 3D detection results.
As shown in Fig. 4, we display Direct-CP’s collaborative
detection results alongside baselines on the V2X-Sim 2.0
dataset. While Where2comm substantially improves global
perception over the lower-bound, it underperforms in certain
local directions, occasionally not exceeding single-vehicle
outcomes, likely due to limited communication budgets and
scattered focus. Conversely, our Direct-CP effectively redi-
rects attention from less critical to key areas, significantly
boosting local directional perception.
Visualization of ego CAV’s attention weights. As de-
picted in Fig. 5, we further compare the attention weights
of ego CAV assigned to neighboring CAVs’ feature maps
WDSA
i,j (defined in Eq. 6) in two methods. With limited
communication budgets, both methods query sparse features.
For Where2comm, the attention weights are more uniformly
assigned to other CAVs to enhance 360-degree CP per-
formance. In contrast, our proposed Direct-CP attends to
features that are more crucial to the ego vehicle’s interested
directions, informed by other CAVs’ pose information and
the ego’s directional mask, shifting great attention from CAV
2 and 4 to CAV 1 and 3 to improve directed CP performance.
V. CONCLUSION
In this paper, we have introduced Direct-CP, a novel CP
system for ego vehicles to enhance perception in patronized
directions. We have developed an RSU-aided direction mask-
ing by integrating RSU’s traffic detection with the ego vehi-
cle’s interest weights to identify key directions. We have also
designed a proactive direction-aware attention mechanism to
intelligently collect sparse feature maps from multiple vehi-
cles under limited communication budgets, thus improving
local directional perception. Additionally, we have created a
direction-weighted detection loss to align perception outputs
with ground truth more accurately. Extensive experiments
have been conducted and the results have demonstrated that
Direct-CP achieves directed performance gains under con-
strained communication resources and outperforms baselines
in terms of flexibility and efficiency.
REFERENCES
[1] Z. Fang, S. Hu, H. An, Y. Zhang, J. Wang, H. Cao, X. Chen, and
Y. Fang, “PACP: Priority-aware collaborative perception for connected
and autonomous vehicles,” IEEE Transactions on Mobile Computing,
(DOI: 10.1109/TMC.2024.3449371), Aug. 2024.
[2] S. Hu, Z. Fang, X. Chen, Y. Fang, and S. Kwong, “Towards
full-scene domain generalization in multi-agent collaborative bird’s
eye view segmentation for connected and autonomous driving,” 2024.
[Online]. Available: https://arxiv.org/abs/2311.16754
[3] Y. Zhang, H. An, Z. Fang, G. Xu, Y. Zhou, X. Chen, and Y. Fang,
“SmartCooper: Vehicular collaborative perception with adaptive fusion
and judger mechanism,” in IEEE International Conference on Robotics
and Automation (ICRA), Yokohama, Japan, May 2024.
[4] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han,
“Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye
view representation,” in IEEE International Conference on Robotics
and Automation (ICRA), 2023, pp. 2774–2781.
[5] Y.-C. Liu, J. Tian, N. Glaser, and Z. Kira, “When2com: Multi-agent
perception via communication graph grouping,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[6] Y. Li, S. Ren, P. Wu, S. Chen, C. Feng, and W. Zhang, “Learning
distilled collaboration graph for multi-agent perception,” in Advances
in Neural Information Processing Systems (NeurIPS), A. Beygelzimer,
Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021.
[7] Y.-C. Liu, J. Tian, C.-Y. Ma, N. Glaser, C.-W. Kuo, and Z. Kira,
“Who2com: Collaborative perception via learnable handshake com-
munication,” in IEEE International Conference on Robotics and Au-
tomation (ICRA), 2020, pp. 6876–6883.
[8] T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Ur-
tasun, “V2vnet: Vehicle-to-vehicle communication for joint perception
and prediction,” in European Conference on Computer Vision (ECCV).
Berlin, Heidelberg: Springer-Verlag, 2020, p. 605–621.
[9] S. Hu, Z. Fang, Y. Deng, X. Chen, and Y. Fang, “Collaborative Per-
ception for Connected and Autonomous Driving: Challenges, Possible
Solutions and Opportunities,” Jan. 2024, arXiv:2401.01544 [cs, eess].
[10] Y. Hu, S. Fang, Z. Lei, Y. Zhong, and S. Chen, “Where2comm:
Communication-efficient collaborative perception via spatial confi-
dence maps,” in Advances in Neural Information Processing Systems
(NeurIPS), A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds.,
2022.
[11] S. Hu, Z. Fang, Z. Fang, Y. Deng, X. Chen, and Y. Fang, “AgentsCo-
Driver: Large Language Model Empowered Collaborative Driving with
Lifelong Learning,” Apr. 2024, arXiv:2404.06345 [cs].
[12] S. Hu, Z. Fang, Z. Fang, Y. Deng, X. Chen, Y. Fang, and S. Kwong,
“Agentscomerge: Large language model empowered collaborative
decision making for ramp merging,” 2024. [Online]. Available:
https://arxiv.org/abs/2408.03624
[13] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, and L. Hanzo,
“Age of information in energy harvesting aided massive multiple
access networks,” IEEE Journal on Selected Areas in Communications,
vol. 40, no. 5, pp. 1441–1456, May 2022.
[14] S. Hu, Z. Fang, H. An, G. Xu, Y. Zhou, X. Chen, and Y. Fang,
“Adaptive Communications in Collaborative Perception with Domain
Alignment for Autonomous Driving,” in IEEE Global Communica-
tions Conference (GLOBECOM). Cape Town, South Africa: IEEE,
Dec. 2024.
[15] Z. Fang, S. Hu, L. Yang, Y. Deng, X. Chen, and Y. Fang, “Pib:
Prioritized information bottleneck framework for collaborative edge
video analytics,” 2024. [Online]. Available: https://arxiv.org/abs/2408.
17047
[16] Z. Fang, S. Hu, J. Wang, Y. Deng, X. Chen, and Y. Fang, “Prioritized
information bottleneck theoretic framework with distributed online
learning for edge video analytics,” 2024. [Online]. Available:
https://arxiv.org/abs/2409.00146
[17] Y.-C. Liu, J. Tian, C.-Y. Ma, N. Glaser, C.-W. Kuo, and
Z. Kira, “Who2com: Collaborative perception via learnable handshake
communication,” 2020. [Online]. Available: https://arxiv.org/abs/2003.
09575
[18] R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma,
“V2x-vit: Vehicle-to-everything cooperative perception with vision
transformer,” in European Conference on Computer Vision (ECCV).
Berlin, Heidelberg: Springer-Verlag, 2022, p. 107–124. [Online].
Available: https://doi.org/10.1007/978-3- 031-19842- 7 7
[19] Y. Zhou, J. Xiao, Y. Zhou, and G. Loianno, “Multi-robot collaborative
perception with graph neural networks,” IEEE Robotics and Automa-
tion Letters, vol. 7, no. 2, pp. 2289–2296, 2022.
[20] Y. Lu, Q. Li, B. Liu, M. Dianati, C. Feng, S. Chen, and Y. Wang,
“Robust collaborative 3d object detection in presence of pose errors,”
in IEEE International Conference on Robotics and Automation (ICRA),
2023, pp. 4812–4818.
[21] S. Su, Y. Li, S. He, S. Han, C. Feng, C. Ding, and F. Miao,
“Uncertainty quantification of collaborative detection for self-driving,”
in IEEE International Conference on Robotics and Automation (ICRA),
2023, pp. 5588–5594.
[22] R. Xu, W. Chen, H. Xiang, X. Xia, L. Liu, and J. Ma, “Model-agnostic
multi-agent perception framework,” in IEEE International Conference
on Robotics and Automation (ICRA), 2023, pp. 1471–1478.
[23] R. Xu, J. Li, X. Dong, H. Yu, and J. Ma, “Bridging the domain gap for
multi-agent perception,” in IEEE International Conference on Robotics
and Automation (ICRA), 2023, pp. 6035–6042.
[24] X. Zhou, D. Wang, and P. Kr¨
ahenb¨
uhl, “Objects as points,” 2019.
[Online]. Available: https://arxiv.org/abs/1904.07850
[25] Y. Li, D. Ma, Z. An, Z. Wang, Y. Zhong, S. Chen, and
C. Feng, “V2x-sim: Multi-agent collaborative perception dataset
and benchmark for autonomous driving,” 2022. [Online]. Available:
https://arxiv.org/abs/2202.08449
[26] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the 1st
Annual Conference on Robot Learning, ser. Proceedings of Machine
Learning Research, S. Levine, V. Vanhoucke, and K. Goldberg, Eds.,
vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16.
[27] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
“Pointpillars: Fast encoders for object detection from point clouds,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.