ArticlePDF Available

Abstract and Figures

Surrounding perceptions are quintessential for safe driving for connected and autonomous vehicles (CAVs), where the Bird's Eye View has been employed to accurately capture spatial relationships among vehicles. However, severe inherent limitations of BEV, like blind spots, have been identified. Collaborative perception has emerged as an effective solution to overcoming these limitations through data fusion from multiple views of surrounding vehicles. While most existing collaborative perception strategies adopt a fully connected graph predicated on fairness in transmissions, they often neglect the varying importance of individual vehicles due to channel variations and perception redundancy. To address these challenges, we propose a novel P riority- A ware C ollaborative P erception ( PACP ) framework to employ a BEV-match mechanism to determine the priority levels based on the correlation between nearby CAVs and the ego vehicle for perception. By leveraging submodular optimization, we find near-optimal transmission rates, link connectivity, and compression metrics. Moreover, we deploy a deep learning-based adaptive autoencoder to modulate the image reconstruction quality under dynamic channel conditions. Finally, we conduct extensive studies and demonstrate that our scheme significantly outperforms the state-of-the-art schemes by 8.27% and 13.60%, respectively, in terms of utility and precision of the Intersection over Union.
Content may be subject to copyright.
IEEE TRANSACTIONS ON MOBILE COMPUTING 1
PACP: Priority-Aware Collaborative Perception
for Connected and Autonomous Vehicles
Zhengru Fang, Senkang Hu, Haonan An, Yuang Zhang, Jingjing Wang, Hangcheng Cao,
Xianhao Chen, Member, IEEE and Yuguang Fang, Fellow, IEEE
Abstract—Surrounding perceptions are quintessential for safe driving for connected and autonomous vehicles (CAVs), where the
Bird’s Eye View has been employed to accurately capture spatial relationships among vehicles. However, severe inherent limitations of
BEV, like blind spots, have been identified. Collaborative perception has emerged as an effective solution to overcoming these
limitations through data fusion from multiple views of surrounding vehicles. While most existing collaborative perception strategies
adopt a fully connected graph predicated on fairness in transmissions, they often neglect the varying importance of individual vehicles
due to channel variations and perception redundancy. To address these challenges, we propose a novel Priority-Aware Collaborative
Perception (PACP) framework to employ a BEV-match mechanism to determine the priority levels based on the correlation between
nearby CAVs and the ego vehicle for perception. By leveraging submodular optimization, we find near-optimal transmission rates, link
connectivity, and compression metrics. Moreover, we deploy a deep learning-based adaptive autoencoder to modulate the image
reconstruction quality under dynamic channel conditions. Finally, we conduct extensive studies and demonstrate that our scheme
significantly outperforms the state-of-the-art schemes by 8.27% and 13.60%, respectively, in terms of utility and precision of the
Intersection over Union.
Index Terms—Connected and autonomous vehicle (CAV), collaborative perception, priority-aware collaborative perception (PACP),
data fusion, submodular optimization, adaptive compression.
1 INTRODUCTION
1.1 Background
RECEN T advances in positioning and perception have
emerged as pivotal components in numerous cutting-
edge applications, most notably in autonomous driving [1]–
[5]. These systems heavily rely on precise positioning and
acute perception capabilities to safely and adeptly navigate
complex road environments. Many solutions exist for po-
sitioning and perception in CAVs, including inertial nav-
igation systems, high-precision GPS, cameras, and LiDAR
[6]. However, any isolated use of them may not be enough
to achieve the desired level of perception quality for safe
Z. Fang, S. Hu, H. An, H. Cao and Y. Fang are with the Department of
Computer Science, City University of Hong Kong, Hong Kong. E-mail:
{zhefang4-c, senkang.forest, haonanan2-c}@my.cityu.edu.hk, {hangccao,
my.fang}@cityu.edu.hk.
Y. Zhang is with the Department of Civil and Environmental Engineering,
University of Washington, Seattle, WA, USA. E-mail: yuangz19@uw.edu.
J. Wang is with the School of Cyber Science and Technology, Beihang
University, China, and also with Hangzhou Innovation Institute, Beihang
University, Hangzhou 310051, China. Email: drwangjj@buaa.edu.cn.
X. Chen is with the Department of Electrical and Electronic Engineering,
the University of Hong Kong, Hong Kong. E-mail: xchen@eee.hku.hk
(Corresponding author).
This work was supported in part by the Hong Kong SAR Government under
the Global STEM Professorship and Research Talent Hub, the Hong Kong
Jockey Club under the Hong Kong JC STEM Lab of Smart City (Ref.: 2023-
0108), and the Hong Kong Innovation and Technology Commission under
InnoHK Project CIMDA. The work of Jingjing Wang was partly supported by
the National Natural Science Foundation of China under Grant No. 62222101,
Beijing Natural Science Foundation under Grant No. L232043 and No.
L222039, and the Fundamental Research Funds for the Central Universities.
The work of X. Chen was supported in part by HKU-SCF FinTech Academy
R&D Funding.
driving. In contrast, the Bird’s Eye View (BEV) stands out as
a more holistic approach. By integrating data from multiple
sensors and cameras placed around a vehicle, BEV offers
a comprehensive, potentially 360-degree view of a vehicle’s
surroundings, offering a more contextually rich understand-
ing of its environment [7]. However, most BEV-aided per-
ception designs have predominantly concentrated on single-
vehicle systems. Such an approach may not be enough
in high-density traffic scenarios, where unobservable blind
spots caused by road obstacles or other vehicles remain a
significant design challenge. Therefore, collaborative per-
ception has become a promising candidate for autonomous
driving. To mitigate the limitations of the single-vehicle
systems, we can leverage multiple surrounding CAVs to
obtain a more accurate BEV prediction via multi-sensor
fusion [8].
To further reduce the risks of blind spots in BEV
prediction, collaborative perception is adapted to enable
multiple vehicles to share more complementary surround-
ing information with each other through vehicle-to-vehicle
(V2V) communications [9]. This framework intrinsically
surmounts several inherent constraints tied to single-agent
perception, including occlusion and long-range detection
limitation. Similar designs have been observed in a variety
of practical scenarios, including communication-assisted au-
tonomous driving within the realm of vehicle-to-everything
[10], multiple aerial vehicles for accuracy perception [11],
and multiple underwater vehicles deployed in search and
tracking operations [12]–[14].
In this emerging field of autonomous driving, the cur-
rent predominant challenge is how to make a trade-off
between perception accuracy and communication resource
arXiv:2404.06891v3 [cs.NI] 21 Aug 2024
IEEE TRANSACTIONS ON MOBILE COMPUTING 2
allocation. Given the voluminous perception outputs (such
as point clouds and consecutive RGB image sequences), the
data transmission for CAVs demands substantial communi-
cation bandwidth. Such requirements often run into capac-
ity bottleneck. As per the KITTI dataset [15], a single frame
from 3-D Velodyne laser scanners encompasses approxi-
mately 100,000 data points, where the smallest recorded
scenario has 114 frames, aggregating to an excess of 10
million data points. Thus, broadcasting such extensive data
via V2V communications amongst a vast array of CAVs
becomes a daunting task. Therefore, it is untenable to solely
emphasize the perception efficiency enhancement without
considering the overhead on V2V communications. Thereby,
some existing studies propose various communication-
efficient collaboration frameworks, such as the edge-aided
perception framework [16] and the transmission scheme for
compressed deep feature map [17]. It is observed that all
these approaches can be viewed as fairness-based schemes,
i.e., each vehicle within a certain range should have a fair
chance to convey its perception results to the ego vehicle.
Among all transmission strategies used for collaborative
perception, the fairness-based scheme is the most popular
one owing to its low computational complexity.
Despite the low computational complexity, several major
design challenges still exist with the state-of-the-art fairness-
based schemes, entailing the adaptation of these perception
approaches in real-world scenarios:
1) Fairness-based schemes may lead to overlapping
data transmissions from multiple vehicles, causing
unnecessary duplications and bandwidth wastage.
2) Fairness-based schemes cannot inherently prioritize
perception results from closer vehicles, which may
be more critical than those from further vehicles.
3) Without differentiating the priority of data from dif-
ferent vehicles, fairness-based schemes may block
communication channels for more crucial informa-
tion.
To address the above challenges, we conceive a priority-
aware perception framework with the BEV-match mech-
anism, which acquires the correlation between nearby
CAVs and the ego CAV. Compared with the fairness-based
schemes, the proposed approach pays attention to the bal-
ance between the correlation and sufficient extra informa-
tion. Therefore, this method not only optimizes perception
efficiency by preventing repetitive data, but also enhances
the robustness of perception under the limited bandwidth
of V2V communications.
In the context of collaborative perception, a signifi-
cant concern revolves efficient collaborative perception over
time-invariant links. The earlier research on cooperative
perception has considered lossy channels [18] and link
latency [19]. However, most existing research leans heav-
ily on the assumption of an ideal communication channel
without packet loss, overlooking the effects of fluctuating
network capacity [9]. Moreover, establishing fusion links
among nearby CAVs is a pivotal aspect that needs attention.
Existing strategies are often based on basic proximity con-
structs, neglecting the dynamic characteristics of the wire-
less channel shared among CAVs. In contrast, graph-based
optimization techniques provide a more adaptive approach,
accounting for real-world factors like signal strength and
bandwidth availability to improve network throughput.
A significant challenge in adopting these graph-based
techniques is to manage the transmission load of V2V
communications for point cloud and camera data. With the
vast amount of data produced by CAVs, it is crucial to
compress this data. Spatial redundancy is usually addressed
by converting raw high-definition data into 2D matrix form.
To tackle temporal redundancy, video-based compression
techniques are applied. However, traditional compression
techniques, such as JPEG [20] and MPEG [21], are not always
ideal for time-varying channels. This highlights the rele-
vance of modulated autoencoders and data-driven learning-
based compression methods that outperform traditional
methods. These techniques excel at encoding important
features while discarding less relevant ones. Moreover, the
adoption of fine-tuning strategies improves the quality of
reconstructed data, whereas classical techniques often face
feasibility issues.
1.2 State-of-the-Art
In this subsection, we review the related literature of collab-
orative perception for autonomous driving with an empha-
sis on their perceptual algorithms, network optimization,
and redundant information reduction.
V2V collaborative perception: V2V collaborative per-
ception combines the sensed data from different CAVs
through fusion networks, thereby expanding the perception
range of each CAV and mitigating troubling design prob-
lems like blind spots. For instance, Chen et al. [22] proposed
the early fusion scheme, which fuses raw data from different
CAVs, Wang et al. [17] employed intermediate fusion, fusing
intermediate features from various CAVs, and Rawashdeh
et al. [23] utilized late fusion, combining detection outputs
from different CAVs to accomplish collaborative percep-
tion tasks. Although these methods show promising re-
sults under ideal conditions, in real-world environments,
where the channel conditions are highly variable, directly
applying the same fusion methods often results in unsatis-
factory outcomes. Transmitting raw-level sensing data has
distinct advantages: it supports multiple downstream tasks,
enhances data fusion accuracy, and ensures future-proof
flexibility for evolving autonomous driving technologies.
Therefore, we can combine the benefits of early fusion and
intermediate fusion by utilizing adaptive compression for
raw data transmission.
Network optimization: High throughput can ensure
more efficient data transmissions among CAVs, thereby
potentially improving the IoU of cooperative perception
systems. Lyu et al. [24] proposed a fully distributed graph-
based throughput optimization framework by leveraging
submodular optimization. Nguyen et al. [25] designed a
cooperative technique, aiming to enhance data transmission
reliability and improve throughput by successively selecting
relay vehicles from the rear to follow the preceding vehicles.
Ma et al. [26] developed an efficient scheme for the through-
put optimization problem in the context of highly dynamic
user requests. However, the intricate relationship between
throughput maximization and IoU has not been thoroughly
investigated in the literature. This gap in the research moti-
IEEE TRANSACTIONS ON MOBILE COMPUTING 3
TABLE 1: CONTRASTING OUR CONTRIBUTION TO THE LITERATURE
[32] [3] [2] [18] [11] [16] [17] [22] [23] [24] [25] [26] [33] Proposed work
BEV evaluation
Multi-agent selection
Data compression
Lossy communications
Priority mechanism
Coverage optimization
Throughput optimization
vates us to conduct more comprehensive studies on the role
of throughput optimization in V2V cooperative perception.
Vehicular data compression: For V2V collaborative per-
ception, participating vehicles compress their data before
transmitting it to the ego vehicle to reduce transmission
latency. However, existing collaborative frameworks often
employ very simple compressors, such as the naive encoder
consisting of only one convolutional layer used in V2VNet
[17]. Such compressors cannot meet the requirement of
transmission latency under 100 ms in practical collaborative
tasks [27]. Additionally, current views suggest that compres-
sors composed of neural networks outperform the compres-
sors based on traditional algorithms [28]. However, these
studies are typically focused on general data compression
tasks and lack research on adaptive compressors suitable
for practical scenarios in V2V collaborative perception.
Priority-Aware Perception Schemes: Priority-aware per-
ception schemes have been advanced significantly, yet they
have also encountered limitations in dynamic environments.
Liao et al.’s model uses an attention mechanism for tra-
jectory prediction, effectively assigning dynamic weights
but overlooking communication costs and information re-
dundancy [29]. Similarly, Wen et al.’s queue-based traffic
scheduling enhances throughput but struggles with real-
time data synchronization, which is critical for systems
like autonomous vehicles [30]. Additionally, studies on
edge-assisted visual simultaneous localization and mapping
(SLAM) introduce a task scheduler that improves mapping
precision but fails to account for the crucial role of channel
quality in perception accuracy and throughput [31].
Our framework mitigates these issues by introducing an
adaptive mechanism that adjusts priorities based on real-
time channel quality and analytics. This approach not only
curtails unnecessary data processing and communication
but also enhances raw-level sensing data fusion and sys-
tem responsiveness, thereby addressing the core challenges
identified in previous studies.
1.3 Our Contributions
To address the weakness of prior works and tackle the afore-
mentioned design challenges, we design our Priority-Aware
Collaborative Perception (PACP) framework for CAVs and
evaluate its performance on a CAV simulation platform
CARLA [34] with OPV2V dataset [35]. Experimental results
verify PACP’s superior performance, with notable improve-
ments in utility value and the average precision of Intersec-
tion over Union (AP@IoU) compared with existing methods.
To summarize, in this paper, we have made the following
major contributions:
We introduce the first-ever implementation of a
priority-aware collaborative perception framework
that incorporates a novel BEV-match mechanism for
autonomous driving. This mechanism uniquely bal-
ances communication overhead with enhanced per-
ception accuracy, directly addressing the inefficien-
cies found in prior works.
Our two-stage optimization framework is the first
to apply submodular theory in this context, allow-
ing the joint optimization of transmission rates, link
connectivity, and compression ratio. This innovation
is particularly adept at overcoming the challenges
of data-intensive transmissions under dynamic and
constrained channel capacities.
We have integrated a deep learning-based adaptive
autoencoder into PACP, supported by a new fine-
tuning mechanism at roadside units (RSUs). The
experimental evaluation reveals that this approach
surpasses the state-of-the-art methods, especially in
utility value and AP@IoU.
Our new contributions are better illustrated in Table 1.
2 FAIRNESS OR PRIORITY?
In this section, we show some practical difficulties with
fairness-based collaborative perception for CAVs and also
briefly demonstrate the superiority of adopting a priority-
aware perception framework.
2.1 Background of Two Schemes
Fairness-based scheme: This scheme aims to achieve fair-
ness in resource allocation among different CAVs. The Jain’s
fairness index is used to measure fairness: [33]:
𝐽=(Í𝑛
𝑖=1𝑥𝑖)2
𝑛·Í𝑛
𝑖=1𝑥2
𝑖
,(1)
where 𝑛is the total number of nodes and 𝑥𝑖is the resource
allocated to the 𝑖-th node. A perfect fairness index of 1
indicates equal resource allocation. Two common fairness
schemes are subchannel-fairness (equal spectrum resources)
and throughput-fairness (equal transmission rates).
Priority-aware scheme: Unlike fairness-based schemes,
this scheme assigns different priority levels to CAVs based
on the importance and quality of their data. The ego vehicle
gives higher priority to CAVs with better channel conditions
and more crucial perception data. Existing works have in-
vestigated several popular priority factors as priority, such
as link latency [36] and routing situation [37]. This approach
mitigates the negative impact of "poisonous" CAVs with
IEEE TRANSACTIONS ON MOBILE COMPUTING 4
CAV 1 CAV 2 CAV 3
0
5
10
15
20
25
Throughput (Mbps)
2.02
9.63
23.85
0
50
100
150
200
250
Bandwidth (MHz)
66.67 66.67 66.67
Throughput
Bandwidth
(a) Subchannel-fairness scheme
CAV 1 CAV 2 CAV 3
0
2
4
6
8
10
Throughput (Mbps)
2.87
5.75
9.80
0
50
100
150
200
250
Bandwidth (MHz)
192.0
4.0 4.0
Throughput
Bandwidth
(b) Throughput-fairness scheme (c) Priority-aware perception scheme
Fig. 1: The bandwidth and throughput allocation by different schemes within V2V network.
Camera 0 Camera 1
Camera 2 Camera 3
(a) Camera perception by ego
CAV
Priority Subchannel-
fairness Throughput-
fairness
0.0
0.2
0.4
0.6
0.8
1.0
AP@IoU
0.685 0.603
0.499
Dynamic AP@IoU
(b) AP@IoU vs. Different schemes
Fig. 2: Camera data and different types of AP@IoU.
poor channel conditions or less relevant data, thus enhanc-
ing the overall system efficiency. Compared to the fairness-
based scheme, the advantages of priority-aware perception
can be summarized as follows:
Transmission Efficiency: By prioritizing CAVs with
better data quality and channel conditions, the
priority-aware scheme optimally allocates spectrum
resources, ensuring efficient transmission.
Improved Prediction: This scheme reduces the influ-
ence of poisonous data, leading to more accurate BEV
predictions by focusing on high-quality perception
inputs.
Dynamic Adaptability: The priority-aware scheme
dynamically adjusts to changing channel conditions
and task requirements, maintaining robust perfor-
mance in diverse environments.
2.2 An Illustrative Motivating Example
To understand the limitations of fairness-based schemes,
we compare their resource allocation and BEV prediction
against an ad hoc priority-aware scheme. Fig. 1 illustrates
the bandwidth and throughput allocation of three schemes
in a V2V network, with a total bandwidth of 200 MHz.
In the subchannel-fairness scheme (Fig. 1(a)), each CAV
receives an equal amount of spectrum resources, leading to
inefficient utilization, especially with poor channel condi-
tions. The throughput-fairness scheme (Fig. 1(b)) equalizes
transmission rates by allocating more resources to weaker
channels, but it can still result in suboptimal BEV predic-
tions. For instance, if a CAV has extremely poor channel
quality (e.g., CAV 1), even the allocated bandwidth may be
insufficient, reducing perception performance.
In contrast, the priority-aware perception scheme (Fig.
1(c)) dynamically adjusts the priority weights of each CAV
based on channel resources, perception accuracy, and cover-
age. For example, CAV 1, with the worst channel condition,
is assigned with the lowest priority and its data is discarded.
This approach ensures that critical and high-quality data
from CAVs with the best channel conditions and/or prox-
imity to the ego vehicle are transmitted with minimal loss
and latency, significantly improving BEV predictions. Fig.
2(b) demonstrates the superior performance of the priority-
aware scheme, with the dynamic AP@IoU metric rising to
0.685. Thus, this ad hoc priority-aware scheme outperforms
fairness-based methods by selectively collaborating with the
most valuable CAVs, leading to enhanced BEV predictions.
However, how to design an effective priority-aware scheme
is critical, which motivates this research.
3 SYSTEM MODEL
In this section, we present a V2X-aided collaborative percep-
tion system with CAVs, including the system’s structure,
channel modeling, and the constraints of computational
capacity and energy. The key notations are listed in the
Table 2.
3.1 System Overview
We consider a V2X-aided collaborative perception system
with multiple CAVs and RSUs, which is shown in Fig. 3. In
our scenario, CAVs can be divided into two types. The first
type is the nearby CAVs (indexes 1-3), which monitor the
surrounding traffic with cameras and share their perception
results with other CAVs. The second type is the ego CAV
(index 0), which fuses the camera data from the nearby
CAVs with its own perception results. As shown in Fig. 3,
the ego CAV 0 is covered with the parked car that the ego
CAV’s own camera cannot observe the incoming pedestrian
from the blind spot. Through the collaborative perception
scheme, the ego CAV merges compressed data from CAVs
2 and 3, i.e., 𝑠20 =𝑠30 =1, which catch the existence
of the pedestrian. However, it is unnecessary to connect
all CAVs together since the bandwidth and subchannel
resources are limited. Therefore, the ego CAV can determine
the importance of each nearby CAV by the priority-aware
mechanism, depending on CAVs’ positions and channel
states. For example, the ego CAV is disconnected from CAV
1, because it fails to provide enough environmental percep-
tion information, i.e., 𝑠10 =0. For the sake of maximizing
IEEE TRANSACTIONS ON MOBILE COMPUTING 5
Parked car Pedestrian
blind spots
RSU
CAV 1
CAV 2
CAV 3
Ego CAV 0
Camera perception
(10 = 0, 10,
10)
(30 = 1, 30,
30)
(20 = 0, 20,
20)
Compressed camera data
Disconnect
Fig. 3: Overview of V2X-aided collaborative perception sys-
tem.
TABLE 2: Summary of Key Notations
Notation Definition
𝑁The total number of vehicles in the network
𝐾The number of orthogonal sub-channels for the
whole V2V networks
𝑊The total bandwidth of the V2V networks
𝐶𝑖 𝑗 The channel capacity between the 𝑖th transmitter
and the 𝑗th receiver
DThe matrix of transmission rates
𝑟𝑖 𝑗 The adaptive compression ratio
𝐸𝑡
𝑖 𝑗 The energy bounds for data transmission
𝐸𝑐
𝑗The energy consumption bounds for computation
𝐸𝑇
𝑗The energy consumption threshold
𝑃𝑡Transmission power of each CAV
𝐹𝑗CPU capacity of CAV 𝑗
𝛽Model complexity parameter depending on the ar-
chitecture of the neural networks
𝐴𝑗Local data generation rate per second at CAV 𝑗
P𝑖 𝑗 The priority weight between two CAVs
I ( S𝑖)The total area covered by the union of perceptual
regions of vehicles
Usum The weighted utility function for the network
UrThe sub-utility function for perception quality
UpThe sub-utility function for the perceptual regions
(𝜔1, 𝜔2)Weights for perception quality and region
(𝑟min, 𝑟max )Compression ratio range
AP@IoU, the ego CAV obtains the near-optimal solution
in terms of the transmission rate 𝑑𝑖 𝑗 and the compression
ratio 𝑟𝑖 𝑗 at each time slot. Additionally, we deploy several
RSUs to achieve a fine-tuning compression strategy, which
is detailed in Sec. 5.4.
3.2 Channel Modeling and System Constraints
Consider the V2V network architecture G=(V,E), where
V=(𝑣1, 𝑣2, ..., 𝑣 𝑁)denotes CAVs and Eis the set of
links between them. As per 3GPP specifications for 5G
[38], V2V networks adopt Cellular Vehicle-to-Everything
(C-V2X) with Orthogonal Frequency Division Multiplex-
ing (OFDM). The total bandwidth 𝑊is split into 𝐾or-
thogonal sub-channels. Each sub-channel capacity is 𝐶𝑖 𝑗 =
𝑊
𝐾log21+𝑃𝑡𝑖𝑗
𝑁0𝑊
𝐾, where 𝑃𝑡represents the transmit power,
𝑖 𝑗 denotes the channel gain from the 𝑖th transmitter to the
𝑗th receiver, and 𝑁0is the noise power spectral density.
Additionally, let 𝑠𝑖 𝑗 =1 indicate the presence of the
directional link from CAV 𝑖to ego CAV 𝑗. Such a link is
denoted as (𝑖, 𝑗) S . The set Srepresents the collection of
all established links in the network. When 𝑠𝑖 𝑗 =1, CAV 𝑣𝑖is
capable of sharing data with the ego CAV 𝑣𝑗. Conversely,
if 𝑠𝑖 𝑗 =0, it implies the disconnected mode of the link
(𝑖, 𝑗 ). However, the number of directed links potentially
increase at a rate of 𝑁2with the number of CAVs, possibly
exhausting the limited communication spectrum resources.
Therefore, the upper bound of the number of connections is
given by:
𝑁
𝑖=1,𝑖𝑗
𝑁
𝑗=1
𝑠𝑖 𝑗 𝐾. (2)
Let D=𝑑𝑖 𝑗 𝑁×𝑁be the matrix consisting of transmission
rates, where each element is non-negative, for ∀(𝑖, 𝑗 ) E.
Each element 𝑑𝑖 𝑗 represents the data rate without compres-
sion from vehicle 𝑣𝑖to vehicle 𝑣𝑗, which is then processed
by 𝑣𝑗. It is noteworthy that 𝑑𝑖 𝑗 satisfies:
𝑟𝑖 𝑗 𝑑𝑖 𝑗 min(𝐶𝑖 𝑗 , 𝑟 𝑖𝑗 𝐴𝑖),(3)
Here, 𝑟𝑖 𝑗 ( 0,1]denotes the adaptive compression ratio,
obtained by the compression algorithm outlined in Sec. 5.4.
𝑟𝑖 𝑗 𝑑𝑖 𝑗 represents the actual transmission rate after compres-
sion. 𝐴𝑖signifies the amount of local perception data at
𝑣𝑖per second, i.e., perception data generation rate at the
location of 𝑣𝑖. This constraint implies that the actual trans-
mission rate must be limited either by the achievable data
rate or by the locally compressed data present at vehicle 𝑣𝑖.
Furthermore, it is observed that an inadequate compression
ratio results in a diminution in the accuracy of perception
data, while an excessively high compression ratio results
in suboptimal throughput maximization. Consequently, the
constraint for the compression ratio is defined as follows:
1𝑟𝑗, min R 𝑗1𝑟𝑗, max,(4)
where R𝑗=r1𝑗,r2𝑗, ..., r𝑁 𝑗 ,R=[R1,R2, . .., R𝑁]. Given
the surrounding data obtained through collaborative per-
ception, perception data from closer vehicles are more
important for perceptional detection, which has a higher
level of accuracy. Therefore, we assume that the adaptive
compression ratio for the link (𝑖, 𝑗 )yields:
𝑟𝑖 𝑗 𝑒𝐿𝑖 𝑗 𝜂, (5)
where 𝐿𝑖 𝑗 denotes the normalized distance between 𝑣𝑖and
𝑣𝑗, and 𝜂 (0,1]. It is noted that we use an exponen-
tial relationship in terms of normalized distance, because
the compression ratio of the remote area should decrease
rapidly, reducing communication overhead. Moreover, the
link establishment and data transmission rate should satisfy
the bounds of energy consumption as follows:
𝐸𝑡
𝑖 𝑗 =𝜏𝑡
𝑗𝑃𝑡𝑠𝑖 𝑗 ,(6)
IEEE TRANSACTIONS ON MOBILE COMPUTING 6
where 𝜏𝑡
𝑗denotes the allocated time span, and 𝑃𝑡signifies
the transmission power. We define 𝐹𝑗as the computational
capability of vehicle 𝑣𝑗. The data processed by 𝑣𝑗, which
includes its local data 𝐴𝑗and the data received from neigh-
boring nodes, should satisfy the following constraint:
𝐴𝑗+
𝑁
𝑖=1,𝑖𝑗
𝑟𝑖 𝑗 𝑠𝑖 𝑗 𝑑𝑖 𝑗 𝐹𝑗/𝛽, (7)
where 𝐹𝑗/𝛽represents the aggregate size of data processed
per second. Additionally, 𝛽is tunable parameter depending
on the architecture of the neural networks employed in these
contexts, like the self-supervised autoencoder. The energy
consumption for computation by 𝑣𝑗can be determined as
follows:
𝐸𝑐
𝑗=©«𝐴𝑗+
𝑁
𝑖=1,𝑖𝑗
𝑟𝑖 𝑗 𝑠𝑖 𝑗 𝑑𝑖 𝑗 ª®¬𝜖𝑗𝜏𝑐
𝑗,(8)
where 𝜖𝑗denotes the energy cost per unit of input data pro-
cessed by 𝑣𝑗’s processing unit. 𝜏𝑐
𝑗represents the duration al-
located for data processing. By imposing constraints on the
overall energy consumption, a suitable trade-off between
computing and communication can be made, facilitating
optimal operation and extending the operational longevity
of CAVs. Intuitively, the cumulative energy consumption in
our CAV system must satisfy the following constraint:
𝑁
𝑖=1,𝑖𝑗𝐸𝑡
𝑖 𝑗 +𝐸𝑐
𝑖 𝑗 𝐸𝑇
𝑗,(𝑗=1,2,· · · , 𝑁),(9)
where 𝐸𝑇
𝑗symbolizes the energy consumption threshold for
the 𝑗th CAV group, including its nearby CAVs.
4 PRIORITY-AWARE COLLABORATIVE PERCEP-
TI ON ARCHITECTURE
While BEV offers a top-down view aiding CAVs in learning
relative positioning, not all data is of equal importance or
relevance [10]. Some CAV data can be unreliable due to
perception qualities, necessitating differential prioritization
in data fusion. This section proposes a priority-aware per-
ception scheme by BEV match mechanism.
4.1 Selection of Priority Weights
There exist many ways to define priority weights, such as
distance [39], channel state [40], and information redun-
dancy [41]. However, as CoBEVT is the backbone for RGB
data fusion [7], our priority weight definition is based on
the Intersection over Union (IoU) of BEV features between
adjacent CAVs. IoU is an effective metric in computer vi-
sion, especially in object detection, used to quantify the
overlap between two areas. The background of IoU can
be obtained in CoBEVT’s local attention aids in pixel-to-
pixel correspondence during object detection fusion. IoU of
the BEV features reveals the consistency in environmental
perception between CAVs. Factors like channel interfer-
ence and network congestion may cause inconsistencies.
As consistency is crucial in fusion data, inconsistencies can
lead to data misrepresentations. Hence, we design a BEV-
match mechanism by relying on IoU analysis in the next
subsection, giving preference to CAVs with closely aligned
perceptions to the ego CAV.
4.2 Procedure of Obtaining Priority Weights
The procedure of calculating priority weights can be divided
into three steps as follows:
(1) Camera perception: As shown in Fig. 4(a), the nearby
CAVs capture raw camera data 𝛤using its four cameras.
This perception data 𝛤is then transmitted to the ego CAV
by communication units through wireless channel. Let 𝛤
𝑡=
F (𝛤)denote the perception data received by the ego CAV,
where Frepresents the function of data transmission over
the network. In reality, different data processing strategies
result in different impacts on 𝛤
𝑡, such as compression and
transmission latency.
(2) Encoder & Decoder: Fig. 4(b) illustrates that upon
reception, the ego CAV uses a SinBEVT-based neural net-
work to process the single-vehicle’s RGB data [7]. This
data is transformed through an encoding-decoding process
to extract BEV features. The BEV feature transformation
is denoted by BEV =G (𝛤
𝑡), where Gcaptures SinBEVT’s
processing essence. In Fig. 4(c), the BEV feature depicts the
traffic scenario from a single vehicle’s perspective, with the
green dashed box marking the range of perceived moving
vehicles, and the red box indicating surrounding vehicles to
the ego CAV. In fact, SinBEVT and CoBEVT achieve real-
time performance of over 70 fps with five CAVs [7].
(3) Priority weight calculation: As shown in Fig. 4(d),
we take the ego CAV and CAV 1 as an example. Along with
their corresponding BEV perceptions BEV0and BEV1, let the
coordinates and orientation angle of both the Ego CAV and
CAV 1 be represented as (𝑥0, 𝑦0, 𝜃0)and (𝑥1, 𝑦1, 𝜃 1), respec-
tively. Firstly, we can derive the translational displacement
between Ego CAV and CAV 1: Δ𝑥=𝑥0𝑥1,Δ𝑦=𝑦0𝑦1and
Δ𝜃=𝜃0𝜃1. Accordingly, the translation matrix 𝑇and the
rotation matrix 𝛩1are articulated as:
𝑇10 =
1 0 Δ𝑥
0 1 Δ𝑦
0 0 1 , 𝛩10 =
cos (Δ𝜃) sin (Δ𝜃)0
sin (Δ𝜃)cos (Δ𝜃)0
0 0 1,(10)
where ˆ
𝑇10 =𝑇10 ×𝛩10 remap coordinates from BEV1to BEV0.
Let ˆ
𝛤be a point of the BEV feature of CAV 1. Furthermore,
the transformation of a point ˆ
𝛤from the BEV perception
BEV1to BEV0is denoted by ˆ
𝛤=ˆ
𝑇10 ׈
𝛤.ˆ
𝛤represents
the transformed coordinate in BEV0and the point ˆ
𝛤is
articulated in homogeneous coordinates, i.e., ˆ
𝛤=[𝑥, 𝑦 , 1]𝑇.
Inspired by IoU, the metric for perceptual quality de-
pends on the intersection of the ground truth and other
predicted results. As we are mainly concerned with the
impact of an unstable channel on perceptual quality, we can
assume that the ego vehicle (CAV 0 in Fig. 5) can obtain
highly accurate locations of nearby objects, which can serve
as the ground truth. For each nearby CAV within commu-
nication range, CAV 0 can calculate priority weights only
using overlapping perceptual objects, i.e., we only calculate
priority weights based on BEV features (boxes) about objects
𝜋0, 𝜋1, 𝜋2,and 𝜋3. Specifically, the priority weight P10 is
formulated as follows:
P10 =[BEV0BEV
1]𝜋2
|[BEV0]𝜋|,(11)
1. We assume that the rotation matrix provided is based on the
counter-clockwise rotation convention.
IEEE TRANSACTIONS ON MOBILE COMPUTING 7
C AV 1
C AV 2
C AV 0
(Ego vehicle)
BEV
GT
(a) Camera perception (b) Encoder & Decoder (d) Weight calculation
(c) BEV feature
BEV
1
BEV
2
BEV
3
Multi-view features
: Wireless channel from
other CAV to CAV 0
Priority weight
Fig. 4: Procedure for priority weight calculation. Fig. 4(a): CAVs observe surroundings with 4 cameras; CAVs 1-2 relay
RGB data to CAV 0. Figs. 4(b)-(c): BEV feature generation in CAV 0. Fig. 4(d): BEV-match mechanism determines priority
weights.
0
1
23
C AV 0 s communication range C AV 1 s communication range
Overlapping perceptual objects
C AV 0 s perceptual objects C AV 1’s perceptual objects
Fig. 5: An example of determining priority weight P10.
where BEV0denote the ego CAV’s BEV features and BEV
1=
ˆ
𝑇10 ×BEV1signifies the transformed BEV, the operation
[·] 𝜋zeroes out pixels outside the overlapping perceptual
objects of the fused BEV’s non-zero region, ∥·∥2measures
the square root of squared pixel values, |·|sums all pixel
values in an image, and the numerator in Eq. (11) quantifies
perception difference between data from CAVs 1 and 0, with
the denominator providing normalization. The term Pin
fact captures two characteristics between the view of the
ego CAV and the transformed view of the assisting CAV
for data fusion. If two views are very close, the intersection
will be similar, the numerator in Eq. (11) will be very high,
thus P10 will be very high, which implies that the view
from CAV 1 will accurate enough to enhance the perception
quality, meaning that the view from CAV 1 can be assigned
high priority. If the view from CAV 1 is too dissimilar,
the intersection will be close to empty, implying that the
numerator in Eq. (11) will be close to 0, meaning that the
view from CAV 1 may be more misleading and should be
assigned lower priority.
Moreover, our system incorporates a gate mechanism
that effectively filters out data from CAVs whose priority
weights fall below a certain threshold. This gating pro-
cess prevents low-quality data from influencing the overall
perception quality, thereby maintaining the integrity and
accuracy of the collaborative perception system despite the
inherent instabilities in vehicular network conditions.
4.3 Utilization Maximization in Collaborative Percep-
tion
The performance of collaborative perception relies on the
quality of the V2V communication network. Hence, our aim
is to maximize the amount of perception data transmitted
under the constraints of communication resources, such as
power and bandwidth. However, given the varying contri-
butions of nearby CAVs, it is unwise to directly optimize the
total throughput of the V2V networks. Instead, we employ
the aforementioned priority weight to adjust the resources
allocated to each CAV. Moreover, one group of CAVs might
have overlapping perceptual regions. To handle this, we
leverage the union-style perception region set 𝐺𝑖, defined
as:𝐺𝑖=Ð𝑗𝐽𝑖G𝑗, where G𝑗represents the perception region
of the 𝑗th CAV, and 𝐽𝑖={𝑗|𝑠𝑗𝑖 =1}. Here, I (S𝑖)represents
the total area covered by the union of perceptual regions of
vehicles in set 𝐽𝑖, which is mathematically expressed as:
I(S𝑖)=A Ø
𝑗𝐽𝑖
G𝑗!(12)
where A(𝑋)denotes the area of a region 𝑋. The utilization
of the union of regions inherently introduces a submodu-
lar property in the coverage function. Submodularity2in
this context means that adding an additional CAV to a
group of vehicles with significantly overlapping regions
provides diminishing marginal gains in covered area. This
property naturally discourages the system from including
excessive overlaps in the set of actively transmitting vehi-
cles. Specifically, we exploit this submodular property by
2. The definition of submodularity can be found in Definition 3.
IEEE TRANSACTIONS ON MOBILE COMPUTING 8
formulating the sub-utility function for perception quality
as Ur=Í𝑖Í𝑗=1, 𝑗𝑖P𝑗𝑖 𝑠𝑗 𝑖 𝑑𝑗𝑖 , which reflects the accuracy
and robustness of surrounding perception. Concurrently,
the sub-utility function for the perceptual regions of CAVs
is defined as Up=Í𝑖I(S𝑖), which represents the amount
of information collected by data fusion. To maximize both
utilities at the same time, the weighted utility function is
formulated as:
Usum(S,D ) =𝜔1Ur+𝜔2Up
=𝜔1
𝑁
𝑖=1
𝑁
𝑗=1, 𝑗𝑖
P𝑗𝑖 𝑠𝑗 𝑖 𝑑𝑗𝑖
| {z }
Perception quality
+𝜔2
𝑁
𝑖=1
I(S𝑖)
| {z }
Perceptual region
,(13)
where 𝜔1and 𝜔2denote the weights of perception quality
and perceptual region, respectively. The optimized variable
D=𝑑𝑖 𝑗 𝑁×𝑁is the matrix of data transmission rate. By
combining the constraints and objective function Eq. (13),
we formulate the utility function maximization problem as:
P: max
R,S,DUsum(S,D )
s.t. (2),(3),(4),(5),(7),(9),
(14)
where the constraint (2) is the upper bound of the number of
subchannels; (3) denotes the constraint of transmission rate;
(4) and (5) are the constraints of the compression ratio; (7)
and (9) are the upper bounds of the computing capacity and
transmission power, respectively. In Sec. 5, we prove that
the utility function maximization can be decomposed into
two problems, i.e., a nonlinear programming problem and
submodular optimization problem that can be solved by a
greedy algorithm with an approximation guarantee.
5 TWO-STAG E OPTIMIZATION FRAMEWORK FOR
UTI LIT Y MAXIMIZATION
In this section, we first want to show that the original prob-
lem Pis NP-hard, making it hard to find its optimal solu-
tion within latency-sensitive systems. To find a suboptimal
solution, we conceive a two-stage optimization framework,
which decomposes Pinto two distinct problems: a nonlinear
programming (NLP) P1, and a submodular optimization P2.
We ascertain the optimal solution for P1in Sec. 5.1, and
deduce a (1𝑒1)-approximation of the optimal value for
P2using an iterative algorithm in Sec. 5.3.
5.1 Nonlinear Programming Problem Analysis
Proposition 1: For fixed D(𝑛), problem Pis reducible from
Weighted Maximum Coverage Problem (WMCP). When optimiz-
ing over matrices Dand R,Psurpasses WMCP’s complexity,
establishing its NP-hardness.
Proof: Please refer to Appendix A.
According to Proposition 1, it can be concluded that the
problem Pis NP-hard. Besides, it can be observed that as the
number of cooperative vehicles 𝑁increases, the state space
of problem Pexhibits double exponential growth. Specifi-
cally, with the state space represented as 2(𝑁2)× D𝑛× R𝑛,
the dimensionality increases rapidly with respect to 𝑁.
For example, let D𝑛and R𝑛denote the discrete levels for
continuous variables Dand R. We can consider a scenario
with 10 CAVs (refer to Sec. 6.2). As for the upper bound
of Dis 40 Mbps, we assume a step size of 1 Mbps with a
total of D𝑛=40 levels. As for the compression ratio R, there
are also R𝑛=20 levels with an increment of 0.05 per level.
Therefore, we can obtain the state space of the problem Pis
approximately 1.014 ×1033.
It is highly complicated to find the optimal solution even
when the V2V network scale is no more than 10 vehicles.
Since the control and decision of CAVs are latency-sensitive,
we have to conceive a real-time optimization solver to
address the issue of finding an optimal solution for the
NP-hard problem P. Therefore, we decompose Pinto two
subproblems by fixing one of the optimization variables.
Denote the link establishment in the 𝑛th round as S(𝑛1).
We then focus on adjusting the matrices for the compression
ratio Rand the data rate D. Problem P1is expressed as:
P1: max
R,DUsum S(𝑛1),D
s.t. (3),(4),(5),
(15a):𝐴𝑗+
𝑁
𝑖=1,𝑖𝑗
𝑟𝑖 𝑗 𝑠(𝑛1)
𝑖 𝑗 𝑑𝑖 𝑗 𝐹𝑗/𝛽,
(15b):
𝑁
𝑖=1,𝑖𝑗𝐸𝑡
𝑗𝑠(𝑛1)
𝑖 𝑗
+𝐸𝑐
𝑖 𝑗 𝑠(𝑛1)
𝑖 𝑗 𝐸𝑇
𝑗.
(15)
For 𝑗=1,2, . . . , 𝑁 , the sub-problem P1is an NLP problem
due to the nonlinear constraints given by (3), (15a), and
(15b)3. While global optimization techniques like branch
and bound or genetic algorithms can be used for non-
convex problems, they are computationally demanding. Our
approach is to linearize the problem. We define U=R D =
𝑢𝑖 𝑗 𝑁×𝑁where is the Hadamard product and 𝑢𝑖𝑗 =𝑟𝑖 𝑗 𝑑𝑖 𝑗 .
This linearizes the product term in the constraints. Thus, P1
is equivalently reformulated as4:
P11: max
U,D
𝑁
𝑗=1
𝑁
𝑖=1,𝑖𝑗
𝑠(𝑛1)
𝑖 𝑗 P𝑖 𝑗 𝑑𝑖 𝑗
s.t. (16a):𝑢𝑖 𝑗 min(𝐶𝑖 𝑗 , 𝑢𝑖 𝑗 𝑑1
𝑖 𝑗 𝐴𝑖),
(16b): max(𝑟𝑗 ,min , 𝜂𝑒𝐿𝑖 𝑗 ) 𝑢𝑖 𝑗 𝑑1
𝑖 𝑗 𝑟𝑗 , max,
(16c):
𝑁
𝑖=1,𝑖𝑗
𝑠(𝑛1)
𝑖 𝑗 𝑢𝑖 𝑗 min(𝛾(𝑛1)
𝑗, 𝜑 𝑗),
(16)
where we define 𝛾(𝑛1)
𝑗as 𝐸𝑇
𝜖𝑗𝜏𝑐
𝑗
𝜏𝑡
𝑗𝑃𝑡Í𝑁
𝑖=1,𝑖𝑗𝑠(𝑛1)
𝑖 𝑗
𝜖𝑗𝜏𝑐
𝑗
𝐴𝑗and
𝜑𝑗as 𝐹𝑗
𝛽𝐴𝑗. Moreover, the constraint (16c) is derived
from both (15a) and (15b). Even though we add a bilinear
equality constraint with 𝑢𝑖 𝑗 , constraint (16b) remains non-
linear, thereby we cannot obtain the optimal result directly.
Given that P11attempts to optimize Í𝑁
𝑗=1Í𝑁
𝑖=1,𝑖𝑗𝑠(𝑛1)
𝑖 𝑗 𝑑𝑖 𝑗 ,
we have:
𝑢𝑖 𝑗
𝑟𝑗, max
𝑑𝑖 𝑗 min (𝐴𝑖, 𝑢𝑖 𝑗 max 𝑟𝑗 ,min ,𝜂
𝑒𝐿𝑖 𝑗 1),(17)
3. This is due to the product of decision variables 𝑟𝑖𝑗 and 𝑑𝑖 𝑗 .
4. For simplicity, we omit I(S𝑖),𝜔1, and 𝜔2in the formulation of
the first sub-problem, since those terms are independent of P11.
IEEE TRANSACTIONS ON MOBILE COMPUTING 9
which offers an upper bound for the optimal value of P11.
To optimize 𝑑𝑖 𝑗 , it is beneficial to focus on maximizing this
limit. Therefore, we derive a relaxed problem as follows:
P12: max
U
𝑁
𝑗=1
𝑁
𝑖=1,𝑖𝑗
𝑠(𝑛1)
𝑖 𝑗 P𝑖 𝑗 𝑢𝑖 𝑗
max 𝑟𝑗, min,𝜂
𝑒𝐿𝑖 𝑗
s.t. (16a)and (16c),
𝑢𝑖 𝑗 =0 if 𝑠(𝑛1)
𝑖 𝑗 =0.
(18)
P12is a standard linear programming solvable using tech-
niques like the simplex or interior-point methods. If the
optimal outcome of P12is 𝑢(𝑛)
𝑖 𝑗 , the optimal solutions for
transmission rate and adaptive compression ratio are 𝑑(𝑛)
𝑖 𝑗 =
min 𝐴𝑖, 𝑢𝑖 𝑗 hmax 𝑟𝑗 ,min,𝜂
𝑒𝐿𝑖 𝑗 i1and 𝑟(𝑛)
𝑖 𝑗 =𝑢(𝑛)
𝑖 𝑗 /𝑑(𝑛)
𝑖 𝑗 , re-
spectively. Considering that 𝑑𝑖 𝑗 values are at the edges of
the feasible region, the optimal solution for P12matches
that of P11. It is noted that 𝑢𝑖 𝑗 =0 if 𝑠(𝑛1)
𝑖 𝑗 =0 guarantees
that 𝑢𝑖 𝑗 is explicitly set to zero whenever 𝑠(𝑛1)
𝑖 𝑗 =0.
5.2 Preliminaries for Submodular Optimization
Prior to delving into specific details of the other subproblem
P2, we briefly review the definition and primary character-
istics of submodularity as presented in [42].
Definition 1: (Set Function Derivative) Given a set function
𝑓: 2𝑉R, for a subset 𝑆of 𝑉and an element 𝑒in 𝑉, the
discrete derivative of 𝑓at 𝑆concerning 𝑒is denoted by Δ𝑓(𝑒|𝑆)
and defined as Δ𝑓(𝑒|𝑆)=𝑓(𝑆 {𝑒}) 𝑓(𝑆). If the context makes
the function 𝑓evident, we omit the subscript, expressing it simply
as Δ(𝑒|𝑆).
Definition 2: (Monotonicity) Given a function 𝑓: 2𝑉R,𝑓
is deemed monotone if, for all 𝐴, 𝐵 𝑉with 𝐴𝐵, the condition
𝑓(𝐴) 𝑓(𝐵)holds.
It should be underscored that the function 𝑓exhibits mono-
tonicity if and only if every discrete derivative maintains
a non-negative value. Specifically, for each 𝐴𝑉and any
𝑒𝑉, the relation Δ(𝑒|𝐴) 0 is satisfied.
Definition 3: (Submodularity) Let 𝐸denote a finite ground
set. A set function 𝑓: 2𝐸Ris said to be normalized, non-
decreasing, and submodular if it satisfies the following properties:
1) 𝑓(∅) =0;
2) 𝑓is monotone as per Definition 2;
3) For any 𝐴, 𝐵 𝐸,𝑓(𝐴) + 𝑓(𝐵) 𝑓(𝐴𝐵) + 𝑓(𝐴𝐵);
4) For any 𝐴𝐵𝐸and an element 𝑒𝐸\𝐵,Δ𝑓(𝑒|𝐴)
Δ𝑓(𝑒|𝐵).
In the next subsection, we prove that the objective
function possesses submodular properties. As more CAVs
share their perception results, the ego CAV tends to add
significant new information for view fusion. However, as
the number of CAV increases, each additional CAV pro-
vides less new information, i.e., Diminishing Marginal Utility.
This concept is crucial in CAVs’ scenarios, ensuring that
resources are not wasted by redundant sensors.
Algorithm 1: Greedy Algorithm for submodular
function maximization
Input: Adaptive compression ratio matrix R(𝑛), data
transmission rate D(𝑛).
Output: Output the optimal link establishment matrix S.
1: Initialization: S ,𝑖1;
2: while 𝑖𝑁do
3: 𝑠
𝑖 𝑗 =arg max𝑠𝑖 𝑗 S 𝑛×𝑛\S Usum (S {𝑠𝑖 𝑗 },D(𝑛));
4: S S {𝑠
𝑖 𝑗 };
5: 𝑖𝑖+1;
6: end while
7: return S.
5.3 Submodular Analysis and Solutions
In this subsection, we first prove that the objective func-
tion of the problem P2is a submodular function. The objec-
tive function Usum(S,D ) represents the aggregated utility
of the system, incorporating both perception quality and
coverage. However, achieving an optimal solution for this
objective function is challenging due to its NP-hard nature.
As more CAVs share their perception results, the ego CAV
gains significant new information for view fusion. However,
as the number of CAVs increases, each additional CAV
provides less new information, i.e., Diminishing Marginal
Utility. This concept is crucial in CAV scenarios to ensure re-
sources are not wasted by redundant sensors. To efficiently
solve the submodular function maximization problem, we
adapt a greedy algorithm. Submodular functions exhibit the
diminishing returns property, allowing a greedy algorithm
to find a near-optimal solution. Specifically, we show that
a greedy algorithm can achieve at least 1𝑒1of the
optimal value to solve P2efficiently.
Proposition 2: Given that P2characterizes the link establish-
ment problem and the data rate D=D(𝑛)is a constant matrix,
the objective function Usum(S,D(𝑛))defined in Eq. (13) exhibits
submodularity if Usum(S,D(𝑛))satisfies all the properties out-
lined in Definition 3 within Sec. 5.2.
Proof: Please refer to Appendix C.
With the adaptive compression ratio matrix R(𝑛)=
h𝑟(𝑛)
𝑖 𝑗 i𝑁×𝑁and data transmission rate D(𝑛)=h𝑑(𝑛)
𝑖 𝑗 i𝑁×𝑁, we
then formulate the link establishment matrix Sfor P2as:
P2: max
S
𝑁
𝑖=1©«𝜔1
𝑁
𝑗=1, 𝑗𝑖
P𝑗𝑖 𝑠𝑗 𝑖 𝑑(𝑛)
𝑖 𝑗 +𝜔2I(S𝑖)ª®¬
s.t. (19a):
𝑁
𝑖=1,𝑖𝑗
𝜒(𝑛)
𝑖 𝑗 𝑠𝑖 𝑗 𝐸𝑇
𝑗𝜏𝑐
𝑗𝜖𝑗𝐴𝑗,
(19b):
𝑁
𝑖=1,𝑖𝑗
𝑠𝑖 𝑗 𝑢(𝑛)
𝑖 𝑗 𝜑𝑗and (2),
(19)
where 𝜒(𝑛)
𝑖 𝑗 =𝑢(𝑛)
𝑖 𝑗 𝜖𝑗𝜏𝑐
𝑗+𝜏𝑡
𝑗𝑃𝑡can be obtained by the inequal-
ity constraint (9). Since the objective function Usum(S,D ) is
a submodular function according to Proposition 2, P2is a
submodular function maximization problem, which can be
solved by a greedy algorithm for near-optimal results.
IEEE TRANSACTIONS ON MOBILE COMPUTING 10
Encoder Decoder
Reconstructed
perception data
Raw camera
perception data
Roadside fine-tuning
Historical data Pre-trained model
RSU
CSI feedback
Wireless channel
Compression
ratio Channel gain
Nearby CAV Ego CAV
Fig. 6: Overall architecture of adaptive compression.
Proposition 3: Given a submodular, non-decreasing set function
Usum(S,D(𝑛)), which yields Usum (,D(𝑛))=0, the greedy
algorithm obtains a set S𝐺satisfying:
Usum S𝐺,D(𝑛)1𝑒1max
SUsum S,D(𝑛).(20)
Proof: Please refer to Sec. II of [43].
According to Proposition 3, it can be observed that
Algorithm 1 can obtain a (1𝑒1)-approximation of the
optimal value of P2. During the 𝑛th round, we update the
link establishment S(𝑛). If link (𝑖, 𝑗)reduces the throughput,
it is removed: S(𝑛) S(𝑛)\{𝑠𝑖 𝑗 }. Otherwise, it is added:
S(𝑛) S(𝑛) {𝑠𝑖 𝑗 }. For P2, we iteratively adjust links until
we find an optimal solution meeting all constraints or hit
the iteration limit by relying on Algorithm 1.
This greedy approach not only ensures maximization
of the objective function Usum(S,D(𝑛))but also signifi-
cantly reduces computational complexity, making it highly
advantageous for real-time applications. According to the
cardinality constraint 𝐾in (2), the time-complexity of Al-
gorithm 1 is only O (𝐾). Therefore, we can circumvent the
complexities typically associated with optimization prob-
lems by such a greedy algorithm rather than engaging in
exhaustive searches or iterative procedures, which may not
always guarantee convergence to the optimal solution.
5.4 Deep Learning-Based Adaptive Compression
The compression modules commonly employed in V2V
collaborative networks often rely on predefined fixed com-
pression ratios, such as JPEG and JPEG2000 [7]. However,
these fixed ratios are inadequate to accommodate the de-
mands of the dynamic channel conditions discussed in Sec.
5. Moreover, it has been shown that the deep learning-based
method generally offers better rate–distortion (R-D) per-
formance compared to the standard compression methods
[44]. In this section, we propose an adaptive compression
method that comprises an adaptive R-D mechanism to refine
the compression ratio, R, to align with the requirements
of dynamic channel conditions. Then, we introduce a fine-
tuning strategy to reduce temporal redundancy in V2V
transmissions by exploiting the holistic RGB frames. We first
present the main procedure of our DL-based compression
scheme as follows.
The Deep Learning-Based Compression (DBC) addresses
the challenges by using trainable parameters from training
datasets, thus offering adaptability to V2V dynamics. Under
DBC architecture, both encoder and decoder utilize convo-
lutional layers. The encoder transforms the input image x
into a latent representation, z=𝑓(x;𝜃), with transformation
parameters 𝜃learned from training. The decoder uses a
distinct parameter set 𝜉for reconstruction: ˜z=𝐻(z;𝜉).ˆx
is the reconstruction image from the decoder. The training
objective is to minimize:
arg min
𝜃, 𝜉
𝑅(˜z;𝜃)+𝛽𝐷 (x,ˆx;𝜃 , 𝜉),(21)
where 𝑅(˜z;𝜃)=Elog2𝑝˜z (˜z)represents the rate func-
tion and 𝐷(x,ˆx;𝜃 , 𝜉)=Exˆx2denotes the distortion
function,where the constant 𝛽manages the R-D trade-off.
Moreover, we formulate the traditional fixed R-D problem
as a multi-R-D problem for adaptability:
arg min
𝜃, 𝜉
𝑅(ˆ
;𝜃) + 𝛽𝐷 (x,ˆx;𝜃, 𝜉).(22)
It is noted that 𝛽can affect reliable decoding and image
quality, emphasizing the need to adjust 𝛽adaptively. In this
context, we introduce a DBC mechanism to dynamically
modify 𝛽under dynamic channel conditions (As shown in
Fig. 6, given the channel state information (CSI) feedback,
we can obtain 𝛽and compression ratio according to Sec.
5.1). Therefore, the revised problem can be formulated as:
arg min
𝜃, 𝜉
𝑅(ˆ
;𝜃, 𝑟 ) + 𝐺(𝑟)𝐷(x,ˆx;𝜃 , 𝜉, 𝑟 ),(23)
where function 𝐺is the R-D mapping function, which
utilizes a lookup table and converts the compression ratio
to the tradeoff parameter 𝛽. Therefore, there is no explicit
expression for the function 𝐺. Moreover, the adaptive com-
pression network is based on a pre-trained model [28],
using historical camera data as training dataset to obtain
network parameter 𝜃, 𝜉 by solving the multi-R-D problem in
(23). The detailed architecture and processes are described
in Appendix D. We also evaluate the R-D performance
of our proposed PACP framework using the Multi-scale
Structural Similarity (MS-SSIM) and Peak Signal-to-Noise
Ratio (PSNR) metrics in Appendix D.5.
As for computational cost, the encoder and de-
coder of MAE require 0.155 MFLOPs/pixel and 0.241
MFLOPs/pixel5, respectively [28]. Considering Tesla FSD
as the computational unit, when three CAVs share their
perception results at a rate of 10 fps, the average latencies for
encoding and decoding processes are observed to be 40.26
ms and 20.89 ms, respectively.
6 SIMULATION RESULTS AN D DISCUSSIONS
In this section, we evaluate our schemes under various
communication settings, which consist of bandwidth, trans-
mission power, the number of CAVs, and the distribution of
vehicles. Subsequently, we delve into the comparison of raw
data reconstruction performance—both with and without
the application of a fine-tuned compression strategy. In the
final part of this section, we present the results of BEV along
with the associated IoU.
5. As a metric of computational cost, MFLOPs/pixel denotes the
number of million floating point operations performed per pixel for
camera perception data.
IEEE TRANSACTIONS ON MOBILE COMPUTING 11
Algorithm 2: Priority Aware Collaborative Percep-
tion
Input: Multi-CAV data 𝛤, number of vehicles 𝑁, channel
parameter constraints 𝐾 , 𝐶𝑖𝑗 , device parameters
𝑟𝑗, min, 𝑟 𝑗 ,max , 𝜂, 𝜏 𝑡
𝑗, 𝜏𝑐
𝑗, 𝐸𝑇
𝑗, etc.
Output: Priority weight P, near-optimal compression ratio
R, link establishment S, data rate D, modulated
autoencoder, and BEV prediction.
1: Initialize priority weight P(0)by equally allocating
bandwidth and transmitting initial frames;
2: Initialize the link establishment decision S(0);
3: for 𝑗=0 to 𝑁1do
4: Sort column of link establishment decision matrix in
descending order and get the indices of the largest 𝐾
capacity, and set the associated 𝑠𝑖 𝑗 =1;
5: end for
6: while Convergence not achieved do
7: Solve the linear programming problem P12;
8: Solve the submodular problem P2using Algorithm 1;
9: Update priority weight Pbased on the current
bandwidth allocation and perception results;
10: if Priority weight Pchanges exceed the threshold 𝜖P
then
11: Re-initialize the link establishment decision S(0),
then re-calculate initial priority weight P;
12: Continue the optimization iteration for problem P;
13: end if
14: end while
15: Obtain the near-optimal solution for R,S,D;
16: Determine the trade-off parameter 𝛽based on the
channel state;
17: Train encoders and decoders to compress and
reconstruct raw camera data according to Eq. (23);
18: Predict BEV feature using the reconstructed camera
data.
6.1 Dataset and Baselines
Dataset: To validate our approach, we employ the CARLA
and OpenCOOD simulation platforms, exploiting the
OPV2V dataset [35]. This dataset encompasses 73 varied
scenes, a multitude of CAVs, 11,464 frames, and in excess of
232,000 annotated 3D vehicle bounding boxes. All of these
have been amassed using the CARLA simulator [34].
Baseline 1: the Fairness Transmission Scheme (FTS),
which is built on the principles laid out in [33]. The cor-
nerstone of this approach is the allocation of subchannels in
a manner that resonates with Jain index defined in Eq. (1).
Baseline 2: The core of this baseline is the Distributed
Multicast Data Dissemination Algorithm (DMDDA), as
outlined in [24], which seeks to optimize throughput in a
decentralized fashion.
Baseline 3: the No Fusion scheme, which implies the
usage of a single ego vehicle for gathering information about
its surroundings. It operates without integrating data from
the cameras of proximate CAVs.
To ensure a fair comparison, we maintain uniformity in
the transmission model and simulation parameters, aligning
them to those discussed in Sec. 3.2.
TABLE 3: Simulation Parameters
Parameter Value
Number of vehicles (𝑁) 10
Local data per vehicle (𝐴𝑗) 40 Mbits
Number of subchannels (𝐾) 4
Computation complexity (𝛽) 100 Cycles/bit
Transmission power (𝑃𝑡) 8 mW
CPU capacity (𝐹𝑗) 1 GHz - 3 GHz
Bandwidth (𝑊) 200 MHz
Power threshold (𝐸𝑇
𝑗) 1 kW
Weight for perception quality (𝑤1) 1 ×102
Weight for perceptual region (𝑤2) 1 ×103
Parameter (𝜂) 1
Compression ratio range (𝑟min, 𝑟max) (0.3, 0.95)
(a) IoU vs. Noise level
+78.79%
+167.00%
+309.79%
+535.84%
+893.
2.0
1.5
1.0
0.5
0.0
Mean Uti lity V alue (x106)
PACP
0 2 4 6 8
Noise Level (dB)
93%
(b) Utility vs. Noise level
Fig. 7: AP@IoU and utility value under different noise levels.
6.2 Simulation Settings
Our simulations are based on the 3GPP standard [38].
Specifically, the