PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Collaborative perception enhances sensing in multirobot and vehicular networks by fusing information from multiple agents, improving perception accuracy and sensing range. However, mobility and non-rigid sensor mounts introduce extrinsic calibration errors, necessitating online calibration, further complicated by limited overlap in sensing regions. Moreover, maintaining fresh information is crucial for timely and accurate sensing. To address calibration errors and ensure timely and accurate perception, we propose a robust task-oriented communication strategy to optimize online self-calibration and efficient feature sharing for Real-time Adaptive Collaborative Perception (R-ACP). Specifically, we first formulate an Age of Perceived Targets (AoPT) minimization problem to capture data timeliness of multi-view streaming. Then, in the calibration phase, we introduce a channel-aware self-calibration technique based on reidentification (Re-ID), which adaptively compresses key features according to channel capacities, effectively addressing calibration issues via spatial and temporal cross-camera correlations. In the streaming phase, we tackle the trade-off between bandwidth and inference accuracy by leveraging an Information Bottleneck (IB) based encoding method to adjust video compression rates based on task relevance, thereby reducing communication overhead and latency. Finally, we design a priority-aware network to filter corrupted features to mitigate performance degradation from packet corruption. Extensive studies demonstrate that our framework outperforms five baselines, improving multiple object detection accuracy (MODA) by 25.49% and reducing communication costs by 51.36% under severely poor channel conditions. Code will be made publicly available: https://github.com/fangzr/R-ACP.
Content may be subject to copyright.
R-ACP: Real-Time Adaptive Collaborative Perception
Leveraging Robust Task-Oriented Communications
Zhengru Fang, Jingjing Wang, Yanan Ma, Yihang Tao,
Yiqin Deng, Xianhao Chen, and Yuguang Fang, Fellow, IEEE
Abstract—Collaborative perception enhances sensing in multi-
robot and vehicular networks by fusing information from mul-
tiple agents, improving perception accuracy and sensing range.
However, mobility and non-rigid sensor mounts introduce extrin-
sic calibration errors, necessitating online calibration, further
complicated by limited overlap in sensing regions. Moreover,
maintaining fresh information is crucial for timely and accurate
sensing. To address calibration errors and ensure timely and
accurate perception, we propose a robust task-oriented commu-
nication strategy to optimize online self-calibration and efficient
feature sharing for Real-time Adaptive Collaborative Perception
(R-ACP). Specifically, we first formulate an Age of Perceived
Targets (AoPT) minimization problem to capture data timeliness
of multi-view streaming. Then, in the calibration phase, we
introduce a channel-aware self-calibration technique based on re-
identification (Re-ID), which adaptively compresses key features
according to channel capacities, effectively addressing calibration
issues via spatial and temporal cross-camera correlations. In the
streaming phase, we tackle the trade-off between bandwidth and
inference accuracy by leveraging an Information Bottleneck (IB)-
based encoding method to adjust video compression rates based
on task relevance, thereby reducing communication overhead and
latency. Finally, we design a priority-aware network to filter cor-
rupted features to mitigate performance degradation from packet
corruption. Extensive studies demonstrate that our framework
outperforms five baselines, improving multiple object detection
accuracy (MODA) by 25.49% and reducing communication costs
by 51.36% under severely poor channel conditions. Code will be
made publicly available: github.com/fangzr/R-ACP.
Index Terms—Multi-camera networks, task-oriented commu-
nications, camera calibration, age of perceived targets (AoPT),
information bottleneck (IB).
Z. Fang, Y. Ma, Y. Tao, Y. Deng and Y. Fang are with the Department
of Computer Science, City University of Hong Kong, Hong Kong. E-
mail: {zhefang4-c, yananma8-c, yihang.tommy}@my.cityu.edu.hk, {yiqideng,
my.fang}@cityu.edu.hk.
J. Wang is with the School of Cyber Science and Technology, Beihang
University, China. Email: drwangjj@buaa.edu.cn.
X. Chen is with the Department of Electrical and Electronic Engineering,
the University of Hong Kong, Hong Kong. E-mail: xchen@eee.hku.hk.
This work was supported in part by the Hong Kong SAR Government under
the Global STEM Professorship and Research Talent Hub, the Hong Kong
Jockey Club under the Hong Kong JC STEM Lab of Smart City (Ref.: 2023-
0108), and the Hong Kong Innovation and Technology Commission under
InnoHK Project CIMDA. This work of J. Wang was partly supported by the
National Natural Science Foundation of China under Grant No. 62222101 and
No. U24A20213, partly supported by the Beijing Natural Science Foundation
under Grant No. L232043 and No. L222039, partly supported by the Natural
Science Foundation of Zhejiang Province under Grant No. LMS25F010007.
The work of Y. Deng was supported in part by the National Natural Science
Foundation of China under Grant No. 62301300. The work of X. Chen was
supported in part by the Research Grants Council of Hong Kong under Grant
27213824 and CRS HKU702/24, in part by HKU-SCF FinTech Academy
R&D Funding, and in part by HKU IDS Research Seed Fund under Grant
IDS-RSF2023-0012. (Corresponding author: Yiqin Deng)
Original view Unknown view
Unpredictable accident
(,,)
Road
UGV
Original view
BEV feature
Extraction
Multiview
Fusion
Occupancy map
Unknown view
BEV feature
Extraction
Multiview
Fusion
UGVs data
High error detection rate
(a) Unpredictable accidents can alter a UGV’s camera extrinsic parameters.
Original view Unknown view
Unpredictable accident
(,,)
Road
UGV
Original view
BEV feature
Extraction
Multiview
Fusion
Occupancy map
Unknown view
BEV feature
Extraction
Multiview
Fusion
UGVs data
High error detection rate
Inaccurate
estimation
(b) Incorrect extrinsic parameters cause errors in collaborative perception.
Fig. 1: Effect of unpredictable accidents involving UGVs on
camera extrinsic parameters and perception error rates.
I. INTRODUCTION
A. Background
COLLABORATIVE perception systems are increasingly
prevalent in fields such as IoT systems [1], [2], connected
and autonomous driving [3]–[5], unmanned aerial vehicles
[6]–[8], and sports analysis [9], [10]. They offer significant
advantages over single-agent systems by mitigating blind
spots, reducing occlusions, and providing comprehensive cov-
erage through multiple perspectives [11], which is especially
valuable in cluttered or crowded environments. However, these
benefits also introduce considerable challenges. The increased
number of cameras demands higher network bandwidth, and
more fine-grained synchronization. Besides, synchronized data
transmission with high inference accuracy necessitates precise
calibration and efficient communication [12]. Therefore, bal-
ancing network resource management for real-time collabora-
tive perception becomes essential.
Camera calibration is the process of determining the in-
trinsic and extrinsic parameters of vision sensors to en-
sure accurate perception across different viewpoints in multi-
camera networks [13]. Intrinsic calibration deals with internal
characteristics like focal length and lens distortion, while
extrinsic calibration defines the physical position and orien-
tation (rotation Rand translation t) of the camera relative
arXiv:submit/6400833 [cs.NI] 30 Apr 2025
to a reference frame. This enables effective alignment across
cameras. Fig. 1 illustrates a scenario where multiple unmanned
ground vehicles (UGVs) equipped with vision sensors col-
laborate to track moving objects, reducing the impact of
obstacles and enhancing perception accuracy. In smart cities,
UGVs can effectively prevent elderly falls, provide real-time
abnormal behavior alerts (e.g., crime or terrorist attacks), and
assist in search and rescue missions in hazardous areas. As
shown in Fig. 1(a), accidents can disrupt extrinsic parameters
[R|t], resulting in an “unknown view”. In Fig. 1(b), these
errors impact collaborative perception, leading to inaccurate
Bird’s Eye View (BEV) mapping when parameters R|t
are incorrect. Thus, efficient extrinsic calibration is vital for
perception accuracy.
Traditional calibration methods—pattern-based [14],
manual measurements [15], and feature-based approaches
[16]—rely on predefined calibration objects and work well in
controlled settings. However, they are impractical for dynamic,
large-scale deployments due to the need for specific targets,
time-consuming processes, and inconsistent natural features.
Multi-camera calibration also faces bandwidth limitations and
network variability since collaboration requires data exchanges
among sensors. Kalman filter-based methods provide online
calibration without predefined objects [17] but rely on linear
assumptions about system dynamics. These assumptions
often fail in multi-UGV systems due to non-linear motions,
especially in rotations and velocity changes, making them less
effective in rapidly changing scenarios. Transmitting large
volumes of image data for collaborative calibration consumes
significant bandwidth, and network delays or packet loss
disrupt synchronization, complicating cross-camera feature
matching [18]–[20], particularly when adding or adjusting
cameras. To address these challenges, we propose an adaptive
calibration approach leveraging spatial and temporal cross-
camera correlations during deployment. By incorporating
re-identification (Re-ID) technology [9], our method achieves
higher key-point matching accuracy than traditional edge-
detection techniques. The Re-ID-based calibration effectively
handles non-linearities in UGV motion and varying camera
perspectives by using both global appearance and fine-
grained local features, enhancing robustness under changing
conditions. Additionally, adaptive feature quantization based
on channel capacity reduces communication overhead,
maintaining high-precision calibration without extra sensors
or specific calibration objects, making it suitable for dynamic
and large-scale deployments.
After meeting the accuracy requirements of vision sensing
through self-calibration, we need to address how to guarantee
the timely data transmission in multi-camera networks, which
is essential for monitoring dynamic environments. For exam-
ple, timely data in life critical signal monitoring can be life-
saving [21], and promptly detecting abnormal behaviors in
public safety can prevent crimes or ensure traffic safety [22].
Therefore, perception tasks like object detection rely on both
data accuracy and its timeliness, as stale information can result
in poor decision-making when immediate responses are re-
quired. The Age of Information (AoI) measures data timeliness
by tracking the time since the latest packet was received [23],
[24]. Traditional AoI assumes homogeneous data sources of
equal importance and consistent quality, simplifying timeliness
evaluation. However, this cannot align with multi-camera
networks, where cameras have varying fields of views (FOVs)
and data quality due to different positions and environmental
factors. While He et al. [25] considered AoI in multi-camera
perception, they did not adequately model camera coverage
or account for overlapping fields of view, where variations
in sensing accuracy affect multi-view fusion performance. To
fill this gap, we propose a novel age-aware metric for multi-
camera networks that reflects both data timeliness and source
quality. This enhanced metric guides optimizations for sources
with higher priorities, ultimately improving overall perception
performance.
Limited network bandwidth and high redundancy in video
streaming increase transmission overhead [18]. Traditional
systems transmit vast amounts of raw data without con-
sidering task relevance, causing latency that degrades real-
time perception. Additionally, channel limitations and multi-
user interference can corrupt transmitted features, but existing
task-oriented communication seldom addresses transmission
robustness, often assuming ideal channels. Task-oriented com-
munication offers efficiency by focusing on task-relevant data
and ignoring redundancy [26], prioritizing compact repre-
sentations for tasks like object detection. However, without
robustness considerations, these methods remain vulnerable
to channel impairments. Traditional solutions like Automatic
Repeat reQuest (ARQ) protocols enhance reliability through
retransmissions but introduce significant overhead and latency,
unsuitable for real-time applications [27]. Therefore, there
is a need to develop robust task-oriented communication
methods that withstand channel impairments without extra
latency. The Information Bottleneck (IB) method [20] aligns
with this approach by encoding the most relevant features
while enhancing resilience to data corruption. By leveraging
robust task-oriented communication, we can optimize network
resources and improve multi-camera network performance, en-
suring efficient and effective real-time collaborative perception
even under bandwidth constraints and poor channel conditions.
B. State-of-the-Art
1) Multi-Camera Networks: Multi-camera networks en-
hance perception by providing comprehensive coverage and
reducing occlusions through multiple views from distributed
cameras. Yang et al. [28] introduced the edge-empowered co-
operative multi-camera sensing system for traffic surveillance,
leveraging edge computing and hierarchical re-identification to
minimize bandwidth usage while maintaining vehicle tracking
accuracy. Liu et al. [29] proposed a Siamese network-based
tracking algorithm that enhances robustness against occlusion
and background clutter in intelligent transportation systems.
For pedestrian detection, Qiu et al. [30] improved detec-
tion accuracy by using multi-view information fusion and
data augmentation to address occlusion challenges. Guo et
al. [31] explored wireless streaming optimization for 360-
degree virtual reality video, focusing on joint beamform-
ing and subcarrier allocation to reduce transmission power.
However, existing research has not fully leveraged spatial
and temporal correlations among multiple camera views to
optimize coverage. Fang et al. [20] developed a collaborative
perception framework to leverage correlations among frames
and perspectives through a prioritization mechanism, but the
priorities cannot be adjusted for dynamically perceived targets.
Additionally, the effect of data timeliness, particularly AoI, on
collaborative perception accuracy remains underexplored.
2) Visual Sensor Calibration: Calibration is essential for
ensuring accuracy in multi-camera networks. Traditional meth-
ods include pattern-based, manual measurement, and feature-
based techniques. Pattern-based methods, like using checker-
board patterns [14], are impractical in dynamic environments
due to the unavailability of specific targets. Manual mea-
surement based methods, requiring physical measurements of
camera positions [15], are time-consuming and unsuitable for
rapidly changing settings, such as connected and autonomous
driving. Feature-based methods match natural features across
overlapping views [16], but inconsistent features and limited
overlap reduce their reliability. In mobile robotic systems, fre-
quent recalibration is often needed due to unpredictable con-
ditions. However, real-time calibration transmission consumes
significant bandwidth, which can degrade perception accuracy.
Collaborative perception, requiring multi-view data, further
increases communication resource demands, complicating pre-
cise calibration, especially when environmental changes neces-
sitate camera adjustments. Motion-based techniques, like Su et
al. [32], estimate transformations through sensor motions, but
accuracy may be limited without collaborative vehicle assis-
tance. Yin et al. [33] introduced a targetless method, combin-
ing motion and feature-based approaches, improving accuracy
but often requiring feature alignment, which increases data
transmission. Thus, how to optimize communication protocols
is crucial for managing calibration overhead and maintaining
perception accuracy. Integrating both calibration techniques
and communication strategies is essential for achieving real-
time, multi-camera calibration in dynamic deployments.
3) Task-Oriented Communications: Recent advancements
in task-oriented communication have shifted the focus from
bit-level to semantic-level data transmission. Wang et al. de-
signed a semantic transmission framework for sharing sensing
data from the physical world to Metaverse [34]. The proposed
method can achieve the sensing performance without data re-
covery. Meng et al. [35], [36] proposed a cross-system design
framework for modeling robotic arms in Metaverse, integrating
Constraint Proximal Policy Optimization (C-PPO) to reduce
packet transmission rates while optimizing scheduling and pre-
diction. Kang et al. [37] explored semantic communication in
UAV image-sensing, designing an energy-efficient framework
with a personalized semantic encoder and optimal resource
allocation to address efficiency and personalization in 6G net-
works. Wei et al. [38] introduced a federated semantic learning
(FedSem) framework for collaborative training of semantic-
channel encoders, leveraging the information bottleneck theory
to enhance rate-distortion performance in semantic knowledge
graph construction. Shao et al. [26] proposed a task-oriented
framework for edge video analytics, focusing on minimizing
data transmission by extracting compact task-relevant features
and utilizing temporal entropy modeling for reduced bitrate.
However, for multi-camera networks, considering correlations
between cameras and task-oriented priorities allows for further
data compression, optimizing transmission efficiency based on
varying perceptual and transmission needs.
C. Our Contributions
Multi-camera networks enhance real-time collaborative per-
ception by leveraging multiple views. Our contributions are
summarized as follows.
We propose a novel robust task-oriented communication
strategy for Real-time Adaptive Collaborative Perception
(R-ACP), which optimizes calibration and feature trans-
mission across calibration and streaming phases. It en-
hances perception accuracy while managing communica-
tion overhead. We also formulate the Age of Perceived
Targets (AoPT) minimization problem to ensure both data
quality and timeliness.
We introduce a channel-aware self-calibration technique
based on Re-ID, which adaptively compresses key-point
features based on channel capacity and leverages spatial
and temporal cross-camera correlations, improving cali-
bration accuracy by up to 89.39%.
To balance bandwidth and inference accuracy, we develop
an Information Bottleneck (IB)-based encoding method
to dynamically adjust video compression rates according
to task relevance, reducing communication overhead and
latency while maintaining perception accuracy.
To address severe packet errors or loss without retrans-
mission in real-time scenarios, we design a priority-aware
multi-view fusion network that discards corrupted data by
dynamically adjusting the importance of each view, ensur-
ing robust performance even under challenging network
conditions.
Extensive evaluations demonstrate that our R-ACP frame-
work outperforms conventional methods, achieving sig-
nificant improvements in multiple object detection accu-
racy (MODA) by 25.49% and reducing communication
costs by 51.36% under constrained network conditions.
The remainder of this paper is organized as follows. Sec. II
and Sec. III introduce the communication and calibration mod-
els, analyze data timeliness, and formulate the optimization
problem. Sec. IV details our methodology, focusing on Re-
ID-based camera calibration, task-oriented compression using
the IB principle, and adaptive & robust streaming scheduling.
Finally, Sec. V evaluates our framework through simulations,
demonstrating improved MODA and reduced communication
costs under constrained network conditions.
II. SYSTEM MODEL AND PRELIMINARY
A. Scenario Description
As illustrated in Fig. 2, our system consists of multiple
vision-based UGVs equipped with edge cameras, denoted as
K={1,2, . . . , 𝐾 }, that collaboratively track mobile targets,
such as pedestrians, within their FOVs. The UGVs are respon-
sible for transmitting decoded features to edge server through
Calibration
feature
Calibration
feature
Reference view
Key-point matching
Priority
Calculation
Pedestrian re-ID
Occupancy map
Timeliness (AoPT)
Wireless feature
transmission
Edge server
Vision-based UGV
Unknown view
Priority
Calculation
Priority
Calculation
Task-oriented
compression
Task-oriented
compression
Task-oriented
compression
IB-based encoding
CSI feedback
Fig. 2: The system consists of several UGVs equipped with
cameras, collaboratively tracking pedestrians.
wireless channel, which generate a pedestrian occupancy map
and conducts pedestrian re-identification (Re-ID) tasks. Edge
servers have more powerful computing and storage capabil-
ity for DNN-based downstream tasks [39]. However, UGVs
encounter several challenges in dynamic environments. First,
unpredictable factors such as terrain changes and obstacles
cause sudden variations in camera extrinsic parameters, lead-
ing to degraded perception accuracy over time. Traditional
methods like Kalman filtering struggle to handle these rapid,
non-linear variations due to their reliance on accurate initial
states and linearity assumptions. To address this, we introduce
aRe-ID-based collaborative perception mechanism, where
nearby UGVs share perceptual information, allowing real-time
calibration of extrinsic parameters without the need of precise
initial settings or additional sensors. Another challenge is to
ensure the timeliness of the high-quality data being collected,
especially in high-mobility scenarios. To address this, we
introduce the Age of Perceived Targets (AoPT) metric and
formulate a new optimization problem. By adjusting the frame
rate and applying Information Bottleneck (IB)-based encoding,
we reduce spatiotemporal redundancy and improve the percep-
tual data timeliness. The proposed approach is also extensible
to other robotic platforms such as UAVs and autonomous
robots. Furthermore, unpredictable channel impairments can
lead to packet errors or loss during transmission. To mitigate
this, we design a priority-aware network, which selectively
fuses data from multiple UGVs based on channel conditions,
filtering out erroneous information to ensure robust perception
performance.
B. Communication Model
To manage communications between UGVs and an edge
server, we adopt a Frequency Division Multiple Access
(FDMA) scheme. The transmission capacity 𝐶𝑘for each UGV
𝑘is determined by the Shannon capacity formula, which
depends on the signal-to-noise ratio (SNR) at the receiver:
𝐶𝑘=𝐵𝑘log2(1+SNR𝑘),(1)
where 𝐵𝑘is the bandwidth allocated to the link between UGV
𝑘and the edge server, and SNR is given by:
SNR𝑘=𝑃𝑡𝐺𝑘
𝑁0𝐵𝑘
,(2)
where 𝑃𝑡is the transmission power, 𝐺𝑘is the channel gain
for UGV 𝑘,𝑁0is the noise power spectral density, and 𝐵𝑘is
Feature
extraction
Feature &
Img. position
Key-point relationship
extraction
Lossy channel
Channel
Feature
extraction
Calibration solver
&
&
Reference UGV
Re-calibration UGV
Fig. 3: The flow of the self-calibration method using multiview
feature sharing.
the bandwidth allocated to UGV 𝑘. Besides, the transmission
delay 𝑑𝑇
𝑘for each camera-server connection is then determined
by the amount of data to be transmitted 𝐷and the capacity
𝐶𝑘:
𝑑𝑇
𝑘=𝐷
𝐶𝑘
=𝐷𝐵𝑘log21+𝑃𝑡𝐺𝑘
𝑁0𝐵𝑘1
.(3)
Thus, the total delay for each UGV 𝑘, which includes the
inference delay 𝑑𝐼
𝑘at the edge server, is given by:
𝑑total
𝑘=𝑑𝑇
𝑘+𝑑𝐼
𝑘.(4)
C. Camera Calibration and Multi-view Fusion
Our multi-UGV collaborative perception system operates
in three phases: Idle (Phase 0), Calibration (Phase 1), and
Streaming (Phase 2). During Phase 0, UGVs perform object
detection without transmitting data. Phase 1 occurs when new
UGVs are deployed or when existing UGVs require recali-
bration to improve tracking accuracy. Phase 2 begins when
targets are detected, prompting real-time data transmission to
the fusion node 𝑠.
1) Calibration (Phase 1): Calibration involves estimating
intrinsic and extrinsic parameters for each UGV’s camera
[16]. Intrinsic parameters are defined by the intrinsic matrix
K, which includes focal lengths 𝑓𝑥, 𝑓𝑦and principal point
(𝑐𝑥, 𝑐𝑦):
K=
𝑓𝑥0𝑐𝑥
0𝑓𝑦𝑐𝑦
0 0 1
.(5)
Extrinsic parameters represent the camera’s orientation and
position, encapsulated in the rotation matrix Rand translation
vector t, forming the extrinsic matrix [R|t]:
[R|t]=
𝑟11 𝑟12 𝑟13 𝑡𝑥
𝑟21 𝑟22 𝑟23 𝑡𝑦
𝑟31 𝑟32 𝑟33 𝑡𝑧
.(6)
The full perspective transformation matrix is then:
P=K[R|t].(7)
Given a 3D point Pworld =[𝑥, 𝑦, 𝑧, 1]𝑇, its 2D image projection
pimg =[𝑢, 𝑣, 1]𝑇is calculated by pimg =PPworld. As shown in
Fig. 3, calibration depends on sharing detected feature points
between the reference camera and the re-calibration UGV
through a wireless channel. When the re-calibration UGV
requires external calibration, it broadcasts a request to nearby
UGVs, requesting them to share their extracted features along
with the corresponding image coordinates pimg
ref and world
coordinates Pworld
ref . The re-calibration UGV then exploits a
Key-point Relationship Extraction (KRE) network Sto select
the UGV with the most matched points as the reference UGV.
The reference UGV transmits multi-frame feature and position
information through a lossy channel. Finally, the re-calibration
UGV uses this data in the Calibration Solver Cto solve
the linear equations and determine its extrinsic parameters
[Rrec|trec]1.
2) Streaming (Phase 2): For tasks like pedestrian detec-
tion, objects are often assumed to lie on the ground plane
𝑧=0. This assumption simplifies the projection to a 2D-to-
2D transformation between views. For a ground plane point
Pground =[𝑥, 𝑦, 0,1]𝑇, the image projection becomes:
pimg =P0Pground,(8)
where P0is the simplified 3x3 perspective matrix obtained by
eliminating the third column from the extrinsic matrix.
To evaluate data timeliness in multi-UGV collaborative
perception, we calculate the proportion of time the system
spends in Phases 0 (Idle) and 2 (Streaming). We assume target
occurrences are independent with a constant rate 𝜆, forming
a Poisson process. Pedestrian dwell times are modeled as a
Log-normal distribution 𝑆LogNormal(𝜇𝑆, 𝜎2
𝑆)[40]. We
adopt the log-normal distribution because pedestrian dwell
times are inherently non-negative. It also captures the right-
skewed nature of dwell times: most are short, but only some
pedestrians stay longer, reflecting real-world behavior. The
Exponential distribution’s memoryless property is unsuitable
since pedestrian leaving probability depends on the time
already spent. Since cameras must capture targets before they
leave, we consider infinite servers, so each target is served
immediately without queuing. Therefore, we model these
phases as an 𝑀/𝐺/∞ queue with periodic Calibration (Phase
1), where 𝑀denotes Poisson arrivals, 𝐺is a general service
time distribution, and servers are infinite.
Let 𝐿(𝑡)represent the number of active targets in the system
(i.e., the number of targets currently being captured by cam-
eras, corresponding to Phase 2). The steady-state distribution
of 𝐿(𝑡)is Poisson with mean 𝜌=𝜆E[𝑆]=𝜆exp 𝜇𝑆+𝜎2
𝑆
2.
Thus, the probability that there are 𝑛targets being captured
(Phase 2) is 𝑃(𝐿=𝑛)=𝜌𝑛𝑒𝜌
𝑛!, where 𝜌represents the
expected number of active targets. Then, the probability that
there are no targets being captured (Phase 0, Idle) is 𝑃(𝐿=
0)=𝑒𝜌. Phase 1 (Calibration) is deterministic, and we
assume it occurs with a fixed probability 𝑝1. Let 𝑇total be the
total cycle time, including time spent in Phase 1. The average
time spent in Phase 1 is 𝑇1=𝑝1𝑇total. The remaining time is
split between Phase 0 and Phase 2. If we let 𝜋(2)
0and 𝜋(2)
2
denote the relative time spent in Phases 0 and 2 within the
non-calibration portion of the cycle (i.e., after Phase 1), we
1Intrinsic parameters are relatively simple to calculate, while extrinsic
parameters, which define camera position and orientation, require more
intricate computations. Hence, we focus on calibrating extrinsic parameters
in our multi-camera network.
have 𝜋(2)
0=𝑒𝜌and 𝜋(2)
2=1𝑒𝜌. Thus, the steady-state
probabilities for the three phases can be given by:
𝜋1=𝑝1,
𝜋2=(1𝑝1) · 𝜋(2)
2=(1𝑝1)·(1𝑒𝜌),
𝜋0=(1𝑝1) · 𝜋(2)
0=(1𝑝1) · 𝑒𝜌.
(9)
Therefore, the average communication cost 𝐶can then be
calculated as:
𝐶=𝜋0𝐶0+𝜋1𝐶1+𝜋2𝐶2,(10)
where 𝐶0,𝐶1, and 𝐶2represent the communication costs
for the Idle, Calibration, and Streaming phases, respectively.
Additionally, the values of the communication costs are deter-
mined by different features in associated phases. In Sec. V,
the features can be transmitted successfully only when 𝐶does
not exceed the communication bottleneck.
D. Data Timeliness Analysis
In this section, we first derive the classical average Age
of Information (AoI), which is not sufficient for evaluating
the data timeliness of a multi-source system with multiple
perceived targets. Therefore, we propose a new metric, namely
the Age of Perceived Targets (AoPT), in a multi-camera
collaborative perception system.
1) Age of Information (AoI) for a single UGV: Let Δ𝑘be
the average sampling interval of the 𝑘th UGV’s camera, and
𝑑total
𝑘=𝑑𝑇
𝑘+𝑑𝐼
𝑘be the average total delay, where 𝑑𝑇
𝑘is the
average processing delay and 𝑑𝐼
𝑘is the average transmission
delay. For a multi-source system, the average AoI for the 𝑘th
UGV is then given in Proposition 1.
Proposition 1: The AoI for UGV 𝑘under deterministic sam-
pling and transmission delays is given by:
ΔAoI,𝑘 =Δ𝑘
2+𝑑𝑇
𝑘+𝑑𝐼
𝑘,(11)
where 𝑑𝑇
𝑘=𝐷
𝐶𝑘
=𝐷𝐵𝑘log21+𝑃𝑡𝐺𝑘
𝑁0𝐵𝑘1
,𝐵𝑘represents the
bandwidth allocated to the link between UGV 𝑘and the edge
server, 𝑃𝑡is the transmission power, 𝐺𝑘is the channel gain
for UGV 𝑘,𝑁0is the noise power spectral density, and 𝐷is
the data packet size.
Proof: The AoI at time 𝑡for UGV 𝑘, denoted as ΔAoI, 𝑘 (𝑡),
increases linearly between updates and resets to the transmis-
sion delay 𝑑total
𝑘upon each update. Given the average sampling
interval Δ𝑘, the AoI is ΔAoI,𝑘 =1
Δ𝑘𝑡𝑛
𝑡𝑛1
ΔAoI, 𝑘 (𝑡)𝑑𝑡, where
𝑡𝑛1is the time of the (𝑛1)th update, and 𝑡𝑛is the time of
the 𝑛th update. Substituting ΔAoI, 𝑘 (𝑡)=𝑡𝑡𝑛1+𝑑total
𝑘, we get:
ΔAoI,𝑘 =1
Δ𝑘𝑡𝑛
𝑡𝑛1𝑡𝑡𝑛1+𝑑total
𝑘𝑑𝑡 =1
Δ𝑘Δ2
𝑘
2+𝑑total
𝑘Δ𝑘.
(12)
According to Eq. (4), we have ΔAoI,𝑘 =Δ𝑘
2+𝑑𝑇
𝑘+𝑑𝐼
𝑘,
where Δ𝑘,𝑑𝑇
𝑘, and 𝑑𝐼
𝑘are time-averaged values representing
the average sampling interval, average processing delay, and
average transmission delay, respectively.
TABLE I: Age metrics: mathematical definition and key fea-
ture captured
Metric Definition Key Feature
AoI ΔAoI,𝑘 =Δ𝑘
2+𝑑𝑇
𝑘+𝑑𝐼
𝑘Vanilla freshness
AoII ΔAoII,𝑘 =Δ𝑘
2+𝑑total
𝑘Pr{ˆ
𝑋𝑘𝑋𝑘}Freshness weighted by
correctness penalty
AoPT Δst
AoPT,𝑘 =
1
{𝑔𝑘𝜀𝑔}𝑔𝑘Δ𝑘
2+𝑑total
𝑘Freshness weighted by
target relevance
While the AoI effectively quantifies data timeliness in
traditional sensor networks, it exhibits significant limitations
when applied to multi-view collaborative perception systems:
1) AoI assumes uniform sensor contributions and iden-
tical fields of view (FoVs). In multi-UGV systems, each
camera covers different areas and contributes unevenly
to global perception. Updates from less critical views are
treated equally, leading to inefficiencies.
2) AoI neglects perception quality, such as target visibil-
ity or occlusions. A timely but low-quality update may
reset AoI while providing little meaningful information.
Moreover, although the Age of Incorrect Information
(AoII) [41], [42] extends AoI by penalizing incorrect updates,
it still presents limitations in collaborative perception:
1) AoII emphasizes correctness over perceptual value. It
targets estimation errors but cannot distinguish semanti-
cally uninformative frames from valuable observations.
2) AoII lacks task-driven prioritization. It does not ac-
count for the number or relevance of perceived targets,
which are crucial in multi-view perception.
To overcome these deficiencies, we propose the Age of
Perceived Targets (AoPT), which integrates both freshness
and perceptual relevance by weighting updates according to
detected target counts. The differences between AoI, AoII, and
AoPT are summarized in Table I.
2) Definition of AoPT: The Age of Perceived Targets
(AoPT) quantifies the freshness and relevance of perception
data from each UGV. Specifically, the AoPT of the 𝑘th UGV is
defined as Δst
AoPT,𝑘 =
1
{𝑔𝑘𝜺𝒈}𝑔𝑘·Δ𝑘
2+𝑑total
𝑘, where 𝑋(𝑘)
𝑡
denotes the perception data frame of the 𝑘th UGV at time 𝑡,
𝑔𝑘(·) is the object recognition network outputting the number
of objects in a frame2,Δ𝑘represents the sampling interval,
𝑑𝐼
𝑘is the inference delay, and 𝜺𝒈is a threshold filtering out
low-quality data. This equation accounts for both the data
freshness and its informational value based on target count.
For simplicity, we abbreviate 𝑔𝑘=𝑔𝑘𝑋(𝑘)
𝑡. As illustrated in
Fig. 4(a), the AoI function increases linearly between updates
and resets upon receiving new data. Fig. 4(b) shows how 𝑔𝑘
varies over time. Frames with 𝑔𝑘<𝜺𝒈are discarded, while
those with 𝑔𝑘𝜺𝒈are retained, contributing to the AoPT
based on target count and motion dynamics, as depicted in
Fig. 4(c). In a multi-UGV collaborative perception system,
it is crucial to consider the worst-case AoPT to ensure no
2To minimize computational costs, we calculate 𝑔𝑘(·) only at regular time
intervals since the count of targets remains constant within small time slot 𝜏.
Moreover, we assume that 𝜏is comparatively longer than both Δ𝑘and Δ𝑇.
(a) AoI function. (b) Non-linear contribution. (c) AoPT function.
Fig. 4: Illustrations of different age-based functions.
UGV significantly lags behind. Therefore, the AoPT during
the streaming phase is formulated by taking the supremum
over all UGVs:
Δst
AoPT =sup
𝑘 K
1
{𝑔𝑘𝜺𝒈}𝑔𝑘·Δ𝑘
2+𝑑total
𝑘,(13)
where 𝑑total
𝑘includes all delays such as inference and transmis-
sion. This expression reflects the system’s aim to prioritize the
freshest and most informative data by optimizing the worst-
case AoPT scenario.
3) Impact of Calibration on AoPT: Calibration introduces
delays due to data suspension. With probability 𝑝1, the system
enters the Calibration phase (Phase 1), which increases the
AoPT by a fixed duration 𝑇1. To derive the AoPT during
the Calibration phase, Δca
AoPT, we first compute the average
AoI ΔCa
𝑘during this phase. Within one calibration phase at
times 𝑡𝑛1and 𝑡𝑛, the data update interval is the sum of the
calibration time Δ𝑇and the total delay 𝑑total
𝑘. Therefore, we
have 𝑡𝑛𝑡𝑛1= Δ𝑇+𝑑total
𝑘. The average AoI during the
Calibration phase for device 𝑘is calculated as:
ΔCa
𝑘=1
𝑡𝑛𝑡𝑛1𝑡𝑛
𝑡𝑛1𝑡𝑡𝑛1+𝑑total
𝑘𝑑𝑡
=1
Δ𝑇+𝑑total
𝑘Δ𝑇+𝑑total
𝑘2
2+𝑑total
𝑘Δ𝑇+𝑑total
𝑘
=
Δ𝑇+3𝑑total
𝑘
2.
(14)
Therefore, the AoPT during the Calibration phase is then
expressed as:
Δca
AoPT =
1
{𝑔𝑘𝜺𝒈}𝑔𝑘·ΔCa
𝑘=1
2𝑔
𝑘·3𝑑total
𝑘+Δ𝑇,(15)
where
𝑘=arg max
𝑘 K
1
{𝑔𝑘𝜺𝒈}𝑔𝑘𝑋(𝑘)
𝑡·Δ𝑘
2+𝑑total
𝑘re-
places the original indicate function
1
{𝑔𝑘𝜺𝒈}in the following
sections.
Definition 1: Combing Eq. (13) and Eq. (15), AoPT over the
entire cycle (including three phases) is formulated as :
Δcy
AoPT =𝑝1Δca
AoPT + (1𝑝1)Δst
AoPT
=𝑔
𝑘
2𝑝1Δ𝑇+ (1𝑝1)Δ
𝑘+ (𝑝1+2)𝑑total
𝑘,(16)
where 𝑝1is the probability that the system enters the Calibra-
tion phase.
According to Definition 1, the AoPT over the entire cycle is
a weighted sum of the calibration and non-calibration phases.
III. PROB LE M FORMULATION
In this section, real-time multi-UGV system operate in
three main phases: Idle (Phase 0), Calibration (Phase 1), and
Streaming (Phase 2). The goal of optimizing these networks
is to minimize AoPT during the entire cycle, ensuring that
the system provides the fresh data for target perception. The
objective is to reduce AoPT across all UGVs, which directly
affects the real-time accuracy of multi-target detection. The
optimization problem can be expressed as follows:
P1: min
{𝑩,𝚫,Δ𝑇,𝑫,𝚯}
Δcy
AoPT
s.t. (17a)𝜸Ca (𝚯Ca) 𝛾Ca,01,
(17b)𝜸St (𝚯St) 𝛾St,01,
(17c)𝑩min 𝑩𝑩max,
(17d)𝚫min 𝚫𝚫max,
(17e)𝚫𝑇, min 𝚫𝑇𝚫𝑇, max,
(17f)𝑫min 𝑫𝑫max ,
(17)
where 𝑩,𝚫,𝚫𝑇and 𝑫are K-dimensional vectors correspond-
ing to the bandwidth, sampling intervals, calibration intervals,
and data packet sizes, respectively, for each UGV 𝑘 K.
1is an all-ones vector. Let 𝚯=[𝚯Ca,𝚯St]be the set of
model parameters, where 𝚯Ca represents the model parameters
for feature extraction in the calibration phase, and 𝚯St de-
notes the parameters for task-specific feature extraction in the
streaming generation. Ineqs. (17a) and (17b) are the constraint
on calibration and streaming task accuracy, respectively. Ineq.
(17c) is the bandwidth constraint, while Ineqs. (17d), Ineq.
(17e) and (17f) are the constraints on the sampling intervals,
calibration intervals and data packet size, respectively. Accord-
ing to Proposition 2, the original optimization problem (17)
can be decomposed into two subproblems in the following
subsections.
Proposition 2: The original problem of minimizing AoPT P1
can be decomposed into two subproblems P2and P3, corre-
sponding to the calibration phase and the streaming phase,
respectively. P1can be solved independently by different
phases.
Proof: According to Eq. (16), the original optimization prob-
lem P1aims to minimize the AoPT over the entire cycle is
given by Δcy
AoPT =𝑝1Δca
AoPT + (1𝑝1)Δst
AoPT, where
Δca
AoPT =1
2𝑔
𝑘Δ𝑇+3𝑑total
𝑘,(18a)
Δst
AoPT =𝑔
𝑘Δ
𝑘
2+𝑑total
𝑘.(18b)
It is noted that
𝑘=arg max𝑘 K
1
{𝑔𝑘𝜀𝑔}𝑔𝑘Δ𝑘
2+𝑑total
𝑘,
which denotes the “bottleneck” UGV that dominates the AoPT.
By substituting Eqs. (18a) and (18b) into Δcy
AoPT, we obtain:
Δcy
AoPT =𝑝11
2𝑔
𝑘Δ𝑇+3𝑑total
𝑘+ (1𝑝1)𝑔
𝑘Δ
𝑘
2+𝑑total
𝑘
=𝑔
𝑘𝑝1
2Δ𝑇+3𝑝1
2+1𝑝1𝑑total
𝑘+(1𝑝1)
2Δ
𝑘
=𝑔
𝑘𝑝1
2Δ𝑇

Calibration rate
+𝑝1
2+1𝑑total
𝑘+(1𝑝1)
2Δ
𝑘

Transmission performance
.
(19)
We observe that the variables Δ𝑇in the calibration phase are
independent of the variables Δ
𝑘and 𝑑total
𝑘in the streaming
phase. Furthermore, the resource allocations in each phase are
independent. Therefore, the original problem P1can be decom-
posed into two independent subproblems P2and P3regarding
calibration rate and transmission performance, respectively.
1) For the calibration phase, we integrate the calibration rate
in Eq. (19) and the constraint of accuracy in Ineqs. (17a) using
Lagrangian multipliers. Therefore, P2is given by:
P2: min
{𝐵𝑘,Δ𝑇,𝐷𝑘,𝚯Ca }𝑝1·Δ𝑇

Calibration rate
𝜆Ca 𝜸Ca,k(𝚯Ca) 𝛾Ca,0

Calibration accuracy
,
s.t. 𝐵min 𝐵𝑘𝐵max, 𝐷min 𝐷𝑘𝐷max,
max Δ𝑇, min,𝐷𝑘
𝐶𝑘Δ𝑇Δ𝑇, max,
(20)
where 𝜆Ca represents the weight of calibration accuracy (La-
grange weight), and 𝜸Ca (𝚯Ca)denotes the calibration accuracy
function. The final constraint in (20) ensures that the calibra-
tion interval Δ𝑇is long enough to accommodate both data
transmission and feature extraction, while remaining within an
upper limit to preserve freshness. Specifically, 𝐷𝑘/𝐶𝑘repre-
sents the minimal time needed to transmit a calibration packet
of size 𝐷𝑘over a link of capacity 𝐶𝑘=𝐵𝑘log2(1+SNR),
and Δ𝑇, min accounts for the minimum required processing
time; thus, the lower bound in (20) reflects the maximum of
these two factors. The upper bound Δ𝑇,max prevents excessive
delays that would degrade calibration quality and perception
freshness. P2is solved for each 𝑘 K; afterwards the worst-
case index
𝑘is identified and substituted back into Eq. (19).
In order to solve the problem P2, we need to design efficient
feature matching algorithm in Sec. IV-A.
2) For the streaming phase, we integrate the transmission
performance in Eq. (19) and the constraint of accuracy in
Ineqs. (17b) using Lagrangian multipliers. Therefore, the sub-
problem P3is given by:
P3: min
{𝐵
𝑘,𝐷
𝑘,Δ
𝑘,𝚯St}𝑝1
2+1𝑑total
𝑘+1𝑝1
2Δ
𝑘

Transmission performance
𝜆
𝑘𝜸St,
𝑘(𝚯St) 𝜸MOD ,0

Inference performance
,
s.t. 𝐵min 𝐵
𝑘𝐵max , 𝐷min 𝐷
𝑘𝐷max,
max Δmin, 𝐷
𝑘·𝐶1
𝑘Δ
𝑘Δmax,
(21)
Adaptive & Robust Scheduling
Temporal Correlation Estimation
Channel-Aware Priority
Filtering Erroneous Data
Multiview Data Fusion
Spatiotemporal Feature Fusion
Occupancy Map Construction
Unpredictable accident or
Cumulative error
Collaborative Self-calibration
Re-ID Based Feature Extraction
Key-point Matching
Calibration Solver
Task-Oriented Compression
Feature Quantization
IB-based Encoding
Fig. 5: Framework of R-ACP.
where 𝜆
𝑘is the weight balancing transmission and infer-
ence performance3. Since P3aims to balance the trade-
off between transmission budget and inference performance,
we can address this subproblem by utilizing an IB-based
theoretical framework, which offers the task-specific feature
encoder/decoder as discussed in Section IV-B.
IV. METHODOLOGY
This section elaborates R-ACP’s design: 1) Collaborative
Self-calibration: R-ACP uses Re-ID to share perception data
for real-time extrinsic calibration. 2) Task-Oriented Com-
pression: After calibration, visual features are compressed
for pedestrian tracking and Re-ID. 3) Adaptive & Robust
Scheduling: Features are further compressed by temporal cor-
relation. Considering the varied packet loss rate, we calculate
the channel-aware priorities and filter out erroneous data. 4)
Multiview Data Fusion: It generates the occupancy map. The
framework of R-ACP is shown in Fig. 5.
A. Collaborative Self-calibration
This section addresses subproblem P2from Sec. III, focus-
ing on minimizing calibration transmission rate while maxi-
mizing calibration accuracy through Re-ID based method.
1) Comparative Analysis of Various Matching Techniques:
As shown in Fig. 3, we need to choose a suitable algorithm
to obtain the relevant key-points for the calibration solver.
Facial recognition extracts unique features to match indi-
viduals across camera views [43]. SIFT extracts distinctive,
scale- and rotation-invariant key-points suitable for viewpoint
matching [44]. Both are common in camera calibration. We
compare them with our Re-ID-based approach, evaluating
calibration accuracy using the extrinsic error 𝑒extrinsic between
the recalibrated extrinsic matrix [Rrec |trec]and the ground
truth [Rgt|tgt]by 𝑒extrinsic = [Rrec |trec ]−[Rgt |tgt ] 𝐹
[Rgt |tgt ] 𝐹×100%.
3𝜆Ca and 𝜆ˆ
𝑘are manually adapted according to the prevailing channel
conditions. Under good channel quality (high 𝐶
𝑘), a smaller 𝜆is used to
prioritize faster updates, whereas under poor channel quality (low 𝐶
𝑘), a
larger 𝜆is selected to ensure sufficient accuracy despite limited transmission
resources.
(a) Rotation error vs comm. cost. (b) Translation error vs comm. cost.
Fig. 6: Calibration accuracy vs communication bottleneck
for different errors. Fig. 6(a) shows the rotation error, while
Fig. 6(b) demonstrates the translation error.
345678
Threshold for Matching Points
N
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Rotation Error (rad)
Rotation Error (Re-ID)
Translation Error (Re-ID)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Translation Error (m)
0.000
0.005
0.010
0.015
0.020
4.5 5.0 5.5 6.0 6.5
0.000
0.005
0.010
0.015
0.020
(a) Calibration errors using Re-ID.
345678
Threshold for Matching Points
N
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Rotation Error (rad)
Rotation Error(SIFT)
Translation Error (SIFT)
0
1
2
3
4
5
Translation Error (m)
0.0
0.2
0.4
4.5 5.0 5.5 6.0 6.5
0.0
0.2
0.4
(b) Calibration errors using SIFT.
Fig. 7: Comparison of rotation and translation errors using Re-
ID and SIFT methods for different thresholds 𝑁.
Experiments show facial recognition performs poorly, with
high data transmission (856MB for two matches) and a
24.3% extrinsic error. SIFT yields more matches but has
higher errors (42.5%). Re-ID, leveraging pedestrian attributes,
reduces extrinsic error to under 0.6%. We evaluate Re-ID
and SIFT under varying communication constraints. In Fig. 6,
errors decrease as communication constraints ease. Re-ID
consistently outperforms SIFT, with rotation error 13.6 times
lower and translation error 8.3 times lower. Fig. 7 shows a
matching threshold of 5 key-points yields optimal calibration,
while incorrect matches reduce performance.
2) Feature Extraction and Similarity Matching: To improve
the calibration accuracy in P2, Re-ID treats pedestrians as key-
points. Given bounding boxes B𝑘={𝑏1, 𝑏2, . . . , 𝑏 𝑁}from a
pretrained model (YOLOV5 [45]), the Re-ID network F ;𝚯)
extracts feature vectors f𝑖=F (𝑏𝑖;𝚯) R𝑑. Similarity be-
tween features across views 𝑖and 𝑗is computed via Euclidean
distance: 𝑑sim f𝑖,f𝑗=f𝑖f𝑗2. To optimize calibration,
distances are ranked, and the top 𝑁matches are selected to
minimize matching error:
M𝑖 𝑗 =argmin
𝑀 B𝑖× B𝑗
(𝑏𝑖,𝑏 𝑗) 𝑀
𝑑sim F(𝑏𝑖),F𝑏𝑗,(22)
where M𝑖 𝑗 represents key-point matching, focusing on the
most similar features to minimize extrinsic error 𝑒extrinsic.
3) Quantization and Communication Cost: To reduce cal-
ibration rate in P2, feature vectors f𝑖are quantized based
on channel quality. The communication cost of calibration is
𝐶1=𝑁
𝑖=1𝑞𝑖·𝑑, where 𝑞𝑖is the quantization level and 𝑑is
dimensionality. Quantization adapts to SNR to meet available
capacity 𝐶ava [46]:
𝑞𝑖=arg min
𝑞𝑖
{𝑞𝑖·𝑑|𝐶1𝐶ava}.(23)
Higher SNR allows finer quantization; lower SNR uses coarser
quantization to reduce overhead. Combining Re-ID and adap-
tive quantization, we optimize calibration accuracy and rates,
efficiently addressing P2in multi-UGV systems.
B. Task-Oriented Compression
The Information Bottleneck (IB) method offers a principled
framework to balance the compression of input camera data
𝑋and the task-relevant information for inference about 𝑌[?].
The IB objective is formulated as:
max
Θ𝐼(𝑍;𝑌) 𝜆 𝐼 (𝑋;𝑍),(24)
where 𝐼(𝑍;𝑌)is the mutual information between the com-
pressed representation 𝑍and the target variable 𝑌, capturing
inference performance, and 𝐼(𝑋;𝑍)is the mutual information
between 𝑋and 𝑍, representing the amount of information
retained for transmission. The parameter 𝜆controls the trade-
off between the compression and inference accuracy.
Proposition 3: The optimization problem P3is equivalent to
the Information Bottleneck (IB) problem defined in Eq. (24);
specifically, minimizing transmission delay and sampling in-
terval corresponds to minimizing 𝐼(𝑋;𝑍), while maximizing
inference performance corresponds to maximizing 𝐼(𝑍;𝑌).
Proof: First, the total delay 𝑑total
𝑘
=𝑑𝑇
𝑘+𝑑𝐼
𝑘includes the
transmission delay 𝑑𝑇
𝑘
=𝐷
𝑘
𝐶ava , where 𝐷
𝑘is the data packet
size and 𝐶ava is the available channel capacity. Since 𝐷
𝑘is
proportional to the entropy 𝐻(𝑍)of the transmitted feature 𝑍,
and for deterministic encoders 𝐻(𝑍|𝑋)=0, we have 𝐷
𝑘=
𝐻(𝑍)=𝐼(𝑋;𝑍). Therefore, we have 𝑑total
𝑘𝐼(𝑋;𝑍). Given
𝐶ava, the sampling interval Δ
𝑘satisfies Δ
𝑘𝐷
𝑘
𝐶ava =𝐼(𝑋;𝑍)
𝐶ava ,
which implies 𝐼(𝑋;𝑍)
Δ
𝑘
𝐶ava, showing that Δ
𝑘𝐼(𝑋;𝑍)
if 𝐶ava is fixed. Thus, minimizing 𝑑total
𝑘and Δ
𝑘under com-
munication constraints corresponds to minimizing 𝐼(𝑋;𝑍).
On the other hand, inference performance depends on how
much information 𝑍retains about 𝑌, quantified by 𝐼(𝑍;𝑌).
Maximizing inference performance corresponds to maximizing
𝐼(𝑍;𝑌). Therefore, the objective of P3can be reformulated
as min [𝛼·𝐼(𝑋;𝑍) 𝛽·𝐼(𝑍;𝑌)], where 𝛼and 𝛽are positive
constants derived from the weights and coefficients in P3,
which aligns with the IB problem defined in Eq. (24).
1) Variational Approximation for IB: Due to the com-
putational complexity of directly estimating mutual infor-
mation, we employ a variational approximation method to
derive a lower bound for 𝐼(𝑍;𝑌)and an upper bound for
𝐼(𝑋;𝑍). The variational approach is based on approximating
the conditional distribution 𝑝(𝑌|𝑍)with a simpler distribution
𝑞(𝑌|𝑍), parameterized by Θ𝑑, and approximating 𝑝(𝑍|𝑋)with
𝑞(𝑍|𝑋), parameterized by Θ𝑐𝑜𝑛. Using the standard definition
of mutual information, we start with:
𝐼(𝑍;𝑌)=E𝑝(𝑌 ,𝑍 )log 𝑝(𝑌|𝑍)
𝑝(𝑌),
and introduce the KL-divergence between 𝑝(𝑌|𝑍)and 𝑞(𝑌|𝑍):
𝐷𝐾 𝐿 [𝑝(𝑌|𝑍)| |𝑞(𝑌|𝑍)]=E𝑝(𝑌|𝑍)log 𝑝(𝑌|𝑍)
𝑞(𝑌|𝑍)0.(25)
This leads to the inequality E𝑝(𝑌|𝑍)[log 𝑝(𝑌|𝑍)]
E𝑝(𝑌|𝑍)[log 𝑞(𝑌|𝑍)]. Therefore, the lower bound of mutual
information:
𝐼(𝑍;𝑌) E𝑝(𝑌 , 𝑍 )[log 𝑞(𝑌|𝑍)] + 𝐻(𝑌),(26)
where 𝐻(𝑌)is the entropy of 𝑌. For the second part 𝐼(𝑋;𝑍),
we derive an upper bound using variational approximations
due to the complexity of directly minimizing the term. Since
entropy 𝐻(𝑍|𝑋) 0, we can establish the following inequality
𝜆𝐾
𝑘=1𝐼(𝑋(𝑘);𝑍(𝑘)) 𝜆𝐾
𝑘=1𝐻(𝑍(𝑘)), where 𝐻(𝑍(𝑘))is the
entropy of the compressed feature 𝑍(𝑘). To refine this upper
bound, we incorporate latent variables 𝑉(𝑘)as side information
to encode the quantized features. Thus, we obtain:
𝜆
𝐾
𝑘=1
𝐻(𝑍(𝑘)) 𝜆
𝐾
𝑘=1
𝐻(𝑍(𝑘), 𝑉 (𝑘)),(27)
where the joint entropy 𝐻(𝑍(𝑘), 𝑉 (𝑘))represents the commu-
nication cost. Moreover, we apply the non-negativity property
of KL-divergence to establish a tighter upper bound. The joint
entropy 𝐻(𝑍(𝑘), 𝑉 (𝑘))can be bounded using variational dis-
tributions 𝑞(𝑍(𝑘)|𝑉(𝑘);Θ(𝑘)
𝑐𝑜𝑛 )and 𝑞(𝑉(𝑘);Θ(𝑘)
𝑙). Specifically,
we have:
𝐻(𝑍(𝑘), 𝑉 (𝑘)) E𝑝(𝑍(𝑘),𝑉 (𝑘))log 𝑞(𝑍(𝑘)|𝑉(𝑘);Θ(𝑘)
𝑐𝑜𝑛)
×𝑞𝑉(𝑘);Θ(𝑘)
𝑙.
(28)
where Θ(𝑘)
𝑐𝑜𝑛 and Θ(𝑘)
𝑙are the learnable parameters of the vari-
ational distributions 𝑞(𝑍(𝑘)|𝑉(𝑘))and 𝑞(𝑉(𝑘)), respectively.
These parameters are optimized to approximate the true distri-
butions, minimizing the communication cost while preserving
essential feature relations for inference. Substituting the result
from Eq. (28) into Eq. (27), we derive the final upper bound
for 𝐼(𝑋(𝑘);𝑍(𝑘))as follows:
𝐼𝑋(𝑘);𝑍(𝑘)E𝑝(𝑍(𝑘),𝑉 (𝑘))log 𝑞𝑍(𝑘)|𝑉(𝑘);Θ(𝑘)
𝑐𝑜𝑛
×𝑞(𝑉(𝑘);Θ(𝑘)
𝑙).
(29)
This upper bound simplifies the minimization process of
𝐼(𝑋;𝑍), providing a feasible method for reducing transmission
cost during network training.
2) Loss Function Design: We design the loss function to
optimize the IB objective. The loss function L1is constructed
to minimize the upper bound of 𝐼(𝑋;𝑍)and maximize the
lower bound of 𝐼(𝑍;𝑌):
L1=
𝐾
𝑘=1
Elog 𝑞(𝑌(𝑘)|𝑍(𝑘);Θ(𝑘)
𝑑)

The upper bound of 𝐼(𝑍(𝑘);𝑌(𝑘))
+
𝜆
𝐾
𝑘=1
Elog 𝑞(𝑍(𝑘)|𝑋(𝑘);Θ(𝑘)
𝑐𝑜𝑛)

The upper bound of 𝐼(𝑋(𝑘);𝑍(𝑘))
.
(30)
This loss function optimizes both compression and inference
accuracy by minimizing 𝐼(𝑋;𝑍)while maximizing 𝐼(𝑍;𝑌),
achieving an optimal balance between transmission cost and
task performance.
C. Adaptive and Robust Streaming Scheduling
In this section, we develop an adaptive streaming scheduling
framework to efficiently manage dynamic packet loss rates in
multi-view environments. We first reduce temporal redundancy
by estimating correlations across multiple frames and then im-
plement a prioritization strategy for robust feature transmission
under varying packet loss conditions.
1) Correlation Estimation by Multiple Frames: To optimize
the transmission efficiency, we leverage the temporal depen-
dencies between consecutive frames. The feature representa-
tion at time 𝑡, denoted by ˆ𝑧(𝑘)
𝑡, is estimated using the preceding
frames as side information. This estimation is modeled by
the variational distribution 𝑞(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏;Θ(𝑘)
𝑝), where
Θ(𝑘)
𝑝represents the parameters of the network in the 𝑘th UGV.
We assume that this conditional distribution follows
a Gaussian distribution 𝑞(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏;Θ(𝑘)
𝑝)=
N ( 𝜇(𝑘)
𝑡, 𝜎(𝑘)
𝑡) U, where 𝜇(𝑘)
𝑡and 𝜎(𝑘)
𝑡are the predicted
mean and variance, respectively, and Umodels the
quantization noise added during transmission. By exploiting
temporal correlations across frames, this model reduces the
entropy of ˆ𝑧(𝑘)
𝑡, thereby minimizing the required transmission
bitrate. To further reduce communication overhead, we align
the predicted distribution 𝑞(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏;Θ(𝑘)
𝑝)with
the true conditional distribution 𝑝(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏)by
minimizing the cross-entropy loss between them. The loss
function is given by:
L2=
𝐾
𝑘=1
𝑁
𝑡=1𝑝(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏)log 𝑝(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏)
𝑞(ˆ𝑧(𝑘)
𝑡|ˆ𝑧(𝑘)
𝑡1, . . . , ˆ𝑧(𝑘)
𝑡𝜏;Θ(𝑘)
𝑝),
which quantifies the KL divergence between the true dis-
tribution and the variational approximation. By minimizing
this divergence, we exploit the temporal redundancy, thereby
reducing the amount of data that needs to be transmitted.
2) Robust Multi-View Fusion under Dynamic Packet Loss:
In multi-UGV sensing systems, fluctuating communication ca-
pacity leads to unpredictable packet loss, dropping critical data
and harming inference accuracy. We propose a robust multi-
view fusion method that adapts to dynamic packet loss by
prioritizing important features and assigning potential losses
to lower-priority data. Firstly, we compute the priority of each
feature map using average pooling. For feature maps ˆ𝑧(𝑘)
𝑡from
camera 𝑘at time 𝑡, the priority 𝑝(𝑘)
𝑡is calculated as:
𝑝(𝑘)
𝑡=1
𝐶𝐻𝑊
𝐶
𝑐=1
𝐻
=1
𝑊
𝑤=1
ˆ𝑧(𝑘)
𝑡, 𝑐, ℎ, 𝑤 ,(31)
where ˆ𝑧(𝑘)
𝑡, 𝑐, ℎ, 𝑤 represents the element value of ˆ𝑧(𝑘)
𝑡at time 𝑡,
channel 𝑐, height , and width 𝑤. Besides, 𝐶,𝐻, and 𝑊are the
dimensions of the feature map, and ˆ𝑧(𝑘)
𝑡has dimensions 𝑇×
𝐶×𝐻×𝑊. Higher average values 𝑝(𝑘)
𝑡indicate more important
features. Based on the dynamic packet loss rate, we allocate
potential losses to features with the lowest priorities. We sort
the feature maps ˆ𝑧(𝑘)
𝑡by their priorities 𝑝(𝑘)
𝑡and assign a
mask 𝑚(𝑘)
𝑡 {0,1}, where 𝑚(𝑘)
𝑡=1if the feature is likely
to be received (higher priority) and 𝑚(𝑘)
𝑡=0otherwise. The
masked features are then defined as ˜𝑧(𝑘)
𝑡=𝑚(𝑘)
𝑡ˆ𝑧(𝑘)
𝑡, where
Jetson Orin
NX 8G
RGB Camera
Wi-Fi Antenna
UGV Platform
(Local Encoding)
Edge Server Node
(Feature Aggregation)
Jetson Orin NX Super 16G
Wi-Fi Antenna
Feature Streaming
Fig. 8: Our algorithm is deployed on a UGV-edge server
platform. The UGV node captures RGB images and performs
local encoding, transmitting features via Wi-Fi to the edge
server node for aggregation and further processing.
(a) Error vs. ΔCa
𝑘for 10kb.
0.5 1 1.5 2 2.5
Calibration interval
Ca
k
(Second)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Rotation Error (rad)
Rotation Error Translation Error
0.0
0.2
0.4
0.6
0.8
Translation Error (m)
(b) Error vs. ΔCa
𝑘for 30kb.
Fig. 9: Error vs. calibration interval ΔCa
𝑘for different data sizes.
˜𝑧(𝑘)
𝑡represents the masked feature map corresponding to ˆ𝑧(𝑘)
𝑡.
The fusion function Gincorporates both the masked features
˜𝑧(𝑘)
𝑡and the masks 𝑚(𝑘)
𝑡, allowing the network to adjust for
missing data due to packet loss. The fusion is formulated as:
ˆ𝑦𝑡=G˜𝑧(𝑘)
𝑡, 𝑚(𝑘)
𝑡𝐾
𝑘=1;Θ𝑟,(32)
where Θ𝑟represents the fusion parameters, and ˆ𝑦𝑡is the
fused output (occupancy map) at time 𝑡. Since our goal is
to minimize the inference error while accounting for dynamic
packet loss rates, the total loss function integrates inference
accuracy and a penalty for losing important features:
L3=E𝑁
𝑡=1
𝑑(𝑦𝑡,ˆ𝑦𝑡)+𝛼𝑑
𝑁
𝑡=1
𝐾
𝑘=11𝑚(𝑘)
𝑡,(33)
where 𝑦𝑡is the ground truth at time 𝑡,𝑑(·) measures inference
error, and the second term penalizes the loss of important
features. The weight 𝛼𝑑balances robustness and accuracy.
Therefore, the total loss is given by:
L=L1+𝛼2L2+𝛼3L3,(34)
where L1and L2are other loss terms related to bitrate
minimization and inference accuracy, and 𝛼2and 𝛼3are
weights for balancing these components.
V. PERFORMANCE EVALUATIO N
A. Simulation Setup
We set up simulations to evaluate our R-ACP framework,
aimed at predicting pedestrian occupancy in urban settings
using multiple cameras. These simulations replicate a city
TABLE II: Latency of Re-ID calibration pipeline on the UGV
platform with Nvidia Jetson unit.
Operation Description Latency
Pedestrian Detection YOLOv5s inference 10 ±2ms
Feature Extraction OSNet-x1.0 60 ±15 ms
Feature Matching Euclidean distance matching 3±1ms
Total Latency Detection + Extraction + Matching 73 ±18 ms
1.00 2.00 3.00 4.00 5.00
Rotation Error (×10 2 rad)
76
78
80
82
84
86
88
90
92
MODA (%)
+1.88% +2.03% +2.60% +5.29% +5.48%
R-ACP (ours) PIB
(a) Rotation Error vs. MODA.
3.00 5.00 7.00 9.00 11.00
Translation Error (×10 2 m)
76
78
80
82
84
86
88
90
92
MODA (%)
+3.58% +4.38% +3.08% +4.64% +5.08%
R-ACP (ours) PIB
(b) Translation Error vs. MODA.
Fig. 10: Comparison of MODA performance with respect to
rotation and translation errors.
environment with variables like signal frequency and device
density affecting the outcomes. We simulate a communication
system operating at a 2.4 GHz frequency with path loss expo-
nent of 3.5, and an 8 dB shadowing deviation to model wireless
conditions. To assess congestion levels, devices emitting 0.1
Watts interfere at densities of 10 to 100 per 100 square
meters. The bandwidth is set to 2 MHz. We use the Wildtrack
dataset from EPFL, featuring high-resolution images from
seven cameras capturing pedestrian movements in a public
area [47]. Each camera provides 400 frames at 2 frames per
second, totaling over 40,000 bounding boxes for more than 300
pedestrians. In our simulations, the positions of these cameras
correspond to the positions of UGVs. Moreover, we also
evaluate the R-ACP framework in our hardware platform. As
shown in Fig. 8, the platform consists of a UGV node equipped
with an RGB camera and a Jetson Orin NX 8G module for
local feature encoding. The extracted features are transmitted
over Wi-Fi to an edge server node with a Jetson Orin NX
Super 16G for feature aggregation and inference. This setup
enables efficient multi-agent collaborative perception under
communication constraints.
The primary metric of inference performance is MODA,
which assesses the system’s ability to accurately detect pedes-
trians based on missed and false detections. We also examine
the rate-performance tradeoff to understand how communi-
cation overhead affects calibration and multi-view perception.
For comparative analysis, we consider ve baselines, including
video coding and image coding, as follows.
PIB [20]: A collaborative perception framework that
enhances detection accuracy and reduces communication
costs by prioritizing and transmitting only useful features.
JPEG [48]: A widely used image compression standard
employing lossy compression algorithms to reduce image
data size, commonly used to decrease communication
load in networked camera systems.
H.264 [49]: Known as Advanced Video Coding (AVC) or
MPEG-4 Part 10, the standard that significantly enhances
0 100 200 300 400 500
Timeslot (s)
0
5
10
15
20
25
Detected Target
Camera 0
Camera 1
Camera 2
Camera 3
(a) Cameras 0-3.
0 100 200 300 400 500
Timeslot (s)
0
5
10
15
20
25
30
35
40
Detected Target
Camera 4
Camera 5
Camera 6
(b) Cameras 4-6.
Fig. 11: Detected targets over time for different cameras.
0 5 10 15 20 25 30 35
Detected Target
0.0
0.2
0.4
0.6
0.8
1.0
CDF
Camera 0 (std=1.42)
Camera 1 (std=2.49)
Camera 2 (std=1.38)
Camera 3 (std=2.55)
Camera 4 (std=1.29)
Camera 5 (std=3.69)
Camera 6 (std=3.28)
Fig. 12: Cumulative distribution function (CDF) of detected
targets per camera.
video compression efficiency, allowing high-quality video
transmission at lower bit rates.
H.265 [50]: Also known as High Efficiency Video Coding
(HEVC), the standard that offers up to 50% better data
compression than its predecessor H.264 (MPEG-4 Part
10), while maintaining the same video quality.
AV1 [51]: AOMedia Video 1 (AV1), an open, royalty-
free video coding format developed by the Alliance for
Open Media (AOMedia), and designed to succeed VP9
with improved compression efficiency.
In Fig. 9(a), the effect of varying calibration intervals
Δca
AoPT on rotation and translation errors under a 10KB key-
point feature constraint is shown. As the calibration interval
increases, both errors rise. Similarly, Fig. 9(b) shows error
trends for a 30KB key-point feature size, where increasing
Δca
AoPT also results in higher errors. Comparing the two figures,
a larger communication bottleneck (30KB) yields more gran-
ular features and significantly improves calibration accuracy
compared to the smaller 10KB bottleneck. As shown in Table
II, our numerical results show that the average per-frame
latency is approximately 73 ±18 ms when processing frames
with detected pedestrians. Specifically, pedestrian detection
consumes 10 ±2ms, total feature extraction takes 60 ±15
ms, and feature matching adds 3±1ms overhead. These
results demonstrate that our Re-ID module is capable of
operating at approximately 14 frames per second (FPS) on
the Jetson Orin, even without aggressive optimization (e.g.,
TensorRT or quantization). Therefore, the proposed method
remains feasible for real-time deployment in dynamic edge
environments.
In Fig. 10(a), the relationship between rotation error and
MODA is presented, showing that as rotation error increases,
the accuracy of multi-UGV collaborative perception improves.
R-ACP achieves up to 5.48% better MODA than the PIB
baseline for the highest rotation error. Similarly, Fig. 10(b)
0 5 10 15 20 25 30
Detected Targets
0.0
0.2
0.4
0.6
0.8
1.0
CDF
R-ACP (std=2.69)
JPEG (std=1.90)
AV1 (std=2.33)
H.264 (std=2.20)
H.265 (std=2.24)
(a) Camera 2
0 5 10 15 20 25 30
Detected Targets
0.0
0.2
0.4
0.6
0.8
1.0
CDF
R-ACP (std=2.15)
JPEG (std=2.43)
AV1 (std=1.98)
H.264 (std=1.68)
H.265 (std=1.71)
(b) Camera 5
0 5 10 15 20 25 30
Detected Targets
0.0
0.2
0.4
0.6
0.8
1.0
CDF
R-ACP (std=3.10)
JPEG (std=2.86)
AV1 (std=2.59)
H.264 (std=2.65)
H.265 (std=2.64)
(c) Camera 6
0 5 10 15 20 25 30
Detected Targets
0.0
0.2
0.4
0.6
0.8
1.0
CDF
R-ACP (std=2.71)
JPEG (std=2.90)
AV1 (std=2.35)
H.264 (std=2.28)
H.265 (std=2.32)
(d) Camera 7
Fig. 13: CDF of detected targets under different compression
methods for cameras 2, 5, 6, and 7.
shows that larger translation errors result in higher MODA
scores, with R-ACP outperforming PIB by up to 5.08%. Both
figures indicate that R-ACP maintains superior calibration and
perception accuracy under larger errors, enhancing collabora-
tive multi-UGV perception.
In Fig. 11, the detected target counts are plotted over time
for different cameras. In Fig. 11(a), cameras 0 to 3 detect a
moderate number of targets, with cameras 2 and 3 identifying
significantly more targets than others. Fig. 11(b) shows that
cameras 4, 5, and 6 detect a higher number of targets overall,
respectively, with camera 5 detecting the most, followed by
camera 6. These variations highlight that UGVs equipped with
cameras and with more detected targets, such as UGVs 3, 5,
and 6, have a higher priority in calculating AoPT. Fig. 12
shows the cumulative distribution function (CDF) of detected
targets across 500 time slots for each camera employing H.265
compression. The results indicate considerable variations in
the number of detected targets across UGVs, with UGVs 5
and 6 capturing the most targets within their FOVs, further
confirming their higher priority in terms of data timeliness
when computing AoPT. Under a constrained channel condition
of 30KB/s, as shown in Fig. 13, R-ACP consistently yields
higher detection rates than traditional codecs. This suggests
that R-ACP is better suited for preserving occupancy-critical
visual information in bandwidth-limited edge deployments.
Fig. 14 visualizes the results of collaborative perception
by multiple UGVs within a 12m×36m area, represented as
a 480×1440 grid with a resolution of 2.5 cm2. In this ex-
perimental setup, seven wireless edge cameras work together
to perceive the area, and contour lines are used to repre-
sent the perception range of each camera. The denser the
contour lines, the closer the target is to the camera, which
correlates with higher perception accuracy. The comparison
between individual FOVs in Figs. 14(a), 14(b), and 14(c)
illustrates the variability in pedestrian detection when only
one or a few UGVs contribute to perception. In these fig-
ures, certain areas show missing detections due to limited
coverage, with only one or two cameras detecting some
targets. In contrast, Fig. 14(d) demonstrates the advantage
TABLE III: Impact of Varying FOVs and Communication
Costs on AoPT and Collaborative Perception Accuracy.
FOV Num. Comm. Cost AoPT (Target ×s) MODA (%)
FOV 1 15.36 KB 6.53±0.64 63.15
FOV 1 18.69 KB 8.13±1.02 64.90
FOV 2 14.81 KB 7.14±0.95 52.34
FOV 2 18.69 KB 8.01±1.04 56.49
FOV 3 15.47 KB 6.91±0.86 66.46
FOV 3 18.73 KB 8.31±1.26 67.09
FOVs 1-3 17.07 KB 9.27±1.15 84.14
FOVs 1-3 26.62 KB 10.13±1.40 85.86
of using all UGVs collaboratively, covering a larger FOV
and significantly reducing missed detections. Pedestrians are
detected more accurately and with better resolution when
multiple cameras provide complementary perspectives, leading
to improved overall system performance. Additionally, the
perception accuracy depends on the proximity of the cameras
to the targets, as indicated by the density of the contour lines.
The closer a target is to the cameras, the higher the resolution
and accuracy of its detection, emphasizing the importance of
strategic UGV positioning and multi-UGV collaboration for
real-time monitoring tasks.
Table III illustrates the impact of different FOVs and
communication costs on AoPT and collaborative perception
accuracy. As communication costs increase, the amount of
feature data transmitted between UGVs rises, leading to more
accurate target detection and higher numbers of detected
targets. Consequently, both AoPT and MODA values im-
prove, highlighting that the network should allocate more
resources to UGVs associated with these FOVs to enhance
the timeliness and precision of perception data. The table also
demonstrates that different FOVs capture varying numbers of
targets, which suggests that optimizing resource distribution
should prioritize nodes with higher AoPT values to maximize
system performance. Additionally, when data from multiple
FOVs is combined, the system achieves its highest MODA
scores, but this also leads to increased communication costs
and higher AoPT values. This indicates that while multi-
FOV collaboration improves overall perception accuracy, it
also necessitates a careful balance of resource allocations to
manage the increased communication demands efficiently.
Fig. 15 illustrates the dynamic variations in detected targets
and AoPT across different FOVs over time. In Fig. 15(a), we
observe the fluctuation of the detected target counts in FOVs 1,
2, and 3 as targets move within and out of the cameras’ fields
of view. Fig. 15(b) shows the corresponding AoPT values,
where higher detected target counts result in increased AoPT,
indicating the system’s need for more channel resources to
maintain data timeliness. Since the targets are continuously
moving, both the number of detected targets and AoPT are
dynamic over time. As the communication constraint and
camera sampling interval remain fixed, the larger the number
of detected targets, the greater the AoPT becomes, indicating
that the system will require more resources to ensure timely
transmission. This illustrates the relationship between the
UGV 6
UGV 1
UGV 4
Missing detection
(a) Perception result from FOV 1.
UGV 5
UGV 3
UGV 7
Missing detection
(b) Perception result from FOV 2.
Missing detection
UGV 5
UGV 3
UGV 2
(c) Perception result from FOV 3.
UGV 5
UGV 3
UGV 2
UGV 6
UGV 1
UGV 4 UGV 7
(d) Perception result from all UGV cameras.
Fig. 14: Comparison of perception results from different FOVs using all UGV cameras. Figs. 14(a) to 14(c) show the results
for individual FOVs, while Fig. 14(d) shows the result from using all UGV cameras collaboratively.
0 100 200 300 400 500
Timeslot (s)
5
10
15
20
25
30
Detected Targets
FOV1
FOV2
FOV3
(a) Detected targets across FOVs.
0 100 200 300 400 500
Timeslot (s)
2
4
6
8
10
[cy
AoPT] (targets × seconds)
FOV1
FOV2
FOV3
(b) AoPT across FOVs.
Fig. 15: Comparison of detected targets and AoPT for different
FOVs over time. Fig. 15(a) shows the detected targets for
FOVs 1-3, while Fig. 15(b) illustrates the AoPT values.
(a) AoPT vs. Comm. Bottleneck. (b) AoPT vs. Sampling Interval.
Fig. 16: AoPT vs. different parameters. Fig. 16(a) shows the
relationship between AoPT and capacity. Fig. 16(b) demon-
strates the relationship between AoPT and sampling interval.
dynamic nature of target movement and the system’s resource
allocation strategy for maintaining efficient data timeliness.
Fig. 16 shows the relationship between AoPT and different
system parameters, i.e., communication bottleneck and sam-
pling interval. In Fig. 16(a), as the communication bottleneck
increases, AoPT decreases due to reduced transmission la-
tency, improving data timeliness across all methods. R-ACP
achieves up to 51.96% lower AoPT compared to H.265, H.264,
and AV1. With larger communication capacity, R-ACP en-
hances data freshness. Fig. 16(b) shows AoPT increasing with
longer sampling intervals due to less frequent data updates.
(a) Packet loss rate 0.1. (b) Packet loss rate 0.2.
(c) Packet loss rate 0.3. (d) Packet loss rate 0.4.
Fig. 17: Comparison of MODA vs. Communication Bottleneck
across different packet loss rates.
R-ACP consistently maintains lower AoPT across different
sampling intervals, reducing it by up to 22.08% compared to
baselines, demonstrating its effectiveness in minimizing AoPT
even with lower sampling frequencies.
Fig. 17 shows how packet loss and communication bot-
tlenecks affect MODA. As the bottleneck increases, MODA
improves across all packet loss rates, indicating higher trans-
mission capacity enhances perception. For lower packet loss
rates (0.1 and 0.2), the improvement is gradual, while at a
packet loss rate of 0.3, R-ACP significantly outperforms other
baselines like PIB, with at least a 23.08% improvement. Even
under severe packet loss (0.4), R-ACP maintains a notable
advantage over H.265, H.264, AV1, and JPEG, achieving up
to 25.49% improvement in MODA, demonstrating robustness
against high packet loss scenarios. These results highlight R-
ACP’s efficiency in maintaining data integrity and accuracy
across various conditions.
VI. CONCLUSION
In this paper, we have proposed a real-time adaptive collab-
orative perception (R-ACP) framework by leveraging a robust
task-oriented communication strategy to enhance real-time
multi-view collaborative perception under constrained network
conditions. Our contributions of R-ACP are twofold. First,
we have introduced a channel-aware self-calibration technique
utilizing Re-ID-based feature extraction and adaptive key-
point compression, which significantly improves extrinsic cal-
ibration accuracy by up to 89.39%, even with limited FOV
overlap. Second, we have leveraged an Information Bottleneck
(IB)-based encoding method to optimize feature transmission
and sharing, ensuring data timeliness while reducing com-
munication overhead. By intelligently compressing data and
employing a priority-based scheduling mechanism for severe
packet loss, R-ACP can reduce AoPT and retain perception ac-
curacy under various channel conditions. Extensive simulation
results show that R-ACP significantly outperforms traditional
methods like PIB, H.265, H.264, and AV1, improving multiple
object detection accuracy (MODA) by 25.49% and decreasing
communication costs by 51.36%, particularly in high packet
loss scenarios (up to 40% packet loss rate).
REFERENCES
[1] J. Wang, H. Du, Y. Liu, G. Sun, D. Niyato, S. Mao, D. I. Kim,
and X. Shen, “Generative AI based secure wireless sensing for ISAC
networks,” 2024, arXiv preprint arXiv:2408.11398. [Online]. Available:
https://arxiv.org/abs/2408.11398
[2] M. Tang, S. Cai, and V. K. N. Lau, “Radix-partition-based over-the-air
aggregation and low-complexity state estimation for iot systems over
wireless fading channels,” IEEE Transactions on Signal Processing,
vol. 70, pp. 1464–1477, Mar. 2022.
[3] S. Hu, Z. Fang, Z. Fang, Y. Deng, X. Chen, Y. Fang, and S. T. W.
Kwong, “AgentsCoMerge: Large language model empowered collabo-
rative decision making for ramp merging, IEEE Transactions on Mobile
Computing, DOI: 10.1109/TMC.2025.3564163, pp. 1–15, Apr. 2025.
[4] X. Chen, Y. Deng, H. Ding, G. Qu, H. Zhang, P. Li, and Y. Fang, “Ve-
hicle as a Service (VaaS): Leverage vehicles to build service networks
and capabilities for smart cities,” IEEE Commun. Surv. Tutor., Feb. 2024,
42(3): 2048–2081.
[5] S. Hu, Y. Tao, Z. Fang, G. Xu, Y. Deng, S. Kwong, and Y. Fang, “CP-
Guard+: A new paradigm for malicious agent detection and defense
in collaborative perception, 2025, arXiv preprint arXiv:2502.07807.
[Online]. Available: https://arxiv.org/abs/2502.07807
[6] J. Wang, J. Wang, Z. Tong, Z. Jiao, M. Zhang, and C. Jiang, “ACBFT:
Adaptive chained byzantine fault-tolerant consensus protocol for UAV
Ad Hoc networks,” IEEE Transactions on Vehicular Technology, (DOI:
10.1109/TVT.2025.3548281), Mar. 2025.
[7] Z. Fang, Z. Liu, J. Wang, S. Hu, Y. Guo, Y. Deng, and
Y. Fang, “Task-oriented communications for visual navigation with
edge-aerial collaboration in low altitude economy, arXiv preprint
arXiv:2504.18317, 2025. [Online]. Available: https://arxiv.org/abs/2504.
18317
[8] X. Hou, J. Wang, Z. Zhang, J. Wang, L. Liu, and Y. Ren, “Split
federated learning for UAV-enabled integrated sensing, computation,
and communication,” arXiv preprint arXiv:2504.01443, 2025. [Online].
Available: https://arxiv.org/abs/2504.01443
[9] H. Yang, J. Cai, M. Zhu, C. Liu, and Y. Wang, “Traffic-informed multi-
camera sensing (TIMS) system based on vehicle re-identification,” IEEE
Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp.
17 189–17 200, Mar. 2022.
[10] Y. He, X. Wei, X. Hong, W. Shi, and Y. Gong, “Multi-target multi-
camera tracking by tracklet-to-target assignment,” IEEE Transactions
on Image Processing, vol. 29, pp. 5191–5205, Mar. 2020.
[11] N. Q. Hieu, D. Thai Hoang, D. N. Nguyen, and M. Abu Alsheikh,
“Reconstructing human pose from inertial measurements: A generative
model-based compressive sensing approach, IEEE Journal on Selected
Areas in Communications, vol. 42, no. 10, pp. 2674–2687, Jun. 2024.
[12] S. Wang, J. Lu, B. Guo, and Z. Dong, “RT-VeD: Real-time voi
detection on edge nodes with an adaptive model selection framework,
in Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, New York, NY, 2022, pp. 4050–4058.
[13] D. Yang, X. Fan, W. Dong, C. Huang, and J. Li, “Robust BEV 3D object
detection for vehicles with tire blow-out, Sensors, vol. 24, no. 14, p.
4446, Jul. 2024.
[14] A. Tabb, H. Medeiros, M. J. Feldmann, and T. T. Santos, “Cal-
ibration of asynchronous camera networks: Calico,” arXiv preprint
arXiv:1903.06811, 2019.
[15] M. Özuysal, “Manual and auto calibration of stereo camera systems,”
Master’s thesis, Middle East Technical University, 2004.
[16] C. Yuan, X. Liu, X. Hong, and F. Zhang, “Pixel-level extrinsic self
calibration of high resolution LiDAR and camera in targetless envi-
ronments,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp.
7517–7524, Jul. 2021.
[17] M. Khodarahmi and V. Maihami, “A review on Kalman filter models,
Archives of Computational Methods in Engineering, vol. 30, no. 1, pp.
727–747, Oct. 2023.
[18] Z. Fang, S. Hu, H. An, Y. Zhang, J. Wang, H. Cao, X. Chen, and
Y. Fang, “PACP: Priority-aware collaborative perception for connected
and autonomous vehicles,” IEEE Transactions on Mobile Computing,
(DOI: 10.1109/TMC.2024.3449371), Aug. 2024.
[19] Z. Fang, S. Hu, L. Yang, Y. Deng, X. Chen, and Y. Fang, “PIB: Pri-
oritized information bottleneck framework for collaborative edge video
analytics,” in IEEE Global Communications Conference (GLOBECOM),
Cape Town, South Africa, Dec. 2024, pp. 1–6.
[20] Z. Fang, S. Hu, J. Wang, Y. Deng, X. Chen, and Y. Fang, “Priori-
tized information bottleneck theoretic framework with distributed online
learning for edge video analytics,” IEEE Transactions on Networking,
DOI: 10.1109/TON.2025.3526148, Jan. 2025.
[21] Y. Yang, W. Wang, Z. Yin, R. Xu, X. Zhou, N. Kumar, M. Alazab,
and T. R. Gadekallu, “Mixed game-based AoI optimization for com-
bating COVID-19 with AI bots, IEEE Journal on Selected Areas in
Communications, vol. 40, no. 11, pp. 3122–3138, Oct. 2022.
[22] M. R. Abedi, N. Mokari, M. R. Javan, H. Saeedi, E. A. Jorswieck,
and H. Yanikomeroglu, “Safety-aware age-of-information (S-AoI) for
collision risk minimization in cell-free mMIMO platooning networks,”
IEEE Transactions on Network and Service Management, Mar. 2024.
[23] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, and L. Hanzo, Age
of information in energy harvesting aided massive multiple access
networks,” IEEE Journal on Selected Areas in Communications, vol. 40,
no. 5, pp. 1441–1456, May 2022.
[24] B. Wu, J. Huang, and Q. Duan, “Real-time intelligent healthcare enabled
by federated digital twins with aoi optimization,” IEEE Network, DOI:
10.1109/MNET.2025.3565977, 2025.
[25] Q. He, G. Dan, and V. Fodor, “Minimizing age of correlated information
for wireless camera networks,” in IEEE Conference on Computer
Communications Workshops (INFOCOM WKSHPS). Honolulu, HI:
IEEE, Apr. 2018, pp. 547–552.
[26] J. Shao, X. Zhang, and J. Zhang, “Task-oriented communication for
edge video analytics,” IEEE Transactions on Wireless Communications
(DOI: 10.1109/TWC.2023.3314888), 2023.
[27] H. Feng, J. Wang, Z. Fang, J. Chen, and D.-T. Do, “Evaluating
aoi-centric harq protocols for uav networks, IEEE Transactions on
Communications, vol. 72, no. 1, pp. 288–301, Sep. 2024.
[28] “Cooperative multi-camera vehicle tracking and traffic surveillance with
edge artificial intelligence and representation learning,” Transportation
Research Part C: Emerging Technologies, vol. 148, p. 103982, Mar.
2023.
[29] S. Liu, S. Huang, X. Xu, J. Lloret, and K. Muhammad, “Efficient visual
tracking based on fuzzy inference for intelligent transportation systems,”
IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12,
pp. 15 795–15 806, Jan. 2023.
[30] R. Qiu, M. Xu, Y. Yan, J. S. Smith, and X. Yang, “3D random
occlusion and multi-layer projection for deep multi-camera pedestrian
localization,” in European Conference on Computer Vision. Tel Aviv,
Israel: Springer, 2022, pp. 695–710.
[31] C. Guo, L. Zhao, Y. Cui, Z. Liu, and D. W. K. Ng, “Power-efficient
wireless streaming of multi-quality tiled 360 VR video in MIMO-
OFDMA systems,” IEEE Transactions on Wireless Communications,
vol. 20, no. 8, pp. 5408–5422, Mar. 2021.
[32] J. Su, M. Hirano, and Y. Yamakawa, “Online camera orientation cali-
bration aided by a high-speed ground-view camera, IEEE Robotics and
Automation Letters, vol. 8, no. 10, pp. 6275–6282, Aug. 2023.
[33] J. Yin, F. Yan, Y. Liu, and Y. Zhuang, “Automatic and targetless Li-
DAR–camera extrinsic calibration using edge alignment, IEEE Sensors
Journal, vol. 23, no. 17, pp. 19871–19 880, Jul. 2023.
[34] J. Wang, H. Du, Z. Tian, D. Niyato, J. Kang, and X. Shen, “Semantic-
aware sensing information transmission for metaverse: A contest the-
oretic approach,” IEEE Transactions on Wireless Communications,
vol. 22, no. 8, pp. 5214–5228, Aug. 2023.
[35] Z. Meng, K. Chen, Y. Diao, C. She, G. Zhao, M. A. Imran, and
B. Vucetic, “Task-oriented cross-system design for timely and accu-
rate modeling in the metaverse, IEEE Journal on Selected Areas in
Communications, vol. 42, no. 3, pp. 752–766, Dec. 2024.
[36] Z. Meng, C. She, G. Zhao, M. A. Imran, M. Dohler, Y. Li, and
B. Vucetic, “Task-oriented metaverse design in the 6G era,” IEEE
Wireless Communications, vol. 31, no. 3, pp. 212–218, Feb. 2024.
[37] J. Kang, H. Du, Z. Li, Z. Xiong, S. Ma, D. Niyato, and Y. Li,
“Personalized saliency in task-oriented semantic communications: Image
transmission and performance analysis,” IEEE Journal on Selected Areas
in Communications, vol. 41, no. 1, pp. 186–201, Nov. 2023.
[38] H. Wei, W. Ni, W. Xu, F. Wang, D. Niyato, and P. Zhang, “Federated
semantic learning driven by information bottleneck for task-oriented
communications,” IEEE Communications Letters, vol. 27, no. 10, pp.
2652–2656, Aug. 2023.
[39] X. Hou, Z. Ren, J. Wang, W. Cheng, Y. Ren, K.-C. Chen, and H. Zhang,
“Reliable computation offloading for edge-computing-enabled software-
defined IoV,” IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7097–
7111, 2020.
[40] D. Peng, L. Chao, and L. Zhili, “Walking time modeling on transfer
pedestrians in subway passages,” Journal of Transportation Systems
Engineering and Information Technology, vol. 9, no. 4, pp. 103–109,
Nov. 2009.
[41] A. Maatouk, S. Kriouile, M. Assaad, and A. Ephremides, “The age of
incorrect information: A new performance metric for status updates,”
IEEE/ACM Trans. Netw., vol. 28, no. 5, p. 2215–2228, Oct. 2020.
[42] J. Chen, J. Wang, C. Jiang, and J. Wang, Age of incorrect information
in semantic communications for NOMA aided XR applications,” IEEE
Journal of Selected Topics in Signal Processing, vol. 17, no. 5, pp.
1093–1105, Sep. 2023.
[43] H. Du, H. Shi, D. Zeng, X.-P. Zhang, and T. Mei, “The elements of
end-to-end deep face recognition: A survey of recent advances, ACM
Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–42, Sep. 2022.
[44] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,
International journal of computer vision, vol. 60, pp. 91–110, Jun. 2004.
[45] Y. Wang, Y. Wang, I. W.-H. Ho, W. Sheng, and L. Chen, “Pavement
marking incorporated with binary code for accurate localization of
autonomous vehicles,” IEEE Transactions on Intelligent Transportation
Systems, vol. 23, no. 11, pp. 22 290–22 300, May 2022.
[46] X. Hou, J. Wang, C. Jiang, Z. Meng, J. Chen, and Y. Ren, “Efficient
federated learning for metaverse via dynamic user selection, gradient
quantization and resource allocation,” IEEE Journal on Selected Areas
in Communications, vol. 42, no. 4, pp. 850–866, Apr. 2024.
[47] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagaut-
dinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “Wildtrack: A
multi-camera hd dataset for dense unscripted pedestrian detection,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Salt Lake City, UT, Jun. 2018, pp. 5030–5039.
[48] G. K. Wallace, “The JPEG still picture compression standard, IEEE
Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv,
Feb. 1992.
[49] ITU-T Recommendation H.264 and ISO/IEC 14496-10, Advanced
Video Coding for Generic Audiovisual Services, International
Telecommunication Union Std., 2003. [Online]. Available:
https://www.itu.int/rec/T- REC-H.264
[50] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity and
implementation analysis,” IEEE Transactions on circuits and Systems
for Video Technology, vol. 22, no. 12, pp. 1685–1696, Oct. 2012.
[51] J. Han, B. Li, D. Mukherjee, C.-H. Chiang, A. Grange, C. Chen, H. Su,
S. Parker, S. Deng, U. Joshi et al., A technical overview of AV1,
Proceedings of the IEEE, vol. 109, no. 9, pp. 1435–1462, Sep. 2021.
Zhengru Fang (S’20) received his B.S. degree
(Hons.) in electronics and information engineer-
ing from the Huazhong University of Science and
Technology (HUST), Wuhan, China, in 2019 and
received his M.S. degree (Hons.) from Tsinghua
University, Beijing, China, in 2022. Currently, he
is pursuing his PhD degree in the Department
of Computer Science at City University of Hong
Kong. His research interests include collaborative
perception, V2X, age of information, and mobile
edge computing. He received the Outstanding Thesis
Award from Tsinghua University in 2022, and the Excellent Master Thesis
Award from the Chinese Institute of Electronics in 2023. His research work
has been published in IEEE/CVF CVPR, IEEE ToN, IEEE JSAC, IEEE TMC,
IEEE ICRA, and ACM MM, etc.
Jingjing Wang (S’14-M’19-SM’21) received his
B.S. degree in Electronic Information Engineering
from Dalian University of Technology, Liaoning,
China in 2014 and the Ph.D. degree in Informa-
tion and Communication Engineering from Tsinghua
University, Beijing, China in 2019, both with the
highest honors. From 2017 to 2018, he visited the
Next Generation Wireless Group chaired by Prof.
Lajos Hanzo, University of Southampton, UK. Dr.
Wang is currently an associate professor at School of
Cyber Science and Technology, Beihang University.
His research interests include AI enhanced next-generation wireless networks,
swarm intelligence and confrontation. He has published over 100 IEEE
Journal/Conference papers. Dr. Wang was a recipient of the Best Journal Paper
Award of IEEE ComSoc Technical Committee on Green Communications &
Computing in 2018, the Best Paper Award of IEEE ICC and IWCMC in 2019.
Yanan Ma ( Graduate Student Member, IEEE)
received the B.Eng. degree in Electronic Informa-
tion Engineering (English Intensive) and the M.Eng.
degree in Information and Communication Engi-
neering from the Dalian University of Technology,
Dalian, China, in 2020 and 2023. She is currently
pursuing the Ph.D. degree in the Department of
Computer Science at the City University of Hong
Kong. Her research interests are focused on edge
intelligence, wireless communication and network-
ing.
Yihang Tao received the B.S. degree from the
School of Information Science and Engineering,
Southeast University, Nanjing, China, in 2021 and
received his M.S. degree from the School of
Electronic Information and Electrical Engineering,
Shanghai Jiao Tong University, Shanghai, China, in
2024. Currently, he is pursuing his PhD degree in the
Department of Computer Science at City University
of Hong Kong. His current research interests include
collaborative perception, autonomous driving, and
AI security.
Yiqin Deng received her MS degree in software
engineering and her PhD degree in computer sci-
ence and technology from Central South University,
Changsha, China, in 2017 and 2022, respectively.
She is currently a Postdoctoral Researcher with the
Department of Computer Science at City University
of Hong Kong. Previously, she was a Postdoctoral
Research Fellow with the School of Control Science
and Engineering, Shandong University, Jinan, China.
She was a visiting researcher at the University of
Florida, Gainesville, Florida, USA, from 2019 to
2021. Her research interests include edge/fog computing, computing power
networks, Internet of Vehicles, and resource management.
Xianhao Chen (Member, IEEE) received the B.Eng.
degree in electronic information from Southwest
Jiaotong University in 2017, and the Ph.D. degree
in electrical and computer engineering from the
University of Florida in 2022. He is currently an
assistant professor at the Department of Electrical
and Electronic Engineering, the University of Hong
Kong, where he directs the Wireless Information
& Intelligence (WILL) Lab. He serves as a TPC
member of several international conferences and an
Associate Editor of ACM Computing Surveys. He
received the Early Career Award from the Research Grants Council (RGC)
of Hong Kong in 2024, the ECE Graduate Excellence Award for research
from the University of Florida in 2022, and the ICCC Best Paper Award in
2023. His research interests include wireless networking, edge intelligence,
and machine learning.
Yuguang Fang (S’92, M’97, SM’99, F’08) received
the MS degree from Qufu Normal University, China
in 1987, a PhD degree from Case Western Reserve
University, Cleveland, Ohio, USA, in 1994, and a
PhD degree from Boston University, Boston, Mas-
sachusetts, USA in 1997. He joined the Department
of Electrical and Computer Engineering at Univer-
sity of Florida in 2000 as an assistant professor,
then was promoted to associate professor in 2003,
full professor in 2005, and distinguished professor in
2019, respectively. Since August 2022, he has been a
Global STEM Scholar and Chair Professor with the Department of Computer
Science at City University of Hong Kong.
Prof. Fang received many awards including the US NSF CAREER Award
(2001), US ONR Young Investigator Award (2002), 2018 IEEE Vehicu-
lar Technology Outstanding Service Award, IEEE Communications Society
AHSN Technical Achievement Award (2019), CISTC Technical Recognition
Award (2015), and WTC Recognition Award (2014), and 2010-2011 UF
Doctoral Dissertation Advisor/Mentoring Award. He held multiple profes-
sorships including the Changjiang Scholar Chair Professorship (2008-2011),
Tsinghua University Guest Chair Professorship (2009-2012), University of
Florida Foundation Preeminence Term Professorship (2019-2022), and Uni-
versity of Florida Research Foundation Professorship (2017-2020, 2006-
2009). He served as the Editor-in-Chief of IEEE Transactions on Vehicular
Technology (2013-2017) and IEEE Wireless Communications (2009-2012)
and serves/served on several editorial boards of journals including Proceedings
of the IEEE (2018-present), ACM Computing Surveys (2017-present), ACM
Transactions on Cyber-Physical Systems (2020-present), IEEE Transactions
on Mobile Computing (2003-2008, 2011-2016, 2019-present), IEEE Transac-
tions on Communications (2000-2011), and IEEE Transactions on Wireless
Communications (2002-2009). He served as the Technical Program Co-
Chair of IEEE INFOCOM’2014. He is a Member-at-Large of the Board of
Governors of IEEE Communications Society (2022-2024) and the Director of
Magazines of IEEE Communications Society (2018-2019). He is a fellow of
ACM and AAAS.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Collaborative perception systems leverage multiple edge devices, such as surveillance cameras or autonomous cars, to enhance sensing quality and eliminate blind spots. Despite their advantages, challenges such as limited channel capacity and data redundancy impede their effectiveness. To address these issues, we introduce the Prioritized Information Bottleneck (PIB) framework for edge video analytics. This framework prioritizes the shared data based on the signal-to-noise ratio (SNR) and camera coverage of the region of interest (RoI), reducing spatial-temporal data redundancy to transmit only essential information. This strategy avoids the need for video reconstruction at edge servers and maintains low latency. It leverages a deterministic information bottleneck method to extract compact, relevant features, balancing informativeness and communication costs. For high-dimensional data, we apply variational approximations for practical optimization. To reduce communication costs in fluctuating connections, we propose a gate mechanism based on distributed online learning (DOL) to filter out less informative messages and efficiently select edge servers. Moreover, we establish the asymptotic optimality of DOL by proving the sublinearity of its regrets. To validate the effectiveness of the PIB framework, we conduct real-world experiments on three types of edge devices with varied computing capabilities. Compared to five coding methods for image and video compression, PIB improves mean object detection accuracy (MODA) by 17.8% while reducing communication costs by 82.65% under poor channel conditions.
Article
Full-text available
Surrounding perceptions are quintessential for safe driving for connected and autonomous vehicles (CAVs), where the Bird's Eye View has been employed to accurately capture spatial relationships among vehicles. However, severe inherent limitations of BEV, like blind spots, have been identified. Collaborative perception has emerged as an effective solution to overcoming these limitations through data fusion from multiple views of surrounding vehicles. While most existing collaborative perception strategies adopt a fully connected graph predicated on fairness in transmissions, they often neglect the varying importance of individual vehicles due to channel variations and perception redundancy. To address these challenges, we propose a novel P riority- A ware C ollaborative P erception ( PACP ) framework to employ a BEV-match mechanism to determine the priority levels based on the correlation between nearby CAVs and the ego vehicle for perception. By leveraging submodular optimization, we find near-optimal transmission rates, link connectivity, and compression metrics. Moreover, we deploy a deep learning-based adaptive autoencoder to modulate the image reconstruction quality under dynamic channel conditions. Finally, we conduct extensive studies and demonstrate that our scheme significantly outperforms the state-of-the-art schemes by 8.27% and 13.60%, respectively, in terms of utility and precision of the Intersection over Union.
Article
Full-text available
The bird’s-eye view (BEV) method, which is a vision-centric representation-based perception task, is essential and promising for future Autonomous Vehicle perception. It has advantages of fusion-friendly, intuitive, end-to-end optimization and is cheaper than LiDAR. The performance of existing BEV methods, however, would be deteriorated under the situation of a tire blow-out. This is because they quite rely on accurate camera calibration which may be disabled by noisy camera parameters during blow-out. Therefore, it is extremely unsafe to use existing BEV methods in the tire blow-out situation. In this paper, we propose a geometry-guided auto-resizable kernel transformer (GARKT) method, which is designed especially for vehicles with tire blow-out. Specifically, we establish a camera deviation model for vehicles with tire blow-out. Then we use the geometric priors to attain the prior position in perspective view with auto-resizable kernels. The resizable perception areas are encoded and flattened to generate BEV representation. GARKT predicts the nuScenes detection score (NDS) with a value of 0.439 on a newly created blow-out dataset based on nuScenes. NDS can still obtain 0.431 when the tire is completely flat, which is much more robust compared to other transformer-based BEV methods. Moreover, the GARKT method has almost real-time computing speed, with about 20.5 fps on one GPU.
Article
Meeting the demand for prompt yet accurate diagnosis remains a major challenge in intelligent healthcare, particularly under conditions involving emergencies or rare medical cases with insufficient data. This article reviews existing AI-driven methods in healthcare and identifies their strengths and limitations in addressing these challenges. To tackle the urgent need for timely medical responses, we propose a novel federated digital twin framework incorporating split learning and artificial intelligence-generated content (AIGC). Additionally, we leverage Age of Information (AoI) as a comprehensive metric to optimize real-time data and model updates, employing a deep reinforcement learning algorithm to achieve enhanced information freshness. Our framework exhibits high diagnostic accuracy across diverse clinical datasets, including Brain MRI (93%), OrganA MNIST (89%), Chest MNIST (83%), and Retina MNIST (80%). Moreover, experimental validation demonstrates our method significantly reduces the AoI to an average of 0.6s, notably outperforming Rainbow DQN and traditional greedy approaches. These results unveil the potential of our proposed approach to deliver timely, privacy-preserving, and accurate diagnostics, laying a foundation for future advancements in intelligent healthcare systems.
Article
The ability to sense, localize, and estimate the 3D position and orientation of the human body is critical in virtual reality (VR) and extended reality (XR) applications. This becomes more important and challenging with the deployment of VR/XR applications over the next generation of wireless systems such as 5G and beyond. In this paper, we propose a novel framework that can reconstruct the 3D human body pose of the user given sparse measurements from Inertial Measurement Unit (IMU) sensors over a noisy wireless environment. Specifically, our framework enables reliable transmission of compressed IMU signals through noisy wireless channels and effective recovery of such signals at the receiver, e.g., an edge server. This task is very challenging due to the constraints of transmit power, recovery accuracy, and recovery latency. To address these challenges, we first develop a deep generative model at the receiver to recover the data from linear measurements of IMU signals. The linear measurements of the IMU signals are obtained by a linear projection with a measurement matrix based on the compressive sensing theory. The key to the success of our framework lies in the novel design of the measurement matrix at the transmitter, which can not only satisfy power constraints for the IMU devices but also obtain a highly accurate recovery for the IMU signals at the receiver. This can be achieved by extending the set-restricted eigenvalue condition of the measurement matrix and combining it with an upper bound for the power transmission constraint. Our framework can achieve robust performance for recovering 3D human poses from noisy compressed IMU signals. Additionally, our pre-trained deep generative model achieves signal reconstruction accuracy comparable to an optimization-based approach, i.e., Lasso, but is an order of magnitude faster.
Article
In this paper, fresh Basic Safety Messages (BSM) (e.g., vehicle’s position and speed) are used to control the Connected Automated Vehicles (CAVs) to reduce Time to Collision (TTC) error which leads to decrease in Collision Risk (CR). In contrast to exiting works, a novel Safety-aware Age of Information (S-AoI) metric is proposed that in addition to AoI, takes into account the risk assessment of CAVs to design an efficient transmission protocol for BSMs. We also deploy user-centric Cell-free-massive-MIMO (CFmMIMO) to improve the communication coverage, accessibility, and reliability, where each CAV is served by a cluster of nearby Access Points (APs). Unlike previous works, a two time-scale distributed deterministic policy gradients algorithm is adopted which greatly reduces the signal processing complexity, system load as well as signaling overhead while maintaining the performance. Simulation results show that the proposed framework, i.e, user-centric CFmMIMO technology together with S-AoI metric, can reduce average TTC error between 24%-35% across different lane change probabilities compared to the baseline scenario in which we use small cell mMIMO with AoI metric. Such a reduction in TTC error results in significant decrease (as high as 75%) in CR ratio.
Article
Smart cities demand resources for rich immersive sensing, ubiquitous communications, powerful computing, large storage, and high intelligence (SCCSI) to support various kinds of applications, such as public safety, connected and autonomous driving, smart and connected health, and smart living. At the same time, it is widely recognized that vehicles, such as connected and autonomous vehicles, equipped with significantly powerful SCCSI capabilities, will become ubiquitous in future smart cities. By observing the convergence of these two trends, this article advocates the use of vehicles to build a cost-effective service network, based on the Vehicle as a Service (VaaS) paradigm, where vehicles empowered with SCCSI capability form a web of mobile servers and communicators to provide SCCSI services in smart cities. Towards this goal, this article first examines the potential use cases in smart cities and possible upgrades required for the transition from traditional vehicular ad hoc networks (VANETs) to VaaS. Then, the system architecture and use cases of the VaaS paradigm are comprehensively discussed. At last, the open problems of this paradigm and future research directions, including architectural design, service provisioning, incentive design, and security & privacy, are identified. It is expected that this paper paves the way towards developing a cost-effective and sustainable approach for smart cities.
Article
As an emerging concept, the Metaverse has the potential to revolutionize social interaction in the post-pandemic era by establishing a digital world for online education, remote healthcare, immersive business, intelligent transportation, and advanced manufacturing. The goal is ambitious, yet the methodologies and technologies to achieve the full vision of the Metaverse remain unclear. In this article, we first introduce the three infrastructure pillars that lay the foundation of the Metaverse, that is, human-computer interfaces, sensing and communication systems, and network architectures. Then, we depict the roadmap toward the Metaverse, consisting of four stages with different applications. To support diverse applications in the Metaverse, we put forward a novel design methodology -- task-oriented design -- and further review the challenges and potential solutions. In the case study, we develop a prototype to illustrate how to synchronize a realworld device and its digital model in the Metaverse by task-oriented design, where a deep reinforcement learning algorithm is adopted to minimize the required communication throughput by optimizing the sampling and prediction systems subject to a synchronization error constraint.