PreprintPDF Available

PIB: Prioritized Information Bottleneck Framework for Collaborative Edge Video Analytics

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Collaborative edge sensing systems, particularly in collaborative perception systems in autonomous driving, can significantly enhance tracking accuracy and reduce blind spots with multi-view sensing capabilities. However, their limited channel capacity and the redundancy in sensory data pose significant challenges, affecting the performance of collaborative inference tasks. To tackle these issues, we introduce a Prioritized Information Bottleneck (PIB) framework for collaborative edge video analytics. We first propose a priority-based inference mechanism that jointly considers the signal-to-noise ratio (SNR) and the camera's coverage area of the region of interest (RoI). To enable efficient inference, PIB reduces video redundancy in both spatial and temporal domains and transmits only the essential information for the downstream inference tasks. This eliminates the need to reconstruct videos on the edge server while maintaining low latency. Specifically, it derives compact, task-relevant features by employing the deterministic information bottleneck (IB) method, which strikes a balance between feature informativeness and communication costs. Given the computational challenges caused by IB-based objectives with high-dimensional data, we resort to variational approximations for feasible optimization. Compared to TOCOM-TEM, JPEG, and HEVC, PIB achieves an improvement of up to 15.1\% in mean object detection accuracy (MODA) and reduces communication costs by 66.7% when edge cameras experience poor channel conditions.
Content may be subject to copyright.
PIB: Prioritized Information Bottleneck
Framework for Collaborative Edge Video Analytics
Zhengru Fang, Senkang Hu, Liyan Yang, Yiqin Deng, Xianhao Chen, Yuguang Fang
City University of Hong Kong, Hong Kong, China
Shandong University, Jinan, China, The University of Hong Kong, Hong Kong, China
Email: {zhefang4-c, senkang.forest, liyanyang3-c}@my.cityu.edu.hk,
yiqin.deng@email.sdu.edu.cn, xchen@eee.hku.hk, my.fang@cityu.edu.hk
Abstract—Collaborative edge sensing systems, particularly in
collaborative perception systems in autonomous driving, can sig-
nificantly enhance tracking accuracy and reduce blind spots with
multi-view sensing capabilities. However, their limited channel
capacity and the redundancy in sensory data pose significant chal-
lenges, affecting the performance of collaborative inference tasks.
To tackle these issues, we introduce a Prioritized Information
Bottleneck (PIB) framework for collaborative edge video analytics.
We first propose a priority-based inference mechanism that
jointly considers the signal-to-noise ratio (SNR) and the camera’s
coverage area of the region of interest (RoI). To enable efficient
inference, PIB reduces video redundancy in both spatial and
temporal domains and transmits only the essential information
for the downstream inference tasks. This eliminates the need
to reconstruct videos on the edge server while maintaining low
latency. Specifically, it derives compact, task-relevant features by
employing the deterministic information bottleneck (IB) method,
which strikes a balance between feature informativeness and
communication costs. Given the computational challenges caused
by IB-based objectives with high-dimensional data, we resort to
variational approximations for feasible optimization. Compared to
TOCOM-TEM, JPEG, and HEVC, PIB achieves an improvement
of up to 15.1% in mean object detection accuracy (MODA)
and reduces communication costs by 66.7% when edge cameras
experience poor channel conditions.
Index Terms—Collaborative edge inference, information bottle-
neck, network compression, variational approximations.
I. INTRODUCTION
Video analytics is rapidly transforming various sectors such
as urban planning, retail analysis, and autonomous navigation
by converting visual data streams into useful insights [1]. When
cameras are deployed for monitoring, they tend to produce vast
amounts of video data constantly. There is often a requirement
for quick analysis of these real-time streams [2], [3]. Further-
more, numerous developing applications such as remote patient
care [4], autonomous driving [5], and virtual reality heavily
depend on efficient video analytics with minimal delay.
The increasing number of deployed smart devices require
a computational paradigm shift towards edge processing. This
approach involves handling data closer to its source, leading
to several benefits compared to traditional cloud-based models.
Specifically, this method significantly reduces latency, which
is crucial in time-sensitive processing tasks like strokes and
attacks. For example, as per the research conducted by Corneo
et al., utilizing remote cloud services for data processing may
result in a 30% increase in latency when compared to local data
handling [6]. Moreover, the importance of privacy, particularly
in regions with strict data protection laws such as the General
Data Protection Regulation (GDPR), makes edge computing
even more attractive [7]. According to the Ponemon Institute,
60% of companies express apprehension toward cloud security
and decide to manage their own data onsites in order to mitigate
potential risks [8].
However, the integration of edge devices into video an-
alytics also brings in many significant challenges [9]. The
computational demands of deep neural network (DNN) models,
such as GoogLeNet [10], which requires about 1.5 billion
operations per image classification, place a substantial burden
on the limited processing capacities of edge devices [11].
Additionally, the outputs from high-resolution cameras increase
the communication load. For example, a 4K video stream
requires up to 18 Gbps of bandwidth to transmit raw video
data, potentially overwhelming the capacity of existing wireless
networks [12].
The current communication strategies for integrating edge
devices into video analytics ecosystems are not effective
enough. One major issue is how to handle the computational
complexity and transmission of redundant data generated from
the overlapping fields of view (FOVs) from multiple cameras.
In scenarios with dense camera deployments, up to 60% of data
can be redundant due to overlapping FOV, which unnecessarily
overburdens the network [13]. In addition, these strategies often
lack adaptability in transmitting tailored data features based on
Region of Interest (RoI) and signal-to-noise ratio (SNR), re-
sulting in poor video fusion or alignment. These limitations can
negatively impact collaborative perception, sometimes making
it less effective than single-camera setups.
In this paper, we aim to develop novel multi-camera video
analytics by prioritizing wireless video transmissions. Our
proposed Prioritized Information Bottleneck (PIB) framework
attempts to effectively leverage SNR and RoI to selectively
transmit data features, significantly reducing computational
load and data transmissions. Our method can decrease data
transmissions by up to 66.7%, while simultaneously enhanc-
ing the mean object detection accuracy (MODA) compared
to current state-of-the-art techniques. This approach not only
compresses data but also intelligently selects data for processing
to ensure that only relevant information is transmitted, thus
mitigating noise-induced inaccuracies in collaborative sensing
scenarios. This innovation sets a new benchmark for efficient
and accurate video analytics at the edge.
arXiv:2408.17047v1 [cs.NI] 30 Aug 2024
Region of Interest
Camera 1 Camera 2
Camera 3 delay3
delay2
delay1
Edge Server
FOV2
FOV1
FOV3
Pedestrian
occupancy
Bitstream
Transmission
Fig. 1: System model.
II. SY ST EM MO DE L AN D PROB LE M FORMULATION
As illustrated in Figure 1, our system comprises a set of
edge cameras, denoted as K={1,2, . . . , K}. These cameras
are deployed across various scenes S={s1, s2, . . . , sS}, each
with a specific Field of View (FoV), FoVk, covering a subset of
the total monitored area. The union of FoVs from all cameras
covering a scene sensures Sk∈F(s)FoVksfor compre-
hensive surveillance. For example, in a high-density pedestrian
environment, our goal is to facilitate collaborative perception
for pedestrian occupancy prediction, under the constraints of
limited channel capacity due to poor channel conditions.
A. Communication Model
Given the dense deployment and high device density, we
adopt Frequency Division Multiple Access (FDMA) to man-
age the communication among cameras, defining the channel
capacity Ckfor each camera kusing the SNR-based Shannon
capacity:
Ck=Bklog2(1 + SNRk),(1)
where Bkis the bandwidth allocated to camera kand SNRkis
its signal-to-noise ratio. The transmission delay dk, critical for
real-time applications, is calculated as:
dk=D
Ck
,(2)
where Dis the fixed data amount to be transmitted. This delay
inversely correlates with Ck, emphasizing the need for efficient
bandwidth allocation and SNR optimization.
B. Priority Weight Formulation
Dynamic priority weighting is crucial in optimizing network
resource allocation. We employ a dual-layer Multilayer Percep-
tron (MLP) to compute priority weights based on normalized
delay and coverage:
pk=MLP(dnorm,k,Coveragenorm,k ; ΘM),(3)
where pkdenotes the computed priority score for camera k, and
ΘMrepresents the trainable parameters of MLP. This MLP’s
architecture, featuring two layers, allows for modeling the
interactions between delay and coverage effectively. Besides,
we have dnorm,k =dk
dmax and Coveragenorm,k =COkC OL
COUC OL.
COkrepresents the coverage area provided by camera kwithin
the Region of Interest (RoI), with COUand COLdenoting the
upper and lower bounds of desired coverage, respectively.
To transform the raw priority scores into a usable format
within the system, we apply a softmax function, which nor-
malizes these scores into a set of weights summed to one:
wk=epk
PK
j=1 epj
,(4)
where wksignifies the priority weight for camera k. This
method ensures that cameras which are more critical, either
due to higher coverage or due to lower delays, are given higher
priority, thereby enhancing the decision-making capabilities and
responsiveness of the edge analytics system.
C. Video Feature Generation
In this paper, MVDet serves as the backbone network for
multi-camera perception fusion [14]. Firstly, edge cameras
capture data and extract feature maps, which are subsequently
encoded and sent via wireless channels to an edge server
through base stations. Upon reception, the server decodes the
data and uses a transformation matrix to perform coordinate
transformations on the multi-view features, integrating them
into a unified feature. The process concludes with the appli-
cation of large kernel convolutions on the ground plane feature
map, culminating in the final occupancy decision. This deci-
sion includes predicting the events of interest, e.g., pedestrian
locations, enhancing the system’s perception accuracy in multi-
camera setups.
D. Prioritized Information Bottleneck Formulation
In the context of information theory, the Information Bot-
tleneck (IB) method seeks an optimal trade-off between the
compression of an input variable Xand the preservation
of relevant information for an output variable Y[15]. We
formalize the input data from camera kas X(k), similar to
Z(k)in extracted features, and the target prediction as Y(k),
corresponding to the population in the dataset D. The aim is to
encode X(k)into a meaningful and concise representation Z(k),
resonating with the hidden representation z(k)that captures the
essence of multi-view content for prediction tasks. The classical
IB problem can be formulated as a constrained optimization
task:
max
Θ
K
X
k=1
IZ(k);Y(k)
s.t. IX(k);Z(k)Ic,(k= 1,2,· · · , K),
(5)
where I(Z(k), Y (k))denotes the mutual information between
two random variables Z(k)and Y(k).Θrepresents the set
of all learnable parameters in PIB framework, including ΘM
and the variational approximation in the following section. The
mutual information is essentially a measure of the amount of
information obtained about one random variable through the
other random variable. Icis the maximum permissible mutual
information that Z(k)can contain about X(k). The objective
Feature map
generation Entropy
coding
Encoding in edge cameras
Multi-frame
correlation model
Priority weight
module
𝑋𝑡
𝑘𝑍𝑡
𝑘
𝑤(𝑘) 𝑞(𝑧𝑡
𝑘|𝑧𝑡1
𝑘, 𝑧𝑡2
𝑘, , 𝑧𝑡𝜏
𝑘)
Raw video data Intermediate feature
Transmitted
bitstream
Fig. 2: The procedure of video encoding.
is to ensure that Z(k)captures the most relevant information
about X(k)for predicting Y(k)while remaining as concise as
possible. Introducing a Lagrange multiplier λ, the problem is
equivalently expressed as:
max
ΘRIB =
K
X
k=1 hIZ(k);Y(k)λ·IX(k);Z(k)i,
(6)
where RIB represents the IB functional, balancing the compres-
sion of X(k)against the necessity of accurately predicting Y(k).
Then, we extend the IB framework to a multi-camera setting by
introducing priority weights to the mutual information terms,
adapting the optimization for an edge analytics network:
min
Θ
K
X
k=1 hIwZ(k);Y(k)+λIwX(k);Z(k)i,(7)
where the weighted mutual information terms are defined as: (1)
IwZ(k);Y(k)=wk·IZ(k);Y(k)and IwX(k);Z(k)=
ew0wk·IX(k);Z(k). The non-negative value w0represents
the maximum weight parameter wk.
The first term with linear weights IwZ(k);Y(k)indicates
the weighted mutual information between the compressed rep-
resentation Z(k)from camera kand the target Y(k). Linear
weighting by wkensures each camera’s influence is propor-
tional to its priority, with higher wkvalues increasing the
emphasis on IZ(k);Y(k)in the objective function, em-
phasizing cameras that provide high-quality data for precise
predictions. The second term with negative exponential weights
IwX(k);Z(k)measures the mutual information between the
original X(k)and its compressed Z(k), scaled negatively by
wk. This ensures exponential reduction in IX(k);Z(k)as
wkrises. Cameras with lower wkundergo more aggressive data
compression, optimizing bandwidth and storage without signif-
icantly impacting overall system performance. This weighting
approach, chosen for this proof of concept, will be further
explored with more general methods in future work.
III. METHODOLOGY
A. Architecture Summary
In this subsection, we outline the workflow of our PIB
framework, designed for collaborative edge video analytics. As
depicted in Fig. 2, the process starts with each edge camera
(denoted by k) capturing raw video data X(k)
tand extracting
feature maps. These cameras utilize priority weights wkto opti-
mize the balance between communication costs and perception
Encoding in edge cameras
Entropy
decoding
Decoding in edge server
Multiview feature
fusion
𝑍𝑡
𝑘
𝑌
𝑡
𝑘
Reconstruction Feature Pedestrian occupancy
Received
bitstream
Multi-frame
correlation model
Priority weight
module
𝑤(𝑘) 𝑞(𝑧𝑡
𝑘|𝑧𝑡1
𝑘, 𝑧𝑡2
𝑘, , 𝑧𝑡𝜏
𝑘)
Fig. 3: The procedure of video decoding.
accuracy, adapting to varying channel conditions. The extracted
features are then compressed using entropy coding and sent
as a bitstream to the edge server for further processing. At
the server (see Fig. 3), the video features are reconstructed
using shared parameters such as weights wkand the variational
model parameters q(Z(k)
t|Z(k)
t1, ..., Z(k)
tτ). The server integrates
these multi-view features to estimate pedestrian occupancy Yt.
This approach leverages historical frame correlations through a
multi-frame correlation model to enhance prediction accuracy.
B. Information Bottleneck Analysis
The objective function of information bottleneck in Eq. (7)
can be divided into two parts. The first part is PK
k=1 wk·
IZ(k);Y(k), which denotes the quality of video recon-
struction by decoding at the edge server. The second part is
λPK
k=1 ew0wk·IX(k);Z(k), which denotes the compres-
sion efficiency for feature extraction. As it has been shown
in the way that a decoder works, pY(k)|Z(k)can be any
valid type of conditional distributions, but most often it is not
smooth enough for straightforward calculation. Because of this
complexity, it is highly challenging to directly work out the
two mutual information components in Eq. (7) and improve
them. Accordingly, we adopt the variational approach [16]. This
approach suggests that the decoder is part of a simpler group
of distributions called Q. We then search for a distribution
qY(k)|Z(k)within this group that is most similar to the
best possible decoder distribution, using the KL-divergence to
measure the closeness.
As a proof of concept, we first focus on deriving a lower
bound for the mutual information IZ(k);Y(k)based on
alternative probability distributions. We start with the standard
definition of mutual information1:
I(Z;Y) = Ep(Y,Z )log p(Y|Z)
p(Y).(8)
We then introduce the Kullback-Leibler divergence (KL di-
vergence), which is always non-negative and measures the
efficiency of how the distribution q(Y|Z)approximates the
true distribution p(Y|Z):
DKL [p(Y|Z)||q(Y|Z)] = Ep(Y|Z)log p(Y|Z)
q(Y|Z)0,(9)
where the variational estimation method q(Y|Z)described
utilizes a weighted exponential family variational distribution
1For simplicity, we omit the exponent of (k)when deriving the lower bound
of I(Z;Y).
that is parameterized by neural network parameters Φand
is designed to approximate the true conditional distribution
p(Y|Z)while providing a computationally tractable lower
bound for mutual information. Eq. (9) leads to the inequality:
Ep(Y|Z)[log p(Y|Z)] Ep(Y|Z)[log q(Y|Z)] ,(10)
where p(Y|Z)can be replaced by p(Y, Z ). The relationship
between the joint and conditional probabilities facilitates the
simplification of the expression for mutual information:
Ep(Y,Z )log p(Y|Z)
p(Y)=Ep(Z)Ep(Y|Z)log p(Y|Z)
p(Y).
(11)
Building on the inequality established by the KL divergence
(9), we can express a lower bound for the mutual information:
I(Z;Y)Ep(Y,Z )[log q(Y|Z)] + H(Y),(12)
where H(Y)is the entropy of Y, a constant that reflects the
inherent uncertainty in Yindependent of Z. This formulation
provides a computationally feasible lower bound for mutual
information, crucial for applications in video analytics and other
areas where direct computation of mutual information is hard
or even infeasible.
As for the second part, we proceed by establishing an
upper limit because of the complexity of directly minimizing
the term λPK
k=1 ew0wk·IX(k);Z(k). Recognizing that
H(Z(k)|X(k))0from the properties of entropy, we can
derive the following inequality:
λ
K
X
k=1
IwX(k);Z(k)λ
K
X
k=1
HZ(k)
ewkw0λ
K
X
k=1
HZ(k), V (k)
ewkw0,
(13)
where we use the latent variables V(k)as the side infor-
mation to encode the quantized feature and we have used
H(Z(k), V (k))H(Z(k)). We begin by recognizing that the
joint entropy H(Z(k), V (k))represents the communication cost.
Then, we establish an upper bound by using the KL divergence
non-negativity property:
H(Z(k), V (k))Ep(Z(k),V (k))hlog q(Z(k)|V(k); Θ(k)
con)
×qV(k); Θ(k)
li.
(14)
where Θ(k)
con and Θ(k)
lare the learnable parameters of the
variational distributions q(Z(k)|V(k); Θ(k)
con)and q(V(k); Θ(k)
l),
respectively, which approximate the true distributions to min-
imize the communication cost while capturing the essential
feature relations for inference. By taking Eq. (14) into Eq. (13),
we obtain the upper bound for the second term in Eq. (7), given
by
IwX(k);Z(k)Ep(Z(k),V (k))hlog qZ(k)|V(k); Θ(k)
con
×q(V(k); Θ(k)
l)iew0wk.
(15)
It should be noted that deriving the lower bound in Ineq. (12)
and upper bound in Ineq. (15) enables us to establish an upper
limit on the objective function in minimization problem in (7).
This makes it easier to minimize by the corresponding loss
function during network training, as discussed in Sec. III-C.
C. Multi-Frame Correlation Model
Inspired by the previous work [9], we utilize a multi-
frame correlation model that leverages variational approxima-
tion to capture the temporal dynamics in video sequences.
This approach utilizes the temporal redundancy across con-
tiguous frames to model the conditional probability distri-
bution effectively. Our model approximates the next feature
in the sequence by considering the variational distribution
q(Z(k)
t|Z(k)
t1, ..., Z(k)
tτ; Θ(k)
τ), which can be modeled as a Gaus-
sian distribution aimed at mimicking the true conditional dis-
tribution of the subsequent frame given the previous frames:
qZ(k)
t|Z(k)
t1, ..., Z(k)
tτ; Θ(k)
τ=NµΘ(k)
τ, σ2Θ(k)
τ,
where µand σ2are parametric functions of the preceding
frames, encapsulating the temporal dependencies. These func-
tions are modeled using a deep neural network with parameters
Θ(k)
τthat are learned from data. By optimizing the variational
parameters, our model aims to closely match the true distribu-
tion, thus encoding the features more efficiently.
D. Network Loss Functions Derivation
In this subsection, we design our network loss functions
to optimize the information flow in a multi-camera setting
according to the IB principle in Sec. II-D.
The first loss function L1aims to minimize the upper bound
of the mutual information, following the inequalities derived in
(12) and (15). L1ensures efficient encoding while preserving
essential information for accurate prediction:
L1=
K
X
k=1
E[wklog q(Y(k)|Z(k))]
|{z }
The upper bound of Iw(Z(k);Y(k))
+λ·min Rmax,
Ehlog q(Z(k)|V(k); Θ(k)
con)·q(V(k); Θ(k)
l)ie(w0wk)
| {z }
The upper bound of Iw(X(k);Z(k))
.
The first term of L1excludes H(Y)from Ineq. (12) because it
is a constant. The second term addresses the upper bound of the
communication cost required to transmit features from cameras
to the edge server. Rmax is used to clip the over-relaxation
for the upper bound, bounding the excessive communications
cost, which results in the degradation of training decoder
p(Y(k)|Z(k)). In Sec. III-C, the Multi-Frame Correlation Model
leverages temporal dynamics, which is critical for sequential
data processing in video analytics. The second loss function,
L(k)
2, is derived to minimize the KL divergence between the
true distribution of frame sequences and the modeled variational
distribution:
L2=
K
X
k=1
DKL hp(Z(k)
t|Z(k)
<t )||q(Z(k)
t|Z(k)
<t )i,(16)
where Z(k)
<t = (Z(k)
t1, ..., Z(k)
tτ). Given the variability in channel
quality and the occurrence of delays, we introduce the third loss
function, L(k)
3, designed to minimize the impact of unreliable
data sources while maximizing inference accuracy:
L3=
K
X
k=1 h1dnorm,k (wkWtarget )2+ 1dnorm,k w2
ki,
(17)
where ϵdenotes a permissible delay that cannot lead to errors in
multi-view fusion. Wtarget represents the target weight for cam-
era without excessive delay. These loss functions collectively
aim to optimize the trade-off between data transmission costs
and perceptual accuracy, crucial for enhancing the performance
of edge analytics in multi-camera systems.
IV. PERFORMANCE EVALUATION
A. Simulation Setup
We set up simulations to evaluate our PIB framework, aimed
at predicting pedestrian occupancy in urban settings using mul-
tiple cameras. These simulations replicate a city environment,
with variables like signal frequency and device density affecting
the outcomes.
Our simulations use a 2.4 GHz operating frequency, a path
loss exponent of 3.5, and a shadowing deviation of 8 dB.
Devices emit an interference power of 0.1 Watts, with densities
ranging from 10 to 100 devices per 100 square meters, allowing
us to test different levels of congestion. The bandwidth is set at
2 MHz, with cameras located about 500 meters from the edge
server. We employ the Wildtrack dataset from EPFL, which
features high-resolution images from seven cameras located in
a public area, capturing unscripted pedestrian movements [17].
This dataset provides 400 frames per camera at 2 frames per
second, documenting over 40,000 bounding boxes that highlight
individual movements across more than 300 pedestrians.
The primary measure we use is the multi-object detection
accuracy (MODA), which assesses the system’s ability to accu-
rately detect pedestrians based on missed and false detections.
We also look at the rate-performance tradeoff to understand
how communication overhead affects system performance.
For comparative analysis, we consider three baselines:
TOCOM-TEM [9]: A task-oriented communication
framework using a temporal entropy model for edge
video analytics. It leverages the deterministic Information
Bottleneck principle to extract and transmit compact, task-
relevant features, integrating spatial-temporal data on the
server for enhanced inference accuracy.
JPEG [18]: A widely adopted image compression stan-
dard that reduces the data size of digital images via lossy
compression algorithms, commonly used for reducing the
communication load in networked camera systems.
High Efficiency Video Coding (HEVC) [19]: Also
known as H.265 and MPEG-H Part 2, this standard
provides up to 50% better data compression than its
predecessor AVC (H.264 or MPEG-4 Part 10), maintaining
the same video quality, which is critical for efficient data
transmission in high-density camera networks.
Our code will be available at github.com/fangzr/PIB-
Prioritized-Information-Bottleneck-Framework.
In the simulation study, we examine the effectiveness of
multiple camera systems in forecasting pedestrian presence.
Unlike a single-camera configuration, this method minimizes
obstructions commonly found in crowded locations by integrat-
ing perspectives from various angles. Nevertheless, this benefit
10 50 100 150
Communication Cost (KB)
60
65
70
75
80
85
90
MODA (%)
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 4: Communication Cost vs MODA.
123456
Number of Delayed Cameras
20
30
40
50
60
70
80
90
100
MODA (%)
+5.6%
+15.1%
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 5: Delayed cameras vs MODA.
is accompanied by heightened communication overhead. In Fig.
4, we observe the relationship between communication costs
and MODA, a metric for multi-camera perception. The PIB
algorithm exhibits a higher MODA across varying commu-
nication costs when compared to TOCOM-TEM, JPEG, and
HEVC. This superior performance can be attributed to PIB’s
strategic fusion of multi-view features, which is informed by
both channel quality and the selection of ROI with appropriate
priorities. By prioritizing information, PIB effectively mitigates
the detrimental effects of delayed information that could poten-
tially degrade the perception accuracy in multi-camera systems.
Fig. 5 depicts the performance rates of different compression
techniques in a multi-view scenario in terms of the number
of delayed cameras. Our proposed PIB method and TOCOM-
TEM, both utilizing multi-frame correlation models, success-
fully reduce redundancy across multiple frames, achieving su-
perior MODA at equivalent compression rates. PIB, in particu-
lar, utilizes a prioritized IB framework, which technique enables
an adaptive balance between compression rate and collaborative
sensing accuracy, optimizing MODA across various channel
conditions. It is worth noting that JPEG was not consistently
outperformed by HEVC compression due to our utilization of
the more effective HEIF algorithm derived from HEVC, which
inadequately supported the motion prediction module, resulting
in compromised performance.
In Fig. 6, we analyze the impact of increasing the number of
delayed cameras on the communication cost for various algo-
123456
Number of Delayed Cameras
0
5
10
15
20
25
30
35
Communication Cost (KB)
-57.8% -66.7%
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 6: Delayed cameras vs communication cost.
rithms. The PIB algorithm demonstrates a significant reduction
in communication costs with a growing amount of delayed
cameras. This efficiency is due to the algorithm’s priority mech-
anism that adeptly assigns weights, filtering out the adverse
information caused by delays. Consequently, PIB prioritizes the
transmission of high-quality features from cameras with more
accurate occupancy predictions. When compared to TOCOM-
TEM, PIB achieves a remarkable 66.7% decrease in communi-
cation costs while still retaining the precision of multi-camera
pedestrian occupancy predictions. For a fair comparison, both
JPEG and HEVC methods were set to a uniform compression
threshold of 30 KB in this experiment. However, as indicated
in Fig. 5, they have not surpassed the performance of PIB and
TOCOM-TEM.
V. CONCLUSION
In this paper, we have proposed the Prioritized Information
Bottleneck (PIB) framework as a robust solution for collabora-
tive edge video analytics. Our contributions are two-fold. First,
we developed a prioritized inference mechanism to intelligently
determine the importance of different camera’ FOVs, effectively
addressing the constraints imposed by channel capacity and
data redundancy. Second, the PIB framework showcases its
effectiveness by notably decreasing communication overhead
and improving tracking accuracy without requiring video recon-
struction at the edge server. Extensive numerical results show
that: PIB not only surpasses the performance of conventional
methods like TOCOM-TEM, JPEG, and HEVC with a marked
improvement of up to 15.1% in MODA but also achieves a
considerable reduction in communication costs by 66.7%, while
retaining low latency and high-quality multi-view sensory data
processing under less favorable channel conditions.
VI. ACK NOWLEDGEMENT
This work was supported in part by the Hong Kong SAR
Government under the Global STEM Professorship and Re-
search Talent Hub, the Hong Kong Jockey Club under the Hong
Kong JC STEM Lab of Smart City (Ref.: 2023-0108), and
the Hong Kong Innovation and Technology Commission under
InnoHK Project CIMDA. The work of Y. Deng was supported
in part by the National Natural Science Foundation of China
under Grant No. 62301300. The work of X. Chen was supported
in part by HKU-SCF FinTech Academy R&D Funding.
REFERENCES
[1] A. Padmanabhan, N. Agarwal, A. Iyer, G. Ananthanarayanan, Y. Shu,
N. Karianakis, G. H. Xu, and R. Netravali, “Gemel: Model merging for
memory-efficient, real-time video analytics at the edge,” in 20th USENIX
Symposium on Networked Systems Design and Implementation (NSDI 23),
Boston, MA, 2023, pp. 973–994.
[2] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, and L. Hanzo, “Age of
information in energy harvesting aided massive multiple access networks,
IEEE Journal on Selected Areas in Communications, vol. 40, no. 5, pp.
1441–1456, May 2022.
[3] Z. Fang, J. Wang, J. Du, X. Hou, Y. Ren, and Z. Han, “Stochastic
optimization-aided energy-efficient information collection in Internet of
Underwater Things networks,” IEEE Internet of Things Journal, vol. 9,
no. 3, pp. 1775–1789, Feb. 2021.
[4] H. Wang, J. Huang, G. Wang, H. Lu, and W. Wang, “Contactless patient
care using hospital IoT: CCTV camera based physiological monitoring
in ICU,” IEEE Internet of Things Journal, vol. 11, no. 4, pp. 5781–5797,
Aug. 2023.
[5] Z. Fang, S. Hu, H. An, Y. Zhang, J. Wang, H. Cao, X. Chen, and
Y. Fang, “PACP: Priority-aware collaborative perception for connected
and autonomous vehicles,” IEEE Transactions on Mobile Computing,
(DOI: 10.1109/TMC.2024.3449371), Aug. 2024.
[6] L. Corneo, N. Mohan, A. Zavodovski, W. Wong, C. Rohner, P. Gun-
ningberg, and J. Kangasharju, “(How much) can edge computing change
network latency?” in IFIP Networking Conference (IFIP Networking).
Espoo and Helsinki, Finland: IEEE, Jun. 2021, pp. 1–9.
[7] L. Marelli and G. Testa, “Scrutinizing the EU general data protection
regulation,” Science, vol. 360, no. 6388, pp. 496–498, May 2018.
[8] Ponemon Institute, “New ponemon institute study finds 60% of it and
security leaders are not confident in their ability to secure access to cloud
environments, https://www.securitymagazine.com/articles/ 98044-60-of-
cybersecurity-leaders-not-confident-in-their- cloud-security-tactics, 2021,
accessed: 2022-07-20.
[9] J. Shao, X. Zhang, and J. Zhang, “Task-oriented communication for edge
video analytics,” IEEE Transactions on Wireless Communications, vol. 23,
no. 5, pp. 4141–4154, May 2024.
[10] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha, “Deep learning
algorithm for autonomous driving using GoogleNet,” in IEEE Intelligent
Vehicles Symposium (IV), Los Angeles, CA, Jun. 2017, pp. 89–96.
[11] K. Gao, H. Wang, H. Lv, and W. Liu, “Localization-oriented digital twin-
ning in 6G: A new indoor-positioning paradigm and proof-of-concept,
IEEE Transactions on Wireless Communications, 2024.
[12] A. Yaqoob, T. Bi, and G.-M. Muntean, A survey on adaptive 360 video
streaming: Solutions, challenges and opportunities,” IEEE Communica-
tions Surveys & Tutorials, vol. 22, no. 4, pp. 2801–2838, 2020.
[13] Z. Jiang, X. Zhang, Y. Xu, Z. Ma, J. Sun, and Y. Zhang, “Reinforcement
learning based rate adaptation for 360-degree video streaming,” IEEE
Transactions on Broadcasting, vol. 67, no. 2, pp. 409–423, Oct. 2020.
[14] Y. Hou, L. Zheng, and S. Gould, “Multiview detection with feature
perspective transformation, in The European Conference on Computer
Vision (ECCV), Glasgow, Scotland, 2020, pp. 1–18.
[15] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck
method,” arXiv preprint physics/0004057, 2000.
[16] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational
information bottleneck,” in Conf. on Learning Representations (ICLR),
Toulon, France, Apr. 2017, pp. 1–9.
[17] T. Chavdarova, P. Baqu ´
e, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov,
L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “Wildtrack: A multi-camera
hd dataset for dense unscripted pedestrian detection,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
Jun. 2018, pp. 5030–5039.
[18] G. K. Wallace, “The JPEG still picture compression standard, IEEE
Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv,
Feb. 1992.
[19] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity and
implementation analysis,” IEEE Transactions on circuits and Systems for
Video Technology, vol. 22, no. 12, pp. 1685–1696, Oct. 2012.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Surrounding perceptions are quintessential for safe driving for connected and autonomous vehicles (CAVs), where the Bird's Eye View has been employed to accurately capture spatial relationships among vehicles. However, severe inherent limitations of BEV, like blind spots, have been identified. Collaborative perception has emerged as an effective solution to overcoming these limitations through data fusion from multiple views of surrounding vehicles. While most existing collaborative perception strategies adopt a fully connected graph predicated on fairness in transmissions, they often neglect the varying importance of individual vehicles due to channel variations and perception redundancy. To address these challenges, we propose a novel P riority- A ware C ollaborative P erception ( PACP ) framework to employ a BEV-match mechanism to determine the priority levels based on the correlation between nearby CAVs and the ego vehicle for perception. By leveraging submodular optimization, we find near-optimal transmission rates, link connectivity, and compression metrics. Moreover, we deploy a deep learning-based adaptive autoencoder to modulate the image reconstruction quality under dynamic channel conditions. Finally, we conduct extensive studies and demonstrate that our scheme significantly outperforms the state-of-the-art schemes by 8.27% and 13.60%, respectively, in terms of utility and precision of the Intersection over Union.
Article
Full-text available
Given the proliferation of the massive machine type communication devices (MTCDs) in beyond 5G (B5G) wireless networks, energy harvesting (EH) aided next generation multiple access (NGMA) systems have drawn substantial attention in the context of energy-efficient data sensing and transmission. However, without adaptive time slot (TS) and power allocation schemes, NGMA systems relying on stochastic sampling instants might lead to tardy actions associated both with high age of information (AoI) as well as high power consumption. For mitigating the energy consumption, we exploit a pair of sleep scheduling policies, namely the multiple vacation (MV) policy and start-up threshold (ST) policy, which are characterized in the context of three typical multiple access protocols, including time division multiple access (TDMA), frequency-division multiple access (FDMA) and non-orthogonal multiple access (NOMA). Furthermore, we derive closed-form expressions for the MTCD system’s peak AoI, which are formulated as the optimization objective under the constraints of EH power, status update rate and stability conditions. An exact linear search based algorithm is proposed for finding the optimal solution by fixing the status update rate. As a design alternative, a low complexity concave convex procedure (CCP) is also formulated for finding a near optimal solution relying on the original problem’s transformation into a form represented by the difference of two convex problems. Our simulation results show that the proposed algorithms are beneficial in terms of yielding a lower peak AoI at a low power consumption in the context of the multiple access protocols considered.<br/
Article
Full-text available
In the face of deeply exploring and exploiting marine resources, the Internet of Underwater Things (IoUT) networks have drawn great attention considering its widely distributed low-cost and easy-deployment smart sensing nodes. However, given the hostile underwater environment, it is critical to conceive energy-efficient information collection because of limited underwater energy supply and inefficient artificial recharge methods. Characterized by high flexibility and maneuverability, autonomous underwater vehicles (AUVs) are regarded as a promising solution for information collection in the IoUT relying upon delicate AUVs' trajectory and information collection strategy design with the spirit of balancing their energy consumption and information processing capability. In this paper, we propose a heterogeneous AUV aided information collection system with the aim of maximizing the energy efficiency of IoUT nodes taking into account AUV trajectory, resource allocation and the age of information (AoI). Moreover, based on the particle swarm optimization (PSO), we obtain the trajectory of AUVs with low time complexity. Additionally, a two-stage joint optimization algorithm based on Lyapunov optimization is constructed to strike a trade-off between energy efficiency and system queue backlog iteratively. Finally, simulation results validate the effectiveness and superiority of our proposed strategy.
Article
Full-text available
Omnidirectional or 360∘ video is increasingly being used, mostly due to the latest advancements in immersive Virtual Reality (VR) technology. However, its wide adoption is hindered by the higher bandwidth and lower latency requirements than associated with traditional video content delivery. Diverse researchers propose and design solutions that help support an immersive visual experience of 360∘ video, primarily when delivered over a dynamic network environment. This paper presents the state-of-the-art on adaptive 360∘ video delivery solutions considering end-to-end video streaming in general and then specifically of 360∘ video delivery. Current and emerging solutions for adaptive 360∘ video streaming, including viewport-independent, viewport-dependent, and tile-based schemes are presented. Next, solutions for network-assisted unicast and multicast streaming of 360∘ video content are discussed. Different research challenges for both on-demand and live 360∘ video streaming are also analyzed. Several proposed standards and technologies and top international research projects are then presented. We demonstrate the ongoing standardization efforts for 360∘ media services that ensure interoperability and immersive media deployment on a massive scale. Finally, the paper concludes with a discussion about future research opportunities enabled by 360∘ video.
Article
Witnessing its large swaths of success in various fields, digital twins (DTs) are considered a promising scheme for 6th Generation (6G) cellular systems, showing a leading edge in networking and communication modelling. However, another 6G core property of high-precision positioning can hardly be supported by existing 6G DT solutions due to the lack of environmental modelling and signal interactions with physical scenes . This shortcoming yields a series of challenges in 6G DT-enabled positioning, including positioning data acquisition, accuracy enhancement, and continuous optimization. In this regard, we propose a novel paradigm of localization-oriented DT (LocDT) with a compound architecture of 7 sub-DT layers to characterize the 6G integrated-localization-and-communication (ILAC) feature. LocDT starts from a physical environment sublayer to mirror 6G signal interactions within a real-world scenario, along with an ILAC baseband sublayer and a channel frequency Polar-coordinate (CFP) image construction method to provide finer-grained fingerprints. Furthermore, insight from LocDT reveals an interesting phenomenon: the channel features of Line-of-Sight (LoS) / None-Los (NLoS) gNodeBs make differentiated-contributions to positioning accuracy, especially in wide-existing partial-LoS-coverage scenarios. Benefiting from this, a DT-driven Artificial Intelligence (AI) positioning model, SSI-Net, is designed with a device-attention mechanism, achieving complementary improvements in accuracy. Evaluation results show LocDT and SSI-Net’s advantages from a position-of-strength in accuracy and time overhead, outperforming state-of-the-art models.
Article
With the development of artificial intelligence (AI) techniques and the increasing popularity of camera-equipped devices, many edge video analytics applications are emerging, calling for the deployment of computation-intensive AI models at the network edge. Edge inference is a promising solution to move computation-intensive workloads from low-end devices to a powerful edge server for video analytics, but device-server communications will remain a bottleneck due to limited bandwidth. This paper proposes a task-oriented communication framework for edge video analytics, where multiple devices collect the visual sensory data and transmit the informative features to an edge server for processing. To enable low-latency inference, this framework removes video redundancy in spatial and temporal domains and transmits minimal information that is essential for the downstream task, rather than reconstructing the videos on the edge server. Specifically, it extracts compact task-relevant features based on the deterministic information bottleneck (IB) principle, which characterizes a tradeoff between the informativeness of the features and the communication cost. As the features of consecutive frames are temporally correlated, we propose a temporal entropy model (TEM) to reduce the bitrate by taking the previous features as side information in feature encoding. To further improve the inference performance, we build a spatial-temporal fusion module on the server to integrate features of the current and previous frames for joint inference. Extensive experiments on video analytics tasks evidence that the proposed framework effectively encodes task-relevant information of video data and achieves a better rate-performance tradeoff than existing methods.
Article
The continuous vital signs monitoring allows clinicians to timely assess the physiological conditions of the patient in ICU. Internet of Things (IoT) in the hospital that integrated various sensors (e.g. cameras) may enable intelligent healthcare. In this paper, we proposed a remote patient monitoring system that exploits CCTV cameras in an IoT infrastructure as optical sensors for non-contact physiological measurement, extending its scope from surveillance to warding. The proposed monitoring system implemented the latest camera photoplethysmography algorithms for cardio-respiratory measurement, particularly focused on heart rate (HR) and breathing rate (BR) as fundamental biomarkers for earlier warning of deterioration. A clinical trial involving 25 critically-ill patients was carried out in ICU, where the camera focal length (i.e. key impact factor) is thoroughly investigated. The clinical results show that our system achieves a Mean Absolute Error (MAE) of 1.3 bpm for HR and 0.7 brpm for BR in the far-focus mode; a MAE of 1.0 bpm for HR and 1.8 brpm for BR in the near-focus mode, which are in the range of clinical acceptance. The comparison between the two modes suggests that the far-focus is more suitable for monitoring patient’s vital signs in this scenario. The success rate of HR and BR in the far-focus mode is 94.5% (MAE = 2 bpm) and 96.7% (MAE = 3 brpm), respectively. The prototypes show that the increased measurement coverage and convenience of CCTV cameras in an IoT system are useful for ubiquitous patient monitoring in hospital care units.
Article
The 360-degree video streaming has higher bandwidth requirements compared with traditional video to achieve the same user-perceived playback quality. Since users only view part of the entire videos, viewport-adaptive streaming is an effective approach to guarantee video quality. However, the performance of viewport-adaptive schemes is highly dependent on the bandwidth estimation and viewport prediction. To overcome these issues, we propose a novel reinforcement learning (RL) based viewport-adaptive streaming framework called RLVA, which optimizes the 360-degree video streaming in viewport prediction, prefetch scheduling and rate adaptation. Firstly, RLVA adopts t location-scale distribution rather than Gaussian distribution to describe the viewport prediction error characteristic more accurately and achieve the tile viewing probability based on the distribution. Besides, a tile prefetch scheduling algorithm is proposed to update the tiles according to the latest prediction results, which further reduces the adverse effect of prediction error. Furthermore, the tile viewing probabilities are treated as input status of RL algorithm. In this way, RL can adjust its policy to adapt to both of the network conditions and viewport prediction error. Through extensive evaluations, the simulation results show that the proposed RLVA outperforms other viewport-adaptive methods by about 4.8%-66.8% improvement of Quality of Experience (QoE) and effectively reduces the impact of viewport prediction errors.
Article
How will new decentralized governance impact research?