Available via license: CC BY 4.0
Content may be subject to copyright.
PIB: Prioritized Information Bottleneck
Framework for Collaborative Edge Video Analytics
Zhengru Fang⋆, Senkang Hu⋆, Liyan Yang⋆, Yiqin Deng‡, Xianhao Chen†, Yuguang Fang⋆
⋆City University of Hong Kong, Hong Kong, China
‡Shandong University, Jinan, China, †The University of Hong Kong, Hong Kong, China
Email: {zhefang4-c, senkang.forest, liyanyang3-c}@my.cityu.edu.hk,
yiqin.deng@email.sdu.edu.cn, xchen@eee.hku.hk, my.fang@cityu.edu.hk
Abstract—Collaborative edge sensing systems, particularly in
collaborative perception systems in autonomous driving, can sig-
nificantly enhance tracking accuracy and reduce blind spots with
multi-view sensing capabilities. However, their limited channel
capacity and the redundancy in sensory data pose significant chal-
lenges, affecting the performance of collaborative inference tasks.
To tackle these issues, we introduce a Prioritized Information
Bottleneck (PIB) framework for collaborative edge video analytics.
We first propose a priority-based inference mechanism that
jointly considers the signal-to-noise ratio (SNR) and the camera’s
coverage area of the region of interest (RoI). To enable efficient
inference, PIB reduces video redundancy in both spatial and
temporal domains and transmits only the essential information
for the downstream inference tasks. This eliminates the need
to reconstruct videos on the edge server while maintaining low
latency. Specifically, it derives compact, task-relevant features by
employing the deterministic information bottleneck (IB) method,
which strikes a balance between feature informativeness and
communication costs. Given the computational challenges caused
by IB-based objectives with high-dimensional data, we resort to
variational approximations for feasible optimization. Compared to
TOCOM-TEM, JPEG, and HEVC, PIB achieves an improvement
of up to 15.1% in mean object detection accuracy (MODA)
and reduces communication costs by 66.7% when edge cameras
experience poor channel conditions.
Index Terms—Collaborative edge inference, information bottle-
neck, network compression, variational approximations.
I. INTRODUCTION
Video analytics is rapidly transforming various sectors such
as urban planning, retail analysis, and autonomous navigation
by converting visual data streams into useful insights [1]. When
cameras are deployed for monitoring, they tend to produce vast
amounts of video data constantly. There is often a requirement
for quick analysis of these real-time streams [2], [3]. Further-
more, numerous developing applications such as remote patient
care [4], autonomous driving [5], and virtual reality heavily
depend on efficient video analytics with minimal delay.
The increasing number of deployed smart devices require
a computational paradigm shift towards edge processing. This
approach involves handling data closer to its source, leading
to several benefits compared to traditional cloud-based models.
Specifically, this method significantly reduces latency, which
is crucial in time-sensitive processing tasks like strokes and
attacks. For example, as per the research conducted by Corneo
et al., utilizing remote cloud services for data processing may
result in a 30% increase in latency when compared to local data
handling [6]. Moreover, the importance of privacy, particularly
in regions with strict data protection laws such as the General
Data Protection Regulation (GDPR), makes edge computing
even more attractive [7]. According to the Ponemon Institute,
60% of companies express apprehension toward cloud security
and decide to manage their own data onsites in order to mitigate
potential risks [8].
However, the integration of edge devices into video an-
alytics also brings in many significant challenges [9]. The
computational demands of deep neural network (DNN) models,
such as GoogLeNet [10], which requires about 1.5 billion
operations per image classification, place a substantial burden
on the limited processing capacities of edge devices [11].
Additionally, the outputs from high-resolution cameras increase
the communication load. For example, a 4K video stream
requires up to 18 Gbps of bandwidth to transmit raw video
data, potentially overwhelming the capacity of existing wireless
networks [12].
The current communication strategies for integrating edge
devices into video analytics ecosystems are not effective
enough. One major issue is how to handle the computational
complexity and transmission of redundant data generated from
the overlapping fields of view (FOVs) from multiple cameras.
In scenarios with dense camera deployments, up to 60% of data
can be redundant due to overlapping FOV, which unnecessarily
overburdens the network [13]. In addition, these strategies often
lack adaptability in transmitting tailored data features based on
Region of Interest (RoI) and signal-to-noise ratio (SNR), re-
sulting in poor video fusion or alignment. These limitations can
negatively impact collaborative perception, sometimes making
it less effective than single-camera setups.
In this paper, we aim to develop novel multi-camera video
analytics by prioritizing wireless video transmissions. Our
proposed Prioritized Information Bottleneck (PIB) framework
attempts to effectively leverage SNR and RoI to selectively
transmit data features, significantly reducing computational
load and data transmissions. Our method can decrease data
transmissions by up to 66.7%, while simultaneously enhanc-
ing the mean object detection accuracy (MODA) compared
to current state-of-the-art techniques. This approach not only
compresses data but also intelligently selects data for processing
to ensure that only relevant information is transmitted, thus
mitigating noise-induced inaccuracies in collaborative sensing
scenarios. This innovation sets a new benchmark for efficient
and accurate video analytics at the edge.
arXiv:2408.17047v1 [cs.NI] 30 Aug 2024
Region of Interest
Camera 1 Camera 2
Camera 3 delay3
delay2
delay1
Edge Server
FOV2
FOV1
FOV3
Pedestrian
occupancy
Bitstream
Transmission
Fig. 1: System model.
II. SY ST EM MO DE L AN D PROB LE M FORMULATION
As illustrated in Figure 1, our system comprises a set of
edge cameras, denoted as K={1,2, . . . , K}. These cameras
are deployed across various scenes S={s1, s2, . . . , sS}, each
with a specific Field of View (FoV), FoVk, covering a subset of
the total monitored area. The union of FoVs from all cameras
covering a scene sensures Sk∈F(s)FoVk⊇sfor compre-
hensive surveillance. For example, in a high-density pedestrian
environment, our goal is to facilitate collaborative perception
for pedestrian occupancy prediction, under the constraints of
limited channel capacity due to poor channel conditions.
A. Communication Model
Given the dense deployment and high device density, we
adopt Frequency Division Multiple Access (FDMA) to man-
age the communication among cameras, defining the channel
capacity Ckfor each camera kusing the SNR-based Shannon
capacity:
Ck=Bklog2(1 + SNRk),(1)
where Bkis the bandwidth allocated to camera kand SNRkis
its signal-to-noise ratio. The transmission delay dk, critical for
real-time applications, is calculated as:
dk=D
Ck
,(2)
where Dis the fixed data amount to be transmitted. This delay
inversely correlates with Ck, emphasizing the need for efficient
bandwidth allocation and SNR optimization.
B. Priority Weight Formulation
Dynamic priority weighting is crucial in optimizing network
resource allocation. We employ a dual-layer Multilayer Percep-
tron (MLP) to compute priority weights based on normalized
delay and coverage:
pk=MLP(dnorm,k,Coveragenorm,k ; ΘM),(3)
where pkdenotes the computed priority score for camera k, and
ΘMrepresents the trainable parameters of MLP. This MLP’s
architecture, featuring two layers, allows for modeling the
interactions between delay and coverage effectively. Besides,
we have dnorm,k =dk
dmax and Coveragenorm,k =COk−C OL
COU−C OL.
COkrepresents the coverage area provided by camera kwithin
the Region of Interest (RoI), with COUand COLdenoting the
upper and lower bounds of desired coverage, respectively.
To transform the raw priority scores into a usable format
within the system, we apply a softmax function, which nor-
malizes these scores into a set of weights summed to one:
wk=epk
PK
j=1 epj
,(4)
where wksignifies the priority weight for camera k. This
method ensures that cameras which are more critical, either
due to higher coverage or due to lower delays, are given higher
priority, thereby enhancing the decision-making capabilities and
responsiveness of the edge analytics system.
C. Video Feature Generation
In this paper, MVDet serves as the backbone network for
multi-camera perception fusion [14]. Firstly, edge cameras
capture data and extract feature maps, which are subsequently
encoded and sent via wireless channels to an edge server
through base stations. Upon reception, the server decodes the
data and uses a transformation matrix to perform coordinate
transformations on the multi-view features, integrating them
into a unified feature. The process concludes with the appli-
cation of large kernel convolutions on the ground plane feature
map, culminating in the final occupancy decision. This deci-
sion includes predicting the events of interest, e.g., pedestrian
locations, enhancing the system’s perception accuracy in multi-
camera setups.
D. Prioritized Information Bottleneck Formulation
In the context of information theory, the Information Bot-
tleneck (IB) method seeks an optimal trade-off between the
compression of an input variable Xand the preservation
of relevant information for an output variable Y[15]. We
formalize the input data from camera kas X(k), similar to
Z(k)in extracted features, and the target prediction as Y(k),
corresponding to the population in the dataset D. The aim is to
encode X(k)into a meaningful and concise representation Z(k),
resonating with the hidden representation z(k)that captures the
essence of multi-view content for prediction tasks. The classical
IB problem can be formulated as a constrained optimization
task:
max
Θ
K
X
k=1
IZ(k);Y(k)
s.t. IX(k);Z(k)≤Ic,(k= 1,2,· · · , K),
(5)
where I(Z(k), Y (k))denotes the mutual information between
two random variables Z(k)and Y(k).Θrepresents the set
of all learnable parameters in PIB framework, including ΘM
and the variational approximation in the following section. The
mutual information is essentially a measure of the amount of
information obtained about one random variable through the
other random variable. Icis the maximum permissible mutual
information that Z(k)can contain about X(k). The objective
Feature map
generation Entropy
coding
Encoding in edge cameras
Multi-frame
correlation model
Priority weight
module
𝑋𝑡
𝑘𝑍𝑡
𝑘
𝑤(𝑘) 𝑞(𝑧𝑡
𝑘|𝑧𝑡−1
𝑘, 𝑧𝑡−2
𝑘, ⋯ , 𝑧𝑡−𝜏
𝑘)
Raw video data Intermediate feature
Transmitted
bitstream
Fig. 2: The procedure of video encoding.
is to ensure that Z(k)captures the most relevant information
about X(k)for predicting Y(k)while remaining as concise as
possible. Introducing a Lagrange multiplier λ, the problem is
equivalently expressed as:
max
ΘRIB =
K
X
k=1 hIZ(k);Y(k)−λ·IX(k);Z(k)i,
(6)
where RIB represents the IB functional, balancing the compres-
sion of X(k)against the necessity of accurately predicting Y(k).
Then, we extend the IB framework to a multi-camera setting by
introducing priority weights to the mutual information terms,
adapting the optimization for an edge analytics network:
min
Θ
K
X
k=1 h−IwZ(k);Y(k)+λIwX(k);Z(k)i,(7)
where the weighted mutual information terms are defined as: (1)
IwZ(k);Y(k)=wk·IZ(k);Y(k)and IwX(k);Z(k)=
ew0−wk·IX(k);Z(k). The non-negative value w0represents
the maximum weight parameter wk.
The first term with linear weights IwZ(k);Y(k)indicates
the weighted mutual information between the compressed rep-
resentation Z(k)from camera kand the target Y(k). Linear
weighting by wkensures each camera’s influence is propor-
tional to its priority, with higher wkvalues increasing the
emphasis on IZ(k);Y(k)in the objective function, em-
phasizing cameras that provide high-quality data for precise
predictions. The second term with negative exponential weights
IwX(k);Z(k)measures the mutual information between the
original X(k)and its compressed Z(k), scaled negatively by
wk. This ensures exponential reduction in IX(k);Z(k)as
wkrises. Cameras with lower wkundergo more aggressive data
compression, optimizing bandwidth and storage without signif-
icantly impacting overall system performance. This weighting
approach, chosen for this proof of concept, will be further
explored with more general methods in future work.
III. METHODOLOGY
A. Architecture Summary
In this subsection, we outline the workflow of our PIB
framework, designed for collaborative edge video analytics. As
depicted in Fig. 2, the process starts with each edge camera
(denoted by k) capturing raw video data X(k)
tand extracting
feature maps. These cameras utilize priority weights wkto opti-
mize the balance between communication costs and perception
Encoding in edge cameras
Entropy
decoding
Decoding in edge server
Multiview feature
fusion
𝑍𝑡
𝑘
𝑌
𝑡
𝑘
Reconstruction Feature Pedestrian occupancy
Received
bitstream
Multi-frame
correlation model
Priority weight
module
𝑤(𝑘) 𝑞(𝑧𝑡
𝑘|𝑧𝑡−1
𝑘, 𝑧𝑡−2
𝑘, ⋯ , 𝑧𝑡−𝜏
𝑘)
Fig. 3: The procedure of video decoding.
accuracy, adapting to varying channel conditions. The extracted
features are then compressed using entropy coding and sent
as a bitstream to the edge server for further processing. At
the server (see Fig. 3), the video features are reconstructed
using shared parameters such as weights wkand the variational
model parameters q(Z(k)
t|Z(k)
t−1, ..., Z(k)
t−τ). The server integrates
these multi-view features to estimate pedestrian occupancy Yt.
This approach leverages historical frame correlations through a
multi-frame correlation model to enhance prediction accuracy.
B. Information Bottleneck Analysis
The objective function of information bottleneck in Eq. (7)
can be divided into two parts. The first part is −PK
k=1 wk·
IZ(k);Y(k), which denotes the quality of video recon-
struction by decoding at the edge server. The second part is
λPK
k=1 ew0−wk·IX(k);Z(k), which denotes the compres-
sion efficiency for feature extraction. As it has been shown
in the way that a decoder works, pY(k)|Z(k)can be any
valid type of conditional distributions, but most often it is not
smooth enough for straightforward calculation. Because of this
complexity, it is highly challenging to directly work out the
two mutual information components in Eq. (7) and improve
them. Accordingly, we adopt the variational approach [16]. This
approach suggests that the decoder is part of a simpler group
of distributions called Q. We then search for a distribution
qY(k)|Z(k)within this group that is most similar to the
best possible decoder distribution, using the KL-divergence to
measure the closeness.
As a proof of concept, we first focus on deriving a lower
bound for the mutual information IZ(k);Y(k)based on
alternative probability distributions. We start with the standard
definition of mutual information1:
I(Z;Y) = Ep(Y,Z )log p(Y|Z)
p(Y).(8)
We then introduce the Kullback-Leibler divergence (KL di-
vergence), which is always non-negative and measures the
efficiency of how the distribution q(Y|Z)approximates the
true distribution p(Y|Z):
DKL [p(Y|Z)||q(Y|Z)] = Ep(Y|Z)log p(Y|Z)
q(Y|Z)≥0,(9)
where the variational estimation method q(Y|Z)described
utilizes a weighted exponential family variational distribution
1For simplicity, we omit the exponent of (k)when deriving the lower bound
of I(Z;Y).
that is parameterized by neural network parameters Φand
is designed to approximate the true conditional distribution
p(Y|Z)while providing a computationally tractable lower
bound for mutual information. Eq. (9) leads to the inequality:
Ep(Y|Z)[log p(Y|Z)] ≥Ep(Y|Z)[log q(Y|Z)] ,(10)
where p(Y|Z)can be replaced by p(Y, Z ). The relationship
between the joint and conditional probabilities facilitates the
simplification of the expression for mutual information:
Ep(Y,Z )log p(Y|Z)
p(Y)=Ep(Z)Ep(Y|Z)log p(Y|Z)
p(Y).
(11)
Building on the inequality established by the KL divergence
(9), we can express a lower bound for the mutual information:
I(Z;Y)≥Ep(Y,Z )[log q(Y|Z)] + H(Y),(12)
where H(Y)is the entropy of Y, a constant that reflects the
inherent uncertainty in Yindependent of Z. This formulation
provides a computationally feasible lower bound for mutual
information, crucial for applications in video analytics and other
areas where direct computation of mutual information is hard
or even infeasible.
As for the second part, we proceed by establishing an
upper limit because of the complexity of directly minimizing
the term λPK
k=1 ew0−wk·IX(k);Z(k). Recognizing that
H(Z(k)|X(k))≥0from the properties of entropy, we can
derive the following inequality:
λ
K
X
k=1
IwX(k);Z(k)≤λ
K
X
k=1
HZ(k)
ewk−w0≤λ
K
X
k=1
HZ(k), V (k)
ewk−w0,
(13)
where we use the latent variables V(k)as the side infor-
mation to encode the quantized feature and we have used
H(Z(k), V (k))≥H(Z(k)). We begin by recognizing that the
joint entropy H(Z(k), V (k))represents the communication cost.
Then, we establish an upper bound by using the KL divergence
non-negativity property:
H(Z(k), V (k))≤Ep(Z(k),V (k))h−log q(Z(k)|V(k); Θ(k)
con)
×qV(k); Θ(k)
li.
(14)
where Θ(k)
con and Θ(k)
lare the learnable parameters of the
variational distributions q(Z(k)|V(k); Θ(k)
con)and q(V(k); Θ(k)
l),
respectively, which approximate the true distributions to min-
imize the communication cost while capturing the essential
feature relations for inference. By taking Eq. (14) into Eq. (13),
we obtain the upper bound for the second term in Eq. (7), given
by
IwX(k);Z(k)≤Ep(Z(k),V (k))h−log qZ(k)|V(k); Θ(k)
con
×q(V(k); Θ(k)
l)iew0−wk.
(15)
It should be noted that deriving the lower bound in Ineq. (12)
and upper bound in Ineq. (15) enables us to establish an upper
limit on the objective function in minimization problem in (7).
This makes it easier to minimize by the corresponding loss
function during network training, as discussed in Sec. III-C.
C. Multi-Frame Correlation Model
Inspired by the previous work [9], we utilize a multi-
frame correlation model that leverages variational approxima-
tion to capture the temporal dynamics in video sequences.
This approach utilizes the temporal redundancy across con-
tiguous frames to model the conditional probability distri-
bution effectively. Our model approximates the next feature
in the sequence by considering the variational distribution
q(Z(k)
t|Z(k)
t−1, ..., Z(k)
t−τ; Θ(k)
τ), which can be modeled as a Gaus-
sian distribution aimed at mimicking the true conditional dis-
tribution of the subsequent frame given the previous frames:
qZ(k)
t|Z(k)
t−1, ..., Z(k)
t−τ; Θ(k)
τ=NµΘ(k)
τ, σ2Θ(k)
τ,
where µand σ2are parametric functions of the preceding
frames, encapsulating the temporal dependencies. These func-
tions are modeled using a deep neural network with parameters
Θ(k)
τthat are learned from data. By optimizing the variational
parameters, our model aims to closely match the true distribu-
tion, thus encoding the features more efficiently.
D. Network Loss Functions Derivation
In this subsection, we design our network loss functions
to optimize the information flow in a multi-camera setting
according to the IB principle in Sec. II-D.
The first loss function L1aims to minimize the upper bound
of the mutual information, following the inequalities derived in
(12) and (15). L1ensures efficient encoding while preserving
essential information for accurate prediction:
L1=
K
X
k=1
E[−wklog q(Y(k)|Z(k))]
|{z }
The upper bound of −Iw(Z(k);Y(k))
+λ·min Rmax,
Eh−log q(Z(k)|V(k); Θ(k)
con)·q(V(k); Θ(k)
l)ie(w0−wk)
| {z }
The upper bound of Iw(X(k);Z(k))
.
The first term of L1excludes H(Y)from Ineq. (12) because it
is a constant. The second term addresses the upper bound of the
communication cost required to transmit features from cameras
to the edge server. Rmax is used to clip the over-relaxation
for the upper bound, bounding the excessive communications
cost, which results in the degradation of training decoder
p(Y(k)|Z(k)). In Sec. III-C, the Multi-Frame Correlation Model
leverages temporal dynamics, which is critical for sequential
data processing in video analytics. The second loss function,
L(k)
2, is derived to minimize the KL divergence between the
true distribution of frame sequences and the modeled variational
distribution:
L2=
K
X
k=1
DKL hp(Z(k)
t|Z(k)
<t )||q(Z(k)
t|Z(k)
<t )i,(16)
where Z(k)
<t = (Z(k)
t−1, ..., Z(k)
t−τ). Given the variability in channel
quality and the occurrence of delays, we introduce the third loss
function, L(k)
3, designed to minimize the impact of unreliable
data sources while maximizing inference accuracy:
L3=
K
X
k=1 h1dnorm,k<ϵ (wk−Wtarget )2+ 1dnorm,k>ϵ w2
ki,
(17)
where ϵdenotes a permissible delay that cannot lead to errors in
multi-view fusion. Wtarget represents the target weight for cam-
era without excessive delay. These loss functions collectively
aim to optimize the trade-off between data transmission costs
and perceptual accuracy, crucial for enhancing the performance
of edge analytics in multi-camera systems.
IV. PERFORMANCE EVALUATION
A. Simulation Setup
We set up simulations to evaluate our PIB framework, aimed
at predicting pedestrian occupancy in urban settings using mul-
tiple cameras. These simulations replicate a city environment,
with variables like signal frequency and device density affecting
the outcomes.
Our simulations use a 2.4 GHz operating frequency, a path
loss exponent of 3.5, and a shadowing deviation of 8 dB.
Devices emit an interference power of 0.1 Watts, with densities
ranging from 10 to 100 devices per 100 square meters, allowing
us to test different levels of congestion. The bandwidth is set at
2 MHz, with cameras located about 500 meters from the edge
server. We employ the Wildtrack dataset from EPFL, which
features high-resolution images from seven cameras located in
a public area, capturing unscripted pedestrian movements [17].
This dataset provides 400 frames per camera at 2 frames per
second, documenting over 40,000 bounding boxes that highlight
individual movements across more than 300 pedestrians.
The primary measure we use is the multi-object detection
accuracy (MODA), which assesses the system’s ability to accu-
rately detect pedestrians based on missed and false detections.
We also look at the rate-performance tradeoff to understand
how communication overhead affects system performance.
For comparative analysis, we consider three baselines:
•TOCOM-TEM [9]: A task-oriented communication
framework using a temporal entropy model for edge
video analytics. It leverages the deterministic Information
Bottleneck principle to extract and transmit compact, task-
relevant features, integrating spatial-temporal data on the
server for enhanced inference accuracy.
•JPEG [18]: A widely adopted image compression stan-
dard that reduces the data size of digital images via lossy
compression algorithms, commonly used for reducing the
communication load in networked camera systems.
•High Efficiency Video Coding (HEVC) [19]: Also
known as H.265 and MPEG-H Part 2, this standard
provides up to 50% better data compression than its
predecessor AVC (H.264 or MPEG-4 Part 10), maintaining
the same video quality, which is critical for efficient data
transmission in high-density camera networks.
Our code will be available at github.com/fangzr/PIB-
Prioritized-Information-Bottleneck-Framework.
In the simulation study, we examine the effectiveness of
multiple camera systems in forecasting pedestrian presence.
Unlike a single-camera configuration, this method minimizes
obstructions commonly found in crowded locations by integrat-
ing perspectives from various angles. Nevertheless, this benefit
10 50 100 150
Communication Cost (KB)
60
65
70
75
80
85
90
MODA (%)
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 4: Communication Cost vs MODA.
123456
Number of Delayed Cameras
20
30
40
50
60
70
80
90
100
MODA (%)
+5.6%
+15.1%
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 5: Delayed cameras vs MODA.
is accompanied by heightened communication overhead. In Fig.
4, we observe the relationship between communication costs
and MODA, a metric for multi-camera perception. The PIB
algorithm exhibits a higher MODA across varying commu-
nication costs when compared to TOCOM-TEM, JPEG, and
HEVC. This superior performance can be attributed to PIB’s
strategic fusion of multi-view features, which is informed by
both channel quality and the selection of ROI with appropriate
priorities. By prioritizing information, PIB effectively mitigates
the detrimental effects of delayed information that could poten-
tially degrade the perception accuracy in multi-camera systems.
Fig. 5 depicts the performance rates of different compression
techniques in a multi-view scenario in terms of the number
of delayed cameras. Our proposed PIB method and TOCOM-
TEM, both utilizing multi-frame correlation models, success-
fully reduce redundancy across multiple frames, achieving su-
perior MODA at equivalent compression rates. PIB, in particu-
lar, utilizes a prioritized IB framework, which technique enables
an adaptive balance between compression rate and collaborative
sensing accuracy, optimizing MODA across various channel
conditions. It is worth noting that JPEG was not consistently
outperformed by HEVC compression due to our utilization of
the more effective HEIF algorithm derived from HEVC, which
inadequately supported the motion prediction module, resulting
in compromised performance.
In Fig. 6, we analyze the impact of increasing the number of
delayed cameras on the communication cost for various algo-
123456
Number of Delayed Cameras
0
5
10
15
20
25
30
35
Communication Cost (KB)
-57.8% -66.7%
PIB (ours)
TOCOM-TEM
JPEG
HEVC
Fig. 6: Delayed cameras vs communication cost.
rithms. The PIB algorithm demonstrates a significant reduction
in communication costs with a growing amount of delayed
cameras. This efficiency is due to the algorithm’s priority mech-
anism that adeptly assigns weights, filtering out the adverse
information caused by delays. Consequently, PIB prioritizes the
transmission of high-quality features from cameras with more
accurate occupancy predictions. When compared to TOCOM-
TEM, PIB achieves a remarkable 66.7% decrease in communi-
cation costs while still retaining the precision of multi-camera
pedestrian occupancy predictions. For a fair comparison, both
JPEG and HEVC methods were set to a uniform compression
threshold of 30 KB in this experiment. However, as indicated
in Fig. 5, they have not surpassed the performance of PIB and
TOCOM-TEM.
V. CONCLUSION
In this paper, we have proposed the Prioritized Information
Bottleneck (PIB) framework as a robust solution for collabora-
tive edge video analytics. Our contributions are two-fold. First,
we developed a prioritized inference mechanism to intelligently
determine the importance of different camera’ FOVs, effectively
addressing the constraints imposed by channel capacity and
data redundancy. Second, the PIB framework showcases its
effectiveness by notably decreasing communication overhead
and improving tracking accuracy without requiring video recon-
struction at the edge server. Extensive numerical results show
that: PIB not only surpasses the performance of conventional
methods like TOCOM-TEM, JPEG, and HEVC with a marked
improvement of up to 15.1% in MODA but also achieves a
considerable reduction in communication costs by 66.7%, while
retaining low latency and high-quality multi-view sensory data
processing under less favorable channel conditions.
VI. ACK NOWLEDGEMENT
This work was supported in part by the Hong Kong SAR
Government under the Global STEM Professorship and Re-
search Talent Hub, the Hong Kong Jockey Club under the Hong
Kong JC STEM Lab of Smart City (Ref.: 2023-0108), and
the Hong Kong Innovation and Technology Commission under
InnoHK Project CIMDA. The work of Y. Deng was supported
in part by the National Natural Science Foundation of China
under Grant No. 62301300. The work of X. Chen was supported
in part by HKU-SCF FinTech Academy R&D Funding.
REFERENCES
[1] A. Padmanabhan, N. Agarwal, A. Iyer, G. Ananthanarayanan, Y. Shu,
N. Karianakis, G. H. Xu, and R. Netravali, “Gemel: Model merging for
memory-efficient, real-time video analytics at the edge,” in 20th USENIX
Symposium on Networked Systems Design and Implementation (NSDI 23),
Boston, MA, 2023, pp. 973–994.
[2] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, and L. Hanzo, “Age of
information in energy harvesting aided massive multiple access networks,”
IEEE Journal on Selected Areas in Communications, vol. 40, no. 5, pp.
1441–1456, May 2022.
[3] Z. Fang, J. Wang, J. Du, X. Hou, Y. Ren, and Z. Han, “Stochastic
optimization-aided energy-efficient information collection in Internet of
Underwater Things networks,” IEEE Internet of Things Journal, vol. 9,
no. 3, pp. 1775–1789, Feb. 2021.
[4] H. Wang, J. Huang, G. Wang, H. Lu, and W. Wang, “Contactless patient
care using hospital IoT: CCTV camera based physiological monitoring
in ICU,” IEEE Internet of Things Journal, vol. 11, no. 4, pp. 5781–5797,
Aug. 2023.
[5] Z. Fang, S. Hu, H. An, Y. Zhang, J. Wang, H. Cao, X. Chen, and
Y. Fang, “PACP: Priority-aware collaborative perception for connected
and autonomous vehicles,” IEEE Transactions on Mobile Computing,
(DOI: 10.1109/TMC.2024.3449371), Aug. 2024.
[6] L. Corneo, N. Mohan, A. Zavodovski, W. Wong, C. Rohner, P. Gun-
ningberg, and J. Kangasharju, “(How much) can edge computing change
network latency?” in IFIP Networking Conference (IFIP Networking).
Espoo and Helsinki, Finland: IEEE, Jun. 2021, pp. 1–9.
[7] L. Marelli and G. Testa, “Scrutinizing the EU general data protection
regulation,” Science, vol. 360, no. 6388, pp. 496–498, May 2018.
[8] Ponemon Institute, “New ponemon institute study finds 60% of it and
security leaders are not confident in their ability to secure access to cloud
environments,” https://www.securitymagazine.com/articles/ 98044-60-of-
cybersecurity-leaders-not-confident-in-their- cloud-security-tactics, 2021,
accessed: 2022-07-20.
[9] J. Shao, X. Zhang, and J. Zhang, “Task-oriented communication for edge
video analytics,” IEEE Transactions on Wireless Communications, vol. 23,
no. 5, pp. 4141–4154, May 2024.
[10] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha, “Deep learning
algorithm for autonomous driving using GoogleNet,” in IEEE Intelligent
Vehicles Symposium (IV), Los Angeles, CA, Jun. 2017, pp. 89–96.
[11] K. Gao, H. Wang, H. Lv, and W. Liu, “Localization-oriented digital twin-
ning in 6G: A new indoor-positioning paradigm and proof-of-concept,”
IEEE Transactions on Wireless Communications, 2024.
[12] A. Yaqoob, T. Bi, and G.-M. Muntean, “A survey on adaptive 360 video
streaming: Solutions, challenges and opportunities,” IEEE Communica-
tions Surveys & Tutorials, vol. 22, no. 4, pp. 2801–2838, 2020.
[13] Z. Jiang, X. Zhang, Y. Xu, Z. Ma, J. Sun, and Y. Zhang, “Reinforcement
learning based rate adaptation for 360-degree video streaming,” IEEE
Transactions on Broadcasting, vol. 67, no. 2, pp. 409–423, Oct. 2020.
[14] Y. Hou, L. Zheng, and S. Gould, “Multiview detection with feature
perspective transformation,” in The European Conference on Computer
Vision (ECCV), Glasgow, Scotland, 2020, pp. 1–18.
[15] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck
method,” arXiv preprint physics/0004057, 2000.
[16] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational
information bottleneck,” in Conf. on Learning Representations (ICLR),
Toulon, France, Apr. 2017, pp. 1–9.
[17] T. Chavdarova, P. Baqu ´
e, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov,
L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “Wildtrack: A multi-camera
hd dataset for dense unscripted pedestrian detection,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
Jun. 2018, pp. 5030–5039.
[18] G. K. Wallace, “The JPEG still picture compression standard,” IEEE
Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv,
Feb. 1992.
[19] F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity and
implementation analysis,” IEEE Transactions on circuits and Systems for
Video Technology, vol. 22, no. 12, pp. 1685–1696, Oct. 2012.