Content uploaded by Keshav Singh
Author content
All content in this area was uploaded by Keshav Singh on Feb 27, 2025
Content may be subject to copyright.
Quantum-Enhanced DRL Optimization for DoA
Estimation and Task Offloading in ISAC Systems
Anal Paul, Member, IEEE, Keshav Singh, Member, IEEE, Aryan Kaushik, Member, IEEE, Chih-Peng
Li, Fellow, IEEE, Octavia A. Dobre, Fellow, IEEE, Marco Di Renzo, Fellow, IEEE,
and Trung Q. Duong, Fellow, IEEE
Abstract—This work proposes a quantum-aided deep reinforce-
ment learning (DRL) framework designed to enhance the accuracy
of direction-of-arrival (DoA) estimation and the efficiency of com-
putational task offloading in integrated sensing and communication
systems. Traditional DRL approaches face challenges in handling
high-dimensional state spaces and ensuring convergence to optimal
policies within complex operational environments. The proposed
quantum-aided DRL framework that operates in a military surveil-
lance system exploits quantum computing’s parallel processing
capabilities to encode operational states and actions into quantum
states, significantly reducing the dimensionality of the decision
space. For the very first time in literature, we propose a quantum-
enhanced actor-critic method, utilizing quantum circuits for policy
representation and optimization. Through comprehensive simula-
tions, we demonstrate that our framework improves DoA estimation
accuracy by 91.66% and 82.61% over existing DRL algorithms
with faster convergence rate, and effectively manages the trade-off
between sensing and communication and optimizing task offloading
decisions under stringent ultra-reliable low-latency communication
requirements. Comparative analysis also reveals that our approach
reduces the overall task offloading latency by 43.09% and 32.35%
compared to the DRL-based deep deterministic policy gradient and
proximal policy optimization algorithms, respectively.
A. Paul, K. Singh, and C-P. Li are with the Institute of Communications
Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan (Email:
apaul@ieee.org, keshav.singh@mail.nsysu.edu.tw, cpli@faculty.nsysu.edu.tw).
The work of K. Singh and C.-P. Li was supported in part by the National
Science and Technology Council of Taiwan under Grants NSTC 112-2221-
E-110-038-MY3 and NSTC 112-2221-E-110-029-MY3 and also supported in
part by the Sixth Generation Communication and Sensing Research Center
funded by the Higher Education SPROUT Project, the Ministry of Education
of Taiwan. The work of O. A. Dobre was supported by Canada Research
Chairs Program CRC-2022-00187. The work of A. Kaushik was supported by
the Higher Education Innovation Fund supported project “Green and Intelligent
6G Connectivity: AI and Holographic Surfaces-assisted Integrated Sensing and
Communications”. The work of M. Di Renzo was supported in part by the
European Commission through the Horizon Europe project titled COVER under
grant agreement number 101086228, the Horizon Europe project titled UNITE
under grant agreement number 101129618, and the Horizon Europe project
titled INSTINCT under grant agreement number 101139161, as well as by the
Agence Nationale de la Recherche (ANR) through the France 2030 project titled
ANR-PEPR Networks of the Future under grant agreement NF-YACARI 22-
PEFT-0005, and by the CHIST-ERA project titled PASSIONATE under grant
agreements CHIST-ERA-22-WAI-04 and ANR-23-CHR4-0003-01. The work of
T. Q. Duong was supported in part by the Canada Excellence Research Chair
(CERC) Program CERC-2022-00109. (Corresponding author: Keshav Singh.)
A. Kaushik is with the School of Engineering and Informatics, University of
Sussex, Brighton BN1 9RH, UK (E-mail: aryan.kaushik@sussex.ac.uk).
O. A. Dobre is with the Faculty of Engineering and Applied Science,
Memorial University of Newfoundland, St. John’s, NL A1C 5S7, Canada (E-
mail: odobre@mun.ca).
M. Di Renzo is with Universit´
e Paris-Saclay, CNRS, CentraleSup´
elec, Labora-
toire des Signaux et Syst`
emes, 3 Rue Joliot-Curie, 91192 Gif-sur-Yvette, France.
(marco.di-renzo@universite-paris-saclay.fr)
T. Q. Duong is with the Faculty of Engineering and Applied Science, Memorial
University of Newfoundland, St. John’s, NL A1C 5S7, Canada, and also with
the School of Electronics, Electrical Engineering and Computer Science, Queen’s
University Belfast, BT7 1NN Belfast, U.K. (e-mail: tduong@mun.ca).
Index Terms—Quantum computing, deep reinforcement learning,
direction-of-arrival estimation, vehicular task offloading, surveil-
lance systems, ultra-reliable low-latency communication.
I. INTRODUCTION
THE advent of integrated sensing and communication (ISAC)
systems marks a transformative era in military surveillance,
essential for modern warfare [1]. The integration of ground,
aerial, and space networks in the sixth-generation (6G) com-
munications is a game-changer that delivers unmatched levels of
global connectivity, low-latency communication, accurate sens-
ing capabilities, and distributed task offloading [2]–[4]. These ca-
pabilities are instrumental in time-sensitive military surveillance
systems, making them an essential component for the next level
of military operations. ISAC systems are essential in achieving
a dual-purpose goal: real-time environmental sensing for threat
detection and dynamic communication for command and control
[5]. The authors in [5] proposed reconfigurable intelligent sur-
faces (RISs) RIS-aided ISAC system to maximize the weighted
performance metrics while maintaining robust communication
links. Pan et al. investigated the use of unmanned aerial vehicles
(UAVs) to provide ISAC services [6]. They take advantage of
UAV mobility to improve accuracy in target location estimation
and ensure quality-of-service (QoS) in communication [6], [7].
Further study revealed that the UAV-mounted RIS is capable of
providing reliable coverage for ISAC systems [8], [9].
The direction-of-arrival (DoA) estimation is crucial in the
sensing component of ISAC systems, particularly if we consider
military surveillance applications [6], [10]. It involves deter-
mining the angle at which a received signal arrives, which
is crucial for accurately localizing and tracking unauthorized
flying objects (UFOs) or potential threats. In their research,
Chen et al. [11] investigated the use of passive beamforming
with RIS for estimating the DoA of ground vehicles. Multiple
measurements were analyzed to determine the ISAC system’s
theoretical Cramer-Rao lower bound (CRLB) in estimating the
DoA [6], [11], [12].
The authors in [13], [14] found that the ISAC-aided military
surveillance systems require precise DoA estimation as well
as effective task offloading to mobile edge computing (MEC)
due to the high computational intensity of processing tasks.
Task offloading to the MEC node for the armored vehicles is
essential as those vehicles are not equipped with specialized
hardware units to process some computationally heavy tasks
[14]. This scenario is further complicated in operations where
ultra-reliable low-latency communications (URLLC) are essen-
tial, necessitating swift processing and dissemination of critical
information with minimal delay [15]. Therefore, effective task
offloading becomes not just a strategic choice but a necessity
to meet the stringent URLLC service constraints. Herein, MEC-
enabled non-terrestrial networks (i.e., satellite communication)
emerge as a pivotal element, offering seamless integration with
terrestrial networks and facilitating the effective task [16]. The
satellite-terrestrial-integrated ISAC framework differs from the
MEC framework in key ways. ISAC integrates sensing and com-
munication for environmental monitoring, while MEC focuses on
reducing latency by offloading tasks to edge servers. ISAC uses
a unified infrastructure for efficient resource use, whereas MEC
employs distributed edge nodes. ISAC is suited for real-time
monitoring and autonomous systems like self-driving cars and
drones, while MEC is ideal for low-latency applications such as
AR/VR and IoT services. ISAC optimizes spectrum efficiency
and ensures accurate object detection, while MEC focuses on
reducing latency and improving offloading efficiency.
To optimize military surveillance systems with limited re-
sources for sensing, communication, and computation, imple-
menting a robust resource allocation strategy within the ISAC
framework is crucial. The evolution of artificial intelligence and
machine learning, followed by advancements in deep learning
(DL) strategy, significantly overcomes the hurdles of tradi-
tional optimization techniques [17], [18]. AI and ML algo-
rithms introduce unprecedented adaptability, learning capabil-
ity, and predictive analysis, enabling sophisticated and efficient
resource allocation strategies that were previously unattainable
[19], [20]. Wang et al. demonstrated that the DL technique for
ISAC-enabled predictive beamforming outperforms traditional
methods by bypassing intermediate state parameter estimation
[17]. The authors in [18] used advanced deep reinforcement
learning (DRL) algorithms in securing ISAC systems, balanc-
ing communication efficacy with security against eavesdropping
threats. DRL algorithms excel in sequential decision-making
environments, making them ideal for dynamic, uncertain systems
through continuous interaction with the environment [18]–[20].
Using a DRL-based deep deterministic policy gradient (DDPG)
algorithm, Gong et al. solved the joint optimization problem of
vehicular task scheduling and resource allocation in an ISAC
framework [19]. In another work, Liu et al. explored a DRL-
based proximal policy optimization (PPO) algorithm in a multi-
user multiple-input single-output (MISO) scenario to maximize
system capacity in an RIS-aided ISAC system [20].
A. Motivations and Contributions
The proliferation of non-terrestrial networks using 6G tech-
nology, encompassing both satellite and aerial platforms, offers
a new frontier for enhancing military surveillance capabilities
[2]. The potential absence of direct line-of-sight (LoS) from
the ground radar system to UFOs due to several obstructions
or geographical constraints necessitates innovative solutions [5].
Installing RIS in a high-rising building or deploying UAVs could
be helpful in the DoA estimation process [6]. The application
of RIS in DoA estimation is well investigated, but their fixed
point installation limits full exploitation in such scenarios [5].
Conversely, some existing works employ UAVs for sensing
in DoA estimation, yet the continuous hovering and sensing
operations quickly deplete the UAVs’ battery power. Therefore,
UAV-mounted passive RIS emerges as a promising approach
for enhancing the DoA estimation process by combining the
UAV’s mobility and RIS’s passive beamforming capabilities
while reporting to the ground radar system [21]. However,
integrating these technologies into a single framework that
supports robust multiple target detection (i.e., enemy UFOs) in
its airspace and efficient task offloading under URLLC stringent
QoS requirements raises a significant challenge for the opti-
mization algorithms. While DRL offers a promising solution for
the optimization of this joint framework, it faces training time
limitation and scalability for high-dimensional computational
space environments [22].
Quantum computing emerges as a potential game-changer
for DRL, offering a new prospect to address these limitations
[23]. Its unparalleled processing power and ability to handle
complex computations simultaneously (parallelism) hold im-
mense promise to accelerate DRL training and improve its
efficiency [24]. In this study, we propose a quantum-assisted
DRL framework, carefully developed to enhance the accuracy
of DoA estimation of UFOs and to optimize non-terrestrial task
offloading for URLLC demands within military surveillance sys-
tems using a comprehensive ISAC framework. Our solution relies
on the exceptional computational power of quantum computing.
By translating operational states and actions into quantum states,
it substantially compresses the decision space, facilitating a more
efficient learning trajectory. Furthermore, the integration of a
quantum-enhanced actor-critic algorithm, employing quantum
circuits for policy representation and optimization, exhibits the
cutting-edge application of quantum computing to address the
multifaceted optimization challenges encountered in military
surveillance operations. The main contributions of this proposed
work are summarized as follows:
•Optimized ISAC System Balance: The framework
achieves an optimal balance between sensing for DoA
estimation and communication for task offloading. Employ-
ing UAV-mounted passive RIS enables simultaneous DoA
estimation and sensing of UFOs. The framework’s novel
integration of root mean squared error (RMSE), CRLB, and
semidefinite programming (SDP) establishes a statistically
robust foundation for DoA estimation, enhancing accuracy
and reducing potential errors within a quantum-DRL envi-
ronment.
•Sophisticated Task Offloading Scheme: By utilizing non-
terrestrial networks, including satellites and aerial plat-
forms, the proposed scheme addresses terrestrial network
limitations and facilitates the execution of computationally
demanding tasks within tolerable latency. The framework
uniquely minimizes task offloading latency and DoA esti-
mation errors by incorporating a novel Nash equilibrium-
based reward mechanism.
•Quantum-Enhanced Actor-Critic Method: The novel
methodology of quantum-enhanced DRL framework, along
with actor-critic networks, employs quantum circuits for
policy representation and optimization within a multi-agent
setting. Demonstrated through extensive simulations, this
approach significantly enhances DoA estimation accuracy
and task offloading efficiency, meeting URLLC’s rigorous
demands. The proposed framework fastens the convergence
speed during the training process, surpassing the perfor-
mance of existing conventional DRL-based DDPG [19]
and PPO [20] algorithms. The further comparative analysis
highlights substantial task offloading latency reductions
over traditional DRL algorithms, emphasizing quantum
computing’s transformative impact on military surveillance
operations.
The rest of the paper is organized as follows: Section II
outlines the proposed system model and its application to ISAC
within military surveillance. Section III provides insight into
problem formulation for DoA estimation and task offloading.
Section IV unveils our novel quantum-enhanced DRL algorithm
and its implementation. Section V assesses our framework
through simulations, metrics, and comparisons. Section VI con-
cludes our contributions.
II. SY S TE M MODE L
Fig. 1 depicts a complex military surveillance system that
operates in urban, geographically challenging environments.
The system comprises a set of ground vehicles, denoted as
V={1, . . . , 𝑉 }, strategically deployed in a spatially distributed
area. Each vehicle 𝑣∈ V is equipped with a ground radar
detection system, which is enhanced through the deployment of
U={1, . . . , 𝑈 }UAVs. The UAVs are tasked with scanning
the airspace to detect W={1,2, . . . , 𝑊 }UFOs in its airspace.
Inspired by the energy efficiency achieved by RIS for the rate
demands of the end-users [25], we consider that each UAV
𝑢∈ U is equipped with a passive 2-bit RIS consisting of
N={1, . . . , 𝑁 }elements. For the reader’s convenience, Table I
provides a comprehensive list of symbols. Please note that only
a few symbols are redefined with proper mentioning for multiple
purposes within this manuscript, but their scopes are limited to
the specific contexts discussed.
The total operational timeframe, 𝑇, is dynamically partitioned
into two segments: 𝜏𝑇 and (1−𝜏)𝑇, as shown in Fig. 2. During
the 𝜏𝑇 phase, each vehicle-based ground radar system concludes
the presence or absence of UFOs and estimates their DoA in the
airspace. Simultaneously, the ground vehicles collect data from
G={1, . . . , 𝐺 }neighboring sensors for its next military opera-
tions. The accumulated data from all the sensors requires heavy
processing, which demands dedicated computational resources.
However, not all military vehicles have high-end processing
units due to cost, resource, and risk feasibility constraints. To
address this, we use a task-offloading technology [16] wherein
ground vehicles offload tasks to the UAVs and a network of
L={1, . . . , 𝐿 }satellites during the sub-slot of (1−𝜏)𝑇, as
depicted in Fig. 2. The UAVs assist in quick task caching,
reducing the need for hardware-based task processing, while
Fig. 1: DoA estimation and task offloading using quantum-DRL.
UFO Detection Task Offloading
←−−−− 𝜏 𝑇 −−−−→ ←−−−−−−−−−−−−−− (1−𝜏)𝑇−−−−−−−−−−−−−−→
Fig. 2: Time-frame model of surveillance system framework.
the satellites perform heavy computation in their onboard edge
processing units. It is worth mentioning here that if the radar
system detects the presence of a UFO in the airspace, the ground
vehicle halts the task offloading process due to the potential
threat of intercepting the sensitive data.
A. UFO Sensing and 3D DoA Estimation
The vehicular ground-based radar system initiates the DoA
estimation by transmitting a scanning signal 𝑠(𝑡)towards the
UAV-mounted RIS in the present surveillance framework. This
signal, represented by 𝑠(𝑡)=𝐴(𝑡)𝑒𝑗(2𝜋 𝑓𝑐𝑡+𝜓(𝑡)) , where 𝐴(𝑡)is
the amplitude, 𝑓𝑐the carrier frequency, and 𝜓(𝑡)the phase,
propagates through the airspace and is received by UAVs
equipped with 2-bit passive RIS. Each UAV’s RIS, consisting of
N={1,2, . . . , 𝑁 }elements, manipulates this signal via discrete
phase adjustments corresponding to the 2-bit control. The phase-
shifted signal from the 𝑛-th element of the RIS is expressed as
𝑠𝑛(𝑡)=𝑠(𝑡)𝑒𝑗 𝜙𝑛, where 𝜙𝑛is the phase shift induced by the
RIS.
The echo signal secho (𝑡) ∈ C𝑀×1received by the ground
radar is an aggregate of signals that are reflected from multiple
UFOs 𝑤∈ W towards the UAVs. The phase-shifted signal after
interacting with the UFO is given by x𝑤(𝑡)=s(𝑡)𝑒𝑗Φ𝑢(𝑡), where
Φ𝑢(𝑡)encompasses the collective phase shifts introduced by the
RIS on UAV 𝑢. The incident echo signal’s azimuth angle 𝜃𝑤
and elevation angle 𝜙𝑤are critical for three-dimensional (3D)
DoA estimation. Incorporating the mobility of the vehicle-based
ground radar, UAVs, and UFOs, the Doppler effect is integrated
into the signal model. The received echo signal is written by:
secho (𝑡)=
𝑈
𝑢=1
𝑊
𝑤=1
G𝑣𝑢 (𝑡)h𝑢𝑤 (𝑡)a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)𝚽𝑢
RIS (𝑡)(1)
×x𝑤(𝑡−𝝉𝑢𝑤 (𝑡))𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡+𝜼𝑣(𝑡),
where G𝑣𝑢 (𝑡) ∈ C𝑀×𝑁is the channel gain matrix from the 𝑢-th
UAV to the ground radar 𝑣with 𝑀antennas, h𝑢𝑤 (𝑡) ∈ C𝑁×1
represents the channel gain vector from the 𝑤-th UFO to the RIS
elements on the 𝑢-th UAV, a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡) ∈ C𝑁×1is the steering
vector of the RIS incorporating the azimuth and elevation angles,
and 𝚽𝑢
RIS (𝑡) ∈ C𝑁×𝑁is the RIS phase shift matrix on UAV 𝑢.
For a given measurement instance 𝑡during the operational phase
𝜏𝑇, the phase shift matrix is represented as:
𝚽𝑢
RIS (𝑡)=diag 𝑒𝑗 𝜑1, 𝑒 𝑗 𝜑2, . . . , 𝑒 𝑗 𝜑 𝑁,(2)
where diag(.)denotes a diagonal matrix with 𝑁elements, 𝑒𝑗 𝜑𝑛
corresponds to the phase shift induced by the 𝑛-th element of
the RIS on UAV 𝑢, and 𝜑𝑛is the phase shift value for the 𝑛-
th RIS element at time 𝑡. The term x𝑤(𝑡)signifies the phase-
shifted radar signal incident on UFO 𝑤,𝝉𝑢𝑤 (𝑡)is the vector of
propagation delays from the RIS elements on UAV 𝑢to UFO 𝑤,
𝑓𝑑𝑢𝑤 denotes the Doppler frequency shift for the link between
UAV 𝑢and UFO 𝑤, and 𝜼𝑣(𝑡) ∈ C𝑀×1is the noise vector at the
radar.
TABLE I: A comprehensive symbol table for the proposed work.
Symbol Description Symbol Description Symbol Description
𝑐Speed of light 𝑍Number of measurements 𝑑Inter-element spacing
𝑓𝑐Carrier frequency 𝐴Amplitude of the scanning signal 𝜓Phase of the scanning signal
𝑀Number of antennas in vehicle 𝜅Rician factor 𝜆Wavelength
x𝑣Position of vehicle x𝑢Position of UAV X𝑙Position of satellite
v𝑣Velocity vector of vehicle v𝑢Velocity vector of UAV V𝑙Velocity vector of satellite
w𝑣Waypoint vector of vehicle w𝑢Waypoint vector of UAV 𝜃𝑣Heading direction of vehicle
𝜃𝑢Horizontal direction of UAV 𝜃𝑙Horizontal direction of satellite 𝜙𝑢Vertical inclination of UAV
𝜃𝑤Azimuth angle from UFO to UAV 𝜙𝑤Elevation angle from UFO to UAV 𝜙𝑖,𝑢 𝑤 Phase shift for NLoS
Δ𝑡Time interval 𝛽1Proportion for local processing 𝛽2Proportion for UAV caching
𝛽3Proportion for satellite offloading 𝑇Total time duration 𝑓𝑙CPU capability of satellite
𝑓𝑣CPU capability of vehicle 𝑇Lat Total processing latency 𝛼𝑖, 𝑢𝑤 Amplitude gain for NLoS
𝑠Scanning signal 𝜖max Max error norm 𝑓𝑣𝑢𝑚
𝐷Doppler shift from vehicle to UAV
PL Path loss PL0Reference path loss 𝑋𝜎Shadow fading
R𝑛Noise covariance matrix 𝑇𝑣 𝑗
𝑐𝑜𝑚 Computation latency 𝑇𝑣 𝑗
𝑡𝑟 𝑎 Transmission latency
Γ𝑢
𝑣SINR vehicle to UAV Γ𝑙
𝑣SINR vehicle to satellite 𝜐𝑣𝑙 Channel dispersion
𝜐𝑣𝑢 Channel dispersion vehicle to UAV 𝜍𝑣𝑢 Codeword length vehicle to UAV 𝜍𝑣𝑙 Codeword length vehicle to satellite
𝜉𝑣𝑢 QoS for URLLC vehicle to UAV 𝜉𝑣𝑙 QoS for URLLC vehicle to satellite h𝑣𝑢 Channel vector vehicle to UAV
h𝑣𝑙 Channel vector vehicle to satellite hLoS
𝑣𝑢 LoS channel vector vehicle to UAV hNLoS
𝑣𝑢 NLoS channel from vehicle to UAV
hLoS
𝑣𝑙 LoS channel from vehicle to satellite hNLoS
𝑣𝑙 NLoS channel e𝑣𝑢 Channel estimation error
e𝑣𝑙 Channel estimation error in satellite 𝐷𝑣Task of vehicle Ω𝑣 𝑗 Sub-task data size
VSet of vehicles USet of UAVs WSet of UFOs
NSet of RIS elements BSet of base stations JSet of sub-tasks
𝐼(𝜃, 𝜙)Fisher information matrix G𝑣𝑢 Channel gain UAV to radar Rsecho Covariance matrix of the echo signal
𝚽𝑢
RIS RIS phase shift matrix x𝑤Phase-shifted radar signal a𝑢Steering vector of RIS
𝜑𝑛Phase shift induced by RIS element 𝝉𝑢𝑤 Propagation delays 𝑓𝑑𝑢𝑤 Doppler shift
𝛿𝜃Adjustment factor for orbital 𝜃𝑣𝑢𝑚 AoD from vehicle to UAV WBeamforming matrix
𝑃𝑣Transmission power of vehicle BCommunication bandwidth 𝑐𝑣 𝑗 Computational complexity
Θ𝑣Heading update for vehicle Θ𝑢Desired horizontal angle for UAV Φ𝑢Desired vertical angle for UAV
Soa Whole operational state space soa Temporal operational state Aoa Operational action space
aoa Operational action sDoA DoA estimation state sTASK Task offloading state
aDoA DoA estimation action aTASK Task offloading action 𝑟oa Reward
𝜏Time step EQuantum environment s𝑄
oa Quantum-encoded operational state
aopt Optimal action R(𝜃𝑤,𝜙𝑤)DoA estimation decision 𝚽RIS RIS configuration
e
R𝑣𝑢 Communication rate from 𝑣to 𝑢e
R𝑣𝑙 Communication rate from 𝑣to 𝑙 𝜆DPenalty coefficient for DoA
𝐶DCost function for DoA estimation 𝜆TPenalty term for task offloading 𝐶TCost function for task offloading
𝛾Learning rate aAction vector EExpectation operator
|𝜓⟩Quantum state UUnitary operation MMeasurement operator
QQuantum circuit 𝜚oa Quantum parameter 𝛼Learning rate
𝜃𝜋
oa Actor network parameter 𝜃𝑄
oa Critic network parameter DReplay buffer
𝛿DTemporal difference error 𝑉𝜃𝑉
oa Value function DKL Kullback-Leibler divergence
LLoss function 𝛽Balancing coefficient 𝜋Policy
1) Steering Vector Formulation: Considering a uniform lin-
ear array of RIS elements with inter-element spacing 𝑑. For
an incident signal with wavelength 𝜆, the steering vector
a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)(𝜃𝑤, 𝜙𝑤)for the 𝑢-th UAV is expressed as a
function of the azimuth angle 𝜃𝑤and elevation angle 𝜙𝑤. The
phase shift at the 𝑛-th RIS element, due to a signal arriving from
direction (𝜃𝑤, 𝜙𝑤), is given by 2𝜋
𝜆𝑛𝑑 sin(𝜃𝑤)sin(𝜙𝑤). Therefore,
the steering vector is formulated as:
a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)=h1, 𝑒 𝑗2𝜋
𝜆𝑑sin(𝜃𝑤(𝑡) ) sin(𝜙𝑤(𝑡)) , . . . ,
𝑒𝑗(𝑁−1)2𝜋
𝜆𝑑sin(𝜃𝑤(𝑡) ) sin(𝜙𝑤(𝑡)) iT
,
(3)
where [·]Tindicates the transpose operation.
2) Imperfect Channel and Pathloss Model: Channel imperfec-
tions, such as multipath fading, shadowing, and environmental
obstructions, along with imperfect channel state information
(CSI), can significantly distort the transmitted signal. The chan-
nels from UAVs to the ground radar (G𝑣𝑢 (𝑡)) and from UFOs
to UAVs (h𝑢𝑤 (𝑡)) are modelled using a Rician fading channel.
This model includes LoS and non-LoS (NLoS) components and
is expressed as:
h𝑢𝑤 (𝑡)=𝐾
𝐾+1hLoS
𝑢𝑤 (𝑡) + 1
𝐾+1hNLoS
𝑢𝑤 (𝑡),(4)
where 𝐾is the Rician factor. The LoS component is modelled
as:
hLoS
𝑢𝑤 (𝑡)=𝜆
4𝜋𝑑𝑢𝑤 (𝑡)𝑒−𝑗2𝜋
𝜆𝑑𝑢𝑤 (𝑡),(5)
and the NLoS component is modelled as [26]:
hNLoS
𝑢𝑤 (𝑡)=𝑁
𝑖=1𝛼𝑖,𝑢 𝑤 (𝑡)𝑒−𝑗 𝜙𝑖,𝑢𝑤 (𝑡),(6)
where 𝛼𝑖,𝑢 𝑤 and 𝜙𝑖, 𝑢𝑤 are the amplitude gains and the phase
shifts for each NLoS multipath component, respectively. The
channel between UAVs and the ground radar, G𝑣𝑢 (𝑡), follows
a similar Rician model to h𝑢𝑤 (𝑡)as given in (5). The path loss
for both UAV-ground radar and UAV-UFO links is expressed as:
PL(𝑑)=PL0+10𝛾log10 𝑑
𝑑0+𝑋𝜎,(7)
where 𝑋𝜎is the shadowing component, which is a Gaussian
random variable with zero mean and standard deviation 𝜎.
For imperfect CSI, we model the actual channel ˜
h𝑢𝑤 (𝑡)as the
sum of the estimated channel and an error term (e𝑢𝑤 ):
˜
h𝑢𝑤 (𝑡)=h𝑢𝑤 (𝑡) + e𝑢𝑤 (𝑡),(8)
with the error norm bounded by 𝜖max:
0<∥e𝑢𝑤 (𝑡)∥ ≤ 𝜖max .(9)
3) Estimation with Temporally Distributed Measurements: In
our DoA estimation process, the echo signal secho (𝑡)received
during the 𝜏𝑇 phase is systematically measured 𝑍times. These
measurements, spaced at regular intervals over the entire dura-
tion of the 𝜏𝑇 phase, enable a time-distributed capture of the
signal dynamics. Thus the sample duration can be written as
𝑡𝑧=𝜏𝑇
𝑍and 𝑡𝑧< 𝜏𝑇 < 𝑇. This measurement strategy is
crucial to accurately characterizing the signal reflections from the
UFOs, considering their potential movement and environmental
variations. The aggregated echo signal (using 𝑍measurements),
during 𝜏𝑇 phase, is represented as follows:
secho (𝑡)=
𝑍
𝑧=1 𝑈
𝑢=1
𝑊
𝑤=1
G𝑣𝑢 (𝑡𝑧)h𝑢𝑤 (𝑡𝑧)a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡𝑧)
𝚽𝑢
RIS (𝑡𝑧)x𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧))𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧+𝜼𝑣(𝑡𝑧).(10)
B. Vehicular Communication Model with UAVs and Satellites
In our urban-focused vehicular network system, an MISO up-
link communication model is implemented. Each vehicle 𝑣∈ V
is equipped with M={1,2, . . . , 𝑀 }transmitting antennas and
is complemented by a fleet of UAVs and a satellite, each having
a single receiving antenna.
1) Mobility Model: Our proposed model extends the classical
random waypoint model [27] incorporating advanced dynamics
to simulate the movement of three types of entities: ground ve-
hicles, UAVs, and satellites. The entities are initially distributed
randomly within a 3D space bounded by X ∈ (−𝑋min,+𝑋max ),
Y ∈ (−𝑌min,+𝑌max), and Z ∈ (−𝑍min,+𝑍max ). Their positions
are considered static within the interval [Δ𝑡 ,𝑡 −1].
The mobility model for ground vehicles integrates urban
mobility factors, considering the complex movement patterns
in city environments. The position of vehicle 𝑣at time slot 𝑡,
denoted as x𝑣(𝑡)=[𝑥𝑣(𝑡), 𝑦𝑣(𝑡)]T, is determined by its velocity
vector v𝑣(𝑡)=[𝑣𝑥, 𝑣𝑦]and heading direction 𝜃𝑣(𝑡). The velocity
and directional updates for ground vehicles at each time slot are
given by:
v𝑣(𝑡)=𝜆v𝑣(𝑡−1)+(1−𝜆)w𝑣,(11)
𝜃𝑣(𝑡)=𝜇𝜃𝑣(𝑡−1)+(1−𝜇)Θ𝑣,(12)
where 𝜆and 𝜇are momentum and adaptation factors in range of
(0.1,0.2).w𝑣is the waypoint vector, and Θ𝑣denotes the heading
direction. To translate the velocity and direction into positional
changes, we incorporate trigonometric functions as:
𝑥𝑣(𝑡)=𝑥𝑣(𝑡−1) + Δ𝑡v𝑣(𝑡)cos(𝜃𝑣(𝑡)),(13)
𝑦𝑣(𝑡)=𝑦𝑣(𝑡−1) + Δ𝑡v𝑣(𝑡)sin(𝜃𝑣(𝑡)),(14)
where Δ𝑡is the time interval between updates.
The mobility model for UAVs accounts for three-dimensional
space, reflecting their operational dynamics comprehensively.
The model includes horizontal movements, altitude changes,
and responses to atmospheric fluctuations. The UAV’s position
at time slot 𝑡, denoted as x𝑢(𝑡)=[𝑥𝑢(𝑡), 𝑦𝑢(𝑡), 𝑧𝑢(𝑡)]T, is
influenced by its velocity vector v𝑢(𝑡)and heading angles
(𝜃𝑢(𝑡), 𝜙𝑢(𝑡)), where 𝜃𝑢(𝑡)represents the horizontal direction
and 𝜙𝑢(𝑡)indicates the vertical inclination. The velocity and
direction of the UAV at each time slot are updated as follows:
v𝑢(𝑡)=𝛼𝑢v𝑢(𝑡−1)+(1−𝛼𝑢)w𝑢+𝝑𝑣,(15)
𝜃𝑢(𝑡)=𝛽𝑢𝜃𝑢(𝑡−1)+(1−𝛽𝑢)Θ𝑢+𝜗𝜃,(16)
𝜙𝑢(𝑡)=𝛾𝑢𝜙𝑢(𝑡−1)+(1−𝛾𝑢)Φ𝑢+𝜗𝜙,(17)
where 𝛼𝑢, 𝛽𝑢, 𝛾𝑢are persistence factors, w𝑢is the waypoint
vector for velocity, Θ𝑢and Φ𝑢are the desired horizontal and
vertical angles, and 𝝑𝑣, 𝜗𝜃, 𝜗𝜙are random perturbation terms.
To translate these velocities and angles into the UAV’s positional
changes, we use trigonometric functions:
𝑥𝑢(𝑡)=𝑥𝑢(𝑡−1) + Δ𝑡v𝑢(𝑡)cos(𝜃𝑢(𝑡)) cos (𝜙𝑢(𝑡)),(18)
𝑦𝑢(𝑡)=𝑦𝑢(𝑡−1) + Δ𝑡v𝑢(𝑡)sin(𝜃𝑢(𝑡)) cos (𝜙𝑢(𝑡)),(19)
𝑧𝑢(𝑡)=𝑧𝑢(𝑡−1) + Δ𝑡v𝑢(𝑡)sin(𝜙𝑢(𝑡)),(20)
where Δ𝑡is the time interval between updates. These equations
allow for a precise and realistic representation of the UAV’s
trajectory in 3D space, accommodating complex flight patterns
and environmental influences.
The satellite’s position at time slot 𝑡, represented as X𝑙(𝑡)=
[𝑥𝑙(𝑡), 𝑦𝑙(𝑡), 𝑧𝑙(𝑡)], is characterized by a constant altitude (z-
coordinate) due to its fixed lower earth orbit. The updates for
the horizontal movement of the satellite can be represented as:
V𝑙(𝑡)=V𝑙(𝑡−1),(21)
𝜃𝑙(𝑡)=𝜃𝑙(𝑡−1) + 𝛿𝜃Δ𝑡𝜃𝑙,(22)
where 𝛿𝜃is an adjustment factor for orbital movements. Similar
to the ground vehicle positioning model, the new satellite coordi-
nates x𝑙(𝑡)=[𝑥𝑙(𝑡), 𝑦𝑙(𝑡), 𝑧𝑙(𝑡)]Tat each time slot are calculated
while keeping the z-coordinate constant.
2) Channel Model: The Rician channel model is employed
to describe the LoS and NLoS propagation components, taking
into account the angle-of-departure (AoD) and Doppler shift
phenomena. The channel vector for the vehicle-to-UAV link from
vehicle 𝑣∈ V to UAV 𝑢∈ U, denoted as h𝐿 𝑜𝑆
𝑣𝑢 (𝑡), is modeled
for each transmitting antenna as:
h𝐿𝑜𝑆
𝑣𝑢 (𝑡)="𝜆
4𝜋𝑑𝑣𝑢
𝑒−𝑗2𝜋𝑑𝑣𝑢
𝜆𝑒−𝑗2𝜋 𝑓𝑐𝑣𝑣𝑢
𝑐cos(𝜃𝑣 𝑢1)𝑡,
. . . , 𝜆
4𝜋𝑑𝑣𝑢
𝑒−𝑗2𝜋𝑑𝑣𝑢
𝜆𝑒−𝑗2𝜋 𝑓𝑐𝑣𝑣𝑢
𝑐cos(𝜃𝑣 𝑢𝑀 )𝑡#T
,
(23)
where 𝜆is the wavelength, 𝑑𝑣𝑢 the distance, 𝑣𝑣𝑢 the relative
velocity, 𝑐the speed of light, and 𝜃𝑣𝑢𝑚 the AoD for the 𝑚-th
antenna. The Doppler shift for each antenna is expressed as:
𝑓𝑣𝑢𝑚
𝐷=𝑣𝑣𝑢 𝑓𝑐cos(𝜃𝑣𝑢𝑚 )
𝑐.(24)
An analogous vector formulation applies from 𝑣-th vehicle to
𝑙-th satellite link (𝑣→𝑙), h𝐿𝑜𝑆
𝑣𝑙 (𝑡), incorporating the parameters
𝑑𝑣𝑙 ,𝑣𝑣𝑙 , and 𝜃𝑣 𝑙𝑚 for each transmitting antenna.
In our MISO communication framework, the aggregate chan-
nel gain vector from the 𝑣-th vehicle to 𝑢-th UAV link is derived
using the Rician model and expressed as:
h𝑣𝑢 (𝑡)=𝑃𝐿𝑣𝑢 𝜅
𝜅+1h𝐿𝑜𝑆
𝑣𝑢 (𝑡) + 1
𝜅+1h𝑁 𝐿𝑜𝑆
𝑣𝑢 (𝑡)!,(25)
where h𝐿𝑜𝑆
𝑣𝑢 (𝑡)and h𝑁 𝐿𝑜 𝑆
𝑣𝑢 (𝑡)represent the LoS and NLoS
components, respectively, for the vehicle-to-UAV link. The sym-
bol 𝑃𝐿 𝑣 𝑢 represents the path loss model defined in (27). The
channel vector h𝑁 𝐿𝑜𝑆
𝑣𝑢 (𝑡)for the vehicle-to-UAV link is modeled
as a complex Gaussian random vector, being mathematically
represented as:
h𝑁 𝐿𝑜𝑆
𝑣𝑢 (𝑡)=ℎ𝑛𝑙𝑜 𝑠
𝑣𝑢1(𝑡), ℎ𝑛𝑙 𝑜𝑠
𝑣𝑢2(𝑡), . . . , ℎ𝑛𝑙𝑜 𝑠
𝑣𝑢 𝑀 (𝑡)T,(26)
where each component ℎ𝑛𝑙𝑜𝑠
𝑣𝑢𝑚 (𝑡),𝑚=1,2, . . . , 𝑀, is modeled as
a complex Gaussian random variable, ℎ𝑛𝑙𝑜𝑠
𝑣𝑢𝑚 (𝑡) ∼ CN (0, 𝜎2
𝑛𝑙𝑜𝑠 ),
where CN represents the complex normal distribution with zero
mean and variance 𝜎2
𝑛𝑙𝑜𝑠 , encapsulating the NLoS propagation
characteristics. The expression for the vehicle-to-satellite link
h𝑣𝑙 (𝑡)follows a similar structure, accounting for the respective
parameters of the satellite link. The path loss for the vehicle-to-
UAV link is modeled as follows:
𝑃𝐿 𝑣 𝑢 =4𝜋𝑑 𝑣𝑢
𝜆2
.(27)
The path loss model for the vehicle-to-satellite link 𝑃𝐿𝑣 𝑙 adheres
to a similar formulation, with 𝑑𝑣𝑙 indicating the distance to the
satellite.
3) Imperfect CSI and SINR Calculation: Considering imper-
fect CSI, the estimated channel vectors ˆ
h𝑣𝑢 and ˆ
h𝑣𝑙 are:
ˆ
h𝑣𝑢 =h𝑣𝑢 (𝑡) + e𝑣𝑢,ˆ
h𝑣𝑙 =h𝑣𝑙 (𝑡) + e𝑣𝑙 ,(28)
with e𝑣𝑢 and e𝑣𝑙 as the estimation error vectors, each element
of which is a complex Gaussian random variable. The signal-to-
interference plus noise ratio (SINR) for vehicle 𝑣communicating
with UAV 𝑢and satellite 𝑙under non-orthogonal multiple access
(NOMA) is derived as:
Γ𝑢
𝑣(𝑡)=𝑃𝑣∥ˆ
h𝑣𝑢 (𝑡)∥2
Í𝑉
𝑘=𝑣+1𝑃𝑘∥ˆ
h𝑘𝑢 (𝑡)∥2+𝜎2
𝑛𝑢
,(29)
Γ𝑙
𝑣(𝑡)=𝑃𝑣∥ˆ
h𝑣𝑙 (𝑡)∥2
Í𝑉
𝑘=𝑣+1𝑃𝑘∥ˆ
h𝑘𝑙 (𝑡)∥2+𝜎2
𝑛𝑙
,(30)
where ∥ˆ
h𝑣𝑢 ∥and ∥ˆ
h𝑣𝑙 ∥denote the norms of the estimated
channel vectors, capturing the combined effect of all transmit-
ting antennas of the vehicle. In this NOMA setup, successive
interference cancellation is employed for the decoding process.
4) Data Rate Calculation under URLLC Requirements: The
data rate for a vehicle 𝑣communicating with a UAV 𝑢or satellite
𝑠at time 𝑡, considering URLLC requirements and imperfect CSI,
is expressed as follows:
For the vehicle-to-UAV link:
R𝑣𝑢 (𝑡)=B log21+Γ𝑢
𝑣(𝑡)−𝜐𝑣𝑢 (Γ𝑢
𝑣)(𝑡)
𝜍𝑣𝑢 (𝑡)𝜉𝑣𝑢 (𝑡)!,(31)
where Brepresents the bandwidth of the communication chan-
nel. The term 𝜍𝑣𝑢 (𝑡)signifies the codeword/block length, and
𝜉𝑣𝑢 (𝑡)is a QoS parameter adjusting the data rate to meet the
URLLC reliability requirement, defined as:
𝜉𝑣𝑢 (𝑡)=Q−1(𝜈𝑣𝑢 )
log𝑒2,(32)
where Q−1(𝜈𝑣𝑢 )is the inverse of the Q-function with parameter
𝜈𝑣𝑢 (i.e., represents a packet error rate), used for calculating the
decoding error probability. The channel dispersion 𝜐𝑣𝑢 (Γ𝑢
𝑣)(𝑡)is
given by:
𝜐𝑣𝑢 (Γ𝑢
𝑣)(𝑡)=1−1+Γ𝑢
𝑣(𝑡)−2.(33)
Similarly, for the vehicle-to-satellite link:
R𝑣𝑙 (𝑡)=B©«log21+Γ𝑙
𝑣(𝑡)−𝜐𝑣𝑙 (Γ𝑙
𝑣)(𝑡)
𝜍𝑣𝑙 (𝑡)𝜉𝑣𝑙 (𝑡)ª®¬.(34)
C. Partial Task Offloading with UAV Caching
In parallel, the ground vehicles are tasked with receiving
critical information from a network of military sensors deployed
in their vicinity. Processing this cumulative sensory data, coupled
with the radar information, requires substantial computational
resources. Given practical constraints such as cost, maintenance,
and sensitivity, installing high-end processing units in each
ground military vehicle is unfeasible. We propose a hybrid
task offloading mechanism involving UAVs and satellites to
tackle this. Specifically, ground vehicles offload their latency-
sensitive URLLC-enabled computational tasks to the UAVs for
task caching and a satellite for processing. The task of the 𝑖-th
vehicle, denoted as 𝐷𝑣, is decomposed into a series of smaller
sub-tasks as follows:
𝐷𝑣=
𝐽
Ø
𝑗=1
Ω𝑣 𝑗 ,(35)
where Ω𝑣 𝑗 signifies the data size of 𝑗-th sub-task of the 𝑣-
th vehicle, and J={1,2, . . . , 𝐽}is the count of sub-tasks
partitioning the original task 𝐷𝑣.
Each sub-task Ω𝑣 𝑗 can either be processed locally within
the vehicle, offloaded for computation to a proximate node, or
cached in the UAVs. The offloading decision is modeled as a
binary variable 𝑥𝑙
𝑣 𝑗 for sub-task Ω𝑣 𝑗 , where 𝑥𝑙
𝑣 𝑗 =1indicates
offloading to the satellite, and 𝑥𝑙
𝑣 𝑗 =0denotes local processing.
The caching decision is modeled as a binary variable 𝑥𝑢
𝑣 𝑗 for
sub-task Ω𝑣 𝑗 , where 𝑥𝑢
𝑣 𝑗 =1indicates caching to the 𝑢-th UAV
and 𝑥𝑢
𝑣 𝑗 =0signifies no caching service. Here to note that
(𝑥𝑙
𝑣 𝑗 +𝑥𝑢
𝑣 𝑗 ) ≤ 1, where 𝑙∈ L and 𝑢∈ U.
In our task offloading strategy, we ensure an equitable distribu-
tion of computational tasks across local processing units, UAVs,
and satellites by imposing constraints on the offloading propor-
tions. Specifically, we have Í𝐽
𝑗=1𝑥𝑙
𝑣 𝑗 =𝛽1𝐽,Í𝐽
𝑗=1𝑥𝑢
𝑣 𝑗 =𝛽2𝐽,
and Í𝐽
𝑗=1(1−𝑥𝑙
𝑣 𝑗 −𝑥𝑢
𝑣 𝑗 )=𝛽3𝐽, where 𝛽1,𝛽2, and 𝛽3denote the
proportions of tasks allocated for local processing, UAV caching,
and satellite offloading, respectively. To ensure these proportions
are both feasible and optimized, we establish the following
boundary conditions 0< 𝛽min ≤𝛽𝑖< 𝛽max ≤1,𝑖=1,2,3,
under the constraint 𝛽1+𝛽2+𝛽3=1. This approach guarantees a
balanced task allocation, optimizing resource utilization within
our system’s operational parameters.
The computation latency 𝑇𝑣 𝑗
𝑐𝑜𝑚 for a sub-task ∀Ω𝑣 𝑗 ∈𝐷𝑣is
conditionally determined based on the caching decision:
𝑇𝑣 𝑗
𝑐𝑜𝑚(𝑡)=(0if 𝑥𝑢
𝑣 𝑗 =1,
𝑥𝑙
𝑣 𝑗
𝑐𝑣 𝑗 (𝑡)
𝑓𝑙(𝑡)+ (1−𝑥𝑙
𝑣 𝑗 )𝑐𝑣 𝑗 (𝑡)
𝑓𝑣(𝑡)otherwise.(36)
Here, 𝑐𝑣 𝑗 denotes the computational complexity of sub-task Ω𝑣 𝑗 .
The symbols 𝑓𝑙and 𝑓𝑣represent the available CPU processing
capability (i.e., 𝑓min (𝑡)< 𝑓𝑖(𝑡)< 𝑓max (𝑡)) of the satellite and
vehicle, respectively. The communication transmission latency
𝑇𝑣 𝑗
𝑡𝑟 𝑎 for offloading Ω𝑣 𝑗 is dependent on the data rate R𝑣 𝑢 (𝑡)or
R𝑣𝑙 (𝑡), and is expressed as:
𝑇𝑣 𝑗
𝑡𝑟 𝑎 (𝑡)=𝑥𝑙
𝑣 𝑗
Ω𝑣 𝑗
R𝑣𝑙 (𝑡)+𝑥𝑢
𝑣 𝑗
Ω𝑣 𝑗
R𝑣𝑢 (𝑡),∃Ω𝑣 𝑗 ∈𝐷𝑣.(37)
The total latency 𝑇Lat for processing 𝐷𝑣is a function of the
computation and communication latencies, expressed as:
𝑇Lat(𝑡)=𝐽
𝑗=1𝑇𝑣 𝑗
Lat (𝑡)=𝑇𝑣 𝑗
𝑡𝑟 𝑎 (𝑡) +𝑇𝑣 𝑗
𝑐𝑜𝑚(𝑡),∀𝑗∈𝐷𝑣.(38)
We focus on optimizing the task offloading and computa-
tional resource distribution, excluding detailed consideration of
response times due to their minimal impact on the system’s
overall performance, given the disparity in data size between
response payloads and processing tasks.
III. PROB LEM FO RMU LATIO N
In this integrated sensing and communication system, our
primary objective is to optimize the operational efficiency of the
military surveillance network while ensuring robust and reliable
communication between ground vehicles, UAVs, and satellites.
The problem encompasses several key components: accurate 3D
DoA estimation for detecting UFOs, efficient management of
communication resources, and effective URLLC-enabled task-
offloading strategies to balance the computational load.
A. DoA Estimation and Obtaining Sensing Decision
The primary sensing objective is to enhance the 3D DoA
estimation for UFO detection using the UAVs’ RIS-enhanced
radar systems. Given the received echo signal secho (𝑡𝑧)for each
measurement instance 𝑡𝑧within the 𝜏𝑇 phase, we can define 𝑍
measurements using (10) as secho (𝑡)=a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)s(𝑡)+𝜼𝑣(𝑡),
where
s(𝑡)=𝑍
𝑧=1𝑈
𝑢=1𝑊
𝑤=1
G𝑣𝑢 (𝑡𝑧)h𝑢𝑤 (𝑡𝑧)𝚽𝑢
RIS (𝑡𝑧)
x𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧))𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧.(39)
To facilitate DoA estimation, the covariance matrix of the echo
signal, Rsecho , is calculated as:
Rsecho (𝑡)=E[secho (𝑡)sH
echo (𝑡)]
=a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)R𝑠(𝑡)aH
𝑢(𝜃𝑤, 𝜙𝑤)(𝑡) + R𝑛(𝑡),(40)
where R𝑠is the covariance matrix of the signal and R𝑛is the
noise covariance matrix. To refine the accuracy of DoA esti-
mation, we deploy three key methodologies: RMSE, CRLB [6],
and SDP. Each method offers unique insights and optimization
capabilities for our DoA estimation scenario.
1) Formulation of DoA Estimation: The RMSE is our initial
step in quantifying the accuracy of DoA estimation. It directly
measures the average deviation between estimated and true DoA
values. Mathematically, the RMSE is defined as:
RMSE =E(𝜃true −ˆ
𝜃)2+ (𝜙true −ˆ
𝜙)2,(41)
where 𝜃true and 𝜙true are the true azimuth and elevation angles of
the UFOs, while ˆ
𝜃and ˆ
𝜙are their estimated counterparts. This
RMSE metric serves as a practical gauge for the performance of
our estimation process.
Building upon the insights from RMSE, the CRLB provides
the lower bound on the variance of any unbiased estimator,
which, in our case, relates to DoA parameters 𝜃and 𝜙. The
probability density function of Rsecho is expressed as
𝑝(secho (𝑡)) =1
(2𝜋)𝑍
2det(Rsecho )
exp −1
2(secho (𝑡) − 𝜇)HR−1
secho (secho (𝑡) − 𝜇),(42)
where 𝜇=E[secho (𝑡)]. The Fisher information matrix is derived
as follows [28]:
𝐼(𝜃, 𝜙)=−E"𝜕2ln 𝑝
𝜕𝜃 2
𝜕2ln 𝑝
𝜕𝜃 𝜕 𝜙
𝜕2ln 𝑝
𝜕𝜙 𝜕 𝜃
𝜕2ln 𝑝
𝜕𝜙 2#.(43)
The CRLB is then given by:
CRLB(𝜃, 𝜙)=diag(𝐼−1(𝜃, 𝜙)).(44)
The CRLB acts as a benchmark to assess the effectiveness
of our estimation methods against the theoretical best possible
performance.
Finally, we employ SDP to optimize our DoA estimation
process, striving to achieve performance as close to the CRLB
as possible. SDP excels in handling complex-valued matrices,
making it highly suitable for DoA estimation. We formulate the
optimization problem as follows:
min
R(𝜃, 𝜙 )
trace(R(𝜃 , 𝜙 )secho (𝑡)sH
echo (𝑡))
s.t: R(𝜃 , 𝜙)⪰0,
rank(R(𝜃 , 𝜙 ))=1,trace(R(𝜃 , 𝜙 ))=1,(45)
where R(𝜃 , 𝜙)is the covariance matrix associated with the esti-
mated DoA parameters. This matrix is optimized to minimize the
difference between the projected and actual covariance matrices
of the echo signal.
B. Detailed derivation process for R(𝜃, 𝜙 )
The covariance matrix of the echo signal, Rsecho, is defined as:
Rsecho (𝑡)=E[secho (𝑡)sH
echo (𝑡)].(46)
Substituting secho (𝑡)into the above equation, we get:
Rsecho (𝑡)=E[(a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)s(𝑡) + 𝜼𝑣(𝑡))
(a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)s(𝑡) + 𝜼𝑣(𝑡))H
=Ea𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)s(𝑡)sH(𝑡)aH
𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)+E𝜼𝑣(𝑡)𝜼H
𝑣(𝑡)
=a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)Es(𝑡)sH(𝑡)aH
𝑢(𝜃𝑤, 𝜙𝑤)(𝑡) + R𝑛(𝑡),(47)
where R𝑛(𝑡)=E𝜼𝑣(𝑡)𝜼H
𝑣(𝑡)is the noise covariance matrix.
Let R𝑠(𝑡)=Es(𝑡)sH(𝑡)be the covariance matrix of the signal
s(𝑡).Thus, the covariance matrix of the echo signal is expressed
as (40). To get further insights into the derivation, let’s consider
the structure of the covariance matrix R𝑠(𝑡).The signal s(𝑡)is a
superposition of multiple components, each subject to different
time delays 𝝉𝑢𝑤 (𝑡𝑧)and Doppler shifts 𝑓𝑑𝑢𝑤 .The covariance
matrix R𝑠(𝑡)captures these effects:
R𝑠(𝑡)=E 𝑍
𝑧=1
𝑈
𝑢=1
𝑊
𝑤=1
G𝑣𝑢 (𝑡𝑧)h𝑢𝑤 (𝑡𝑧)𝚽𝑢
RIS (𝑡𝑧)x𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧))
𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧 𝑍
𝑧′=1
𝑈
𝑢′=1
𝑊
𝑤′=1
G𝑣𝑢 (𝑡𝑧′)h𝑢𝑤 (𝑡𝑧′)𝚽𝑢
RIS (𝑡𝑧′)
x𝑤(𝑡𝑧′−𝝉𝑢𝑤 (𝑡𝑧′))𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧′H,(48)
=
𝑍
𝑧=1
𝑍
𝑧′=1
𝑈
𝑢=1
𝑈
𝑢′=1
𝑊
𝑤=1
𝑊
𝑤′=1
G𝑣𝑢 (𝑡𝑧)h𝑢𝑤 (𝑡𝑧)𝚽𝑢
RIS (𝑡𝑧)
R𝑥𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧), 𝑡𝑧′−𝝉𝑢𝑤 (𝑡𝑧′))𝚽𝑢′
RIS (𝑡𝑧′)Hh𝑢′𝑤′(𝑡𝑧′)H
G𝑣′𝑢′(𝑡𝑧′)H𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧𝑒−𝑗2𝜋 𝑓𝑑𝑢′𝑤′𝑡𝑧′.(49)
The matrix R𝑥𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧), 𝑡𝑧′−𝝉𝑢𝑤 (𝑡𝑧′)) represents the cross-
correlation of the transmitted signals at different delays and
Doppler shifts. This complex structure accounts for the varying
propagation paths and their impacts on the received signal.
Finally, incorporating these details, the covariance matrix of the
echo signal becomes:
Rsecho (𝑡)=a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)𝑍
𝑧=1
𝑍
𝑧′=1
𝑈
𝑢=1
𝑈
𝑢′=1
𝑊
𝑤=1
𝑊
𝑤′=1
G𝑣𝑢 (𝑡𝑧)h𝑢𝑤 (𝑡𝑧)
𝚽𝑢
RIS (𝑡𝑧)R𝑥𝑤(𝑡𝑧−𝝉𝑢𝑤 (𝑡𝑧), 𝑡𝑧′−𝝉𝑢𝑤 (𝑡𝑧′))𝚽𝑢′
RIS (𝑡𝑧′)H
h𝑢′𝑤′(𝑡𝑧′)HG𝑣′𝑢′(𝑡𝑧′)H𝑒𝑗2𝜋 𝑓𝑑𝑢𝑤 𝑡𝑧𝑒−𝑗2𝜋 𝑓𝑑𝑢′𝑤′𝑡𝑧′
aH
𝑢(𝜃𝑤, 𝜙𝑤)(𝑡) + R𝑛(𝑡).(50)
This derivation shows how the covariance matrix R(𝜃, 𝜙 )(𝑡)=
Rsecho (𝑡)is obtained from the received echo signal model. The
primary goal in our DoA estimation framework is to mini-
mize the difference between the projected covariance matrix,
R(𝜃𝑤, 𝜙𝑤), and the actual covariance matrix, Rsecho. This approach
aims to reduce the RMSE in our estimations. The optimization
problem is formulated as follows:
RMSE ∝min
R(𝜃𝑤,𝜙 𝑤)
trace(Rsecho −R(𝜃𝑤, 𝜙𝑤)secho (𝑡)sH
echo (𝑡)).(51)
1) Obtaining UFO Sensing Decision: UFO detection is ap-
proached as a binary hypothesis testing problem in our radar-
based surveillance system. We assess the presence (𝐻1) or
absence (𝐻0) of UFOs based on the received echo signal secho (𝑡).
The hypotheses are defined using the indicator function Ψ, where
Ψ = 1indicates the UFO presence and Ψ = 0indicates absence:
𝐻0(Ψ = 0):secho (𝑡)=𝜼𝑣(𝑡),(52)
𝐻1(Ψ = 1):secho (𝑡)=a𝑢(𝜃𝑤, 𝜙𝑤)(𝑡)𝑠(𝑡) + 𝜼𝑣(𝑡).(53)
To facilitate the detection decision, a threshold 𝜆is established
based on the desired probability of false alarm (𝑃𝑓 𝑎 ). The test
statistic for energy detection, denoted by 𝑌, is given by:
𝑌=1
𝑍𝑍
𝑧=1|secho (𝑧)|2,(54)
where 𝑍represents the number of temporally distributed mea-
surements. The probability of detection (𝑃𝑑) is the likelihood
of correctly detecting a UFO when it is present (Ψ = 1). It is
calculated as [29]:
𝑃𝑑=𝑄 𝜆
𝑃𝑛−𝛾−1√𝑍
𝛾+1!,(55)
where 𝛾denotes the signal-to-noise ratio (SNR), 𝑃𝑛is the noise
power which is calculated as 𝑃𝑛=E[𝜈(𝑡)𝜈H(𝑡)], and 𝑄(.)is the
Q-function. Similarly, the probability of a false alarm (𝑃𝑓 𝑎 ) is
the likelihood of incorrectly detecting a UFO when it is absent
(Ψ = 0). This probability is determined as [29]:
𝑃𝑓 𝑎 =𝑄𝜆
𝑃𝑛−1√𝑍.(56)
C. Communication for Task Offloading
In our military surveillance system, the communication strat-
egy is optimized for data throughput and latency minimization
across vehicle-to-UAV and vehicle-to-satellite links within the
remaining sub-slots (𝑇−𝜏𝑇 ), emphasizing the URLLC require-
ments. Task offloading decisions are dynamically adjusted based
on the UFO detection, with operational logic that suspends
offloading during UFO presence (P(𝐻1)) to prioritize surveil-
lance, except under the false detection scenarios characterized by
(1−𝑃𝑑). Conversely, normal offloading resumes in the absence
of UFO detection (P(𝐻0)), provided no false alarm occurs, as
indicated by (1−𝑃𝑓 𝑎 ). Given that each vehicle’s task is divided
into 𝐽sub-tasks, the effective throughput for the vehicle-to-UAV
link, denoted as R𝑣𝑢 in (31), and the vehicle-to-satellite link,
denoted as R𝑣𝑙 in (34), are recalculated to accommodate this
division. The adjusted throughputs are expressed as:
e
R𝑣𝑢 (𝑡)=𝑇−𝜏𝑇
𝐽𝑇 P(𝐻0)(1−𝑃𝑓 𝑎 )R𝑣𝑢 (𝑡),(57)
e
R𝑣𝑙 (𝑡)=𝑇−𝜏𝑇
𝐽𝑇 P(𝐻0)(1−𝑃𝑓 𝑎 )R𝑣𝑙 (𝑡).(58)
D. Integrated Optimization Problem
The underlying problem is integrating the sensing, commu-
nication, and computational aspects into a unified optimization
framework. This involves jointly optimizing the DoA estimation
process, UFO sensing performance, communication link param-
eters, and task offloading decisions to achieve the best overall
system performance.
In the context of DoA estimation, the SNR at the receiver
significantly influences the accuracy of DoA estimation, directly
affecting sensing performance. The power of the received sig-
nal from (10), 𝑃𝑠, is defined as 𝑃𝑠=E[secho (𝑡)sH
echo (𝑡)] −
E[𝜼𝑣(𝑡)𝜼H
𝑣(𝑡)]. The noise power, 𝑃𝑛, is given by 𝑃𝑛=
E[𝜼𝑣(𝑡)𝜼H
𝑣(𝑡)], with 𝜼𝑣(𝑡)denoting the noise vector at the radar
receiver. Finally, the SNR, a critical metric for assessing the
received signal quality relative to background noise, is defined
as SNR =𝑃𝑠
𝑃𝑛. The 𝜏𝑇 sub-slot in Fig. 2 aims to effectively
ensure DoA optimization while adhering to stringent detection
performance standards. The optimization task, focused on refin-
ing the accuracy of DoA estimation, is formulated as:
𝑓DoA =min
R(𝜃𝑤,𝜙 𝑤),𝑍,𝚽RIS Rsecho −R(𝜃𝑤, 𝜙𝑤)secho (𝑡)sH
echo (𝑡)2
F,
subject to (59)
C1: SNR ≥SNRmin ,
C2: 𝜑𝑛∈ {0, 𝜋/2, 𝜋 , 3𝜋/2},∀𝑛∈ N,
C3: 𝜃∈ [0,2𝜋], 𝜙 ∈ [0, 𝜋],
C4: 𝑃𝑑≥𝑃𝑡 ℎ
𝑑,
C5: 𝑃𝑓 𝑎 ≤𝑃𝑡 ℎ
𝑓 𝑎,
C6: 0< 𝑍 ≤𝑍max .
where the constraint C1 guarantees a minimum SNRmin at the
receiver, ensuring the received signal’s integrity for dependable
DoA estimation. Constraint C2 precisely controls the phase shifts
introduced by the RIS. Constraint C3 confines the optimization
search within realistic azimuth (𝜃) and elevation (𝜙) angle
bounds, ensuring that the DoA estimation adheres to feasible
operational domains. Furthermore, constraint C4 ensures that the
probability of detecting a UFO (𝑃𝑑) surpasses a predetermined
threshold 𝑃𝑡 ℎ
𝑑. Conversely, C5 ensures that the system’s 𝑃𝑓 𝑎
does not exceed an upper limit 𝑃𝑡 ℎ
𝑓 𝑎. Constraint 𝐶6bounds the
maximum number of measurements (i.e., 𝜏𝑇 =𝑍×𝑡𝑧) in the
DoA estimation process.
On the other hand, given the stringent requirements of URLLC
services, our optimization problem focuses on minimizing the
total latency involved in task offloading, comprising both com-
munication and computation latencies, while ensuring that the
reliability requirements are met. The objective function, along-
side the constraints for URLLC services, is formulated using
(38) as follows:
𝑓TASK =min
𝑃𝑣(𝑡),e
R𝑣𝑢 (𝑡),e
R𝑣𝑙 , 𝑓𝑣, 𝑓𝑙𝐽
𝑗=1𝑇𝑣 𝑗
𝑡𝑟 𝑎 (𝑡) +𝑇𝑣 𝑗
𝑐𝑜𝑚,
subject to
C7: 𝑇𝑣 𝑗
𝑐𝑜𝑚 ≤𝑇𝑡 ℎ
𝑐𝑜𝑚,∀𝑗 ,
C8: 𝑇𝑣 𝑗
𝑡𝑟 𝑎 ≤𝑇𝑡 ℎ
𝑡𝑟 𝑎 ,∀𝑗 ,
C9: e
R𝑣𝑢 (𝑡) ≥ R𝑚𝑖𝑛
𝑣𝑢 ,∀𝑡,
C10: e
R𝑣𝑙 (𝑡) ≥ R𝑚𝑖𝑛
𝑣𝑙 ,∀𝑡,
C11: 𝑃𝑣(𝑡) ≤ 𝑃max,∀𝑡,
(60)
where constraints C7 and C8 ensure that the computation and
communication latencies for each sub-task 𝑗do not exceed their
respective thresholds, 𝑇𝑡 ℎ
𝑐𝑜𝑚 and 𝑇𝑡 ℎ
𝑡𝑟 𝑎 , crucial for maintaining the
URLLC’s low-latency requirements. Constraint C9 mandates the
adjusted data rate for the vehicle-to-UAV link e
R𝑣𝑢 (𝑡)to meet or
surpass a minimum R𝑚𝑖𝑛
𝑣𝑢 , vital for link reliability and meeting the
URLLC latency criteria. Similarly, C10 requires the vehicle-to-
satellite link’s adjusted data rate e
R𝑣𝑙 (𝑡)to exceed R𝑚𝑖𝑛
𝑣𝑙 , ensuring
efficient data transmission aligning with the URLLC standards.
Constraint C11 ensures that the power used for transmission
𝑃𝑣(𝑡)does not exceed a maximum power budget 𝑃max at any
given time 𝑡.
To achieve an optimal balance between sensing accuracy
and communication efficiency, we propose a joint optimization
framework formulated as a multi-objective optimization problem:
minimize
R(𝜃𝑤,𝜙 𝑤),𝑍,𝚽RIS ,
𝑃𝑣(𝑡),e
R𝑣𝑢 (𝑡),e
R𝑣𝑙 (𝑡), 𝑓𝑣, 𝑓𝑙
𝑓=𝜔1𝑓DoA + (1−𝜔1)𝑓TASK,
s.t.: C1 - C11,
(61)
where the weight 0< 𝜔1<1adjusts the importance of
each function to the overall objective, allowing for flexibility
in prioritizing between sensing accuracy and communication
efficiency based on the application’s needs.
IV. QUANT UM -AI DED MU LTI -AGE NT D RL SOLUTION
In addressing the intricate and high-dimensional state space
challenges of military surveillance systems, we introduce a
quantum-aided multi-agent DRL solution. This approach utilizes
the parallel processing power of quantum computing, utilizing
phenomena such as superposition and entanglement to transcend
the limitations of traditional optimization methods [23]. The
fusion of quantum computing with multi-agent DRL facilitates
enhanced distributed decision-making and learning, strengthen-
ing the system’s adaptability, resilience, and performance in
dynamic operational scenarios [24], [30].
Our proposed architecture introduces a dual-tiered agent
framework, as depicted in Fig. 1. Within this framework, we
define two distinct operational agents, denoted by AgDoA and
AgTASK, which are responsible for the immediate execution
of DoA estimation and task offloading decisions, respectively.
These agents interact directly with the operational environment
to fulfill their designated tasks.
A. MDP Model for Operational Agents
The decision-making processes of the operational agents are
modeled using an MDP framework to apply the DRL model [20].
Each operational agent has a distinct role and operates within its
unique state and action spaces, which are outlined as follows:
1) Operational State Space (Soa):At any discrete time
instance 𝑡, the operational state space for an agent is represented
by soa (𝑡) ∈ Soa, which is composed of the state vectors for
DoA estimation and task offloading decision processes. The state
space is formally given by
soa (𝑡)={sDoA (𝑡),sTASK (𝑡) | DoA,TASK ∈oa},(62)
where sDoA (𝑡)is defined for DoA estimation as
sDoA (𝑡)={𝑠(𝑡),G𝑣𝑢 ,h𝑢𝑤 ,𝜼𝑣,x𝑣,v𝑣,w𝑣,Θ𝑣,x𝑢,v𝑢,w𝑢,Θ𝑢}.
The state vector for task offloading, sTASK(𝑡), comprises
sTASK (𝑡)={h𝑣𝑢 ,h𝑣𝑙 , 𝐷 𝑣, 𝜉𝑣 𝑢, 𝜉𝑣𝑙 , 𝑐 𝑣𝑗 , 𝑓𝑣, 𝜎2
𝑛𝑢, 𝑥𝑢
𝑣 𝑗 , 𝜎2
𝑛𝑙, 𝑥𝑙
𝑣 𝑗 , 𝑓𝑙}.
2) Operational Action Space (Aoa):The action space for
each operational agent is defined to align with its specific
operational role within the surveillance system. The collective
action space, denoted by aoa (𝑡) ∈ Aoa , is composed of the
individual action sets for agents involved in DoA estimation and
task offloading, respectively:
aoa (𝑡)={aDoA (𝑡),aTASK (𝑡) | DoA,TASK ∈oa},(63)
where the action vector aDoA (𝑡)relevant to agent AgDoA at
time 𝑡includes decisions related to DoA estimation such as
aDoA (𝑡)={R(𝜃𝑤, 𝜙𝑤),𝚽RIS, 𝑍 }. Concurrently, the action set for
task offloading decisions, aTASK, for agent AgTASK is constituted
by aTASK (𝑡)={𝑃𝑣(𝑡),e
R𝑣𝑢 (𝑡),e
R𝑣𝑙 (𝑡), 𝑓𝑣(𝑡), 𝑓𝑙(𝑡)}, encompassing
power allocation, communication rate adjustments, and compu-
tational resource management.
3) Nash Equilibrium-based Rewards Calculation:The
multi-agent framework employs a reward structure grounded in
a joint objective function, aiming to balance DoA estimation
accuracy and task offloading efficiency. This balance is regulated
by dynamically allocating time between DoA estimation (𝜏)
and task offloading, directly influencing the system performance
as described in the joint minimization problem in (61). To
achieve Nash equilibrium [31], where no agent benefits from
unilaterally changing its strategy, we define dynamic penalty
coefficients and cost functions. These components are designed
to penalize deviations from desired performance thresholds, thus
incentivizing agents toward optimal behaviour:
𝑟oa (𝑡)=(𝜔1
𝑓DoA (𝑡)−𝜆D(𝑡)𝐶D(𝑡),for AgDoA
(1−𝜔1)
𝑓TASK(𝑡)−𝜆T(𝑡)𝐶T(𝑡),for AgTASK
,(64)
where 𝜆D(𝑡)=max(0,Í6
𝑐=1𝜅𝑐I(C𝑐violation)) and 𝜆T(𝑡)=
max(0,Í12
𝑐=7𝜅𝑐I(C𝑐violation)) denote the dynamic penalty co-
efficients for DoA estimation and task offloading, respectively.
Here, 𝜅𝑐and 𝜅𝑐are the penalty weights for constraint violations,
and I(.)indicates a constraint violation. The cost functions are
defined as 𝐶D(𝑡)=|estimated DoA −actual DoA|2for DoA
estimation error, and 𝐶T(𝑡)=max(0, 𝑇latmax (𝑡) −𝑇lat (𝑡)) for task
offloading latency, ensuring penalties are directly tied to the
magnitude of performance deviation. This structured approach to
reward calculation drives the system toward a Nash equilibrium,
optimizing the overall surveillance operation. Therefore, the
cumulative reward can be expressed as:
𝑟sum
oa (𝑡)=𝜔1
𝑓DoA (𝑡)−𝜆D(𝑡)𝐶D(𝑡) + (1−𝜔1)
𝑓TASK(𝑡)−𝜆T(𝑡)𝐶T(𝑡).(65)
Algorithm 1 Quantum State Encoding and Initialization
Require: Classical state vectors soa (𝑡)
Ensure: Quantum-encoded operational state s𝑄
oa (𝑡)
1: Normalize sDoA (𝑡)and sTASK (𝑡)to ∥s(𝑡)∥2=1
2: for s∈ {sDoA (𝑡),sTASK (𝑡)} do
3: Encode sinto |𝜓oa (𝑡)⟩ =Í𝑁−1
𝑖=0
𝑠𝑖(𝑡)
∥soa (𝑡)∥2|𝑖⟩
4: end for
5: Set s𝑄
oa (𝑡)={|𝜓DoA (𝑡)⟩,|𝜓TASK (𝑡)⟩}
6: Initialize quantum system with s𝑄
oa (𝑡)
7: return s𝑄
oa (𝑡)
B. Quantum-Aided Multi-agent DRL Framework
By leveraging the power of quantum computation, our ap-
proach aims to address the high-dimensional challenges prevalent
in military surveillance systems, improving both the DoA estima-
tion accuracy and computational task offloading efficiency. The
proposed framework’s procedural details and operational insights
are thoroughly discussed, with Algorithm 3 serving as the core
for our quantum-aided DRL optimization process.
1) Quantum-Encoded Operational State Space (S𝑄
oa):For
a given discrete time instant 𝑡, the quantum-encoded operational
state space [30], [32], denoted by s𝑄
oa (𝑡) ∈ S𝑄
oa, describes the
quantum states relevant to DoA estimation and task offloading
decisions. Specifically, s𝑄
oa (𝑡)comprises:
s𝑄
oa (𝑡)={|𝜓DoA (𝑡)⟩,|𝜓TASK (𝑡)⟩},(66)
where |𝜓DoA (𝑡)⟩ and |𝜓TASK (𝑡)⟩ represent the quantum-encoded
states (e.g., here 𝑁defines the size of the quantum state space)
derived from their classical counterparts, sDoA(𝑡)and sTASK (𝑡),
through amplitude encoding [33]. This encoding process initiates
with normalizing the classical vectors to the unit norm, followed
by the amplitude encoding [33], which maps each vector s
into a quantum state |𝜓⟩, as detailed in Algorithm 1. This
quantum-encoded state space, exploiting quantum superposition,
affords a quantum computational advantage by enabling parallel
processing of multiple states.
Lemma 1. Quantum encoding of operational states and actions
into quantum states significantly reduces the dimensionality of
the decision space, thereby enhancing the efficiency of the
learning process in the quantum-aided DRL framework.
Proof. For an 𝑛-qubit quantum system, operational agent states
and actions are encoded into a quantum state |𝜓(𝑡)⟩ within a
Hilbert space Hof dimension 2𝑛[34]. Utilizing the principle of
superposition, this encoding is represented mathematically as:
|𝜓(𝑡)⟩ =2𝑛−1
𝑖=0𝛼𝑖|𝑖⟩,with 2𝑛−1
𝑖=0|𝛼𝑖|2=1,(67)
where 𝛼𝑖∈Care probability amplitudes, indicating the complex
likelihood of the system being found in each basis state upon
measurement. The set {|𝑖⟩}2𝑛−1
𝑖=0denotes the computational basis,
where each basis state |𝑖⟩is a direct representation of the binary
equivalent of the integer 𝑖. The computational basis can be
formally defined as: b={|𝑖⟩:𝑖∈ {0,1, . . . , 2𝑛−1}}.
Unitary transformations U(𝑡)evolve |𝜓(𝑡)⟩ into:
|𝜓′(𝑡)⟩ =U(𝑡)|𝜓(𝑡)⟩ =2𝑛−1
𝑖=0𝛼′
𝑖|𝑖⟩,(68)
with 𝛼′
𝑖=U(𝑡)𝛼𝑖, indicating that the complexity of operations
scales as O(𝑝𝑜𝑙 𝑦(𝑛)). The Grover search algorithm [35] high-
lights quantum computational advantages by requiring O(√2𝑛)
queries to identify a marked item in a search space 𝑆, con-
trasting with the classical search complexity of O(2𝑁), where
𝑛≪𝑁.□
2) Quantum-Enhanced Actor-Critic Framework:Building
upon the quantum-encoded operational state spaces, our ap-
proach employs a quantum-enhanced actor-critic method for each
operational agent. This method employs separate networks for
the policy (actor) and value function (critic), optimized to work
within the quantum computing paradigm.
a) Quantum Circuit Initialization and Actor Network:
Quantum circuits, parameterized by action vectors aoa (𝑡), pro-
cess quantum-encoded operational state spaces s𝑄
oa (𝑡)through
unitary transformations U(𝜚oa (𝑡)), reflecting the decision-
making policies. The initialization of these quantum circuits
(Qoa) is formalized as follows [36]:
Qoa (𝑡, aoa )=U(𝜚oa (𝑡))|𝜓oa (𝑡)⟩,(69)
where U(𝜚oa (𝑡)) represents the quantum equivalent of actions,
optimized to maximize the expected reward. This unitary opera-
tion transforms the quantum-encoded states according to agent-
specific actions aoa (𝑡), mapping the initial state |𝜓oa (𝑡)⟩ to a
new state |𝜓′
oa (𝑡)⟩ as:
U(𝜚oa (𝑡)) :|𝜓oa (𝑡)⟩ ↦→ |𝜓′
oa (𝑡)⟩.(70)
The evolution of these states under the influence of actions is
governed by the Hamiltonian Hoa(𝜚oa ), with the unitary oper-
ation expressed as U(𝜚oa (𝑡)) =𝑒−𝑖Hoa (𝜚oa (𝑡)) , which encodes
the total energy of the system. This is not about the physical
energy in the conventional sense but rather a mathematical
representation of the system’s energy states and transitions within
the quantum computational model. Mathematically, Hoa(𝜚oa )is
defined as:
Hoa =𝑖𝜖𝑖|𝑖⟩⟨𝑖| + 𝑖≠𝑗𝜏𝑖 𝑗 (|𝑖⟩⟨𝑗|+|𝑗⟩⟨𝑖|),(71)
where 𝜖𝑖represents the energy associated with the system being
in a particular state |𝑖⟩, and 𝜏𝑖 𝑗 represents the transition energy
between states |𝑖⟩and |𝑗⟩.
The actor-network employs variational quantum circuits, pa-
rameterized by 𝜃𝜋
oa, to efficiently explore action probabilities
through quantum superposition and entanglement [37], written
as:
𝜋𝜃𝜋
oa (aoa (𝑡)|s𝑄
oa (𝑡)) =⟨𝜓oa (𝑡)|U†(𝜚oa (𝑡))
U(𝜚oa (𝑡))|𝜓oa (𝑡)⟩,(72)
where U†(𝜚oa (𝑡)) is its Hermitian adjoint, ensuring reversibility
and the preservation of quantum state properties during policy
application. To optimize 𝜃𝜋
oa, we use the parameter shift rule to
estimate the gradient as follows:
∇𝜃𝜋
oa 𝐽(𝜃𝜋
oa)=1
2⟨𝜕𝜃𝜋
oa 𝜓oa (𝑡)|U†(𝜚oa (𝑡))U(𝜚oa (𝑡))|𝜓oa (𝑡)⟩
+ ⟨𝜓oa (𝑡)|U†(𝜚oa (𝑡))U(𝜚oa (𝑡))|𝜕𝜃𝜋
oa 𝜓oa (𝑡)⟩.(73)
b) Critic Network Evaluation: According to the current
policy, the critic network, parameterized by 𝜃𝑄
oa, evaluates the
expected return of taking an action ain-state s. This evaluation
guides the policy improvement by providing feedback on the
action value. This evaluation is quantitatively expressed as:
𝑄𝜃𝑄
oa (s𝑄
oa (𝑡),aoa (𝑡)) =⟨𝜓′
oa (𝑡)|𝑀𝜃𝑄
oa |𝜓′
oa (𝑡)⟩,(74)
where 𝑀𝜃𝑄
oa represents the measurement operator parameterized
by the critic network.
c) Replay Buffer: A quantum-enhanced replay buffer is
utilized to store experience tuples (s,a, 𝑟 , s′), collected from
interactions with the environment. This buffer serves as a
database for sampling mini-batches of experiences [19], reducing
the correlation in the observation sequence and improving the
stability and efficiency of learning:
D={(s𝑄
oa (𝑡𝑖),a(𝑡𝑖), 𝑟 (𝑡𝑖),s𝑄
oa (𝑡𝑖+1))}𝑁
𝑖=1,(75)
where Ddenotes the replay buffer containing 𝑁experiences,
facilitating the training of both the actor and critic networks
within the quantum-augmented DRL framework.
d) Learning with Quantum-Enhanced TD Error: The op-
timization of actor and critic networks within the quantum-
augmented DRL framework utilizes a quantum-enhanced tem-
poral difference (TD) learning approach [38]. This involves
computing the TD error in a manner that accounts for the
quantum-encoded states and the probabilistic nature of quantum
measurements. Given a quantum-encoded state s𝑄
oa (𝑡)and its
successor s𝑄
oa (𝑡+1), along with the reward 𝑟(𝑡). The quantum-
enhanced TD error, accounting for experiences sampled from the
replay buffer, is defined as [38]:
𝛿D(𝑡)=𝑟oa (𝑡) + 𝛾⟨𝜓s′𝑄
oa |𝑄𝜃𝑄
oa |𝜓s′𝑄
oa ⟩−⟨𝜓s𝑄
oa |𝑄𝜃𝑄
oa |𝜓s𝑄
oa ⟩,(76)
where 𝜓s𝑄
oa and 𝜓s′𝑄
oa denote the quantum-encoded states of
the current and next states sampled from the replay buffer D,
enhancing the learning stability and efficiency.
To incorporate experiences from the replay buffer in the
optimization of critic network parameters 𝜃𝑄
oa, the loss function
is defined as:
LD(𝜃𝑄
oa)=E(s,a,𝑟 , s′)∼D 𝛿D(𝑡)2+𝜆1∥𝜃𝑄
oa ∥2
2
−𝜆2E(s,a,𝑟 ,s′)∼D hF ( 𝜌𝜓s′𝑄
oa
, 𝜎𝜓s′𝑄
oa (𝜃𝑄
oa))i,(77)
where the expectations are over the distribution of experi-
ences (s,a, 𝑟, s′)sampled from the replay buffer D, facilitating
the training of actor and critic networks within the quantum-
augmented DRL framework, and E[𝛿(𝑡)2]denotes the expected
squared TD error to minimize the discrepancy in predicted
versus actual rewards. The L2 regularization term, 𝜆1∥𝜃𝑄
oa ∥2
2is
Algorithm 2 Quantum State Optimization with Feedback
Require: {s𝑄
oa (𝑡)}𝑁
oa=1,Aoa.
Ensure: aopt (𝑡).
1: Ψinit ←Ë𝑁
oa=1|𝜓oa (𝑡)⟩
2: for 𝑖←1to 𝑁do
3: QAg𝑖←𝑈encode (sAg𝑖)|0⟩⊗𝑛
4: end for
5: Hglobal ←Íoa,oa′𝐻oa,oa′
6: Prepare an initial quantum state |Ψinit⟩
7: Define Q(𝜚oa )=U𝑛(𝜚oa𝑛)U𝑛−1(𝜚oa𝑛−1). . . U1(𝜚oa1)
8: while not converged do
9: Apply Q( 𝜚oa)to |Ψinit ⟩to get |Ψ(𝜚oa )⟩
10: Measure E(𝜚oa )=⟨Ψ(𝜚oa )|Hglobal |Ψ(𝜚oa )⟩
11: 𝜚min =argmin
𝜚oa
E(𝜚oa )
12: Update 𝜚oa ←𝜚min
13: end while
14: Ψground ← |Ψ(𝜚min )⟩
15: aopt (𝑡) ← aop (𝑡) ← Measure(Ψground)
16: return aopt (𝑡)
used to prevent overfitting by penalizing large weights. We use
a quantum fidelity term, F, to encourage the critic network
to accurately reflect the underlying quantum state dynamics
by maximizing the fidelity between the target and predicted
quantum states.
e) Quantum State Optimization with Feedback Loop: In
Algorithm 2, each |𝜓oa (𝑡)⟩ represents the quantum-encoded
state of an individual agent at time 𝑡. By taking the tensor
product Ë𝑁
oa=1|𝜓oa (𝑡)⟩, we construct a multi-agent quantum
state that encompasses the entire system’s state information.
The optimization of quantum states, coupled with a feedback
mechanism, plays a pivotal role in enhancing the performance of
DRL agents in dynamic environments [38], [39]. The algorithmic
framework outlined in Algorithm 2 directs the optimization pro-
cess based on quantum principles. The core of the optimization
lies in the application of unitary transformations Q(𝜚oa), which
evolve the quantum state to explore the decision space. These
transformations are defined as:
Q(𝜚oa )=U𝑛(𝜚oa𝑛)U𝑛−1(𝜚oa𝑛−1). . . U1(𝜚oa1),(78)
where each U𝑖(𝜚oa𝑖)represents a parameterized unitary opera-
tion, reflecting the decision-making policy. The objective is to
find the optimal parameters 𝜚oa that maximize the reward, as
quantified by the measurement
E(𝜚oa)=⟨Ψ(𝜚oa )|Hglobal |Ψ(𝜚oa )⟩,(79)
where Hglobal represents the global Hamiltonian of the system,
encapsulating the interaction between agents and the environ-
ment. The optimization process iteratively adjusts 𝜚oa to find the
minimum of E(𝜚oa), indicative of the optimal decision-making
strategy.
The efficacy of actions is evaluated through projective mea-
surement operators {𝑀𝑚}, which, when applied to the post-
action quantum states [38]. The feedback for actions taken by
any agent Agoa is computed as:
𝑏oa (𝑡)=𝑚𝑚⟨𝜓′
oa (𝑡)|𝑀†
𝑚𝑀𝑚⊗ |aopt (𝑡)⟩⟨aopt |𝜓′
oa⟩,(80)
where aopt (𝑡)in the measurement process, allowing for the
evaluation of action efficacy 𝑏oa (𝑡)for the post-action quantum
states and the optimized actions taken by the agents. 𝑏oa (𝑡)
denotes the weighted sum of all possible measurement outcomes
for actions undertaken by the agent Agoa. Higher values of
𝑏oa (𝑡)indicate favourable actions, while lower values suggest
Algorithm 3 Quantum-aided DRL with Actor-Critic Networks
1: Initialize actor network 𝜃𝜋
oa, critic network 𝜃𝑄
oa, and replay buffer D
2: Prepare initial quantum-encoded state s𝑄
oa (𝑡)using Algorithm 1
3: while not converged do
4: for each timestep 𝑡do
5: Sample a mini-batch B𝑡={(s𝑖,a𝑖, 𝑟𝑖,s′
𝑖)}𝑁𝑏
𝑖=1∈ D
Actor Network Optimization:
6: Compute policy 𝜋𝜃𝜋
oa (aoa (𝑡)|s𝑄
oa (𝑡)) using (72)
7: Estimate gradient using the rule in (73)
8: Update: 𝜃𝜋
oa ←𝜃𝜋
oa +𝛼∇𝜃𝜋
oa 𝐽(𝜃𝜋
oa )
Critic Network Evaluation:
9: Compute action value 𝑄𝜃𝑄
oa (s𝑄
oa (𝑡),aoa (𝑡)) using (74)
10: Calculate TD error for critic network using (76)
11: Update 𝜃𝑄
oa by minimizing L (𝜃𝑄
oa )in (77)
Feedback Loop via Quantum Measurement:
12: Call Algorithm 2
13: Evaluate action efficacy 𝑏oa (𝑡)via (80)
Optimization of Action Parameters:
14: Update action parameters 𝜚oa using (81)
15: end for
16: end while
less desirable actions.
f) Optimization of Action Parameters: The iterative refine-
ment of the action parameters 𝜚oa for each agent is instrumental.
The following expression guides this refinement process:
𝜚oa (𝑡+1)=𝜚oa (𝑡) − 𝛼∇𝜚oa Lact (soa (𝑡),aoa (𝑡), 𝜚oa (𝑡)),(81)
where 𝛼represents the learning rate, and ∇𝜚oa denotes the
gradient of the loss function, Lact, which is defined as:
Lact (soa (𝑡),aoa (𝑡), 𝜚oa (𝑡)) =𝛽(𝑄𝜃𝑄
oa (soa (𝑡),aoa (𝑡)) (82)
−𝑉𝜃𝑉
oa (soa (𝑡)))2+ (1−𝛽)DKL (𝜋𝜃𝜋
oa ∥𝜋𝜃𝜋
oa′),
where 𝛽is a balancing coefficient. 𝑉𝜃𝑉
oa denotes the critic’s
estimate of the expected return from state soa (𝑡), parameterized
by the weights 𝜃𝑉
oa. DKL is the Kullback-Leibler divergence
measuring the difference between the current policy 𝜋𝜃𝜋
oa and
a target policy 𝜋𝜃𝜋
oa′.
C. Computational Complexity of Quantum-aided DRL
Given the quantum-encoded state space S𝑄
oa for each agent
AgDoA and AgTASK, the quantum state encoding exhibits a
complexity of O(log D), utilizing amplitude encoding within an
𝑛-qubit system, where D=2𝑛. The quantum-encoded operational
state |𝜓𝑄
oa (𝑡)⟩ is defined as |𝜓𝑄
oa (𝑡)⟩ =Í2𝑛−1
𝑖=0𝛼𝑖|𝑖⟩, with the
normalization condition Í2𝑛−1
𝑖=0|𝛼𝑖|2=1. The computational
complexity associated with the quantum decision-making, fa-
cilitated by the unitary transformations U(𝜚oa (𝑡)), is O(GU𝑛U),
where GUrepresents the gate count and 𝑛Udenotes the qubit
count involved in U. The optimization process, involving iterative
adjustments over Inumber of iterations of parameters 𝜚oa ,
refers to (81). Therefore, the total computational complexity is
expressed as
Ctotal =O(log D +GU𝑛U+ (I×U×𝑛)).(83)
D. Convergence Analysis
To establish the reliability and effectiveness of the proposed
quantum-aided multi-agent DRL framework, we present a theo-
retical analysis demonstrating the convergence of our solution.
The proof is predicated on the principles of quantum computation
and RL theory, ensuring a systematic approach towards achieving
an optimal policy.
Theory 1. Given a quantum-aided multi-agent DRL framework
with the Hilbert space Hfor quantum state encodings |𝜓𝑄
oa (𝑡)⟩,
and unitary operations U(𝜚oa (𝑡)) for policy representation, the
framework converges to an optimal policy 𝜋∗.
Proof. Consider a quantum-aided DRL framework wherein the
state of each agent at time 𝑡is quantum-encoded as |𝜓𝑄
oa (𝑡)⟩ ∈
H, with actions executed through parameterized unitary opera-
tions U(𝜚oa (𝑡)). The evolution under action 𝑎is described by:
|𝜓′𝑄
oa (𝑡)⟩ =U(𝜚oa (𝑡))|𝜓𝑄
oa (𝑡)⟩.(84)
The policy 𝜋𝜃𝜋
oa , parameterized by 𝜃𝜋
oa, is optimized by updating
𝜃𝜋
oa to maximize the expected cumulative reward. The policy
gradient, derived using the parameter shift rule, is:
∇𝜃𝜋
oa 𝐽=1
2⟨𝜕𝜃𝜋
oa 𝜓oa |U†U|𝜓oa⟩ + 𝑐∗,(85)
where 𝑐∗denotes the complex conjugate, the Born rule pro-
vides feedback for policy updates after quantum measurement
collapses |𝜓𝑄
oa (𝑡)⟩ to classical outcomes [36].
Define 𝑉𝜋(𝑠)as the expected return from state 𝑠under policy
𝜋, and 𝑄𝜋(𝑠, 𝑎)as the expected return from taking action 𝑎in
state 𝑠and following 𝜋. The Bellman optimality equations are
given by
𝑉∗(|𝜓𝑄
𝑠⟩) =max
𝑎∈A 𝑄∗(|𝜓𝑄
𝑠⟩, 𝑎),(86)
𝑄∗(|𝜓𝑄
𝑠⟩, 𝑎)=EhO𝑅(𝑠,𝑎 )+𝛾𝑉 ∗(|𝜓𝑄
𝑠′⟩)||𝜓𝑄
𝑠⟩, 𝑎i,(87)
where O𝑅(𝑠, 𝑎)represents the quantum observable corresponding
to the reward for taking action 𝑎in state |𝜓𝑄
𝑠⟩is defined by a
Hermitian operator that acts on the Hilbert space Hwhich can
be expressed as
EO𝑅(𝑠,𝑎 )=⟨𝜓𝑄
𝑠|O𝑅(𝑠, 𝑎)|𝜓𝑄
𝑠⟩,(88)
The compactness of Hand continuity of U(𝜚oa (𝑡)) imply
that for any 𝜖 > 0, there exists a 𝛿 > 0such that ∥U(𝜚oa (𝑡)) −
U(𝜚oa (𝑡+𝛿))∥ < 𝜖 for all 𝑡. Thus, as 𝑡→ ∞, we have:
lim
𝑡→∞ ∥∇𝜃𝜋
oa 𝐽(𝜃𝜋
oa (𝑡))∥ =0,(89)
ensuring convergence of 𝑉𝜋(|𝜓𝑄
𝑠⟩) to 𝑉∗(|𝜓𝑄
𝑠⟩) and
𝑄𝜋(|𝜓𝑄
𝑠⟩, 𝑎)to 𝑄∗(|𝜓𝑄
𝑠⟩, 𝑎)for all |𝜓𝑄
𝑠⟩ ∈ H and 𝑎∈ A,
thereby establishing convergence to the optimal policy 𝜋∗.□
From (65), we consider the reward sequence {𝑟𝑖=𝑟sum
oa (𝑡)}𝑇
𝑡=1
where 𝑇is the total number of episodes. The moving average
𝜇𝑡and variance 𝜎2
𝑡over a window of size 𝑊are defined as:
𝜇𝑡=1
𝑊
𝑡
𝑖=𝑡−𝑊+1
𝑟𝑖, 𝜎2
𝑡=1
𝑊
𝑡
𝑖=𝑡−𝑊+1(𝑟𝑖−𝜇𝑡)2.(90)
Convergence is determined when 𝜎2
𝑡remains below the threshold
for the last 𝑊episodes. In our analysis, we set 𝑊=100. The
reward variance 𝜎2
𝑡over the final window can be expressed as:
𝜎2
𝑇−𝑊+1=1
100
𝑇
𝑖=𝑇−99(𝑟𝑖−𝜇𝑇−𝑊+1)2.(91)
If 𝜎2
𝑇−𝑊+1<0.05, we conclude the algorithm has converged.
V. NUMERICAL RE SU LTS A ND ANALYSI S
The present military surveillance system operates within a
simulated 4km2urban environment [4], with UAVs, satellites,
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Normalized reward
Number of episodes
Proposed quantum-DRL
PPO [20]
DDPG [19]
Convergence Plot
Fig. 3: Convergence plot: reward vs. episode.
and ground vehicles working through LoS and NLoS condi-
tions under consideration of Rician factor 𝜅=6.46 [16] and
𝜖max =0.05. We further consider a 100 ms time frame duration,
and dynamic altitudes for UAVs are up to 100 meters [2].
UAV and ground vehicle velocities are limited to 15 m/s and
10 m/s, respectively, while satellites maintain a fixed orbit at
550 km [4] with a velocity of 7.8km/s. The number of UAVs,
UFOs, satellites, and ground vehicles is set to 16,4,8, and 2,
respectively, with each UAV equipped with a 64-element-aided
RIS [11]. The number of maximum measurements (𝑍max ) is fixed
to 32 [11]. The carrier frequency for the radar signals is 2.4
GHz, and the allocated bandwidth to each military vehicle is 20
MHz [4], [11]. The URLLC packet error rate is fixed to 10−5,
and the block length is set to 256 bit. The computational task
size is 1Mbits [2], with computational complexities (100,300)
cycles/bit [2]. The satellites and vehicles have computational
capacities of 12 GHz [2] and 4.8GHz, respectively. We set the
threshold for 𝑃𝑡 ℎ
𝑑=0.9, and 𝑃𝑡 ℎ
𝑓 𝑎 =0.1[29]. The maximum
transmission power of vehicles is fixed at 2W [2]. We also set the
minimum required data rate for communications from vehicles
to UAVs and satellites at 20 Mbps [4]. The framework imposes
a maximum tolerable latency of 15 ms and 20 ms for task
transmission and computing processes, respectively. Noise levels
at the receivers are consistently maintained at −130 dBm [2].
For all simulation runs except the plot in Fig. 3, the results are
averaged over 100 simulation runs to ensure statistical reliability
and smooth out any fluctuations and irregularities.
We employ an 8-qubit quantum system on Google Cirq and
QSimhSimulator for constructing variational quantum circuits
that construct the actor-critic networks essential to our DRL
framework. The variational circuits are composed of four layers,
each layer hosting 128 neurons and employing ReLU activation
functions aimed at simulating the policy and value function
estimations. These layers utilize parameterized quantum gates,
including 𝑅𝑥,𝑅𝑦,𝑅𝑧, and CNOT gates, to facilitate the en-
coding of the system’s state and the execution of complex
network interactions through quantum gradient descent. For the
quantitative analysis, the framework utilizes a discount factor
(𝛾) of 0.99 [2] and a learning rate (𝛼) of 0.0001 [4]. The
replay buffer accommodates 10,000 experiences; batch size is
32, and the maximum number of training episodes is 100 [2],
ensuring a robust dataset for network training and updates.
Comparative benchmarks include DDPG [19] and PPO [20]
algorithms executed on a computing setup featuring an Intel i9
processor, 64GB RAM, and an NVIDIA RTX GPU.
Fig. 3 illustrates the superior performance enhancements
achieved by our proposed quantum-DRL framework when
applied to present military surveillance systems, in compar-
ison to conventional DRL techniques such as DDPG [19]
and PPO [20]. For better visualization, we normalize the ac-
tual reward of all outcomes in the range of (0,1), using
𝑟𝑛𝑜𝑟 𝑚𝑖=(𝑟i−𝑟𝑚𝑖𝑛 )/(𝑟𝑚𝑎 𝑥 −𝑟𝑚𝑖 𝑛). Quantum-DRL outperforms
the DDPG and PPO by approximately 76.32% and 48.73%
in normalized reward at episode 1000, respectively. This note-
worthy enhancement highlights the quantum-DRL’s superior
capability in navigating complex state-action spaces efficiently.
Furthermore, a critical observation from our experiments is the
convergence rate of quantum-DRL. For our proposed quantum-
DRL algorithm, we observe that the reward reaches a relatively
stable state after approximately 650 episodes, although minor
oscillations are present. These oscillations are attributed to
the inherent exploration-exploitation trade-off in reinforcement
learning, particularly in complex environments. For a more
formal and quantitative criterion, we consider an algorithm to
have converged if the variance of the normalized reward over the
last 100 episodes falls below a predefined threshold. Specifically,
we measure the moving average and standard deviation of the
reward over a sliding window of 100 episodes. Convergence is
achieved when the standard deviation remains consistently low
(we consider below 0.05) over the window. In the case of the
proposed quantum-DRL, the normalized reward stabilizes with
minor fluctuations, suggesting effective learning and adaptation.
Comparatively, the benchmarks PPO [20] and DDPG [19], also
exhibit a slower convergence, but with differing stability levels.
PPO stabilizes earlier, while DDPG shows more variability.
This faster convergence rate of quantum-DRL is attributed to
its quantum-enhanced decision-making process, which utilizes
quantum parallelism and entanglement to explore and exploit the
decision space more comprehensively than classical approaches.
4 8 12 16 20 24 28
0
4
8
12
16
20
24
28
RMSE in DoA (deg)
Sensing duration (ms)
Random phase shift
DDPG-based phase shift
PPO-based phase shift
Proposed Quantum-DRL
(a) RMSE in DoA vs. sensing duration.
10 20 30 40 50 60 70
10
2
10
1
10
0
10
1
RMSE (deg)
Number of RIS elements
CRLB [6]
Quantum-DRL
PPO [20]
DDPG [19]
(b) RMSE in DoA vs. RIS elements.
15 10 5 0 5 10 15 20
10
3
10
2
10
1
10
0
10
1
RMSE (deg)
SNR at ground vehicle's radar (dB)
CRLB
Quantum-DRL
PPO
DDPG
(c) RMSE in DoA vs. SNR level.
Fig. 4: Comparison of RMSE in DoA under various conditions.
2 4 6 8 10
10
15
20
25
30
35
40
Total latecy in task offloading (ms)
Number of sub-tasks
Equal subtask size- Quantum-DRL
Equal subtask size- PPO [20]
Equal subtask size-DDPG [19]
Equal sub-task size
(a) Equal sub-task size portioning.
2 4 6 8 10
10
15
20
25
30
35
40
Total latecy in task offloading (ms)
Number of sub-tasks
Quantum-DRL
PPO [20]
DDPG [19]
Random sub-task size
(b) Random sub-task size portioning.
Fig. 5: Offloading latency vs. no. of sub-tasks.
4 8 12 16 20 24
10
15
20
25
30
35
40
45
Task offloading latency (ms)
Communication bandwidth (MhZ)
1
=0.0,
2
=0.0,
3
=1.0
1
=0.0,
2
=0.5,
3
=0.5
1
=0.2,
2
=0.3,
3
=0.5
Distributed task offloading policy
Fig. 6: Offloading latency vs. wireless bandwidth.
The unique ability of the quantum framework to process and
encode high-dimensional data allows for a deeper understanding
of the operational environment, thereby significantly contributing
to the overall system performance.
In Fig. 4a, the efficacy of the quantum-DRL-based RIS phase
shift design over its counterparts is distinctly evident through
the substantial reduction in RMSE for DoA estimation across
sensing durations. The quantum-DRL-based approach yields a
reduction in RMSE compared to random-phase shift-based RIS
design, DDPG, and PPO-based phase shift designs by 94.10%,
91.66%, and 82.61% respectively at a sensing duration of 16 ms.
Such efficiency highlights the quantum DRL framework’s supe-
rior capability to efficiently explore and optimize the complex,
high-dimensional solution space.
Fig. 4b evaluates the performance of the quantum-DRL
method against the theoretical lower bound for RMSE in DoA
estimation, represented by the CRLB [6]. The reduction in
RMSE with the quantum-DRL approach compared to PPO [20]
and DDPG [19] methods demonstrates its superior efficiency and
closer adherence to the CRLB. The performance gain in RMSE
reduction for the quantum-DRL approach compared to the PPO
method is approximately 69.16%, and compared to the DDPG
method, it is approximately 73.37%, when number of elements
in RIS is 64. These findings underscore the quantum-DRL
framework’s enhanced efficacy in approaching the theoretical
accuracy limits set by the CRLB.
In Fig. 4c, the quantum-DRL approach significantly outper-
forms its counterparts in reducing the RMSE for DoA estimation.
At 5dB SNR, quantum-DRL reduces the RMSE by approx-
imately 64.59% and 93.23% compared to PPO and DDPG,
respectively, showcasing its significant advantage in minimizing
estimation errors. This performance highlights quantum-DRL’s
capabilities in optimizing RIS phase shift design even in low
SNR scenarios, demonstrating its capacity for near-theoretical
accuracy and robustness to noise.
Fig. 5a presents a comparative analysis of task latency reduc-
tions across different partitioning schemes. Our quantum-DRL
64 128 256 512 1024
0
5
10
15
20
25
30
Total offloading latency (ms)
Number of bit in each blockcode
P
d
=1.00
P
d
=0.95
P
d
=0.90
Fig. 7: Offloading latency vs. blockcode length.
approach demonstrates superior performance at the partitioning
level of actual main task to 10 subtasks, showcasing a notable
decrease in task latency over DDPG [19] and PPO [20]. In equal
subtask sizing as shown in Fig. 5b, quantum-DRL achieves a
performance gain of 23.18% and 14.36% compared to DDPG
and PPO, respectively. When employing a random subtask sizing
strategy, the efficiency of quantum-DRL is further accentuated,
yielding performance gains of 43.09% over DDPG and 32.35%
over PPO.
Fig. 6, evaluates the task offloading latency versus allocated
system bandwidth by various task distribution strategies, as
detailed in Section II-C. The setting employing a balanced
distribution with 𝛽1=0.2,𝛽2=0.3, and 𝛽3=0.5demonstrates
a compelling performance gain by reducing the task offloading
latency by 47.83% while bandwidth is 20 mHz compared to a
satellite MEC-only scenario (𝛽3=1), which does not utilize
local processing or UAV caching. Moreover, an arrangement
with 𝛽1=0,𝛽2=0.5, and 𝛽3=0.5enhances this efficiency,
showcasing a latency reduction of 29.82% relative to the satellite
MEC-only configuration. This configuration demonstrates the
critical interaction between UAV caching and satellite MEC,
emphasizing the significance of strategic task distribution.
Fig. 7 reveals how the 𝑃𝑑significantly influences vehicular
task offloading latency with the changes in block length. With
a block length of 256 bits, setting 𝑃𝑑=1showcases sub-
stantial performance improvements, yielding a latency reduction
of approximately 31.56% compared to a 𝑃𝑑=0.95, and an
even more pronounced reduction of about 41.30% when set
against a 𝑃𝑑threshold of 0.9. This trend is consistent with larger
block lengths, where increased 𝑃𝑑consistently correlates with
lower task offloading latency. The reduction in latency becomes
more substantial as the block length increases. This behavior
highlights the critical importance of accurate detection of UFOs
in enhancing system performance, as higher 𝑃𝑑values directly
contribute to more efficient task processing and reduced task
offloading latency in vehicular networks, emphasizing the need
for optimized detection.
A. Runtime Complexity Analysis
An analysis of the runtime for each algorithm can offer
valuable insights into their efficiency and practicality. However,
directly presenting the programming running time can be influ-
enced by various external factors such as programming language,
hardware architecture, and coding styles. To address these vari-
abilities, we propose presenting a runtime complexity analysis of
the key algorithms depicted in Fig. 4, namely PPO, DDPG, and
QDRL. For PPO, during the policy update, it requires gradient
200 400 600 800 1000
10
2
10
3
10
4
10
5
10
6
Runtime complexity
Batch Size (B)
PPO Complexity
DDPG Complexity
QDRL Complexity
Network dimension = 10
(a) Runtime complexity at 𝑑=10.
200 400 600 800 1000
10
2
10
3
10
4
10
5
10
6
10
7
Runtime Complexity
Batch Size (B)
PPO Complexity
DDPG Complexity
QDRL Complexity
Network dimension = 100
(b) Runtime complexity at 𝑑=100.
200 400 600 800 1000
10
2
10
4
10
6
10
8
Runtime Complexity
Batch Size (B)
PPO Complexity
DDPG Complexity
QDRL Complexity
Network dimension = 1000
(c) Runtime complexity at 𝑑=1000.
Fig. 8: Comparative analysis of runtime complexity across varying network dimensions for PPO, DDPG, and QDRL algorithms.
computation with a complexity of O(𝑇 𝐵 𝑑), where 𝑇is the time
steps per update, 𝐵is the batch size, and 𝑑is the policy network
dimension. The clipped objective function adds a complexity
of O(𝐵𝑑 2). The value function update, which has a similar
complexity to the policy update, is O(𝑇 𝐵𝑑 ). Therefore, the
minimum runtime complexity for PPO becomes O(𝑇 𝐵 𝑑 +𝐵𝑑 2).
On the other hand, for DDPG, the actor update complexity
through gradient computation is O(𝐵𝑑). For the critic network
update, the gradient computation complexity is O(𝐵𝑑2), along
with the value network update of O(𝐵𝑑2). Considering the replay
buffer sampling with a complexity of O(𝐵), the overall runtime
complexity of DDPG becomes O(𝐵+𝐵𝑑 +𝐵𝑑2+𝐵𝑑 2), based on
neural network dimension which simplifies to O(𝐵𝑑 +𝐵𝑑2). For
QDRL, the runtime complexity is given in (83). Here, the state
space dimension Dis similar to the network dimension 𝑑in PPO
and DDPG. Since D=2𝑛, we can set 𝑛such that Dis comparable
to 𝑑. The number of iterations (I) and unitary transformations (U)
in QDRL should correspond to the number of updates in PPO
and DDPG. Additionally, the gate count (GU) and qubit count
(𝑛U) can be aligned with the network dimensions and update
mechanisms of PPO and DDPG. Given these considerations,
we can adjust the QDRL parameters for better alignment: let
𝑛=log2(𝑑), making Dequivalent to the network dimension
𝑑, and let Iand Ucorrespond to the time steps 𝑇and batch
size 𝐵for PPO and DDPG. With these adjustments, we redefine
the complexity equations and generate a more accurate runtime
complexity plot as shown in Fig. 8 for a comparative analysis.
Fig. 8 demonstrates the comparative analysis of runtime
complexity across varying network dimensions (i.e., 𝑑=10
in Fig. 8a, 𝑑=100 in Fig. 8b, and 𝑑=1000 in Fig. 8c)
for PPO, DDPG, and QDRL algorithms. PPO and DDPG have
higher complexities compared to QDRL for smaller batch sizes
due to the quadratic term 𝐵𝑑2. As batch size increases, QDRL’s
complexity rises more gradually compared to PPO and DDPG,
showing potential efficiency for larger batch sizes. For larger
network dimensions (𝑑=1000), the relative performance differ-
ence becomes more pronounced, with QDRL maintaining lower
complexity increases compared to PPO and DDPG.
VI. CONCLUSIONS
This work presented a quantum-aided DRL framework to en-
hance DoA estimation accuracy and computational task offload-
ing latency in ISAC systems for military surveillance. By uti-
lizing quantum computing’s parallelism, it reduces the decision
space dimensionality by encoding operational states and actions
into quantum states, introducing a quantum-enhanced actor-critic
method for policy optimization. Comparative analyses demon-
strated significant outperformance, with faster convergence and
a76.32% and 48.73% improvement in normalized reward over
DDPG and PPO, respectively. The quantum-DRL approach
notably reduced RMSE in DoA estimation by over 94.10%
compared to the random phase shift method, and by 91.66%
and 82.61% against DDPG and PPO, respectively. Additionally,
it minimized task offloading latency under URLLC requirements,
achieving up to 43.09% latency reduction compared to DDPG
and 32.35% against PPO, evidencing its efficacy.
REFERENCES
[1] A. Aubry, A. D. Maio, and L. Pallotta, “Power-aperture resource allocation
for a MPAR with communications capabilities,” IEEE Trans. Veh. Technol.,
pp. 1–14, 2024.
[2] D. S. Lakew, A.-T. Tran, N.-N. Dao, and S. Cho, “Intelligent self-
optimization for task offloading in LEO-MEC-assisted energy-harvesting-
UAV systems,” IEEE Trans. Netw. Sci. Eng., pp. 1–14, 2024.
[3] G. Geraci et al., “What will the future of UAV cellular communications
be? a flight from 5G to 6G,” IEEE Commun. Surv. Tutor., vol. 24, no. 3,
pp. 1304–1335, 3rd Quart., 2022.
[4] D. Han et al., “Two-timescale learning-based task offloading for remote
IoT in integrated satellite–terrestrial networks,” IEEE Internet Things J.,
vol. 10, no. 12, pp. 10 131–10 145, Jun. 2023.
[5] Y. Xu, Y. Li, J. A. Zhang, M. Di Renzo, and T. Q. S. Quek, “Joint beam-
forming for RIS-assisted integrated sensing and communication systems,”
IEEE Trans. Commun., pp. 1–1, 2023.
[6] Y. Pan, R. Li, X. Da, H. Hu, M. Zhang, D. Zhai, K. Cumanan, and O. A.
Dobre, “Cooperative trajectory planning and resource allocation for UAV-
enabled integrated sensing and communication systems,” IEEE Trans. Veh.
Technol., pp. 1–16, 2023.
[7] S. Li, B. Duo, X. Yuan, Y.-C. Liang, and M. Di Renzo, “Reconfigurable
intelligent surface assisted UAV communication: Joint trajectory design
and passive beamforming,” IEEE Wirel. Commun. Lett., vol. 9, no. 5, pp.
716–720, May 2020.
[8] R. Liu, M. Li, H. Luo, Q. Liu, and A. L. Swindlehurst, “Integrated sensing
and communication with reconfigurable intelligent surfaces: Opportunities,
applications, and future directions,” IEEE Wirel. Commun., vol. 30, no. 1,
pp. 50–57, Feb. 2023.
[9] A. Magbool et al., “A survey on integrated sensing and communication
with intelligent metasurfaces: Trends, challenges, and opportunities,” Jan.
2024.
[10] A. M. Elbir, K. V. Mishra, M. R. B. Shankar, and S. Chatzinotas, “The rise
of intelligent reflecting surfaces in integrated sensing and communications
paradigms,” IEEE Netw., pp. 1–8, 2022.
[11] Z. Chen, P. Chen, Z. Guo, Y. Zhang, and X. Wang, “A RIS-based vehi-
cle DOA estimation method with integrated sensing and communication
system,” IEEE Trans. Intell. Transp. Syst., pp. 1–13, 2023.
[12] X. Wang, Z. Fei, J. Huang, and H. Yu, “Joint waveform and discrete phase
shift design for RIS-assisted integrated sensing and communication system
under cramer-rao bound constraint,” IEEE Trans. Veh. Technol., vol. 71,
no. 1, pp. 1004–1009, Jan. 2022.
[13] Z. Fei, X. Wang, N. Wu, J. Huang, and J. A. Zhang, “Air-ground
integrated sensing and communications: Opportunities and challenges,”
IEEE Commun. Mag., vol. 61, no. 5, pp. 55–61, May 2023.
[14] Q. Liu, R. Luo, H. Liang, and Q. Liu, “Energy-efficient joint computation
offloading and resource allocation strategy for ISAC-aided 6G V2X net-
works,” IEEE Trans. Green Commun. Netw., vol. 7, no. 1, pp. 413–423,
Mar. 2023.
[15] Q. Wu, J. Xu, Y. Zeng, D. W. K. Ng, N. Al-Dhahir, R. Schober, and A. L.
Swindlehurst, “A comprehensive overview on 5G-and-beyond networks
with UAVs: From communications to sensing and intelligence,” IEEE J.
Sel. Areas Commun., vol. 39, no. 10, pp. 2912–2945, Oct. 2021.
[16] A. Paul, K. Singh, M.-H. T. Nguyen, C. Pan, and C.-P. Li, “Digital twin-
assisted space-air-ground integrated networks for vehicular edge comput-
ing,” IEEE J. Sel. Top. Signal Process., pp. 1–16, 2023.
[17] Z. Wang and V. W. Wong, “Deep learning for isac-enabled end-to-end
predictive beamforming in vehicular networks,” in Proc. IEEE International
Conference on Communications, Oct. 2023, pp. 5713–5718.
[18] Q. Liu, Y. Zhu, M. Li, R. Liu, Y. Liu, and Z. Lu, “DRL-based secrecy
rate optimization for RIS-assisted secure ISAC systems,” IEEE Trans. Veh.
Technol., vol. 72, no. 12, pp. 16 871–16 875, Dec. 2023.
[19] Y. Gong, Y. Wei, Z. Feng, F. R. Yu, and Y. Zhang, “Resource allocation for
integrated sensing and communication in digital twin enabled internet of
vehicles,” IEEE Trans. Veh. Technol., vol. 72, no. 4, pp. 4510–4524, 2023.
[20] X. Liu, H. Zhang, K. Long, M. Zhou, Y. Li, and H. V. Poor, “Proximal
policy optimization-based transmit beamforming and phase-shift design in
an IRS-aided ISAC system for the THz band,” IEEE J. Sel. Areas Commun.,
vol. 40, no. 7, pp. 2056–2069, Jul. 2022.
[21] M. Di Renzo, A. Zappone, M. Debbah, M.-S. Alouini, C. Yuen, J. de Rosny,
and S. Tretyakov, “Smart radio environments empowered by reconfigurable
intelligent surfaces: How it works, state of research, and the road ahead,”
IEEE J. Sel. Areas Commun., vol. 38, no. 11, pp. 2450–2525, Nov. 2020.
[22] W. Chen, X. Qiu, T. Cai, H.-N. Dai, Z. Zheng, and Y. Zhang, “Deep
reinforcement learning for internet of things: A comprehensive survey,”
IEEE Commun. Surv. Tutor., vol. 23, no. 3, pp. 1659–1692, 3rd Quart.,
2021.
[23] R. Yan, Y. Wang, Y. Xu, and J. Dai, “A multiagent quantum deep
reinforcement learning method for distributed frequency control of islanded
microgrids,” IEEE Trans. Control Netw. Syst., vol. 9, no. 4, pp. 1622–1632,
Dec. 2022.
[24] Silvirianti, B. Narottama, and S. Y. Shin, “Layerwise quantum deep rein-
forcement learning for joint optimization of UAV trajectory and resource
allocation,” IEEE Internet Things J., vol. 11, no. 1, pp. 430–443, Jan. 2024.
[25] K. Wang, N. Qi, H. Liu, A.-A. A. Boulogeorgos, T. A. Tsiftsis, M. Xiao,
and K.-K. Wong, “Reconfigurable intelligent surfaces aided energy effi-
ciency maximization in cell-free networks,” IEEE Wireless Commun. Lett.,
vol. 13, no. 6, pp. 1596–1600, 2024.
[26] J. Xie, W. Wang, X. Liu, I. Rashdan, C. Di, and J. Qin, “Identification
of NLOS based on soft decision method,” IEEE Wireless Commun. Lett.,
vol. 12, no. 4, pp. 703–707, Apr. 2023.
[27] H. A. Ammar, R. Adve, S. Shahbazpanahi, G. Boudreau, and K. V. Srinivas,
“RWP+: A new random waypoint model for high-speed mobility,” IEEE
Commun. Lett., vol. 25, no. 11, pp. 3748–3752, Nov. 2021.
[28] M. Chen, Q. Li, L. Huang, L. Feng, and M. Rihan, “One-bit cram´
er–rao
bound of direction of arrival estimation for deterministic signals,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 71, no. 2, pp. 957–961, Feb. 2024.
[29] A. Paul and S. P. Maity, “Outage analysis in cognitive radio networks with
energy harvesting and Q-routing,” IEEE Trans. Veh. Technol., vol. 69, no. 6,
pp. 6755–6765, Jun. 2020.
[30] A. Paul, K. Singh, C.-P. Li, O. A. Dobre, and T. Q. Duong, “Digital
twin-aided vehicular edge network: A large-scale model optimization by
quantum-DRL,” IEEE Trans. Veh. Technol., pp. 1–17, 2024.
[31] N. Yang, L. Han, R. Liu, Z. Wei, H. Liu, and C. Xiang, “Multiobjective
intelligent energy management for hybrid electric vehicles based on multi-
agent reinforcement learning,” IEEE Trans. Transp. Electrification, vol. 9,
no. 3, pp. 4294–4305, Sept. 2023.
[32] F. Metz and M. Bukov, “Self-correcting quantum many-body control using
reinforcement learning with tensor networks,” Nat. Mach. Intell., vol. 5,
no. 7, pp. 780–791, Jul. 2023.
[33] K. Miyamoto and H. Ueda, “Extracting a function encoded in amplitudes
of a quantum state by tensor network and orthogonal function expansion,”
Quantum Information Processing, vol. 22, no. 6, p. 239, Jun. 2023.
[34] M. Schuld and N. Killoran, “Quantum machine learning in feature hilbert
spaces,” Physical Review Letters, vol. 122, no. 4, p. 040504, Feb. 2019.
[35] Z. Qu and H. Sun, “A secure information transmission protocol for
healthcare cyber based on quantum image expansion and grover search
algorithm,” IEEE Trans. Netw. Sci. Eng., vol. 10, no. 5, pp. 2551–2563,
Sept.-Oct. 2023.
[36] M. S. Rudolph et al., “Synergistic pretraining of parametrized quantum
circuits via tensor networks,” Nat. Commun., vol. 14, no. 1, p. 8367, Dec.
2023.
[37] Z. Li, K. Xue, J. Li, L. Chen, R. Li, Z. Wang, N. Yu, D. S. L. Wei, Q. Sun,
and J. Lu, “Entanglement-assisted quantum networks: Mechanics, enabling
technologies, challenges, and research directions,” IEEE Commun. Surv.
Tutor., vol. 25, no. 4, pp. 2133–2189, 4th Quart., 2023.
[38] J. A. Ansere, E. Gyamfi, V. Sharma, H. Shin, O. A. Dobre, and T. Q. Duong,
“Quantum deep reinforcement learning for dynamic resource allocation
in mobile edge computing-based IoT systems,” IEEE Trans. Wireless
Commun., pp. 1–1, 2023.
[39] J. A. Ansere, D. T. Tran, O. A. Dobre, H. Shin, G. K. Karagiannidis, and
T. Q. Duong, “Energy-efficient optimization for mobile edge computing
with quantum machine learning,” IEEE Wireless Commun. Lett., pp. 1–1,
2023.
Anal Paul (Member, IEEE) received his Bachelor of
Technology degree from the Government College of
Engineering and Ceramic Technology, India, in 2008,
and his Master of Engineering degree from Jadavpur
University, India, in 2010. In 2021, he received his
Ph.D. degree from the Indian Institute of Engineering
Science and Technology, Shibpur. From July to Decem-
ber 2022, he worked as a postdoctoral researcher in the
Department of Information and Communication Engi-
neering at Yeungnam University, South Korea. Since
January 2023, he has been a Postdoctoral Researcher
at National Sun Yat-sen University, Taiwan, conducting research in Digital Twin
and Metaverse applications for Wireless Communication Systems.
Keshav Singh (Member, IEEE) received the Ph.D.
degree in Communication Engineering from National
Central University, Taiwan, in 2015. He currently works
at the Institute of Communications Engineering, Na-
tional Sun Yat-sen University (NSYSU), Taiwan as an
Associate Professor. Prior to this, he held the position of
Research Associate from 2016 to 2019 at the Institute
of Digital Communications, University of Edinburgh,
U.K. From 2019 to 2020, he was associated with
the University College Dublin, Ireland as a Research
Fellow. He chaired workshops on conferences like
IEEE GLOBECOM 2023 and IEEE WCNC, 2024. He also serves as leading
guest editor of IEEE Transactions on Green Communications and Networking
Special Issue on Design of Green Near-Field Wireless Communication Networks
and IEEE Internet of Things Journal Special Issue on Positioning and Sensing
for Near-Filed (NF)-driven Internet-of-Everything. He leads research in the
areas of green communications, resource allocation, transceiver design for full-
duplex radio, ultra-reliable low-latency communication, non-orthogonal multiple
access, machine learning for wireless communications, integrated sensing and
communications, non-terrestrial networks, and large intelligent surface-assisted
communications.
Aryan Kaushik (Member, IEEE) is currently an
Assistant Professor on senior grade with the University
of Sussex, UK, since 2021. Prior to that, he has been
with University College London, UK (2020-21), Uni-
versity of Edinburgh, UK (2015-19), and Hong Kong
University of Science and Technology, Hong Kong
(2014-15). He has also held visiting appointments at
Imperial College London, UK (2019-20), University
of Bologna, Italy (2024), University of Luxembourg,
Luxembourg (2018), Athena RC, Greece (2021), and
Beihang University, China (2017-19, 2022). He has
been External PhD Examiner internationally such as at Universidad Carlos
III de Madrid, Spain, in 2023. He has been an Invited Panel Member at the
UK EPSRC ICT Prioritisation Panel in 2023, and has led several collaborative
projects forging industry and academic collaborations on topics of strategic
importance. He has been Editor of three books on ISAC (2024 Edition), 6G
NTN (2025 Edition) and ESIT (2025 Edition) by Elsevier, and several journals
such as IEEE Open Journal of the Communications Society (Best Editor Award
2023), IEEE Communications Letters (Exemplary Editor 2023), IEEE Internet of
Things Magazine (including the AI for IoT miniseries), IEEE Communications
Technology News (initiated the IEEE ComSoc Podcasts series), and several
special issues such as in IEEE Wireless Communications Magazine, IEEE
Network Magazine, and many other IEEE venues. He has been an invited/keynote
and tutorial speaker for over 75 academic and industry events, and conferences
globally such as at IEEE ICC 2024, IEEE GLOBECOM 2023 and 2024, IEEE
VTC-Spring 2023 and 2024, IEEE ICMLCN 2024, IEEE WCNC 2023, IEEE
MeditCom 2023 and 2024, One6G Summit 2023 and 2024, and many other
events worldwide. He has been chairing in Organizing and Technical Program
Committees of 10 flagship IEEE conferences such as IEEE ICC 2024 and 2025,
IEEE ICMLCN 2024 and 2025, and IEEE WCNC 2023 and 2024, etc. He has
been General Chair of over 18 workshops for IEEE ComSoc conferences such
as at IEEE ICC 2024, IEEE GLOBECOM 2023 and 24, IEEE WCNC 2023 and
2024, IEEE PIMRC 2022, 2023 and 2024, and many others.
Chih-Peng Li (Fellow, IEEE) received the B.S. de-
gree in Physics from National Tsing Hua University,
Hsin Chu, Taiwan, and the Ph.D. degree in Electrical
Engineering from Cornell University, NY, USA. Dr.
Li was a Member of the Technical Staff with Lucent
Technologies. Since 2002, he has been with National
Sun Yat-sen University (NSYSU), Kaohsiung, Taiwan,
where he is currently a Distinguished Professor. Dr. Li
has served in various positions with NSYSU, including
the Chairman of the Electrical Engineering Department,
the VP of General Affairs, the Dean of Engineering
College, and the VP of Academic Affairs. His research interests include wireless
communications, baseband signal processing, and data networks. He is now
the Director General of the Engineering and Technologies Department, at the
National Science and Technology Council, Taiwan.
Dr. Li is currently the Chapter Chair of the IEEE Broadcasting Technology
Society Tainan Section. Dr. Li has also served as the Chapter Chair of the IEEE
Communication Society Tainan Section, the President of the Taiwan Institute
of Electrical and Electronics Engineering, the Editor of IEEE Transactions
on Wireless Communications, the Associate Editor of IEEE Transactions on
Broadcasting, and the Member of Board of Governors with IEEE Tainan Section.
Dr. Li has received various awards, including the Outstanding Research Award
from the Ministry of Science and Technology. Dr. Li is a Fellow of the IEEE.
Octavia A. Dobre (Fellow, IEEE) is a Professor
and Tier-1 Canada Research Chair with Memorial
University, Canada. She was a Visiting Professor with
Massachusetts Institute of Technology, USA and Uni-
versit´
e de Bretagne Occidentale, France. Her research
interests encompass wireless communication and net-
working technologies, as well as optical and underwa-
ter communications. She has (co-)authored over 500
refereed papers in these areas. Dr. Dobre serves as the
VP Publications of the IEEE Communications Society.
She was the inaugural Editor-in-Chief (EiC) of the
IEEE Open Journal of the Communications Society and the EiC of the IEEE
Communications Letters.
Dr. Dobre was a Fulbright Scholar, Royal Society Scholar, and Distinguished
Lecturer of the IEEE Communications Society. She obtained 7 IEEE Best Paper
Awards including the 2024 Heinrich Hertz Award. Dr. Dobre is an elected
member of the European Academy of Sciences and Arts, a Fellow of the
Engineering Institute of Canada, and a Fellow of the Canadian Academy of
Engineering.
Marco Di Renzo (Fellow, IEEE) received the Laurea
(cum laude) and Ph.D. degrees in electrical engineering
from the University of L’Aquila, Italy, in 2003 and
2007, respectively, and the Habilitation `
a Diriger des
Recherches (Doctor of Science) degree from University
Paris-Sud (currently Paris-Saclay University), France,
in 2013. Currently, he is a CNRS Research Director
(Professor) and the Head of the Intelligent Physical
Communications group in the Laboratory of Signals
and Systems (L2S) at Paris-Saclay University – CNRS
and CentraleSupelec, Paris, France. Also, he is an
elected member of the L2S Board Council and a member of the L2S Management
Committee, and is a Member of the Admission and Evaluation Committee of
the Ph.D. School on Information and Communication Technologies, Paris-Saclay
University. He is a Founding Member and the Academic Vice Chair of the
Industry Specification Group (ISG) on Reconfigurable Intelligent Surfaces (RIS)
within the European Telecommunications Standards Institute (ETSI), where
he served as the Rapporteur for the work item on communication models,
channel models, and evaluation methodologies. He is a Fellow of the IEEE,
IET, EURASIP, and AAIA; an Academician of AIIA; an Ordinary Member
of the European Academy of Sciences and Arts, an Ordinary Member of the
Academia Europaea; an Ambassador of the European Association on Antennas
and Propagation; and a Highly Cited Researcher. Also, he holds the 2023 France-
Nokia Chair of Excellence in ICT at University of Oulu (Finland), he holds the
Tan Chin Tuan Exchange Fellowship in Engineering at Nanyang Technological
University (Singapore), and he was a Fulbright Fellow at City University of
New York (USA), a Nokia Foundation Visiting Professor with Aalto University,
Finland; and a Royal Academy of Engineering Distinguished Visiting Fellow
with Queen’s University Belfast, U.K. His recent research awards include the
2021 EURASIP Best Paper Award, the 2022 IEEE COMSOC Outstanding Paper
Award, the 2022 Michel Monpetit Prize conferred by the French Academy of
Sciences, the 2023 EURASIP Best Paper Award, the 2023 IEEE ICC Best
Paper Award, the 2023 IEEE COMSOC Fred W. Ellersick Prize, the 2023
IEEE COMSOC Heinrich Hertz Award, the 2023 IEEE VTS James Evans Avant
Garde Award, the 2023 IEEE COMSOC Technical Recognition Award from the
Signal Processing and Computing for Communications Technical Committee,
the 2024 IEEE COMSOC Fred W. Ellersick Prize, the 2024 Best Tutorial Paper
Award, and the 2024 IEEE COMSOC Marconi Prize Paper Award in Wireless
Communications. He served as the Editor-in-Chief of IEEE Communications
Letters during the period 2019-2023, and he is now serving on the Advisory
Board. He is currently serving as a Voting Member of the Fellow Evaluation
Standing Committee and as the Director of Journals of the IEEE Communications
Society.
Trung Q. Duong (Fellow, IEEE) is a Canada Excel-
lence Research Chair (CERC) and a Full Professor at
Memorial University, Canada. He is also the adjunct
Chair Professor in Telecommunications at Queen’s Uni-
versity Belfast, UK and a Research Chair of Royal
Academy of Engineering, UK. He was a Distinguished
Advisory Professor at Inje University, South Korea
(2017-2019), an Adjunct Professor and the Director of
Institute for AI and Big Data at Duy Tan University,
Vietnam (2012-present), and a Visiting Professor (under
Eminent Scholar program) at Kyung Hee University,
South Korea (2023-2025). His current research interests include quantum com-
munications, wireless communications, quantum machine learning, and quantum
optimisation.
Dr. Duong has served as an Editor/Guest Editor for the IEEE TRA NS AC-
TI ON S ON WIRELESS COMMUNICATIONS, IEEE TRANSACTIONS ON COM-
MUNICATIONS, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE
COMMUNICATIONS LETTE RS, IEEE WIRELESS COMMUNICATIONS LETT ER S,
IEEE WIRELESS COMMUNICATIONS, IEEE COMMUNICATIONS MAGAZ IN ES ,
and IE EE JO UR NAL O N SEL EC TE D ARE AS I N COMMUNICATIONS. He received
the Best Paper Award at the IEEE VTC-Spring 2013, IEEE ICC 2014, IEEE
GLOBECOM 2016, 2019, 2022, IEEE DSP 2017, IWCMC 2019, 2023, and
IEEE CAMAD 2023. He has received the two prestigious awards from the Royal
Academy of Engineering (RAEng): RAEng Research Chair (2021-2025) and
the RAEng Research Fellow (2015-2020). He is the recipient of the prestigious
Newton Prize 2017.