ArticlePDF Available

Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach

Authors:

Abstract and Figures

The space–air–ground integrated network can provide services to ground users in remote areas by utilizing high-altitude platform (HAP) drones to support stable user access and using low earth orbit (LEO) satellites to provide large-scale traffic backhaul. However, the rapid movement of LEO satellites requires dynamic maintenance of the matching relationship between LEO satellites and HAP drones. Additionally, different traffic types generated at HAP drones hold varying levels of values. Therefore, a tripartite matching problem among LEO satellites, HAP drones, and traffic types jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency, and traffic storage capacity is formulated as mixed integer nonlinear programming to maximize the average transmitted traffic value. The traffic generation state for HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world scenarios, posing challenges for traditional optimization solvers. Thus, the original problem is decoupled into two independent sub-problems: traffic–drone matching and LEO–drone matching, which are addressed by mathematical simplification and multi-agent deep reinforcement learning with centralized training and decentralized execution, respectively. Simulation results verify the effectiveness and superiority of the proposed tripartite matching approach.
Content may be subject to copyright.
Citation: Huang, X.; Xia, X.; Wang, Z.;
Peng, M. Joint Drone Access and LEO
Satellite Backhaul for a Space–Air
–Ground Integrated Network: A
Multi-Agent Deep Reinforcement
Learning-Based Approach. Drones
2024,8, 218. https://doi.org/
10.3390/drones8060218
Academic Editor: Carlos Tavares
Calafate
Received: 19 April 2024
Revised: 23 May 2024
Accepted: 24 May 2024
Published: 25 May 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
drones
Article
Joint Drone Access and LEO Satellite Backhaul for a
Space–Air–Ground Integrated Network: A Multi-Agent Deep
Reinforcement Learning-Based Approach
Xuan Huang 1,* , Xu Xia 1, Zhibo Wang 2and Mugen Peng 3
16G Research Center, China Telecom Research Institute, Beijing 102209, China; xiaxu@chinatelecom.cn
2Hisilicon Technologies Co., Ltd., Beijing 100085, China; wangzhibo9@hisilicon.com
3Department of Information and Communication Engineering, Beijing University of Posts and
Telecommunications, Beijing 100876, China; pmg@bupt.edu.cn
*Correspondence: huangx21@chinatelecom.cn
Abstract: The space–air–ground integrated network can provide services to ground users in remote
areas by utilizing high-altitude platform (HAP) drones to support stable user access and using low
earth orbit (LEO) satellites to provide large-scale traffic backhaul. However, the rapid movement of
LEO satellites requires dynamic maintenance of the matching relationship between LEO satellites
and HAP drones. Additionally, different traffic types generated at HAP drones hold varying levels
of values. Therefore, a tripartite matching problem among LEO satellites, HAP drones, and traffic
types jointly considering multi-dimensional characteristics such as remaining visible time, channel
condition, handover latency, and traffic storage capacity is formulated as mixed integer nonlinear
programming to maximize the average transmitted traffic value. The traffic generation state for
HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world
scenarios, posing challenges for traditional optimization solvers. Thus, the original problem is
decoupled into two independent sub-problems: traffic–drone matching and LEO–drone matching,
which are addressed by mathematical simplification and multi-agent deep reinforcement learning
with centralized training and decentralized execution, respectively. Simulation results verify the
effectiveness and superiority of the proposed tripartite matching approach.
Keywords: space–air–ground integrated network; high altitude platform drone; LEO satellite;
matching
problem; reinforcement learning
1. Introduction
Densely deployed ground communication infrastructures can provide access services
for mobile and Internet of Things (IoT) users in urban areas, with the advantages of high
data rates and small propagation delay. However, deploying infrastructures in remote
areas such as the ocean and desert is challenging and expensive. Various applications in
remote areas, such as forest monitoring, desert communication, and maritime logistics,
are difficult to serve [
1
,
2
]. There are still approximately three billion people all over the
world living without Internet access, presenting an obstacle for 6G in realizing seamless
connectivity and ubiquitous access [
3
,
4
]. How to achieve user access and traffic backhaul
for mobile and IoT users in remote areas has become crucial [5].
Satellite communication makes up for the shortage of terrestrial networks and provides
users with large-scale access services with wide coverage. The utilization of low earth
orbit (LEO) satellites for global Internet access and traffic backhaul has garnered attention
due to their lower development and launch cost and transmission latency compared with
geostationary earth orbit (GEO) and medium earth orbit (MEO) satellites [
6
]. The use of
inter-satellite links (ISLs) enables the traffic generated by ground IoT users to be relayed
among LEO satellites and transmitted back to the terrestrial traffic center [
7
]. However, the
Drones 2024,8, 218. https://doi.org/10.3390/drones8060218 https://www.mdpi.com/journal/drones
Drones 2024,8, 218 2 of 18
severe path loss between LEO satellites and ground IoT users makes it difficult for users to
directly access LEO satellites due to limited transmission power.
In order to reduce the demand for user-side transmission power, the space–air–ground
integrated network has attracted a lot of attention from academia and industry in 6G
[
8
,
9
]. Compared to the orbital altitude of hundreds or thousands of kilometers of LEO
satellites, the altitude of drones is much lower, thus needing lower transmission power from
ground IoT users[
10
]. In the space–air–ground integrated network, drones in the air are
utilized to support user access with lower transmission energy costs, and satellites in space
are used to provide traffic backhaul with global coverage [
11
]. They work together with
communication infrastructures on the ground to provide users with various application
services. In recent years, a category of drone that can provide users with more stable access,
namely high altitude platform (HAP) drone, has become a research hotspot. Different from
traditional drones, HAP drones hover at an altitude of about 20 km in the stratosphere, with
base stations deployed on them to provide users with ubiquitous and stable access. HAP
drones can extend communication capabilities across the space, air, and ground domains.
Specifically, aerial networks composed of HAP drones are utilized to support user access
and collect traffic generated by users in remote or inaccessible areas lacking communication
infrastructures. Then, LEO satellites are used to support traffic backhaul to the terrestrial
traffic center, thus supplying stable access and traffic backhaul [12].
Due to the advantages of low deployment cost, flexible on-demand deployment,
and reliable line-of-sight communication link, HAP drones have been employed in the
satellite-ground network for user access, traffic backhaul, and task execution. However,
practical issues in the space–air–ground integrated network have been overlooked in
existing research. For instance, due to the high mobility of LEO satellites, HAP drones
need to be switched between different LEO satellites. Therefore, the calculation of the
available traffic transmission time of HAP drones must jointly consider the remaining
visible time and handover latency. Furthermore, different traffic types generated at HAP
drones hold varying values, suggesting a preference for establishing matching for high-
value traffic types first. Lastly, the assumption of a specific constant traffic generation state
at HAP drones in existing research does not align with the stochastic and deterministic
nature of traffic generation in practice, rendering conventional static matching algorithms
inapplicable [13].
Therefore, in order to address the issues mentioned above, a tripartite matching
problem among LEO satellites, HAP drones, and traffic types is investigated for the space–
air–ground integrated network in this paper. Specifically, the main contributions of this
paper are as follows:
First, the network architecture and working mechanism of the space–air–ground
integrated network is introduced, which aims at achieving user access and traffic
backhaul in remote areas. Different from the conventional static traffic generation state
with deterministic variables, the traffic generation state at HAP drones is modeled as
a mixture of stochasticity and determinism, which aligns with real-world scenarios.
Then, different from the conventional schemes that treat all traffic types as equally im-
portant, we develop a tripartite matching problem among satellites, HAP drones, and
traffic types based on the different values of different traffic types. The problem can
be decoupled into two sub-problems: traffic–drone matching and LEO–drone match-
ing. Traffic–drone matching is simplified into multiple separate sub-subproblems
through mathematical analysis, which can be addressed independently. LEO–drone
matching cannot be solved by conventional optimization solvers since the traffic
generation state at drones is a mixture of stochasticity and determinism. Thus, rein-
forcement learning is adopted. Moreover, due to the significant propagation latency
between terrestrial traffic center and LEO satellites, a conventional centralized scheme
cannot obtain the latest status of the network. Therefore, it cannot devise LEO–
drone matching strategies in a timely manner. In addition, the state space of the
LEO–drone matching sub-problem is continuous. Therefore, a multi-agent deep rein-
Drones 2024,8, 218 3 of 18
forcement learning approach with centralized training and decentralized execution
is proposed, in which the value network is centrally trained at the terrestrial traffic
center and the LEO–drone matching strategy is timely devised at LEO satellites in
a decentralized manner.
Finally, the convergence performance of the proposed matching approach is discussed
and analyzed through simulations. In addition, the proposed algorithm is com-
pared with state-of-the-art algorithms under different network parameters to validate
its effectiveness.
The rest of the paper is organized as follows. The related works are discussed in
Section 2. The system model and working mechanism are illustrated in Section 3. Section 4
formulates and simplifies the tripartite matching problem. In Section 5, the formulated
problem is solved by the multi-agent deep reinforcement learning algorithm. Simulation
results are presented and discussed in Section 6. Future work is summarized in Section 7.
Finally, conclusions are drawn in Section 8.
2. Related Works
Abbasi et al. first presented the potential use cases, open challenges, and possible
solutions of HAP drones for next-generation networks [
14
]. The main communication
links between HAP drones and other non-terrestrial network (NTN) platforms, along with
their advantages and challenges, are presented in [
15
]. Due to the rapid movement of LEO
satellites, the matching relationship between HAP drones and LEO satellites is not fixed, so
efficient matching and association strategies need to be developed. In [
16
], the matching
relationship between user equipment (UE), HAP drones, and terrestrial base stations (BS)
is formulated as a mixed discrete–continuous optimization problem under the HAP drone
payload connectivity constraints, HAP drones and BSs power constraints, and backhaul
constraints to maximize the network throughput. The formulated problem is solved using a
combination of integer linear programming and generalized assignment problems. A deep
Q-learning (DQL) approach is proposed in [
17
] to perform the user association between
a terrestrial base station and a HAP drone based on the channel state information of the
previous time slot. In addition to the above-mentioned UE’s selection between terrestrial
networks and non-terrestrial networks, there has been relevant research on the three-
party matching problem among users, HAP drones and satellites in remote areas without
terrestrial network coverage. In [
18
], the matching problem among users, HAP drones,
and satellites is formulated to maximize the total revenue and it was solved by a satellite-
oriented restricted three-sided matching algorithm. In [
19
], a throughput maximization
problem is formulated for ground users in an integrated satellite–aerial–ground network
by comprehensively optimizing user association, transmission power, and unmanned
aerial vehicle (UAV) trajectory. In [
20
], a UAV-LEO integrated traffic collection network is
proposed to maximize the uploaded traffic volume while ensuring the energy consumption
by comprehensively considering bandwidth allocation, UAV trajectory design, power
allocation, and LEO satellite selection. The maximum computation delay among terminals
is minimized in [
21
] by a joint considering matching relationship, resource allocation, and
deployment location optimization. An alternating optimization algorithm based on block
coordinate descent and successive convex approximation is proposed to solve this. A joint
association and power allocation approach is proposed for the space–air–ground network
in [
22
] to maximize the transmitted traffic amount while minimizing the transmit power
under the constraints of the power budget and quality of service (QoS) requirements of
HAP drones and the data storage and visibility time of LEO satellites. The association
problem and power allocation problem are alternately addressed by the GUROBI optimizer
and the whale optimization algorithm, respectively.
It is worth mentioning that reinforcement learning (RL) algorithms are widely used for
HAP drone problems. HAP drones form a distributed network, and with multi-agent RL,
the space–air–ground integrated network can effectively become self-organizing. In [
23
], a
multi-agent Q-learning approach is proposed to tackle the service function chain placement
Drones 2024,8, 218 4 of 18
problem for LEO satellite networks in a discrete-time stochastic control framework, thus
optimizing the long-term system performance. In [
24
], a multi-agent deep reinforcement
learning algorithm with global rewards is proposed to optimize the transmit power, CPU
frequency, bit allocation, offloading decision, and bandwidth allocation via a decentralized
method, thus achieving the computation offloading and resource allocation for the LEO
satellite edge computing network. In [
25
], the utility of HAP drones is maximized by jointly
optimizing association and resource allocation, which is formulated as a Stackelberg game.
The formulated problem is transformed into a stochastic game model, and a multi-agent
deep RL algorithm is adopted to solve it.
3. System Model and Working Mechanism
In order to provide services for mobile users and IoT users in remote areas, the space–
air–ground integrated network is investigated in this paper, and its network architecture is
shown in Figure 1. It utilizes an aerial network composed of HAP drones to collect traffic
generated by various IoT users, thus providing stable and large-scale access services for
areas without ground communication infrastructures. Via a drone–LEO link and multiple
LEO–LEO links, the collected traffic is then relayed to the LEO satellite connected to a
ground station to achieve traffic backhaul. Finally, the ground station downloads the traffic
via the LEO–ground link and transmits it back to the terrestrial traffic center for processing
via optical fibers. Ground devices access HAP drones through the C-band. HAP drones
are directly connected to LEO satellites through the Ka-band to achieve high-rate traffic
backhaul [26].
Traffic Collection
Forest monitoring Maritime logisticsDesert communication
HAP Drones
LEO
Satellites
ISL
ISL
Ground Station
Terrestrial
Traffic Center
Ka-band
C-band
Optical Fiber
Ka-band Ka-band
Drone-LEO Link
LEO-LEO Link
LEO-Ground Link
Figure 1. Network architecture of the space–air–ground integrated network.
3.1. Traffic Generation Model for HAP Drones
For the space–air–ground integrated network, the drone–LEO link needs to transmit
the traffic generated by the HAP drone itself and the traffic collected from various mobile
and IoT users on the ground. This traffic can be divided into traffic types generated with
a determined rate, which mainly include HAP drone health status and UE location, and
traffic types generated abruptly with random probability, such as malfunction diagnosis
and signaling execution. Therefore, the traffic generation state at HAP drones is modeled
as a mixture of stochasticity and determinism. Markov chains can be used to describe the
traffic generation models of various types uniformly, as shown in Figure 2. Specifically, the
generation of each traffic type at HAP drones is modeled as a Markov chain with two states:
on and off. In the on state, traffic is generated at a constant rate, whereas traffic generation
ceases in the off state. Denote the self-transition probabilities of the
q
-th traffic type from on
to on as
p1,q
and from off to off as
p2,q
, where
q{1, 2, · · · ,Q}
and
Q
are the total number
of traffic types. For traffic types generated at a constant rate, there are
p1,q=
1 and
p2,q=
0.
Drones 2024,8, 218 5 of 18
For traffic types generated abruptly with random probability, there are 0
<p1,q<
1 and
0<p2,q<1, which means that the state switches randomly between on and off.
LEO Satellite
On Off
On Off
Traffic Type 1
Traffic Type Q
HAP Drone i
Time Slot m
Figure 2. Traffic generation model for each HAP drone.
In addition, different traffic types have varying levels of importance in practical
scenarios. For instance, the traffic carrying the remaining power of HAP drones is more
valuable than other traffic types. To account for this, we introduce a value factor
µq
to
represent the value of the
q
-th traffic type. The optimization objective is to maximize the
average transmitted traffic value of the network in each time slot. Unlike the conventional
approach, which treats all traffic types equally, we prioritize the transmission of high-value
traffic when system resources are restricted, which aligns better with actual transmission
requirements.
3.2. Traffic Transmission Model between LEO Satellites and HAP Drones
Suppose that there are
I
HAP drones at an altitude of
h1
and
J
LEO satellites at an
altitude of
h2
in the space–air–ground integrated network. The LEO satellite set is denoted
as
J={1, 2, · · · ,J}
, and the HAP drone set is denoted as
I={1, 2, · · · ,I}
. Each HAP
drone is equipped with an omnidirectional antenna, and each LEO satellite is equipped
with
L
steerable beams. The time interval is divided into
M
time slots with a length of
M0
, and the time slot set is denoted as
M={1, 2, · · · ,M}
. When
M0
is sufficiently small,
the matching between LEO satellites and HAP drones in each time slot can be treated as
quasi-static. In each time slot, one LEO satellite beam can provide services for no more
than one HAP drone, and one HAP drone can establish a connection with at most one
LEO satellite. We define a LEO–drone matching matrix
XI×J[m]
to describe the matching
relationship between LEO satellites and HAP drones in the
m
-th time slot. If the
i
-th HAP
drone is served by the
j
-th LEO satellite in the
m
-th time slot, there is
xi,j[m]=
1; otherwise,
xi,j[m]=0.
This work focuses on mobile users and IoT users in depopulated regions with almost
no obstacles. Therefore, small-scale fading due to multi-path effects can be neglected. The
channel gain from the i-th HAP drone to the j-th LEO satellite in the m-th time slot can be
expressed as follows [27]:
hi,j[m]= c
4πfcdi,j[m]!2
, (1)
where
c
and
fc
represent the speed of light and the carrier frequency, respectively.
di,j[m]
represents the distance between the
i
-th HAP drone and the
j
-th LEO satellite in the
m
-th
time slot. Based on this, the traffic transmission rate between the
i
-th HAP drone and the
j-th LEO satellite can be expressed as follows:
Ri,j[m]=Wlog21+PhGiGjhi,j[m]
kBTbW, (2)
Drones 2024,8, 218 6 of 18
where
W
is the bandwidth of LEO beams,
Ph
is the transmit power of HAP drones, and
Gi
and
Gj
represent the antenna gains of the transmitter of HAP drone and the receiver
of LEO satellite, respectively [
28
].
kB
is Boltzmann’s constant, and
Tb
is the system noise
temperature. When the channel gain between the HAP drone and the
j
-th LEO satellite
exceeds a given threshold
h0
, it is considered that this HAP drone is within the visible range
of the
j
-th LEO satellite. In the
m
-th time slot, the set of HAP drones within the visible
range of the j-th LEO satellite can be expressed as follows:
Ij[m]=i|hi,j[m]h0. (3)
As a result of the high-speed movement of LEO satellites, handover is required when
the HAP drone moves outside the visible range of the LEO satellite. HAP drones are
unable to send traffic to LEO satellites during the handover duration
Th
, which can be
approximately expressed as follows:
Th=κ×di,j[m]
c, (4)
where
κ
is the signaling that requires to be transmitted between the HAP drone and the
LEO satellite during handover. Therefore, the available traffic transmission time in the
m
-th
time slot can be represented as follows:
Ti,j[m]=
T01xi,j[m1]×κ×di,j[m]
c,Tremain
i,jT0
Tremain
i,j,Tremain
i,j<T0
, (5)
where
Tremain
i,j
represents the remaining visible time between the
j
-th LEO satellite and
the
i
-th HAP drone. In each time slot, a HAP drone can only choose one of the
Q
traffic
types for transmission. We define a traffic–drone matching matrix
YI×Q[m]
to describe the
transmission status of different traffic types at each HAP drone in the
m
-th time slot. If
the
q
-th traffic type of the
i
-th HAP drone is sent in the
m
-th time slot, there is
yi,q[m]=
1;
otherwise,
yi,q[m]=
0. Thus, in the
m
-th time slot, the maximum traffic volume from the
q
-th traffic type of the
i
-th HAP drone to the
j
-th LEO satellite can be expressed as follows:
Ui,q,j[m]=xi,j[m]yi,q[m]Ri,j[m]Ti,j[m]. (6)
The transmitted traffic value of the q-th traffic type of the i-th HAP drone in the m-th
time slot can be represented as follows:
e
Ui,q[m]=min Si,q[m],
J
j=1
Ui,q,j[m]!, (7)
where
Si,q[m]
is the traffic volume of the
q
-th traffic type stored at the
i
-th HAP drone in the
m-th time slot, which can be obtained as follows:
Si,q[m]=Si,q[m1]e
Ui,q[m1]+Gi,q[m1], (8)
where
Gi,q[m1]
denotes the traffic volume of the
q
-th traffic type newly generated at the
i
-th HAP drone in the
(t1)
-th time slot. It is a random variable that follows the traffic
generation model defined in Section 3.1.
Therefore, the total transmitted traffic value of the space-air-ground integrated net-
work can be given as follows:
Utotal =
M
m=1
I
i=1
Q
q=1
µq·e
Ui,q[m]. (9)
Drones 2024,8, 218 7 of 18
4. Problem Formulation and Transformation
The optimization objective is to establish tripartite matching among LEO satellites,
HAP drones, and traffic types by choosing the most suitable LEO–drone matching matrix
XI×J[m]
and traffic–drone matching matrix
YI×Q[m]
in each time slot, so as to maximize
the average transmitted traffic value of the network. The objective function and constraints
can be formulated as follows:
max
X,Ylim
M
Utotal
M(10a)
s.t.
J
j=1
xi,j[m]{0, 1},i I,m M, (10b)
I
i=1
xi,j[m]=L,j J ,m M, (10c)
Q
q=1
yi,q[m]{0, 1},i I,m M, (10d)
xi,j[m]{0, 1},i I,j J ,m M, (10e)
yi,q[m]{0, 1},i I,q Q,m M. (10f)
Constraint
(10b)
specifies that each HAP drone can connect to a maximum of one LEO
satellite in each time slot. Constraint
(10c)
specifies that the number of HAP drones served
by each LEO satellite is equal to the beam number
L
. Note that even though each LEO
satellite can provide services for less than
L
HAP drones, this would lead to inefficient use
of satellite beams. Thus, in order to achieve maximum average transmitted traffic value of
the network in each time slot, all beams of each satellite will be utilized. Constraint
(10d)
specifies that each HAP drone can transmit a maximum of one traffic type in each time slot.
(10e)
and
(10f)
are restrictions on the elements of the LEO–drone matching matrix and the
traffic–drone matching matrix, respectively.
The formulated problem
(10a)
is a mixed integer nonlinear programming problem.
In the following content, we will analyze and simplify it. Given a specific
XI×J[m]
and
substituting (9) into (10a), the original problem is as follows:
max
Ylim
M
1
M
M
m=1
I
i=1 Q
q=1
µq·e
Ui,q[m]!(11a)
s.t.
Q
q=1
yi,q[m]{0, 1},i I,m M, (11b)
yi,q[m]{0, 1},i I,q Q,m M. (11c)
Through analysis, it becomes evident that
Q
q=1µq·e
Ui,q[m]
is solely dependent on
the matching
yi,q[m]|q Q
between the
i
-th HAP drone and all traffic types in the
m
-th
time slot, and is independent of the matching
nyi,q[m]|q Q,i=i,m=mo
between
other HAP drones and traffic types in other time slots. Consequently, maximizing
(11a)
Drones 2024,8, 218 8 of 18
can be achieved by maximizing each term within the brackets of
(11a)
. Thus,
(11a)
can be
rephrased as follows:
lim
M
1
M
M
m=1
I
i=1 max
yi,q[m]|q∈Q
Q
q=1
µq·e
Ui,q[m]!(12a)
s.t.
Q
q=1
yi,q[m]{0, 1},i I,m M, (12b)
yi,q[m]{0, 1},i I,q Q,m M. (12c)
Formulation
(12a)
is equivalent to optimizing
I×M
independent sub-subproblems.
For i I,m M, the sub-subproblem can be formulated as follows:
max
yi,q[m]|q∈Q
Q
q=1
µq·e
Ui,q[m](13a)
s.t.
Q
q=1
yi,q[m]{0, 1},i I,m M, (13b)
yi,q[m]{0, 1},i I,q Q,m M. (13c)
It is feasible that the region can be expressed as follows:
yi,q[m]=0, q Q, (14)
or
yi,q[m]=(1, q=q0
0, q Q,q=q0
. (15)
Regarding the former, the optimal value of
(13a)
is 0, whereas for the latter, the optimal
value is greater than or equal to 0. Hence, the optimal solution of
(13a)
must adhere to
(15)
,
so as to maximize the objective function. By substituting
(15)
into
(13a)
, it is equivalent to
addressing the following:
max
q∈Q µq·min Si,q[m],
J
j=1
xi,j[m]Ri,j[m]Ti,j[m]!!. (16)
Its optimal solution can be expressed as follows:
q
i[m]=arg max
q∈Q µqmin Si,q[m],
J
j=1
xi,j[m]Ri,j[m]Ti,j[m]!!. (17)
Based on this, the optimal solution of (11a) can be expressed as follows:
yi,q[m]=(1, q=q
i[m]
0, q=q
i[m],i I,m M. (18)
At this point, we have successfully decomposed the optimization sub-problem
(11a)
into
I×M
independent sub-subproblems through mathematical analysis. The optimal
traffic–drone matching matrix
YI×Q[m]
can be obtained according to
(18)
. Intuitively, once
the LEO–drone matching of each time slot is determined, the maximum average transmitted
traffic value of the network can be achieved by choosing the traffic type with the highest
value for each HAP drone to transmit.
Drones 2024,8, 218 9 of 18
Substituting the optimal solution
(18)
into the objective function
(10a)
yields the
following:
max
Xlim
M
1
M
M
m=1
I
i=1
max
q∈Q µqmin Si,q[m],
J
j=1
xi,j[m]Ri,j[m]Ti,j[m]!! (19a)
s.t.
J
j=1
xi,j[m]{0, 1},i I,m M, (19b)
I
i=1
xi,j[m]=L,j J ,m M, (19c)
xi,j[m]{0, 1},i I,j J ,m M, (19d)
which is solely associated with the LEO–drone matching matrix XI×J[m].
5. Problem Solving and Algorithm Designing
Typically, conventional optimization solvers are employed to solve problems with
deterministic variables [
29
]. Problems with random variables are difficult to be solved using
these solvers. Nevertheless, the tripartite matching problem that this paper focuses on is a
mixture of stochasticity and determinism. Therefore, we adopt reinforcement learning to
dynamically solve the LEO–drone matching sub-problem
(19a)
. Specifically, the matching
between each LEO satellite and HAP drones is modeled as a Markov decision process [
30
],
where each LEO satellite is treated as an agent. The state, action, and reward of the
j
-th
LEO satellite are defined as follows:
State: sj[m]=Ti,j[m]|i I,Ri,j[m]|i I,
Si,q[m1]|i I,q Q,Gi,q[m1]|i I,q Q.
In the
m
-th time slot, the
j
-th LEO satellite obtains the state of each HAP drone within
its visible range, which includes the available traffic transmission time
Ti,j[m]
and
the traffic transmission rate
Ri,j[m]
of the current
m
-th time slot, as well as the stored
traffic volume
Si,q[m1]
and the traffic generation rate
Gi,q[m1]
of each traffic type
in the previous
(m1)
-th time slot. For the HAP drone
i0
, which is not within the
visible range of the
j
-th LEO satellite, i.e.,
i0/ Ij[m]
, there are
Ti0,j[m]=
0,
Ri0,j[m]=
0,
Si0,q[m1]=0, q Q, and Gi0,q[m1]=0, q Q.
Action: aj[m]=nxi,j[m]|I
i=1xi,j[m]=Lo.
In the
m
-th time slot, the action of the
j
-th LEO satellite is to determine which
L
HAP
drones to provide services for. If multiple LEO satellites decide to provide services
to the same HAP drone, this HAP drone will actively choose to connect to the LEO
satellite, transmitting the highest traffic value.
Reward: rjsj[m],aj[m]=I
i=1Q
q=1µqUi,q,j[m].
In the
m
-th time slot, the reward obtained by the
j
-th LEO satellite after taking action
aj[m]
in state
sj[m]
is defined as the total transmitted traffic value of the
j
-th LEO
satellite in the current time slot.
Then, reinforcement learning is employed to solve
(19a)
based on the above definitions.
The discounted return of the j-th LEO satellite in the m-th time slot is defined as follows:
Gj[m]=
τ=0
γτrjsj[m+τ],aj[m+τ], (20)
where
γ[0, 1)
represents the discount rate, which is used to balance the impact of short-
term and long-term rewards. If
γ
is close to 0, the discounted return mainly depends on
recent rewards. Conversely, if
γ
approaches 1, the discounted return primarily depends on
forward rewards. Q-values can be used to evaluate the expectation of return that the
j
-th
Drones 2024,8, 218 10 of 18
LEO satellite can achieve by taking action
aj
based on policy
πj
in state
sj
, which can be
expressed as follows:
qπjsj,aj=EGj[m]|sj[m]=sj,aj[m]=aj. (21)
In conventional Q-learning, Q-values of the optimal policy
π
j
can be continuously
updated through iterations. Generate an episode of length
Tmax
. For the
t
-th iteration, the
Q-value of the state-action pair st
j,at
jcan be obtained as follows [31]:
qt+1
jst
j,at
j=qt
jst
j,at
jϑt
j"qt
jst
j,at
j"rt
jst
j,at
j+γmax
a∈Aj
qt
jsj(t+1),a##, (22)
where
t{1, 2, · · · ,Tmax}
.
st
j
represents the state at the
t
-th step of the episode, and
at
j
denotes the action taken in state
st
j
.
ϑt
j
represents the learning rate, and
Aj
denotes the
action space of the
j
-th LEO satellite.
rt
jst
j,at
j
denotes the average one-step immediate
reward acquired after taking action at
jin state st
j, which can be represented as follows:
rt
jst
j,at
j=Ehrj(s,a)|s=st
j,a=at
ji. (23)
Supposing that the proposed approach converges after
C
iterations, the optimal policy
can be expressed as follows [32]:
π
ja|sj=
1, a=arg max
a∈Aj
qC
jsj,a
0, otherwise
. (24)
The aforementioned conventional Q-learning algorithm stores the calculated Q-values
qt
jst
j,at
j
in the form of tables, known as a Q-table, which has the advantages of being
intuitive and easy to analyze. However, due to the continuous state space of
(19a)
, using
the conventional tabular Q-learning algorithm requires storing a large volume of data,
thereby increasing storage costs. Furthermore, the generalization ability of the conventional
Q-learning algorithm is poor. To address these issues, a deep Q-learning algorithm is
employed in this paper, which is one of the earliest and most successful algorithms that
introduces deep neural networks into reinforcement learning [
32
]. In deep Q-learning, the
high-dimensional Q-table can be approximated by a deep Q network with low-dimensional
parameters, thereby significantly reducing the storage cost. In addition, the Q-values of
unvisited state-action pairs can be calculated through value function approximation, giving
it strong generalization ability.
In addition, the aforementioned algorithm is fully decentralized, in which each satel-
lite calculates its Q-values according to its own local states, local actions, and local rewards.
However, LEO satellites are not completely independent, but influence each other. For
example, if the
i
-th HAP drone is connected to the
j
-th LEO satellite at the current mo-
ment, other LEO satellites cannot provide service for this HAP drone. Therefore, the
aforementioned fully decentralized reinforcement learning algorithm cannot obtain high
performance and may not even converge in some cases. An alternative solution is to use a
fully centralized reinforcement learning algorithm. In each time slot, each LEO satellite
sends its experience obtained from its interaction with the environment to the terrestrial
traffic center. Then, both value network training and strategy making are performed at
the center based on global experiences. Nevertheless, the experience of each satellite must
pass through multiple ISLs, an LEO-ground link, and an optical fiber link to be transmitted
back to the terrestrial traffic center, facing high propagation latency. The terrestrial traffic
center is unable to obtain the latest status of the space–air–ground integrated network, so
Drones 2024,8, 218 11 of 18
it is unable to make timely LEO–drone matching strategies. To address these issues, we
employ multi-agent deep reinforcement learning with centralized training and decentral-
ized execution. The value network of each LEO satellite is trained in a centralized manner
at the terrestrial traffic center. Then, the trained value networks are distributed to the
corresponding LEO satellites [
33
]. Each satellite distributively trains its policy network
based on the received value network and the latest local observations, thus it can devise
LEO–drone matching strategies in a timely manner.
Specifically, when training the value network, each LEO satellite sends its local ex-
perience
nsj,aj,rj,s
jo
obtained from its interaction with the environment to the terres-
trial traffic center, where
s
j
is the state reached after taking action
aj
in state
sj
. Based
on the collected local experiences of various LEO satellites, the terrestrial traffic center
forms the global experience, including the global state
s=s1,s2,· · · ,sJ
, the global
action
a=a1,a2,· · · ,aJ
, and the global reached state
s=hs
1,s
2,· · · ,s
Ji
, and stores
s,a,rj,s
in the replay buffer
Dj
. Afterwards, the terrestrial traffic center trains the
value network of the
j
-th LEO satellite based on
s,a,rj,s
to evaluate the quality of the
matching approach. As previously mentioned, the deep Q-learning algorithm is adopted,
where the true Q-values of the optimal strategy are approximated by the Q-values calcu-
lated by the trained value network, which can be obtained through the quasi-static target
network scheme [
34
]. Specifically, two networks need to be defined: the target network
b
qjS,A,ωj,target
and the main network
b
qjS,A,ωj,main
described by parameters
ωj,target
and
ωj,main
, respectively, where
S
and
A
are global states and global actions collected by
the terrestrial traffic center in the form of random variables. The objective of parameter
iteration is to minimize the mean square error of the Q-values calculated by the target
network and the main network. This can be achieved by minimizing the loss function,
which can be expressed as follows:
Jjωj,main =Eb
qjS,A,ωj,main Rj+γmax
ab
qjS,a,ωj,target 2, (25)
where
S
and
Rj
represent the reached state and the acquired reward after taking action
A
in state
S
, respectively. The gradient-descent algorithm is then adopted to minimize the
objective function. The gradient of (25) can be calculated as follows:
ωj,main Jjωj,main =EhRj+γmax
ab
qjS,a,ωj,target b
qjS,A,ωj,main ×
ωj,main b
qjS,A,ωj,main i,
(26)
where
ωj,main b
qjS,A,ωj,main
can be obtained through the gradient back propagation
algorithm [
35
]. In each iteration, an experience batch
Dbatch
j
is randomly sampled from the
replay buffer
Dj
to train the value network. For each sample
s,a,rj,s
in
Dbatch
j
, the
parameter ωj,main of the main network is updated as follows:
ωj,main ωj,m ain βωj,m ain Jjωj,main , (27)
where
β
is the learning rate. After
iterations, the parameter
ωj,target
of the target network
is updated as ωj,main:
ωj,target ωj,main. (28)
Algorithm 1presents the matching algorithm based on multi-agent deep reinforcement
learning, in which
ϵ
-greedy is used to balance exploitation and exploration. The value
network of each LEO satellite is centrally trained at the terrestrial traffic center based on the
global states, global actions, and the local reward of each LEO satellite. Then, the trained
value network is sent to the corresponding LEO satellite. At the
j
-th LEO satellite, its policy
network can be trained in a decentralized manner based on its received value network with
Drones 2024,8, 218 12 of 18
parameter
ωj,target
and its local observations. Afterwards, each LEO satellite develops its
own optimal strategy based on its trained policy network to maximize the long-term return.
Finally, each LEO satellite broadcasts the matching strategy to all HAP drones within its
visible range.
Algorithm 1 Matching approach based on multi-agent deep reinforcement learning.
Input:
Episode length
Tmax
, learning rate
β
, greedy factor
ϵ
, discount factor
γ
, iteration
number
, and randomly initialize parameters
ωj,main
and states
s1
j
, let
ωj,target =
ωj,main ,δ=0, Dj=Φ, and Dbatch
j=Φ;
Output: Optimal strategy for each LEO satellite
1: for t=1 to Tmax do
2: for j=1 to Jdo
3:
The
j
-th LEO satellite takes action
at
j
according to
ϵ
-greedy strategy, where the
optimal action is arg max
a∈Ajb
qjst
j,a,ωj,target ;
4: Interact with environment to get the rewards rt
jand the reached states st+1
j;
5: end for
6:
Form the global state
st=hst
1,· · · ,st
Ji
, the global action
at=hat
1,· · · ,at
Ji
, and the
global reached state st=hst
1,· · · ,st
Ji;
7: for j=1 to Jdo
8: Store st,at,rt
j,st+1into the replay buffer Dj;
9: Randomly sample an experience batch Dbatch
jfrom Dj;
10: Update ωj,main based on Dbatch
jaccording to (27);
11: end for
12: δ=δ+1;
13: if δ== then
14: ωj,target =ωj,main for j=1, · · · ,J;
15: δ=0;
16: end if
17: for j=1 to Jdo
18: Send the trained value network b
qjsj,aj,ωj,target to the j-th LEO satellite;
19:
The
j
-th LEO satellite trains its own policy network based on
sj
and
b
qjsj,aj,ωj,target ;
20:
Develop the optimal strategy of the
j
-th LEO satellite based on its trained policy
network.
21: end for
22: end for
6. Simulation Results
In order to verify the effectiveness of the proposed matching algorithm, preliminary
simulations are conducted. The main simulation parameters are listed in Table 1. We
compare the proposed approach with some state-of-the-art algorithms, including deep
deterministic policy gradient (DDPG), deep Q-network (DQN), and two greedy methods.
For the first greedy method (abbreviated as Greedy 1), each LEO satellite will choose
the
L
HAP drones with the highest channel gains within its visible range to establish
connections.
For the second greedy method (abbreviated as Greedy 2), each LEO satellite will
choose the
L
HAP drones with the longest remaining visible time within its visible
range to establish connections.
For both Greedy 1 and Greedy 2, each HAP drone that has established a connection
with an LEO satellite will choose the traffic type with the largest transmitted traffic
value for transmission.
Drones 2024,8, 218 13 of 18
Table 1. System parameters.
Description Notation Value
Self-transition probability from on to on p10.9
Self-transition probability from off to off p20.95
Total number of traffic types Q4
Value factors µq,q=1, 2,3, 4 1, 2, 3, 4
Number of LEO satellites J5
Number of HAP drones I20
Number of beams for each LEO satellite L4
Beam bandwidth W10 MHz
Number of time slots T1000
Length of time slot T00.1 s
HAP drone altitude h120 km
LEO altitude h2600 km
Carrier frequency fc10 GHz
Number of signaling κ8
Discount rate γ0.98
Learning rate β0.1
Figure 3illustrates the transmitted traffic values of the proposed matching approach
in one time slot under episode lengths of 500, 1000, 1500, 2000, 2500, and 3000, respectively.
When the length of the episode does not exceed 2000, the transmitted traffic values in one
time slot increase significantly with the increase of the episode length. However, when
the length of the episode exceeds 2000, the transmitted traffic values in one time slot are
basically the same for various episode lengths. Thus, the length of the episode is set to
2000 in subsequent simulations, thereby saving computational resources while ensuring
performance. Furthermore, it can be observed that for any episode length, the transmitted
traffic value will first increase and then remain essentially stable, which validates the
convergence of the proposed matching algorithm.
100 200 300 400 500 600 700 800 900 1000
Time Slot
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Traffic Value in One Time Slot
#106
Tmax
= 500
Tmax
= 1000
Tmax
= 1500
Tmax
= 2000
Tmax
= 2500
Tmax
= 3000
Figure 3. Transmitted traffic values in one time slot with different episode lengths.
Figure 4illustrates the variation of the relative mean square error of Q-values obtained
by the target network and the main network under learning rates of 0.15, 0.1, 0.08, and 0.05,
respectively. As the learning rate
β
increases from 0.05 to 0.1, the rate of decrease in relative
mean square error accelerates. Nevertheless, as the learning rate continues to increase from
0.1 to 0.15, the rate of decrease in relative mean square error remains almost unchanged,
but its fluctuations will increase. Therefore, in order to balance the convergence speed and
stability, we set the learning rate βto 0.1 in subsequent simulations.
Drones 2024,8, 218 14 of 18
0 500 1000 1500 2000
Episode
0
0.2
0.4
0.6
0.8
1
Relative Mean Squared Error
-=0.15
-=0.1
-=0.08
-=0.05
Figure 4. Mean square error of the Q-values obtained by the target network and the main network.
Figure 5illustrates the total transmitted traffic values of different algorithms under
varying HAP drone transmission powers. It can be seen that with the increase of the
transmission power, the total transmitted traffic values of all algorithms increase. This is
because, according to
(2)
, increasing the transmission power of HAP drones can improve
the traffic transmission rates, thereby increasing the total transmitted traffic value of the
space–air–ground integrated network. From Figure 5we can see that the proposed multi-
agent deep RL algorithm is the best. Since multi-agent deep RL utilizes centralized training
and decentralized execution to reduce the interference of non-stationary environments
among agents, the proposed algorithm can increase the transmitted traffic value compared
with DDPG and DQN. Furthermore, all the three RL-based algorithms perform better than
the greedy methods due to the following two reasons.
10 15 20 25 30 35 40
Transmission Power (W)
0
0.5
1
1.5
2
2.5
3
Total Transmitted Traffic Value
#109
Proposed
DDPG
DQN
Greedy 1
Greedy 2
Figure 5. Total transmitted traffic value under different HAP drone transmission powers.
Greedy 1 aims to improve the transmission rate between LEO satellites and HAP
drones by choosing HAP drones with higher channel gains, thereby increasing the
total transmitted traffic value. Similarly, Greedy 2 focuses on reducing the handover
latency by choosing HAP drones with long remaining visible time, thereby improving
the available traffic transmission time of HAP drones, so as to increase the total trans-
mitted traffic value. In contrast, the RL-based algorithms take a more comprehensive
perspective by jointly considering multi-dimensional characteristics such as remaining
Drones 2024,8, 218 15 of 18
visible time, channel condition, handover latency, and traffic storage capacity. Thus,
the RL-based algorithms can improve the total transmitted traffic value of the network
from a global perspective, surpassing the performance of Greedy 1 and Greedy 2.
Both Greedy 1 and Greedy 2 rely on static matching algorithms, which fail to account
for the randomness of traffic generation at HAP drones. In contrast, the RL-based
algorithms can learn the randomness of the traffic generation at HAP drones and
make the matching strategy based on this learning.
Figure 6illustrates the total transmitted traffic values of different algorithms with
respect to the LEO satellite beam number
L
. As the number of LEO satellite beams increases,
the total transmitted traffic values of all algorithms will also increase. This is because
increasing the number of LEO satellite beams can relax the constraint
(10c)
, thereby allowing
more HAP drones to transmit traffic to LEO satellites simultaneously, so as to increase the
total transmitted traffic value of the space–air–ground integrated network. From Figure 6,
we can see that the proposed multi-agent deep reinforcement learning algorithm is the best
since it can learn from the experience of the other LEO satellites. Furthermore, all three
RL-based algorithms perform better than greedy methods for the same reasons shown in
Figure 5.
4 8 12 16 20 24 28
Beam Number
0
0.5
1
1.5
2
2.5
3
Total Transmitted Traffic Value
#109
Proposed
DDPG
DQN
Greedy 1
Greedy 2
Figure 6. Total transmitted traffic value under different beam numbers.
7. Future Work
Although the proposed approach can effectively address the tripartite matching prob-
lem among LEO satellites, HAP drones, and traffic types, there are some limitations.
7.1. Matching among Various Network Nodes
In this paper, only the matching problem between HAP drones and LEO satellites is
considered. However, in the space–air–ground integrated network, in addition to HAP
drones and LEO satellites, there are also a variety of network nodes, such as ground users,
gateway stations, and geostationary earth orbit satellites. In the future, it is necessary to
investigate the matching relationships among different nodes to improve the topology of
the space–air–ground integrated network. For example, the matching problem between
ground users and HAP drones should be addressed by comprehensively considering
multiple factors such as the location, movement speed, and service requirements of ground
users and the payloads of HAP drones.
7.2. Computing Task Assignment and Resource Allocation
Our research only considers how to perform user access and traffic backhaul in remote
areas where ground base stations are difficult to deploy. However, in addition to serving
Drones 2024,8, 218 16 of 18
remote areas, HAP drones can also provide low-latency edge computing services for IoT
devices in urban areas with ground base station coverage. In the future, the great pressure
that computing-intensive applications place on resource-constrained Internet of Things
(IoT) devices with limited computing capability and energy storage can be alleviated by
offloading latency-sensitive computing tasks to nearby edge nodes. A matching strategy
for ground users, HAP drones, and ground base stations should be developed by jointly
optimizing computing task assignment and resource allocation, thus improving the perfor-
mance of the space–air–ground integrated network, such as minimizing the maximum task
execution latency among IoT devices or maximizing the amount of transmitted traffic per
unit time.
7.3. HAP Drone Localization
The positions of HAP drones are assumed to be stationary and known in our pa-
per. However, the positions of HAP drones will constantly change due to jitter. Only
by knowing the exact location of HAP drones can we accurately calculate the distance
between HAP drone and LEO satellite, the remaining visible time, and the channel capacity.
Therefore, the exact location of HAP drone is essential for making the user access and traffic
backhaul strategy of the space–air–ground integrated network. In the future, the HAP
drone localization problem needs to be solved. Other positioning systems can be added
to estimate the exact location of HAP drone. For example, reinforcement learning-based
algorithms can be used to regularly predict the exact location of HAP drone by inputting
atmospheric data such as wind speed.
8. Conclusions
In this paper, the matching problem between HAP drones and LEO satellites in the
space–air–ground integrated network has been investigated. First, we introduced the
network architecture and working mechanism, including the traffic generation model and
the traffic transmission model. Then, a tripartite matching problem that takes comprehen-
sive consideration of multi-dimensional characteristics has been formulated to maximize
the average transmitted traffic value of network. Through mathematical simplification,
the optimization problem is then simplified into two independent sub-problems: traffic–
drone matching and LEO–drone matching. The former can be decoupled into multiple
independent and easily solvable sub-subproblems. Considering the mixed stochastic and
deterministic traffic generation model, the long propagation latency between LEO satellites
and HAP drones, and in the continuous state space, we proposed a multi-agent deep
reinforcement learning approach with centralized training and decentralized execution to
solve the LEO–drone matching problem. In this approach, the value network is trained
in a centralized manner at the terrestrial traffic center and the matching strategy is timely
formulated in a decentralized manner at LEO satellites. Finally, the proposed approach has
been compared with multiple state-of-the-art algorithms through simulations, and results
have proven the effectiveness and efficiency of the proposed algorithm.
Author Contributions: Conceptualization, X.H.; methodology, Z.W. and X.X.; validation, X.H.,
Z.W. and X.X.; formal analysis, X.H. and M.P.; investigation, X.H. and Z.W.; writing—original draft
preparation, X.H.; writing—review and editing, X.H., Z.W., X.X. and M.P.; supervision, X.H. and
Z.W.; project administration, X.X.; funding acquisition, X.X. All authors have read and agreed to the
published version of the manuscript.
Funding: This work was supported by the 2020 National Key R&D Program “Broadband Communica-
tion and New Network” special “6G Network Architecture and Key Technologies” 2020YFB1806700.
Data Availability Statement: Data underlying the results presented in this paper are not publicly
available at this time but may be obtained from the authors upon reasonable request.
Conflicts of Interest: Author Zhibo Wang is from the company, but all authors declare that there is
no conflicts of interest.
Drones 2024,8, 218 17 of 18
References
1.
Jia, Z.; Sheng, M.; Li, J.; Han, Z. Toward data collection and transmission in 6G space-air-ground integrated networks: cooperative
HAP and LEO satellite schemes. IEEE Internet Things J. 2022,9, 10516–10528.
2.
Li, Z.; Wang, Y.; Liu, M.; Sun, R.; Chen, Y.; Yuan, J.; Li, J. Energy efficient resource allocation for UAV-assisted space-air-ground
internet of remote things networks. IEEE Access 2019,7, 145348–145362.
3.
Liu, J.; Shi, Y.; Fadlullah, Z.M.; Kato, N. Space-air-ground integrated network: A survey. IEEE Commun. Surv. Tuts. 2018,20,
2714–2741.
4.
Heng, M.; Wang, S.Y.; Li, J.; Liu, R.; Zhou, D.; He, L. Toward a flexible and reconfigurable broadband satellite network: Resource
management architecture and strategies. IEEE Wirel. Commun. 2017,24, 127–133.
5.
Qiu, J.; Grace, D.; Ding, G.; Zakaria, M.D.; Wu Q. Air-ground heterogeneous networks for 5G and beyond via integrating high
and low altitude platforms. IEEE Wirel. Commun. 2019,26, 140–148.
6.
Zhou, D.; Sheng, M.; Luo, J.; Liu, R.; Li, J.; Han Z. Collaborative data scheduling with joint forward and backward induction in
small satellite networks. IEEE Trans. Commun. 2019,67, 3443–3456.
7.
Karapantazis S.; Pavlidou, F. Broadband communications via high-altitude platforms: A survey. IEEE Commun. Surv. Tutor. 2005,
7, 2–31.
8.
Nafees, M.; Huang, S.; Thompson, J.; Safari, M. Backhaul-aware user association and throughput maximization in UAV-aided
hybrid FSO/RF network. Drones 2023,7, 74.
9.
Ding, C.; Wang, J.B.; Zhang, H.; Lin, M.; Li, G.Y. Joint optimization of transmission and computation resources for satellite and
high altitude platform assisted edge computing. IEEE Trans. Wirel. Commun. 2022,21, 1362–1377.
10.
Gonzalo, J.; López, D.; Domínguez, D.; García, A.; Escapa, A. On the capabilities and limitations of high altitude pseudo-satellites.
Prog. Aerosp. Sci. 2018,98, 37–56.
11.
Wang, W.; Li, H.; Liu, Y.; Cheng, W.; Liang, R. Files cooperative caching strategy based on physical layer security for air-to-ground
integrated IoV. Drones 2023,7, 163.
12.
Huang, X.; Chen, P.; Xia, X. Heterogeneous optical network and power allocation scheme for inter-cubesat communication. Opt.
Lett. 2024,49, 1213–1216.
13.
Pham, Q.-V.; Mirjalili, S.; Kumar, N.; Alazab, M.; Hwang, W.-J. Whale optimization algorithm with applications to resource
allocation in wireless networks. IEEE Trans. Veh. Technol. 2020,69, 4285–4297.
14.
Abbasi, O.; Yadav, A.; Yanikomeroglu, H.; Dao, N.-D.; Senarath, G.; Zhu, P. HAPS for 6G networks: potential use cases, open
challenges, and possible solutions. IEEE Wirel. Commun. 2024, 1–8. https://doi.org/10.1109/MWC.012.2200365.
15.
Lou, Z.; Youcef Belmekki, B.E.; Alouini, M.-S. HAPS in the non-terrestrial network nexus: Prospective architectures and
performance insights. IEEE Wirel. Commun. 2024,30, 52–58.
16.
Liu, S.; Dahrouj, H.; Alouini, M.-S. Joint user association and beamforming in integrated satellite-HAPS-ground networks. IEEE
Trans. Veh. Technol. 2024,73, 5162–5178.
17.
Khoshkbari, H.; Sharifi, S.; Kaddoum, G. User association in a VHetNet with delayed CSI: A deep reinforcement learning
approach. IEEE Commun. Lett. 2023,27, 2257–2261.
18.
Jia, Z.; Sheng, M.; Li, J.; Zhou, D.; Han, Z. Joint HAP access and LEO satellite backhaul in 6G: matching game-based approaches.
IEEE J. Sel. Areas Commun. 2021,39, 1147–1159.
19.
Pervez, F.; Zhao, L.; Yang, C. Joint user association, power optimization and trajectory control in an integrated satellite-aerial-
terrestrial network. IEEE Trans. Wirel. Commun. 2022,21, 3279–3290.
20.
Ma, T.; Zhou, H.; Qian, B.; Cheng, N.; Shen, X.; Chen, X.; Bai, B. UAV-LEO integrated backbone: a ubiquitous data collection
approach for B5G internet of remote things networks. IEEE J. Sel. Areas Commun. 2021,39, 3491–3505.
21.
Mao, S.; He, S.; Wu J. Joint UAV position optimization and resource scheduling in space-air-ground integrated networks with
mixed cloud-edge computing. IEEE Syst. J. 2021,15, 3992–4002.
22.
Ei, N.N.; Aung, P.S.; Park, S.-B.; Huh, E.-N.; Hong, C.S. Joint association and power allocation for data collection in HAP-LEO-
assisted IoT networks. In Proceedings of the International Conference on Information Networking (ICOIN), Bangkok, Thailand,
11–14 January 2023; pp. 206–211.
23.
Doan, K.; Avgeris, M.; Leivadeas, A.; Lambadaris, I.; Shin, W. Service function chaining in LEO satellite networks via multi-agent
reinforcement learning. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Kuala Lumpur, Malaysia,
4–8 December 2023; pp. 7145–7150.
24.
Li, H.; Yu, J.; Cao, L.; Zhang, Q.; Hou, S.; Song, Z. Multi-agent Reinforcement learning based computation offloading and resource
allocation for LEO satellite edge computing networks. Comput. Commun. 2024,222, 268–276.
25.
Seid, A.M.; Erbad, A. Multi-agent RL for SDN-based resource allocation in HAPS-assisted IoV networks. In Proceedings of the
IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 1664–1669.
26.
Mei, C.; Gao, C.; Wang, H.; Xing, Y.; Ju, N.; Hu, B. Joint task offloading and resource allocation for space-air-ground collaborative
network. Drones 2023,7, 482.
27.
Dong, F.; Li, H.; Gong, X.; Liu, Q.; Wang, J. Energy-efficient transmissions for remote wireless sensor networks: An integrated
HAP/satellite architecture for emergency scenarios. Sensors 2015,15, 22266–22290.
Drones 2024,8, 218 18 of 18
28.
Leyva-Mayorga, I.; Gala, V.; Chiariotti, F.; Popovski, P. Continent-wide efficient and fair downlink resource allocation in LEO
satellite constellations. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June
2023; pp. 6689–6694.
29. Sutton R.; Barto, A. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998,9, 1054–1054.
30.
Badini, N.; Jaber, M.; Marchese, M.; Patrone, F. Reinforcement learning-based load balancing satellite handover using NS-3. In
Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 2595–2600.
31.
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in
communications and networking: a survey. IEEE Commun. Surv. Tuts. 2019,21, 3133–3174.
32.
Arulkumaran, K.; Deisenroth, M.P.; Brundage M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process.
Mag. 2017,34, 26–38.
33.
Wang, G.; Yang, F.; Song, J.; Han, Z. Multi-agent deep reinforcement learning for dynamic laser inter-satellite link scheduling.
In Proceedings of the IEEE Global Communications Conference GLOBECOM, Kuala Lumpur, Malaysia, 4–8 December 2023;
pp. 5751–5756.
34.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015,518, 529–533.
35. Amari, S.I. Backpropagation and stochastic gradient descent method. Neurocomputing 1993,5, 185–196.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
... This architecture is supported by adaptive protocols that optimize device association and resource allocation, leveraging machine learning techniques to enhance network performance and efficiency. At the heart of our satellite-HA-UAV-IoT network architecture lies a tailored Large Language Model (LLM) framework that intelligently manages the association of IoT devices with the appropriate satellite or HA-UAV relay [11,12]. This dynamic association process takes into account various factors, such as device mobility patterns, traffic demands, and resource availability, to ensure efficient and reliable connectivity. ...
Article
Full-text available
High-altitude UAVs (HA-UAVs) have emerged as vital components in 6G communication infrastructures, providing stable relays for telecommunications services above terrestrial and aerial disturbances. This paper explores the multifaceted roles of HA-UAVs in remote sensing, data relay, and telecommunication network enhancement. A Large Language Model (LLM) framework is introduced that dynamically predicts optimal HA-UAV connectivity for IoT devices, enhancing network performance and adaptability. The study emphasizes HA-UAVs’ operational efficiency, broad coverage, and potential to transform global communications, particularly in remote and underserved areas. Our proposed satellite-HA-UAV-IoT architecture with LLM optimization demonstrated substantial improvements, including a 25% increase in network throughput (from 20 Mbps to 25 Mbps at a 20 km distance), a 40% reduction in latency (from 25 ms to 15 ms), and a 28% enhancement in energy efficiency (from 0.25 μJ/bit to 0.18 μJ/bit), significantly advancing the performance and adaptability of next-generation IoT networks. These advancements pave the way for unprecedented connectivity and set the stage for future communication technologies.
Article
Full-text available
In this Letter, the problems of achieving inter-CubeSat communication through radio frequency (RF) and lasers are explained, and the feasibility of using visible light communication to replace RF and lasers is investigated. On this basis, a novel, to the best of our knowledge, heterogeneous optical network with high flexibility is proposed, in which CubeSats are divided into clusters in pairs. CubeSats in each cluster utilize different optical modulation methods to achieve a compromise between optical power efficiency and spectral efficiency, as well as avoid inter-CubeSat interference. Furthermore, under the maximum power and minimum capacity constraints, a closed-form optical power allocation solution minimizing an overall bit error rate (BER) is investigated. Simulation results show that our proposed scheme is more preferred in practical systems and can achieve 3.8 dB gains compared to the conventional power allocation scheme at a BER of 10⁻⁴.
Conference Paper
In addition to offering enhanced global connectivity, low-earth-orbit satellite networks (LSNs) can be a potential solution for a large range of applications such as disaster response, environmental monitoring, and military operations, among others. In our context, each specific application is represented by a service function chain (SFC) in which each function is considered as a task in the application. Our objective is to optimize the long-term system performance by minimizing the average end-to-end delay of SFC deployments in LSNs. To achieve this, we formulate a dynamic programming (DP) problem to derive an optimal placement policy. To overcome the computational intractability, the need for statistical knowledge of SFC requests, and centralized decision-making challenges, we present a multi-agent Q-learning approach where satellites act as independent agents. To facilitate performance convergence in non-stationary agents’ environments, we let agents to collaborate by sharing designated learning parameters. In addition, agents update their Q-tables via two distinct rules depending on selected actions. Extensive experimentation shows that our approach achieves convergence and performance relatively close to the optimum obtained by solving the formulated DP equation.
Article
High altitude platform station (HAPS), which is deployed in the stratosphere at an altitude of 20-50 kilometres, has attracted much attention in recent years due to their large footprint, line-ofsight links, and fixed position relative to the Earth. Compared with existing network infrastructure, HAPS has a much larger coverage area than terrestrial base stations and is much closer than satellites to the ground users. Besides small-cells and macro- cells, a HAPS can offer one mega-cell, which can complement legacy networks in 6G and beyond wireless systems. This article explores potential use cases and discusses relevant open challenges of integrating HAPS into legacy networks, while also suggesting some solutions to these challenges. The cumulative density functions of spectral efficiency of the integrated network and cell-edge users are studied and compared with terrestrial networks. The results show the capacity gains achieved by the integrated network are beneficial to cell-edge users. Furthermore, the advantages of a HAPS for backhauling aerial base stations are demonstrated by the simulation results.
Article
High altitude platform stations (HAPS) have recently emerged as a new key stratospheric player in non-terrestrial networks (NTN) alongside satellites and low-altitude platforms. In this article, we present the main communication links between HAPS and other NTN platforms, their advantages, and their challenges. Then prospective network architectures in which HAPS plays an indispensable role in future NTNs, such as ad-hoc, cell-free, and integrated access and backhaul, are presented. To showcase the importance of HAPS in NTN, we provide comprehensive performance insights when using HAPS in prospective architectures with the most suitable communication link. The insights show the HAPS’ ability to interconnect the NTN nexus as well as their versatility by incorporating different metrics into the analysis, such as routing latency, energy efficiency, coverage probability, and channel capacity. Depending on the architecture, HAPS will play different roles in NTN, such as a UAV network center, satellite relay, and ground network extension. Finally, the performance gain provided by HAPS usage in NTN is further highlighted by comparing the results when no HAPS are used.
Article
This paper proposes and evaluates the benefit of, one particular hybrid satellite-high-altitude-platform-station (HAPS)-ground network, where one HAPS connected to one geo-satellite assists the ground base stations (BSs) at serving ground-level users. The paper assumes that the geo-satellite is connected to the HAPS using free-space-optical backhaul links. The HAPS, equipped with multiple antennas, aims at transmitting the geo-satellite data to the users via radio-frequency (RF) links using spatial-multiplexing. Each ground BS, on the other hand, is equipped with multiple antennas, but directly serves the users through the RF links. The paper then focuses on maximizing the network-wide throughput, subject to HAPS payload connectivity constraint, HAPS and BSs power constraints, and backhaul constraints, so as to jointly determine the user-association strategy of each user (i.e., user to geo-satellite via HAPS, or user to BS), and their associated beamforming vectors. We tackle such a mixed discrete-continuous optimization problem using an iterative approach, where the user-association is determined using a combination of integer linear programming and generalized assignment problems, and where the beamforming strategy is found using a weighted-minimum-mean-squared-error approach. The simulations illustrate the appreciable gain of our proposed algorithm, and highlight the prospects of augmenting the ground networks with beamforming-empowered HAPS for connecting the unconnected, and super-connecting the connected.
Conference Paper
The Fifth-Generation of Mobile Communications (5G) is intended to meet users' growing needs for high-quality services at any time and from any location. The unique features of Low Earth Orbit (LEO) satellites in terms of higher coverage, reliability, and availability, can help expand the reach of 5G and beyond technologies to support those needs. However, because of their high speeds, a single LEO satellite is unable to provide continuous service to multiple User Equipments (UEs) spread over a large (potentially worldwide) area, resulting in the need for LEO satellite constellations with a high number of satellites and a consequent high amount of satellite handovers (HOs). Moreover, UEs can only acquire partial information about the satellite system and compete for the limited available communication resources of the satellites, requiring the implementation of a decentralized satellite HO strategy to avoid network congestion. In this paper, we propose a decentralized Load Balancing Satellite HO (LBSH) strategy based on multi-agent reinforcement Q-learning, implemented within the software Network Simulator 3 (NS-3). LBSH aims to reduce the total number of HOs and the blocking rate while balancing the load distribution among satellites. Our results show that the proposed LBSH method outperforms the state-of-the-art methods in terms of a 95% drop in the average number of HOs per user and an 84% reduction in blocking rate.