Content uploaded by Jiayuan Chen
Author content
All content in this area was uploaded by Jiayuan Chen on Mar 17, 2023
Content may be subject to copyright.
A Triple Learner Based Energy Efficient Scheduling
for Multi-UAV Assisted Mobile Edge Computing
Jiayuan Chen∗, Changyan Yi∗, Jialiuyuan Li∗, Kun Zhu∗and Jun Cai†
∗College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
†Department of Electrical and Computer Engineering, Concordia University, Montr´
eal, QC, H3G 1M8, Canada
Email: {jiayuan.chen, changyan.yi, jialiuyuan.li, zhukun}@nuaa.edu.cn, jun.cai@concordia.ca
Abstract—In this paper, an energy efficient scheduling problem
for multiple unmanned aerial vehicle (UAV) assisted mobile
edge computing is studied. In the considered model, UAVs act
as mobile edge servers to provide computing services to end-
users with task offloading requests. Unlike existing works, we
allow UAVs to determine not only their trajectories but also
decisions of whether returning to the depot for replenishing
energies and updating application placements (due to limited
batteries and storage capacities). Aiming to maximize the long-
term energy efficiency of all UAVs, i.e., total amount of offloaded
tasks computed by all UAVs over their total energy consumption,
a joint optimization of UAVs’ trajectory planning, energy renewal
and application placement is formulated. Taking into account
the underlying cooperation and competition among intelligent
UAVs, we reformulate such problem as three coupled multi-agent
stochastic games, and then propose a novel triple learner based
reinforcement learning approach, integrating a trajectory learner,
an energy learner and an application learner, for reaching equi-
libriums. Simulations evaluate the performance of the proposed
solution, and demonstrate its superiority over counterparts.
I. INTRODUCTION
Recently, the multi-unmanned aerial vehicle (UAV) assisted
mobile edge computing (MEC) has attracted a myriad of
attentions due to its high-flexibility in providing MEC services
for end-users (e.g., IoT devices). Particularly, UAVs with
computing resources can dynamically adjust their positions
to get close to end-users or fly to the areas that cannot be
covered by fixed MEC infrastructures [1], [2]. Thus, compared
to the traditional MEC system, the multi-UAV assisted MEC
can provide better quality of experience for end-users [3].
Although the multi-UAV assisted MEC is envisioned as
a light-weight but highly efficient paradigm for alleviating
computation burdens on end-users, it also suffers several inher-
ent restrictions. For instance, computing tasks offloaded from
different end-users are required to be processed by specific ser-
vice applications, while the limited storage capacities of UAVs
impede their abilities to store all applications. Additionally,
the limited energy capacities of UAVs also hinders the imple-
mentation of this paradigm in providing the long-term MEC
services. Recent research efforts in this area include trajectory
optimization [4], [5], service caching [6], UAV deployment
[7], [8], etc. Nevertheless, there are still some critical issues,
especially how UAVs’ installed applications should be updated
(with severely restricted wireless backhauls) and how UAVs’
energy replenishment should be jointly scheduled, which are
of great importance but have not yet been well investigated.
In this paper, we study a joint optimization of trajectory
planning, energy renewal, and application placement for multi-
UAV assisted MEC to maximize the long-term energy effi-
ciency of all UAVs, i.e., the total amount of offloaded tasks
computed by all UAVs over their total energy consumption,
when providing MEC services. Specifically, in the considered
system, each UAV working over a target region has to decide
its actions after finishing the last one, i.e., a flight direction
for serving IoT devices in other areas or returning back to the
depot for replenishing its energy and simultaneously updating
its application placement (through wired connections), with
the aim of maximizing the long-term energy efficiency of
all UAVs. Since UAVs are intelligent, we allow each of
them to make its own decisions while regulate the underlying
cooperation and competition among them. Additionally, we
take into account the uncertainty that the future environment
information (e.g., positions and task requirements of IoT
devices) is unavailable to UAVs. To this end, we reformulate
the joint optimization problem as three coupled multi-agent
stochastic games, namely, trajectory planning stochastic game
(TPSG), energy renewal stochastic game (ERSG) and appli-
cation placement stochastic game (APSG), and then propose
a novel triple learner based reinforcement learning (TLRL)
approach to obtain corresponding equilibriums of these games.
The main contribution of this paper are in the following.
•A joint optimization of trajectory planning, energy re-
newal and application placement for multi-UAV assisted
MEC is formulated, where the objective is to maximize
the long-term energy efficiency of all UAVs.
•Observing the underlying cooperation and competition
among UAVs, the optimization problem is reformulated
as three coupled multi-agent stochastic games, i.e., TPSG,
ERSG and APSG, and then a novel approach, called
TLRL, is proposed to derive corresponding equilibriums.
•Extensive simulations are conducted to show the superi-
ority of the proposed TLRL approach over counterparts.
The rest of this paper is organized as follows: Section II
introduces the system model and problem formulation. In
Section III, a problem reformulation based on multi-agent
stochastic game is proposed and analyzed, along with the
developed TLRL approach. Simulation results are provided
in Section IV, followed by the conclusion in Section V.
UAV 1
UAV 2 UAV 3 UAV 4
Depot
Remaining energy of the UAV
Applications
Offloading to the UAV
Requested tasks
Movement of the UAV
Charging and updating
applications at the UAV
Interference to the UAV
Coverage of the UAV
Fig. 1: An illustration of considered multi-UAV assisted MEC.
II. SY ST EM MO DE L AN D PROB LE M FOR MU LATI ON
A. Network Model
Consider a multi-UAV assisted MEC system deployed in a
target region, as illustrated in Fig. 1, consisting of a group
of heterogeneous UAVs (acting as mobile edge servers) M
with cardinality of |M| =Mand a set of randomly scattered
IoT devices Nwith cardinality of |N | =N. There is a
depot located at the edge of the target region, which can be
used by UAVs for both energy replenishment and application
placement through wired connections. A time-slotted operation
framework is studied, in which we define t∈ {1,2, ..., T }as
the index of time slot. The target region is equally divided
into small squared grids with the side length of q, and similar
to [9], we assume that the downlink transmission range of
each UAV is √2
2q, which totally covers a grid (for feeding
back computation outcomes). All IoT devices are required
to offload their tasks to their associated UAVs via uplink
communications using the same frequency band B, and the set
of IoT devices served by (or associated with) a certain UAV
mis denoted by Gm⊆ N . The horizontal coordinates of IoT
device n∈ N and UAV m∈ M at time slot tare represented
as In(t)=(xI
n(t), yI
n(t)) and Um(t) = (xU
m(t), yU
m(t)),
respectively. Then, the distance of IoT device n∈ N and
UAV m∈ M at time slot tcan be expressed as
dm,n(t) = q(xU
m(t)−xI
n(t))2+ (yU
m(t)−yI
n(t))2+H2,
where Hdenotes a fixed flight altitude of all UAVs. Following
the literature [10], the line-of-sight (LoS) probability between
IoT device n∈ Gmand UAV m∈ M at time slot tis given
by δm,n(t) = a·exp(−b(arctan(H/dm,n(t)) −a)), where a
and bare constant values depending on the environment. Then,
the path loss between IoT device n∈ Gmand UAV m∈ M
at time slot tcan be expressed as
λm,n(t) = 20log(pH2+dm,n(t)2)
+δm,n(t)(ηLoS −ηN LoS ) + 20log[(4πf )/c] + ηN LoS ,
where fand csignify the carrier frequency and the speed of
light, respectively; ηLoS and ηNLoS are the losses correspond-
ing to the LoS and non-LoS links, respectively.
Since a common frequency band is reused among all links,
the signal-to-interference-plus-noise ratio (SINR) at UAV m∈
Mwith regard to the uplink communication of IoT device
n∈ Gmat time slot tcan be calculated as
σm,n(t) = vn(t)(wm(t)>)ptran
n10
−λm,n
10
PN
i=1\{n}vn(t)(wm(t)>)ptran
i10
−λm,n
10 +ϕB
,
where ptran
nis the transmission power of IoT device n,
and ϕindicates the power spectral density of noise. At
time slot t, we consider that IoT device n∈ Gmcan
offload no more than one task to its associated UAV m.
Let vn(t) = {vn,1(t), vn,2(t), ..., vn,c(t), ..., vn,C (t)}, where
c∈ {1,2, ..., C}is the index of the type of task, and
vn,c(t) = 1 signifies that IoT device nrequests to of-
fload task c, and vn,c(t) = 0, otherwise. Meanwhile,
the applications placed in UAV mcan be defined as
wm(t) = {wm,1(t), wm,2(t), ..., wm,c(t), ..., wm,C (t)}, where
wm,c(t)∈ {0,1}signifies whether UAV mplaces the appli-
cation type c. Note that, any UAV m∈ M can only process
the types of tasks fitting the types of its placed applications.
Based on these, the transmission time of IoT devices n∈ Gm
in offloading a task to UAV m∈ M can be written as
toff
m,n(t) = vn(t)(wm(t)>)Dn
Blog2(1 + σm,n (t)) ,
where Dnis the size of task offloaded by IoT device n.
Within each time slot t, we consider that UAV m∈ M
hovers over the center of a certain grid to provide MEC
services with time duration thover , and toff
m,n(t)< thover <|t|,
∀n∈ Gm,∀m∈ M, which means that thover is large enough
for UAV mto receive any task offloaded by any IoT device
and is shorter than the duration of a time slot. Then, the size
of tasks computed by UAV m∈ M can be expressed as
T askcomp
m(t) = min{Pn∈Gm
vn(t)(wm(t)>)Dn,
(thover −min{tof f
m,n(t)n∈Gm})fU
m},
where fU
mis the computing capacity of UAV m(in the number
of CPU cycles per second), and (thover −min{tof f
m,n(t)n∈Gm})
indicates that UAV mstarts edge computing since the first task
is totally received. Correspondingly, the energy consumption
of UAV m∈ M for computing tasks at slot tis calculated as
Ecomp
m(t) = ξ(fU
m)2T askcomp
m(t),
where ξshows the capacitance coefficient of UAV m∈ M.
Furthermore, let κm(t)∈ {0,1}stand for the decision that
UAV m∈ M chooses to whether return to the depot at the
beginning of each time slot t. If UAV m∈ M decides not to
return to the depot (denoted by κm(t)=1), it will select a
direction among forward, backward, left and right, and then
move to the center of another adjacent grid with a constant
velocity V. The propulsion energy consumption (consisting of
the energy consumption of horizontal moving and hovering)
of the UAV mcan be expressed as Epro
m=Ppro
m(V)q
V+
Ppro
m(0)thover , where Ppro
mis the propulsion power model of
UAVs, and its descriptions follows from [11] and are omitted
here. If UAV m∈ M decides to return to the depot (denoted
by κm(t) = 0), the energy consumption of UAV mmoving
between the target region and the depot with the constant
velocity Vcan be written as Edep
m= 2 ·Ppro
m(V)dm,dep(t)
V,
where dm,dep(t)is the distance between UAV mand the
depot at time slot t. At the depot, UAV mcan quickly
renew its energy and also update its application placement
for better serving IoT devices. Note that, the total size of
applications placed at UAV m∈ M should be smaller than
its storage capacity Sm, that is PC
c=1 µcwm,c(t)≤Sm, where
µcstands for the size of application type c. Additionally,
to guarantee the quality of service (QoS) of IoT devices,
each type of application should be placed in at least one
UAV hovering over the target region at each time slot t, i.e.,
PM
m=1 wm,c(t)κm(t)≥1,∀c∈C. After replenishing energy
and updating application placement, UAV mwill back to the
original region and continue to provide MEC services.
B. Problem Formulation
In this work, we aim to maximize the energy efficiency of
all UAVs, i.e., total amount of offloaded tasks computed by
all UAVs over their total energy consumption, and we have
Eeffi (t) =
M
P
m=1
κm(t)T askcomp
m(t)
M
P
m=1
(κm(t)(Ecomp
m(t)+Epro
m)+(1−κm(t))Edep
m)
(1)
Then, the joint optimization of UAVs’ trajectory planning,
energy renewal and application placement is formulated as
[P1] : max
Um(t),wm(t),κm(t)lim
T→+∞
1
TXT
t=1 Eeffi (t)(2)
s.t., κm(t)∈ {0,1},∀m∈ M,(3)
wm,c(t)∈ {0,1},∀m∈ M,∀c∈C, (4)
XC
c=1 µcwm,c(t)≤Sm,∀m∈ M,(5)
XM
m=1 wm,c(t)κm(t)≥1,∀c∈C, (6)
|Um(t)− Um(t−1)|2κm(t) = q2,∀m∈ M,(7)
(xU
m(t)−xU
m(t−1))(yU
m(t)−yU
m(t−1))κm(t) = 0,(8)
|Um(t)− Um0(t)|κm(t)≥q, ∀m6=m0,(9)
where constraint (5) means that the total size of applications
placed at each UAV should be less than its storage capacity;
constraint (6) states that QoS of serving IoT devices should
be met; constraints (7) and (8) imply that each UAV can only
move to the center of adjacent grid if it does not return to
the depot; constraint (9) indicates that each grid can only
be covered by one UAV to avoid potential collisions. In the
following section, we will first analyze problem [P1], and then
propose a novel approach to derive the solution.
III. PROB LE M REF OR MU LATI ON A ND SOLUTION
A. Problem Reformulation
Since UAVs are intelligent, to solve problem [P1], we can
allow each UAV to make its own decisions while regulating
the underlying cooperation and competition among them.
Specifically, UAVs are expected to cooperatively conduct the
trajectory planning, energy renewal and application placement
to maximize the energy efficiency of all UAVs while guar-
anteeing QoS of IoT devices. Meanwhile, allowing UAVs to
make decisions themselves may also lead to competitions in
trajectory planning, energy renewal and application placement
among them. Additionally, considering the uncertainty that the
future environment information (e.g., task requirements of IoT
devices) is not available to UAVs, to this end, we reformulate
[P1] as three coupled multi-agent stochastic games as follows.
[P1] is reformulated as three coupled multi-agent stochas-
tic games, i.e., TPSG hM,ST P SG ,AT P SG,PT P S G,RT P S Gi,
ERSG hM,SERS G,AERS G,PE RSG ,RERSG iand APSG
hM,SAP SG ,AAP SG,PAP S G,RAP S Gi, where Mindicates
the set of agents (i.e., UAVs in this paper), Sstands for the
environment states, Arepresents the set of joint actions of all
agents, Psignifies the set of state transition probabilities, and
Ris the set of reward functions. Particularly, for TPSG, each
UAV m∈ M will choose an action individually based on the
current environment states sTP S G(t)∈ ST P SG at each time
slot t, and then form a joint action aT P SG (t)∈ AT P SG.
After executing the joint action, rewards will be obtained
according to RT P SG , and the environment states will turn
to be next ones following PTP S G. The descriptions of ERSG
and APSG are similar to TPSG, and are omitted here. Note
that, TPSG, ERSG and APSG are inherently coupled. In the
following subsection, we propose a novel approach, called
TLRL, to obtain equilibriums of these three coupled multi-
agent stochastic games.
B. TLRL Approach
The transitions of states actions of TPSG, ERSG, and APSG
satisfy the Markov property, because all joint actions, i.e.,
aT P SG (t),a(t)ERSG and a(t)AP S G, at time slot tonly
depend on the environment states at time slot t, i.e., sTP S G(t),
sERS G(t)and sAP SG (t), and thereby, in this paper, we
characterize each UAV’s strategic decision process in TPSG,
ERSG and APSG by three Markov decision processes (MDPs).
MDP for each UAV in TPSG: With the aim of finding
the optimal trajectories for all UAVs, the individual decision
making problem for each UAV m∈ M in TPSG can be
modelled as an MDP (ST P SG ,AT P SG
m,RT P SG
m,PT P SG ).
1) Environment State for Each UAV in TPSG: The environ-
ment state sT P SG (t)∈ ST P S G for UAV m∈ M in TPSG
at time slot tconsists of all UAVs’ positions Um(t),m∈ M
and application placement wm(t),m∈M, which can be
expressed as sT P SG(t)=(Um(t),wm(t))m∈M.
2) Action for Each UAV in TPSG: At time slot t, UAV
m∈ M chooses an action aT P S G
m(t)∈ AT P SG
m, where
AT P SG
mis the set consisting of four possible actions, i.e.,
moving forward, backward, left or right.
3) Reward of Each UAV in TPSG: The immediate reward
of UAV m∈ M at time slot tis given by
RT P SG
m(t) = κm(t)T askcomp
m(t)
Ecomp
m(t) + Epro
m
,(10)
where the numerator indicates the size of tasks computed by
UAV mat time slot t, and the denominator represents the
energy consumption of UAV mat time slot t.
4) State Transition Probabilities of UAVs in
TPSG: The state transition probability from state
sT P SG to sT P SG0by taking the joint action
aT P SG (t)=(aT P SG
1(t),aT P SG
2(t),...,aT P SG
M(t)) can
be expressed as PT P SG
sT P SG ,sT P SG0(aT P SG (t))=P r(sT P SG (t+
1)=sT P SG0|sT P S G(t) = sT P SG ,aT P SG (t)).
MDP for each UAV in ERSG: With the aim of de-
signing the optimal schedule of energy renewal for all
UAVs, the individual decision making problem for each
UAV m∈ M in ERSG can be modelled as an MDP
(SERS G,AERS G
m,RERS G
m,PERS G).
1) Environment State for Each UAV in ERSG: The envi-
ronment state sERS G(t)∈ SERS G for UAV m∈ M in
ERSG at time slot tconsists of all UAVs’ remaining energy
Eremain
m(t),m∈ M and positions Um(t),m∈ M, which
can be expressed as sERSG (t)=(Er emain
m(t),Um(t))m∈M.
2) Action for Each UAV in ERSG: UAV m∈ M chooses
an action aERS G
m(t)∈ AERS G
mat time slot t, where AERS G
m
is the set consisting of two actions, i.e., deciding to return to
the depot or not.
3) Reward of Each UAV in ERSG: The immediate reward
of UAV m∈ M at time slot tis given by
RERS G
m(t) = −10,if constraint (6) is violated,
κm(t),otherwise.(11)
This reward function can prompt UAVs to hover over the target
region providing MEC services without violating (6).
The definition of state transition probabilities of UAVs in
ERSG PERS G is similar to that in TPSG and is omitted here.
MDP for each UAV in APSG: With the aim of producing
the optimal policy for updating the application placement
of all UAVs, the individual decision making problem for
each UAV m∈ M in APSG can be defined as an MDP
(SAP SG ,AAP SG
m,RAP SG
m,PAP SG ).
1) Environment State for Each UAV in APSG: The environ-
ment state sAP SG (t)∈ SAP S G for UAV m∈ M at time slot
tconsists of applications placed in all UAVs wm(t), m ∈ M
and the amount of the task requests from IoT devices covered
by UAV mbefore t, i.e., θm(t) = Pt
τ=1 Pn∈Gm
vn(τ), m ∈
M, and thus sAP SG (t) = (wm(t), θm(t))m∈M.
2) Action for Each UAV in APSG: UAV m∈ M chooses
an action aAP SG
m(t)∈ AAP SG
mat time slot t, signifying that
it selects Smtypes of applications from the total Ctypes.
3) Reward of Each UAV in APSG: The immediate reward
of UAV m∈ M in APSG at time slot tis given by
RAP SG
m(t) = e(t)
C
t
X
τ=1 X
n∈Gm
vn(τ)wm(τ)>,(12)
where e(t)indicates the number of application types placed
in all UAVs at time slot t. This reward function would
guide UAVs to update more popular but diverse applications
according to the history of providing MEC services.
The definition of state transition probabilities PAP SG is
similar to that in TPSG and is omitted here.
Based on the above three MDP formulations, we develop a
novel triple learner (i.e., trajectory learner, energy learner and
application learner) based reinforcement learning approach
to obtain equilibriums of these three coupled multi-agent
stochastic games. Specifically, each UAV runs three Q-learning
algorithms to learn the optimal Q values of each state-action
pair, and obtain the optimal local policies for trajectory learner,
energy learner, and application learner. It is worth noting
that, since trajectory planning, energy renewal and application
placement are tightly coupled, these three learners have to run
in a back-and-forth manner.
1) Settings for Trajectory Learner: The policy πTP S G
m:
ST P SG → AT P SG
mof the trajectory learner in UAV m∈
M, meaning a mapping from the environment state set to
the action set, signifies a probability distribution of actions
aT P SG
m∈ AT P SG
min a given state sTP S G. Particularly,
for UAV min state sT P SG ∈ S T P S G, the trajectory pol-
icy of the trajectory learner in UAV mcan be presented
as πT P SG
m(sT P SG ) = {πT P SG
m(sT P SG , aT P SG
m)|aT P SG
m∈
AT P SG
m}, where πT P SG
m(sT P SG , aT P SG
m)is the probability of
UAV mselecting action aT P SG
min state sT P SG .
In Q-learning, the process of building trajectory policy
πT P SG
mis significantly affected by trajectory learner’s Q
function, and the Q function of the trajectory learner in UAV m
is the expected reward by executing action aT P SG
m∈ AT P SG
m
in state sT P SG ∈ ST P SG under the given policy πTP S G
m,
which can be expressed by
QT P SG
m(sT P SG ,aT P SG, π TP S G
m) =
E(P∞
τ=0 γτRT P SG
m(t+τ+ 1)|sT P SG (t) = sT P SG,
a(t)T P SG =aTP S G, πT P S G
m),
(13)
where γis a constant discounted factor with γ∈[0,1], and
the results of (13) are termed as action values, i.e., Q values.
Trajectory learner in UAV m∈ M selects an action
aT P SG
m(t)∈ AT P SG
maccording to its Q function at slot
t. For striking a balance between exploration and exploita-
tion, we consider an -greedy exploration strategy for the
trajectory learner. Specifically, the trajectory learner in UAV
m∈ M selects a random action aT P SG
m∈ AT P SG
m
in state sT P SG ∈ ST P SG with probability and selects
the best action aT P SG∗
mwith probability (1 −), where
the best action has QT P SG
m(sT P SG ,aT P SG∗, πT P S G
m)≥
QT P SG
m(sT P SG ,aT P SG, π TP S G
m),∀aT P SG ∈ AT P SG with
aT P SG∗
mbeing the m-th element of aT P SG∗. Besides, if the
later described energy learner in UAV mselects to return to
the depot, the trajectory learner will not choose any action in
AT P SG
m. Then, the probability of selecting action aT P SG
m∈
AT P SG
min state sT P SG can be expressed by
πT P SG
m(sT P SG , aT P SG
m)
=
0,if UAV mdecides to return to the depot,
1−, if QT P SG
m(sT P SG ,·,·)of aT P SG
mis the highest,
, otherwise.
In the Q value update step of Q-learning, the trajectory
learner in each UAV m∈ M follows the update rule:
QT P SG
m(sT P SG ,aT P SG, t + 1) =
QT P SG
m(sT P SG ,aT P SG, t) + βT P S G(RT P S G
m(t)+
γmax
aT P SG0∈AT P SG QT P SG
m(sT P SG0,aT P S G0, t)
−QT P SG
m(sT P SG ,aT P SG, t)),
(14)
where βT P SG denotes the learning rate in TPSG.
2) Settings for Energy Learner: The policy of energy learner
in UAV m∈ M is expressed as πE RSG
m:SERS G → AERSG
m.
Here, the Q function of the energy learner in UAV m∈ M
can be expressed by
QERS G
m(sERS G,aERS G, π ERSG
m) =
E(P∞
τ=0 γτRERS G
m(t+τ+ 1)|sERS G(t) = sERS G,
a(t)ERS G =aERSG , πE RSG
m).
(15)
The energy learner in UAV m∈ M selects an action
aERS G
m∈ AERS G
m(i.e., whether returning to the depot) also
according the -greedy exploration strategy. Then, we have
πERS G
m(sERS G, aERS G
m)
=1−, if QERS G
m(sERS G,·,·)of aERS G
mis the highest,
, otherwise.
The energy learner in UAV m∈ M follows the update rule:
QERS G
m(sERS G,aERS G, t + 1) =
QERS G
m(sERS G,aERS G, t) + βE RSG (RERSG
m(t)+
γmax
aERSG0∈AE RSG QERS G
m(sERS G0,aERSG0, t)
−QERS G
m(sERS G,aERS G, t)),
(16)
where βERS G denotes the learning rate in ERSG.
3) Settings for Application Learner: The policy of applica-
tion learner in UAV m∈ M is πAPS G
m:SAP SG → AAPS G
m.
Here, the Q function of the application learner in UAV m∈
Mcan be expressed by
QAP SG
m(sAP SG ,aAP SG, π APS G
m) =
E(P∞
τ=0 γτRAP SG
m(t+τ+ 1)|sAP SG (t) = sAP SG,
a(t)AP SG =aAPS G, πAP S G
m).
(17)
The application learner in UAV m∈ M selects an action
aAP SG
m∈ AAP SG
malso according the -greedy exploration
strategy. Then, we have
πAP SG
m(sAP SG , aAP SG
m)
=1−, if QAP SG
m(sAP SG ,·,·)of aAP SG
mis the highest,
, otherwise.
The update rule of application learner in UAV m∈ M is
QAP SG
m(sAP SG ,aAP SG, t + 1) =
QAP SG
m(sAP SG ,aAP SG, t) + βAP S G(RAP S G
m(t)+
γmax
aAP SG0∈AAP SG QAP SG
m(sAP SG0,aAP S G0, t)
−QAP SG
m(sAP SG ,aAP SG, t)),
(18)
where βAP SG denotes the learning rate in APSG.
In summary, the proposed TLRL approach is detailedly
illustrated in Algorithm 1.
Algorithm 1: TLRL Approach
1for m= 1 to Mdo
2Initialize Q values QT P SG
m=QERSG
m=QAP SG
m= 0;
3Set the maximal iteration counter LOOP and loop = 0;
4for loop < LOOP do
5t= 0;
6for m= 1 to Mdo
7Send QT P SG
m,QERSG
mand QAP SG
mto other UAVs;
8while t≤Tdo
9Observe state sT P SG,sE RSG and sAP SG ;
10 for m= 1 to Mdo
11 UAV mselects aE RSG
maccording to πERSG
m;
12 if UAV mreturns to the depot then
13 UAV mselects aAP S G
maccording to πAP SG
m;
14 else
15 UAV mselects aT P S G
maccording to πT P SG
m;
16 Obtain rewards RTP S G
m,RERSG
mand RAP SG
m;
17 Update QT P SG
m,QERSG
mand QAP SG
maccording to (14),
(16) and (18), respectively;
18 Send QT P SG
m,QERSG
mand QAP SG
mto other UAVs;
19 Set t=t+ 1;
20 Set loop =loop + 1.
TABLE I: Simulation Parameters
Param. Value Param. Value Param. Value
M3B10 MHz C 10
N300 Dn[2,5] MB V 20 m/s
thover 5s ξ 10−18 f3GH z
q100 m ptran
n[0.2,0.5] W H 120 m
Sm6GB µc[1,3] GB ϕ −174 dB m/Hz
a,b9.6117,
0.1581
fU
m2Mbps Target
region
1000 m×1000 m
IV. SIMULATION RESULTS
In this section, simulations are conducted to evaluate the
performance of the proposed TLRL approach for [P1]. Table I
lists the values of all simulation parameters, and the propulsion
power model follows [11]. Similar settings have also been
employed in [9], [12].
For comparison purpose, we introduce an energy efficient
oriented trajectory planning (EOTP) algorithm and an exist-
ing algorithm called decentralized multiple UAVs cooperative
reinforcement learning (DMUCRL) [9] algorithm as bench-
marks: EOTP determines the trajectories of all UAVs with
the aim of maximizing the energy efficiency but asks UAVs
to return to the depot for energy renewal only when their
batteries are exhausted, and EOTP does not enable the update
of application placement; DMUCRL is originally designed to
maximize the energy efficiency of UAVs in downlink content
sharing by controlling all UAVs to work collaboratively based
on a double Q-learning (each UAV contains a trajectory learner
and an energy learner).
It can be observed from Fig. 2 that the energy efficiency first
increases and then becomes stable with the increase of IoT
devices’ transmission power. This is because, with a larger
transmission power, IoT devices would offload more tasks
to their associated UAVs, and thereby increasing the amount
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Transmission Power of IoT Device (W)
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
Energy Efficiency of All UAVs (bit/J)
Proposed TLRL
DMUCRL
EOTP
Fig. 2: Energy efficiency w.r.t. trans-
mission power of IoT devices.
12345678910
Storage Capacity of Each UAV (GB)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Energy Efficiency of All UAVs (bit/J)
Grid Size q=50m
Grid Size q=100m
Grid Size q=200m
Fig. 3: Energy efficiency w.r.t. storage
capacity of each UAV.
12345678910
UAV Hovering Time (s)
1000
1500
2000
2500
3000
3500
Energy Efficiency of All UAVs (bit/J)
Proposed TLRL
DMUCRL
EOTP
Fig. 4: Energy efficiency w.r.t. UAV
hovering time.
of tasks processed by UAVs. However, since the computing
capacity of each UAV is still limited, such increasing trend
slows down as the limit is approaching. More importantly,
this figure shows that the proposed TLRL outperforms both
DMUCRL and EOTP. Reasons are that i) each UAV under
EOTP returns to the depot directly once its energy is exhausted
regardless of other UAVs; ii) each UAV’s applications are
fixed placed under DMUCRL, making it capable of serving
very limited IoT devices; and iii) our proposed TLRL well
addresses the shortcomings of DMUCRL and EOTP.
Fig. 3 shows all UAV’s energy efficiency with different
UAV storage capacities under different grid size settings.
Specifically, UAVs can adjust their downlink transmission
ranges so as to adjust the size qof grids. It can be seen
from Fig. 3 that the larger the grid size is, the higher energy
efficiency of all UAVs is obtained. This is because with a
larger grid size, more IoT devices are included in a grid,
and thereby each UAV can potentially process more offloaded
tasks. Besides, it is also shown that the energy efficiency of
all UAVs increases monotonically with the storage capacity
of each UAV. The reason is that with the increase of storage
capacity, more types of applications can be placed in each
UAV, so that more tasks may be processed.
It can be observed from Fig. 4 that, the energy efficiency
of all UAVs first increases with the UAV hovering time, and
then decreases. This is because with the growth of UAV
hovering time, more offloaded tasks from IoT devices can be
computed by UAVs during hovering. However, when all tasks
have been completely processed by UAVs, they will become
idle and consume hovering energy over the target region until
hovering time expires. Additionally, it is also shown that the
proposed TLRL outperforms both DMUCRL and EOTP, and
the explanations for this are similar to those for Fig. 2.
V. CONCLUSION
In this paper, an energy efficient scheduling problem for
multi-UAV assisted MEC has been studied. With the aim
of maximizing the long-term energy-efficiency of all UAVs,
a joint optimization of UAVs’ trajectory planning, energy
renewal and application placement is formulated. By taking
the inherent cooperation and competition among UAVs, we
reformulate such optimization problem as three coupled multi-
agent stochastic games, and then propose a novel TLRL
approach for reaching equilibriums. Simulation results show
that, compared to counterparts, the proposed TLRL approach
can significantly increase the energy efficiency of all UAVs.
ACKNOWLEDGMENTS
This work was supported by National Natural Science
Foundation of China (NSFC) under Grants No. 62002164, No.
62176122, and by the Postgraduate Research & Practice In-
novation Program of NUAA under grant No. xcxjh20221614.
REFERENCES
[1] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan,
“Deep reinforcement learning based dynamic trajectory control for UAV-
assisted mobile edge computing,” IEEE Trans. Mob. Comput., vol. 21,
no. 10, pp. 3536–3550, Oct. 2020.
[2] Y. Shi, C. Yi, B. Chen, C. Yang, K. Zhu, and J. Cai, “Joint online opti-
mization of data sampling rate and preprocessing mode for edge–cloud
collaboration-enabled industrial IoT,” IEEE Internet Things J., vol. 9,
no. 17, pp. 16 402–16 417, 2022.
[3] C. Dai, K. Zhu, and E. Hossain, “Multi-agent deep reinforcement
learning for joint decoupled user association and trajectory design in
full-duplex multi-UAV networks,” IEEE Trans. Mob. Comput., pp. 1–
15, 2022.
[4] J. Ji, K. Zhu et al., “Energy consumption minimization in UAV-assisted
mobile-edge computing systems: Joint resource allocation and trajectory
design,” IEEE Internet Things J., vol. 8, no. 10, pp. 8570–8584, 2021.
[5] J. Chen, C. Yi et al., “Learning aided joint sensor activation and mobile
charging vehicle scheduling for energy-efficient WRSN-based industrial
IoT,” IEEE Trans. Veh. Technol., pp. 1–15, 2022.
[6] G. Zheng, C. Xu, M. Wen, and X. Zhao, “Service caching based aerial
cooperative computing and resource allocation in multi-UAV enabled
MEC systems,” IEEE Trans. Veh. Technol., pp. 1–14, 2022.
[7] Y. Zhao, Z. Li, N. Cheng, R. Zhang, B. Hao, and X. Shen, “UAV
deployment strategy for range-based space-air integrated localization
network,” in Proc. IEEE GLOBECOM, 2019, pp. 1–6.
[8] L. Yang, H. Yao et al., “Multi-UAV deployment for MEC enhanced IoT
networks,” in Proc. IEEE ICCC, 2020, pp. 436–441.
[9] C. Zhao, J. Liu, M. Sheng, W. Teng, Y. Zheng, and J. Li, “Multi-UAV
trajectory planning for energy-efficient content coverage: A decentral-
ized learning-based approach,” IEEE J. Sel. Areas Commun., vol. 39,
no. 10, pp. 3193–3207, Oct. 2021.
[10] H. Mei, K. Yang, Q. Liu, and K. Wang, “Joint trajectory-resource
optimization in UAV-enabled edge-cloud system with virtualized mobile
clone,” IEEE Internet Things J., vol. 7, no. 7, pp. 5906–5921, Jul. 2020.
[11] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless
communication with rotary-wing UAV,” IEEE Trans. Wirel. Commun.,
vol. 18, no. 4, p. 2329–2345, Apr. 2019.
[12] B. Liu, Y. Wan, F. Zhou, Q. Wu, and R. Hu, “Resource allocation and
trajectory design for MISO UAV-assisted MEC networks,” IEEE Trans.
Veh. Technol., vol. 71, no. 5, pp. 4933–4948, May. 2022.