Content uploaded by Jiayuan Chen

Author content

All content in this area was uploaded by Jiayuan Chen on Mar 17, 2023

Content may be subject to copyright.

A Triple Learner Based Energy Efﬁcient Scheduling

for Multi-UAV Assisted Mobile Edge Computing

Jiayuan Chen∗, Changyan Yi∗, Jialiuyuan Li∗, Kun Zhu∗and Jun Cai†

∗College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China

†Department of Electrical and Computer Engineering, Concordia University, Montr´

eal, QC, H3G 1M8, Canada

Email: {jiayuan.chen, changyan.yi, jialiuyuan.li, zhukun}@nuaa.edu.cn, jun.cai@concordia.ca

Abstract—In this paper, an energy efﬁcient scheduling problem

for multiple unmanned aerial vehicle (UAV) assisted mobile

edge computing is studied. In the considered model, UAVs act

as mobile edge servers to provide computing services to end-

users with task ofﬂoading requests. Unlike existing works, we

allow UAVs to determine not only their trajectories but also

decisions of whether returning to the depot for replenishing

energies and updating application placements (due to limited

batteries and storage capacities). Aiming to maximize the long-

term energy efﬁciency of all UAVs, i.e., total amount of ofﬂoaded

tasks computed by all UAVs over their total energy consumption,

a joint optimization of UAVs’ trajectory planning, energy renewal

and application placement is formulated. Taking into account

the underlying cooperation and competition among intelligent

UAVs, we reformulate such problem as three coupled multi-agent

stochastic games, and then propose a novel triple learner based

reinforcement learning approach, integrating a trajectory learner,

an energy learner and an application learner, for reaching equi-

libriums. Simulations evaluate the performance of the proposed

solution, and demonstrate its superiority over counterparts.

I. INTRODUCTION

Recently, the multi-unmanned aerial vehicle (UAV) assisted

mobile edge computing (MEC) has attracted a myriad of

attentions due to its high-ﬂexibility in providing MEC services

for end-users (e.g., IoT devices). Particularly, UAVs with

computing resources can dynamically adjust their positions

to get close to end-users or ﬂy to the areas that cannot be

covered by ﬁxed MEC infrastructures [1], [2]. Thus, compared

to the traditional MEC system, the multi-UAV assisted MEC

can provide better quality of experience for end-users [3].

Although the multi-UAV assisted MEC is envisioned as

a light-weight but highly efﬁcient paradigm for alleviating

computation burdens on end-users, it also suffers several inher-

ent restrictions. For instance, computing tasks ofﬂoaded from

different end-users are required to be processed by speciﬁc ser-

vice applications, while the limited storage capacities of UAVs

impede their abilities to store all applications. Additionally,

the limited energy capacities of UAVs also hinders the imple-

mentation of this paradigm in providing the long-term MEC

services. Recent research efforts in this area include trajectory

optimization [4], [5], service caching [6], UAV deployment

[7], [8], etc. Nevertheless, there are still some critical issues,

especially how UAVs’ installed applications should be updated

(with severely restricted wireless backhauls) and how UAVs’

energy replenishment should be jointly scheduled, which are

of great importance but have not yet been well investigated.

In this paper, we study a joint optimization of trajectory

planning, energy renewal, and application placement for multi-

UAV assisted MEC to maximize the long-term energy efﬁ-

ciency of all UAVs, i.e., the total amount of ofﬂoaded tasks

computed by all UAVs over their total energy consumption,

when providing MEC services. Speciﬁcally, in the considered

system, each UAV working over a target region has to decide

its actions after ﬁnishing the last one, i.e., a ﬂight direction

for serving IoT devices in other areas or returning back to the

depot for replenishing its energy and simultaneously updating

its application placement (through wired connections), with

the aim of maximizing the long-term energy efﬁciency of

all UAVs. Since UAVs are intelligent, we allow each of

them to make its own decisions while regulate the underlying

cooperation and competition among them. Additionally, we

take into account the uncertainty that the future environment

information (e.g., positions and task requirements of IoT

devices) is unavailable to UAVs. To this end, we reformulate

the joint optimization problem as three coupled multi-agent

stochastic games, namely, trajectory planning stochastic game

(TPSG), energy renewal stochastic game (ERSG) and appli-

cation placement stochastic game (APSG), and then propose

a novel triple learner based reinforcement learning (TLRL)

approach to obtain corresponding equilibriums of these games.

The main contribution of this paper are in the following.

•A joint optimization of trajectory planning, energy re-

newal and application placement for multi-UAV assisted

MEC is formulated, where the objective is to maximize

the long-term energy efﬁciency of all UAVs.

•Observing the underlying cooperation and competition

among UAVs, the optimization problem is reformulated

as three coupled multi-agent stochastic games, i.e., TPSG,

ERSG and APSG, and then a novel approach, called

TLRL, is proposed to derive corresponding equilibriums.

•Extensive simulations are conducted to show the superi-

ority of the proposed TLRL approach over counterparts.

The rest of this paper is organized as follows: Section II

introduces the system model and problem formulation. In

Section III, a problem reformulation based on multi-agent

stochastic game is proposed and analyzed, along with the

developed TLRL approach. Simulation results are provided

in Section IV, followed by the conclusion in Section V.

UAV 1

UAV 2 UAV 3 UAV 4

Depot

Remaining energy of the UAV

Applications

Offloading to the UAV

Requested tasks

Movement of the UAV

Charging and updating

applications at the UAV

Interference to the UAV

Coverage of the UAV

Fig. 1: An illustration of considered multi-UAV assisted MEC.

II. SY ST EM MO DE L AN D PROB LE M FOR MU LATI ON

A. Network Model

Consider a multi-UAV assisted MEC system deployed in a

target region, as illustrated in Fig. 1, consisting of a group

of heterogeneous UAVs (acting as mobile edge servers) M

with cardinality of |M| =Mand a set of randomly scattered

IoT devices Nwith cardinality of |N | =N. There is a

depot located at the edge of the target region, which can be

used by UAVs for both energy replenishment and application

placement through wired connections. A time-slotted operation

framework is studied, in which we deﬁne t∈ {1,2, ..., T }as

the index of time slot. The target region is equally divided

into small squared grids with the side length of q, and similar

to [9], we assume that the downlink transmission range of

each UAV is √2

2q, which totally covers a grid (for feeding

back computation outcomes). All IoT devices are required

to ofﬂoad their tasks to their associated UAVs via uplink

communications using the same frequency band B, and the set

of IoT devices served by (or associated with) a certain UAV

mis denoted by Gm⊆ N . The horizontal coordinates of IoT

device n∈ N and UAV m∈ M at time slot tare represented

as In(t)=(xI

n(t), yI

n(t)) and Um(t) = (xU

m(t), yU

m(t)),

respectively. Then, the distance of IoT device n∈ N and

UAV m∈ M at time slot tcan be expressed as

dm,n(t) = q(xU

m(t)−xI

n(t))2+ (yU

m(t)−yI

n(t))2+H2,

where Hdenotes a ﬁxed ﬂight altitude of all UAVs. Following

the literature [10], the line-of-sight (LoS) probability between

IoT device n∈ Gmand UAV m∈ M at time slot tis given

by δm,n(t) = a·exp(−b(arctan(H/dm,n(t)) −a)), where a

and bare constant values depending on the environment. Then,

the path loss between IoT device n∈ Gmand UAV m∈ M

at time slot tcan be expressed as

λm,n(t) = 20log(pH2+dm,n(t)2)

+δm,n(t)(ηLoS −ηN LoS ) + 20log[(4πf )/c] + ηN LoS ,

where fand csignify the carrier frequency and the speed of

light, respectively; ηLoS and ηNLoS are the losses correspond-

ing to the LoS and non-LoS links, respectively.

Since a common frequency band is reused among all links,

the signal-to-interference-plus-noise ratio (SINR) at UAV m∈

Mwith regard to the uplink communication of IoT device

n∈ Gmat time slot tcan be calculated as

σm,n(t) = vn(t)(wm(t)>)ptran

n10

−λm,n

10

PN

i=1\{n}vn(t)(wm(t)>)ptran

i10

−λm,n

10 +ϕB

,

where ptran

nis the transmission power of IoT device n,

and ϕindicates the power spectral density of noise. At

time slot t, we consider that IoT device n∈ Gmcan

ofﬂoad no more than one task to its associated UAV m.

Let vn(t) = {vn,1(t), vn,2(t), ..., vn,c(t), ..., vn,C (t)}, where

c∈ {1,2, ..., C}is the index of the type of task, and

vn,c(t) = 1 signiﬁes that IoT device nrequests to of-

ﬂoad task c, and vn,c(t) = 0, otherwise. Meanwhile,

the applications placed in UAV mcan be deﬁned as

wm(t) = {wm,1(t), wm,2(t), ..., wm,c(t), ..., wm,C (t)}, where

wm,c(t)∈ {0,1}signiﬁes whether UAV mplaces the appli-

cation type c. Note that, any UAV m∈ M can only process

the types of tasks ﬁtting the types of its placed applications.

Based on these, the transmission time of IoT devices n∈ Gm

in ofﬂoading a task to UAV m∈ M can be written as

toff

m,n(t) = vn(t)(wm(t)>)Dn

Blog2(1 + σm,n (t)) ,

where Dnis the size of task ofﬂoaded by IoT device n.

Within each time slot t, we consider that UAV m∈ M

hovers over the center of a certain grid to provide MEC

services with time duration thover , and toff

m,n(t)< thover <|t|,

∀n∈ Gm,∀m∈ M, which means that thover is large enough

for UAV mto receive any task ofﬂoaded by any IoT device

and is shorter than the duration of a time slot. Then, the size

of tasks computed by UAV m∈ M can be expressed as

T askcomp

m(t) = min{Pn∈Gm

vn(t)(wm(t)>)Dn,

(thover −min{tof f

m,n(t)n∈Gm})fU

m},

where fU

mis the computing capacity of UAV m(in the number

of CPU cycles per second), and (thover −min{tof f

m,n(t)n∈Gm})

indicates that UAV mstarts edge computing since the ﬁrst task

is totally received. Correspondingly, the energy consumption

of UAV m∈ M for computing tasks at slot tis calculated as

Ecomp

m(t) = ξ(fU

m)2T askcomp

m(t),

where ξshows the capacitance coefﬁcient of UAV m∈ M.

Furthermore, let κm(t)∈ {0,1}stand for the decision that

UAV m∈ M chooses to whether return to the depot at the

beginning of each time slot t. If UAV m∈ M decides not to

return to the depot (denoted by κm(t)=1), it will select a

direction among forward, backward, left and right, and then

move to the center of another adjacent grid with a constant

velocity V. The propulsion energy consumption (consisting of

the energy consumption of horizontal moving and hovering)

of the UAV mcan be expressed as Epro

m=Ppro

m(V)q

V+

Ppro

m(0)thover , where Ppro

mis the propulsion power model of

UAVs, and its descriptions follows from [11] and are omitted

here. If UAV m∈ M decides to return to the depot (denoted

by κm(t) = 0), the energy consumption of UAV mmoving

between the target region and the depot with the constant

velocity Vcan be written as Edep

m= 2 ·Ppro

m(V)dm,dep(t)

V,

where dm,dep(t)is the distance between UAV mand the

depot at time slot t. At the depot, UAV mcan quickly

renew its energy and also update its application placement

for better serving IoT devices. Note that, the total size of

applications placed at UAV m∈ M should be smaller than

its storage capacity Sm, that is PC

c=1 µcwm,c(t)≤Sm, where

µcstands for the size of application type c. Additionally,

to guarantee the quality of service (QoS) of IoT devices,

each type of application should be placed in at least one

UAV hovering over the target region at each time slot t, i.e.,

PM

m=1 wm,c(t)κm(t)≥1,∀c∈C. After replenishing energy

and updating application placement, UAV mwill back to the

original region and continue to provide MEC services.

B. Problem Formulation

In this work, we aim to maximize the energy efﬁciency of

all UAVs, i.e., total amount of ofﬂoaded tasks computed by

all UAVs over their total energy consumption, and we have

Eeffi (t) =

M

P

m=1

κm(t)T askcomp

m(t)

M

P

m=1

(κm(t)(Ecomp

m(t)+Epro

m)+(1−κm(t))Edep

m)

(1)

Then, the joint optimization of UAVs’ trajectory planning,

energy renewal and application placement is formulated as

[P1] : max

Um(t),wm(t),κm(t)lim

T→+∞

1

TXT

t=1 Eeffi (t)(2)

s.t., κm(t)∈ {0,1},∀m∈ M,(3)

wm,c(t)∈ {0,1},∀m∈ M,∀c∈C, (4)

XC

c=1 µcwm,c(t)≤Sm,∀m∈ M,(5)

XM

m=1 wm,c(t)κm(t)≥1,∀c∈C, (6)

|Um(t)− Um(t−1)|2κm(t) = q2,∀m∈ M,(7)

(xU

m(t)−xU

m(t−1))(yU

m(t)−yU

m(t−1))κm(t) = 0,(8)

|Um(t)− Um0(t)|κm(t)≥q, ∀m6=m0,(9)

where constraint (5) means that the total size of applications

placed at each UAV should be less than its storage capacity;

constraint (6) states that QoS of serving IoT devices should

be met; constraints (7) and (8) imply that each UAV can only

move to the center of adjacent grid if it does not return to

the depot; constraint (9) indicates that each grid can only

be covered by one UAV to avoid potential collisions. In the

following section, we will ﬁrst analyze problem [P1], and then

propose a novel approach to derive the solution.

III. PROB LE M REF OR MU LATI ON A ND SOLUTION

A. Problem Reformulation

Since UAVs are intelligent, to solve problem [P1], we can

allow each UAV to make its own decisions while regulating

the underlying cooperation and competition among them.

Speciﬁcally, UAVs are expected to cooperatively conduct the

trajectory planning, energy renewal and application placement

to maximize the energy efﬁciency of all UAVs while guar-

anteeing QoS of IoT devices. Meanwhile, allowing UAVs to

make decisions themselves may also lead to competitions in

trajectory planning, energy renewal and application placement

among them. Additionally, considering the uncertainty that the

future environment information (e.g., task requirements of IoT

devices) is not available to UAVs, to this end, we reformulate

[P1] as three coupled multi-agent stochastic games as follows.

[P1] is reformulated as three coupled multi-agent stochas-

tic games, i.e., TPSG hM,ST P SG ,AT P SG,PT P S G,RT P S Gi,

ERSG hM,SERS G,AERS G,PE RSG ,RERSG iand APSG

hM,SAP SG ,AAP SG,PAP S G,RAP S Gi, where Mindicates

the set of agents (i.e., UAVs in this paper), Sstands for the

environment states, Arepresents the set of joint actions of all

agents, Psigniﬁes the set of state transition probabilities, and

Ris the set of reward functions. Particularly, for TPSG, each

UAV m∈ M will choose an action individually based on the

current environment states sTP S G(t)∈ ST P SG at each time

slot t, and then form a joint action aT P SG (t)∈ AT P SG.

After executing the joint action, rewards will be obtained

according to RT P SG , and the environment states will turn

to be next ones following PTP S G. The descriptions of ERSG

and APSG are similar to TPSG, and are omitted here. Note

that, TPSG, ERSG and APSG are inherently coupled. In the

following subsection, we propose a novel approach, called

TLRL, to obtain equilibriums of these three coupled multi-

agent stochastic games.

B. TLRL Approach

The transitions of states actions of TPSG, ERSG, and APSG

satisfy the Markov property, because all joint actions, i.e.,

aT P SG (t),a(t)ERSG and a(t)AP S G, at time slot tonly

depend on the environment states at time slot t, i.e., sTP S G(t),

sERS G(t)and sAP SG (t), and thereby, in this paper, we

characterize each UAV’s strategic decision process in TPSG,

ERSG and APSG by three Markov decision processes (MDPs).

MDP for each UAV in TPSG: With the aim of ﬁnding

the optimal trajectories for all UAVs, the individual decision

making problem for each UAV m∈ M in TPSG can be

modelled as an MDP (ST P SG ,AT P SG

m,RT P SG

m,PT P SG ).

1) Environment State for Each UAV in TPSG: The environ-

ment state sT P SG (t)∈ ST P S G for UAV m∈ M in TPSG

at time slot tconsists of all UAVs’ positions Um(t),m∈ M

and application placement wm(t),m∈M, which can be

expressed as sT P SG(t)=(Um(t),wm(t))m∈M.

2) Action for Each UAV in TPSG: At time slot t, UAV

m∈ M chooses an action aT P S G

m(t)∈ AT P SG

m, where

AT P SG

mis the set consisting of four possible actions, i.e.,

moving forward, backward, left or right.

3) Reward of Each UAV in TPSG: The immediate reward

of UAV m∈ M at time slot tis given by

RT P SG

m(t) = κm(t)T askcomp

m(t)

Ecomp

m(t) + Epro

m

,(10)

where the numerator indicates the size of tasks computed by

UAV mat time slot t, and the denominator represents the

energy consumption of UAV mat time slot t.

4) State Transition Probabilities of UAVs in

TPSG: The state transition probability from state

sT P SG to sT P SG0by taking the joint action

aT P SG (t)=(aT P SG

1(t),aT P SG

2(t),...,aT P SG

M(t)) can

be expressed as PT P SG

sT P SG ,sT P SG0(aT P SG (t))=P r(sT P SG (t+

1)=sT P SG0|sT P S G(t) = sT P SG ,aT P SG (t)).

MDP for each UAV in ERSG: With the aim of de-

signing the optimal schedule of energy renewal for all

UAVs, the individual decision making problem for each

UAV m∈ M in ERSG can be modelled as an MDP

(SERS G,AERS G

m,RERS G

m,PERS G).

1) Environment State for Each UAV in ERSG: The envi-

ronment state sERS G(t)∈ SERS G for UAV m∈ M in

ERSG at time slot tconsists of all UAVs’ remaining energy

Eremain

m(t),m∈ M and positions Um(t),m∈ M, which

can be expressed as sERSG (t)=(Er emain

m(t),Um(t))m∈M.

2) Action for Each UAV in ERSG: UAV m∈ M chooses

an action aERS G

m(t)∈ AERS G

mat time slot t, where AERS G

m

is the set consisting of two actions, i.e., deciding to return to

the depot or not.

3) Reward of Each UAV in ERSG: The immediate reward

of UAV m∈ M at time slot tis given by

RERS G

m(t) = −10,if constraint (6) is violated,

κm(t),otherwise.(11)

This reward function can prompt UAVs to hover over the target

region providing MEC services without violating (6).

The deﬁnition of state transition probabilities of UAVs in

ERSG PERS G is similar to that in TPSG and is omitted here.

MDP for each UAV in APSG: With the aim of producing

the optimal policy for updating the application placement

of all UAVs, the individual decision making problem for

each UAV m∈ M in APSG can be deﬁned as an MDP

(SAP SG ,AAP SG

m,RAP SG

m,PAP SG ).

1) Environment State for Each UAV in APSG: The environ-

ment state sAP SG (t)∈ SAP S G for UAV m∈ M at time slot

tconsists of applications placed in all UAVs wm(t), m ∈ M

and the amount of the task requests from IoT devices covered

by UAV mbefore t, i.e., θm(t) = Pt

τ=1 Pn∈Gm

vn(τ), m ∈

M, and thus sAP SG (t) = (wm(t), θm(t))m∈M.

2) Action for Each UAV in APSG: UAV m∈ M chooses

an action aAP SG

m(t)∈ AAP SG

mat time slot t, signifying that

it selects Smtypes of applications from the total Ctypes.

3) Reward of Each UAV in APSG: The immediate reward

of UAV m∈ M in APSG at time slot tis given by

RAP SG

m(t) = e(t)

C

t

X

τ=1 X

n∈Gm

vn(τ)wm(τ)>,(12)

where e(t)indicates the number of application types placed

in all UAVs at time slot t. This reward function would

guide UAVs to update more popular but diverse applications

according to the history of providing MEC services.

The deﬁnition of state transition probabilities PAP SG is

similar to that in TPSG and is omitted here.

Based on the above three MDP formulations, we develop a

novel triple learner (i.e., trajectory learner, energy learner and

application learner) based reinforcement learning approach

to obtain equilibriums of these three coupled multi-agent

stochastic games. Speciﬁcally, each UAV runs three Q-learning

algorithms to learn the optimal Q values of each state-action

pair, and obtain the optimal local policies for trajectory learner,

energy learner, and application learner. It is worth noting

that, since trajectory planning, energy renewal and application

placement are tightly coupled, these three learners have to run

in a back-and-forth manner.

1) Settings for Trajectory Learner: The policy πTP S G

m:

ST P SG → AT P SG

mof the trajectory learner in UAV m∈

M, meaning a mapping from the environment state set to

the action set, signiﬁes a probability distribution of actions

aT P SG

m∈ AT P SG

min a given state sTP S G. Particularly,

for UAV min state sT P SG ∈ S T P S G, the trajectory pol-

icy of the trajectory learner in UAV mcan be presented

as πT P SG

m(sT P SG ) = {πT P SG

m(sT P SG , aT P SG

m)|aT P SG

m∈

AT P SG

m}, where πT P SG

m(sT P SG , aT P SG

m)is the probability of

UAV mselecting action aT P SG

min state sT P SG .

In Q-learning, the process of building trajectory policy

πT P SG

mis signiﬁcantly affected by trajectory learner’s Q

function, and the Q function of the trajectory learner in UAV m

is the expected reward by executing action aT P SG

m∈ AT P SG

m

in state sT P SG ∈ ST P SG under the given policy πTP S G

m,

which can be expressed by

QT P SG

m(sT P SG ,aT P SG, π TP S G

m) =

E(P∞

τ=0 γτRT P SG

m(t+τ+ 1)|sT P SG (t) = sT P SG,

a(t)T P SG =aTP S G, πT P S G

m),

(13)

where γis a constant discounted factor with γ∈[0,1], and

the results of (13) are termed as action values, i.e., Q values.

Trajectory learner in UAV m∈ M selects an action

aT P SG

m(t)∈ AT P SG

maccording to its Q function at slot

t. For striking a balance between exploration and exploita-

tion, we consider an -greedy exploration strategy for the

trajectory learner. Speciﬁcally, the trajectory learner in UAV

m∈ M selects a random action aT P SG

m∈ AT P SG

m

in state sT P SG ∈ ST P SG with probability and selects

the best action aT P SG∗

mwith probability (1 −), where

the best action has QT P SG

m(sT P SG ,aT P SG∗, πT P S G

m)≥

QT P SG

m(sT P SG ,aT P SG, π TP S G

m),∀aT P SG ∈ AT P SG with

aT P SG∗

mbeing the m-th element of aT P SG∗. Besides, if the

later described energy learner in UAV mselects to return to

the depot, the trajectory learner will not choose any action in

AT P SG

m. Then, the probability of selecting action aT P SG

m∈

AT P SG

min state sT P SG can be expressed by

πT P SG

m(sT P SG , aT P SG

m)

=

0,if UAV mdecides to return to the depot,

1−, if QT P SG

m(sT P SG ,·,·)of aT P SG

mis the highest,

, otherwise.

In the Q value update step of Q-learning, the trajectory

learner in each UAV m∈ M follows the update rule:

QT P SG

m(sT P SG ,aT P SG, t + 1) =

QT P SG

m(sT P SG ,aT P SG, t) + βT P S G(RT P S G

m(t)+

γmax

aT P SG0∈AT P SG QT P SG

m(sT P SG0,aT P S G0, t)

−QT P SG

m(sT P SG ,aT P SG, t)),

(14)

where βT P SG denotes the learning rate in TPSG.

2) Settings for Energy Learner: The policy of energy learner

in UAV m∈ M is expressed as πE RSG

m:SERS G → AERSG

m.

Here, the Q function of the energy learner in UAV m∈ M

can be expressed by

QERS G

m(sERS G,aERS G, π ERSG

m) =

E(P∞

τ=0 γτRERS G

m(t+τ+ 1)|sERS G(t) = sERS G,

a(t)ERS G =aERSG , πE RSG

m).

(15)

The energy learner in UAV m∈ M selects an action

aERS G

m∈ AERS G

m(i.e., whether returning to the depot) also

according the -greedy exploration strategy. Then, we have

πERS G

m(sERS G, aERS G

m)

=1−, if QERS G

m(sERS G,·,·)of aERS G

mis the highest,

, otherwise.

The energy learner in UAV m∈ M follows the update rule:

QERS G

m(sERS G,aERS G, t + 1) =

QERS G

m(sERS G,aERS G, t) + βE RSG (RERSG

m(t)+

γmax

aERSG0∈AE RSG QERS G

m(sERS G0,aERSG0, t)

−QERS G

m(sERS G,aERS G, t)),

(16)

where βERS G denotes the learning rate in ERSG.

3) Settings for Application Learner: The policy of applica-

tion learner in UAV m∈ M is πAPS G

m:SAP SG → AAPS G

m.

Here, the Q function of the application learner in UAV m∈

Mcan be expressed by

QAP SG

m(sAP SG ,aAP SG, π APS G

m) =

E(P∞

τ=0 γτRAP SG

m(t+τ+ 1)|sAP SG (t) = sAP SG,

a(t)AP SG =aAPS G, πAP S G

m).

(17)

The application learner in UAV m∈ M selects an action

aAP SG

m∈ AAP SG

malso according the -greedy exploration

strategy. Then, we have

πAP SG

m(sAP SG , aAP SG

m)

=1−, if QAP SG

m(sAP SG ,·,·)of aAP SG

mis the highest,

, otherwise.

The update rule of application learner in UAV m∈ M is

QAP SG

m(sAP SG ,aAP SG, t + 1) =

QAP SG

m(sAP SG ,aAP SG, t) + βAP S G(RAP S G

m(t)+

γmax

aAP SG0∈AAP SG QAP SG

m(sAP SG0,aAP S G0, t)

−QAP SG

m(sAP SG ,aAP SG, t)),

(18)

where βAP SG denotes the learning rate in APSG.

In summary, the proposed TLRL approach is detailedly

illustrated in Algorithm 1.

Algorithm 1: TLRL Approach

1for m= 1 to Mdo

2Initialize Q values QT P SG

m=QERSG

m=QAP SG

m= 0;

3Set the maximal iteration counter LOOP and loop = 0;

4for loop < LOOP do

5t= 0;

6for m= 1 to Mdo

7Send QT P SG

m,QERSG

mand QAP SG

mto other UAVs;

8while t≤Tdo

9Observe state sT P SG,sE RSG and sAP SG ;

10 for m= 1 to Mdo

11 UAV mselects aE RSG

maccording to πERSG

m;

12 if UAV mreturns to the depot then

13 UAV mselects aAP S G

maccording to πAP SG

m;

14 else

15 UAV mselects aT P S G

maccording to πT P SG

m;

16 Obtain rewards RTP S G

m,RERSG

mand RAP SG

m;

17 Update QT P SG

m,QERSG

mand QAP SG

maccording to (14),

(16) and (18), respectively;

18 Send QT P SG

m,QERSG

mand QAP SG

mto other UAVs;

19 Set t=t+ 1;

20 Set loop =loop + 1.

TABLE I: Simulation Parameters

Param. Value Param. Value Param. Value

M3B10 MHz C 10

N300 Dn[2,5] MB V 20 m/s

thover 5s ξ 10−18 f3GH z

q100 m ptran

n[0.2,0.5] W H 120 m

Sm6GB µc[1,3] GB ϕ −174 dB m/Hz

a,b9.6117,

0.1581

fU

m2Mbps Target

region

1000 m×1000 m

IV. SIMULATION RESULTS

In this section, simulations are conducted to evaluate the

performance of the proposed TLRL approach for [P1]. Table I

lists the values of all simulation parameters, and the propulsion

power model follows [11]. Similar settings have also been

employed in [9], [12].

For comparison purpose, we introduce an energy efﬁcient

oriented trajectory planning (EOTP) algorithm and an exist-

ing algorithm called decentralized multiple UAVs cooperative

reinforcement learning (DMUCRL) [9] algorithm as bench-

marks: EOTP determines the trajectories of all UAVs with

the aim of maximizing the energy efﬁciency but asks UAVs

to return to the depot for energy renewal only when their

batteries are exhausted, and EOTP does not enable the update

of application placement; DMUCRL is originally designed to

maximize the energy efﬁciency of UAVs in downlink content

sharing by controlling all UAVs to work collaboratively based

on a double Q-learning (each UAV contains a trajectory learner

and an energy learner).

It can be observed from Fig. 2 that the energy efﬁciency ﬁrst

increases and then becomes stable with the increase of IoT

devices’ transmission power. This is because, with a larger

transmission power, IoT devices would ofﬂoad more tasks

to their associated UAVs, and thereby increasing the amount

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Transmission Power of IoT Device (W)

1700

1800

1900

2000

2100

2200

2300

2400

2500

2600

2700

2800

Energy Efficiency of All UAVs (bit/J)

Proposed TLRL

DMUCRL

EOTP

Fig. 2: Energy efﬁciency w.r.t. trans-

mission power of IoT devices.

12345678910

Storage Capacity of Each UAV (GB)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Energy Efficiency of All UAVs (bit/J)

Grid Size q=50m

Grid Size q=100m

Grid Size q=200m

Fig. 3: Energy efﬁciency w.r.t. storage

capacity of each UAV.

12345678910

UAV Hovering Time (s)

1000

1500

2000

2500

3000

3500

Energy Efficiency of All UAVs (bit/J)

Proposed TLRL

DMUCRL

EOTP

Fig. 4: Energy efﬁciency w.r.t. UAV

hovering time.

of tasks processed by UAVs. However, since the computing

capacity of each UAV is still limited, such increasing trend

slows down as the limit is approaching. More importantly,

this ﬁgure shows that the proposed TLRL outperforms both

DMUCRL and EOTP. Reasons are that i) each UAV under

EOTP returns to the depot directly once its energy is exhausted

regardless of other UAVs; ii) each UAV’s applications are

ﬁxed placed under DMUCRL, making it capable of serving

very limited IoT devices; and iii) our proposed TLRL well

addresses the shortcomings of DMUCRL and EOTP.

Fig. 3 shows all UAV’s energy efﬁciency with different

UAV storage capacities under different grid size settings.

Speciﬁcally, UAVs can adjust their downlink transmission

ranges so as to adjust the size qof grids. It can be seen

from Fig. 3 that the larger the grid size is, the higher energy

efﬁciency of all UAVs is obtained. This is because with a

larger grid size, more IoT devices are included in a grid,

and thereby each UAV can potentially process more ofﬂoaded

tasks. Besides, it is also shown that the energy efﬁciency of

all UAVs increases monotonically with the storage capacity

of each UAV. The reason is that with the increase of storage

capacity, more types of applications can be placed in each

UAV, so that more tasks may be processed.

It can be observed from Fig. 4 that, the energy efﬁciency

of all UAVs ﬁrst increases with the UAV hovering time, and

then decreases. This is because with the growth of UAV

hovering time, more ofﬂoaded tasks from IoT devices can be

computed by UAVs during hovering. However, when all tasks

have been completely processed by UAVs, they will become

idle and consume hovering energy over the target region until

hovering time expires. Additionally, it is also shown that the

proposed TLRL outperforms both DMUCRL and EOTP, and

the explanations for this are similar to those for Fig. 2.

V. CONCLUSION

In this paper, an energy efﬁcient scheduling problem for

multi-UAV assisted MEC has been studied. With the aim

of maximizing the long-term energy-efﬁciency of all UAVs,

a joint optimization of UAVs’ trajectory planning, energy

renewal and application placement is formulated. By taking

the inherent cooperation and competition among UAVs, we

reformulate such optimization problem as three coupled multi-

agent stochastic games, and then propose a novel TLRL

approach for reaching equilibriums. Simulation results show

that, compared to counterparts, the proposed TLRL approach

can signiﬁcantly increase the energy efﬁciency of all UAVs.

ACKNOWLEDGMENTS

This work was supported by National Natural Science

Foundation of China (NSFC) under Grants No. 62002164, No.

62176122, and by the Postgraduate Research & Practice In-

novation Program of NUAA under grant No. xcxjh20221614.

REFERENCES

[1] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan,

“Deep reinforcement learning based dynamic trajectory control for UAV-

assisted mobile edge computing,” IEEE Trans. Mob. Comput., vol. 21,

no. 10, pp. 3536–3550, Oct. 2020.

[2] Y. Shi, C. Yi, B. Chen, C. Yang, K. Zhu, and J. Cai, “Joint online opti-

mization of data sampling rate and preprocessing mode for edge–cloud

collaboration-enabled industrial IoT,” IEEE Internet Things J., vol. 9,

no. 17, pp. 16 402–16 417, 2022.

[3] C. Dai, K. Zhu, and E. Hossain, “Multi-agent deep reinforcement

learning for joint decoupled user association and trajectory design in

full-duplex multi-UAV networks,” IEEE Trans. Mob. Comput., pp. 1–

15, 2022.

[4] J. Ji, K. Zhu et al., “Energy consumption minimization in UAV-assisted

mobile-edge computing systems: Joint resource allocation and trajectory

design,” IEEE Internet Things J., vol. 8, no. 10, pp. 8570–8584, 2021.

[5] J. Chen, C. Yi et al., “Learning aided joint sensor activation and mobile

charging vehicle scheduling for energy-efﬁcient WRSN-based industrial

IoT,” IEEE Trans. Veh. Technol., pp. 1–15, 2022.

[6] G. Zheng, C. Xu, M. Wen, and X. Zhao, “Service caching based aerial

cooperative computing and resource allocation in multi-UAV enabled

MEC systems,” IEEE Trans. Veh. Technol., pp. 1–14, 2022.

[7] Y. Zhao, Z. Li, N. Cheng, R. Zhang, B. Hao, and X. Shen, “UAV

deployment strategy for range-based space-air integrated localization

network,” in Proc. IEEE GLOBECOM, 2019, pp. 1–6.

[8] L. Yang, H. Yao et al., “Multi-UAV deployment for MEC enhanced IoT

networks,” in Proc. IEEE ICCC, 2020, pp. 436–441.

[9] C. Zhao, J. Liu, M. Sheng, W. Teng, Y. Zheng, and J. Li, “Multi-UAV

trajectory planning for energy-efﬁcient content coverage: A decentral-

ized learning-based approach,” IEEE J. Sel. Areas Commun., vol. 39,

no. 10, pp. 3193–3207, Oct. 2021.

[10] H. Mei, K. Yang, Q. Liu, and K. Wang, “Joint trajectory-resource

optimization in UAV-enabled edge-cloud system with virtualized mobile

clone,” IEEE Internet Things J., vol. 7, no. 7, pp. 5906–5921, Jul. 2020.

[11] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless

communication with rotary-wing UAV,” IEEE Trans. Wirel. Commun.,

vol. 18, no. 4, p. 2329–2345, Apr. 2019.

[12] B. Liu, Y. Wan, F. Zhou, Q. Wu, and R. Hu, “Resource allocation and

trajectory design for MISO UAV-assisted MEC networks,” IEEE Trans.

Veh. Technol., vol. 71, no. 5, pp. 4933–4948, May. 2022.