Conference PaperPDF Available

A Triple Learner Based Energy Efficient Scheduling for Multi-UAV Assisted Mobile Edge Computing

Authors:

Abstract

In this paper, an energy efficient scheduling problem for multiple unmanned aerial vehicle (UAV) assisted mobile edge computing is studied. In the considered model, UAVs act as mobile edge servers to provide computing services to end-users with task offloading requests. Unlike existing works, we allow UAVs to determine not only their trajectories but also decisions of whether returning to the depot for replenishing energies and updating application placements (due to limited batteries and storage capacities). Aiming to maximize the long-term energy efficiency of all UAVs, i.e., total amount of offloaded tasks computed by all UAVs over their total energy consumption, a joint optimization of UAVs' trajectory planning, energy renewal and application placement is formulated. Taking into account the underlying cooperation and competition among intelligent UAVs, we reformulate such problem as three coupled multi-agent stochastic games, and then propose a novel triple learner based reinforcement learning approach, integrating a trajectory learner, an energy learner and an application learner, for reaching equilibriums. Simulations evaluate the performance of the proposed solution, and demonstrate its superiority over counterparts.
A Triple Learner Based Energy Efficient Scheduling
for Multi-UAV Assisted Mobile Edge Computing
Jiayuan Chen, Changyan Yi, Jialiuyuan Li, Kun Zhuand Jun Cai
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Department of Electrical and Computer Engineering, Concordia University, Montr´
eal, QC, H3G 1M8, Canada
Email: {jiayuan.chen, changyan.yi, jialiuyuan.li, zhukun}@nuaa.edu.cn, jun.cai@concordia.ca
Abstract—In this paper, an energy efficient scheduling problem
for multiple unmanned aerial vehicle (UAV) assisted mobile
edge computing is studied. In the considered model, UAVs act
as mobile edge servers to provide computing services to end-
users with task offloading requests. Unlike existing works, we
allow UAVs to determine not only their trajectories but also
decisions of whether returning to the depot for replenishing
energies and updating application placements (due to limited
batteries and storage capacities). Aiming to maximize the long-
term energy efficiency of all UAVs, i.e., total amount of offloaded
tasks computed by all UAVs over their total energy consumption,
a joint optimization of UAVs’ trajectory planning, energy renewal
and application placement is formulated. Taking into account
the underlying cooperation and competition among intelligent
UAVs, we reformulate such problem as three coupled multi-agent
stochastic games, and then propose a novel triple learner based
reinforcement learning approach, integrating a trajectory learner,
an energy learner and an application learner, for reaching equi-
libriums. Simulations evaluate the performance of the proposed
solution, and demonstrate its superiority over counterparts.
I. INTRODUCTION
Recently, the multi-unmanned aerial vehicle (UAV) assisted
mobile edge computing (MEC) has attracted a myriad of
attentions due to its high-flexibility in providing MEC services
for end-users (e.g., IoT devices). Particularly, UAVs with
computing resources can dynamically adjust their positions
to get close to end-users or fly to the areas that cannot be
covered by fixed MEC infrastructures [1], [2]. Thus, compared
to the traditional MEC system, the multi-UAV assisted MEC
can provide better quality of experience for end-users [3].
Although the multi-UAV assisted MEC is envisioned as
a light-weight but highly efficient paradigm for alleviating
computation burdens on end-users, it also suffers several inher-
ent restrictions. For instance, computing tasks offloaded from
different end-users are required to be processed by specific ser-
vice applications, while the limited storage capacities of UAVs
impede their abilities to store all applications. Additionally,
the limited energy capacities of UAVs also hinders the imple-
mentation of this paradigm in providing the long-term MEC
services. Recent research efforts in this area include trajectory
optimization [4], [5], service caching [6], UAV deployment
[7], [8], etc. Nevertheless, there are still some critical issues,
especially how UAVs’ installed applications should be updated
(with severely restricted wireless backhauls) and how UAVs’
energy replenishment should be jointly scheduled, which are
of great importance but have not yet been well investigated.
In this paper, we study a joint optimization of trajectory
planning, energy renewal, and application placement for multi-
UAV assisted MEC to maximize the long-term energy effi-
ciency of all UAVs, i.e., the total amount of offloaded tasks
computed by all UAVs over their total energy consumption,
when providing MEC services. Specifically, in the considered
system, each UAV working over a target region has to decide
its actions after finishing the last one, i.e., a flight direction
for serving IoT devices in other areas or returning back to the
depot for replenishing its energy and simultaneously updating
its application placement (through wired connections), with
the aim of maximizing the long-term energy efficiency of
all UAVs. Since UAVs are intelligent, we allow each of
them to make its own decisions while regulate the underlying
cooperation and competition among them. Additionally, we
take into account the uncertainty that the future environment
information (e.g., positions and task requirements of IoT
devices) is unavailable to UAVs. To this end, we reformulate
the joint optimization problem as three coupled multi-agent
stochastic games, namely, trajectory planning stochastic game
(TPSG), energy renewal stochastic game (ERSG) and appli-
cation placement stochastic game (APSG), and then propose
a novel triple learner based reinforcement learning (TLRL)
approach to obtain corresponding equilibriums of these games.
The main contribution of this paper are in the following.
A joint optimization of trajectory planning, energy re-
newal and application placement for multi-UAV assisted
MEC is formulated, where the objective is to maximize
the long-term energy efficiency of all UAVs.
Observing the underlying cooperation and competition
among UAVs, the optimization problem is reformulated
as three coupled multi-agent stochastic games, i.e., TPSG,
ERSG and APSG, and then a novel approach, called
TLRL, is proposed to derive corresponding equilibriums.
Extensive simulations are conducted to show the superi-
ority of the proposed TLRL approach over counterparts.
The rest of this paper is organized as follows: Section II
introduces the system model and problem formulation. In
Section III, a problem reformulation based on multi-agent
stochastic game is proposed and analyzed, along with the
developed TLRL approach. Simulation results are provided
in Section IV, followed by the conclusion in Section V.
UAV 1
UAV 2 UAV 3 UAV 4
Depot
Remaining energy of the UAV
Applications
Offloading to the UAV
Requested tasks
Movement of the UAV
Charging and updating
applications at the UAV
Interference to the UAV
Coverage of the UAV
Fig. 1: An illustration of considered multi-UAV assisted MEC.
II. SY ST EM MO DE L AN D PROB LE M FOR MU LATI ON
A. Network Model
Consider a multi-UAV assisted MEC system deployed in a
target region, as illustrated in Fig. 1, consisting of a group
of heterogeneous UAVs (acting as mobile edge servers) M
with cardinality of |M| =Mand a set of randomly scattered
IoT devices Nwith cardinality of |N | =N. There is a
depot located at the edge of the target region, which can be
used by UAVs for both energy replenishment and application
placement through wired connections. A time-slotted operation
framework is studied, in which we define t {1,2, ..., T }as
the index of time slot. The target region is equally divided
into small squared grids with the side length of q, and similar
to [9], we assume that the downlink transmission range of
each UAV is 2
2q, which totally covers a grid (for feeding
back computation outcomes). All IoT devices are required
to offload their tasks to their associated UAVs via uplink
communications using the same frequency band B, and the set
of IoT devices served by (or associated with) a certain UAV
mis denoted by Gm N . The horizontal coordinates of IoT
device n N and UAV m M at time slot tare represented
as In(t)=(xI
n(t), yI
n(t)) and Um(t) = (xU
m(t), yU
m(t)),
respectively. Then, the distance of IoT device n N and
UAV m M at time slot tcan be expressed as
dm,n(t) = q(xU
m(t)xI
n(t))2+ (yU
m(t)yI
n(t))2+H2,
where Hdenotes a fixed flight altitude of all UAVs. Following
the literature [10], the line-of-sight (LoS) probability between
IoT device n Gmand UAV m M at time slot tis given
by δm,n(t) = a·exp(b(arctan(H/dm,n(t)) a)), where a
and bare constant values depending on the environment. Then,
the path loss between IoT device n Gmand UAV m M
at time slot tcan be expressed as
λm,n(t) = 20log(pH2+dm,n(t)2)
+δm,n(t)(ηLoS ηN LoS ) + 20log[(4πf )/c] + ηN LoS ,
where fand csignify the carrier frequency and the speed of
light, respectively; ηLoS and ηNLoS are the losses correspond-
ing to the LoS and non-LoS links, respectively.
Since a common frequency band is reused among all links,
the signal-to-interference-plus-noise ratio (SINR) at UAV m
Mwith regard to the uplink communication of IoT device
n Gmat time slot tcan be calculated as
σm,n(t) = vn(t)(wm(t)>)ptran
n10
λm,n
10
PN
i=1\{n}vn(t)(wm(t)>)ptran
i10
λm,n
10 +ϕB
,
where ptran
nis the transmission power of IoT device n,
and ϕindicates the power spectral density of noise. At
time slot t, we consider that IoT device n Gmcan
offload no more than one task to its associated UAV m.
Let vn(t) = {vn,1(t), vn,2(t), ..., vn,c(t), ..., vn,C (t)}, where
c {1,2, ..., C}is the index of the type of task, and
vn,c(t) = 1 signifies that IoT device nrequests to of-
fload task c, and vn,c(t) = 0, otherwise. Meanwhile,
the applications placed in UAV mcan be defined as
wm(t) = {wm,1(t), wm,2(t), ..., wm,c(t), ..., wm,C (t)}, where
wm,c(t) {0,1}signifies whether UAV mplaces the appli-
cation type c. Note that, any UAV m M can only process
the types of tasks fitting the types of its placed applications.
Based on these, the transmission time of IoT devices n Gm
in offloading a task to UAV m M can be written as
toff
m,n(t) = vn(t)(wm(t)>)Dn
Blog2(1 + σm,n (t)) ,
where Dnis the size of task offloaded by IoT device n.
Within each time slot t, we consider that UAV m M
hovers over the center of a certain grid to provide MEC
services with time duration thover , and toff
m,n(t)< thover <|t|,
n Gm,m M, which means that thover is large enough
for UAV mto receive any task offloaded by any IoT device
and is shorter than the duration of a time slot. Then, the size
of tasks computed by UAV m M can be expressed as
T askcomp
m(t) = min{Pn∈Gm
vn(t)(wm(t)>)Dn,
(thover min{tof f
m,n(t)n∈Gm})fU
m},
where fU
mis the computing capacity of UAV m(in the number
of CPU cycles per second), and (thover min{tof f
m,n(t)n∈Gm})
indicates that UAV mstarts edge computing since the first task
is totally received. Correspondingly, the energy consumption
of UAV m M for computing tasks at slot tis calculated as
Ecomp
m(t) = ξ(fU
m)2T askcomp
m(t),
where ξshows the capacitance coefficient of UAV m M.
Furthermore, let κm(t) {0,1}stand for the decision that
UAV m M chooses to whether return to the depot at the
beginning of each time slot t. If UAV m M decides not to
return to the depot (denoted by κm(t)=1), it will select a
direction among forward, backward, left and right, and then
move to the center of another adjacent grid with a constant
velocity V. The propulsion energy consumption (consisting of
the energy consumption of horizontal moving and hovering)
of the UAV mcan be expressed as Epro
m=Ppro
m(V)q
V+
Ppro
m(0)thover , where Ppro
mis the propulsion power model of
UAVs, and its descriptions follows from [11] and are omitted
here. If UAV m M decides to return to the depot (denoted
by κm(t) = 0), the energy consumption of UAV mmoving
between the target region and the depot with the constant
velocity Vcan be written as Edep
m= 2 ·Ppro
m(V)dm,dep(t)
V,
where dm,dep(t)is the distance between UAV mand the
depot at time slot t. At the depot, UAV mcan quickly
renew its energy and also update its application placement
for better serving IoT devices. Note that, the total size of
applications placed at UAV m M should be smaller than
its storage capacity Sm, that is PC
c=1 µcwm,c(t)Sm, where
µcstands for the size of application type c. Additionally,
to guarantee the quality of service (QoS) of IoT devices,
each type of application should be placed in at least one
UAV hovering over the target region at each time slot t, i.e.,
PM
m=1 wm,c(t)κm(t)1,cC. After replenishing energy
and updating application placement, UAV mwill back to the
original region and continue to provide MEC services.
B. Problem Formulation
In this work, we aim to maximize the energy efficiency of
all UAVs, i.e., total amount of offloaded tasks computed by
all UAVs over their total energy consumption, and we have
Eeffi (t) =
M
P
m=1
κm(t)T askcomp
m(t)
M
P
m=1
(κm(t)(Ecomp
m(t)+Epro
m)+(1κm(t))Edep
m)
(1)
Then, the joint optimization of UAVs’ trajectory planning,
energy renewal and application placement is formulated as
[P1] : max
Um(t),wm(t)m(t)lim
T+
1
TXT
t=1 Eeffi (t)(2)
s.t., κm(t) {0,1},m M,(3)
wm,c(t) {0,1},m M,cC, (4)
XC
c=1 µcwm,c(t)Sm,m M,(5)
XM
m=1 wm,c(t)κm(t)1,cC, (6)
|Um(t) Um(t1)|2κm(t) = q2,m M,(7)
(xU
m(t)xU
m(t1))(yU
m(t)yU
m(t1))κm(t) = 0,(8)
|Um(t) Um0(t)|κm(t)q, m6=m0,(9)
where constraint (5) means that the total size of applications
placed at each UAV should be less than its storage capacity;
constraint (6) states that QoS of serving IoT devices should
be met; constraints (7) and (8) imply that each UAV can only
move to the center of adjacent grid if it does not return to
the depot; constraint (9) indicates that each grid can only
be covered by one UAV to avoid potential collisions. In the
following section, we will first analyze problem [P1], and then
propose a novel approach to derive the solution.
III. PROB LE M REF OR MU LATI ON A ND SOLUTION
A. Problem Reformulation
Since UAVs are intelligent, to solve problem [P1], we can
allow each UAV to make its own decisions while regulating
the underlying cooperation and competition among them.
Specifically, UAVs are expected to cooperatively conduct the
trajectory planning, energy renewal and application placement
to maximize the energy efficiency of all UAVs while guar-
anteeing QoS of IoT devices. Meanwhile, allowing UAVs to
make decisions themselves may also lead to competitions in
trajectory planning, energy renewal and application placement
among them. Additionally, considering the uncertainty that the
future environment information (e.g., task requirements of IoT
devices) is not available to UAVs, to this end, we reformulate
[P1] as three coupled multi-agent stochastic games as follows.
[P1] is reformulated as three coupled multi-agent stochas-
tic games, i.e., TPSG hM,ST P SG ,AT P SG,PT P S G,RT P S Gi,
ERSG hM,SERS G,AERS G,PE RSG ,RERSG iand APSG
hM,SAP SG ,AAP SG,PAP S G,RAP S Gi, where Mindicates
the set of agents (i.e., UAVs in this paper), Sstands for the
environment states, Arepresents the set of joint actions of all
agents, Psignifies the set of state transition probabilities, and
Ris the set of reward functions. Particularly, for TPSG, each
UAV m M will choose an action individually based on the
current environment states sTP S G(t) ST P SG at each time
slot t, and then form a joint action aT P SG (t) AT P SG.
After executing the joint action, rewards will be obtained
according to RT P SG , and the environment states will turn
to be next ones following PTP S G. The descriptions of ERSG
and APSG are similar to TPSG, and are omitted here. Note
that, TPSG, ERSG and APSG are inherently coupled. In the
following subsection, we propose a novel approach, called
TLRL, to obtain equilibriums of these three coupled multi-
agent stochastic games.
B. TLRL Approach
The transitions of states actions of TPSG, ERSG, and APSG
satisfy the Markov property, because all joint actions, i.e.,
aT P SG (t),a(t)ERSG and a(t)AP S G, at time slot tonly
depend on the environment states at time slot t, i.e., sTP S G(t),
sERS G(t)and sAP SG (t), and thereby, in this paper, we
characterize each UAV’s strategic decision process in TPSG,
ERSG and APSG by three Markov decision processes (MDPs).
MDP for each UAV in TPSG: With the aim of finding
the optimal trajectories for all UAVs, the individual decision
making problem for each UAV m M in TPSG can be
modelled as an MDP (ST P SG ,AT P SG
m,RT P SG
m,PT P SG ).
1) Environment State for Each UAV in TPSG: The environ-
ment state sT P SG (t) ST P S G for UAV m M in TPSG
at time slot tconsists of all UAVs’ positions Um(t),m M
and application placement wm(t),mM, which can be
expressed as sT P SG(t)=(Um(t),wm(t))m∈M.
2) Action for Each UAV in TPSG: At time slot t, UAV
m M chooses an action aT P S G
m(t) AT P SG
m, where
AT P SG
mis the set consisting of four possible actions, i.e.,
moving forward, backward, left or right.
3) Reward of Each UAV in TPSG: The immediate reward
of UAV m M at time slot tis given by
RT P SG
m(t) = κm(t)T askcomp
m(t)
Ecomp
m(t) + Epro
m
,(10)
where the numerator indicates the size of tasks computed by
UAV mat time slot t, and the denominator represents the
energy consumption of UAV mat time slot t.
4) State Transition Probabilities of UAVs in
TPSG: The state transition probability from state
sT P SG to sT P SG0by taking the joint action
aT P SG (t)=(aT P SG
1(t),aT P SG
2(t),...,aT P SG
M(t)) can
be expressed as PT P SG
sT P SG ,sT P SG0(aT P SG (t))=P r(sT P SG (t+
1)=sT P SG0|sT P S G(t) = sT P SG ,aT P SG (t)).
MDP for each UAV in ERSG: With the aim of de-
signing the optimal schedule of energy renewal for all
UAVs, the individual decision making problem for each
UAV m M in ERSG can be modelled as an MDP
(SERS G,AERS G
m,RERS G
m,PERS G).
1) Environment State for Each UAV in ERSG: The envi-
ronment state sERS G(t) SERS G for UAV m M in
ERSG at time slot tconsists of all UAVs’ remaining energy
Eremain
m(t),m M and positions Um(t),m M, which
can be expressed as sERSG (t)=(Er emain
m(t),Um(t))m∈M.
2) Action for Each UAV in ERSG: UAV m M chooses
an action aERS G
m(t) AERS G
mat time slot t, where AERS G
m
is the set consisting of two actions, i.e., deciding to return to
the depot or not.
3) Reward of Each UAV in ERSG: The immediate reward
of UAV m M at time slot tis given by
RERS G
m(t) = 10,if constraint (6) is violated,
κm(t),otherwise.(11)
This reward function can prompt UAVs to hover over the target
region providing MEC services without violating (6).
The definition of state transition probabilities of UAVs in
ERSG PERS G is similar to that in TPSG and is omitted here.
MDP for each UAV in APSG: With the aim of producing
the optimal policy for updating the application placement
of all UAVs, the individual decision making problem for
each UAV m M in APSG can be defined as an MDP
(SAP SG ,AAP SG
m,RAP SG
m,PAP SG ).
1) Environment State for Each UAV in APSG: The environ-
ment state sAP SG (t) SAP S G for UAV m M at time slot
tconsists of applications placed in all UAVs wm(t), m M
and the amount of the task requests from IoT devices covered
by UAV mbefore t, i.e., θm(t) = Pt
τ=1 Pn∈Gm
vn(τ), m
M, and thus sAP SG (t) = (wm(t), θm(t))m∈M.
2) Action for Each UAV in APSG: UAV m M chooses
an action aAP SG
m(t) AAP SG
mat time slot t, signifying that
it selects Smtypes of applications from the total Ctypes.
3) Reward of Each UAV in APSG: The immediate reward
of UAV m M in APSG at time slot tis given by
RAP SG
m(t) = e(t)
C
t
X
τ=1 X
n∈Gm
vn(τ)wm(τ)>,(12)
where e(t)indicates the number of application types placed
in all UAVs at time slot t. This reward function would
guide UAVs to update more popular but diverse applications
according to the history of providing MEC services.
The definition of state transition probabilities PAP SG is
similar to that in TPSG and is omitted here.
Based on the above three MDP formulations, we develop a
novel triple learner (i.e., trajectory learner, energy learner and
application learner) based reinforcement learning approach
to obtain equilibriums of these three coupled multi-agent
stochastic games. Specifically, each UAV runs three Q-learning
algorithms to learn the optimal Q values of each state-action
pair, and obtain the optimal local policies for trajectory learner,
energy learner, and application learner. It is worth noting
that, since trajectory planning, energy renewal and application
placement are tightly coupled, these three learners have to run
in a back-and-forth manner.
1) Settings for Trajectory Learner: The policy πTP S G
m:
ST P SG AT P SG
mof the trajectory learner in UAV m
M, meaning a mapping from the environment state set to
the action set, signifies a probability distribution of actions
aT P SG
m AT P SG
min a given state sTP S G. Particularly,
for UAV min state sT P SG S T P S G, the trajectory pol-
icy of the trajectory learner in UAV mcan be presented
as πT P SG
m(sT P SG ) = {πT P SG
m(sT P SG , aT P SG
m)|aT P SG
m
AT P SG
m}, where πT P SG
m(sT P SG , aT P SG
m)is the probability of
UAV mselecting action aT P SG
min state sT P SG .
In Q-learning, the process of building trajectory policy
πT P SG
mis significantly affected by trajectory learner’s Q
function, and the Q function of the trajectory learner in UAV m
is the expected reward by executing action aT P SG
m AT P SG
m
in state sT P SG ST P SG under the given policy πTP S G
m,
which can be expressed by
QT P SG
m(sT P SG ,aT P SG, π TP S G
m) =
E(P
τ=0 γτRT P SG
m(t+τ+ 1)|sT P SG (t) = sT P SG,
a(t)T P SG =aTP S G, πT P S G
m),
(13)
where γis a constant discounted factor with γ[0,1], and
the results of (13) are termed as action values, i.e., Q values.
Trajectory learner in UAV m M selects an action
aT P SG
m(t) AT P SG
maccording to its Q function at slot
t. For striking a balance between exploration and exploita-
tion, we consider an -greedy exploration strategy for the
trajectory learner. Specifically, the trajectory learner in UAV
m M selects a random action aT P SG
m AT P SG
m
in state sT P SG ST P SG with probability and selects
the best action aT P SG
mwith probability (1 ), where
the best action has QT P SG
m(sT P SG ,aT P SG, πT P S G
m)
QT P SG
m(sT P SG ,aT P SG, π TP S G
m),aT P SG AT P SG with
aT P SG
mbeing the m-th element of aT P SG. Besides, if the
later described energy learner in UAV mselects to return to
the depot, the trajectory learner will not choose any action in
AT P SG
m. Then, the probability of selecting action aT P SG
m
AT P SG
min state sT P SG can be expressed by
πT P SG
m(sT P SG , aT P SG
m)
=
0,if UAV mdecides to return to the depot,
1, if QT P SG
m(sT P SG ,·,·)of aT P SG
mis the highest,
, otherwise.
In the Q value update step of Q-learning, the trajectory
learner in each UAV m M follows the update rule:
QT P SG
m(sT P SG ,aT P SG, t + 1) =
QT P SG
m(sT P SG ,aT P SG, t) + βT P S G(RT P S G
m(t)+
γmax
aT P SG0∈AT P SG QT P SG
m(sT P SG0,aT P S G0, t)
QT P SG
m(sT P SG ,aT P SG, t)),
(14)
where βT P SG denotes the learning rate in TPSG.
2) Settings for Energy Learner: The policy of energy learner
in UAV m M is expressed as πE RSG
m:SERS G AERSG
m.
Here, the Q function of the energy learner in UAV m M
can be expressed by
QERS G
m(sERS G,aERS G, π ERSG
m) =
E(P
τ=0 γτRERS G
m(t+τ+ 1)|sERS G(t) = sERS G,
a(t)ERS G =aERSG , πE RSG
m).
(15)
The energy learner in UAV m M selects an action
aERS G
m AERS G
m(i.e., whether returning to the depot) also
according the -greedy exploration strategy. Then, we have
πERS G
m(sERS G, aERS G
m)
=1, if QERS G
m(sERS G,·,·)of aERS G
mis the highest,
, otherwise.
The energy learner in UAV m M follows the update rule:
QERS G
m(sERS G,aERS G, t + 1) =
QERS G
m(sERS G,aERS G, t) + βE RSG (RERSG
m(t)+
γmax
aERSG0∈AE RSG QERS G
m(sERS G0,aERSG0, t)
QERS G
m(sERS G,aERS G, t)),
(16)
where βERS G denotes the learning rate in ERSG.
3) Settings for Application Learner: The policy of applica-
tion learner in UAV m M is πAPS G
m:SAP SG AAPS G
m.
Here, the Q function of the application learner in UAV m
Mcan be expressed by
QAP SG
m(sAP SG ,aAP SG, π APS G
m) =
E(P
τ=0 γτRAP SG
m(t+τ+ 1)|sAP SG (t) = sAP SG,
a(t)AP SG =aAPS G, πAP S G
m).
(17)
The application learner in UAV m M selects an action
aAP SG
m AAP SG
malso according the -greedy exploration
strategy. Then, we have
πAP SG
m(sAP SG , aAP SG
m)
=1, if QAP SG
m(sAP SG ,·,·)of aAP SG
mis the highest,
, otherwise.
The update rule of application learner in UAV m M is
QAP SG
m(sAP SG ,aAP SG, t + 1) =
QAP SG
m(sAP SG ,aAP SG, t) + βAP S G(RAP S G
m(t)+
γmax
aAP SG0∈AAP SG QAP SG
m(sAP SG0,aAP S G0, t)
QAP SG
m(sAP SG ,aAP SG, t)),
(18)
where βAP SG denotes the learning rate in APSG.
In summary, the proposed TLRL approach is detailedly
illustrated in Algorithm 1.
Algorithm 1: TLRL Approach
1for m= 1 to Mdo
2Initialize Q values QT P SG
m=QERSG
m=QAP SG
m= 0;
3Set the maximal iteration counter LOOP and loop = 0;
4for loop < LOOP do
5t= 0;
6for m= 1 to Mdo
7Send QT P SG
m,QERSG
mand QAP SG
mto other UAVs;
8while tTdo
9Observe state sT P SG,sE RSG and sAP SG ;
10 for m= 1 to Mdo
11 UAV mselects aE RSG
maccording to πERSG
m;
12 if UAV mreturns to the depot then
13 UAV mselects aAP S G
maccording to πAP SG
m;
14 else
15 UAV mselects aT P S G
maccording to πT P SG
m;
16 Obtain rewards RTP S G
m,RERSG
mand RAP SG
m;
17 Update QT P SG
m,QERSG
mand QAP SG
maccording to (14),
(16) and (18), respectively;
18 Send QT P SG
m,QERSG
mand QAP SG
mto other UAVs;
19 Set t=t+ 1;
20 Set loop =loop + 1.
TABLE I: Simulation Parameters
Param. Value Param. Value Param. Value
M3B10 MHz C 10
N300 Dn[2,5] MB V 20 m/s
thover 5s ξ 1018 f3GH z
q100 m ptran
n[0.2,0.5] W H 120 m
Sm6GB µc[1,3] GB ϕ 174 dB m/Hz
a,b9.6117,
0.1581
fU
m2Mbps Target
region
1000 m×1000 m
IV. SIMULATION RESULTS
In this section, simulations are conducted to evaluate the
performance of the proposed TLRL approach for [P1]. Table I
lists the values of all simulation parameters, and the propulsion
power model follows [11]. Similar settings have also been
employed in [9], [12].
For comparison purpose, we introduce an energy efficient
oriented trajectory planning (EOTP) algorithm and an exist-
ing algorithm called decentralized multiple UAVs cooperative
reinforcement learning (DMUCRL) [9] algorithm as bench-
marks: EOTP determines the trajectories of all UAVs with
the aim of maximizing the energy efficiency but asks UAVs
to return to the depot for energy renewal only when their
batteries are exhausted, and EOTP does not enable the update
of application placement; DMUCRL is originally designed to
maximize the energy efficiency of UAVs in downlink content
sharing by controlling all UAVs to work collaboratively based
on a double Q-learning (each UAV contains a trajectory learner
and an energy learner).
It can be observed from Fig. 2 that the energy efficiency first
increases and then becomes stable with the increase of IoT
devices’ transmission power. This is because, with a larger
transmission power, IoT devices would offload more tasks
to their associated UAVs, and thereby increasing the amount
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Transmission Power of IoT Device (W)
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
Energy Efficiency of All UAVs (bit/J)
Proposed TLRL
DMUCRL
EOTP
Fig. 2: Energy efficiency w.r.t. trans-
mission power of IoT devices.
12345678910
Storage Capacity of Each UAV (GB)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Energy Efficiency of All UAVs (bit/J)
Grid Size q=50m
Grid Size q=100m
Grid Size q=200m
Fig. 3: Energy efficiency w.r.t. storage
capacity of each UAV.
12345678910
UAV Hovering Time (s)
1000
1500
2000
2500
3000
3500
Energy Efficiency of All UAVs (bit/J)
Proposed TLRL
DMUCRL
EOTP
Fig. 4: Energy efficiency w.r.t. UAV
hovering time.
of tasks processed by UAVs. However, since the computing
capacity of each UAV is still limited, such increasing trend
slows down as the limit is approaching. More importantly,
this figure shows that the proposed TLRL outperforms both
DMUCRL and EOTP. Reasons are that i) each UAV under
EOTP returns to the depot directly once its energy is exhausted
regardless of other UAVs; ii) each UAV’s applications are
fixed placed under DMUCRL, making it capable of serving
very limited IoT devices; and iii) our proposed TLRL well
addresses the shortcomings of DMUCRL and EOTP.
Fig. 3 shows all UAV’s energy efficiency with different
UAV storage capacities under different grid size settings.
Specifically, UAVs can adjust their downlink transmission
ranges so as to adjust the size qof grids. It can be seen
from Fig. 3 that the larger the grid size is, the higher energy
efficiency of all UAVs is obtained. This is because with a
larger grid size, more IoT devices are included in a grid,
and thereby each UAV can potentially process more offloaded
tasks. Besides, it is also shown that the energy efficiency of
all UAVs increases monotonically with the storage capacity
of each UAV. The reason is that with the increase of storage
capacity, more types of applications can be placed in each
UAV, so that more tasks may be processed.
It can be observed from Fig. 4 that, the energy efficiency
of all UAVs first increases with the UAV hovering time, and
then decreases. This is because with the growth of UAV
hovering time, more offloaded tasks from IoT devices can be
computed by UAVs during hovering. However, when all tasks
have been completely processed by UAVs, they will become
idle and consume hovering energy over the target region until
hovering time expires. Additionally, it is also shown that the
proposed TLRL outperforms both DMUCRL and EOTP, and
the explanations for this are similar to those for Fig. 2.
V. CONCLUSION
In this paper, an energy efficient scheduling problem for
multi-UAV assisted MEC has been studied. With the aim
of maximizing the long-term energy-efficiency of all UAVs,
a joint optimization of UAVs’ trajectory planning, energy
renewal and application placement is formulated. By taking
the inherent cooperation and competition among UAVs, we
reformulate such optimization problem as three coupled multi-
agent stochastic games, and then propose a novel TLRL
approach for reaching equilibriums. Simulation results show
that, compared to counterparts, the proposed TLRL approach
can significantly increase the energy efficiency of all UAVs.
ACKNOWLEDGMENTS
This work was supported by National Natural Science
Foundation of China (NSFC) under Grants No. 62002164, No.
62176122, and by the Postgraduate Research & Practice In-
novation Program of NUAA under grant No. xcxjh20221614.
REFERENCES
[1] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan,
“Deep reinforcement learning based dynamic trajectory control for UAV-
assisted mobile edge computing,” IEEE Trans. Mob. Comput., vol. 21,
no. 10, pp. 3536–3550, Oct. 2020.
[2] Y. Shi, C. Yi, B. Chen, C. Yang, K. Zhu, and J. Cai, “Joint online opti-
mization of data sampling rate and preprocessing mode for edge–cloud
collaboration-enabled industrial IoT,” IEEE Internet Things J., vol. 9,
no. 17, pp. 16 402–16 417, 2022.
[3] C. Dai, K. Zhu, and E. Hossain, “Multi-agent deep reinforcement
learning for joint decoupled user association and trajectory design in
full-duplex multi-UAV networks,” IEEE Trans. Mob. Comput., pp. 1–
15, 2022.
[4] J. Ji, K. Zhu et al., “Energy consumption minimization in UAV-assisted
mobile-edge computing systems: Joint resource allocation and trajectory
design,” IEEE Internet Things J., vol. 8, no. 10, pp. 8570–8584, 2021.
[5] J. Chen, C. Yi et al., “Learning aided joint sensor activation and mobile
charging vehicle scheduling for energy-efficient WRSN-based industrial
IoT,” IEEE Trans. Veh. Technol., pp. 1–15, 2022.
[6] G. Zheng, C. Xu, M. Wen, and X. Zhao, “Service caching based aerial
cooperative computing and resource allocation in multi-UAV enabled
MEC systems,” IEEE Trans. Veh. Technol., pp. 1–14, 2022.
[7] Y. Zhao, Z. Li, N. Cheng, R. Zhang, B. Hao, and X. Shen, “UAV
deployment strategy for range-based space-air integrated localization
network,” in Proc. IEEE GLOBECOM, 2019, pp. 1–6.
[8] L. Yang, H. Yao et al., “Multi-UAV deployment for MEC enhanced IoT
networks,” in Proc. IEEE ICCC, 2020, pp. 436–441.
[9] C. Zhao, J. Liu, M. Sheng, W. Teng, Y. Zheng, and J. Li, “Multi-UAV
trajectory planning for energy-efficient content coverage: A decentral-
ized learning-based approach,” IEEE J. Sel. Areas Commun., vol. 39,
no. 10, pp. 3193–3207, Oct. 2021.
[10] H. Mei, K. Yang, Q. Liu, and K. Wang, “Joint trajectory-resource
optimization in UAV-enabled edge-cloud system with virtualized mobile
clone,” IEEE Internet Things J., vol. 7, no. 7, pp. 5906–5921, Jul. 2020.
[11] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless
communication with rotary-wing UAV,” IEEE Trans. Wirel. Commun.,
vol. 18, no. 4, p. 2329–2345, Apr. 2019.
[12] B. Liu, Y. Wan, F. Zhou, Q. Wu, and R. Hu, “Resource allocation and
trajectory design for MISO UAV-assisted MEC networks,” IEEE Trans.
Veh. Technol., vol. 71, no. 5, pp. 4933–4948, May. 2022.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we consider a platform of flying mobile edge computing (F-MEC), where unmanned aerial vehicles (UAVs) serve as equipment providing computation resource, and they enable task offloading from user equipment (UE). We aim to minimize energy consumption of all the UEs via optimizing the user association, resource allocation and the trajectory of UAVs. To this end, we first propose a Convex optimizAtion based Trajectory control algorithm (CAT), which solves the problem in an iterative way by using block coordinate descent (BCD) method. Then, to make the real-time decision while taking into account the dynamics of the environment (i.e., UAV may take off from different locations), we propose a deep Reinforcement leArning based Trajectory control algorithm (RAT). In RAT, we apply the Prioritized Experience Replay (PER) to improve the convergence of the training procedure. Different from the convex optimization based algorithm which may be susceptible to the initial points and requires iterations, RAT can be adapted to any taking off points of the UAVs and can obtain the solution more rapidly than CAT once training process has been completed. Simulation results show that the proposed CAT and RAT achieve the considerable performance and both outperform traditional algorithms.
Article
Full-text available
The emergence of mobile edge computing (MEC) and unmanned aerial vehicles (UAVs) is of great significance for the prospective development of Internet of Things (IoT). The additional computation capability and extensive network coverage provide energy-limited smart mobile devices (SMDs) with more opportunities to experience diverse intelligent applications. In this paper, a computation efficiency maximization problem is formulated in a multi-UAV assisted MEC system, where both computation bits and energy consumption are considered. Based on the partial computation offloading mode, user association, allocation of central processing unit (CPU) cycle frequency, power and spectrum resources, as well as trajectory scheduling of UAVs are jointly optimized. Due to the non-convexity of the problem and the coupling among variables, we propose an iterative optimization algorithm with double-loop structure to find the optimal solutions. Simulation results demonstrate that the proposed algorithm can obtain higher computation efficiency than baseline schemes while guaranteeing the quality of computation service.
Article
Service caching refers to caching the necessary programs or/and the related databases for performing computational tasks at edge servers, which has been considered to save both computation and communication resources in mobile edge computing (MEC) systems. In this paper, we investigate computation service caching in a multi-unmanned aerial vehicle (UAV) enabled MEC system, where each UAV equipped with an edge server acts as an aerial computing platform to serve the ground devices. Furthermore, the UAVs can serve the devices cooperatively through the provided computing and caching resources. Aiming at minimizing the maximum task completion latency among all devices, we formulate a joint service caching, task offloading, communication and computation resource allocation, as well as UAV placement optimization problem, while guaranteeing the energy budget of all devices and UAVs. The problem is a mixed integer non-linear programming problem, and we decompose it into four sub-problems, and then propose an iterative algorithm based on block coordinate descent (BCD) and successive convex approximation (SCA) techniques to obtain near-optimal solution. Numerical results show that our proposed algorithm can achieve lower task completion latency than other baselines while guaranteeing better fairness among all devices.
Article
Mobile Edge Computing (MEC) is a promising technology in the next generation network, which provides computing services for user equipments (UEs) to reduce the task delay and prolong the usage time of UEs. To address the deficiency of poor channel quality caused by multipath and blockages in traditional MEC networks, a multiple input single output (MISO) UAV-assisted MEC network is studied. A system energy consumption minimization problem is formulated by jointly optimizing the the UAV's beamforming vectors, the UAV's central processing unit (CPU) frequency, the UAV's trajectory, the UEs' transmission power and the UEs' CPU frequency subject to the constraints on the task, the UAV's trajectory, and the UEs' computation bits. A three-stage iterative algorithm is proposed to solve the challenging non-convex problem. The closed-form expressions for the optimal UAV CPU frequency and the transmission power of UEs are derived. Simulation results show that the proposed algorithm is superior to the benchmark schemes in terms of energy consumption, and the convergence performance is guaranteed.
Article
In next-generation wireless networks, high-mobility unmanned aerial vehicles (UAVs) are promising to provide content coverage, where users can receive sufficient requested content within a given time. However, trajectory planning for multiple UAVs to provide content coverage is challenging since 1) UAVs cannot provide content coverage for all users due to the limited energy and caching storage, and 2) the trajectory planning of UAV is coupled with each other. Moreover, the complete information based trajectory planning methods are unusable since UAVs cannot obtain prior information on the rapidly changing environment. In this paper, we investigate the multi-UAV trajectory planning for energy-efficient content coverage. We first formulate an energy efficiency maximization problem considering recharging scheduling, which aims to reduce the total length of trajectories of UAVs under the quality of service (QoS) constraints. To settle environment uncertainty, the trajectory planning problem is modeled as two coupled multi-agent stochastic games, whose equilibrium constitute the optimal trajectory. To obtain the equilibrium, we propose a decentralized reinforcement learning algorithm, which can decouple the two games. We prove that the proposed algorithm can converge to the optimal solution of the Bellman equation with a higher rate compared to the centralized one. Moreover, simulation results show that the energy efficiency of the proposed algorithm is smaller than 5% compared the optimal, which is obtained with the prior information of environments.
Article
Unmanned aerial vehicles (UAVs) have been introduced into wireless communication systems to provide high-quality services and enhanced coverage due to their high mobility. In this paper, we study a UAV-assisted mobile edge computing (MEC) system in which a moving UAV equipped with computing resources is employed to help user devices (UDs) computing their tasks. The computing tasks of each UD can be divided into two parts: one portion is processed locally and the remaining portion is offloaded to the UAV for computing. Offloading is enabled by uplink and downlink communications between UDs and the UAV. On this basis, two types of access modes are considered, namely non-orthogonal and orthogonal multiple access. For both access modes, we formulate new optimization problems to minimize the weighted sum energy consumption of the UAV and UDs by jointly optimizing the UAV trajectory and computation resource allocation, under the constraint on the number of computation bits. These problems are non-convex optimization problems that are difficult to solve directly. Accordingly, we develop alternating iterative algorithms to solve them based on the block alternating descent method. Specifically, the UAV trajectory and computation resource allocation are alteratively optimized in each iteration. Extensive simulation results demonstrate the significant energy savings of our proposed joint design over the benchmarks.
Conference Paper
Unmanned aerial vehicles (UAVs) are already widely used to provide both relay services and enhanced information coverage to the terrestrial Internet of Things (IoT) networks. IoT devices may not be able to handle heavy computing tasks due to their severely limited processing capability. In this paper, a multi-UAV deployment for mobile edge computing (MEC) enhanced IoT architecture is designed, where multiple UAVs are endowed with computing offloading services for ground IoT devices with limited local processing capabilities. In order to balance the load of UAV, this paper proposes a multi-UAV deployment mechanism which is based on the difference evolution (DE) algorithm. Meanwhile, the node access problem is formulated as a generalized assignment problem (GAP), and then an approximate optimal solution scheme is used to solve the problem. Based on this, we realize the load balance of multiple UAVs, guarantee the constraint of coverage range and meet the quality of service (QoS) of MEC network. Finally, sufficient simulations prove the effectiveness of our proposed multi-UAV deployment algorithm.
Article
This paper studies an unmanned aerial vehicle (UAV) enabled Edge-Cloud system, where UAV acts as mobile edge computing (MEC) server interplaying with remote central cloud to provide computation services to Ground Terminals (GTs). The UAV-enabled Edge-Cloud system implements a virtualized network function, namely mobile clone (MC), for each GT to help execute their offloaded tasks. Through such network function virtualization (NFV) implemented on top of the UAV-enabled Edge-Cloud system, GTs can have extended computation capability and prolonged battery lifetime. We aim to jointly optimize the allocation of resource and the UAV trajectory in 3D spaces to minimize the overall energy consumption of the UAV. The proposed solution therefore can extend the endurance of the UAV and support reliable MC functions for GTs. This paper solves the complicated optimization problem through a block coordinate descent algorithm in an iterative way. In each iteration, the allocation of resource is modelled as a multiple constrained optimization problem given pre-defined UAV trajectory, which can be reformulated into a more tractable convex form and solved by successive convex optimization and Lagrange duality. Second, given the allocated resource, the optimization of the trajectory of rotary-wing/fixed-wing UAV can be formulated into series of convex Quadratically Constrained Quadratically Program (QCQP) problems and solved by standard convex optimization techniques. After the block coordinate descent algorithm converges to a prescribed accuracy, a high quality sub-optimal solution can be found. According to the simulation, the numerical results verify the effectiveness of our proposed solution in contrast to the baseline solutions.
Article
Unmanned aerial vehicles (UAVs) have attracted increasing attention in wireless communications due to the high mobility. This paper investigates a fixed-wing UAV-to-UAV (U2U) communications system, with the aim of minimizing the information transmission time via proactively designing the UAV paths. We first propose a general optimization framework for U2U communications, which covers the communication throughput requirement, interference from terrestrial transmitters, UAV maximum/minimum speeds and accelerations, and minimum U2U distance. To tackle the formulated optimization, the communication throughput constraint that contains uncertain locations of terrestrial transmitters is transformed into a deterministic expression with the aid of S-procedure, and the nonlinear equality constraints on the UAV paths are replaced by linear equality constraints with additional positive semidefinite matrix constraints. Then, we develop a path planning algorithm based on the exact penalty method and successive convex approximation. We further design a heuristic path planning algorithm that solves the completion time minimization problem by iteratively addressing a series of throughput maximization problems. The proposed heuristic algorithm strikes a good tradeoff between the computational complexity and the achievable performance. Finally, simulation results are presented to verify the proposed path planning algorithms under various parameter configurations.