Content uploaded by Yaxiong Yuan
Author content
All content in this area was uploaded by Yaxiong Yuan on Apr 20, 2021
Content may be subject to copyright.
Actor-Critic Deep Reinforcement Learning for
Energy Minimization in UAV-Aided Networks
Yaxiong Yuan, Lei Lei, Thang X. Vu, Symeon Chatzinotas, and Bj¨
orn Ottersten
Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg
Emails: {yaxiong.yuan; lei.lei; thang.vu; symeon.chatzinotas; bjorn.ottersten@uni.lu }
Abstract—In this paper, we investigate a user-timeslot schedul-
ing problem for downlink unmanned aerial vehicle (UAV)-aided
networks, where the UAV serves as an aerial base station. We
formulate an optimization problem by jointly determining user
scheduling and hovering time to minimize UAV’s transmission
and hovering energy. An offline algorithm is proposed to solve the
problem based on the branch and bound method and the golden
section search. However, executing the offline algorithm suffers
from the exponential growth of computational time. Therefore,
we apply a deep reinforcement learning (DRL) method to design
an online algorithm with less computational time. To this end,
we first reformulate the original user scheduling problem to a
Markov decision process (MDP). Then, an actor-critic-based RL
algorithm is developed to determine the scheduling policy under
the guidance of two deep neural networks. Numerical results
show the proposed online algorithm obtains a good tradeoff
between performance gain and computational time.
Index Terms—UAV-aided networks, deep reinforcement learn-
ing, actor-critic, user scheduling, energy minimization.
I. INT ROD UC TI ON
Unmanned aerial vehicles (UAVs) are widely used in many
areas. Two promising features of the UAVs are flexibility
and mobility such that they can be applied in dynamic,
distributed or plug-and-play environments, e.g., disaster rescue
and live concert [1]. Since the UAVs have more likelihood to
undergo line-of-sight (LoS) connections with ground users,
it is favorable for reliable communications [2]. For these
benefits, the applications of UAV-aided wireless networks have
been emerging. We consider a UAV-aided downlink scenario
where the UAV serves as an aerial base station (BS) to deliver
data to ground users when some of the terrestrial BSs are
destroyed after the disaster. The design of an energy-efficient
UAV network is necessary, as the battery storage of the UAV
is limited.
The energy consumption of the UAV mainly comes from the
propulsion energy, i.e., the energy used for flying, hovering,
and communication. A proper user-timeslot scheduling scheme
is effective to achieve energy conservation for UAV systems
[3]-[5]. In [3], the authors introduced an energy model and
proposed a user scheduling method to minimize the energy
consumption of the UAV. In [4] and [5], the authors studied
an energy-efficiency maximization problem via jointly UAV-
to-user scheduling, power control, and trajectory design. The
above papers focused on single-antenna UAV networks where
the users are served by time division multiple access (TDMA)
mode. Equipping with multiple antennas, the UAV can trans-
mit data to multiple users simultaneously to improve the
capacity of networks. This is known as space division multiple
access (SDMA). In [6], the authors proposed a semiorthogonal
user selection (SUS) algorithm for terrestrial multiple input
multiple output (MIMO) systems. However, the SDMA-based
user scheduling problem for UAV networks is more difficult
than TDMA due to the combinatorial explosion of the user
groups. Moreover, diversified sources of the UAV’s energy
consumption make the problem more complicated.
Deep reinforcement learning (DRL) combines artificial neu-
ral networks with a reinforcement learning architecture that
can provide efficient algorithms. Since DRL makes decisions
based on the current environment state, it is more suitable
for dynamic systems, such as UAV’s moving, time-varying
channel and new user arrival. In [7], the authors proposed a
user association algorithm by deep Q-network, where non-
linear deep neural networks are used to approximate the
value function. In [8], the authors proposed an echo state
network-based DRL algorithm by joint path selection, UAV-
BSs association, and power control. In [9], an actor-critic-
based deep deterministic policy gradient algorithm was applied
to deal with the problem of flying direction and distance
selection. Actor-critic learning can acquire a stochastic policy
to deal with a very huge or continuous action space. In our
problem, the combinatorial nature of the scheduling problem
makes a huge action space. Therefore, we employ actor-critic-
based DRL to develop an online user scheduling algorithm.
The following lists our major contributions:
•For energy minimization, we formulate a combinatorial
optimization problem and propose an offline algorithm to
solve it.
•Implementing the offline algorithm is impractical since
it suffers from long computational time. To overcome
this difficulty, we reformulate the original optimization
problem to an MDP problem and design an actor-critic-
based DRL algorithm.
•Simulation results demonstrate that the proposed DRL al-
gorithm strikes a good tradeoff between the performance
gain and computational time.
II. SY ST EM MO DE L
In the UAV-aided downlink system, the UAV serves as the
BS to deliver data to the ground users. The service area is
2020 European Conference on Networks and Communications (EuCNC): Wireless, Optical and Satellite Networks (WOS)
978-1-7281-4355-2/20/$31.00 ©2020 IEEE 348
Authorized licensed use limited to: University of Luxembourg. Downloaded on October 15,2020 at 11:50:00 UTC from IEEE Xplore. Restrictions apply.
divided into Nclusters due to the limited service range of
the UAV. Before the service starts, the UAV has to be fully
charged and prepared at a dock station. Then, the UAV flies
through all the clusters successively at a fixed altitude and
transmits data to the ground users. We denote Knand qk,n
(in bits) as the number of users and the k-th user’s demands in
the n-th cluster. When all the demands in the current cluster
are satisfied, the UAV will leave for the next cluster. After the
service, the UAV flies back to the dock station and prepares for
the next round. New users may arrive during the service. Their
requests will be processed in the next round. Fig. 1 illustrates
an example of the system model.
Fig. 1. A UAV network with N= 3 clusters.
The delivery process evolves over a sequence of frames
whose structure is standardized. Since the data collected by the
UAV have a certain life span, in each round, all the tasks must
be completed within a limited time TL(in frames). We assume
that a frame lasts TF(in seconds) and consists of Itimeslots.
Thus, each timeslot lasts TI=TF/I (in seconds). Under the
assumption of SDMA, the UAV can schedule more than one
user at each timeslot. As shown in Fig. 2, the shaded block
indicates that the user is scheduled. We define the scheduled
users as a user group. If no user is scheduled, the group is an
empty set. The number of users for each cluster is up to K.
So, the maximum number of candidate groups is calculated
by G= 2K, which increases exponentially with K.
Fig. 2. An illustration of the structure of the frame.
For the UAV-to-ground communication links, we consider
a quasi-static Rician fading channel, as it can comprise both a
deterministic LoS component and a random multipath compo-
nent [10]. The channel states are fixed within a transmission
frame. We denote hk,n =gk,n10−ξk,n /10 as the channel vector
between the UAV and the ground user kin cluster n, where
ξk,n is the path loss. We assume Lis the number of antennas
of the UAV while the ground users are equipped with single
antenna. gk,n = [gk,n,1, ..., gk,n,L ]is the Rician fading vector
where all elements are independent of each other. We collect
their channel vectors to reformulate a matrix Hn∈CKn×L.
Towards eliminating multi-user interference within a group,
minimum mean square error (MMSE) precoding is applied
[11]. The precoding vector for user kis calculated by wk,n =
√pˆ
hk,n, where ˆ
hk,n is the column corresponding to user k
in HH
n(σ2I+HnHH
n)−1. We normalize the precoder ˆ
hk,n and
assume the power allocation pis the same for all the users. σ2
refers to the noise power. Denote β(kj)
n=|hH
k,n ˆ
hj,n|2as the
channel coefficient after precoding. The signal-to-interference-
plus-noise ratio (SINR) for user kis given by:
SINRk,g,n =β(kk)
np
Pj∈Kg\{k}β(kj)
np+σ2, k ∈ Kg, g ∈ Gn,(1)
where Kgis the set of users in group gand Gnis the set of
candidate groups in cluster n. Let Bas the system bandwidth.
The transmitted data of each timeslot can be expressed by
Shannon equation:
dk,g,n =TIBlog2(1 + SINRk,g ,n), k ∈ Kg, g ∈ Gn.(2)
The communication energy of group gis given by:
e(c)
g,n =TIX
k∈Kg
β(kk)
np, g ∈ Gn.(3)
To facilitate the following calculation, we collect dk,g,n to
form a data matrix for each cluster, Dn={dk,g,n}K×G.
We set the element dk,g ,n to 0 if k /∈ Kgor g /∈ Gn.
e(c)
n= [e(c)
1,n, ..., e(c)
G,n]is denoted as the vector of commu-
nication energy for all the groups in cluster n.
To analyze the propulsion energy, we employ a UAV energy
model [3]. The flying power P(V)is given by:
P(V) =P0 1 + 3V2
U2
tip !+P1 s1 + V4
4v4
0−V2
2v2
0!1/2
(4)
+1
2d0ρsAV 3,
where Vis the flying speed while P0and P1refer to the
blade profile power and induced power in hovering status,
respectively. Utip is the tip speed of the rotor blade. v0
denotes the mean rotor induced velocity. d0and sare the
fuselage drag ratio and rotor solidity, respectively. ρand A
are denoted as the air density and rotor disc area, respectively.
By substituting V= 0, the hovering power is given by
P(h)=P(0) = P0+P1. We assume the UAV travels between
clusters with a constant speed Vmr and a predetermined flying
path (total flying distance is S). Vmr refers to the maximum-
range speed which maximizes the total traveling distance with
any given battery storage [3]. So, the flying power is fixed at
P(f)=P(Vmr)and the flying energy can be calculated by
E(f)=SP (f)/Vmr .
III. PROB LE M FOR MU LATI ON A ND OFFL IN E APP ROAC H
A. Problem Formulation
The user scheduling scheme varies between the frames.
We denote Λn[t] = {λg,i,n}G×Ias the scheduling matrix on
2020 European Conference on Networks and Communications (EuCNC): Wireless, Optical and Satellite Networks (WOS)
349
Authorized licensed use limited to: University of Luxembourg. Downloaded on October 15,2020 at 11:50:00 UTC from IEEE Xplore. Restrictions apply.
frame t, where the element λg,i,n ∈ {0,1}indicates whether
the g-th group is selected at the i-th timeslot. νn[t]∈ {0,1}
refers to the hovering indicator. νn[t]=1means the UAV is
hovering above the cluster nat frame t. The variable νn[t]has
the following constraints:
νn[t+1] + νn+1[t+1] = 1,if νn[t]=1,∀t, (5)
XN
n=1 νn[t]≤1,∀t. (6)
Eq. (5) shows the UAV has to choose either staying at the
current cluster or flying to the next cluster at the end of each
frame but cannot fly back to the previous cluster. Eq. (6) means
the UAV can only serve one cluster at each frame. The energy
minimization problem is formulated as P1:
P1: min
Λn[t],νn[t]E(f)+E(c)+E(h)(7)
s.t. dnqn,∀n, (7a)
Λn[t]T11,∀n, t, (7b)
Λn[t]∈B, νn[t]∈ {0,1},∀n, t (7c)
(5),(6).
•E(c)is the UAV’s communication energy, which is given
by PTL
t=1 PN
n=1 νn[t]e(c)
n[t]Λn[t]1.
•E(h)is the UAV’s hovering energy, which is given by
PTL
t=1(PN
n=1 νn[t])TFP(h).
•dnis the received data for all the users in the n-th cluster,
which is given by PTL
t=1 νn[t]Dn[t]Λn[t]1.
•qnis the demand vector of the n-th cluster that consists
of the users’ requests qk,n.
Constraints (7a) mean that all the users’ requests have to
be satisfied. Constraints (7b) indicate that no more than
one group can be scheduled per timeslot. Constraints (7c)
confine both Λn[t]and νn[t]to binary variables. P1has some
characteristics that make it difficult to be solved:
•P1is a combinatorial optimization problem whose goal
is to find an optimal user group from a finite set. The size
of the set exponentially increases with the user scale.
•The two variables Λn[t]and νn[t]are coupled. The
change of νn[t]affects the scheduling policy.
•The constraints (5) and (6) make the decisions of νn[t]
time-correlated. The hovering time allocation of the cur-
rent cluster affects the decisions of the next cluster.
B. Offline Approach
To make P1more tractable, we first investigate the schedul-
ing policy under the fixed hovering indicator νn[t]. Thus, the
constraints (5) and (6) can be removed. We denote tn=
PTL
t=1 νn[t]as the hovering time for each cluster n. When
νn[t]is fixed, the scheduling policies are independent between
clusters. P1can be divided into Nsub-problems P1(n)to find
the optimal Λn[t]. For concise expression, we replace νv[t]
with tnand delete the constant E(f)from the objective.
P1(n) : min
Λn[t]E(c)
n+E(h)
n(8)
s.t. (7a),(7b),(7c),
where E(c)
n=Pτn+tn
t=τn+1 e(c)
n[t]Λn[t]1and E(h)
n=TFP(h)tn.
τnrefers to the consumed time before the UAV arriving
cluster n, which can be calculated by τn=PTL
t=1 PN−1
n=1 νn[t].
We can observe that P1(n)is a binary linear programming
problem that can be solved optimally by branch and bound
(B&B) [12].
Determining optimal hovering time tnis a non-trivial task
since tnand Λn[t]are coupled. Brute force search is the most
direct approach but not efficient. In our study, an efficient
search method, golden section search (GSS), is employed to
provide sub-optimal solutions [14]. The offline algorithm is
summarized in Alg. 1. In the outer loop (line 4, Alg. 1), GSS
is used to determine the hovering time tnfor each cluster. In
the inner loop (line 5, Alg. 1), we apply B&B to find the user
scheduling scheme.
Algorithm 1 Offline Algorithm
Inputs: Users’ demands q1, ..., qN, maximal time limitation TL,
channel coefficients {β(kj)
n[t]}, ..., {β(kj)
N[t]}.
Outputs: Hovering time t∗
n, user scheduling Λ∗
n[t].
1: for n= 1 : Ndo:
2: = 0.618,a1= 0,b1=TL,i= 1,
3: u1=db1−(b1−a1),v1=da1+(b1−a1)e.
4: while |bi−ai| 6= 1 do:golden section search
5: Solve P1(n)with t1
n=uiand t2
n=viby B&B.
6: Obtain the user scheduling Λ1
n[t]and Λ2
n[t].
7: Obtain the objective energy E1
nand E2
n.
8: if E1
n< E2
nthen:
9: ai+1 =ai, bi+1 =vi,
vi+1 =ui, ui+1 =dbi+1 −(bi+1 −ai+1)e.
10: else:
11: ai+1 =ui, bi+1 =bi,
ui+1 =vi, vi+1 =dbi+1 −(bi+1 −ai+1)e.
12: end if
13: i=i+ 1.
14: end while
15: t∗
n=t2
n,Λ∗
n[t] = Λ2
n[t].
16: end for
However, the offline approach still has some limitations in
practical usage. The computational time for the inner loop
increases exponentially with the user scale. Specifically, under
the worst case, the time complexity of B&B is O((22K)Itn).
Besides, when a new round starts, the problem needs to be
recalculated due to the new users’ arrival and the channel con-
dition change. So, we propose a deep reinforcement learning
algorithm that enables UAV to make decisions intelligently to
overcome these difficulties.
IV. ACTO R- CR IT IC S- BASED ONLINE ALGORITH M
Actor-critic is one of the reinforcement learning frameworks
which splits the model into two components to leverage the
strengths of both value-based and policy-based methods [13].
For the actor part, the policy πcan be described as an action’s
probability distribution π(a|s), where a∈ A and s∈ S are the
current action and state, respectively. Usually, we call π(a|s)
the stochastic policy which handles a huge or continuous
2020 European Conference on Networks and Communications (EuCNC): Wireless, Optical and Satellite Networks (WOS)
350
Authorized licensed use limited to: University of Luxembourg. Downloaded on October 15,2020 at 11:50:00 UTC from IEEE Xplore. Restrictions apply.
action space. The objective of reinforcement learning is to
find a policy πwhich maximizes
J=Eπ[Qπ(s, a)] = ZS
pπ(s)ZA
π(a|s)Qπ(s, a)dads, (9)
where pπ(s)is the distribution of state and Qπ(s, a)is the
action-value (or Q-value) function under the policy π. To
predict the action’s distribution, a parameterized approximator
πθ(a|s)is built. Assume πθ(a|s)is differentiable in regards
to θsuch that the gradient of Jcan be expressed by:
∇θJ(θ) = ZS
pπθ(s)ZA∇θπθ(a|s)Qπθ(s, a)dads
=Eπθ[∇θlog(πθ(a|s))Qπθ(s, a)] .(10)
Based on gradient ascending, the parameter θis updated by:
θ0=θ+αa∇θJ(θ),(11)
where αais the learning rate for the actor.
The critic is responsible for estimating Qπθ(s, a). Like
the actor, the critic also has a parameterized approximator
Qw(s, a). Temporal difference (TD) can be applied in the critic
to enhance the learning efficiency [13]. The objective of the
critic is to minimize the square of TD error
L(w) = [(r+Qw(s0, a0)) −Qw(s, a)]2,(12)
where a0and s0are the next action and state, respectively. The
update rule for parameter wcan be derived as:
w0=w+αc∇wL(w).(13)
In our study, we use two fully connected deep neural network
(DNN) as the approximators for the actor and critic. Thus, θ
and wrepresent the weights of the neural networks.
To apply the DRL algorithm, we reformulate the optimiza-
tion problem P1to an MDP.
1) System States: The system states stare jointly deter-
mined by the channel state, remaining requests, and cluster
indicator. To characterize the fading nature of the wireless
channel, the time-varying channel can be further modeled as
first state Markov channel (FSMC) [7]. In our study, we model
the channel coefficient β(kj)
n[t]as a Markov random variable.
The remaining requests are the difference between the required
data and delivered data,
bn[t] = qn−Xt
t0=1 dn[t0].(14)
We denote c[t]as the cluster indicator that shows which cluster
the UAV is serving on frame t. When all the users’ requests
in the current cluster are completed, i.e., bn[t] = 0, the UAV
will move to the next cluster such that c[t+1] = c[t] + 1. The
state can be defined as:
st={β(kj)
1[t], ..., β(kj )
N[t],b1[t], ..., bN[t], c[t]}.(15)
2) System Actions: The action of the UAV is to choose a
group of users to serve. On each frame t, we define the action
atas:
at={a1,t, ..., aI ,t}, ai,t ∈[0, G].(16)
where ai,t is a continuous value. After selecting an action at,
we round all the elements to integer values ˆai,t .ˆai,t =gmeans
the g-th group is selected at the i-th timeslot. The hovering
time of each cluster can be determined by calculating the time
difference between the UAV arriving and leaving the cluster.
3) Rewards Design: The rewards of the DRL algorithms
are commonly related to the objective of the problem. Since
P1is an energy minimization problem, the reward function
can be designed by:
rt= 1/e[t](17)
where e[t]is the energy consumed on frame t. The reward
function is monotonically decreasing with regards to the en-
ergy enabling the UAV’s decisions to evolve towards reducing
energy consumption. If the after-learned policy violates the
constraint (7a), the agent will get a penalty φwhich is a
negative value.
Based on the above definition, we propose an actor-critic-
based DRL algorithm in Alg. 2. When a new user arrives, the
DRL algorithm only needs to update the state information at
the beginning of the next round, without restarting training.
Algorithm 2 Actor-critic-based DRL Algorithm
Inputs: The current state st.
Outputs: The current action at.
1: Initialize the parameters θ1and w1.
2: for each learning round do:
3: for t= 1 : TLdo:
4: Approximate a distribution π(at|st, θt)by actor.
5: Randomly choose ataccording to π.
6: Round atto ˆatand observe rt,st+1.
7: Pass rtand st+1 to the critic.
8: Approximate Q(at, st|wt),Q(at+1, st+1 |wt)by critic.
9: Calculate TD error L(wt)by Eq. (12).
10: Collect tuples {at, st, st+1, rt, L(wt)}.
11: Update θtand wtby Eq. (11) and Eq. (13).
12: end for
13: end for
V. SIM UL ATIO N RES ULTS
In this section, we compare the proposed actor-critic-based
DRL algorithm with the semiorthogonal user selection (SUS)
algorithm in [6] and the offline approach. The UAV is equipped
with L= 10 antennas serving N= 3 clusters. The users are
randomly scattered in the service area. The users’ arrival and
departure follow Possion distribution. In each round, the users’
demands qk,n are randomly selected from {1, 1.5, 2, 3, 4.45,
5}(Mbit). Each cluster holds a maximum of K= 10 users
such that the number of candidate groups G= 210. We assume
the bandwidth B= 10MHz, noise power σ2= 0.1mW,
transmission power p= 3W and hovering power PH= 10W.
Based on FSMC, we quantize the channel coefficient β(kj)
n
into 10 levels, B={0,0.3,0.6,0.9,1.2,1.5,1.8,2.1,2.4}, and
apply the transfer probability matrix referring to [7]. For the
2020 European Conference on Networks and Communications (EuCNC): Wireless, Optical and Satellite Networks (WOS)
351
Authorized licensed use limited to: University of Luxembourg. Downloaded on October 15,2020 at 11:50:00 UTC from IEEE Xplore. Restrictions apply.
120 140 160 180 200 220
TL (Frame)
500
1000
1500
2000
2500
3000
Objective energy (Joule)
Offline algorithm
Actor-critic DRL algorithm
SUS algorithm in [6]
Fig. 3. Energy vs. TL.
56789
K
500
1000
1500
2000
Objective Energy (Joule)
Offline algorithm
Actor-critic DRL algorithm
SUS algorithm in [6]
Fig. 4. Energy vs. K.
56789
K
0
0.5
1
1.5
2
2.5
3
3.5
4
Computational time (s)
Offline algorithm
Actor-critic DRL algorithm
SUS algorithm in [6]
Fig. 5. Computational time vs. K.
120 140 160 180 200 220
TL (Frame)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Outage ratio
Offline algorithm
Actor-critic DRL algorithm
SUS algorithm in [6]
Fig. 6. Outage ratio vs. TL.
DRL algorithm, we set the learning rate αa=αc= 0.001, the
penalty φ=−100. We use two 5-layer (300 nodes per layer)
DNNs to build the approximators. The stochastic policy π(a|s)
follows Gaussian distribution.
Fig. 3 compares the energy consumption of three schemes
with different time limitation TL. We can observe that the
energy drops at the minimal point and then increases steadily.
This is because, when TLis not sufficient, the users have to
share one timeslot leading to the inter-user interference and the
degradation of the average transmission rates. In this case, the
UAV needs to consume more transmission energy to satisfy the
users’ requests. Thus, the communication energy undergoes
a sharp decrease and then becomes stable after TLis large
enough. On the other hand, the hovering energy maintains a
linear growth with regards to TL. Among the three algorithms,
the offline approach has the best performance in energy conser-
vation, followed by the proposed DRL algorithm. Fig. 4 shows
the energy consumption with different K. We can observe
that a large user scale leads to more energy consumption. The
offline algorithm consumes the least energy compare to the
others. The proposed DRL algorithm saves more than 25%
energy of SUS.
Fig. 5 compares the computational time with different K.
The computational time refers to the average time needed
to generate a scheduling solution for each frame. It can be
found that the proposed DRL algorithms are more time-
efficient. The computational time of the offline algorithm
grows exponentially, while that of the DRL algorithm and
SUS increase linearly. Since the DRL algorithm has an extra
learning process, it takes more time than the non-learning SUS.
If a user’s request is not completed within TL, the service
of the user will be interrupted. We define the outage ratio as
the percentage of the number of interrupted users to the total
number of users. Fig. 6 shows the outage ratio with regards
to TL. We can observe that, if TLis sufficient, almost all the
users’ requests can be completed without interruption. When
the TLis small, the proposed DRL algorithm has a lower
outage ratio compared with SUS.
VI. CO NC LU SI ON
In this paper, we have investigated an energy-efficient user
scheduling problem in UAV-aided communication systems.
We first proposed an offline algorithm that provides user
scheduling solutions to minimize the energy consumption
for the UAV. To reduce the computational time, an actor-
critic-based DRL algorithm was developed by transferring
the problem to an MDP. Numerical results show that the
proposed DRL algorithm achieves good performance in energy
conservation with less computational time.
ACK NOW LE DG ME NT
The work has been supported by the ERC project
AGNOSTIC (742648), by the FNR CORE projects RO-
SETTA (11632107), ProCAST (C17/IS/11691338) and 5G-
Sky (C19/IS/13713801), and by the FNR bilateral project
LARGOS (12173206).
REF ER EN CE S
[1] Y. Zeng, R. Zhang and T. J. Lim, “Wireless Communications with
Unmanned Aerial Vehicles: Opportunities and Challenges,” in IEEE
Communication Magazine, vol. 52, no. 5, pp. 36–42, May 2016.
[2] M. Mozaffari, W. Saad, M. Bennis et al., “A Tutorial on UAVs for
Wireless Networks: Applications, Challenges, and Open Problems,” in
IEEE Communications Surveys & Tutorials, Mar. 2019.
[3] Y. Zeng, J. Xu and R. Zhang, “Energy Minimization for Wireless Com-
munication with Rotary-Wing UAV,” in IEEE Transactions on Wireless
Communications, vol. 18, no. 4, pp. 2329–2345, Mar. 2019.
[4] Y. Cai, Z. Wei, R. Li et al., “Energy-Efficient Resource Allocation for
Secure UAV Communication Systems,” in 2019 IEEE Wireless Commu-
nications and Networking Conference (WCNC), Apr. 2019.
[5] S. Ahmed, M. Z. Chowdhury and Y. M. Jang, “Energy-Efficient UAV-
to-User Scheduling to Maximize Throughput in Wireless Networks,” in
IEEE Access, vol. 8, pp.21215–21225, Jan. 2020.
[6] T. Yoo and A. Goldsmith, “On the Optimality of Multiantenna Broadcast
Scheduling Using Zero-Forcing Beamforming,” in IEEE Journal on
Selected Areas in Communications, vol. 24, no. 3, pp. 528–541 Mar.
2006.
[7] Y. He, Z. Zhang, F. R. Yu et al., “Deep-Reinforcement-Learning-Based
Optimization for Cache-Enabled Opportunistic Interference Alignment
Wireless Networks,” in IEEE Transactions on Vehicular Technology,
vol. 66, no. 11 pp.10433–10445, Sep. 2017.
[8] U. Challita, W. Saad, and C. Bettstetter, “Cellular-Connected UAVs over
5G: Deep Reinforcement Learning for Interference Management,” in
arXiv preprint:1801.05500, Jan. 2018.
[9] C. H. Liu, Z. Chen, J. Tang et al., “Energy-Efficient UAV Control for
Effective and Fair Communication Coverage: A Deep Reinforcement
Learning Approach,” in IEEE Journal on Selected Areas in Communi-
cations, vol. 36, no. 9, pp. 2059–2070, Aug. 2018.
[10] C. You and R. Zhang, “3D Trajectory Optimization in Rician Fading
for UAV-Enabled Data Harvesting,” in arXiv preprint:1901.04106, Jan.
2019.
[11] C. Zhang, W. Xu and M. Chen, “Robust MMSE Beamforming for
Multiuser MISO Systems With Limited Feedback,” in IEEE Signal
Processing Letters, vol. 16, no. 7, pp. 588–591, Jul. 2009.
[12] W. Zhang, “Branch-and-Bound Search Algorithms and Their Computa-
tional Complexity,” Research Report (No. ISI/RR-96-443), University of
Southern California, May 1996.
[13] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” in Advances
in Neural Information Processing Systems, pp. 1008–1014, 2000.
[14] J. Guillot, D. R. Leal, C. R. Algar´
ın et al., “Search for Global Maximal in
Multimodal Functions by Applying Numerical Optimization Algorithms:
A Comparison Between Golden Section and Simulated Annealing,” in
Computation, vol. 7, no. 3, 2019.
2020 European Conference on Networks and Communications (EuCNC): Wireless, Optical and Satellite Networks (WOS)
352
Authorized licensed use limited to: University of Luxembourg. Downloaded on October 15,2020 at 11:50:00 UTC from IEEE Xplore. Restrictions apply.