Content uploaded by Arumugam Nallanathan
Author content
All content in this area was uploaded by Arumugam Nallanathan on Feb 11, 2019
Content may be subject to copyright.
1
Deep Reinforcement Learning for Real-Time
Optimization in NB-IoT Networks
Nan Jiang, Student Member, IEEE, Yansha Deng, Member, IEEE, Arumugam Nallanathan,
Fellow, IEEE, and Jonathon A. Chambers, Fellow, IEEE
Abstract
NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible
configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration
specifies the amount of radio resource allocated to each group of devices for random access and for data transmission.
Assuming no knowledge of the traffic statistics, there exists an important challenge in “how to determine the
configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval
(TTI) in an online fashion”. Given the complexity of searching for optimal configuration, we first develop real-time
configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-
Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results
show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic
approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also
indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with
much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and
via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the
problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of
the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase
of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and
training efficiency.
I. INTRODUCTION
To effectively support the emerging massive Internet of Things (mIoT) ecosystem, the 3rd Generation
Partnership Project (3GPP) partners have standardized a new radio access technology, namely NarrowBand-
IoT (NB-IoT) [1]. NB-IoT is expected to provide reliable wireless access for IoT devices with various
N. Jiang, and A. Nallanathan are with the School of Electronic Engineering and Computer Science, Queen Mary University of London,
London E1 4NS, UK (e-mail: {nan.jiang, a.nallanathan}@qmul.ac.uk).
Y. Deng is with the Department of Informatics, King’s College London, London WC2R 2LS, UK (e-mail: yansha.deng@kcl.ac.uk)
(Corresponding author: Yansha Deng).
J. A. Chambers is with the Department of Engineering, University of Leicester, Leicester LE1 7RH, UK (e-mail: jonathon.chambers@le.ac.uk).
arXiv:1812.09026v1 [cs.NI] 21 Dec 2018
2
types of data traffic, and to meet the requirement of extended coverage. As most mIoT applications favor
delay-tolerant data traffic with small size, such as data from alarms, and meters, monitors, the key target of
NB-IoT design is to deal with the sporadic uplink transmissions of massive IoT devices [2].
NB-IoT is built from legacy Long-Term Evolution (LTE) design, but only deploys in a narrow bandwidth
(180 KHz) for Coverage Enhancement (CE) [3]. Different from the legacy LTE, NB-IoT only defines two
uplink physical channel resource to perform all the uplink transmission, including the Random Access
CHannel (RACH) resource (i.e., using NarrowBand Physical Random Access CHannel, a.k.a. NPRACH)
for RACH preamble transmission, and the data resource (i.e., using NarrowBand Physical Uplink Shared
CHannel, a.k.a. NPUSCH) for control information and data transmission. To support various traffic with
different coverage requirements, NB-IoT supports up to three CE groups of IoT devices sharing the uplink
resource in the same band. Each group serves IoT devices with different coverage requirements distinguishing
based on a same broadcast signal from the evolved Node B (eNB) [3]. At the beginning of each uplink
Transmission Time Interval (TTI), eNB selects a system configuration that specifies the radio resource
allocated to each group in order to accommodate the RACH procedure along with the remaining resource
for data transmission. The key challenge is to optimally balance the allocations of channel resource between
the RACH procedure and data transmission so as to provide maximum success accesses and transmissions in
massive IoT networks. Allocating too many resource for RACH enhances the random access pernformace,
while leaving insufficient resource for data transmission.
Unfortunately, dynamic RACH and data transmission resource configuration optimization is an untreated
problem in NB-IoT. Generally speaking, the eNB observes the transmission receptions of both RACH
(e.g., number of successfully received preambles and collisions) and data transmission (e.g., number of
successful scheduling and unscheduling) for all groups at the end of each TTI. This historical information
can be potentially used to predict traffic from all groups and to facilitate the optimization of future TTIs’
configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would
result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces,
which would be generally intractable. The complexity of the problem is compounded by the lack of a
prior knowledge at the eNB regarding the stochastic traffic and unobservable channel statistics (i.e., random
collision, and effects of physical radio including path-loss and fading). The related works will be briefly
introduced in the following two subsections.
1) Related works on real-time optimization in cellular-based networks: In light of this POMDP challenge,
prior works [4, 5] studied real-time resource configuration of RACH procedure and/or data transmission by
proposing dynamic Access Class Barring (ACB) schemes to optimize the number of served IoT devices.
3
These optimization problems have been tackled under the simplified assumptions that at most two config-
urations are allowed and that the optimization is executed for a single group without considering errors
due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement
Learning (RL) emerges as a natural solution given its capability in interacting with the practical environment
and feedback in the form of the number of successful and unsuccessful transmissions per TTI. Distributed
RL based on tabular Q-learning (tabular-Q) has been proposed in [6–9]. In [6–8], the authors studied
distributed tabular-Q in slotted-Aloha networks, where each device learns how to avoid collisions by finding
a proper time slot to transmit packets. In [9], the authors implemented tabular-Q agents at the relay nodes
for cooperatively selecting its transmit power and transmission probability to optimize the total number of
useful received packets per consumed energy. Centralized RL has also been studied in [10–12], where the
RL agent was implemented at the base station site. In [10], a learning-based scheme was proposed for radio
resource management in multimedia wide-band code-division multiple access systems to improve spectrum
utilization. In [11, 12], the authors studied the tabular-Q based ACB schemes in cellular networks, where a
Q-agent was implemented at an eNB aiming at selecting the optimal ACB factor to maximize the access
success probability of RACH procedure.
2) Related works on optimization in NB-IoT: In NB-IoT networks, most existing studies either focused
on the resource allocation during RACH procedure [13, 14], or that during the data transmission [15, 16]. For
RACH procedure, the access success probability was statistically optimized in [13] using exhaustive search,
and the authors in [14] studied the fixed-size data resource scheduling for various resource requirements.
For the data transmission, [15] presented an uplink data transmission time slot and power allocation scheme
to optimize the overall channel gain, and [16] proposed a link adaptation scheme, which dynamically
selects modulation and coding level, and the repetition value according to the acknowledgment/negative-
acknowledgment feedback to reduce the uplink data transmission block error ratio. More importantly, these
works ignore the time-varied heterogeneous traffic of massive IoT devices, and considered a snap shot [13,
15, 16] or steady-state behavior [14] of NB-IoT networks. Our most relevant work is [17], where the authors
studied the steady-state behavior of NB-IoT networks from the perspective of a single device. Optimizing
some of the parameters of the NB-IoT configuration, namely the repetition value (to be defined below) and
time intervals between two consecutive scheduling of NPRACH and NPDCCH, was carried out in terms of
latency and power consumption in [17] using a queuing framework.
Unfortunately, the tabular-Q framework in [11, 12] cannot be used to solve the multi-parameter multi-
group optimization problem in uplink resource configuration of NB-IoT networks, due to their incapability
to address high-dimensional state space and variable selection. More importantly, whether their proposed
4
RL-based resource configuration approaches [11, 12] outperform the conventional resource configuration
approaches [4, 5] is still unknown. In this paper, we develop RL-based uplink resource configuration ap-
proaches to dynamically optimize the number of served IoT devices in NB-IoT networks. To showcase the
efficiency, we compare the proposed RL-based approaches with the conventional heuristic uplink resource
allocation approaches. The contributions can be summarized as follows:
•We develop an RL-based framework to optimize the number of served IoT devices by adaptively
configuring uplink resource in NB-IoT networks. The uplink communication procedure in NB-IoT is
simulated by taking into account the heterogeneous IoT traffics, the CE group selection, the RACH
procedure, and the uplink data transmission resource scheduling. This generated simulation environment
is used for training the RL-based agents before deployment, and these agents will be updated according
to the real traffic in practical NB-IoT networks in an online manner.
•We first study a simplified NB-IoT scenario considering the single parameter and the single CE group,
where a basic tabular-Q was developed to compare with the revised conventional Load Estimation based
Uplink Resource Configuration (LE-URC) scheme. The tabular-Q is further advanced by implementing
function approximators with different computational complexities, namely, Linear Approximator (LA-Q)
and Deep Neural Networks (Deep Q-Network, a.k.a. DQN) to elaborate their capability and efficiency
in dealing with high-dimensional state space.
•We then study a more practical NB-IoT scenario with multiple parameters and multiple CE groups,
where direct implementation of the LA-Q or DQN is not feasible due to the increasing size of the
parameter combinations. To solve it, we propose Action Aggregation approaches based on LA-Q and
DQN, namely, AA-LA-Q and AA-DQN, which guarantee convergence capability by sacrificing certain
accuracy in the parameters selection. Finally, a Cooperative Multi-Agent learning based on DQN (CMA-
DQN) is developed to break down the selection in high-dimensional parameters into multiple parallel
sub-tasks by using that a number of DQN agents are cooperatively trained to produce each parameter
for each CE group.
•In the simplified scenario, our results show that the number of served IoT devices with tabular-Q con-
siderably outperforms that with LE-URC, while LA-Q and DQN achieve almost the same performance
as that of tabular-Q using much less training time. In the practical scenario, the superiority of Q-learning
based approaches over LE-URC significantly improves. Especially, CMA-DQN outperforms all other
approaches in terms of both throughput and training efficiency, which is mainly due to the use of
DQN enabling operation over a large state space and the use of multiple agents dealing with the large
dimensionality of parameters selection.
5
The rest of the paper is organized as follows. Section II provides the problem formulation and system
model. Section III illustrates the preliminary and the conventional LE-URC. Section IV proposes Q-leaning
based uplink resource configuration approaches in the single-parameter single-group scenario. Section V
presents the advanced Q-learning based approaches in the multi-parameter multi-group scenario. Section VI
elaborates the numerical results, and finally, Section VII concludes the paper.
II. PROBLEM FORMUL ATION AND SYSTEM MODEL
As illustrated in Fig. 1(a), we consider a single-cell NB-IoT network composed of an eNB located at the
center of the cell, and a set of static IoT devices randomly located in an area of the plane R2, and remain
spatially static once deployed. The devices are divided into three CE groups as further discussed below, and
the eNB is unaware of the status of these IoT devices, hence no uplink channel resource is scheduled to
them in advance. In each IoT device, uplink data is generated according to random inter-arrival processes
over the TTIs, which are Markovian and possibly time-varying.
n
t
Repe,1
=4
… …
n
t
Repe,2
=8
f
t
Prea,0
=48
IoT
device
eNB
Frequency
CE group 0:P
RSRP
>γ
RSRP1
CE group 1:γ
RSRP1
≥
P
RSRP
≥
γ
RSRP2
CE group 2: P
RSRP
<γ
RSRP2
Time tth T
TTI
n
tRach,0
= 4
n
tRach,1
= 2
n
tRach,2
= 1
n
t
Repe,0
=1 f
tPrea,1
=24
f
t
Prea,2
=12
n
t
Rach,i
: Number of RACH periods
n
t
Repe,i
: Repetition value
f
t
Prea,i
: Number of preambles
… …
RE
(a) (b)
P
RSRP
Fig. 1: (a) Illustration of system model; (b) Uplink channel frame structure.
A. Problem Formulation
With packets waiting for service, an IoT device executes the contention-based RACH procedure in order to
establish a Radio Resource Control (RRC) connection with the eNB. The contention-based RACH procedure
consists of four steps, where an IoT device transmits a randomly selected preamble, for a given number
of times according to the repetition value nt
Repe,i [1], to initial RACH procedure in step 1, and exchanges
control information with the eNB in the next three steps [18]. The RACH process can fail if: (i) a collision
occurs when two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB
6
cannot detect a preamble due to low SNR. Note that a collision can be still detected in step 3 of RACH when
the collided preambles are not detected in step 1 of RACH following 3GPP report [19]. This assumption is
different from our previous works [20,21], which only focuses on the preamble detection analysis in step 1
of RACH.
As shown in Fig. 1(b), for each TTI tand for each CE group i= 0,1,2, in order to reduce the chance
of a collision, the eNB can increase the number nt
Rach,i of RACH periods in the TTI or the number ft
Prea,i of
preambles available in each RACH period [22]. Furthermore, in order to mitigate the SNR outage, the eNB
can increase the number nt
Repe,i of times that a preamble transmission is repeated by a device in group iin
one RACH period [22] of the TTI.
After the RRC connection is established, the IoT device requests uplink channel resource from the eNB
for control information and data transmission. As shown in Fig. 1(b), given a total number of resource
RUplink for uplink transmission in the TTI, the number of available resource for data transmission Rt
DATA
is written as Rt
DATA =RUplink −Rt
RACH, where Rt
RACH is the overall number of Resource Elements (REs)1
allocated for the RACH procedure. This can be computed as Rt
RACH =BRACH P2
i=0 nRach,inRepe,i fPrea,i, where
BRACH measures the number of REs required for one preamble transmission.
In this work, we tackle the problem of optimizing the RACH configuration defined by parameters
At={nt
Rach,i, f t
Prea,i, nt
Repe,i}2
i=0 for each ith group in an online manner for every TTI t. In order to
make this decision at the beginning of every TTI t, the eNB accesses all prior history Ut0in TTIs t0=
1, ..., t −1consisting of the following variables: the number of the collided preambles Vt0
cp,i, the number
of the successfully received preambles Vt0
sp,i, and the number of idle preambles Vt0
ip,i of the ith CE group
in the tth TTI for the RACH, as well as the number of IoT devices that have successfully sent data
Vt0
su,i and the number of IoT devices that are waiting for being allocated data resource Vt0
un,i. We denote
Ot={At−1, Ut−1, At−2, U t−2,· · · , A1, U 1}as the observed history of all such measurements and past
actions.
The eNB aims at maximizing the long-term average number of devices that successfully transmit data
with respect to the stochastic policy πthat maps the current observation history Otto the probabilities of
selecting each possible configuration At. This problem can be formulated as the optimization
(P1) : max
{π(At|Ot)}
∞
X
k=t
2
X
i=0
γk−tEπ[Vk
su,i],(1)
1The uplink channel consists of 48 sub-carriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time
slot of 2 ms and one sub-carrier of 3.75 kHz [1]. Note that the NB-IoT also supports 12 sub-carriers with 15 kHz tone spacing for NPUSCH,
but NPRACH only supports 3.75 kHz tone spacing [1].
7
where γ∈[0,1) is the discount rate for the performance in future TTIs and index iruns over the CE groups.
Since the dynamics of the system is Markovian over the TTI and is defined by the NB-IoT protocol to be
further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will
be discussed in Sections III, IV, and V.
B. NB-IoT Access Network
We now provide additional details on the model and on the NB-IoT protocol. To capture the effects of
the physical radio, we consider the standard power-law path-loss model that the path-loss attenuation is u−η,
with the propagation distance uand the path-loss exponent η. The system is operated in a Rayleigh flat-
fading environment, where the channel power gains hare exponentially distributed (i.i.d.) random variables
with unit mean. Fig. 2 presents the uplink data transmission procedure from the perspective of an IoT device
in NB-IoT networks, which consists of the four stages that are explained in the following four subsections
to introduce the system model.
γCE,i : maximum allowed RACH attempts in the ith CE group
γpMax: maximum allowed RACH attempts in all CE groups
γRRC : maximum allowed channel resources requests
cpCE : CE counter
cpMax: RACH counter
cRRC : RRC counter
Receive
system
information
New
packets?
Yes RACH
procedure
Request
uplink
channel
resource
Initial
cRRC=0
cRRC=cRRC+1
Serving
succeeds
Initial
cpMax=0
cpCE=0
Waiting
for new
packet
Serving fails, drop packet
cpCE=cpCE+1 cpMax=cpMax+1
Yes
Step up to
higher CE
group, initial
cpCE = 0
No
No
A. Traffic Inter-Arrival B. CE Group Determination C. RACH Procedure D. Data Resource Scheduling
No
RACH
succeeds?
cpMax<γpMax?cpCE<γCE,i?
Scheduled?
Yes
cRRC<γRRC?
No
No
No
Yes
Yes
Fig. 2: Uplink data transmission procedure from the perspective of an IoT device in NB-IoT networks.
1) Traffic Inter-Arrival: We consider two types of IoT devices with different traffic models, including
periodical traffic and bursty traffic, which is a heterogeneous traffic scenario for diverse IoT applications [23,
24]. The periodical traffic coming from periodic uplink reporting tasks, such as metering or environmental
monitoring, is the most common traffic model in NB-IoT networks [25]. The bursty traffic due to emergency
events, such as fire alarms and earthquake alarms, captures the complementary scenario in which a massive
number of IoT devices tries to establish RRC connection with the eNB [19]. Due to the nature of slotted-
Aloha, an IoT device can only transmit a preamble at the beginning of a RACH period, which means that
IoT devices executing RACH in a RACH period comes from those who received an inter-arrival within the
8
interval between with the last RACH period. For the periodical traffic, the first packet is generated using
Uniform distribution over Tperiodic (ms), and then repeated every Tperiodic ms. The packet inter-arrival rate
measured in each RACH period at each IoT device is hence expressed by
µt
period =TTTI
nt
Rach,i
×1
Tperiodic
,(2)
where nt
Rach,i is the number of RACH periods in the tth TTI, TTTI
nt
Rach,i
is the duration between neighboring
RACH periods. The bursty traffic is generated within a short period of time Tbursty starting from a random
time τ0. The traffic instantaneous rate in packets in a period is described by a function p(τ)so that the
packets arrival rate in the jth RACH period of the tth TTI is given by
µt,j
bursty =Zτj
τj−1
p(τ)dτ, (3)
where τjis the starting time of the jth RACH period in the tth TTI, τj−τj−1=TTTI
nt
Rach,i
, and the distribution
p(τ)follows the time limited Beta profile given as [19, Section 6.1.1]
p(τ) = τα−1(Tbursty −τ)β−1
Tburstyα+β−2Beta(α, β),(4)
In (4), Beta(α, β)is the Beta function with the constant parameters αand β[26].
2) CE Group Determination: Once an IoT device is backlogged, it first determines its associated CE
group by comparing the received power of the broadcast signal PRSRP to the Reference Signal Received
Power (RSRP) thresholds {γRSRP1, γRSRP2}according to the rule [27]
CE group 0, if PRSRP > γRSRP1,
CE group 1, if γRSRP1 ≥PRSRP ≥γRSRP2,
CE group 2, if PRSRP < γRSRP2.
(5)
In (5), the received power of broadcast signal PRSRP is expressed as
PRSRP =PNPBCHu−η,(6)
where uis the device’s distance from the eNB, and PNPBCH is the broadcast power of eNB [27]. Note that
PRSRP is obtained by averaging the small-scale Rayleigh fading of the received power [27].
3) RACH Procedure: After CE group determination, each backlogged IoT device in group irepeats a
randomly selected preamble nt
Repe,i times in the next RACH period by using a pseudo-random frequency
hopping schedule. The pseudo-random hopping rule is based on the current repetition time as well as the
Narrowband Physical Cell ID, and in one repetition, a preamble consists of four symbol groups, which are
transmitted with fixed size frequency hopping [1, 20, 28]. Therefore, a preamble is successfully detected if at
least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly
9
decoded [20]. Assuming that correct detecting is determined by the SNR level SNRt
sg,j,k for the jth repetition
and the ksymbol group, the correct detecting event Spd can be expressed as
Spd
∆
=
nt
Repe,i
[
j=1 4
\
k=1 SNRt
sg,j,k ≥γth,(7)
where kis the index of symbol group in the jth repetition, nt
Repe,i is the repetition value of the ith CE group
in the tth TTI, SNRt
sg,j,k ≥γthmeans that the preamble symbol group is successfully decoded when its
received SNR SNRt
sg,j,k above a threshold γth, and SNRt
sg,j,k is expressed as
SNRt
sg,j,k =PRACH,iu−ηh/σ2.(8)
In (8), uis the Euclidean distance between the IoT device and the eNB, ηis the path-loss attenuation factor,
his the Rayleigh fading channel power gain from the IoT device to the eNB, σ2is the noise power, and
PRACH,iis the preamble transmit power in the ith CE group defined as
PRACH,i=
min {ρuη, PRACHmax}, i = 0,
PRACHmax, i = 1 or 2.
(9)
where iis the index of CE groups, IoT devices in the CE group 0 (i= 0) transmit preamble using the
full path-loss inversion power control [27], which maintains the received signal power at the eNB from IoT
devices with different distance equalling to the same threshold ρ, and PRACHmax is the maximal transmit
power of an IoT device. The IoT devices in the CE group 1 and group 2 always transmit preamble using
the maximum transmit power [27].
As shown in the RACH procedure of Fig. 2, if a RACH fails, the IoT device reattempts the procedure
until receiving a positive acknowledgement that RRC connection is established, or exceeding γpCE,i RACH
attempts while being part of one CE group. If these attempts exceeds γpCE,i, the device switches to a higher
CE group if possible [29]. Moreover, the IoT device is allowed to attempt the RACH procedure no more
than γpMax times before dropping its packets. These two features are counted by cpCE and cpMax, respectively.
4) Data Resource Scheduling: After the RACH procedure succeeds, the RRC connection is successfully
established, and the eNB schedules resource from the data channel resource Rt
DATA to the associated IoT
device for control information and data transmission as shown in Fig 1(b). To allocate data resource among
these devices, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that
have successfully completed the RACH procedure but have not received a data channel is compiled using a
random order. In each TTI, devices in the list are considered in order for access to the data channel until the
data resource are insufficient to serve the next device in the list. The remaining RRC connections between
the unscheduled IoT devices and the eNB will be preserved within at most γRRC subsequent TTIs counting
10
by cRRC, and attempts will be made to schedule the device’s data during these TTIs [29, 30]. The condition
that the data resource are sufficient in TTI tis expressed as
Rt
DATA ≥
2
X
i=0
rt
DATA,iVt
sch,i,(10)
where P2
i=0 Vt
sch,i ≤P2
i=0(Vt
sp,i +Vt−1
un,i )is the number of scheduled devices limited by the upper bound
denoted by IoT devices with successful RACH Vt
sp,i in the current TTI tas well as unscheduled IoT devices
Vt−1
un,i in the last TTI (t−1),rt
DATA,i =BDATA ×nt
Repe,i is the number of required REs for serving one IoT
device within the ith CE group, and BDATA is the number of REs per repetition for control signal and data
transmission2. Note that nt
Repe,i is the repetition value for the ith CE group in the tth TTI, which is the same
as for preamble transmission [1].
III. PRELIMINARY AND CONVENTIONAL SOLUTIONS
A. Preliminary
The optimized number of served IoT devices over the long term given in Eq. (1) is really complicated,
which cannot be easily solved via the conventional uplink resource approach. Therefore, most prior works
simplified the objective to dynamically optimize the single parameter to achieve the maximum number of
served IoT devices in the single group without consideration of future performance [4, 5], which is expressed
as
(P2) : max
π(x
Ot)
Eπ[Vt
su,0],(11)
where xis the optimized single parameter.
To maximize number of served IoT devices in the tth TTI, the configuration xis expected to be dynamically
adjusted according to the actual number of IoT devices that will execute RACH attempts Dt
RACH, which refers
to the current load of the network. Note that in practice, this load information is unable to be detected at
the eNB. Thus, it is necessary to estimate the load based on the previous transmission reception from the
1th to (t−1)th TTI Otbefore the uplink resource configuration in the tth TTI.
In [5], the authors designed a dynamic ACB scheme to optimize the problem given in Eq. (1) via adjusting
the ACB factor. The ACB factor is adapted based on the knowledge of traffic load, which is estimated via
2The basic scheduling unit of NPUSCH is resource unit (RU), which has two formats. NPUSCH format 1 (NPUSCH-1) is with 16 REs for
data transmission, and NPUSCH format 2 (NPUSCH-2) is with 4 REs for carrying control information [3, 22].
11
moment matching. The estimated number of RACH attempting IoT devices in the tth TTI ˆ
Dt
RACH is expressed
as:
ˆ
Dt
RACH =max n0,ˆ
Dt−1
RACH +max −ft−1
Prea,0,ˆ
Dt
RACH −ˆ
Dt−1
RACHo(12)
where ft−1
Prea,0is the number of allocated preambles in the (t−1)th TTI, and ˆ
Dt−1
RACH is the estimated number
of devices performing RACH attempts in the (t−1)th TTI given as
ˆ
Dt−1
RACH =ft−1
Prea,0/hminn1, pt−1
ACB1 + (Vt−1
cp,0−uM,p∗)e
2ft−1
Prea,0
)−1oi.(13)
In Eq. (13), pt−1
ACB,ft−1
Prea,0, and Vt−1
cp,0are the ACB factor, the number of preambles and the observed number
of collided preambles in the (t−1)th TTI, and uM,p∗is an estimated factor given in [5, Eq. (32)].
In Eq. (12), ˆ
Dt
RACH −ˆ
Dt−1
RACH ≈ˆ
Dt−1
RACH −ˆ
Dt−2
RACH is the difference between the estimated numbers of RACH
requesting IoT devices in the (t−1)th and the tth TTIs, which is obtained by assuming that the number of
successful RACH IoT devices does not change significantly in these two TTIs [5].
This dynamic control approach is designed for an ACB scheme, which is only triggered when the exact
traffic load is bigger than the number of preambles (i.e., Dt
RACH > ft
Prea,0). Accordingly, the related backlog
estimation approach is only used when Dt
RACH > ft
Prea,0. However, it cannot estimate the load when Dt
RACH <
ft
Prea,0, which is required in our problem.
B. Resource Configuration in Single Parameter Single CE Group Scenario
In this subsection, we modify the load estimation approach given in [5] via estimating based on the last
number of the collided preambles Vt−1
cp,0and the previous numbers of idle preambles Vt−1
ip,0, V t−2
ip,0,· · · . And
then, we propose an uplink resource configuration approach based on this revised load estimation, namely,
LE-URC.
1) Load Estimation: By definition, FPrea is the set of valid number of preambles that the eNB can choose,
where each IoT device selects a RACH preamble from ft
Prea,0available preambles with an equal probability
given by 1/ft
Prea,0. For a given preamble jtransmitted to the eNB, let djdenotes the number of IoT devices
that selects the preamble j. The probability that no IoT device selects preamble jis
P{dj= 0
Dt−1
RACH,0=n}=1−1
ft−1
Prea,0n.(14)
The expected number of preambles experiencing idles E{Vt−1
idle,0
Dt−1
RACH,0=n}in the (t−1)th TTI is given
by
E{Vt−1
ip,0
Dt−1
RACH,0=n}=
ft−1
Prea,0
X
j>1
P{dj= 0
Dt−1
RACH =n}=ft−1
Prea,01−1
ft−1
Prea,0n.(15)
12
Due to that the actual number of preambles experiencing idles Vt−1
ip,0can be observed at the eNB, the number
of RACH attempting IoT devices in the (t−1)th TTI ζt−1can be estimated as
ζt−1=f−1(E{Vt−1
ip,0
Dt−1
RACH,0) = log
(ft−1
Prea,0−1
ft−1
Prea,0
)
(Vt−1
ip,0
ft−1
Prea,0
),(16)
To obtain the estimated number of RACH attempting IoT devices in the tth TTI ˜
Dt
RACH,0, we also need to
know the difference between the estimated numbers of RACH attempting IoT devices in the (t−1)th and
the tth TTIs, denoted by δt, where δt=˜
Dt
RACH,0−˜
Dt−1
RACH,0for t= 1,2,· · · , and ˜
D0
RACH,0= 0. However,
˜
Dt
RACH,0cannot be obtained before the tth TTI. To solve this, we can assume δt≈δt−1according to [5]. This
is due to that the time between two consecutive TTIs is small, and the available preambles are gradually
updated leading to that the number of successful RACH IoT devices does not change significantly in these
two TTIs [5]. Therefore, the number of RACH attempting IoT devices in the tth time slot is estimated as
˜
Dt
RACH,0=max2Vt−1
cp,0, ζt−1+δt−1,(17)
where 2Vt−1
cp,0represents that there are at least 2Vt−1
cp,0number of IoT devices colliding in the last TTI.
2) Uplink Resource Configuration Based on Load Estimation: In the following, we propose LE-URC
by taking into account the resource condition given in Eq. (10). The number of RACH periods nRach,0
and the repetition value nRepe,0is fixed, and only the number of preambles in each RACH period fPrea,0is
dynamically configured in each TTI. Using the estimated number of RACH attempting IoT devices in the
tth TTI ˜
Dt
RACH,0, the probability that only one IoT device selects preamble j(i.e., no collision occurs) is
expressed as
P{dj= 1
˜
Dt
RACH,0=n}=n
11
ft
Prea,01−1
ft
Prea,0n−1.(18)
The expected number of RACH attempting IoT devices in the tth TTI is derived as
E{Vt
RACH,0
˜
Dt
RACH,0=n}=
ft
Prea,0
X
j>1
P{dj= 1
˜
Dt
RACH,0=n}=n1−1
ft
Prea,0n−1,(19)
Based on (19), the expected number of IoT devices requesting uplink resource in the tth TTI is derived as
E{Vt
reqs
˜
Dt
RACH,0=n}=E{Vt
RACH,0
˜
Dt
RACH,0=n}+Vt
un,0=n1−1
ft
Prea,0n−1+Vt−1
un,0,(20)
where Vt−1
un,0is the number of unscheduled IoT devices in the last TTI. Note that Vt−1
un,0can be observed.
However, if the data resource is not sufficient (i.e., occurs when Eq. (10) is invalid), some IoT devices
may not be scheduled in the tth TTI. The upper bound of the number of scheduled IoT devices Vt
up,0is
expressed as
Vt
up,0=Rt
DATA
rt
DATA,i
=RUplink −Rt
RACH
rt
DATA,i
.(21)
13
where Ruplink is the total number of REs reserved for uplink transmission in a TTI, Rt
RACH is the uplink
resource configured for RACH in the tth TTI. rt
DATA,0is required REs for serving one IoT device given in
Eq. (10).
According to (20) and (21), the expected number of the successfully served IoT devices is given by
Vt
suss(ft
Prea,0) = min {E{Vt
reqs
˜
Dt
RACH,0=n},Vt
up,0}.(22)
The maximal expected number of the successfully served IoT devices is obtained by selects the number
of preamble ft∗
Prea,0using
ft∗
Prea,0=argmax
f∈NPrea
Vt
suss(f).(23)
The LE-URC approach based on the estimated load ˜
Dt
RACH,0is detailed in Algorithm 1. For comparison,
we consider an ideal scenario that the actual number of RACH requesting IoT devices Dt
RACH is available
at the eNB, namely, Full State Information based URC (FSI-URC). FSI-URC configures ft∗
Prea,0still using
the approach given in Eq. (23), while the load estimation approach given in Section III.B.1) is not required.
Algorithm 1: Load Estimation Based Uplink Resource Configuration (LE-URC)
input : The set of the number of preambles in each RACH period FPrea,0, Number of IoT devices D, Operation Iteration I.
1for Iteration ←1to Ido
2Initialization of V0
ip,0:= 12,V0
cp,0:= 0,˜
D0
RACH,0:= 0,δ1:= 0, and bursty traffic arrival rate µ0
bursty = 0;
3for t←1to Tdo
4Generate µt
bursty using Eq. (3);
5The eNB observes Vt−1
ip,0and Vt−1
cp,0, and calculate ζt−1using Eq. (16);
6Estimate the number of RACH requesting IoT devices ˜
Dt
RACH,0using Eq. (17);
7Select the number of preambles ft∗
Prea,0using Eq. (23) based on the estimated load ˜
Dt
RACH,0;
8The eNB broadcasts ft∗
Prea,0, and backlogged IoT devices attempt communication in the tth TTI;
9Update δt+1 := ˜
Dt
RACH,0−˜
Dt−1
RACH,0.
10 end
11 end
3) LE-URC for Multiple CE Groups: We slightly revise the introduced single-parameter single-group LE-
URC approach (given in Section III.B) to dynamically configure resource for multiple CE groups. Note that
the repetition value nRepe,i in the LE-URC approach is still defined as a constant to enable the availability of
load estimation in Eq. (17). Remind that the principle of LE-URC approach is to optimize the expectation of
the number of successful served IoT devices while balancing Rt
RACH and Rt
DATA with limited uplink resource
Ruplink =Rt
DATA +Rt
RACH. In the multiple CE groups scenarios, the resource Rt
DATA are allocated to IoT
devices in any CE groups without bias, but Rt
RACH is specifically allocated to each CE group.
14
Under this condition, the expected number of successfully served IoT devices Vt
suss,i given in Eq. (22)
needs to be modified by taking into account multiple variables, which becomes non-convex, and extremely
complicates the optimization problem. To solve it, we use a sub-optimal solution by artificially setting uplink
resource constrain RUplink,i for each CE group (RUplink =P2
i=0 RUplink,i). Each CE group can independently
allocate the resource between Rt
DATA,i and Rt
RACH,i according to the approach given in Eq. (23).
IV. Q-LEARNING BASED RESOURCE CONFIGURATION IN SINGLE-PARAMETER SINGLE-GROUP
SCENARIO
The RL approaches are well-known in addressing dynamic control problem in complex POMDPs [31].
Nevertheless, they have been rarely studied in handling the resource configuration in slotted-Aloha based
wireless communication systems. Therefore, it is worthwhile to evaluate the capability of RL in the single-
parameter single-group scenario first, in order to be compared with conventional heuristic approaches. In
this section, we consider one single CE group with the fixed RACH periods nRach,0as well as the fixed
repetition value nRepe,0, and only dynamically configuring the number of preambles fPrea,0at the beginning
of each TTI. In the following, We first study tabular-Q based on the tabular representation of the value
function, which is the simplest Q-learning form with guaranteed convergence [31], but requires extremely
long training time. We then study Q-learning with function approximators to improve training efficiency,
where LA-Q and DQN will be used to construct an approximation of the desired value function.
A. Q-Learning and Tabular Value Function
Considering a Q-agent deployed at the eNB to optimize the number of successfully served IoT devices in
real-time, the Q-agent need to explore the environment in order to choose appropriate actions progressively
leading to the optimization goal. We define s∈ S,a∈ A, and r∈ R as any state, action, and reward from
their corresponding sets, respectively. At the beginning of the tth TTI (t∈ {0,1,2,· · · }), the Q-agent first
observes the current state Stcorresponding to a set of previous observations (Ot={Ut−1, U t−2,· · · , U 1}) in
order to select an specific action At∈ A(St). The action Atcorresponds to the number of preambles in
each RACH period ft
Prea,0in single CE group scenario.
As shown in Fig. 3, we consider a basic state function in the single CE group scenario, where Stis a
set of indices mapping to the current observed information Ut−1= [Vt−1
su,0, V t−1
un,0, V t−1
cp,0, V t−1
sp,0, V t−1
ip,0]. With the
knowledge of the state St, the Q-agent chooses an action Atfrom the set A, which is a set of indexes
mapped to the set of the number of available preambles FPrea. Once an action Atis performed, the Q-agent
will receive a scalar reward Rt+1, and observe a new state St+1. The reward Rt+1 indicates to what extent
the executed action Atcan achieve the optimization goal, which is determined by the new observed state
15
Q-Value
Function
Random Action
Action
max Q(St, a)
pε≥ε
pε<ε
Executing communication
procedures as Fig. 2
Observations at the eNB: Ut =
[Vtsu,0, Vtun,0, Vtcp,0 , Vtsp,0 , Vtip,0]
Environment
Rt+1=Vt
su/csu
State: St+1
Reward: Rt+1
Q-Agent
Q-Value
Function Sync
Actor
Learner
Update using Q(St, At)
=Q(St, At)+λ(Rt+1+
γmax Q(St+1, a)
-Q(St, At))
a
∈A
At f t
Prea,0
a
∈A
Fig. 3: The Tabular-Q agent and environment interaction in the POMDP.
St+1. As the optimization goal is to maximize the number of the successfully served IoT devices, we define
the reward Rt+1 as a function that positively proportional to the observed number of successfully served
IoT devices Vt
su ∈Ot, which is defined as
Rt+1 =Vt
su/csu,(24)
where csu is constant used to normalize the reward function.
Q-learning is a value-based RL approach [31, 32], where the policy of states to actions mapping π(s) = a
is learned using a state-action value function Q(s, a)to determine an action for the state s. We first use a
lookup table to represent the state-action value function Q(s, a)(tabular-Q), which consists of value scalars
for all the state and action spaces. To obtain an action At, we select the highest value scalar from the
numerical value vector Q(St, a), which maps all possible actions under Stto the Q-value table Q(s, a).
Accordingly, our objective is to find an optimal Q-value table Q∗(s, a)with optimal policy π∗that can
select actions to dynamically optimize the number of served IoT devices. To do so, we train a initial Q-value
table Q(s, a)in the environment using Q-Learning algorithm, where Q(s, a)is immediately updated using
the current observed reward Rt+1 after each action as
Q(St, At) =Q(St, At) + λRt+1 +γmax
a∈A Q(St+1, a)−Q(St, At),(25)
where λis a constant step-size learning rate that affects how fast the algorithm adapt to a new environment,
γ∈[0,1) is the discount rate that determines how current rewards affects the value function updating,
max
a∈A Q(St+1, a)approximates the value in optimal Q-value table Q∗(s, a)via the up-to-date Q-value table
Q(s, a)and the obtained new state St+1. Note that Q(St, At)in Eq. (25) is a scalar, which means that we
can only update one value scalar in the Q-value table Q(s, a)with one received reward Rt+1.
As shown in Fig. 3, we consider -greedy approach to balance exploitation and exploration in the Actor
of the Q-Agent, where is a positive real number and ≤1. In each TTI t, the Q-agent randomly generates
16
a probability pt
to compare with . Then, with the probability , the algorithm randomly chooses an action
from the remaining feasible actions to improve the estimate of the non-greedy action’s value. With the
probability 1−, the algorithm exploits the current knowledge of the Q-value table to choose the action
that maximizes the expected reward.
Particularly, the learning rate λis suggested to be set to a small number (e.g., λ= 0.01) to guarantee
the stable convergence of Q-value table in this NB-IoT communication system. This is due to that a
single reward in a specific TTI can be severely biased, because state function is composed of multiple
unobserved information with unpredictable distributions (e.g., an action allows for the setting with large
number of preambles ft
prea, but massive random collisions accidentally occur, which leads to an unusual
low reward). In the following, the implementation of uplink resource configuration using tabular-Q based
real-time optimization is shown in Algorithm 2.
Algorithm 2: Tabular-Q Based Uplink Resource Configuration
input : Valid numbers of preambles FPrea, Number of IoT devices D, Operation Iteration I.
1Algorithm hyperparameters: learning rate λ∈(0,1], discount rate γ∈[0,1),-greedy rate ∈(0,1] ;
2Initialization of the Q-value table Q(s, a)with 0value scalars;
3for Iteration ←1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t←1to Tdo
6Update µt
bursty using Eq. (3);
7if pt
< then select a random action Atfrom A;
8else select At=argmax
a∈A
Q(St, a);
9The eNB broadcasts ft
Prea =FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, calculate the related Rt+1 using Eq. (24), and update Q(St, At)using Eq. (25).
11 end
12 end
B. Value Function Approximation
Since tabular-Q needs its each element to be updated to converge, searching for an optimal policy can
be difficult in limited time and computational resource. To solve this problem, we use a value function
approximator instead of Q-value table to find a sub-optimal approximated policy. Generally, selecting a
efficient approximation approach to represent the value function for different learning scenarios is a usual
problem within the RL [31, 33–35]. A variety of function approximation approaches can be conducted, such
as LA, DNNs, tree search, and which approach to be selected can critically influence the successful learning
[31, 34, 35]. The function approximation should fit the complexity of the desired value function, and be
17
efficient to obtain good solutions. Unfortunately, most function approximation approaches require specific
design for different learning problems, and there is no basis function, which is both reliable and efficient to
satisfy all learning problems.
In this subsection, we first focus on the linear function approximation for Q-learning, due to its simplicity,
efficiency, and guaranteed convergence [31, 36, 37]. We then conduct the DNN for Q-learning as a more
effective but complicated function approximator, which is also known as DQN [32]. The reasons we conduct
DQN are that: 1) the DNN function approximation is able to deal with several kinds of partially observable
problems [31, 32]; 2) DQN has the potential to accurately approximate the desired value function while
addressing a problem with very large state spaces [32], which can be favored for the learning in the multiple
CE group scenarios; 3) DQN is with high scalability, where the scale of its value function can be easily fit
to a more complicated problem; 4) a variety of libraries have been established to facilitate building DNN
architectures and accelerate experiments, such as TensorFlow, Pytorch, Theano, Keras, and etc..
1) Linear Approximation: LA-Q uses a linear weight matrix wto approximate the value function Q(s, a)
with feature vector ~x =x(s)corresponding to the state St. The dimensions of weight matrix wis |A| × |~x|,
where |A| is the total number of all available actions and |~x|is the size of feature vector ~x. Here, we
consider polynomial regression (as [31, Eq. 9.17]) to construct the real-valued feature vector x(s)due to
its efficiency3. In the training process, the exploration is the same as the tabular Q-learning by generating
random actions, but the exploitation is calculated using the weight matrix wof the value function. In detail,
to predict an action using the LA value function Q(St, a, w)with state Stin the tth TTI, the approximated
value function scalars for each action ais obtained by inner-producting between the weight matrix wand
the features vector x(s)as:
Q(St, a, w) = w·x(St)T=h
|~x|−1
X
j=0
w(0,j)xj(St),
|~x|−1
X
j=0
w(1,j)xj(St),· · · ,
|~x|−1
X
j=0
w(|A|−1,j)xj(St)iT.(26)
By searching for the maximal value function scalar in Q(St, a, w)given in Eq. (26), we can obtain the
matched action Atto maximize future rewards.
To obtain the optimal policy, we update the weigh matrix win the value function Q(s, a;w)using
Stochastic Gradient Descent (SGD) [31, 39]. SGD minimizes the error on predictions of observation after each
example, where the error is reduced by a small amount following the direction to the optimal target policy
Q∗(s, a). As it is infeasible to obtain optimal target policy by summing over all states, we instead estimate
3The polynomial case is the most well understood feature constructor and always performs well in practice with appropriate setting [31, 33].
Furthermore, the results in [38] shows that there is a rough correspondence between a fitted neural network and a fitted ordinary parametric
polynomial regression model. These reasons encourage us to compare the polynomial based LA-Q with DQN
18
the desired action-value function by simply considering one learning sample Q∗(s, a)≈Q∗(St, a, wt)[31].
In each TTI, the weigh matrix wis updated following
wt+1 =wt−λ∇L(wt),(27)
where λis the learning rate. ∇L(wt)is the gradient of the loss function L(wt)used to train the Q-function
approximator. This is given as
∇L(wt) = Rt+1 +γmax
aQ(St+1, a;wt)−Q(St, a, wt)·x(At, St)T· ∇wQ(St, At,wt)(28)
where wtis the weight matrix, x(At, St)is the features matrix with the same shape of wt.x(At, S t)is
constructed by zeros and the feature vector located in the row corresponding to the index of the action
selected in the tth TTI At. Note that Q(St+1, a;wt)is a scalar. The learning procedure follows Algorithm
2by changing the Q-table Q(s, a)to the LA value function Q(s, a;w)with linear weigh matrix w, and
updating Q(s, a;w)with SGD given in (28) in step 10 of Algorithm 2.
2) Deep Q-Network: The DQN agent parameterizes the action-state value function Q(s, a)by using a
function Q(s, a;θ), where θrepresents the weights matrix of a DNN with multiple layers. We consider
the conventional DNN, where neurons between two adjacent layers are fully pairwise connected, namely
fully-connected layers. The input of the DNN is given by the variables in state St; the intermediate hidden
layers are Rectifier Linear Units (ReLUs) by using the function f(x) = max (0, x); while the output layer
is composed of linear units4, which are in one-to-one correspondence with all available actions in A.
Primary Q-network θ
Random Action
max Q(St, a, θ)
pε≥ε
pε<ε
Executing communication
procedures as Fig. 2
Observations at the eNB: Ut =
[Vt
su,0, Vt
sc,0, Vt
cp,0 , Vt
sp,0 , Vt
ip,0 ]
Environment
Rt+1=Vtsu/csu
Rt+1
DQN Agent
Memory
Mr
St, At
( St, At, Rt+1, St+1)
Sample
minibatch
Loss Function LDDQN(θ)
Target Q-network θ
Sj, Aj
Rj+1
Sj+1
Primary Q-network θ
max Q(Sj+1, a, θ)
Q(Sj, Aj, θ) SGD using Eq. (24)
Sync
St+1
Actor
Leaner
St+1
Stack
Action At f t
Prea,0
a
∈A
a
∈A
Fig. 4: The DQN agent and environment interaction in the POMDP.
The exploitation is obtained by performing forward propagation of Q-function Q(s, a;θ)with respect to the
observed state St. The weights matrix θis updated online along each training episode by using double deep
4Linear activation is used here according to [32]. Note that Q-learning is value-based, thus the desired value function given in Eq. (15) can
be bigger than 1, rather than a probability, and thus the activation function with return value limited in [−1,1] (such as sigmoid function and
tanh function) can lead to convergence difficulty.
19
Algorithm 3: DQN Based Uplink Resource Configuration
input : The set of numbers of preambles in each RACH period FPrea , the number of IoT devices D, and operation iteration I.
1Algorithm hyperparameters: learning rate λ∈(0,1], discount rate γ∈[0,1),-greedy rate ∈(0,1], target network update frequency K;
2Initialization of replay memory Mto capacity C, the primary Q-network θ, and the target Q-network ¯
θ;
3for Iteration ←1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t←1to Tdo
6Update µt
bursty using Eq. (3);
7if p< then select a random action Atfrom A;
8else select At=argmax
a∈A
Q(St, a, θ);
9The eNB broadcasts FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, and calculate the related Rt+1 using Eq. (24);
11 Store transition (St, At, Rt+1, S t+1)in replay memory M;
12 Sample random minibatch of transitions (Sj, Aj, Rj+1, S j+1)from replay memory M;
13 Perform a gradient descent for Q(s, a;θ)using Eq. (30);
14 Every Ksteps update target Q-network ¯
θ=θ.
15 end
16 end
Q-learning (DDQN) [40], which to some extend reduce the substantial overestimations5of value function.
Accordingly, learning takes place over multiple training episodes, with each episode of duration NTTI TTI
periods. In each TTI, the parameter θof the Q-function approximator Q(s, a;θ)is updated using SGD as
θt+1 =θt−λRMS∇LDDQN (θt),(29)
where λRMS is RMSProp learning rate [41], ∇L(θ)is the gradient of the loss function L(θt)used to train
the Q-function approximator. This is given as
∇LDDQN(θt) =ESi,Ai,Ri+1 ,Si+1 Ri+1 +γmax
aQ(Si+1, a;¯
θt)−Q(Si, Ai;θt)∇θQ(Si, Ai;θt),(30)
where the expectation is taken with respect to a so-called minibatch, which are randomly selected previous
samples (Si, Ai, Si+1 , Ri+1)for some i∈ {t−Mr, ..., t}, with Mrbeing the replay memory [32]. When
t−Mris negative, this is interpreted as including samples from the previous episode. The use of minibatch,
instead of a single sample, to update the value function Q(s, a;θ)improves the convergent reliability of
value function [32]. Furthermore, following DDQN [40], in (30), ¯
θtis a so-called target Q-network that is
5Overestimation refers to that some suboptimal actions regularly were given higher Q-values than optimal actions, which can negatively
influence the convergence capability and training efficiency of the algorithm [34, 40].
20
used to estimate the future value of the Q-function in the update rule. This parameter is periodically copied
from the current value θtand kept fixed for a number of episodes [40].
V. Q-LEARNING BASED RESOURCE CONFIGURATION IN MULTI-PARAMETER MULTI-GROUP
SCENARIO
Practically, NB-IoT is always deployed with multiple CE groups to serve IoT devices with various coverage
requirements. In this section, we study the problem (1) of optimizing the resource configuration for three CE
groups each with parameters At={nt
Rach,i, f t
Prea,i, nt
Repe,i}2
i=0. This joint optimization by configuring each
parameter in each CE group can improve the overall data access and transmission performance. Note that
each CE group shares the uplink resource in the same bandwidth, and the eNB schedules data resource to
all RRC connected IoT devices without the CE group bias as introduced in Sec. II.B.4). To optimize the
number of served IoT devices in real-time, the eNB should not only balance the uplink resource between
RACH and data, but also balance them among each CE group.
The Q-learning algorithms with the single CE group provided in Sec. IV are model-free, and thus their
learning structure can be directly used in this multi-parameter multi-group scenario. However, considering
multiple CE groups results in the increment of observations space, which exponentially increases the size
of state space. To train Q-agent with this expansion, the requirements of time and computational resource
greatly increase. In such case, the tabular-Q would be extremely inefficient, as not only the state-action value
table requires a big memory, but it is impossible to repeatedly experience every state to achieve convergence
with limited time. In view of this, we only study Q-learning with value function approximation (LA-Q and
DQN) to design uplink resource configuration approaches for the multi-parameter multi-group scenario.
LA-Q and DQN are with high capability to handle massive state spaces, and thus we can considerably
improve the state spaces with more observed information to support the optimization of Q-agent. Here, we
define the current state Stincludes information about the last MoTTIs (Ut−1, U t−2, Ut−3,· · · , U t−Mo). This
design improves Q-agent by enabling it to estimate the trend of traffic. As our goal is to optimize the number
of served IoT devices, the reward function should be defined according to the number of successfully served
IoT devices Vsu,i of each CE group, which is expressed as
Rt+1 =
2
X
i=0
Vt
su,i/csu.(31)
Same as the state spaces, the available action spaces also exponentially increases with the increment of
the adjustable configurations. The number of available actions corresponds to the possible combinations of
configurations |A| =
2
Q
i=0
(|NRach,i|×|NRepe,i|×|FPrea,i|)(i.e., |·|denotes the number of elements in any
21
vector ·,Ais the set of actions, NRach,i,NRepe,i, and FPrea,i are the sets of the number of RACH periods,
the repetition value, and the number of preambles in each RACH period). Unfortunately, it is extremely
hard to optimize the system under such numerous action spaces (i.e., |A| can be over fifty thousands.),
due to that the system will fall into updating policy with only a small part of the action in A, and finally
leads to convergence difficulty. To solve this problem, we then provide two approaches that can reduce the
dimension of action space to enable the LA and DQN in the multi-parameter multi-group scenario.
A. Actions Aggregated Approach
We first provide AA based Q-learning approaches, which guarantee convergent capability by sacrificing
the accuracy of action selection6. In detail, the specific action selection can be converted to the increasing or
decreasing trend selection. Instead of selecting the exact values from the sets of NRach,i,NRepe,i, and FPrea,i ,
we convert it to single step ascent/descent based on the last action, which is represented by At
Rach,i ∈ {0,1},
At
Repe,i ∈ {0,1}, and At
Prea,i ∈ {0,1}for the number of RACH periods nt
Rach,i, the repetition values nt
Repe,i,
and the number of preambles in each RACH period ft
Prea,i in the tth TTI. Consequently, the size of total
action spaces for the three CE groups is reduced to |A|=29=512. By doing so, the algorithms for training
with LA function approximator and DQN in the multiple configurations multiple CE groups scenario can
be deployed following Algorithm 2 and Algorithm 3, respectively.
B. Cooperative Multi-agent Learning Approach
Despite that the uplink resource configuration is managed by a central authority, identifying the control
of each parameter as one sub-task that is cooperatively handled by independent Q-agents is sufficient to
deal with the problem with unsolvable action spaces [42]. As shown in Fig. 5, we consider multiple DQN
agents are centralized at the eNB with the same structure of value function approximator7following Section
IV.B.2). We break down the action space by considering nine separate action variables in At, where each
DQN agent controls their own action variable as shown in Fig. 5. Recall that we have three variables for
each group i, namely nRach,i,nRepe,i , and fPrea,i.
We introduce a separate DQN agent for each output variable in Atdefined as action At
kselected by the
kth agent, where each kth agent is responsible to update the value Q(St, At
k;θk)of action At
kin shared
6The action aggregation has been rarely evaluated, but the same idea, namely, state aggregation has been well studied, which is a basic
function approximation approach [31].
7The structures of value function approximator can also be specifically designed for RL agents with sub-tasks of significantly different
complexity. However, there is no such requirement in our problem, so it will not be considered.
22
Executing communication procedures as Fig. 2
Environment
St+1
Rt+1
St Ut =[Vtsu,0, Vtun,0, Vtcp,0, Vtsp,0, Vtip,0]
[Vtsu,1, Vtun,1, Vtcp,1, Vtsp,1, Vtip,1]
[Vtsu,2, Vtun,2, Vtcp,2, Vtsp,2, Vtip,2]
[ At, Ut, At-1, Ut-1, At-2, Ot-2, …, Ot-Mo-1 At-Mo-1]
Stack
DNN-Q Agent 1
DNN-Q Agent 2
DNN-Q Agent 3
DNN-Q Agent 9
… ...
At
1
At
2
At
3
At
9
Memory Mr,1
Memory Mr,2
Memory Mr,3
Memory Mr,9
… ...
Sample
minibatch SGD
At
(St, At
3,
Rt+1, St+1)
(St, At
2,
Rt+1, St+1)
(St, At
1,
Rt+1, St+1)
(St, At
9,
Rt+1, St+1)
At
At = [At
0, At
1,…, At
k,…, At
9]
[nt
Rach,0, nt
Repe,0,…, f t
Prea,2]
St+1
Rt+1=(∑Vt
su,i)/csu
i=0
2
Fig. 5: The CMA-DQN agents and environment interaction in the POMDP.
state St. The DQN agents are trained in parallel and receive the same reward signal given in Eq. (31) at the
end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents
aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a
factorization of the overall value function akin to the approach proposed in [43] for multi-agent systems.
The challenge of this approach is how to evaluate each action according to the common reward function.
For each DQN agent, the received reward is corrupted by massive noise, where its own effect on the
reward is deeply hidden in the effects of all other DQN agents. For instance, a positive action can receive a
mismatched low reward due to other DQN agents’ negative actions. Fortunately, in our scenario, all DQN
agents are centralized at the eNB, which means that all DQN agents can have full information among each
other. Accordingly, we adopt the action selection histories of each DQN agent as part of state function8,
thus they are able to know how reward is influenced by different combinations of actions. To do so, we
define state variable Stas
St= [At−1, Ut−1, At−2, U t−2,· · · , At−Mo, U t−Mo],(32)
where Mois the number of stored observations, At−1is the set of selected action of each DQN agent in
the (t−1)th TTI corresponding to nRach,i,nRepe,i , and fPrea,i for the ith CE group, and Ut−1is the set of
observed transmission receptions.
In each TTI, the parameters θkof the Q-function approximator Q(St, At
k;θk)are updated using SGD at all
agents kas Eq. (29). The learning algorithm can be implemented following Algorithm 3. Different from the
single-parameter single-group scenario, we need to first initialize nine primary networks θk, target networks
¯
θk, and replay memories Mkfor each DQN agent. In step 11 of Algorithm 3, the current transactions of
each DQN agent should be stored in their own memory separately. In step 12 and 13 of Algorithm 3,
8The state function can be designed to collect more information according to the complexity requirements, such as sharing the value function
between each DQN agent [42].
23
the minibatch of transaction should separately sampled from each memory to train the corresponding DQN
agent.
VI. SIMULATION RESU LTS
In this section, we evaluate the performance of the proposed Q-learning approaches and compare it with the
conventional LE-URC and FSI-URC described in Sec. III via numerical experiments. We adopt the standard
network parameters listed in Table I following [1, 3, 22, 25, 29], and hyperparameters for Q-learning listed in
Table II. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). The RL agents will first be trained
in a so-called learning phase, and after convergence, their performance will be compared with LE-URC and
FSI-URC in a so-called testing phase. All testing performance results are obtained by averaging over 1000
episodes. In the following, we present our simulation results of the single-parameter single-group scenario
and the multi-parameter multi-group scenario in Section VI-A and Section VI-B, respectively.
TABLE I: Simulation Parameters
Parameters Setting Parameters Setting
Path-loss exponent η4 noise power σ2-138 dBm
eNB broadcast power PNPBCH 35 dBm Path-loss inverse power control threshold ρ120 dB
Maximal preamble transmit power PRACHmax 23 dBm The received SNR threshold γth 0 dB
Duration of periodic traffic Tperiodic 1 hour TTI 640ms
Duration of bursty traffic Tbursty 10 minutes Set of number of preambles FPrea {12,24,36,48}
Maximum allowed resource requests γRRC 5 Set of repetition value NRepe {1,2,4,8,16,32}
Maximum RACH attempts γpMax 10 Set of number of RACH periods NRach {1,2,4}
Maximum allowed RACH in one CE γpCE,i 5 REs required for BRACH 4
Bursty traffic parameter Beta(α, β) (3,4) REs required for BDATA 32
TABLE II: Q-learning Hyperparameters
Hyperparameters Value Hyperparameters Value
Learning rate λfor Tabular-Q and LA-Q 0.01 Learning rate by RMSProp λRMS for DQN 0.0001
Initial exploration 1 Final exploration 0.1
Discount rate γ0.5 Minibatch size 32
Replay memory 10000 Target Q-network update frequency 1000
A. Single-Parameter Single-Group Scenario
In the single-parameter single-group scenario, eNB is located at the center of a circular area with a 10
km radius, and the IoT devices are randomly located within the cell. We set the number of RACH periods
as nRach = 1, the repetition value as nRepe = 4, and the limited uplink resource as Ruplink = 1536 REs (i.e.,
32 slots with 48 sub-carriers). Unless otherwise stated, we consider the number of periodical IoT devices to
be Dperiodic = 10000, and the number of bursty IoT devices to be Dbursty = 5000. The DQN is set with three
hidden layers, each with 128 ReLU units. Tabular-Q, LA-Q, and DQN approaches are proposed in Sec.
24
Fig. 6: The real-time traffic load and Vsu for FSI-URC, LE-URC, and
DQN.
Fig. 7: Vsu and the average received reward for Tabular-Q, LA-Q, and
DQN.
IV.A, IV.B.1), and IV.B.2), respectively. The conventional LE-URC and FSI-URC approaches are proposed
in Sec. III.B.
Throughout epoch, each device has a periodical traffic profile (i.e., Uniform distritbuion given in Eq. (2)),
or a bursty traffic profile (i.e., the time limited Beta profile defined in Eq. (4) with parameters (3,4)) that
has a peak around the 400th TTI. The resulting average number of newly generated packets is shown as
dashed line in Fig. 6(a). Fig. 6(b) plot the number of successfully served IoT devices Vsu with the proposed
FSI-URC, LE-URC, and DQN approaches. In Fig. 6(b), Vsu first increases gradually with the increasing of
traffic shown in Fig. 6(a), until it reaches the serving capacity of eNB. Then, Vsu decreases slowly due to the
increasing collisions and scheduling failures with the increase of traffic. After that, Vsu increases gradually
as the collisions and scheduling failures decrease with the decreasing of traffic. Finally, Vsu decreases slowly
with the decreasing of traffic.
In Fig. 6(b), we see that the ideal FSI-URC approach outperforms the LE-URC approach, due to that
the FSI-URC approach uses the actual network load to perfectly optimize Vt
su at one time instance as Eq.
(11). DQN not only always outperforms LE-URC, but also exceeds the ideal DSI-URC approach in most
of TTIs. This is due to that both LE-URC and FSI-URC only optimize Vt
su at one time instance, whereas
DQN optimizes the long-term performance of the number of served IoT devices. The optimization in one
time instance (LE-URC and FSI-URC) only takes into account the current trade-off between RACH resource
and DATA resource given in Eq. (22), while the optimization over long-term period (DQN) also accounts
for some long-term hidden features, such as the dropping packets due to exceeding them maximum RACH
attempts γpMax or maximum resource requests γRRC. The DQN approach can well capture these hidden
25
features to optimize the long-term performance of Vsu as Eq. (1).
Fig. 7(a) compares the number of successfully served IoT devices Vsu under Tabular-Q, LA-Q, and DQN
approaches. We observe that all these three approaches achieve similar values of Vsu, which indicates that
both LA-Q and DQN can well estimate the optimal value function Q∗(s, a)as the converged Tabular-Q
in this low-complexity single CE group scenario. Fig. 7(b) plots the average received reward over each
bursty duration E{R}=1
Tbursty PTbursty
t=0 Rt(i.e., one epoch consists of one bursty duration Tbursty) from the
beginning of the training versus the required training time. It can be seen that LA-Q and DQN converge to
the optimal value function Q∗(s, a)(about 10 minutes) much faster than that of Tabular-Q (about 5 days).
The observations in Fig. 7 demonstrate that LA-Q and DQN can be good alternatives for tabular-Q to achieve
almost same number of served IoT devices with much less training time.
Fig. 8(a) and Fig. 8(b) plot the average number of successfully served IoT devices E{Vsu}and the average
number of dropped packets E{Vdrop}(i.e., this system performance can only be summarized in simulation)
over a bursty duration Tbursty versus the number of bursty IoT devices Dbursty. In Fig. 8(a), we observe
that E{Vsu}first increases and then decreases with increasing the number of bursty devices, the decreasing
trend starts when eNB can not afford to serve the increasing IoT device number due to the increasing
collisions and scheduling failures. These collisions and scheduling failures also result in the increasing
number of packet drops with increasing traffics as shown in Fig. 8(b). In Fig. 8, we also notice that DQN
always outperforms LE-URC (especially for relatively large Dbursty), which indicates the superiority of DQN
approach in handling massive bursty IoT devices. Interestingly, DQN provides better performance of the
number of served IoT devices and smaller mean errors than the ideal FSI-URC approach in most cases,
which thanks to the long-term optimization capability of DQN.
B. Multi-Parameter Multi-Group Scenario
Considering eNB is located at the center of a circle area with 12 km radius, we set RSRP thresholds for CE
group choosing {γRSRP1, γRSRP2 }={0,−5}dB, the uplink resource Ruplink = 15360 REs (i.e., 320 slots with
48 sub-carriers), and the NPUSCH constrains for LE-URC following Ruplink,0:Ruplink,1:Ruplink,2=1:1:1.
To model massive IoT traffic, both the number of periodical IoT devices Dperiodic and the number of bursty
IoT devices Dbursty increase to 30000. In AA-DQN, we use one Q-network with three hidden layers each
of which is consist of 2048 ReLU units. In CMA-DQN, nine DQNs are used to control each of the nine
configuration (i.e., nRach,i,nRepe,i ,fPrea,i for three CE groups), where each DQN has three hidden layers,
each with 128 ReLU units. AA-LA-Q and AA-DQN approaches are proposed in Sec. V.A, and CMA-DQN
approach is proposed in Sec. V.B.
26
Fig. 8: E{Vsu}and E{Vdrop }for FSI-URC, LE-URC, and DQN. Fig. 9: Vsu and the average received reward.
Fig. 9(a) compares the number of successfully served IoT devices Vsu during one epoch using AA-LA-
Q, AA-DQN, CMA-DQN and LE-URC. The “LE-URC-[1,4,8]” and “LE-URC-[2,8,16]” curves represent
the LE-URC approach with the repetition values {nRepe,0, nRepe,1, nRepe,2}set to {1,4,8}and {2,8,16},
respectively. We observes that the number of successfully served IoT devices Vsu follows CMA-DQN >AA-
DQN >AA-LA-Q LE-URC-[1,4,8] LE-URC-[2,8,16]. As can be seen, all Q-learning based approaches
outperform LE-URC approaches, due to that these Q-learning based approaches can dynamically optimize
the number of served IoT devices by accurately configuring each parameter. We also observe that CMA-
DQN slightly outperforms the others in the light traffic regions at the beginning and end of the epoch,
but it substantially outperforms the others in the period of heavy traffic in the middle of the epoch. This
demonstrates the capability of CMA-DQN in better managing the scarce channel resource in the presence
of heavy traffic. It is also observed that increasing the repetition value of each CE group with LE-URC
improves the received SNR, and thus the RACH success rate in the light traffic region, but it degrades the
scheduling success rate due to limited channel resource in the heavy traffic region.
Fig. 9(b) plots the average received reward over each bursty duration E{R}=1
Tbursty PTbursty
t=0 Rtfrom the
beginning of the training versus the consumed training time. It can be seen that CMA-DQN and AA-DQN
outperform AA-LA-Q in terms of less training time. Compared with the results in the single CE group
scenario shown in Fig. 7, DNN is a better value function approximator for the 3 CE groups scenario due to
its efficiency and capability in solving high complexity problems. We also observe that CMA-DQN achieves
higher E{R}∗than AA-DQN, due to that CMA-DQN can accurately select the exact values from the set of
27
actions {NRepe,NRach ,FPrea}, whereas AA-DQN can only select ascent/descent actions, which leads to a
sub-optimal solution.
Fig. 10: The average number of successfully served IoT devices Vsucc,i for each CE group i.
Fig. 11: The allocated repetition value nt
Repe,i, and RAOs producted by nt
Rach,i ×ft
Prea,i.
To gain more insight into the operation of CMA-DQN, Fig. 10 plots the average number of successfully
served IoT devices Vsucc,i for each CE group i, and Fig. 11 plots the average number nt
Repe,i of repetitions
and the average number of Random Access Opportunities (RAOs), defined as the product nt
Rach,i ×ft
Prea,i, for
each CE group ithat are selected by CMA-DQN over the testing episodes. As seen in Fig. 10, CMA-DQN
substantially outperforms LE-URC approaches for each CE group i, where the reasons for this performance
are showcased in Fig. 11. As seen in Fig. 11(a)-(c), CMA-DQN increases the number of repetitions in the
light traffic region in order to improve the SNR and reduce RACH failures, while decreasing it in the heavy
traffic region so as to reduce scheduling failures. Surprisingly, the CMA-DQN increases the repetition value
of group 0 nRepe,0at the same time, which is completely opposite to the actions of nRepe,1and nRepe,2. This
is due to that the CMA-DQN is aware of the key to optimize the overall performance Vsu is to guarantee
Vsucc,0, as the IoT devices in the CE group 0 are easier to be served, due to they are located close to the eNB
28
and consume less resource. As illustrated in Fig. 11(d)-(f), this allows CMA-DQN to increase the number
of RAOs in the high traffic regime mitigating the impact of collisions on the throughput. In contrast, for the
CE groups 1 and 2, in the heavy traffic region, LE-URC decreases the number of RAOs in order to reduce
resource scheduling failures, causing an overall lower throughput as seen in Fig. 10.
Fig. 12: The average number of successfully served IoT devices per
TTI over each epoch in online updating
The realistic network conditions can be different from the simulation environment, due to that the
practical traffic and physical channel vary and can be unpredictable. This difference may lead to inaccurate
configuration that can degrade the system performance of each approach. Fortunately, the proposed RL-based
approaches can self-update after deployment according to the practical observation in NB-IoT networks in an
online manner. To model this, we use the trained CMA-DQN agents given in Fig. 11 (i.e., the bursty traffic is
modelled by the time limited Beta profile with parameters (3,4)), and test them in a slightly modified traffic
scenario that the bursty traffic is with Beta(5,6), and we set the constant exploration rate = 0.001. Fig.
12 plots the average number of successfully served IoT devices E{Vsu}per TTI over each episode versus
epochs. Our result shows that, as expected, E{Vsu}follows CMA-DQN>LE-URC-[1,4,8]>LE-URC-[2,8,16]
at any epoch. More importantly, the performance of CMA-DQN gradually improves along epochs, which
sheds light on the online self-updating capability of the proposed RL-based approaches.
VII. CONCLUSION
In this paper, we developed Q-learning based uplink resource configuration approaches to optimize the
number of served IoT devices in real-time in NB-IoT networks. We first developed tabular-Q, LA-Q, and
DQN based approaches for the single-parameter single-group scenario, which are shown to outperform
29
the conventional LE-URC and FSI-URC approaches in terms of the number of served IoT devices. Our
results demonstrated that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same
system performance with much less training time. To support traffic with different coverage requirements, we
then studied the multi-parameter multi-group scenario as defined in NB-IoT standard, which introduced the
high-dimensional configurations problem. To solve it, we advanced the proposed LA-Q and DQN using the
Actions Aggregation technique (AA-LA-Q and AA-DQN), which guarantees the convergent capability of Q-
learning by sacrificing the accuracy in resource configuration. We further developed CMA-DQN by dividing
high-dimensional configurations into multiple parallel sub-tasks, which achieved the best performance in
terms of the number of successfully served IoT devices Vsu with the least training time.
REFERENCES
[1] J. Schlienz and D. Raddino, “Narrowband internet of things whitepaper,” IEEE Microw. Mag., vol. 8, no. 1, pp. 76–82, Aug. 2016.
[2] H. S. Dhillon, H. Huang, and H. Viswanathan, “Wide-area wireless communication challenges for the internet of things,” IEEE Commun.
Mag., vol. 55, no. 2, pp. 168–174, Feb. 2017.
[3] Y.-P. E. Wang, X. Lin, A. Adhikary, A. Grovlen, Y. Sui, Y. Blankenship, J. Bergman, and H. S. Razaghi, “A primer on 3GPP narrowband
internet of things (NB-IoT),” IEEE Commun. Mag., vol. 55, no. 3, pp. 117–123, Mar. 2017.
[4] D. T. Wiriaatmadja and K. W. Choi, “Hybrid random access and data transmission protocol for machine-to-machine communications in
cellular networks,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 33–46, Jan. 2015.
[5] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. Wong, “D-ACB: Adaptive congestion control algorithm for bursty M2M traffic in LTE
networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, Dec. 2016.
[6] L. M. Bello, P. Mitchell, and D. Grace, “Application of Q-learning for RACH access to support M2M traffic over a cellular network,” in
Proc. European Wireless Conf., 2014, pp. 1–6.
[7] Y. Chu, P. D. Mitchell, and D. Grace, “ALOHA and Q-learning based medium access control for wireless sensor networks,” in Int. Symp.
Wireless Commun. Syst. (ISWCS), 2012, pp. 511–515.
[8] Y. Yan, P. Mitchell, T. Clarke, and D. Grace, “Distributed frame size selection for a Q learning based slotted ALOHA protocol,” in Int.
Symp. Wireless Commun. Syst. (ISWCS), 2013, pp. 1–5.
[9] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C.-K. Tham, “Distributed reinforcement learning frameworks for cooperative retransmission in
wireless networks,” IEEE Trans. Veh. Technol., vol. 59, no. 8, pp. 4157–4162, Oct. 2010.
[10] Y.-S. Chen, C.-J. Chang, and F.-C. Ren, “Q-learning-based multirate transmission control scheme for RRM in multimedia WCDMA
systems,” IEEE Trans. Veh. Technol., vol. 53, no. 1, pp. 38–48, Jan. 2004.
[11] M. ihun and L. Yujin, “A reinforcement learning approach to access management in wireless cellular networks,” in Wireless Commun.
Mobile Comput., May. 2017, pp. 1–7.
[12] T.-O. Luis, P.-P. Diego, P. Vicent, and M.-B. Jorge, “Reinforcement learning-based ACB in LTE-A networks for handling massive M2M
and H2H communications,” in IEEE Int. Commun. Conf. (ICC), May. 2018, pp. 1–7.
[13] R. Harwahyu, R.-G. Cheng, C.-H. Wei, and R. F. Sari, “Optimization of random access channel in NB-IoT,” IEEE Internet Things J.,
vol. 5, no. 1, pp. 391–402, Feb. 2018.
[14] S.-M. Oh and J. Shin, “An efficient small data transmission scheme in the 3GPP NB-IoT system,” IEEE Commun. Lett., vol. 21, no. 3,
pp. 660–663, Mar. 2017.
30
[15] H. Malik, H. Pervaiz, M. M. Alam, Y. Le Moullec, A. Kuusik, and M. A. Imran, “Radio resource management scheme in NB-IoT systems,”
IEEE Access, vol. 6, pp. 15 051–15 064, Jun. 2018.
[16] C. Yu, L. Yu, Y. Wu, Y. He, and Q. Lu, “Uplink scheduling and link adaptation for narrowband internet of things systems,” IEEE Access,
vol. 5, pp. 1724–1734, 5 2017.
[17] A. Azari, G. Miao, C. Stefanovic, and P. Popovski, “Latency-energy tradeoff based on channel scheduling and repetitions in NB-IoT
systems,” arXiv preprint arXiv:1807.05602, Jul. 2018.
[18] E. Dahlman, S. Parkvall, and J. Skold, 4G: LTE/LTE-advanced for mobile broadband. Academic press, 2013.
[19] “Study on RAN improvements for machine-type communications,” 3GPP TR 37.868 V11.0.0, Sep. 2011.
[20] N. Jiang, Y. Deng, M. Condoluci, W. Guo, A. Nallanathan, and M. Dohler, “RACH preamble repetition in NB-IoT network,” IEEE
Commun. Lett., vol. 22, no. 6, pp. 1244–1247, Jun. 2018.
[21] N. Jiang, Y. Deng, A. Nallanathan, X. Kang, and T. Q. S. Quek, “Analyzing random access collisions in massive IoT networks,” IEEE
Trans. Wireless Commun., vol. 17, no. 10, pp. 6853–6870, Oct. 2018.
[22] “Evolved universal terrestrial radio access (E-UTRA); Physical channels and modulation,” 3GPP TS 36.211 v.14.2.0, Apr. 2017.
[23] M. Z. Shafiq, L. Ji, A. X. Liu, J. Pang, and J. Wang, “A first look at cellular machine-to-machine traffic: large scale measurement and
characterization,” ACM SIGMETRICS Performance Evaluation Rev., vol. 40, no. 1, pp. 65–76, Jun. 2012.
[24] J. Kim, J. Lee, J. Kim, and J. Yun, “M2M service platforms: Survey, issues, and enabling technologies.” IEEE Commun. Surveys Tuts.,
vol. 16, no. 1, pp. 61–76, Jan. 2014.
[25] “Cellular system support for ultra-low complexity and low throughput Internet of Things (CIoT),” 3GPP TR 45.820 V13.1.0, Nov. 2015.
[26] A. K. Gupta and S. Nadarajah, Handbook of Beta distribution and its applications. New York, USA: CRC press, 2004.
[27] “Evolved universal terrestrial radio access (E-UTRA); Physical layer measurements,” 3GPP TS 36.214 v. 14.2.0, Apr. 2017.
[28] X. Lin, A. Adhikary, and Y.-P. E. Wang, “Random access preamble design and detection for 3GPP narrowband IoT systems,” IEEE
Wireless Commun. Lett., vol. 5, no. 6, pp. 640–643, Jun. 2016.
[29] “Evolved universal terrestrial radio access (E-UTRA); Medium Access Control protocol specification,” 3GPP TS 36.321 v.14.2.1, May.
2017.
[30] “Evolved universal terrestrial radio access (E-UTRA); Requirements for support of radio resource management,” 3GPP TS 36.133 v.
14.3.0, Apr. 2017.
[31] R. Sutton and A. Barto, “Reinforcement learning: An introduction (draft),” URl: http://www.incompleteideas.net/book/bookdraft2017nov5.pdf,
2017.
[32] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
[33] G. Konidaris, S. Osentoski, and P. S. Thomas, “Value function approximation in reinforcement learning using the Fourier basis.” in Assoc.
Adv. AI (AAAI), vol. 6, Aug. 2011, p. 7.
[34] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proc. Connectionist Models Summer
School Hillsdale, NJ. Lawrence Erlbaum, 1993.
[35] M. Hauskrecht, “Value-function approximations for partially observable markov decision processes,” J. AI Res., vol. 13, pp. 33–94, Aug.
2000.
[36] A. Geramifard et al., “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Found. Trends
Mach. Learn., vol. 6, no. 4, pp. 375–451, Dec. 2013.
[37] F. S. Melo and M. I. Ribeiro, “Q-learning with linear function approximation,” in Springer Int. Conf. Comput. Learn. Theory, Jun. 2007,
pp. 308–322.
[38] C. Xi, K. Bohdan, M. Norman, and M. Pete, “Polynomial regression as an alternative to neural nets,” arXiv preprint arXiv:1806.06850,
2018.
[39] C. M. Bishop, Pattern Recognition and Machine Learning. New York, USA: Springer print, 2006.
31
[40] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning.” in Assoc. Adv. AI (AAAI), vol. 2, Feb.
2016, p. 5.
[41] T. Tieleman and G. Hinton, “Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude,” COURSERA:
Neural Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, Oct. 2012.
[42] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Syst., Man,
Cybern. C, C, Appl. Rev., 38 (2), Mar. 2008.
[43] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and
T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Pro. Int. Conf. Auton. Agents
MultiAgent Syst. (AAMAS), Jul. 2018, pp. 2085–2087.