ArticlePDF Available

Abstract and Figures

NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration specifies the amount of radio resource allocated to each group of devices for random access and for data transmission. Assuming no knowledge of the traffic statistics, there exists an important challenge in "how to determine the configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval (TTI) in an online fashion". Given the complexity of searching for optimal configuration, we first develop real-time configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and training efficiency.
Content may be subject to copyright.
1
Deep Reinforcement Learning for Real-Time
Optimization in NB-IoT Networks
Nan Jiang, Student Member, IEEE, Yansha Deng, Member, IEEE, Arumugam Nallanathan,
Fellow, IEEE, and Jonathon A. Chambers, Fellow, IEEE
Abstract
NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible
configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration
specifies the amount of radio resource allocated to each group of devices for random access and for data transmission.
Assuming no knowledge of the traffic statistics, there exists an important challenge in “how to determine the
configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval
(TTI) in an online fashion”. Given the complexity of searching for optimal configuration, we first develop real-time
configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-
Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results
show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic
approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also
indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with
much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and
via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the
problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of
the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase
of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and
training efficiency.
I. INTRODUCTION
To effectively support the emerging massive Internet of Things (mIoT) ecosystem, the 3rd Generation
Partnership Project (3GPP) partners have standardized a new radio access technology, namely NarrowBand-
IoT (NB-IoT) [1]. NB-IoT is expected to provide reliable wireless access for IoT devices with various
N. Jiang, and A. Nallanathan are with the School of Electronic Engineering and Computer Science, Queen Mary University of London,
London E1 4NS, UK (e-mail: {nan.jiang, a.nallanathan}@qmul.ac.uk).
Y. Deng is with the Department of Informatics, King’s College London, London WC2R 2LS, UK (e-mail: yansha.deng@kcl.ac.uk)
(Corresponding author: Yansha Deng).
J. A. Chambers is with the Department of Engineering, University of Leicester, Leicester LE1 7RH, UK (e-mail: jonathon.chambers@le.ac.uk).
arXiv:1812.09026v1 [cs.NI] 21 Dec 2018
2
types of data traffic, and to meet the requirement of extended coverage. As most mIoT applications favor
delay-tolerant data traffic with small size, such as data from alarms, and meters, monitors, the key target of
NB-IoT design is to deal with the sporadic uplink transmissions of massive IoT devices [2].
NB-IoT is built from legacy Long-Term Evolution (LTE) design, but only deploys in a narrow bandwidth
(180 KHz) for Coverage Enhancement (CE) [3]. Different from the legacy LTE, NB-IoT only defines two
uplink physical channel resource to perform all the uplink transmission, including the Random Access
CHannel (RACH) resource (i.e., using NarrowBand Physical Random Access CHannel, a.k.a. NPRACH)
for RACH preamble transmission, and the data resource (i.e., using NarrowBand Physical Uplink Shared
CHannel, a.k.a. NPUSCH) for control information and data transmission. To support various traffic with
different coverage requirements, NB-IoT supports up to three CE groups of IoT devices sharing the uplink
resource in the same band. Each group serves IoT devices with different coverage requirements distinguishing
based on a same broadcast signal from the evolved Node B (eNB) [3]. At the beginning of each uplink
Transmission Time Interval (TTI), eNB selects a system configuration that specifies the radio resource
allocated to each group in order to accommodate the RACH procedure along with the remaining resource
for data transmission. The key challenge is to optimally balance the allocations of channel resource between
the RACH procedure and data transmission so as to provide maximum success accesses and transmissions in
massive IoT networks. Allocating too many resource for RACH enhances the random access pernformace,
while leaving insufficient resource for data transmission.
Unfortunately, dynamic RACH and data transmission resource configuration optimization is an untreated
problem in NB-IoT. Generally speaking, the eNB observes the transmission receptions of both RACH
(e.g., number of successfully received preambles and collisions) and data transmission (e.g., number of
successful scheduling and unscheduling) for all groups at the end of each TTI. This historical information
can be potentially used to predict traffic from all groups and to facilitate the optimization of future TTIs’
configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would
result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces,
which would be generally intractable. The complexity of the problem is compounded by the lack of a
prior knowledge at the eNB regarding the stochastic traffic and unobservable channel statistics (i.e., random
collision, and effects of physical radio including path-loss and fading). The related works will be briefly
introduced in the following two subsections.
1) Related works on real-time optimization in cellular-based networks: In light of this POMDP challenge,
prior works [4, 5] studied real-time resource configuration of RACH procedure and/or data transmission by
proposing dynamic Access Class Barring (ACB) schemes to optimize the number of served IoT devices.
3
These optimization problems have been tackled under the simplified assumptions that at most two config-
urations are allowed and that the optimization is executed for a single group without considering errors
due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement
Learning (RL) emerges as a natural solution given its capability in interacting with the practical environment
and feedback in the form of the number of successful and unsuccessful transmissions per TTI. Distributed
RL based on tabular Q-learning (tabular-Q) has been proposed in [6–9]. In [6–8], the authors studied
distributed tabular-Q in slotted-Aloha networks, where each device learns how to avoid collisions by finding
a proper time slot to transmit packets. In [9], the authors implemented tabular-Q agents at the relay nodes
for cooperatively selecting its transmit power and transmission probability to optimize the total number of
useful received packets per consumed energy. Centralized RL has also been studied in [10–12], where the
RL agent was implemented at the base station site. In [10], a learning-based scheme was proposed for radio
resource management in multimedia wide-band code-division multiple access systems to improve spectrum
utilization. In [11, 12], the authors studied the tabular-Q based ACB schemes in cellular networks, where a
Q-agent was implemented at an eNB aiming at selecting the optimal ACB factor to maximize the access
success probability of RACH procedure.
2) Related works on optimization in NB-IoT: In NB-IoT networks, most existing studies either focused
on the resource allocation during RACH procedure [13, 14], or that during the data transmission [15, 16]. For
RACH procedure, the access success probability was statistically optimized in [13] using exhaustive search,
and the authors in [14] studied the fixed-size data resource scheduling for various resource requirements.
For the data transmission, [15] presented an uplink data transmission time slot and power allocation scheme
to optimize the overall channel gain, and [16] proposed a link adaptation scheme, which dynamically
selects modulation and coding level, and the repetition value according to the acknowledgment/negative-
acknowledgment feedback to reduce the uplink data transmission block error ratio. More importantly, these
works ignore the time-varied heterogeneous traffic of massive IoT devices, and considered a snap shot [13,
15, 16] or steady-state behavior [14] of NB-IoT networks. Our most relevant work is [17], where the authors
studied the steady-state behavior of NB-IoT networks from the perspective of a single device. Optimizing
some of the parameters of the NB-IoT configuration, namely the repetition value (to be defined below) and
time intervals between two consecutive scheduling of NPRACH and NPDCCH, was carried out in terms of
latency and power consumption in [17] using a queuing framework.
Unfortunately, the tabular-Q framework in [11, 12] cannot be used to solve the multi-parameter multi-
group optimization problem in uplink resource configuration of NB-IoT networks, due to their incapability
to address high-dimensional state space and variable selection. More importantly, whether their proposed
4
RL-based resource configuration approaches [11, 12] outperform the conventional resource configuration
approaches [4, 5] is still unknown. In this paper, we develop RL-based uplink resource configuration ap-
proaches to dynamically optimize the number of served IoT devices in NB-IoT networks. To showcase the
efficiency, we compare the proposed RL-based approaches with the conventional heuristic uplink resource
allocation approaches. The contributions can be summarized as follows:
We develop an RL-based framework to optimize the number of served IoT devices by adaptively
configuring uplink resource in NB-IoT networks. The uplink communication procedure in NB-IoT is
simulated by taking into account the heterogeneous IoT traffics, the CE group selection, the RACH
procedure, and the uplink data transmission resource scheduling. This generated simulation environment
is used for training the RL-based agents before deployment, and these agents will be updated according
to the real traffic in practical NB-IoT networks in an online manner.
We first study a simplified NB-IoT scenario considering the single parameter and the single CE group,
where a basic tabular-Q was developed to compare with the revised conventional Load Estimation based
Uplink Resource Configuration (LE-URC) scheme. The tabular-Q is further advanced by implementing
function approximators with different computational complexities, namely, Linear Approximator (LA-Q)
and Deep Neural Networks (Deep Q-Network, a.k.a. DQN) to elaborate their capability and efficiency
in dealing with high-dimensional state space.
We then study a more practical NB-IoT scenario with multiple parameters and multiple CE groups,
where direct implementation of the LA-Q or DQN is not feasible due to the increasing size of the
parameter combinations. To solve it, we propose Action Aggregation approaches based on LA-Q and
DQN, namely, AA-LA-Q and AA-DQN, which guarantee convergence capability by sacrificing certain
accuracy in the parameters selection. Finally, a Cooperative Multi-Agent learning based on DQN (CMA-
DQN) is developed to break down the selection in high-dimensional parameters into multiple parallel
sub-tasks by using that a number of DQN agents are cooperatively trained to produce each parameter
for each CE group.
In the simplified scenario, our results show that the number of served IoT devices with tabular-Q con-
siderably outperforms that with LE-URC, while LA-Q and DQN achieve almost the same performance
as that of tabular-Q using much less training time. In the practical scenario, the superiority of Q-learning
based approaches over LE-URC significantly improves. Especially, CMA-DQN outperforms all other
approaches in terms of both throughput and training efficiency, which is mainly due to the use of
DQN enabling operation over a large state space and the use of multiple agents dealing with the large
dimensionality of parameters selection.
5
The rest of the paper is organized as follows. Section II provides the problem formulation and system
model. Section III illustrates the preliminary and the conventional LE-URC. Section IV proposes Q-leaning
based uplink resource configuration approaches in the single-parameter single-group scenario. Section V
presents the advanced Q-learning based approaches in the multi-parameter multi-group scenario. Section VI
elaborates the numerical results, and finally, Section VII concludes the paper.
II. PROBLEM FORMUL ATION AND SYSTEM MODEL
As illustrated in Fig. 1(a), we consider a single-cell NB-IoT network composed of an eNB located at the
center of the cell, and a set of static IoT devices randomly located in an area of the plane R2, and remain
spatially static once deployed. The devices are divided into three CE groups as further discussed below, and
the eNB is unaware of the status of these IoT devices, hence no uplink channel resource is scheduled to
them in advance. In each IoT device, uplink data is generated according to random inter-arrival processes
over the TTIs, which are Markovian and possibly time-varying.
n
t
Repe,1
=4
… …
n
t
Repe,2
=8
f
t
Prea,0
=48
IoT
device
eNB
Frequency
CE group 0:P
RSRP
RSRP1
CE group 1:γ
RSRP1
P
RSRP
γ
RSRP2
CE group 2: P
RSRP
RSRP2
Time tth T
TTI
n
tRach,0
= 4
n
tRach,1
= 2
n
tRach,2
= 1
n
t
Repe,0
=1 f
tPrea,1
=24
f
t
Prea,2
=12
n
t
Rach,i
: Number of RACH periods
n
t
Repe,i
: Repetition value
f
t
Prea,i
: Number of preambles
… …
RE
(a) (b)
P
RSRP
Fig. 1: (a) Illustration of system model; (b) Uplink channel frame structure.
A. Problem Formulation
With packets waiting for service, an IoT device executes the contention-based RACH procedure in order to
establish a Radio Resource Control (RRC) connection with the eNB. The contention-based RACH procedure
consists of four steps, where an IoT device transmits a randomly selected preamble, for a given number
of times according to the repetition value nt
Repe,i [1], to initial RACH procedure in step 1, and exchanges
control information with the eNB in the next three steps [18]. The RACH process can fail if: (i) a collision
occurs when two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB
6
cannot detect a preamble due to low SNR. Note that a collision can be still detected in step 3 of RACH when
the collided preambles are not detected in step 1 of RACH following 3GPP report [19]. This assumption is
different from our previous works [20,21], which only focuses on the preamble detection analysis in step 1
of RACH.
As shown in Fig. 1(b), for each TTI tand for each CE group i= 0,1,2, in order to reduce the chance
of a collision, the eNB can increase the number nt
Rach,i of RACH periods in the TTI or the number ft
Prea,i of
preambles available in each RACH period [22]. Furthermore, in order to mitigate the SNR outage, the eNB
can increase the number nt
Repe,i of times that a preamble transmission is repeated by a device in group iin
one RACH period [22] of the TTI.
After the RRC connection is established, the IoT device requests uplink channel resource from the eNB
for control information and data transmission. As shown in Fig. 1(b), given a total number of resource
RUplink for uplink transmission in the TTI, the number of available resource for data transmission Rt
DATA
is written as Rt
DATA =RUplink Rt
RACH, where Rt
RACH is the overall number of Resource Elements (REs)1
allocated for the RACH procedure. This can be computed as Rt
RACH =BRACH P2
i=0 nRach,inRepe,i fPrea,i, where
BRACH measures the number of REs required for one preamble transmission.
In this work, we tackle the problem of optimizing the RACH configuration defined by parameters
At={nt
Rach,i, f t
Prea,i, nt
Repe,i}2
i=0 for each ith group in an online manner for every TTI t. In order to
make this decision at the beginning of every TTI t, the eNB accesses all prior history Ut0in TTIs t0=
1, ..., t 1consisting of the following variables: the number of the collided preambles Vt0
cp,i, the number
of the successfully received preambles Vt0
sp,i, and the number of idle preambles Vt0
ip,i of the ith CE group
in the tth TTI for the RACH, as well as the number of IoT devices that have successfully sent data
Vt0
su,i and the number of IoT devices that are waiting for being allocated data resource Vt0
un,i. We denote
Ot={At1, Ut1, At2, U t2,· · · , A1, U 1}as the observed history of all such measurements and past
actions.
The eNB aims at maximizing the long-term average number of devices that successfully transmit data
with respect to the stochastic policy πthat maps the current observation history Otto the probabilities of
selecting each possible configuration At. This problem can be formulated as the optimization
(P1) : max
{π(At|Ot)}
X
k=t
2
X
i=0
γktEπ[Vk
su,i],(1)
1The uplink channel consists of 48 sub-carriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time
slot of 2 ms and one sub-carrier of 3.75 kHz [1]. Note that the NB-IoT also supports 12 sub-carriers with 15 kHz tone spacing for NPUSCH,
but NPRACH only supports 3.75 kHz tone spacing [1].
7
where γ[0,1) is the discount rate for the performance in future TTIs and index iruns over the CE groups.
Since the dynamics of the system is Markovian over the TTI and is defined by the NB-IoT protocol to be
further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will
be discussed in Sections III, IV, and V.
B. NB-IoT Access Network
We now provide additional details on the model and on the NB-IoT protocol. To capture the effects of
the physical radio, we consider the standard power-law path-loss model that the path-loss attenuation is uη,
with the propagation distance uand the path-loss exponent η. The system is operated in a Rayleigh flat-
fading environment, where the channel power gains hare exponentially distributed (i.i.d.) random variables
with unit mean. Fig. 2 presents the uplink data transmission procedure from the perspective of an IoT device
in NB-IoT networks, which consists of the four stages that are explained in the following four subsections
to introduce the system model.
γCE,i : maximum allowed RACH attempts in the ith CE group
γpMax: maximum allowed RACH attempts in all CE groups
γRRC : maximum allowed channel resources requests
cpCE : CE counter
cpMax: RACH counter
cRRC : RRC counter
Receive
system
information
New
packets?
Yes RACH
procedure
Request
uplink
channel
resource
Initial
cRRC=0
cRRC=cRRC+1
Serving
succeeds
Initial
cpMax=0
cpCE=0
Waiting
for new
packet
Serving fails, drop packet
cpCE=cpCE+1 cpMax=cpMax+1
Yes
Step up to
higher CE
group, initial
cpCE = 0
No
No
A. Traffic Inter-Arrival B. CE Group Determination C. RACH Procedure D. Data Resource Scheduling
No
RACH
succeeds?
cpMax<γpMax?cpCE<γCE,i?
Scheduled?
Yes
cRRC<γRRC?
No
No
No
Yes
Yes
Fig. 2: Uplink data transmission procedure from the perspective of an IoT device in NB-IoT networks.
1) Traffic Inter-Arrival: We consider two types of IoT devices with different traffic models, including
periodical traffic and bursty traffic, which is a heterogeneous traffic scenario for diverse IoT applications [23,
24]. The periodical traffic coming from periodic uplink reporting tasks, such as metering or environmental
monitoring, is the most common traffic model in NB-IoT networks [25]. The bursty traffic due to emergency
events, such as fire alarms and earthquake alarms, captures the complementary scenario in which a massive
number of IoT devices tries to establish RRC connection with the eNB [19]. Due to the nature of slotted-
Aloha, an IoT device can only transmit a preamble at the beginning of a RACH period, which means that
IoT devices executing RACH in a RACH period comes from those who received an inter-arrival within the
8
interval between with the last RACH period. For the periodical traffic, the first packet is generated using
Uniform distribution over Tperiodic (ms), and then repeated every Tperiodic ms. The packet inter-arrival rate
measured in each RACH period at each IoT device is hence expressed by
µt
period =TTTI
nt
Rach,i
×1
Tperiodic
,(2)
where nt
Rach,i is the number of RACH periods in the tth TTI, TTTI
nt
Rach,i
is the duration between neighboring
RACH periods. The bursty traffic is generated within a short period of time Tbursty starting from a random
time τ0. The traffic instantaneous rate in packets in a period is described by a function p(τ)so that the
packets arrival rate in the jth RACH period of the tth TTI is given by
µt,j
bursty =Zτj
τj1
p(τ)dτ, (3)
where τjis the starting time of the jth RACH period in the tth TTI, τjτj1=TTTI
nt
Rach,i
, and the distribution
p(τ)follows the time limited Beta profile given as [19, Section 6.1.1]
p(τ) = τα1(Tbursty τ)β1
Tburstyα+β2Beta(α, β),(4)
In (4), Beta(α, β)is the Beta function with the constant parameters αand β[26].
2) CE Group Determination: Once an IoT device is backlogged, it first determines its associated CE
group by comparing the received power of the broadcast signal PRSRP to the Reference Signal Received
Power (RSRP) thresholds {γRSRP1, γRSRP2}according to the rule [27]
CE group 0, if PRSRP > γRSRP1,
CE group 1, if γRSRP1 PRSRP γRSRP2,
CE group 2, if PRSRP < γRSRP2.
(5)
In (5), the received power of broadcast signal PRSRP is expressed as
PRSRP =PNPBCHuη,(6)
where uis the device’s distance from the eNB, and PNPBCH is the broadcast power of eNB [27]. Note that
PRSRP is obtained by averaging the small-scale Rayleigh fading of the received power [27].
3) RACH Procedure: After CE group determination, each backlogged IoT device in group irepeats a
randomly selected preamble nt
Repe,i times in the next RACH period by using a pseudo-random frequency
hopping schedule. The pseudo-random hopping rule is based on the current repetition time as well as the
Narrowband Physical Cell ID, and in one repetition, a preamble consists of four symbol groups, which are
transmitted with fixed size frequency hopping [1, 20, 28]. Therefore, a preamble is successfully detected if at
least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly
9
decoded [20]. Assuming that correct detecting is determined by the SNR level SNRt
sg,j,k for the jth repetition
and the ksymbol group, the correct detecting event Spd can be expressed as
Spd
=
nt
Repe,i
[
j=1 4
\
k=1 SNRt
sg,j,k γth,(7)
where kis the index of symbol group in the jth repetition, nt
Repe,i is the repetition value of the ith CE group
in the tth TTI, SNRt
sg,j,k γthmeans that the preamble symbol group is successfully decoded when its
received SNR SNRt
sg,j,k above a threshold γth, and SNRt
sg,j,k is expressed as
SNRt
sg,j,k =PRACH,iuηh/σ2.(8)
In (8), uis the Euclidean distance between the IoT device and the eNB, ηis the path-loss attenuation factor,
his the Rayleigh fading channel power gain from the IoT device to the eNB, σ2is the noise power, and
PRACH,iis the preamble transmit power in the ith CE group defined as
PRACH,i=
min {ρuη, PRACHmax}, i = 0,
PRACHmax, i = 1 or 2.
(9)
where iis the index of CE groups, IoT devices in the CE group 0 (i= 0) transmit preamble using the
full path-loss inversion power control [27], which maintains the received signal power at the eNB from IoT
devices with different distance equalling to the same threshold ρ, and PRACHmax is the maximal transmit
power of an IoT device. The IoT devices in the CE group 1 and group 2 always transmit preamble using
the maximum transmit power [27].
As shown in the RACH procedure of Fig. 2, if a RACH fails, the IoT device reattempts the procedure
until receiving a positive acknowledgement that RRC connection is established, or exceeding γpCE,i RACH
attempts while being part of one CE group. If these attempts exceeds γpCE,i, the device switches to a higher
CE group if possible [29]. Moreover, the IoT device is allowed to attempt the RACH procedure no more
than γpMax times before dropping its packets. These two features are counted by cpCE and cpMax, respectively.
4) Data Resource Scheduling: After the RACH procedure succeeds, the RRC connection is successfully
established, and the eNB schedules resource from the data channel resource Rt
DATA to the associated IoT
device for control information and data transmission as shown in Fig 1(b). To allocate data resource among
these devices, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that
have successfully completed the RACH procedure but have not received a data channel is compiled using a
random order. In each TTI, devices in the list are considered in order for access to the data channel until the
data resource are insufficient to serve the next device in the list. The remaining RRC connections between
the unscheduled IoT devices and the eNB will be preserved within at most γRRC subsequent TTIs counting
10
by cRRC, and attempts will be made to schedule the device’s data during these TTIs [29, 30]. The condition
that the data resource are sufficient in TTI tis expressed as
Rt
DATA
2
X
i=0
rt
DATA,iVt
sch,i,(10)
where P2
i=0 Vt
sch,i P2
i=0(Vt
sp,i +Vt1
un,i )is the number of scheduled devices limited by the upper bound
denoted by IoT devices with successful RACH Vt
sp,i in the current TTI tas well as unscheduled IoT devices
Vt1
un,i in the last TTI (t1),rt
DATA,i =BDATA ×nt
Repe,i is the number of required REs for serving one IoT
device within the ith CE group, and BDATA is the number of REs per repetition for control signal and data
transmission2. Note that nt
Repe,i is the repetition value for the ith CE group in the tth TTI, which is the same
as for preamble transmission [1].
III. PRELIMINARY AND CONVENTIONAL SOLUTIONS
A. Preliminary
The optimized number of served IoT devices over the long term given in Eq. (1) is really complicated,
which cannot be easily solved via the conventional uplink resource approach. Therefore, most prior works
simplified the objective to dynamically optimize the single parameter to achieve the maximum number of
served IoT devices in the single group without consideration of future performance [4, 5], which is expressed
as
(P2) : max
π(x
Ot)
Eπ[Vt
su,0],(11)
where xis the optimized single parameter.
To maximize number of served IoT devices in the tth TTI, the configuration xis expected to be dynamically
adjusted according to the actual number of IoT devices that will execute RACH attempts Dt
RACH, which refers
to the current load of the network. Note that in practice, this load information is unable to be detected at
the eNB. Thus, it is necessary to estimate the load based on the previous transmission reception from the
1th to (t1)th TTI Otbefore the uplink resource configuration in the tth TTI.
In [5], the authors designed a dynamic ACB scheme to optimize the problem given in Eq. (1) via adjusting
the ACB factor. The ACB factor is adapted based on the knowledge of traffic load, which is estimated via
2The basic scheduling unit of NPUSCH is resource unit (RU), which has two formats. NPUSCH format 1 (NPUSCH-1) is with 16 REs for
data transmission, and NPUSCH format 2 (NPUSCH-2) is with 4 REs for carrying control information [3, 22].
11
moment matching. The estimated number of RACH attempting IoT devices in the tth TTI ˆ
Dt
RACH is expressed
as:
ˆ
Dt
RACH =max n0,ˆ
Dt1
RACH +max ft1
Prea,0,ˆ
Dt
RACH ˆ
Dt1
RACHo(12)
where ft1
Prea,0is the number of allocated preambles in the (t1)th TTI, and ˆ
Dt1
RACH is the estimated number
of devices performing RACH attempts in the (t1)th TTI given as
ˆ
Dt1
RACH =ft1
Prea,0/hminn1, pt1
ACB1 + (Vt1
cp,0uM,p)e
2ft1
Prea,0
)1oi.(13)
In Eq. (13), pt1
ACB,ft1
Prea,0, and Vt1
cp,0are the ACB factor, the number of preambles and the observed number
of collided preambles in the (t1)th TTI, and uM,pis an estimated factor given in [5, Eq. (32)].
In Eq. (12), ˆ
Dt
RACH ˆ
Dt1
RACH ˆ
Dt1
RACH ˆ
Dt2
RACH is the difference between the estimated numbers of RACH
requesting IoT devices in the (t1)th and the tth TTIs, which is obtained by assuming that the number of
successful RACH IoT devices does not change significantly in these two TTIs [5].
This dynamic control approach is designed for an ACB scheme, which is only triggered when the exact
traffic load is bigger than the number of preambles (i.e., Dt
RACH > ft
Prea,0). Accordingly, the related backlog
estimation approach is only used when Dt
RACH > ft
Prea,0. However, it cannot estimate the load when Dt
RACH <
ft
Prea,0, which is required in our problem.
B. Resource Configuration in Single Parameter Single CE Group Scenario
In this subsection, we modify the load estimation approach given in [5] via estimating based on the last
number of the collided preambles Vt1
cp,0and the previous numbers of idle preambles Vt1
ip,0, V t2
ip,0,· · · . And
then, we propose an uplink resource configuration approach based on this revised load estimation, namely,
LE-URC.
1) Load Estimation: By definition, FPrea is the set of valid number of preambles that the eNB can choose,
where each IoT device selects a RACH preamble from ft
Prea,0available preambles with an equal probability
given by 1/ft
Prea,0. For a given preamble jtransmitted to the eNB, let djdenotes the number of IoT devices
that selects the preamble j. The probability that no IoT device selects preamble jis
P{dj= 0
Dt1
RACH,0=n}=11
ft1
Prea,0n.(14)
The expected number of preambles experiencing idles E{Vt1
idle,0
Dt1
RACH,0=n}in the (t1)th TTI is given
by
E{Vt1
ip,0
Dt1
RACH,0=n}=
ft1
Prea,0
X
j>1
P{dj= 0
Dt1
RACH =n}=ft1
Prea,011
ft1
Prea,0n.(15)
12
Due to that the actual number of preambles experiencing idles Vt1
ip,0can be observed at the eNB, the number
of RACH attempting IoT devices in the (t1)th TTI ζt1can be estimated as
ζt1=f1(E{Vt1
ip,0
Dt1
RACH,0) = log
(ft1
Prea,01
ft1
Prea,0
)
(Vt1
ip,0
ft1
Prea,0
),(16)
To obtain the estimated number of RACH attempting IoT devices in the tth TTI ˜
Dt
RACH,0, we also need to
know the difference between the estimated numbers of RACH attempting IoT devices in the (t1)th and
the tth TTIs, denoted by δt, where δt=˜
Dt
RACH,0˜
Dt1
RACH,0for t= 1,2,· · · , and ˜
D0
RACH,0= 0. However,
˜
Dt
RACH,0cannot be obtained before the tth TTI. To solve this, we can assume δtδt1according to [5]. This
is due to that the time between two consecutive TTIs is small, and the available preambles are gradually
updated leading to that the number of successful RACH IoT devices does not change significantly in these
two TTIs [5]. Therefore, the number of RACH attempting IoT devices in the tth time slot is estimated as
˜
Dt
RACH,0=max2Vt1
cp,0, ζt1+δt1,(17)
where 2Vt1
cp,0represents that there are at least 2Vt1
cp,0number of IoT devices colliding in the last TTI.
2) Uplink Resource Configuration Based on Load Estimation: In the following, we propose LE-URC
by taking into account the resource condition given in Eq. (10). The number of RACH periods nRach,0
and the repetition value nRepe,0is fixed, and only the number of preambles in each RACH period fPrea,0is
dynamically configured in each TTI. Using the estimated number of RACH attempting IoT devices in the
tth TTI ˜
Dt
RACH,0, the probability that only one IoT device selects preamble j(i.e., no collision occurs) is
expressed as
P{dj= 1
˜
Dt
RACH,0=n}=n
11
ft
Prea,011
ft
Prea,0n1.(18)
The expected number of RACH attempting IoT devices in the tth TTI is derived as
E{Vt
RACH,0
˜
Dt
RACH,0=n}=
ft
Prea,0
X
j>1
P{dj= 1
˜
Dt
RACH,0=n}=n11
ft
Prea,0n1,(19)
Based on (19), the expected number of IoT devices requesting uplink resource in the tth TTI is derived as
E{Vt
reqs
˜
Dt
RACH,0=n}=E{Vt
RACH,0
˜
Dt
RACH,0=n}+Vt
un,0=n11
ft
Prea,0n1+Vt1
un,0,(20)
where Vt1
un,0is the number of unscheduled IoT devices in the last TTI. Note that Vt1
un,0can be observed.
However, if the data resource is not sufficient (i.e., occurs when Eq. (10) is invalid), some IoT devices
may not be scheduled in the tth TTI. The upper bound of the number of scheduled IoT devices Vt
up,0is
expressed as
Vt
up,0=Rt
DATA
rt
DATA,i
=RUplink Rt
RACH
rt
DATA,i
.(21)
13
where Ruplink is the total number of REs reserved for uplink transmission in a TTI, Rt
RACH is the uplink
resource configured for RACH in the tth TTI. rt
DATA,0is required REs for serving one IoT device given in
Eq. (10).
According to (20) and (21), the expected number of the successfully served IoT devices is given by
Vt
suss(ft
Prea,0) = min {E{Vt
reqs
˜
Dt
RACH,0=n},Vt
up,0}.(22)
The maximal expected number of the successfully served IoT devices is obtained by selects the number
of preamble ft
Prea,0using
ft
Prea,0=argmax
fNPrea
Vt
suss(f).(23)
The LE-URC approach based on the estimated load ˜
Dt
RACH,0is detailed in Algorithm 1. For comparison,
we consider an ideal scenario that the actual number of RACH requesting IoT devices Dt
RACH is available
at the eNB, namely, Full State Information based URC (FSI-URC). FSI-URC configures ft
Prea,0still using
the approach given in Eq. (23), while the load estimation approach given in Section III.B.1) is not required.
Algorithm 1: Load Estimation Based Uplink Resource Configuration (LE-URC)
input : The set of the number of preambles in each RACH period FPrea,0, Number of IoT devices D, Operation Iteration I.
1for Iteration 1to Ido
2Initialization of V0
ip,0:= 12,V0
cp,0:= 0,˜
D0
RACH,0:= 0,δ1:= 0, and bursty traffic arrival rate µ0
bursty = 0;
3for t1to Tdo
4Generate µt
bursty using Eq. (3);
5The eNB observes Vt1
ip,0and Vt1
cp,0, and calculate ζt1using Eq. (16);
6Estimate the number of RACH requesting IoT devices ˜
Dt
RACH,0using Eq. (17);
7Select the number of preambles ft
Prea,0using Eq. (23) based on the estimated load ˜
Dt
RACH,0;
8The eNB broadcasts ft
Prea,0, and backlogged IoT devices attempt communication in the tth TTI;
9Update δt+1 := ˜
Dt
RACH,0˜
Dt1
RACH,0.
10 end
11 end
3) LE-URC for Multiple CE Groups: We slightly revise the introduced single-parameter single-group LE-
URC approach (given in Section III.B) to dynamically configure resource for multiple CE groups. Note that
the repetition value nRepe,i in the LE-URC approach is still defined as a constant to enable the availability of
load estimation in Eq. (17). Remind that the principle of LE-URC approach is to optimize the expectation of
the number of successful served IoT devices while balancing Rt
RACH and Rt
DATA with limited uplink resource
Ruplink =Rt
DATA +Rt
RACH. In the multiple CE groups scenarios, the resource Rt
DATA are allocated to IoT
devices in any CE groups without bias, but Rt
RACH is specifically allocated to each CE group.
14
Under this condition, the expected number of successfully served IoT devices Vt
suss,i given in Eq. (22)
needs to be modified by taking into account multiple variables, which becomes non-convex, and extremely
complicates the optimization problem. To solve it, we use a sub-optimal solution by artificially setting uplink
resource constrain RUplink,i for each CE group (RUplink =P2
i=0 RUplink,i). Each CE group can independently
allocate the resource between Rt
DATA,i and Rt
RACH,i according to the approach given in Eq. (23).
IV. Q-LEARNING BASED RESOURCE CONFIGURATION IN SINGLE-PARAMETER SINGLE-GROUP
SCENARIO
The RL approaches are well-known in addressing dynamic control problem in complex POMDPs [31].
Nevertheless, they have been rarely studied in handling the resource configuration in slotted-Aloha based
wireless communication systems. Therefore, it is worthwhile to evaluate the capability of RL in the single-
parameter single-group scenario first, in order to be compared with conventional heuristic approaches. In
this section, we consider one single CE group with the fixed RACH periods nRach,0as well as the fixed
repetition value nRepe,0, and only dynamically configuring the number of preambles fPrea,0at the beginning
of each TTI. In the following, We first study tabular-Q based on the tabular representation of the value
function, which is the simplest Q-learning form with guaranteed convergence [31], but requires extremely
long training time. We then study Q-learning with function approximators to improve training efficiency,
where LA-Q and DQN will be used to construct an approximation of the desired value function.
A. Q-Learning and Tabular Value Function
Considering a Q-agent deployed at the eNB to optimize the number of successfully served IoT devices in
real-time, the Q-agent need to explore the environment in order to choose appropriate actions progressively
leading to the optimization goal. We define s∈ S,a∈ A, and r∈ R as any state, action, and reward from
their corresponding sets, respectively. At the beginning of the tth TTI (t∈ {0,1,2,· · · }), the Q-agent first
observes the current state Stcorresponding to a set of previous observations (Ot={Ut1, U t2,· · · , U 1}) in
order to select an specific action At∈ A(St). The action Atcorresponds to the number of preambles in
each RACH period ft
Prea,0in single CE group scenario.
As shown in Fig. 3, we consider a basic state function in the single CE group scenario, where Stis a
set of indices mapping to the current observed information Ut1= [Vt1
su,0, V t1
un,0, V t1
cp,0, V t1
sp,0, V t1
ip,0]. With the
knowledge of the state St, the Q-agent chooses an action Atfrom the set A, which is a set of indexes
mapped to the set of the number of available preambles FPrea. Once an action Atis performed, the Q-agent
will receive a scalar reward Rt+1, and observe a new state St+1. The reward Rt+1 indicates to what extent
the executed action Atcan achieve the optimization goal, which is determined by the new observed state
15
Fig. 3: The Tabular-Q agent and environment interaction in the POMDP.
St+1. As the optimization goal is to maximize the number of the successfully served IoT devices, we define
the reward Rt+1 as a function that positively proportional to the observed number of successfully served
IoT devices Vt
su Ot, which is defined as
Rt+1 =Vt
su/csu,(24)
where csu is constant used to normalize the reward function.
Q-learning is a value-based RL approach [31, 32], where the policy of states to actions mapping π(s) = a
is learned using a state-action value function Q(s, a)to determine an action for the state s. We first use a
lookup table to represent the state-action value function Q(s, a)(tabular-Q), which consists of value scalars
for all the state and action spaces. To obtain an action At, we select the highest value scalar from the
numerical value vector Q(St, a), which maps all possible actions under Stto the Q-value table Q(s, a).
Accordingly, our objective is to find an optimal Q-value table Q(s, a)with optimal policy πthat can
select actions to dynamically optimize the number of served IoT devices. To do so, we train a initial Q-value
table Q(s, a)in the environment using Q-Learning algorithm, where Q(s, a)is immediately updated using
the current observed reward Rt+1 after each action as
Q(St, At) =Q(St, At) + λRt+1 +γmax
a∈A Q(St+1, a)Q(St, At),(25)
where λis a constant step-size learning rate that affects how fast the algorithm adapt to a new environment,
γ[0,1) is the discount rate that determines how current rewards affects the value function updating,
max
a∈A Q(St+1, a)approximates the value in optimal Q-value table Q(s, a)via the up-to-date Q-value table
Q(s, a)and the obtained new state St+1. Note that Q(St, At)in Eq. (25) is a scalar, which means that we
can only update one value scalar in the Q-value table Q(s, a)with one received reward Rt+1.
As shown in Fig. 3, we consider -greedy approach to balance exploitation and exploration in the Actor
of the Q-Agent, where is a positive real number and 1. In each TTI t, the Q-agent randomly generates
16
a probability pt
to compare with . Then, with the probability , the algorithm randomly chooses an action
from the remaining feasible actions to improve the estimate of the non-greedy action’s value. With the
probability 1, the algorithm exploits the current knowledge of the Q-value table to choose the action
that maximizes the expected reward.
Particularly, the learning rate λis suggested to be set to a small number (e.g., λ= 0.01) to guarantee
the stable convergence of Q-value table in this NB-IoT communication system. This is due to that a
single reward in a specific TTI can be severely biased, because state function is composed of multiple
unobserved information with unpredictable distributions (e.g., an action allows for the setting with large
number of preambles ft
prea, but massive random collisions accidentally occur, which leads to an unusual
low reward). In the following, the implementation of uplink resource configuration using tabular-Q based
real-time optimization is shown in Algorithm 2.
Algorithm 2: Tabular-Q Based Uplink Resource Configuration
input : Valid numbers of preambles FPrea, Number of IoT devices D, Operation Iteration I.
1Algorithm hyperparameters: learning rate λ(0,1], discount rate γ[0,1),-greedy rate (0,1] ;
2Initialization of the Q-value table Q(s, a)with 0value scalars;
3for Iteration 1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t1to Tdo
6Update µt
bursty using Eq. (3);
7if pt
<  then select a random action Atfrom A;
8else select At=argmax
a∈A
Q(St, a);
9The eNB broadcasts ft
Prea =FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, calculate the related Rt+1 using Eq. (24), and update Q(St, At)using Eq. (25).
11 end
12 end
B. Value Function Approximation
Since tabular-Q needs its each element to be updated to converge, searching for an optimal policy can
be difficult in limited time and computational resource. To solve this problem, we use a value function
approximator instead of Q-value table to find a sub-optimal approximated policy. Generally, selecting a
efficient approximation approach to represent the value function for different learning scenarios is a usual
problem within the RL [31, 33–35]. A variety of function approximation approaches can be conducted, such
as LA, DNNs, tree search, and which approach to be selected can critically influence the successful learning
[31, 34, 35]. The function approximation should fit the complexity of the desired value function, and be
17
efficient to obtain good solutions. Unfortunately, most function approximation approaches require specific
design for different learning problems, and there is no basis function, which is both reliable and efficient to
satisfy all learning problems.
In this subsection, we first focus on the linear function approximation for Q-learning, due to its simplicity,
efficiency, and guaranteed convergence [31, 36, 37]. We then conduct the DNN for Q-learning as a more
effective but complicated function approximator, which is also known as DQN [32]. The reasons we conduct
DQN are that: 1) the DNN function approximation is able to deal with several kinds of partially observable
problems [31, 32]; 2) DQN has the potential to accurately approximate the desired value function while
addressing a problem with very large state spaces [32], which can be favored for the learning in the multiple
CE group scenarios; 3) DQN is with high scalability, where the scale of its value function can be easily fit
to a more complicated problem; 4) a variety of libraries have been established to facilitate building DNN
architectures and accelerate experiments, such as TensorFlow, Pytorch, Theano, Keras, and etc..
1) Linear Approximation: LA-Q uses a linear weight matrix wto approximate the value function Q(s, a)
with feature vector ~x =x(s)corresponding to the state St. The dimensions of weight matrix wis |A| × |~x|,
where |A| is the total number of all available actions and |~x|is the size of feature vector ~x. Here, we
consider polynomial regression (as [31, Eq. 9.17]) to construct the real-valued feature vector x(s)due to
its efficiency3. In the training process, the exploration is the same as the tabular Q-learning by generating
random actions, but the exploitation is calculated using the weight matrix wof the value function. In detail,
to predict an action using the LA value function Q(St, a, w)with state Stin the tth TTI, the approximated
value function scalars for each action ais obtained by inner-producting between the weight matrix wand
the features vector x(s)as:
Q(St, a, w) = w·x(St)T=h
|~x|−1
X
j=0
w(0,j)xj(St),
|~x|−1
X
j=0
w(1,j)xj(St),· · · ,
|~x|−1
X
j=0
w(|A|−1,j)xj(St)iT.(26)
By searching for the maximal value function scalar in Q(St, a, w)given in Eq. (26), we can obtain the
matched action Atto maximize future rewards.
To obtain the optimal policy, we update the weigh matrix win the value function Q(s, a;w)using
Stochastic Gradient Descent (SGD) [31, 39]. SGD minimizes the error on predictions of observation after each
example, where the error is reduced by a small amount following the direction to the optimal target policy
Q(s, a). As it is infeasible to obtain optimal target policy by summing over all states, we instead estimate
3The polynomial case is the most well understood feature constructor and always performs well in practice with appropriate setting [31, 33].
Furthermore, the results in [38] shows that there is a rough correspondence between a fitted neural network and a fitted ordinary parametric
polynomial regression model. These reasons encourage us to compare the polynomial based LA-Q with DQN
18
the desired action-value function by simply considering one learning sample Q(s, a)Q(St, a, wt)[31].
In each TTI, the weigh matrix wis updated following
wt+1 =wtλL(wt),(27)
where λis the learning rate. L(wt)is the gradient of the loss function L(wt)used to train the Q-function
approximator. This is given as
L(wt) = Rt+1 +γmax
aQ(St+1, a;wt)Q(St, a, wt)·x(At, St)T· ∇wQ(St, At,wt)(28)
where wtis the weight matrix, x(At, St)is the features matrix with the same shape of wt.x(At, S t)is
constructed by zeros and the feature vector located in the row corresponding to the index of the action
selected in the tth TTI At. Note that Q(St+1, a;wt)is a scalar. The learning procedure follows Algorithm
2by changing the Q-table Q(s, a)to the LA value function Q(s, a;w)with linear weigh matrix w, and
updating Q(s, a;w)with SGD given in (28) in step 10 of Algorithm 2.
2) Deep Q-Network: The DQN agent parameterizes the action-state value function Q(s, a)by using a
function Q(s, a;θ), where θrepresents the weights matrix of a DNN with multiple layers. We consider
the conventional DNN, where neurons between two adjacent layers are fully pairwise connected, namely
fully-connected layers. The input of the DNN is given by the variables in state St; the intermediate hidden
layers are Rectifier Linear Units (ReLUs) by using the function f(x) = max (0, x); while the output layer
is composed of linear units4, which are in one-to-one correspondence with all available actions in A.
Primary Q-network θ
Random Action
max Q(St, a, θ)
pε≥ε
pε
Executing communication
procedures as Fig. 2
Observations at the eNB: Ut =
[Vt
su,0, Vt
sc,0, Vt
cp,0 , Vt
sp,0 , Vt
ip,0 ]
Environment
Rt+1=Vtsu/csu
Rt+1
DQN Agent
Memory
Mr
St, At
( St, At, Rt+1, St+1)
Sample
minibatch
Loss Function LDDQN(θ)
Target Q-network θ
Sj, Aj
Rj+1
Sj+1
Primary Q-network θ
max Q(Sj+1, a, θ)
Q(Sj, Aj, θ) SGD using Eq. (24)
Sync
St+1
Actor
Leaner
St+1
Stack
Action At f t
Prea,0
a
A
a
A
Fig. 4: The DQN agent and environment interaction in the POMDP.
The exploitation is obtained by performing forward propagation of Q-function Q(s, a;θ)with respect to the
observed state St. The weights matrix θis updated online along each training episode by using double deep
4Linear activation is used here according to [32]. Note that Q-learning is value-based, thus the desired value function given in Eq. (15) can
be bigger than 1, rather than a probability, and thus the activation function with return value limited in [1,1] (such as sigmoid function and
tanh function) can lead to convergence difficulty.
19
Algorithm 3: DQN Based Uplink Resource Configuration
input : The set of numbers of preambles in each RACH period FPrea , the number of IoT devices D, and operation iteration I.
1Algorithm hyperparameters: learning rate λ(0,1], discount rate γ[0,1),-greedy rate (0,1], target network update frequency K;
2Initialization of replay memory Mto capacity C, the primary Q-network θ, and the target Q-network ¯
θ;
3for Iteration 1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t1to Tdo
6Update µt
bursty using Eq. (3);
7if p<  then select a random action Atfrom A;
8else select At=argmax
a∈A
Q(St, a, θ);
9The eNB broadcasts FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, and calculate the related Rt+1 using Eq. (24);
11 Store transition (St, At, Rt+1, S t+1)in replay memory M;
12 Sample random minibatch of transitions (Sj, Aj, Rj+1, S j+1)from replay memory M;
13 Perform a gradient descent for Q(s, a;θ)using Eq. (30);
14 Every Ksteps update target Q-network ¯
θ=θ.
15 end
16 end
Q-learning (DDQN) [40], which to some extend reduce the substantial overestimations5of value function.
Accordingly, learning takes place over multiple training episodes, with each episode of duration NTTI TTI
periods. In each TTI, the parameter θof the Q-function approximator Q(s, a;θ)is updated using SGD as
θt+1 =θtλRMSLDDQN (θt),(29)
where λRMS is RMSProp learning rate [41], L(θ)is the gradient of the loss function L(θt)used to train
the Q-function approximator. This is given as
LDDQN(θt) =ESi,Ai,Ri+1 ,Si+1 Ri+1 +γmax
aQ(Si+1, a;¯
θt)Q(Si, Ai;θt)θQ(Si, Ai;θt),(30)
where the expectation is taken with respect to a so-called minibatch, which are randomly selected previous
samples (Si, Ai, Si+1 , Ri+1)for some i∈ {tMr, ..., t}, with Mrbeing the replay memory [32]. When
tMris negative, this is interpreted as including samples from the previous episode. The use of minibatch,
instead of a single sample, to update the value function Q(s, a;θ)improves the convergent reliability of
value function [32]. Furthermore, following DDQN [40], in (30), ¯
θtis a so-called target Q-network that is
5Overestimation refers to that some suboptimal actions regularly were given higher Q-values than optimal actions, which can negatively
influence the convergence capability and training efficiency of the algorithm [34, 40].
20
used to estimate the future value of the Q-function in the update rule. This parameter is periodically copied
from the current value θtand kept fixed for a number of episodes [40].
V. Q-LEARNING BASED RESOURCE CONFIGURATION IN MULTI-PARAMETER MULTI-GROUP
SCENARIO
Practically, NB-IoT is always deployed with multiple CE groups to serve IoT devices with various coverage
requirements. In this section, we study the problem (1) of optimizing the resource configuration for three CE
groups each with parameters At={nt
Rach,i, f t
Prea,i, nt
Repe,i}2
i=0. This joint optimization by configuring each
parameter in each CE group can improve the overall data access and transmission performance. Note that
each CE group shares the uplink resource in the same bandwidth, and the eNB schedules data resource to
all RRC connected IoT devices without the CE group bias as introduced in Sec. II.B.4). To optimize the
number of served IoT devices in real-time, the eNB should not only balance the uplink resource between
RACH and data, but also balance them among each CE group.
The Q-learning algorithms with the single CE group provided in Sec. IV are model-free, and thus their
learning structure can be directly used in this multi-parameter multi-group scenario. However, considering
multiple CE groups results in the increment of observations space, which exponentially increases the size
of state space. To train Q-agent with this expansion, the requirements of time and computational resource
greatly increase. In such case, the tabular-Q would be extremely inefficient, as not only the state-action value
table requires a big memory, but it is impossible to repeatedly experience every state to achieve convergence
with limited time. In view of this, we only study Q-learning with value function approximation (LA-Q and
DQN) to design uplink resource configuration approaches for the multi-parameter multi-group scenario.
LA-Q and DQN are with high capability to handle massive state spaces, and thus we can considerably
improve the state spaces with more observed information to support the optimization of Q-agent. Here, we
define the current state Stincludes information about the last MoTTIs (Ut1, U t2, Ut3,· · · , U tMo). This
design improves Q-agent by enabling it to estimate the trend of traffic. As our goal is to optimize the number
of served IoT devices, the reward function should be defined according to the number of successfully served
IoT devices Vsu,i of each CE group, which is expressed as
Rt+1 =
2
X
i=0
Vt
su,i/csu.(31)
Same as the state spaces, the available action spaces also exponentially increases with the increment of
the adjustable configurations. The number of available actions corresponds to the possible combinations of
configurations |A| =
2
Q
i=0
(|NRach,i|×|NRepe,i|×|FPrea,i|)(i.e., |·|denotes the number of elements in any
21
vector ·,Ais the set of actions, NRach,i,NRepe,i, and FPrea,i are the sets of the number of RACH periods,
the repetition value, and the number of preambles in each RACH period). Unfortunately, it is extremely
hard to optimize the system under such numerous action spaces (i.e., |A| can be over fifty thousands.),
due to that the system will fall into updating policy with only a small part of the action in A, and finally
leads to convergence difficulty. To solve this problem, we then provide two approaches that can reduce the
dimension of action space to enable the LA and DQN in the multi-parameter multi-group scenario.
A. Actions Aggregated Approach
We first provide AA based Q-learning approaches, which guarantee convergent capability by sacrificing
the accuracy of action selection6. In detail, the specific action selection can be converted to the increasing or
decreasing trend selection. Instead of selecting the exact values from the sets of NRach,i,NRepe,i, and FPrea,i ,
we convert it to single step ascent/descent based on the last action, which is represented by At
Rach,i ∈ {0,1},
At
Repe,i ∈ {0,1}, and At
Prea,i ∈ {0,1}for the number of RACH periods nt
Rach,i, the repetition values nt
Repe,i,
and the number of preambles in each RACH period ft
Prea,i in the tth TTI. Consequently, the size of total
action spaces for the three CE groups is reduced to |A|=29=512. By doing so, the algorithms for training
with LA function approximator and DQN in the multiple configurations multiple CE groups scenario can
be deployed following Algorithm 2 and Algorithm 3, respectively.
B. Cooperative Multi-agent Learning Approach
Despite that the uplink resource configuration is managed by a central authority, identifying the control
of each parameter as one sub-task that is cooperatively handled by independent Q-agents is sufficient to
deal with the problem with unsolvable action spaces [42]. As shown in Fig. 5, we consider multiple DQN
agents are centralized at the eNB with the same structure of value function approximator7following Section
IV.B.2). We break down the action space by considering nine separate action variables in At, where each
DQN agent controls their own action variable as shown in Fig. 5. Recall that we have three variables for
each group i, namely nRach,i,nRepe,i , and fPrea,i.
We introduce a separate DQN agent for each output variable in Atdefined as action At
kselected by the
kth agent, where each kth agent is responsible to update the value Q(St, At
k;θk)of action At
kin shared
6The action aggregation has been rarely evaluated, but the same idea, namely, state aggregation has been well studied, which is a basic
function approximation approach [31].
7The structures of value function approximator can also be specifically designed for RL agents with sub-tasks of significantly different
complexity. However, there is no such requirement in our problem, so it will not be considered.
22
Executing communication procedures as Fig. 2
Environment
St+1
Rt+1
St Ut =[Vtsu,0, Vtun,0, Vtcp,0, Vtsp,0, Vtip,0]
[Vtsu,1, Vtun,1, Vtcp,1, Vtsp,1, Vtip,1]
[Vtsu,2, Vtun,2, Vtcp,2, Vtsp,2, Vtip,2]
[ At, Ut, At-1, Ut-1, At-2, Ot-2, …, Ot-Mo-1 At-Mo-1]
Stack
DNN-Q Agent 1
DNN-Q Agent 2
DNN-Q Agent 3
DNN-Q Agent 9
… ...
At
1
At
2
At
3
At
9
Memory Mr,1
Memory Mr,2
Memory Mr,3
Memory Mr,9
… ...
Sample
minibatch SGD
At
(St, At
3,
Rt+1, St+1)
(St, At
2,
Rt+1, St+1)
(St, At
1,
Rt+1, St+1)
(St, At
9,
Rt+1, St+1)
At
At = [At
0, At
1,, At
k,, At
9]
[nt
Rach,0, nt
Repe,0,, f t
Prea,2]
St+1
Rt+1=(Vt
su,i)/csu
i=0
2
Fig. 5: The CMA-DQN agents and environment interaction in the POMDP.
state St. The DQN agents are trained in parallel and receive the same reward signal given in Eq. (31) at the
end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents
aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a
factorization of the overall value function akin to the approach proposed in [43] for multi-agent systems.
The challenge of this approach is how to evaluate each action according to the common reward function.
For each DQN agent, the received reward is corrupted by massive noise, where its own effect on the
reward is deeply hidden in the effects of all other DQN agents. For instance, a positive action can receive a
mismatched low reward due to other DQN agents’ negative actions. Fortunately, in our scenario, all DQN
agents are centralized at the eNB, which means that all DQN agents can have full information among each
other. Accordingly, we adopt the action selection histories of each DQN agent as part of state function8,
thus they are able to know how reward is influenced by different combinations of actions. To do so, we
define state variable Stas
St= [At1, Ut1, At2, U t2,· · · , AtMo, U tMo],(32)
where Mois the number of stored observations, At1is the set of selected action of each DQN agent in
the (t1)th TTI corresponding to nRach,i,nRepe,i , and fPrea,i for the ith CE group, and Ut1is the set of
observed transmission receptions.
In each TTI, the parameters θkof the Q-function approximator Q(St, At
k;θk)are updated using SGD at all
agents kas Eq. (29). The learning algorithm can be implemented following Algorithm 3. Different from the
single-parameter single-group scenario, we need to first initialize nine primary networks θk, target networks
¯
θk, and replay memories Mkfor each DQN agent. In step 11 of Algorithm 3, the current transactions of
each DQN agent should be stored in their own memory separately. In step 12 and 13 of Algorithm 3,
8The state function can be designed to collect more information according to the complexity requirements, such as sharing the value function
between each DQN agent [42].
23
the minibatch of transaction should separately sampled from each memory to train the corresponding DQN
agent.
VI. SIMULATION RESU LTS
In this section, we evaluate the performance of the proposed Q-learning approaches and compare it with the
conventional LE-URC and FSI-URC described in Sec. III via numerical experiments. We adopt the standard
network parameters listed in Table I following [1, 3, 22, 25, 29], and hyperparameters for Q-learning listed in
Table II. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). The RL agents will first be trained
in a so-called learning phase, and after convergence, their performance will be compared with LE-URC and
FSI-URC in a so-called testing phase. All testing performance results are obtained by averaging over 1000
episodes. In the following, we present our simulation results of the single-parameter single-group scenario
and the multi-parameter multi-group scenario in Section VI-A and Section VI-B, respectively.
TABLE I: Simulation Parameters
Parameters Setting Parameters Setting
Path-loss exponent η4 noise power σ2-138 dBm
eNB broadcast power PNPBCH 35 dBm Path-loss inverse power control threshold ρ120 dB
Maximal preamble transmit power PRACHmax 23 dBm The received SNR threshold γth 0 dB
Duration of periodic traffic Tperiodic 1 hour TTI 640ms
Duration of bursty traffic Tbursty 10 minutes Set of number of preambles FPrea {12,24,36,48}
Maximum allowed resource requests γRRC 5 Set of repetition value NRepe {1,2,4,8,16,32}
Maximum RACH attempts γpMax 10 Set of number of RACH periods NRach {1,2,4}
Maximum allowed RACH in one CE γpCE,i 5 REs required for BRACH 4
Bursty traffic parameter Beta(α, β) (3,4) REs required for BDATA 32
TABLE II: Q-learning Hyperparameters
Hyperparameters Value Hyperparameters Value
Learning rate λfor Tabular-Q and LA-Q 0.01 Learning rate by RMSProp λRMS for DQN 0.0001
Initial exploration 1 Final exploration 0.1
Discount rate γ0.5 Minibatch size 32
Replay memory 10000 Target Q-network update frequency 1000
A. Single-Parameter Single-Group Scenario
In the single-parameter single-group scenario, eNB is located at the center of a circular area with a 10
km radius, and the IoT devices are randomly located within the cell. We set the number of RACH periods
as nRach = 1, the repetition value as nRepe = 4, and the limited uplink resource as Ruplink = 1536 REs (i.e.,
32 slots with 48 sub-carriers). Unless otherwise stated, we consider the number of periodical IoT devices to
be Dperiodic = 10000, and the number of bursty IoT devices to be Dbursty = 5000. The DQN is set with three
hidden layers, each with 128 ReLU units. Tabular-Q, LA-Q, and DQN approaches are proposed in Sec.
24
Fig. 6: The real-time traffic load and Vsu for FSI-URC, LE-URC, and
DQN.
Fig. 7: Vsu and the average received reward for Tabular-Q, LA-Q, and
DQN.
IV.A, IV.B.1), and IV.B.2), respectively. The conventional LE-URC and FSI-URC approaches are proposed
in Sec. III.B.
Throughout epoch, each device has a periodical traffic profile (i.e., Uniform distritbuion given in Eq. (2)),
or a bursty traffic profile (i.e., the time limited Beta profile defined in Eq. (4) with parameters (3,4)) that
has a peak around the 400th TTI. The resulting average number of newly generated packets is shown as
dashed line in Fig. 6(a). Fig. 6(b) plot the number of successfully served IoT devices Vsu with the proposed
FSI-URC, LE-URC, and DQN approaches. In Fig. 6(b), Vsu first increases gradually with the increasing of
traffic shown in Fig. 6(a), until it reaches the serving capacity of eNB. Then, Vsu decreases slowly due to the
increasing collisions and scheduling failures with the increase of traffic. After that, Vsu increases gradually
as the collisions and scheduling failures decrease with the decreasing of traffic. Finally, Vsu decreases slowly
with the decreasing of traffic.
In Fig. 6(b), we see that the ideal FSI-URC approach outperforms the LE-URC approach, due to that
the FSI-URC approach uses the actual network load to perfectly optimize Vt
su at one time instance as Eq.
(11). DQN not only always outperforms LE-URC, but also exceeds the ideal DSI-URC approach in most
of TTIs. This is due to that both LE-URC and FSI-URC only optimize Vt
su at one time instance, whereas
DQN optimizes the long-term performance of the number of served IoT devices. The optimization in one
time instance (LE-URC and FSI-URC) only takes into account the current trade-off between RACH resource
and DATA resource given in Eq. (22), while the optimization over long-term period (DQN) also accounts
for some long-term hidden features, such as the dropping packets due to exceeding them maximum RACH
attempts γpMax or maximum resource requests γRRC. The DQN approach can well capture these hidden
25
features to optimize the long-term performance of Vsu as Eq. (1).
Fig. 7(a) compares the number of successfully served IoT devices Vsu under Tabular-Q, LA-Q, and DQN
approaches. We observe that all these three approaches achieve similar values of Vsu, which indicates that
both LA-Q and DQN can well estimate the optimal value function Q(s, a)as the converged Tabular-Q
in this low-complexity single CE group scenario. Fig. 7(b) plots the average received reward over each
bursty duration E{R}=1
Tbursty PTbursty
t=0 Rt(i.e., one epoch consists of one bursty duration Tbursty) from the
beginning of the training versus the required training time. It can be seen that LA-Q and DQN converge to
the optimal value function Q(s, a)(about 10 minutes) much faster than that of Tabular-Q (about 5 days).
The observations in Fig. 7 demonstrate that LA-Q and DQN can be good alternatives for tabular-Q to achieve
almost same number of served IoT devices with much less training time.
Fig. 8(a) and Fig. 8(b) plot the average number of successfully served IoT devices E{Vsu}and the average
number of dropped packets E{Vdrop}(i.e., this system performance can only be summarized in simulation)
over a bursty duration Tbursty versus the number of bursty IoT devices Dbursty. In Fig. 8(a), we observe
that E{Vsu}first increases and then decreases with increasing the number of bursty devices, the decreasing
trend starts when eNB can not afford to serve the increasing IoT device number due to the increasing
collisions and scheduling failures. These collisions and scheduling failures also result in the increasing
number of packet drops with increasing traffics as shown in Fig. 8(b). In Fig. 8, we also notice that DQN
always outperforms LE-URC (especially for relatively large Dbursty), which indicates the superiority of DQN
approach in handling massive bursty IoT devices. Interestingly, DQN provides better performance of the
number of served IoT devices and smaller mean errors than the ideal FSI-URC approach in most cases,
which thanks to the long-term optimization capability of DQN.
B. Multi-Parameter Multi-Group Scenario
Considering eNB is located at the center of a circle area with 12 km radius, we set RSRP thresholds for CE
group choosing {γRSRP1, γRSRP2 }={0,5}dB, the uplink resource Ruplink = 15360 REs (i.e., 320 slots with
48 sub-carriers), and the NPUSCH constrains for LE-URC following Ruplink,0:Ruplink,1:Ruplink,2=1:1:1.
To model massive IoT traffic, both the number of periodical IoT devices Dperiodic and the number of bursty
IoT devices Dbursty increase to 30000. In AA-DQN, we use one Q-network with three hidden layers each
of which is consist of 2048 ReLU units. In CMA-DQN, nine DQNs are used to control each of the nine
configuration (i.e., nRach,i,nRepe,i ,fPrea,i for three CE groups), where each DQN has three hidden layers,
each with 128 ReLU units. AA-LA-Q and AA-DQN approaches are proposed in Sec. V.A, and CMA-DQN
approach is proposed in Sec. V.B.
26
Fig. 8: E{Vsu}and E{Vdrop }for FSI-URC, LE-URC, and DQN. Fig. 9: Vsu and the average received reward.
Fig. 9(a) compares the number of successfully served IoT devices Vsu during one epoch using AA-LA-
Q, AA-DQN, CMA-DQN and LE-URC. The “LE-URC-[1,4,8]” and “LE-URC-[2,8,16]” curves represent
the LE-URC approach with the repetition values {nRepe,0, nRepe,1, nRepe,2}set to {1,4,8}and {2,8,16},
respectively. We observes that the number of successfully served IoT devices Vsu follows CMA-DQN >AA-
DQN >AA-LA-Q LE-URC-[1,4,8] LE-URC-[2,8,16]. As can be seen, all Q-learning based approaches
outperform LE-URC approaches, due to that these Q-learning based approaches can dynamically optimize
the number of served IoT devices by accurately configuring each parameter. We also observe that CMA-
DQN slightly outperforms the others in the light traffic regions at the beginning and end of the epoch,
but it substantially outperforms the others in the period of heavy traffic in the middle of the epoch. This
demonstrates the capability of CMA-DQN in better managing the scarce channel resource in the presence
of heavy traffic. It is also observed that increasing the repetition value of each CE group with LE-URC
improves the received SNR, and thus the RACH success rate in the light traffic region, but it degrades the
scheduling success rate due to limited channel resource in the heavy traffic region.
Fig. 9(b) plots the average received reward over each bursty duration E{R}=1
Tbursty PTbursty
t=0 Rtfrom the
beginning of the training versus the consumed training time. It can be seen that CMA-DQN and AA-DQN
outperform AA-LA-Q in terms of less training time. Compared with the results in the single CE group
scenario shown in Fig. 7, DNN is a better value function approximator for the 3 CE groups scenario due to
its efficiency and capability in solving high complexity problems. We also observe that CMA-DQN achieves
higher E{R}than AA-DQN, due to that CMA-DQN can accurately select the exact values from the set of
27
actions {NRepe,NRach ,FPrea}, whereas AA-DQN can only select ascent/descent actions, which leads to a
sub-optimal solution.
Fig. 10: The average number of successfully served IoT devices Vsucc,i for each CE group i.
Fig. 11: The allocated repetition value nt
Repe,i, and RAOs producted by nt
Rach,i ×ft
Prea,i.
To gain more insight into the operation of CMA-DQN, Fig. 10 plots the average number of successfully
served IoT devices Vsucc,i for each CE group i, and Fig. 11 plots the average number nt
Repe,i of repetitions
and the average number of Random Access Opportunities (RAOs), defined as the product nt
Rach,i ×ft
Prea,i, for
each CE group ithat are selected by CMA-DQN over the testing episodes. As seen in Fig. 10, CMA-DQN
substantially outperforms LE-URC approaches for each CE group i, where the reasons for this performance
are showcased in Fig. 11. As seen in Fig. 11(a)-(c), CMA-DQN increases the number of repetitions in the
light traffic region in order to improve the SNR and reduce RACH failures, while decreasing it in the heavy
traffic region so as to reduce scheduling failures. Surprisingly, the CMA-DQN increases the repetition value
of group 0 nRepe,0at the same time, which is completely opposite to the actions of nRepe,1and nRepe,2. This
is due to that the CMA-DQN is aware of the key to optimize the overall performance Vsu is to guarantee
Vsucc,0, as the IoT devices in the CE group 0 are easier to be served, due to they are located close to the eNB
28
and consume less resource. As illustrated in Fig. 11(d)-(f), this allows CMA-DQN to increase the number
of RAOs in the high traffic regime mitigating the impact of collisions on the throughput. In contrast, for the
CE groups 1 and 2, in the heavy traffic region, LE-URC decreases the number of RAOs in order to reduce
resource scheduling failures, causing an overall lower throughput as seen in Fig. 10.
Fig. 12: The average number of successfully served IoT devices per
TTI over each epoch in online updating
The realistic network conditions can be different from the simulation environment, due to that the
practical traffic and physical channel vary and can be unpredictable. This difference may lead to inaccurate
configuration that can degrade the system performance of each approach. Fortunately, the proposed RL-based
approaches can self-update after deployment according to the practical observation in NB-IoT networks in an
online manner. To model this, we use the trained CMA-DQN agents given in Fig. 11 (i.e., the bursty traffic is
modelled by the time limited Beta profile with parameters (3,4)), and test them in a slightly modified traffic
scenario that the bursty traffic is with Beta(5,6), and we set the constant exploration rate = 0.001. Fig.
12 plots the average number of successfully served IoT devices E{Vsu}per TTI over each episode versus
epochs. Our result shows that, as expected, E{Vsu}follows CMA-DQN>LE-URC-[1,4,8]>LE-URC-[2,8,16]
at any epoch. More importantly, the performance of CMA-DQN gradually improves along epochs, which
sheds light on the online self-updating capability of the proposed RL-based approaches.
VII. CONCLUSION
In this paper, we developed Q-learning based uplink resource configuration approaches to optimize the
number of served IoT devices in real-time in NB-IoT networks. We first developed tabular-Q, LA-Q, and
DQN based approaches for the single-parameter single-group scenario, which are shown to outperform
29
the conventional LE-URC and FSI-URC approaches in terms of the number of served IoT devices. Our
results demonstrated that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same
system performance with much less training time. To support traffic with different coverage requirements, we
then studied the multi-parameter multi-group scenario as defined in NB-IoT standard, which introduced the
high-dimensional configurations problem. To solve it, we advanced the proposed LA-Q and DQN using the
Actions Aggregation technique (AA-LA-Q and AA-DQN), which guarantees the convergent capability of Q-
learning by sacrificing the accuracy in resource configuration. We further developed CMA-DQN by dividing
high-dimensional configurations into multiple parallel sub-tasks, which achieved the best performance in
terms of the number of successfully served IoT devices Vsu with the least training time.
REFERENCES
[1] J. Schlienz and D. Raddino, “Narrowband internet of things whitepaper,IEEE Microw. Mag., vol. 8, no. 1, pp. 76–82, Aug. 2016.
[2] H. S. Dhillon, H. Huang, and H. Viswanathan, “Wide-area wireless communication challenges for the internet of things,” IEEE Commun.
Mag., vol. 55, no. 2, pp. 168–174, Feb. 2017.
[3] Y.-P. E. Wang, X. Lin, A. Adhikary, A. Grovlen, Y. Sui, Y. Blankenship, J. Bergman, and H. S. Razaghi, “A primer on 3GPP narrowband
internet of things (NB-IoT),” IEEE Commun. Mag., vol. 55, no. 3, pp. 117–123, Mar. 2017.
[4] D. T. Wiriaatmadja and K. W. Choi, “Hybrid random access and data transmission protocol for machine-to-machine communications in
cellular networks,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 33–46, Jan. 2015.
[5] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. Wong, “D-ACB: Adaptive congestion control algorithm for bursty M2M traffic in LTE
networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, Dec. 2016.
[6] L. M. Bello, P. Mitchell, and D. Grace, “Application of Q-learning for RACH access to support M2M traffic over a cellular network,” in
Proc. European Wireless Conf., 2014, pp. 1–6.
[7] Y. Chu, P. D. Mitchell, and D. Grace, “ALOHA and Q-learning based medium access control for wireless sensor networks,” in Int. Symp.
Wireless Commun. Syst. (ISWCS), 2012, pp. 511–515.
[8] Y. Yan, P. Mitchell, T. Clarke, and D. Grace, “Distributed frame size selection for a Q learning based slotted ALOHA protocol,” in Int.
Symp. Wireless Commun. Syst. (ISWCS), 2013, pp. 1–5.
[9] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C.-K. Tham, “Distributed reinforcement learning frameworks for cooperative retransmission in
wireless networks,” IEEE Trans. Veh. Technol., vol. 59, no. 8, pp. 4157–4162, Oct. 2010.
[10] Y.-S. Chen, C.-J. Chang, and F.-C. Ren, “Q-learning-based multirate transmission control scheme for RRM in multimedia WCDMA
systems,” IEEE Trans. Veh. Technol., vol. 53, no. 1, pp. 38–48, Jan. 2004.
[11] M. ihun and L. Yujin, “A reinforcement learning approach to access management in wireless cellular networks,” in Wireless Commun.
Mobile Comput., May. 2017, pp. 1–7.
[12] T.-O. Luis, P.-P. Diego, P. Vicent, and M.-B. Jorge, “Reinforcement learning-based ACB in LTE-A networks for handling massive M2M
and H2H communications,” in IEEE Int. Commun. Conf. (ICC), May. 2018, pp. 1–7.
[13] R. Harwahyu, R.-G. Cheng, C.-H. Wei, and R. F. Sari, “Optimization of random access channel in NB-IoT,IEEE Internet Things J.,
vol. 5, no. 1, pp. 391–402, Feb. 2018.
[14] S.-M. Oh and J. Shin, “An efficient small data transmission scheme in the 3GPP NB-IoT system,” IEEE Commun. Lett., vol. 21, no. 3,
pp. 660–663, Mar. 2017.
30
[15] H. Malik, H. Pervaiz, M. M. Alam, Y. Le Moullec, A. Kuusik, and M. A. Imran, “Radio resource management scheme in NB-IoT systems,”
IEEE Access, vol. 6, pp. 15 051–15 064, Jun. 2018.
[16] C. Yu, L. Yu, Y. Wu, Y. He, and Q. Lu, “Uplink scheduling and link adaptation for narrowband internet of things systems,IEEE Access,
vol. 5, pp. 1724–1734, 5 2017.
[17] A. Azari, G. Miao, C. Stefanovic, and P. Popovski, “Latency-energy tradeoff based on channel scheduling and repetitions in NB-IoT
systems,” arXiv preprint arXiv:1807.05602, Jul. 2018.
[18] E. Dahlman, S. Parkvall, and J. Skold, 4G: LTE/LTE-advanced for mobile broadband. Academic press, 2013.
[19] “Study on RAN improvements for machine-type communications,3GPP TR 37.868 V11.0.0, Sep. 2011.
[20] N. Jiang, Y. Deng, M. Condoluci, W. Guo, A. Nallanathan, and M. Dohler, “RACH preamble repetition in NB-IoT network,IEEE
Commun. Lett., vol. 22, no. 6, pp. 1244–1247, Jun. 2018.
[21] N. Jiang, Y. Deng, A. Nallanathan, X. Kang, and T. Q. S. Quek, “Analyzing random access collisions in massive IoT networks,IEEE
Trans. Wireless Commun., vol. 17, no. 10, pp. 6853–6870, Oct. 2018.
[22] “Evolved universal terrestrial radio access (E-UTRA); Physical channels and modulation,3GPP TS 36.211 v.14.2.0, Apr. 2017.
[23] M. Z. Shafiq, L. Ji, A. X. Liu, J. Pang, and J. Wang, “A first look at cellular machine-to-machine traffic: large scale measurement and
characterization,” ACM SIGMETRICS Performance Evaluation Rev., vol. 40, no. 1, pp. 65–76, Jun. 2012.
[24] J. Kim, J. Lee, J. Kim, and J. Yun, “M2M service platforms: Survey, issues, and enabling technologies.IEEE Commun. Surveys Tuts.,
vol. 16, no. 1, pp. 61–76, Jan. 2014.
[25] “Cellular system support for ultra-low complexity and low throughput Internet of Things (CIoT),3GPP TR 45.820 V13.1.0, Nov. 2015.
[26] A. K. Gupta and S. Nadarajah, Handbook of Beta distribution and its applications. New York, USA: CRC press, 2004.
[27] “Evolved universal terrestrial radio access (E-UTRA); Physical layer measurements,3GPP TS 36.214 v. 14.2.0, Apr. 2017.
[28] X. Lin, A. Adhikary, and Y.-P. E. Wang, “Random access preamble design and detection for 3GPP narrowband IoT systems,” IEEE
Wireless Commun. Lett., vol. 5, no. 6, pp. 640–643, Jun. 2016.
[29] “Evolved universal terrestrial radio access (E-UTRA); Medium Access Control protocol specification,3GPP TS 36.321 v.14.2.1, May.
2017.
[30] “Evolved universal terrestrial radio access (E-UTRA); Requirements for support of radio resource management,3GPP TS 36.133 v.
14.3.0, Apr. 2017.
[31] R. Sutton and A. Barto, “Reinforcement learning: An introduction (draft),” URl: http://www.incompleteideas.net/book/bookdraft2017nov5.pdf,
2017.
[32] V. Mnih et al., “Human-level control through deep reinforcement learning,Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
[33] G. Konidaris, S. Osentoski, and P. S. Thomas, “Value function approximation in reinforcement learning using the Fourier basis.” in Assoc.
Adv. AI (AAAI), vol. 6, Aug. 2011, p. 7.
[34] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proc. Connectionist Models Summer
School Hillsdale, NJ. Lawrence Erlbaum, 1993.
[35] M. Hauskrecht, “Value-function approximations for partially observable markov decision processes,J. AI Res., vol. 13, pp. 33–94, Aug.
2000.
[36] A. Geramifard et al., “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Found. Trends
Mach. Learn., vol. 6, no. 4, pp. 375–451, Dec. 2013.
[37] F. S. Melo and M. I. Ribeiro, “Q-learning with linear function approximation,” in Springer Int. Conf. Comput. Learn. Theory, Jun. 2007,
pp. 308–322.
[38] C. Xi, K. Bohdan, M. Norman, and M. Pete, “Polynomial regression as an alternative to neural nets,arXiv preprint arXiv:1806.06850,
2018.
[39] C. M. Bishop, Pattern Recognition and Machine Learning. New York, USA: Springer print, 2006.
31
[40] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning.” in Assoc. Adv. AI (AAAI), vol. 2, Feb.
2016, p. 5.
[41] T. Tieleman and G. Hinton, “Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude,” COURSERA:
Neural Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, Oct. 2012.
[42] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,IEEE Trans. Syst., Man,
Cybern. C, C, Appl. Rev., 38 (2), Mar. 2008.
[43] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and
T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Pro. Int. Conf. Auton. Agents
MultiAgent Syst. (AAMAS), Jul. 2018, pp. 2085–2087.
... The intelligent agents consistently avoided excessive buffer accumulation, even under uncertainty, while maintaining acceptable service levels. This demonstrated that reinforcement learning could effectively balance cost efficiency with availability [38]. ...
... For practitioners, this means fewer stockouts, lower carrying costs, and greater agility in managing disruptions. Furthermore, the decentralized nature of RL agents aligns well with modern supply ecosystems, allowing firms to scale optimization strategies without reliance on centralized command systems [38]. The framework supports strategic transformation from static planning models to dynamic, learning-driven inventory architectures [39]. ...
... In scenarios with well-defined operational constraints or long-term planning needs, hybrid models can provide the structure of rule-based optimization with the adaptability of learning-based systems. For example, RL could handle short-term order quantity adjustments while a linear optimizer ensures capacity and budget constraints are respected [38]. ...
Article
Full-text available
In today's globally distributed and decentralized manufacturing environments, managing inventory efficiently presents significant challenges due to the increasing complexity of demand patterns, lead-time variability, and supply chain uncertainties. Traditional inventory optimization models, which rely on static assumptions and centralized control, often fall short in highly dynamic and geographically dispersed ecosystems. This paper introduces a novel framework for Dynamic Inventory Optimization using Reinforcement Learning (RL), tailored to the needs of decentralized global manufacturing supply chains. From a broader perspective, the study explores the limitations of conventional optimization methods in responding to real-time changes and disruptions, emphasizing the necessity for intelligent, adaptive, and autonomous decision-making systems. Reinforcement Learning is leveraged to create agents capable of learning optimal inventory policies through interaction with the supply environment, dynamically adjusting order quantities and replenishment strategies based on evolving conditions. These agents are embedded within a multi-agent system, enabling decentralized decision-making aligned with local objectives while maintaining global efficiency. The RL framework integrates real-time data streams from IoT-enabled devices and enterprise resource planning systems, ensuring that inventory decisions reflect the most current operational states across distributed nodes. The proposed system is validated through simulation scenarios reflective of real-world supply chain structures in sectors such as automotive and electronics manufacturing. Results indicate substantial improvements in service level performance, inventory holding cost reduction, and adaptability to supply-demand fluctuations compared to baseline heuristics. This work underscores the potential of combining artificial intelligence with decentralized supply chain architectures, offering a transformative approach to inventory optimization that is robust, scalable, and future-ready.
... PROBLEM STATEMENT We suggest using reinforcement learning techniques, which appear to have not been thoroughly investigated in NB-IoT transmission control despite their present popularity. The heuristic suggested in [11] could easily work with our transmission control method, in contrast to attempts to apply RL in the setup of random-access parameters, particularly for the allocation of NPRACH resources of each CE level [12]. As noted in [12], real network conditions can differ dramatically from the simulated environment after deployment, rendering control rules worthless. ...
... The heuristic suggested in [11] could easily work with our transmission control method, in contrast to attempts to apply RL in the setup of random-access parameters, particularly for the allocation of NPRACH resources of each CE level [12]. As noted in [12], real network conditions can differ dramatically from the simulated environment after deployment, rendering control rules worthless. Therefore, RL algorithms are usually required to be trained offline in a simulator before being deployed. ...
Article
Compared to previous IoT principles, NB-IoT provides end customers with a higher quality of service (QoS) in an era where everything is connected to the internet. The Third-Generation Partnership Project (3GPP) for Low-Power Wide-Area Networks (LPWAN) introduced the narrowband Internet of Things (NB-IoT), a new cellular radio access technology based on Long-Term Evolution (LTE). The main objectives of NB-IoT are to enable low-power, low-cost, and low-data-rate communication and to support massive machine-type communication (mMTC). One of the more difficult tasks in uplink transmission is resource allocation. While numerous suggestions have been made for efficient resource distribution, a comprehensive and successful solution has not yet been provided. In this article, we attempt to suggest a resource allocation technique based on reinforcement learning (RL). Reinforcement learning is a subset of machine learning that essentially employs an agent to act in the environment and gather rewards, both positive and negative. We will be using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm as it's an improvement to its predecessor algorithm (DDPG). We will be tweaking the parameters in RL for better efficiency: Latency, Throughput, energy efficiency, fairness, and rewards.
... Reinforcement learning plays a key role in achieving adaptive and real-time control of network parameters. Techniques such as Q-learning [12], multi-armed bandit (MAB) [13] models, and actor-critic [38] methods have been used to optimize routing [14], [15], load balancing [16], [23], and fault tolerance in wired networks [17], [39]. These methods enable the network to adapt to different conditions and improve overall performance. ...
Preprint
Full-text available
Network services are increasingly managed by considering chained-up virtual network functions and relevant traffic flows, known as the Service Function Chains (SFCs). To deal with sequential arrivals of SFCs in an online fashion, we must consider two closely-coupled problems - an SFC placement problem that maps SFCs to servers/links in the network and an SFC scheduling problem that determines when each SFC is executed. Solving the whole SFC problem targeting these two optimizations jointly is extremely challenging. In this paper, we propose a novel network diffuser using conditional generative modeling for this SFC placing-scheduling optimization. Recent advances in generative AI and diffusion models have made it possible to generate high-quality images/videos and decision trajectories from language description. We formulate the SFC optimization as a problem of generating a state sequence for planning and perform graph diffusion on the state trajectories to enable extraction of SFC decisions, with SFC optimization constraints and objectives as conditions. To address the lack of demonstration data due to NP-hardness and exponential problem space of the SFC optimization, we also propose a novel and somewhat maverick approach -- Rather than solving instances of this difficult optimization, we start with randomly-generated solutions as input, and then determine appropriate SFC optimization problems that render these solutions feasible. This inverse demonstration enables us to obtain sufficient expert demonstrations, i.e., problem-solution pairs, through further optimization. In our numerical evaluations, the proposed network diffuser outperforms learning and heuristic baselines, by \sim20\% improvement in SFC reward and \sim50\% reduction in SFC waiting time and blocking rate.
... The advancement of deep reinforcement learning (DRL) techniques offers improved solutions for optimizing task data transmission due to their ability to automatically adapt to dynamic changes in the network environment. In [28], multiple Deep Q-Learning Networks (DQN) are employed for the dynamic scheduling of network transmission states and data transmission requirements, optimizing channel allocation and throughput. Similarly, DDQN has also been applied to dynamically allocate time slot resources across multiple channels [29]. ...
Article
Full-text available
In recent years, task offloading and its scheduling optimization have emerged as widely discussed and significant topics. The multi-objective optimization problems inherent in this domain, particularly those related to resource allocation, have been extensively investigated. However, existing studies predominantly focus on matching suitable computational resources for task offloading requests, often overlooking the optimization of the task data transmission process. This inefficiency in data transmission leads to delays in the arrival of task data at computational nodes within the edge network, resulting in increased service times due to elevated network transmission latencies and idle computational resources. To address this gap, we propose an Asynchronous Data Transmission Policy (ADTP) for optimizing data transmission for task offloading in edge-computing enabled ultra-dense IoT. ADTP dynamically generates data transmission scheduling strategies by jointly considering task offloading decisions and the fluctuating operational states of edge computing-enabled IoT networks. In contrast to existing methods, the Deep Deterministic Policy Gradient (DDPG) based task data transmission scheduling module works asynchronously with the Deep Q-Network (DQN) based Virtual Machine (VM) selection module in ADTP. This significantly reduces the computational space required for the scheduling algorithm. The continuous dynamic adjustment of data transmission bandwidth ensures timely delivery of task data and optimal utilization of network bandwidth resources. This reduces the task completion time and minimizes the failure rate caused by timeouts. Moreover, the VM selection module only performs the next inference step when a new task arrives or when a task finishes its computation. As a result, the wastage of computational resources is further reduced. The simulation results indicate that the proposed ADTP reduced average data transmission delay and service time by 7.11% and 8.09%, respectively. Furthermore, the task failure rate due to network congestion decreased by 68.73%.
... Additionally, machine learning techniques, particularly reinforcement learning, will be explored to enable dynamic adaptation in real-time [45], [46]. Reinforcement learning models can continuously update and improve defense mechanisms based on new attack data, ensuring that the system evolves alongside the threats [47], [48], [49]. ...
Article
Full-text available
This research introduces a dynamic optimization algorithm designed to enhance blockchain network resilience against distributed attacks such as Distributed Denial of Service (DDoS), Sybil, and eclipse attacks. The primary objective is to develop a real-time, adaptive control strategy that minimizes network performance degradation while dynamically responding to evolving threats. The research design integrates multi-objective optimization, game theory, and reinforcement learning to formulate a defense strategy that adapts to adversarial conditions. The methodology is based on a modified state-space model, where the blockchain's performance is represented by a system of dynamic equations influenced by both control actions (defensive measures) and attack vectors. The optimization problem is formulated to minimize a cost function that balances network resilience and resource usage. A numerical example is presented to validate the model, demonstrating the algorithm’s effectiveness in maintaining network performance under attack by adjusting defense mechanisms in real-time. The main results indicate that the proposed method significantly reduces the impact of distributed attacks while ensuring efficient resource allocation. In conclusion, this research offers a novel framework for enhancing blockchain security, with implications for real-world applications in decentralized systems, financial services, and critical infrastructure. Future work will address the scalability of the algorithm and explore more advanced reinforcement learning techniques to handle more complex and unpredictable attack patterns.
... DRL is well-suited for solving complex Markov decision process (MDP) problems, leveraging deep neural networks (DNNs) as powerful non-linear approximation functions [16]. DRL-based algorithms have been extensively applied to dynamic optimization in wireless communication systems [17], where the DRL agent continuously interacts with the environment to iteratively refine its policy based on reward feedback from the system [18]. A deep Q-network (DQN)based framework was proposed to demonstrate significant energy cost reductions with only a slight compromise in average AoI performance. ...
Preprint
Full-text available
The Age of Information (AoI) has recently gained recognition as a critical quality-of-service (QoS) metric for quantifying the freshness of status updates, playing a crucial role in supporting massive ultra-reliable and low-latency communications (mURLLC) services. In mURLLC scenarios, due to the inherent system dynamics and varying environmental conditions, optimizing AoI under such multi-QoS constraints considering both delay and reliability often results in non-convex and computationally intractable problems. Motivated by the demonstrated efficacy of deep reinforcement learning (DRL) in addressing large-scale networking challenges, this work aims to apply DRL techniques to derive optimal resource allocation solutions in real time. Despite its potential, the effective integration of FBC in DRL-based AoI optimization remains underexplored, especially in addressing the challenge of simultaneously upper-bounding both delay and error-rate. To address these challenges, we propose a DRL-based framework for AoI-aware optimal resource allocation in mURLLC-driven multi-QoS schemes, leveraging AoI as a core metric within the finite blocklength regime. First, we design a wireless communication architecture and AoI-based modeling framework that incorporates FBC. Second, we proceed by deriving upper-bounded peak AoI and delay violation probabilities using stochastic network calculus (SNC). Subsequently, we formulate an optimization problem aimed at minimizing the peak AoI violation probability through FBC. Third, we develop DRL algorithms to determine optimal resource allocation policies that meet statistical delay and error-rate requirements for mURLLC. Finally, to validate the effectiveness of the developed schemes, we have executed a series of simulations.
... To this end, we devise a smart friendly jamming approach in this paper inspired by the recently emerging technique of reinforcement learning (RL), which obtains optimal strategies by dynamically interacting with the environment modeled as a Markov decision process (MDP) [21]. As a subfield of artificial intelligence, the RL technology has been investigated in some telecommunication applications, including power control, antijamming, and relay selection [22]- [25], etc. By incorporating deep neural networks with RL, the deep reinforcement learning (DRL) technique is employed to find better strategies from more complex environments and accelerate the learning process [26]- [31]. ...
Article
Full-text available
As one of the indoor communication technologies, visible light communication (VLC) has drawn great attention for its advantages such as ultra-wide unlicensed spectrum, power saving and low complexity. The nature of the visible light propagation is an open channel, which is vulnerable to wiretapping. This paper investigates a secure VLC mechanism enabled by multiple light fixtures acting as friendly jammers. The goal of the friendly jammers is to diminish the capability of the eavesdropper to infer the undisclosed information, on the premise of causing minimal impact on the legitimate receiver. For this reason, an algorithm based on reinforcement learning is proposed to dynamically optimize the friendly jamming policy in realistic nonstationary environments. In order to resolve the difficult problem of the dimensional curse and to effectively represent the continuous state and action spaces, an algorithm based on deep reinforcement learning is devised, which utilizes deep convolutional neural networks to accelerate the convergence rate of the learning process. A differentiable neural dictionary is introduced to make full use of the experiences in similar anti-eavesdropping scenarios to improve the learning capability. Simulation results demonstrate that, the proposed schemes can achieve a higher secrecy rate and a lower bit error rate than some state-of-the-art schemes.
Article
The subject of this research is the adaptive management of access to Random Access Channels (RACH) in Narrowband Internet of Things (NB-IoT) networks, which frequently face congestion due to high device density and limited channel capacity. The study focuses on the practical application of Reinforcement Learning algorithms, specifically Q-learning and Deep Q-Network (DQN), to address this issue. The authors thoroughly examine the problem of RACH overload and the resulting collisions that cause delays in data transmission and increased energy consumption in connected devices. The article analyzes the limitations and inefficiency of traditional static slot allocation methods and justifies the necessity of implementing a dynamic, learning-based approach capable of adapting to constantly changing network conditions. The research aims to significantly minimize collision rates, improve connection success rates, and reduce the overall energy consumption of NB-IoT devices. The research methodology involved the use of advanced machine learning methods, including Q-learning and DQN, together with simulation modeling conducted in the NS-3 environment, integrating a dedicated RL-agent for dynamic and intelligent RACH slot allocation. The main conclusions of the study highlight the demonstrated effectiveness of the adaptive RL-based approach for optimizing access to communication slots in NB-IoT networks. The scientific novelty lies in the development and integration of a specialized RL-agent capable of dynamically managing slot distribution based on real-time network conditions. As a result of implementing the proposed approach, the number of collisions was reduced by 74%, the number of successful connections increased by 16%, and the energy efficiency of the devices improved by 15% in comparison with traditional static methods. These results clearly demonstrate the practical applicability, and scalability of adaptive RL-based management techniques for enhancing both the performance and reliability of real-world NB-IoT networks.
Article
Full-text available
The cellular-based infrastructure is regarded as one of potential solutions for massive Internet of Things (mIoT), where the Random Access (RA) procedure is used for requesting channel resources in the uplink data transmission. Due to the nature of mIoT network with the sporadic uplink transmissions of a large amount of IoT devices, massive concurrent channel resource requests lead to a high probability of RA failure. To relieve the congestion during the RA in mIoT networks, we model RA procedure, and analyze as well as evaluate the performance improvement due to different RA schemes, including power ramping (PR), back-off (BO), access class barring (ACB), hybrid ACB and back-off schemes (ACB&BO), and hybrid power ramping and back-off (PR&BO). To do so, we develop a traffic-aware spatio-temporal model for the contention-based RA analysis in the mIoT network, where the signal-to-noise-plus-interference ratio (SINR) outage and collision events jointly determine the traffic evolution and the RA success probability. Compared with existing literature only modelled collision from single cell perspective, we model both SINR outage and the collision from the network perspective. Based on this analytical model, we derive the analytical expression for the RA success probabilities to show the effectiveness of different RA schemes. We also derive the average queue lengths and the average waiting delays of each RA scheme to evaluate the packets accumulation status and packets serving efficiency. Our results show that our proposed PR&BO scheme outperforms other schemes in heavy traffic scenario in terms of the RA success probability, the average queue length, and the average waiting delay.
Conference Paper
Full-text available
Narrowband IoT (NB-IoT) is the latest IoT connec-tivity solution presented by the 3GPP. NB-IoT introduces coverage classes and introduces a significant link budget improvement by allowing repeated transmissions by nodes that experience high path loss. However, those repetitions necessarily increase the energy consumption and the latency in the whole NB-IoT system. The extent to which the whole system is affected depends on the scheduling of the uplink and downlink channels. We address this question, not treated previously, by developing a tractable model of NB-IoT access protocol operation, comprising message exchanges in random-access, control, and data channels, both in the uplink and downlink. The model is then used to analyze the impact of channel scheduling as well as the interaction of coexisting coverage classes, through derivation of the expected latency and battery lifetime for each coverage class. These results are subsequently employed in investigation of latency-energy tradeoff in NB-IoT channel scheduling as well as determining the optimized operation points. Simulations results show validity of the analysis and confirm that there is a significant impact of channel scheduling on latency and lifetime performance of NB-IoT devices.
Article
Full-text available
Narrowband Internet of Things (NB-IoT) is the prominent technology that fits the requirements of future Internet of Things (IoT) networks. However, due to the limited spectrum (i.e., 180 kHz) availability for NB-IoT systems, one of the key issues is how to efficiently use these resources to support massive IoT devices? Furthermore, in NB-IoT, to reduce the computation complexity and to provide coverage extension, the concept of time offset and repetition has been introduced. Considering these new features, the existing resource management schemes are no longer applicable. Moreover, the allocation of frequency band for NB-IoT within LTE band, or as a standalone, might not be synchronous in all the cells, resulting in intercell interference (ICI) from the neighbouring cells’ LTE users or NB-IoT users (synchronous case). In this paper, first a theoretical framework for the upper bound on the achievable data rate is formulated in the presence of control channel and repetition factor. From the conducted analysis, it is shown that the maximum achievable data rates are 89.2 Kbps and 92 Kbps for downlink and uplink, respectively. Secondly, we propose an interference aware resource allocation for NB-IoT by formulating the rate maximization problem considering the overhead of control channels, time offset and repetition factor. Due to the complexity of finding the globally optimum solution of the formulated problem, a sub-optimal solution with an iterative algorithm based on cooperative approaches is proposed. The proposed algorithm is then evaluated to investigate the impact of repetition factor, time offset and ICI on the NB-IoT data rate and energy consumption. Furthermore, a detailed comparison between the non-cooperative, cooperative, and optimal scheme (i.e., no repetition) is also presented. It is shown through the simulation results that the cooperative scheme provides up to 8% rate improvement and 17% energy reduction as compared to the non-cooperative scheme.
Article
Full-text available
NarrowBand-Internet of Things (NB-IoT) is a radio access technology recently standardized by 3GPP. To provide reliable connections with extended coverage, a repetition transmission scheme is applied in both Random Access CHannel (RACH) procedure and data transmission. In this letter, we model RACH in the NB-IoT network taking into account the repeated preamble transmission and collision using stochastic geometry. We derive the exact expression of RACH success probability under time correlated interference, and validate the analysis with different repetition values via independent simulations. Numerical results have shown that the repetition scheme can efficiently improve the RACH success probability in a light traffic scenario, but only slightly improves that performance with very inefficient channel resource utilization in a heavy traffic scenario.
Article
Full-text available
We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
Article
Full-text available
In smart city applications, huge numbers of devices need to be connected in an autonomous manner. 3rd Generation Partnership Project (3GPP) specifies that Machine Type Communication (MTC) should be used to handle data transmission among a large number of devices. However, the data transmission rates are highly variable, and this brings about a congestion problem. To tackle this problem, the use of Access Class Barring (ACB) is recommended to restrict the number of access attempts allowed in data transmission by utilizing strategic parameters. In this paper, we model the problem of determining the strategic parameters with a reinforcement learning algorithm. In our model, the system evolves to minimize both the collision rate and the access delay. The experimental results show that our scheme improves system performance in terms of the access success rate, the failure rate, the collision rate, and the access delay.
Article
Full-text available
Narrowband Internet of Things (NB-IoT) is a new narrowband radio technology introduced in the Third Generation Partnership Project (3GPP) Release 13 towards to the 5th generation (5G) evolution for providing low-power widearea Internet of Things (IoT). In NB-IoT systems, repeating transmission data or control signals has been considered as a promising approach for enhancing coverage. Taking into account the new feature of repetition, link adaptation for NBIoT systems need to be performed in two dimensions, i.e., the modulation and coding scheme (MCS), and the repetition number. Therefore, existing link adaptation schemes without consideration of repetition number are no longer applicable. In this paper, a novel uplink link adaptation scheme with repetition number determination is proposed, which is composed of the inner loop link adaptation and the outer loop link adaptation, to guarantee transmission reliability and improve throughput of NB-IoT systems. In particular, the inner loop link adaptation is designed to cope with Block Error Ratio (BLER) variation by periodically adjusting the repetition number. The outer loop link adaptation coordinates the MCS level selection and the repetition number determination. Besides, key technologies of uplink scheduling like power control and transmission gap are analyzed and a simple single-tone scheduling scheme is proposed. Link-level simulations are performed to validate the performance of the proposed uplink link adaptation scheme. The results show that our proposed uplink link adaptation scheme for NB-IoT systems outperforms the repetition-dominated method and straightforward method, particularly for good channel conditions and larger packet sizes. Specifically, it can save more than 14% of the active time and resource consumption compared with the repetition-dominated method and save more than 46% of the active time and resource consumption compared with the straightforward method.
Article
Narrowband Internet of Things (NB-IoT) is a new access technology introduced by 3GPP. This paper presents an analytical model to estimate the access success probability and average access delay of the random access channels by considering maximum number of preamble transmissions, size of backoff windows, and number of sub-carriers in each coverage enhancement (CE) levels. A joint optimization technique is proposed to configure the parameters to maximize the access success probability under a target delay constraint. The accuracy of the analysis and the effectiveness of the proposed optimization technique are verified by computer simulations and benchmarked with exhaustive search. The result shows that the proposed optimization is able to find the optimal configuration under various conditions.
Article
This letter proposes an efficient small data transmission scheme in the narrow band (NB)-Internet of things (IoT) system. For the efficient use of radio resources, the proposed scheme enables devices in an idle state to transmit a small data packet without the radio resource control connection setup process. This can improve the maximum number of supportable devices in the NB-IoT system which has insufficient radio resources. Numerical results have shown that the proposed scheme can increase the maximum number of supportable devices by about 60% compared with the conventional scheme.