ArticlePDF Available

Abstract and Figures

NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration specifies the amount of radio resource allocated to each group of devices for random access and for data transmission. Assuming no knowledge of the traffic statistics, there exists an important challenge in "how to determine the configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval (TTI) in an online fashion". Given the complexity of searching for optimal configuration, we first develop real-time configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and training efficiency.
Content may be subject to copyright.
Deep Reinforcement Learning for Real-Time
Optimization in NB-IoT Networks
Nan Jiang, Student Member, IEEE, Yansha Deng, Member, IEEE, Arumugam Nallanathan,
Fellow, IEEE, and Jonathon A. Chambers, Fellow, IEEE
NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible
configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration
specifies the amount of radio resource allocated to each group of devices for random access and for data transmission.
Assuming no knowledge of the traffic statistics, there exists an important challenge in “how to determine the
configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval
(TTI) in an online fashion”. Given the complexity of searching for optimal configuration, we first develop real-time
configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-
Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results
show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic
approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also
indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with
much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and
via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the
problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of
the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase
of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and
training efficiency.
To effectively support the emerging massive Internet of Things (mIoT) ecosystem, the 3rd Generation
Partnership Project (3GPP) partners have standardized a new radio access technology, namely NarrowBand-
IoT (NB-IoT) [1]. NB-IoT is expected to provide reliable wireless access for IoT devices with various
N. Jiang, and A. Nallanathan are with the School of Electronic Engineering and Computer Science, Queen Mary University of London,
London E1 4NS, UK (e-mail: {nan.jiang, a.nallanathan}
Y. Deng is with the Department of Informatics, King’s College London, London WC2R 2LS, UK (e-mail:
(Corresponding author: Yansha Deng).
J. A. Chambers is with the Department of Engineering, University of Leicester, Leicester LE1 7RH, UK (e-mail:
arXiv:1812.09026v1 [cs.NI] 21 Dec 2018
types of data traffic, and to meet the requirement of extended coverage. As most mIoT applications favor
delay-tolerant data traffic with small size, such as data from alarms, and meters, monitors, the key target of
NB-IoT design is to deal with the sporadic uplink transmissions of massive IoT devices [2].
NB-IoT is built from legacy Long-Term Evolution (LTE) design, but only deploys in a narrow bandwidth
(180 KHz) for Coverage Enhancement (CE) [3]. Different from the legacy LTE, NB-IoT only defines two
uplink physical channel resource to perform all the uplink transmission, including the Random Access
CHannel (RACH) resource (i.e., using NarrowBand Physical Random Access CHannel, a.k.a. NPRACH)
for RACH preamble transmission, and the data resource (i.e., using NarrowBand Physical Uplink Shared
CHannel, a.k.a. NPUSCH) for control information and data transmission. To support various traffic with
different coverage requirements, NB-IoT supports up to three CE groups of IoT devices sharing the uplink
resource in the same band. Each group serves IoT devices with different coverage requirements distinguishing
based on a same broadcast signal from the evolved Node B (eNB) [3]. At the beginning of each uplink
Transmission Time Interval (TTI), eNB selects a system configuration that specifies the radio resource
allocated to each group in order to accommodate the RACH procedure along with the remaining resource
for data transmission. The key challenge is to optimally balance the allocations of channel resource between
the RACH procedure and data transmission so as to provide maximum success accesses and transmissions in
massive IoT networks. Allocating too many resource for RACH enhances the random access pernformace,
while leaving insufficient resource for data transmission.
Unfortunately, dynamic RACH and data transmission resource configuration optimization is an untreated
problem in NB-IoT. Generally speaking, the eNB observes the transmission receptions of both RACH
(e.g., number of successfully received preambles and collisions) and data transmission (e.g., number of
successful scheduling and unscheduling) for all groups at the end of each TTI. This historical information
can be potentially used to predict traffic from all groups and to facilitate the optimization of future TTIs’
configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would
result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces,
which would be generally intractable. The complexity of the problem is compounded by the lack of a
prior knowledge at the eNB regarding the stochastic traffic and unobservable channel statistics (i.e., random
collision, and effects of physical radio including path-loss and fading). The related works will be briefly
introduced in the following two subsections.
1) Related works on real-time optimization in cellular-based networks: In light of this POMDP challenge,
prior works [4, 5] studied real-time resource configuration of RACH procedure and/or data transmission by
proposing dynamic Access Class Barring (ACB) schemes to optimize the number of served IoT devices.
These optimization problems have been tackled under the simplified assumptions that at most two config-
urations are allowed and that the optimization is executed for a single group without considering errors
due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement
Learning (RL) emerges as a natural solution given its capability in interacting with the practical environment
and feedback in the form of the number of successful and unsuccessful transmissions per TTI. Distributed
RL based on tabular Q-learning (tabular-Q) has been proposed in [6–9]. In [6–8], the authors studied
distributed tabular-Q in slotted-Aloha networks, where each device learns how to avoid collisions by finding
a proper time slot to transmit packets. In [9], the authors implemented tabular-Q agents at the relay nodes
for cooperatively selecting its transmit power and transmission probability to optimize the total number of
useful received packets per consumed energy. Centralized RL has also been studied in [10–12], where the
RL agent was implemented at the base station site. In [10], a learning-based scheme was proposed for radio
resource management in multimedia wide-band code-division multiple access systems to improve spectrum
utilization. In [11, 12], the authors studied the tabular-Q based ACB schemes in cellular networks, where a
Q-agent was implemented at an eNB aiming at selecting the optimal ACB factor to maximize the access
success probability of RACH procedure.
2) Related works on optimization in NB-IoT: In NB-IoT networks, most existing studies either focused
on the resource allocation during RACH procedure [13, 14], or that during the data transmission [15, 16]. For
RACH procedure, the access success probability was statistically optimized in [13] using exhaustive search,
and the authors in [14] studied the fixed-size data resource scheduling for various resource requirements.
For the data transmission, [15] presented an uplink data transmission time slot and power allocation scheme
to optimize the overall channel gain, and [16] proposed a link adaptation scheme, which dynamically
selects modulation and coding level, and the repetition value according to the acknowledgment/negative-
acknowledgment feedback to reduce the uplink data transmission block error ratio. More importantly, these
works ignore the time-varied heterogeneous traffic of massive IoT devices, and considered a snap shot [13,
15, 16] or steady-state behavior [14] of NB-IoT networks. Our most relevant work is [17], where the authors
studied the steady-state behavior of NB-IoT networks from the perspective of a single device. Optimizing
some of the parameters of the NB-IoT configuration, namely the repetition value (to be defined below) and
time intervals between two consecutive scheduling of NPRACH and NPDCCH, was carried out in terms of
latency and power consumption in [17] using a queuing framework.
Unfortunately, the tabular-Q framework in [11, 12] cannot be used to solve the multi-parameter multi-
group optimization problem in uplink resource configuration of NB-IoT networks, due to their incapability
to address high-dimensional state space and variable selection. More importantly, whether their proposed
RL-based resource configuration approaches [11, 12] outperform the conventional resource configuration
approaches [4, 5] is still unknown. In this paper, we develop RL-based uplink resource configuration ap-
proaches to dynamically optimize the number of served IoT devices in NB-IoT networks. To showcase the
efficiency, we compare the proposed RL-based approaches with the conventional heuristic uplink resource
allocation approaches. The contributions can be summarized as follows:
We develop an RL-based framework to optimize the number of served IoT devices by adaptively
configuring uplink resource in NB-IoT networks. The uplink communication procedure in NB-IoT is
simulated by taking into account the heterogeneous IoT traffics, the CE group selection, the RACH
procedure, and the uplink data transmission resource scheduling. This generated simulation environment
is used for training the RL-based agents before deployment, and these agents will be updated according
to the real traffic in practical NB-IoT networks in an online manner.
We first study a simplified NB-IoT scenario considering the single parameter and the single CE group,
where a basic tabular-Q was developed to compare with the revised conventional Load Estimation based
Uplink Resource Configuration (LE-URC) scheme. The tabular-Q is further advanced by implementing
function approximators with different computational complexities, namely, Linear Approximator (LA-Q)
and Deep Neural Networks (Deep Q-Network, a.k.a. DQN) to elaborate their capability and efficiency
in dealing with high-dimensional state space.
We then study a more practical NB-IoT scenario with multiple parameters and multiple CE groups,
where direct implementation of the LA-Q or DQN is not feasible due to the increasing size of the
parameter combinations. To solve it, we propose Action Aggregation approaches based on LA-Q and
DQN, namely, AA-LA-Q and AA-DQN, which guarantee convergence capability by sacrificing certain
accuracy in the parameters selection. Finally, a Cooperative Multi-Agent learning based on DQN (CMA-
DQN) is developed to break down the selection in high-dimensional parameters into multiple parallel
sub-tasks by using that a number of DQN agents are cooperatively trained to produce each parameter
for each CE group.
In the simplified scenario, our results show that the number of served IoT devices with tabular-Q con-
siderably outperforms that with LE-URC, while LA-Q and DQN achieve almost the same performance
as that of tabular-Q using much less training time. In the practical scenario, the superiority of Q-learning
based approaches over LE-URC significantly improves. Especially, CMA-DQN outperforms all other
approaches in terms of both throughput and training efficiency, which is mainly due to the use of
DQN enabling operation over a large state space and the use of multiple agents dealing with the large
dimensionality of parameters selection.
The rest of the paper is organized as follows. Section II provides the problem formulation and system
model. Section III illustrates the preliminary and the conventional LE-URC. Section IV proposes Q-leaning
based uplink resource configuration approaches in the single-parameter single-group scenario. Section V
presents the advanced Q-learning based approaches in the multi-parameter multi-group scenario. Section VI
elaborates the numerical results, and finally, Section VII concludes the paper.
As illustrated in Fig. 1(a), we consider a single-cell NB-IoT network composed of an eNB located at the
center of the cell, and a set of static IoT devices randomly located in an area of the plane R2, and remain
spatially static once deployed. The devices are divided into three CE groups as further discussed below, and
the eNB is unaware of the status of these IoT devices, hence no uplink channel resource is scheduled to
them in advance. In each IoT device, uplink data is generated according to random inter-arrival processes
over the TTIs, which are Markovian and possibly time-varying.
… …
CE group 0:P
CE group 1:γ
CE group 2: P
Time tth T
= 4
= 2
= 1
=1 f
: Number of RACH periods
: Repetition value
: Number of preambles
… …
(a) (b)
Fig. 1: (a) Illustration of system model; (b) Uplink channel frame structure.
A. Problem Formulation
With packets waiting for service, an IoT device executes the contention-based RACH procedure in order to
establish a Radio Resource Control (RRC) connection with the eNB. The contention-based RACH procedure
consists of four steps, where an IoT device transmits a randomly selected preamble, for a given number
of times according to the repetition value nt
Repe,i [1], to initial RACH procedure in step 1, and exchanges
control information with the eNB in the next three steps [18]. The RACH process can fail if: (i) a collision
occurs when two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB
cannot detect a preamble due to low SNR. Note that a collision can be still detected in step 3 of RACH when
the collided preambles are not detected in step 1 of RACH following 3GPP report [19]. This assumption is
different from our previous works [20,21], which only focuses on the preamble detection analysis in step 1
of RACH.
As shown in Fig. 1(b), for each TTI tand for each CE group i= 0,1,2, in order to reduce the chance
of a collision, the eNB can increase the number nt
Rach,i of RACH periods in the TTI or the number ft
Prea,i of
preambles available in each RACH period [22]. Furthermore, in order to mitigate the SNR outage, the eNB
can increase the number nt
Repe,i of times that a preamble transmission is repeated by a device in group iin
one RACH period [22] of the TTI.
After the RRC connection is established, the IoT device requests uplink channel resource from the eNB
for control information and data transmission. As shown in Fig. 1(b), given a total number of resource
RUplink for uplink transmission in the TTI, the number of available resource for data transmission Rt
is written as Rt
DATA =RUplink Rt
RACH, where Rt
RACH is the overall number of Resource Elements (REs)1
allocated for the RACH procedure. This can be computed as Rt
i=0 nRach,inRepe,i fPrea,i, where
BRACH measures the number of REs required for one preamble transmission.
In this work, we tackle the problem of optimizing the RACH configuration defined by parameters
Rach,i, f t
Prea,i, nt
i=0 for each ith group in an online manner for every TTI t. In order to
make this decision at the beginning of every TTI t, the eNB accesses all prior history Ut0in TTIs t0=
1, ..., t 1consisting of the following variables: the number of the collided preambles Vt0
cp,i, the number
of the successfully received preambles Vt0
sp,i, and the number of idle preambles Vt0
ip,i of the ith CE group
in the tth TTI for the RACH, as well as the number of IoT devices that have successfully sent data
su,i and the number of IoT devices that are waiting for being allocated data resource Vt0
un,i. We denote
Ot={At1, Ut1, At2, U t2,· · · , A1, U 1}as the observed history of all such measurements and past
The eNB aims at maximizing the long-term average number of devices that successfully transmit data
with respect to the stochastic policy πthat maps the current observation history Otto the probabilities of
selecting each possible configuration At. This problem can be formulated as the optimization
(P1) : max
1The uplink channel consists of 48 sub-carriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time
slot of 2 ms and one sub-carrier of 3.75 kHz [1]. Note that the NB-IoT also supports 12 sub-carriers with 15 kHz tone spacing for NPUSCH,
but NPRACH only supports 3.75 kHz tone spacing [1].
where γ[0,1) is the discount rate for the performance in future TTIs and index iruns over the CE groups.
Since the dynamics of the system is Markovian over the TTI and is defined by the NB-IoT protocol to be
further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will
be discussed in Sections III, IV, and V.
B. NB-IoT Access Network
We now provide additional details on the model and on the NB-IoT protocol. To capture the effects of
the physical radio, we consider the standard power-law path-loss model that the path-loss attenuation is uη,
with the propagation distance uand the path-loss exponent η. The system is operated in a Rayleigh flat-
fading environment, where the channel power gains hare exponentially distributed (i.i.d.) random variables
with unit mean. Fig. 2 presents the uplink data transmission procedure from the perspective of an IoT device
in NB-IoT networks, which consists of the four stages that are explained in the following four subsections
to introduce the system model.
γCE,i : maximum allowed RACH attempts in the ith CE group
γpMax: maximum allowed RACH attempts in all CE groups
γRRC : maximum allowed channel resources requests
cpCE : CE counter
cpMax: RACH counter
cRRC : RRC counter
for new
Serving fails, drop packet
cpCE=cpCE+1 cpMax=cpMax+1
Step up to
higher CE
group, initial
cpCE = 0
A. Traffic Inter-Arrival B. CE Group Determination C. RACH Procedure D. Data Resource Scheduling
Fig. 2: Uplink data transmission procedure from the perspective of an IoT device in NB-IoT networks.
1) Traffic Inter-Arrival: We consider two types of IoT devices with different traffic models, including
periodical traffic and bursty traffic, which is a heterogeneous traffic scenario for diverse IoT applications [23,
24]. The periodical traffic coming from periodic uplink reporting tasks, such as metering or environmental
monitoring, is the most common traffic model in NB-IoT networks [25]. The bursty traffic due to emergency
events, such as fire alarms and earthquake alarms, captures the complementary scenario in which a massive
number of IoT devices tries to establish RRC connection with the eNB [19]. Due to the nature of slotted-
Aloha, an IoT device can only transmit a preamble at the beginning of a RACH period, which means that
IoT devices executing RACH in a RACH period comes from those who received an inter-arrival within the
interval between with the last RACH period. For the periodical traffic, the first packet is generated using
Uniform distribution over Tperiodic (ms), and then repeated every Tperiodic ms. The packet inter-arrival rate
measured in each RACH period at each IoT device is hence expressed by
period =TTTI
where nt
Rach,i is the number of RACH periods in the tth TTI, TTTI
is the duration between neighboring
RACH periods. The bursty traffic is generated within a short period of time Tbursty starting from a random
time τ0. The traffic instantaneous rate in packets in a period is described by a function p(τ)so that the
packets arrival rate in the jth RACH period of the tth TTI is given by
bursty =Zτj
p(τ)dτ, (3)
where τjis the starting time of the jth RACH period in the tth TTI, τjτj1=TTTI
, and the distribution
p(τ)follows the time limited Beta profile given as [19, Section 6.1.1]
p(τ) = τα1(Tbursty τ)β1
Tburstyα+β2Beta(α, β),(4)
In (4), Beta(α, β)is the Beta function with the constant parameters αand β[26].
2) CE Group Determination: Once an IoT device is backlogged, it first determines its associated CE
group by comparing the received power of the broadcast signal PRSRP to the Reference Signal Received
Power (RSRP) thresholds {γRSRP1, γRSRP2}according to the rule [27]
CE group 0, if PRSRP > γRSRP1,
CE group 1, if γRSRP1 PRSRP γRSRP2,
CE group 2, if PRSRP < γRSRP2.
In (5), the received power of broadcast signal PRSRP is expressed as
where uis the device’s distance from the eNB, and PNPBCH is the broadcast power of eNB [27]. Note that
PRSRP is obtained by averaging the small-scale Rayleigh fading of the received power [27].
3) RACH Procedure: After CE group determination, each backlogged IoT device in group irepeats a
randomly selected preamble nt
Repe,i times in the next RACH period by using a pseudo-random frequency
hopping schedule. The pseudo-random hopping rule is based on the current repetition time as well as the
Narrowband Physical Cell ID, and in one repetition, a preamble consists of four symbol groups, which are
transmitted with fixed size frequency hopping [1, 20, 28]. Therefore, a preamble is successfully detected if at
least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly
decoded [20]. Assuming that correct detecting is determined by the SNR level SNRt
sg,j,k for the jth repetition
and the ksymbol group, the correct detecting event Spd can be expressed as
j=1 4
k=1 SNRt
sg,j,k γth,(7)
where kis the index of symbol group in the jth repetition, nt
Repe,i is the repetition value of the ith CE group
in the tth TTI, SNRt
sg,j,k γthmeans that the preamble symbol group is successfully decoded when its
received SNR SNRt
sg,j,k above a threshold γth, and SNRt
sg,j,k is expressed as
sg,j,k =PRACH,iuηh/σ2.(8)
In (8), uis the Euclidean distance between the IoT device and the eNB, ηis the path-loss attenuation factor,
his the Rayleigh fading channel power gain from the IoT device to the eNB, σ2is the noise power, and
PRACH,iis the preamble transmit power in the ith CE group defined as
min {ρuη, PRACHmax}, i = 0,
PRACHmax, i = 1 or 2.
where iis the index of CE groups, IoT devices in the CE group 0 (i= 0) transmit preamble using the
full path-loss inversion power control [27], which maintains the received signal power at the eNB from IoT
devices with different distance equalling to the same threshold ρ, and PRACHmax is the maximal transmit
power of an IoT device. The IoT devices in the CE group 1 and group 2 always transmit preamble using
the maximum transmit power [27].
As shown in the RACH procedure of Fig. 2, if a RACH fails, the IoT device reattempts the procedure
until receiving a positive acknowledgement that RRC connection is established, or exceeding γpCE,i RACH
attempts while being part of one CE group. If these attempts exceeds γpCE,i, the device switches to a higher
CE group if possible [29]. Moreover, the IoT device is allowed to attempt the RACH procedure no more
than γpMax times before dropping its packets. These two features are counted by cpCE and cpMax, respectively.
4) Data Resource Scheduling: After the RACH procedure succeeds, the RRC connection is successfully
established, and the eNB schedules resource from the data channel resource Rt
DATA to the associated IoT
device for control information and data transmission as shown in Fig 1(b). To allocate data resource among
these devices, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that
have successfully completed the RACH procedure but have not received a data channel is compiled using a
random order. In each TTI, devices in the list are considered in order for access to the data channel until the
data resource are insufficient to serve the next device in the list. The remaining RRC connections between
the unscheduled IoT devices and the eNB will be preserved within at most γRRC subsequent TTIs counting
by cRRC, and attempts will be made to schedule the device’s data during these TTIs [29, 30]. The condition
that the data resource are sufficient in TTI tis expressed as
where P2
i=0 Vt
sch,i P2
sp,i +Vt1
un,i )is the number of scheduled devices limited by the upper bound
denoted by IoT devices with successful RACH Vt
sp,i in the current TTI tas well as unscheduled IoT devices
un,i in the last TTI (t1),rt
Repe,i is the number of required REs for serving one IoT
device within the ith CE group, and BDATA is the number of REs per repetition for control signal and data
transmission2. Note that nt
Repe,i is the repetition value for the ith CE group in the tth TTI, which is the same
as for preamble transmission [1].
A. Preliminary
The optimized number of served IoT devices over the long term given in Eq. (1) is really complicated,
which cannot be easily solved via the conventional uplink resource approach. Therefore, most prior works
simplified the objective to dynamically optimize the single parameter to achieve the maximum number of
served IoT devices in the single group without consideration of future performance [4, 5], which is expressed
(P2) : max
where xis the optimized single parameter.
To maximize number of served IoT devices in the tth TTI, the configuration xis expected to be dynamically
adjusted according to the actual number of IoT devices that will execute RACH attempts Dt
RACH, which refers
to the current load of the network. Note that in practice, this load information is unable to be detected at
the eNB. Thus, it is necessary to estimate the load based on the previous transmission reception from the
1th to (t1)th TTI Otbefore the uplink resource configuration in the tth TTI.
In [5], the authors designed a dynamic ACB scheme to optimize the problem given in Eq. (1) via adjusting
the ACB factor. The ACB factor is adapted based on the knowledge of traffic load, which is estimated via
2The basic scheduling unit of NPUSCH is resource unit (RU), which has two formats. NPUSCH format 1 (NPUSCH-1) is with 16 REs for
data transmission, and NPUSCH format 2 (NPUSCH-2) is with 4 REs for carrying control information [3, 22].
moment matching. The estimated number of RACH attempting IoT devices in the tth TTI ˆ
RACH is expressed
RACH =max n0,ˆ
RACH +max ft1
where ft1
Prea,0is the number of allocated preambles in the (t1)th TTI, and ˆ
RACH is the estimated number
of devices performing RACH attempts in the (t1)th TTI given as
RACH =ft1
Prea,0/hminn1, pt1
ACB1 + (Vt1
In Eq. (13), pt1
Prea,0, and Vt1
cp,0are the ACB factor, the number of preambles and the observed number
of collided preambles in the (t1)th TTI, and uM,pis an estimated factor given in [5, Eq. (32)].
In Eq. (12), ˆ
RACH is the difference between the estimated numbers of RACH
requesting IoT devices in the (t1)th and the tth TTIs, which is obtained by assuming that the number of
successful RACH IoT devices does not change significantly in these two TTIs [5].
This dynamic control approach is designed for an ACB scheme, which is only triggered when the exact
traffic load is bigger than the number of preambles (i.e., Dt
RACH > ft
Prea,0). Accordingly, the related backlog
estimation approach is only used when Dt
RACH > ft
Prea,0. However, it cannot estimate the load when Dt
Prea,0, which is required in our problem.
B. Resource Configuration in Single Parameter Single CE Group Scenario
In this subsection, we modify the load estimation approach given in [5] via estimating based on the last
number of the collided preambles Vt1
cp,0and the previous numbers of idle preambles Vt1
ip,0, V t2
ip,0,· · · . And
then, we propose an uplink resource configuration approach based on this revised load estimation, namely,
1) Load Estimation: By definition, FPrea is the set of valid number of preambles that the eNB can choose,
where each IoT device selects a RACH preamble from ft
Prea,0available preambles with an equal probability
given by 1/ft
Prea,0. For a given preamble jtransmitted to the eNB, let djdenotes the number of IoT devices
that selects the preamble j. The probability that no IoT device selects preamble jis
P{dj= 0
The expected number of preambles experiencing idles E{Vt1
RACH,0=n}in the (t1)th TTI is given
P{dj= 0
RACH =n}=ft1
Due to that the actual number of preambles experiencing idles Vt1
ip,0can be observed at the eNB, the number
of RACH attempting IoT devices in the (t1)th TTI ζt1can be estimated as
RACH,0) = log
To obtain the estimated number of RACH attempting IoT devices in the tth TTI ˜
RACH,0, we also need to
know the difference between the estimated numbers of RACH attempting IoT devices in the (t1)th and
the tth TTIs, denoted by δt, where δt=˜
RACH,0for t= 1,2,· · · , and ˜
RACH,0= 0. However,
RACH,0cannot be obtained before the tth TTI. To solve this, we can assume δtδt1according to [5]. This
is due to that the time between two consecutive TTIs is small, and the available preambles are gradually
updated leading to that the number of successful RACH IoT devices does not change significantly in these
two TTIs [5]. Therefore, the number of RACH attempting IoT devices in the tth time slot is estimated as
cp,0, ζt1+δt1,(17)
where 2Vt1
cp,0represents that there are at least 2Vt1
cp,0number of IoT devices colliding in the last TTI.
2) Uplink Resource Configuration Based on Load Estimation: In the following, we propose LE-URC
by taking into account the resource condition given in Eq. (10). The number of RACH periods nRach,0
and the repetition value nRepe,0is fixed, and only the number of preambles in each RACH period fPrea,0is
dynamically configured in each TTI. Using the estimated number of RACH attempting IoT devices in the
tth TTI ˜
RACH,0, the probability that only one IoT device selects preamble j(i.e., no collision occurs) is
expressed as
P{dj= 1
The expected number of RACH attempting IoT devices in the tth TTI is derived as
P{dj= 1
Based on (19), the expected number of IoT devices requesting uplink resource in the tth TTI is derived as
where Vt1
un,0is the number of unscheduled IoT devices in the last TTI. Note that Vt1
un,0can be observed.
However, if the data resource is not sufficient (i.e., occurs when Eq. (10) is invalid), some IoT devices
may not be scheduled in the tth TTI. The upper bound of the number of scheduled IoT devices Vt
expressed as
=RUplink Rt
where Ruplink is the total number of REs reserved for uplink transmission in a TTI, Rt
RACH is the uplink
resource configured for RACH in the tth TTI. rt
DATA,0is required REs for serving one IoT device given in
Eq. (10).
According to (20) and (21), the expected number of the successfully served IoT devices is given by
Prea,0) = min {E{Vt
The maximal expected number of the successfully served IoT devices is obtained by selects the number
of preamble ft
The LE-URC approach based on the estimated load ˜
RACH,0is detailed in Algorithm 1. For comparison,
we consider an ideal scenario that the actual number of RACH requesting IoT devices Dt
RACH is available
at the eNB, namely, Full State Information based URC (FSI-URC). FSI-URC configures ft
Prea,0still using
the approach given in Eq. (23), while the load estimation approach given in Section III.B.1) is not required.
Algorithm 1: Load Estimation Based Uplink Resource Configuration (LE-URC)
input : The set of the number of preambles in each RACH period FPrea,0, Number of IoT devices D, Operation Iteration I.
1for Iteration 1to Ido
2Initialization of V0
ip,0:= 12,V0
cp,0:= 0,˜
RACH,0:= 0,δ1:= 0, and bursty traffic arrival rate µ0
bursty = 0;
3for t1to Tdo
4Generate µt
bursty using Eq. (3);
5The eNB observes Vt1
ip,0and Vt1
cp,0, and calculate ζt1using Eq. (16);
6Estimate the number of RACH requesting IoT devices ˜
RACH,0using Eq. (17);
7Select the number of preambles ft
Prea,0using Eq. (23) based on the estimated load ˜
8The eNB broadcasts ft
Prea,0, and backlogged IoT devices attempt communication in the tth TTI;
9Update δt+1 := ˜
10 end
11 end
3) LE-URC for Multiple CE Groups: We slightly revise the introduced single-parameter single-group LE-
URC approach (given in Section III.B) to dynamically configure resource for multiple CE groups. Note that
the repetition value nRepe,i in the LE-URC approach is still defined as a constant to enable the availability of
load estimation in Eq. (17). Remind that the principle of LE-URC approach is to optimize the expectation of
the number of successful served IoT devices while balancing Rt
RACH and Rt
DATA with limited uplink resource
Ruplink =Rt
RACH. In the multiple CE groups scenarios, the resource Rt
DATA are allocated to IoT
devices in any CE groups without bias, but Rt
RACH is specifically allocated to each CE group.
Under this condition, the expected number of successfully served IoT devices Vt
suss,i given in Eq. (22)
needs to be modified by taking into account multiple variables, which becomes non-convex, and extremely
complicates the optimization problem. To solve it, we use a sub-optimal solution by artificially setting uplink
resource constrain RUplink,i for each CE group (RUplink =P2
i=0 RUplink,i). Each CE group can independently
allocate the resource between Rt
DATA,i and Rt
RACH,i according to the approach given in Eq. (23).
The RL approaches are well-known in addressing dynamic control problem in complex POMDPs [31].
Nevertheless, they have been rarely studied in handling the resource configuration in slotted-Aloha based
wireless communication systems. Therefore, it is worthwhile to evaluate the capability of RL in the single-
parameter single-group scenario first, in order to be compared with conventional heuristic approaches. In
this section, we consider one single CE group with the fixed RACH periods nRach,0as well as the fixed
repetition value nRepe,0, and only dynamically configuring the number of preambles fPrea,0at the beginning
of each TTI. In the following, We first study tabular-Q based on the tabular representation of the value
function, which is the simplest Q-learning form with guaranteed convergence [31], but requires extremely
long training time. We then study Q-learning with function approximators to improve training efficiency,
where LA-Q and DQN will be used to construct an approximation of the desired value function.
A. Q-Learning and Tabular Value Function
Considering a Q-agent deployed at the eNB to optimize the number of successfully served IoT devices in
real-time, the Q-agent need to explore the environment in order to choose appropriate actions progressively
leading to the optimization goal. We define s∈ S,a∈ A, and r∈ R as any state, action, and reward from
their corresponding sets, respectively. At the beginning of the tth TTI (t∈ {0,1,2,· · · }), the Q-agent first
observes the current state Stcorresponding to a set of previous observations (Ot={Ut1, U t2,· · · , U 1}) in
order to select an specific action At∈ A(St). The action Atcorresponds to the number of preambles in
each RACH period ft
Prea,0in single CE group scenario.
As shown in Fig. 3, we consider a basic state function in the single CE group scenario, where Stis a
set of indices mapping to the current observed information Ut1= [Vt1
su,0, V t1
un,0, V t1
cp,0, V t1
sp,0, V t1
ip,0]. With the
knowledge of the state St, the Q-agent chooses an action Atfrom the set A, which is a set of indexes
mapped to the set of the number of available preambles FPrea. Once an action Atis performed, the Q-agent
will receive a scalar reward Rt+1, and observe a new state St+1. The reward Rt+1 indicates to what extent
the executed action Atcan achieve the optimization goal, which is determined by the new observed state
Fig. 3: The Tabular-Q agent and environment interaction in the POMDP.
St+1. As the optimization goal is to maximize the number of the successfully served IoT devices, we define
the reward Rt+1 as a function that positively proportional to the observed number of successfully served
IoT devices Vt
su Ot, which is defined as
Rt+1 =Vt
where csu is constant used to normalize the reward function.
Q-learning is a value-based RL approach [31, 32], where the policy of states to actions mapping π(s) = a
is learned using a state-action value function Q(s, a)to determine an action for the state s. We first use a
lookup table to represent the state-action value function Q(s, a)(tabular-Q), which consists of value scalars
for all the state and action spaces. To obtain an action At, we select the highest value scalar from the
numerical value vector Q(St, a), which maps all possible actions under Stto the Q-value table Q(s, a).
Accordingly, our objective is to find an optimal Q-value table Q(s, a)with optimal policy πthat can
select actions to dynamically optimize the number of served IoT devices. To do so, we train a initial Q-value
table Q(s, a)in the environment using Q-Learning algorithm, where Q(s, a)is immediately updated using
the current observed reward Rt+1 after each action as
Q(St, At) =Q(St, At) + λRt+1 +γmax
a∈A Q(St+1, a)Q(St, At),(25)
where λis a constant step-size learning rate that affects how fast the algorithm adapt to a new environment,
γ[0,1) is the discount rate that determines how current rewards affects the value function updating,
a∈A Q(St+1, a)approximates the value in optimal Q-value table Q(s, a)via the up-to-date Q-value table
Q(s, a)and the obtained new state St+1. Note that Q(St, At)in Eq. (25) is a scalar, which means that we
can only update one value scalar in the Q-value table Q(s, a)with one received reward Rt+1.
As shown in Fig. 3, we consider -greedy approach to balance exploitation and exploration in the Actor
of the Q-Agent, where is a positive real number and 1. In each TTI t, the Q-agent randomly generates
a probability pt
to compare with . Then, with the probability , the algorithm randomly chooses an action
from the remaining feasible actions to improve the estimate of the non-greedy action’s value. With the
probability 1, the algorithm exploits the current knowledge of the Q-value table to choose the action
that maximizes the expected reward.
Particularly, the learning rate λis suggested to be set to a small number (e.g., λ= 0.01) to guarantee
the stable convergence of Q-value table in this NB-IoT communication system. This is due to that a
single reward in a specific TTI can be severely biased, because state function is composed of multiple
unobserved information with unpredictable distributions (e.g., an action allows for the setting with large
number of preambles ft
prea, but massive random collisions accidentally occur, which leads to an unusual
low reward). In the following, the implementation of uplink resource configuration using tabular-Q based
real-time optimization is shown in Algorithm 2.
Algorithm 2: Tabular-Q Based Uplink Resource Configuration
input : Valid numbers of preambles FPrea, Number of IoT devices D, Operation Iteration I.
1Algorithm hyperparameters: learning rate λ(0,1], discount rate γ[0,1),-greedy rate (0,1] ;
2Initialization of the Q-value table Q(s, a)with 0value scalars;
3for Iteration 1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t1to Tdo
6Update µt
bursty using Eq. (3);
7if pt
<  then select a random action Atfrom A;
8else select At=argmax
Q(St, a);
9The eNB broadcasts ft
Prea =FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, calculate the related Rt+1 using Eq. (24), and update Q(St, At)using Eq. (25).
11 end
12 end
B. Value Function Approximation
Since tabular-Q needs its each element to be updated to converge, searching for an optimal policy can
be difficult in limited time and computational resource. To solve this problem, we use a value function
approximator instead of Q-value table to find a sub-optimal approximated policy. Generally, selecting a
efficient approximation approach to represent the value function for different learning scenarios is a usual
problem within the RL [31, 33–35]. A variety of function approximation approaches can be conducted, such
as LA, DNNs, tree search, and which approach to be selected can critically influence the successful learning
[31, 34, 35]. The function approximation should fit the complexity of the desired value function, and be
efficient to obtain good solutions. Unfortunately, most function approximation approaches require specific
design for different learning problems, and there is no basis function, which is both reliable and efficient to
satisfy all learning problems.
In this subsection, we first focus on the linear function approximation for Q-learning, due to its simplicity,
efficiency, and guaranteed convergence [31, 36, 37]. We then conduct the DNN for Q-learning as a more
effective but complicated function approximator, which is also known as DQN [32]. The reasons we conduct
DQN are that: 1) the DNN function approximation is able to deal with several kinds of partially observable
problems [31, 32]; 2) DQN has the potential to accurately approximate the desired value function while
addressing a problem with very large state spaces [32], which can be favored for the learning in the multiple
CE group scenarios; 3) DQN is with high scalability, where the scale of its value function can be easily fit
to a more complicated problem; 4) a variety of libraries have been established to facilitate building DNN
architectures and accelerate experiments, such as TensorFlow, Pytorch, Theano, Keras, and etc..
1) Linear Approximation: LA-Q uses a linear weight matrix wto approximate the value function Q(s, a)
with feature vector ~x =x(s)corresponding to the state St. The dimensions of weight matrix wis |A| × |~x|,
where |A| is the total number of all available actions and |~x|is the size of feature vector ~x. Here, we
consider polynomial regression (as [31, Eq. 9.17]) to construct the real-valued feature vector x(s)due to
its efficiency3. In the training process, the exploration is the same as the tabular Q-learning by generating
random actions, but the exploitation is calculated using the weight matrix wof the value function. In detail,
to predict an action using the LA value function Q(St, a, w)with state Stin the tth TTI, the approximated
value function scalars for each action ais obtained by inner-producting between the weight matrix wand
the features vector x(s)as:
Q(St, a, w) = w·x(St)T=h
w(1,j)xj(St),· · · ,
By searching for the maximal value function scalar in Q(St, a, w)given in Eq. (26), we can obtain the
matched action Atto maximize future rewards.
To obtain the optimal policy, we update the weigh matrix win the value function Q(s, a;w)using
Stochastic Gradient Descent (SGD) [31, 39]. SGD minimizes the error on predictions of observation after each
example, where the error is reduced by a small amount following the direction to the optimal target policy
Q(s, a). As it is infeasible to obtain optimal target policy by summing over all states, we instead estimate
3The polynomial case is the most well understood feature constructor and always performs well in practice with appropriate setting [31, 33].
Furthermore, the results in [38] shows that there is a rough correspondence between a fitted neural network and a fitted ordinary parametric
polynomial regression model. These reasons encourage us to compare the polynomial based LA-Q with DQN
the desired action-value function by simply considering one learning sample Q(s, a)Q(St, a, wt)[31].
In each TTI, the weigh matrix wis updated following
wt+1 =wtλL(wt),(27)
where λis the learning rate. L(wt)is the gradient of the loss function L(wt)used to train the Q-function
approximator. This is given as
L(wt) = Rt+1 +γmax
aQ(St+1, a;wt)Q(St, a, wt)·x(At, St)T· ∇wQ(St, At,wt)(28)
where wtis the weight matrix, x(At, St)is the features matrix with the same shape of wt.x(At, S t)is
constructed by zeros and the feature vector located in the row corresponding to the index of the action
selected in the tth TTI At. Note that Q(St+1, a;wt)is a scalar. The learning procedure follows Algorithm
2by changing the Q-table Q(s, a)to the LA value function Q(s, a;w)with linear weigh matrix w, and
updating Q(s, a;w)with SGD given in (28) in step 10 of Algorithm 2.
2) Deep Q-Network: The DQN agent parameterizes the action-state value function Q(s, a)by using a
function Q(s, a;θ), where θrepresents the weights matrix of a DNN with multiple layers. We consider
the conventional DNN, where neurons between two adjacent layers are fully pairwise connected, namely
fully-connected layers. The input of the DNN is given by the variables in state St; the intermediate hidden
layers are Rectifier Linear Units (ReLUs) by using the function f(x) = max (0, x); while the output layer
is composed of linear units4, which are in one-to-one correspondence with all available actions in A.
Primary Q-network θ
Random Action
max Q(St, a, θ)
Executing communication
procedures as Fig. 2
Observations at the eNB: Ut =
su,0, Vt
sc,0, Vt
cp,0 , Vt
sp,0 , Vt
ip,0 ]
DQN Agent
St, At
( St, At, Rt+1, St+1)
Loss Function LDDQN(θ)
Target Q-network θ
Sj, Aj
Primary Q-network θ
max Q(Sj+1, a, θ)
Q(Sj, Aj, θ) SGD using Eq. (24)
Action At f t
Fig. 4: The DQN agent and environment interaction in the POMDP.
The exploitation is obtained by performing forward propagation of Q-function Q(s, a;θ)with respect to the
observed state St. The weights matrix θis updated online along each training episode by using double deep
4Linear activation is used here according to [32]. Note that Q-learning is value-based, thus the desired value function given in Eq. (15) can
be bigger than 1, rather than a probability, and thus the activation function with return value limited in [1,1] (such as sigmoid function and
tanh function) can lead to convergence difficulty.
Algorithm 3: DQN Based Uplink Resource Configuration
input : The set of numbers of preambles in each RACH period FPrea , the number of IoT devices D, and operation iteration I.
1Algorithm hyperparameters: learning rate λ(0,1], discount rate γ[0,1),-greedy rate (0,1], target network update frequency K;
2Initialization of replay memory Mto capacity C, the primary Q-network θ, and the target Q-network ¯
3for Iteration 1to Ido
4Initialization of S1by executing a random action A0and bursty traffic arrival rate µ0
bursty = 0;
5for t1to Tdo
6Update µt
bursty using Eq. (3);
7if p<  then select a random action Atfrom A;
8else select At=argmax
Q(St, a, θ);
9The eNB broadcasts FPrea(At)and backlogged IoT devices attempt communication in the tth TTI;
10 The eNB observes St+1, and calculate the related Rt+1 using Eq. (24);
11 Store transition (St, At, Rt+1, S t+1)in replay memory M;
12 Sample random minibatch of transitions (Sj, Aj, Rj+1, S j+1)from replay memory M;
13 Perform a gradient descent for Q(s, a;θ)using Eq. (30);
14 Every Ksteps update target Q-network ¯
15 end
16 end
Q-learning (DDQN) [40], which to some extend reduce the substantial overestimations5of value function.
Accordingly, learning takes place over multiple training episodes, with each episode of duration NTTI TTI
periods. In each TTI, the parameter θof the Q-function approximator Q(s, a;θ)is updated using SGD as
θt+1 =θtλRMSLDDQN (θt),(29)
where λRMS is RMSProp learning rate [41], L(θ)is the gradient of the loss function L(θt)used to train
the Q-function approximator. This is given as
LDDQN(θt) =ESi,Ai,Ri+1 ,Si+1 Ri+1 +γmax
aQ(Si+1, a;¯
θt)Q(Si, Ai;θt)θQ(Si, Ai;θt),(30)
where the expectation is taken with respect to a so-called minibatch, which are randomly selected previous
samples (Si, Ai, Si+1 , Ri+1)for some i∈ {tMr, ..., t}, with Mrbeing the replay memory [32]. When
tMris negative, this is interpreted as including samples from the previous episode. The use of minibatch,
instead of a single sample, to update the value function Q(s, a;θ)improves the convergent reliability of
value function [32]. Furthermore, following DDQN [40], in (30), ¯
θtis a so-called target Q-network that is
5Overestimation refers to that some suboptimal actions regularly were given higher Q-values than optimal actions, which can negatively
influence the convergence capability and training efficiency of the algorithm [34, 40].
used to estimate the future value of the Q-function in the update rule. This parameter is periodically copied
from the current value θtand kept fixed for a number of episodes [40].
Practically, NB-IoT is always deployed with multiple CE groups to serve IoT devices with various coverage
requirements. In this section, we study the problem (1) of optimizing the resource configuration for three CE
groups each with parameters At={nt
Rach,i, f t
Prea,i, nt
i=0. This joint optimization by configuring each
parameter in each CE group can improve the overall data access and transmission performance. Note that
each CE group shares the uplink resource in the same bandwidth, and the eNB schedules data resource to
all RRC connected IoT devices without the CE group bias as introduced in Sec. II.B.4). To optimize the
number of served IoT devices in real-time, the eNB should not only balance the uplink resource between
RACH and data, but also balance them among each CE group.
The Q-learning algorithms with the single CE group provided in Sec. IV are model-free, and thus their
learning structure can be directly used in this multi-parameter multi-group scenario. However, considering
multiple CE groups results in the increment of observations space, which exponentially increases the size
of state space. To train Q-agent with this expansion, the requirements of time and computational resource
greatly increase. In such case, the tabular-Q would be extremely inefficient, as not only the state-action value
table requires a big memory, but it is impossible to repeatedly experience every state to achieve convergence
with limited time. In view of this, we only study Q-learning with value function approximation (LA-Q and
DQN) to design uplink resource configuration approaches for the multi-parameter multi-group scenario.
LA-Q and DQN are with high capability to handle massive state spaces, and thus we can considerably
improve the state spaces with more observed information to support the optimization of Q-agent. Here, we
define the current state Stincludes information about the last MoTTIs (Ut1, U t2, Ut3,· · · , U tMo). This
design improves Q-agent by enabling it to estimate the trend of traffic. As our goal is to optimize the number
of served IoT devices, the reward function should be defined according to the number of successfully served
IoT devices Vsu,i of each CE group, which is expressed as
Rt+1 =
Same as the state spaces, the available action spaces also exponentially increases with the increment of
the adjustable configurations. The number of available actions corresponds to the possible combinations of
configurations |A| =
(|NRach,i|×|NRepe,i|×|FPrea,i|)(i.e., |·|denotes the number of elements in any
vector ·,Ais the set of actions, NRach,i,NRepe,i, and FPrea,i are the sets of the number of RACH periods,
the repetition value, and the number of preambles in each RACH period). Unfortunately, it is extremely
hard to optimize the system under such numerous action spaces (i.e., |A| can be over fifty thousands.),
due to that the system will fall into updating policy with only a small part of the action in A, and finally
leads to convergence difficulty. To solve this problem, we then provide two approaches that can reduce the
dimension of action space to enable the LA and DQN in the multi-parameter multi-group scenario.
A. Actions Aggregated Approach
We first provide AA based Q-learning approaches, which guarantee convergent capability by sacrificing
the accuracy of action selection6. In detail, the specific action selection can be converted to the increasing or
decreasing trend selection. Instead of selecting the exact values from the sets of NRach,i,NRepe,i, and FPrea,i ,
we convert it to single step ascent/descent based on the last action, which is represented by At
Rach,i ∈ {0,1},
Repe,i ∈ {0,1}, and At
Prea,i ∈ {0,1}for the number of RACH periods nt
Rach,i, the repetition values nt
and the number of preambles in each RACH period ft
Prea,i in the tth TTI. Consequently, the size of total
action spaces for the three CE groups is reduced to |A|=29=512. By doing so, the algorithms for training
with LA function approximator and DQN in the multiple configurations multiple CE groups scenario can
be deployed following Algorithm 2 and Algorithm 3, respectively.
B. Cooperative Multi-agent Learning Approach
Despite that the uplink resource configuration is managed by a central authority, identifying the control
of each parameter as one sub-task that is cooperatively handled by independent Q-agents is sufficient to
deal with the problem with unsolvable action spaces [42]. As shown in Fig. 5, we consider multiple DQN
agents are centralized at the eNB with the same structure of value function approximator7following Section
IV.B.2). We break down the action space by considering nine separate action variables in At, where each
DQN agent controls their own action variable as shown in Fig. 5. Recall that we have three variables for
each group i, namely nRach,i,nRepe,i , and fPrea,i.
We introduce a separate DQN agent for each output variable in Atdefined as action At
kselected by the
kth agent, where each kth agent is responsible to update the value Q(St, At
k;θk)of action At
kin shared
6The action aggregation has been rarely evaluated, but the same idea, namely, state aggregation has been well studied, which is a basic
function approximation approach [31].
7The structures of value function approximator can also be specifically designed for RL agents with sub-tasks of significantly different
complexity. However, there is no such requirement in our problem, so it will not be considered.
Executing communication procedures as Fig. 2
St Ut =[Vtsu,0, Vtun,0, Vtcp,0, Vtsp,0, Vtip,0]
[Vtsu,1, Vtun,1, Vtcp,1, Vtsp,1, Vtip,1]
[Vtsu,2, Vtun,2, Vtcp,2, Vtsp,2, Vtip,2]
[ At, Ut, At-1, Ut-1, At-2, Ot-2, …, Ot-Mo-1 At-Mo-1]
DNN-Q Agent 1
DNN-Q Agent 2
DNN-Q Agent 3
DNN-Q Agent 9
… ...
Memory Mr,1
Memory Mr,2
Memory Mr,3
Memory Mr,9
… ...
minibatch SGD
(St, At
Rt+1, St+1)
(St, At
Rt+1, St+1)
(St, At
Rt+1, St+1)
(St, At
Rt+1, St+1)
At = [At
0, At
1,, At
k,, At
Rach,0, nt
Repe,0,, f t
Fig. 5: The CMA-DQN agents and environment interaction in the POMDP.
state St. The DQN agents are trained in parallel and receive the same reward signal given in Eq. (31) at the
end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents
aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a
factorization of the overall value function akin to the approach proposed in [43] for multi-agent systems.
The challenge of this approach is how to evaluate each action according to the common reward function.
For each DQN agent, the received reward is corrupted by massive noise, where its own effect on the
reward is deeply hidden in the effects of all other DQN agents. For instance, a positive action can receive a
mismatched low reward due to other DQN agents’ negative actions. Fortunately, in our scenario, all DQN
agents are centralized at the eNB, which means that all DQN agents can have full information among each
other. Accordingly, we adopt the action selection histories of each DQN agent as part of state function8,
thus they are able to know how reward is influenced by different combinations of actions. To do so, we
define state variable Stas
St= [At1, Ut1, At2, U t2,· · · , AtMo, U tMo],(32)
where Mois the number of stored observations, At1is the set of selected action of each DQN agent in
the (t1)th TTI corresponding to nRach,i,nRepe,i , and fPrea,i for the ith CE group, and Ut1is the set of
observed transmission receptions.
In each TTI, the parameters θkof the Q-function approximator Q(St, At
k;θk)are updated using SGD at all
agents kas Eq. (29). The learning algorithm can be implemented following Algorithm 3. Different from the
single-parameter single-group scenario, we need to first initialize nine primary networks θk, target networks
θk, and replay memories Mkfor each DQN agent. In step 11 of Algorithm 3, the current transactions of
each DQN agent should be stored in their own memory separately. In step 12 and 13 of Algorithm 3,
8The state function can be designed to collect more information according to the complexity requirements, such as sharing the value function
between each DQN agent [42].
the minibatch of transaction should separately sampled from each memory to train the corresponding DQN
In this section, we evaluate the performance of the proposed Q-learning approaches and compare it with the
conventional LE-URC and FSI-URC described in Sec. III via numerical experiments. We adopt the standard
network parameters listed in Table I following [1, 3, 22, 25, 29], and hyperparameters for Q-learning listed in
Table II. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). The RL agents will first be trained
in a so-called learning phase, and after convergence, their performance will be compared with LE-URC and
FSI-URC in a so-called testing phase. All testing performance results are obtained by averaging over 1000
episodes. In the following, we present our simulation results of the single-parameter single-group scenario
and the multi-parameter multi-group scenario in Section VI-A and Section VI-B, respectively.
TABLE I: Simulation Parameters
Parameters Setting Parameters Setting
Path-loss exponent η4 noise power σ2-138 dBm
eNB broadcast power PNPBCH 35 dBm Path-loss inverse power control threshold ρ120 dB
Maximal preamble transmit power PRACHmax 23 dBm The received SNR threshold γth 0 dB
Duration of periodic traffic Tperiodic 1 hour TTI 640ms
Duration of bursty traffic Tbursty 10 minutes Set of number of preambles FPrea {12,24,36,48}
Maximum allowed resource requests γRRC 5 Set of repetition value NRepe {1,2,4,8,16,32}
Maximum RACH attempts γpMax 10 Set of number of RACH periods NRach {1,2,4}
Maximum allowed RACH in one CE γpCE,i 5 REs required for BRACH 4
Bursty traffic parameter Beta(α, β) (3,4) REs required for BDATA 32
TABLE II: Q-learning Hyperparameters
Hyperparameters Value Hyperparameters Value
Learning rate λfor Tabular-Q and LA-Q 0.01 Learning rate by RMSProp λRMS for DQN 0.0001
Initial exploration 1 Final exploration 0.1
Discount rate γ0.5 Minibatch size 32
Replay memory 10000 Target Q-network update frequency 1000
A. Single-Parameter Single-Group Scenario
In the single-parameter single-group scenario, eNB is located at the center of a circular area with a 10
km radius, and the IoT devices are randomly located within the cell. We set the number of RACH periods
as nRach = 1, the repetition value as nRepe = 4, and the limited uplink resource as Ruplink = 1536 REs (i.e.,
32 slots with 48 sub-carriers). Unless otherwise stated, we consider the number of periodical IoT devices to
be Dperiodic = 10000, and the number of bursty IoT devices to be Dbursty = 5000. The DQN is set with three
hidden layers, each with 128 ReLU units. Tabular-Q, LA-Q, and DQN approaches are proposed in Sec.
Fig. 6: The real-time traffic load and Vsu for FSI-URC, LE-URC, and
Fig. 7: Vsu and the average received reward for Tabular-Q, LA-Q, and
IV.A, IV.B.1), and IV.B.2), respectively. The conventional LE-URC and FSI-URC approaches are proposed
in Sec. III.B.
Throughout epoch, each device has a periodical traffic profile (i.e., Uniform distritbuion given in Eq. (2)),
or a bursty traffic profile (i.e., the time limited Beta profile defined in Eq. (4) with parameters (3,4)) that
has a peak around the 400th TTI. The resulting average number of newly generated packets is shown as
dashed line in Fig. 6(a). Fig. 6(b) plot the number of successfully served IoT devices Vsu with the proposed
FSI-URC, LE-URC, and DQN approaches. In Fig. 6(b), Vsu first increases gradually with the increasing of
traffic shown in Fig. 6(a), until it reaches the serving capacity of eNB. Then, Vsu decreases slowly due to the
increasing collisions and scheduling failures with the increase of traffic. After that, Vsu increases gradually
as the collisions and scheduling failures decrease with the decreasing of traffic. Finally, Vsu decreases slowly
with the decreasing of traffic.
In Fig. 6(b), we see that the ideal FSI-URC approach outperforms the LE-URC approach, due to that
the FSI-URC approach uses the actual network load to perfectly optimize Vt
su at one time instance as Eq.
(11). DQN not only always outperforms LE-URC, but also exceeds the ideal DSI-URC approach in most
of TTIs. This is due to that both LE-URC and FSI-URC only optimize Vt
su at one time instance, whereas
DQN optimizes the long-term performance of the number of served IoT devices. The optimization in one
time instance (LE-URC and FSI-URC) only takes into account the current trade-off between RACH resource
and DATA resource given in Eq. (22), while the optimization over long-term period (DQN) also accounts
for some long-term hidden features, such as the dropping packets due to exceeding them maximum RACH
attempts γpMax or maximum resource requests γRRC. The DQN approach can well capture these hidden
features to optimize the long-term performance of Vsu as Eq. (1).
Fig. 7(a) compares the number of successfully served IoT devices Vsu under Tabular-Q, LA-Q, and DQN
approaches. We observe that all these three approaches achieve similar values of Vsu, which indicates that
both LA-Q and DQN can well estimate the optimal value function Q(s, a)as the converged Tabular-Q
in this low-complexity single CE group scenario. Fig. 7(b) plots the average received reward over each
bursty duration E{R}=1
Tbursty PTbursty
t=0 Rt(i.e., one epoch consists of one bursty duration Tbursty) from the
beginning of the training versus the required training time. It can be seen that LA-Q and DQN converge to
the optimal value function Q(s, a)(about 10 minutes) much faster than that of Tabular-Q (about 5 days).
The observations in Fig. 7 demonstrate that LA-Q and DQN can be good alternatives for tabular-Q to achieve
almost same number of served IoT devices with much less training time.
Fig. 8(a) and Fig. 8(b) plot the average number of successfully served IoT devices E{Vsu}and the average
number of dropped packets E{Vdrop}(i.e., this system performance can only be summarized in simulation)
over a bursty duration Tbursty versus the number of bursty IoT devices Dbursty. In Fig. 8(a), we observe
that E{Vsu}first increases and then decreases with increasing the number of bursty devices, the decreasing
trend starts when eNB can not afford to serve the increasing IoT device number due to the increasing
collisions and scheduling failures. These collisions and scheduling failures also result in the increasing
number of packet drops with increasing traffics as shown in Fig. 8(b). In Fig. 8, we also notice that DQN
always outperforms LE-URC (especially for relatively large Dbursty), which indicates the superiority of DQN
approach in handling massive bursty IoT devices. Interestingly, DQN provides better performance of the
number of served IoT devices and smaller mean errors than the ideal FSI-URC approach in most cases,
which thanks to the long-term optimization capability of DQN.
B. Multi-Parameter Multi-Group Scenario
Considering eNB is located at the center of a circle area with 12 km radius, we set RSRP thresholds for CE
group choosing {γRSRP1, γRSRP2 }={0,5}dB, the uplink resource Ruplink = 15360 REs (i.e., 320 slots with
48 sub-carriers), and the NPUSCH constrains for LE-URC following Ruplink,0:Ruplink,1:Ruplink,2=1:1:1.
To model massive IoT traffic, both the number of periodical IoT devices Dperiodic and the number of bursty
IoT devices Dbursty increase to 30000. In AA-DQN, we use one Q-network with three hidden layers each
of which is consist of 2048 ReLU units. In CMA-DQN, nine DQNs are used to control each of the nine
configuration (i.e., nRach,i,nRepe,i ,fPrea,i for three CE groups), where each DQN has three hidden layers,
each with 128 ReLU units. AA-LA-Q and AA-DQN approaches are proposed in Sec. V.A, and CMA-DQN
approach is proposed in Sec. V.B.
Fig. 8: E{Vsu}and E{Vdrop }for FSI-URC, LE-URC, and DQN. Fig. 9: Vsu and the average received reward.
Fig. 9(a) compares the number of successfully served IoT devices Vsu during one epoch using AA-LA-
Q, AA-DQN, CMA-DQN and LE-URC. The “LE-URC-[1,4,8]” and “LE-URC-[2,8,16]” curves represent
the LE-URC approach with the repetition values {nRepe,0, nRepe,1, nRepe,2}set to {1,4,8}and {2,8,16},
respectively. We observes that the number of successfully served IoT devices Vsu follows CMA-DQN >AA-
DQN >AA-LA-Q LE-URC-[1,4,8] LE-URC-[2,8,16]. As can be seen, all Q-learning based approaches
outperform LE-URC approaches, due to that these Q-learning based approaches can dynamically optimize
the number of served IoT devices by accurately configuring each parameter. We also observe that CMA-
DQN slightly outperforms the others in the light traffic regions at the beginning and end of the epoch,
but it substantially outperforms the others in the period of heavy traffic in the middle of the epoch. This
demonstrates the capability of CMA-DQN in better managing the scarce channel resource in the presence
of heavy traffic. It is also observed that increasing the repetition value of each CE group with LE-URC
improves the received SNR, and thus the RACH success rate in the light traffic region, but it degrades the
scheduling success rate due to limited channel resource in the heavy traffic region.
Fig. 9(b) plots the average received reward over each bursty duration E{R}=1
Tbursty PTbursty
t=0 Rtfrom the
beginning of the training versus the consumed training time. It can be seen that CMA-DQN and AA-DQN
outperform AA-LA-Q in terms of less training time. Compared with the results in the single CE group
scenario shown in Fig. 7, DNN is a better value function approximator for the 3 CE groups scenario due to
its efficiency and capability in solving high complexity problems. We also observe that CMA-DQN achieves
higher E{R}than AA-DQN, due to that CMA-DQN can accurately select the exact values from the set of
actions {NRepe,NRach ,FPrea}, whereas AA-DQN can only select ascent/descent actions, which leads to a
sub-optimal solution.
Fig. 10: The average number of successfully served IoT devices Vsucc,i for each CE group i.
Fig. 11: The allocated repetition value nt
Repe,i, and RAOs producted by nt
Rach,i ×ft
To gain more insight into the operation of CMA-DQN, Fig. 10 plots the average number of successfully
served IoT devices Vsucc,i for each CE group i, and Fig. 11 plots the average number nt
Repe,i of repetitions
and the average number of Random Access Opportunities (RAOs), defined as the product nt
Rach,i ×ft
Prea,i, for
each CE group ithat are selected by CMA-DQN over the testing episodes. As seen in Fig. 10, CMA-DQN
substantially outperforms LE-URC approaches for each CE group i, where the reasons for this performance
are showcased in Fig. 11. As seen in Fig. 11(a)-(c), CMA-DQN increases the number of repetitions in the
light traffic region in order to improve the SNR and reduce RACH failures, while decreasing it in the heavy
traffic region so as to reduce scheduling failures. Surprisingly, the CMA-DQN increases the repetition value
of group 0 nRepe,0at the same time, which is completely opposite to the actions of nRepe,1and nRepe,2. This
is due to that the CMA-DQN is aware of the key to optimize the overall performance Vsu is to guarantee
Vsucc,0, as the IoT devices in the CE group 0 are easier to be served, due to they are located close to the eNB
and consume less resource. As illustrated in Fig. 11(d)-(f), this allows CMA-DQN to increase the number
of RAOs in the high traffic regime mitigating the impact of collisions on the throughput. In contrast, for the
CE groups 1 and 2, in the heavy traffic region, LE-URC decreases the number of RAOs in order to reduce
resource scheduling failures, causing an overall lower throughput as seen in Fig. 10.
Fig. 12: The average number of successfully served IoT devices per
TTI over each epoch in online updating
The realistic network conditions can be different from the simulation environment, due to that the
practical traffic and physical channel vary and can be unpredictable. This difference may lead to inaccurate
configuration that can degrade the system performance of each approach. Fortunately, the proposed RL-based
approaches can self-update after deployment according to the practical observation in NB-IoT networks in an
online manner. To model this, we use the trained CMA-DQN agents given in Fig. 11 (i.e., the bursty traffic is
modelled by the time limited Beta profile with parameters (3,4)), and test them in a slightly modified traffic
scenario that the bursty traffic is with Beta(5,6), and we set the constant exploration rate = 0.001. Fig.
12 plots the average number of successfully served IoT devices E{Vsu}per TTI over each episode versus
epochs. Our result shows that, as expected, E{Vsu}follows CMA-DQN>LE-URC-[1,4,8]>LE-URC-[2,8,16]
at any epoch. More importantly, the performance of CMA-DQN gradually improves along epochs, which
sheds light on the online self-updating capability of the proposed RL-based approaches.
In this paper, we developed Q-learning based uplink resource configuration approaches to optimize the
number of served IoT devices in real-time in NB-IoT networks. We first developed tabular-Q, LA-Q, and
DQN based approaches for the single-parameter single-group scenario, which are shown to outperform
the conventional LE-URC and FSI-URC approaches in terms of the number of served IoT devices. Our
results demonstrated that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same
system performance with much less training time. To support traffic with different coverage requirements, we
then studied the multi-parameter multi-group scenario as defined in NB-IoT standard, which introduced the
high-dimensional configurations problem. To solve it, we advanced the proposed LA-Q and DQN using the
Actions Aggregation technique (AA-LA-Q and AA-DQN), which guarantees the convergent capability of Q-
learning by sacrificing the accuracy in resource configuration. We further developed CMA-DQN by dividing
high-dimensional configurations into multiple parallel sub-tasks, which achieved the best performance in
terms of the number of successfully served IoT devices Vsu with the least training time.
[1] J. Schlienz and D. Raddino, “Narrowband internet of things whitepaper,IEEE Microw. Mag., vol. 8, no. 1, pp. 76–82, Aug. 2016.
[2] H. S. Dhillon, H. Huang, and H. Viswanathan, “Wide-area wireless communication challenges for the internet of things,” IEEE Commun.
Mag., vol. 55, no. 2, pp. 168–174, Feb. 2017.
[3] Y.-P. E. Wang, X. Lin, A. Adhikary, A. Grovlen, Y. Sui, Y. Blankenship, J. Bergman, and H. S. Razaghi, “A primer on 3GPP narrowband
internet of things (NB-IoT),” IEEE Commun. Mag., vol. 55, no. 3, pp. 117–123, Mar. 2017.
[4] D. T. Wiriaatmadja and K. W. Choi, “Hybrid random access and data transmission protocol for machine-to-machine communications in
cellular networks,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 33–46, Jan. 2015.
[5] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. Wong, “D-ACB: Adaptive congestion control algorithm for bursty M2M traffic in LTE
networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, Dec. 2016.
[6] L. M. Bello, P. Mitchell, and D. Grace, “Application of Q-learning for RACH access to support M2M traffic over a cellular network,” in
Proc. European Wireless Conf., 2014, pp. 1–6.
[7] Y. Chu, P. D. Mitchell, and D. Grace, “ALOHA and Q-learning based medium access control for wireless sensor networks,” in Int. Symp.
Wireless Commun. Syst. (ISWCS), 2012, pp. 511–515.
[8] Y. Yan, P. Mitchell, T. Clarke, and D. Grace, “Distributed frame size selection for a Q learning based slotted ALOHA protocol,” in Int.
Symp. Wireless Commun. Syst. (ISWCS), 2013, pp. 1–5.
[9] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C.-K. Tham, “Distributed reinforcement learning frameworks for cooperative retransmission in
wireless networks,” IEEE Trans. Veh. Technol., vol. 59, no. 8, pp. 4157–4162, Oct. 2010.
[10] Y.-S. Chen, C.-J. Chang, and F.-C. Ren, “Q-learning-based multirate transmission control scheme for RRM in multimedia WCDMA
systems,” IEEE Trans. Veh. Technol., vol. 53, no. 1, pp. 38–48, Jan. 2004.
[11] M. ihun and L. Yujin, “A reinforcement learning approach to access management in wireless cellular networks,” in Wireless Commun.
Mobile Comput., May. 2017, pp. 1–7.
[12] T.-O. Luis, P.-P. Diego, P. Vicent, and M.-B. Jorge, “Reinforcement learning-based ACB in LTE-A networks for handling massive M2M
and H2H communications,” in IEEE Int. Commun. Conf. (ICC), May. 2018, pp. 1–7.
[13] R. Harwahyu, R.-G. Cheng, C.-H. Wei, and R. F. Sari, “Optimization of random access channel in NB-IoT,IEEE Internet Things J.,
vol. 5, no. 1, pp. 391–402, Feb. 2018.
[14] S.-M. Oh and J. Shin, “An efficient small data transmission scheme in the 3GPP NB-IoT system,” IEEE Commun. Lett., vol. 21, no. 3,
pp. 660–663, Mar. 2017.
[15] H. Malik, H. Pervaiz, M. M. Alam, Y. Le Moullec, A. Kuusik, and M. A. Imran, “Radio resource management scheme in NB-IoT systems,”
IEEE Access, vol. 6, pp. 15 051–15 064, Jun. 2018.
[16] C. Yu, L. Yu, Y. Wu, Y. He, and Q. Lu, “Uplink scheduling and link adaptation for narrowband internet of things systems,IEEE Access,
vol. 5, pp. 1724–1734, 5 2017.
[17] A. Azari, G. Miao, C. Stefanovic, and P. Popovski, “Latency-energy tradeoff based on channel scheduling and repetitions in NB-IoT
systems,” arXiv preprint arXiv:1807.05602, Jul. 2018.
[18] E. Dahlman, S. Parkvall, and J. Skold, 4G: LTE/LTE-advanced for mobile broadband. Academic press, 2013.
[19] “Study on RAN improvements for machine-type communications,3GPP TR 37.868 V11.0.0, Sep. 2011.
[20] N. Jiang, Y. Deng, M. Condoluci, W. Guo, A. Nallanathan, and M. Dohler, “RACH preamble repetition in NB-IoT network,IEEE
Commun. Lett., vol. 22, no. 6, pp. 1244–1247, Jun. 2018.
[21] N. Jiang, Y. Deng, A. Nallanathan, X. Kang, and T. Q. S. Quek, “Analyzing random access collisions in massive IoT networks,IEEE
Trans. Wireless Commun., vol. 17, no. 10, pp. 6853–6870, Oct. 2018.
[22] “Evolved universal terrestrial radio access (E-UTRA); Physical channels and modulation,3GPP TS 36.211 v.14.2.0, Apr. 2017.
[23] M. Z. Shafiq, L. Ji, A. X. Liu, J. Pang, and J. Wang, “A first look at cellular machine-to-machine traffic: large scale measurement and
characterization,” ACM SIGMETRICS Performance Evaluation Rev., vol. 40, no. 1, pp. 65–76, Jun. 2012.
[24] J. Kim, J. Lee, J. Kim, and J. Yun, “M2M service platforms: Survey, issues, and enabling technologies.IEEE Commun. Surveys Tuts.,
vol. 16, no. 1, pp. 61–76, Jan. 2014.
[25] “Cellular system support for ultra-low complexity and low throughput Internet of Things (CIoT),3GPP TR 45.820 V13.1.0, Nov. 2015.
[26] A. K. Gupta and S. Nadarajah, Handbook of Beta distribution and its applications. New York, USA: CRC press, 2004.
[27] “Evolved universal terrestrial radio access (E-UTRA); Physical layer measurements,3GPP TS 36.214 v. 14.2.0, Apr. 2017.
[28] X. Lin, A. Adhikary, and Y.-P. E. Wang, “Random access preamble design and detection for 3GPP narrowband IoT systems,” IEEE
Wireless Commun. Lett., vol. 5, no. 6, pp. 640–643, Jun. 2016.
[29] “Evolved universal terrestrial radio access (E-UTRA); Medium Access Control protocol specification,3GPP TS 36.321 v.14.2.1, May.
[30] “Evolved universal terrestrial radio access (E-UTRA); Requirements for support of radio resource management,3GPP TS 36.133 v.
14.3.0, Apr. 2017.
[31] R. Sutton and A. Barto, “Reinforcement learning: An introduction (draft),” URl:,
[32] V. Mnih et al., “Human-level control through deep reinforcement learning,Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
[33] G. Konidaris, S. Osentoski, and P. S. Thomas, “Value function approximation in reinforcement learning using the Fourier basis.” in Assoc.
Adv. AI (AAAI), vol. 6, Aug. 2011, p. 7.
[34] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proc. Connectionist Models Summer
School Hillsdale, NJ. Lawrence Erlbaum, 1993.
[35] M. Hauskrecht, “Value-function approximations for partially observable markov decision processes,J. AI Res., vol. 13, pp. 33–94, Aug.
[36] A. Geramifard et al., “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Found. Trends
Mach. Learn., vol. 6, no. 4, pp. 375–451, Dec. 2013.
[37] F. S. Melo and M. I. Ribeiro, “Q-learning with linear function approximation,” in Springer Int. Conf. Comput. Learn. Theory, Jun. 2007,
pp. 308–322.
[38] C. Xi, K. Bohdan, M. Norman, and M. Pete, “Polynomial regression as an alternative to neural nets,arXiv preprint arXiv:1806.06850,
[39] C. M. Bishop, Pattern Recognition and Machine Learning. New York, USA: Springer print, 2006.
[40] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning.” in Assoc. Adv. AI (AAAI), vol. 2, Feb.
2016, p. 5.
[41] T. Tieleman and G. Hinton, “Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude,” COURSERA:
Neural Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, Oct. 2012.
[42] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,IEEE Trans. Syst., Man,
Cybern. C, C, Appl. Rev., 38 (2), Mar. 2008.
[43] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and
T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Pro. Int. Conf. Auton. Agents
MultiAgent Syst. (AAMAS), Jul. 2018, pp. 2085–2087.
... G0 through G3 are the values of the lookup table for OGM period, and the Bayes Optimizer changes these values and receives a score based on their performance. The OGM period is calculated as in Eq. (12), where ̄ is normalized dynamicity as shown in Eq. (13) and ̄ is the normalized density as shown in Eq. (14). ...
... In this way, retransmission of messages is taken into consideration for transmission. Other algorithms have used metrics such as packet delivery rate and number of packets delivered to calculate round score [14], [15]. We chose this metric so that both the number of packet transmissions, and the latency of the messages would both be valued in the calculation of performance. ...
Conference Paper
Mobile Ad-hoc Networks are a growing field of interest. They have many real-world applications, such as enabling internet connected sensors to operate in environments without pre-existing infrastructure. In past work, we have demonstrated that the Long Range (LoRa) radio frequency (RF) modulation technique, in conjunction with a mesh network can meet these needs in static networks. To extend this to applications with mobile nodes, several adaptations have been implemented to extend the original B.A.T.M.A.N (Better Approach to Mobile Ad-hoc Networking) mesh network algorithm. Node movement models were developed and tested to improve simulation accuracy. We also implemented situationally aware, machine learning (ML) based, route discovery techniques to ensure adequate network information is available in dynamic environments, without adding excessive overhead in static situations. To optimize these changes, a Black Box Optimizer was used in conjunction with an event-based simulation tool to train the ML model.
... Inspired by these developments, authors in [13], [14] have discussed the architectures based on joint optimal cost and resource-efficient methodologies for application in health and vehicular communications. Furthermore, as the deployment of intelligent architectures are increasingly in demand for 5G and beyond, [15]- [17] have proposed novel radio resource management mechanisms using fully connected deep neural networks. However, with the increase in such learning-based architecture deployments, there is also a need to optimize the use of resources for the same, as discussed in [18], for moving towards green-learning based resource management. ...
... Equation (15) refers to the predicted energy consumption of i th SCBS node, ∀i ∈ {1, 2, . . . , N } for the j th time stamp, ∀j ∈ {1, 2, . . . ...
Full-text available
Optimal resource provisioning and management of the next generation communication networks are crucial for attaining a seamless Quality of Service with reduced environmental impact. Considering the ecological assessment, urban and rural telecommunication infrastructure is moving towards deploying green cellular base stations to cater to the needs of the ever-growing traffic demands of heterogeneous networks. In such scenarios , the existing learning-based renewable resource provision-ing methods lack intelligent and optimal resource management at the Small Cell Base Stations (SCBS). Therefore, in this article, we present a novel machine learning-based framework for intelligent resource provisioning mechanisms for micro-grid connected green SCBSs with a completely modified ring parametric distribution method. In addition, an algorithmic implementation is proposed for prediction-based renewable resource redistribution with Energy Flow Control Unit (EFCU) mechanism for grid-connected SCBS, eliminating the need for centralised hardware. Moreover, this modeling enables the prediction mechanism to estimate the future on-demand traffic provisioning capability of SCBS. Furthermore, we present the numerical analysis of the proposed framework showcasing the systems' ability to attain a balanced energy convergence level of all the SCBS at the end of the periodic cycle, signifying our model's merits.
... DQN (Deep Q-Learning) is a DRL-Based (Deep Reinforcement Learning Based) method, which is often used in various resource allocation fields [11][12][13][14][15][16][17][18]. However, DQN is difficult to handle continuous action space. ...
... Guolin Sun, et al. use the DQN method to balance the energy consumption and user satisfaction issues in the C-RANs system [13]. By improving DQN, [14][15][16][17] allocate network resources and computing resources for edge computing, so that the delay in the system is lower. DQN-Based resource allocation methods enable edge resources to be more reasonably allocated to different tasks. ...
Full-text available
Intelligent video surveillance is important to ensure production safety in coal mines, while cloud-edge cooperation is an effective means to improve the performance of intelligent video monitoring. However, in edge layers, incorrect resource allocation of computing and network resources will result in the waste of resources and low real-time performance. In this paper, a DDPG-Based (Deep deterministic policy gradient-based) edge resource allocation method for cloud-edge cooperation framework is proposed. Firstly, the cloud-edge cooperation framework is designed for different tasks. Secondly, the joint minimizing problem of latency and bandwidth usage caused by edge computing is modeled. To quickly solve the joint optimization problem, we convert it to MDP (Markov Decision Process). In addition, ESPN (Edge status perception network) is proposed, which enhances the ability of feature perception and action output of DDPG. Finally, DDPG-ESPN is proposed to solve the joint optimization problem. Simulation results show that compared with other methods, DDPG-ESPN improves the real-time performance and bandwidth usage by up to 18.88% and 42.81% respectively.
... The traditional SADRL approach (see Section 3.1) has been applied in distributed agents to handle large state and action spaces. For instance, the resource allocation scheme (A.1) in [51,[68][69][70] address the challenge of high dynamicity (C.2) and enhance the throughput performance (P.3) of distributed agents in 4G networks. Using the SADRL approach, distributed agents: (a) do not exchange local information among themselves, and so the signaling overhead is reduced (O.4); and (b) use a lesser amount of data, including states and actions, for learning, and so it increases scalability (O.5) with a lower computational complexity. ...
... Hence, various approaches have been proposed. For instance, Nan et al. enable distributed SADRL agents to use historical knowledge to ensure stability (O.1) in [69], and Arjit et al. detects missing data for improved reliability of medical image analysis in [71]. The rest of this section presents the SADRL approaches applied to multi-agent environments. ...
Full-text available
Recent advancements in deep reinforcement learning (DRL) have led to its application in multi-agent scenarios to solve complex real-world problems, such as network resource allocation and sharing, network routing, and traffic signal controls. Multi-agent DRL (MADRL) enables multiple agents to interact with each other and with their operating environment, and learn without the need for external critics (or teachers), thereby solving complex problems. Significant performance enhancements brought about by the use of MADRL have been reported in multi-agent domains; for instance, it has been shown to provide higher quality of service (QoS) in network resource allocation and sharing. This paper presents a survey of MADRL models that have been proposed for various kinds of multi-agent domains, in a taxonomic approach that highlights various aspects of MADRL models and applications, including objectives, characteristics, challenges, applications, and performance measures. Furthermore, we present open issues and future directions of MADRL.
... The third method is to use the simulation tool and reinforcement learning (RL) methods to determine a metric that optimizes the network based on a round reward. Others have used training round rewards based on the number of packets received [17] or packet delivery rates [18]; however due to the inherent high latency of LoRa, there was a need to incentivize minimizing latency. In this metric, the destination score was modeled as a simple single layer perception as shown in Fig. 2 which is calculated in the general form as (13). ...
Conference Paper
To be useful, wireless sensor networks (WSNs) must be relied upon even when dispersed across environments that lack consistent internet access. To this end, we propose a mesh network architecture based on the Better Approach to Mobile Ad-hoc Networking (B.A.T.M.A.N.) algorithm in conjunction with the long range, low power communication protocol, LoRa, to transmit messages. Adaptations including methods of time synchronization, slotted ALOHA transmission and Quality of Service (QoS) considerations with a network-traffic-aware data routing protocol for a multi-source/multi-sink network configuration have been implemented. With this solution, nodes can create an ad-hoc network, sharing internet access and greatly expanding the network coverage without the need for any additional infrastructure. Our QoS-aware routing metrics have been tested in simulation and show performance improvements over traditional B.A.T.M.A.N. destination routing algorithms in these low data rate systems.
... Reinforcement learning is one of the important tools in the field of machine learning. It is widely used to deal with Markov dynamic programming problems [26,27]. As shown in Figure 1, the AI engine is designed as an agent that combines deep learning and reinforcement learning. ...
Full-text available
Since the birth of narrowband Internet of Things (NB-IoT), the Internet of Things (IoT) industry has made a considerable progress in the application for smart cities, smart manufacturing, and healthcare. Therefore, the number of UEs is increasing exponentially, which brings considerable pressure to the efficient resource allocation for the bandwidth and power constrained NB-IoT networks. In view of the conventional algorithms that cannot dynamically adjust resource allocation, resulting in a low resource utilization and prone to resource fragmentation, this paper proposes a double deep Q-network (DDQN)-based NB-IoT dynamic resource allocation algorithm. It first builds an NB-IoT environment model based on the real environment. Then, the DDQN algorithm interacts with the NB-IoT environment model to learn and optimize resource allocation strategies until it converges to the optimum. Finally, the simulation results show that the DDQN-based NB-IoT dynamic resource allocation algorithm is better than the traditional algorithm in the resource utilization, average transmission rate, and UE average queuing time.
For the easy and flexible management of large scale networks, Software-Defined Networking (SDN) is a strong candidate technology that offers centralisation and programmable interfaces for making complex decisions in a dynamic and seamless manner. On the one hand, there are opportunities for individuals and businesses to build and improve services and applications based on their requirements in the SDN. On the other hand, SDN poses a new array of privacy and security threats, such as Distributed Denial of Service (DDoS) attacks. For detecting and mitigating potential threats, Machine Learning (ML) is an effective approach that has a quick response to anomalies. In this article, we analyse and compare the performance, using different ML techniques, to detect DDoS attacks in SDN, where both experimental datasets and self-generated traffic data are evaluated. Moreover, we propose a simple supervised learning (SL) model to detect flooding DDoS attacks against the SDN controller via the fluctuation of flows. By dividing a test round into multiple pieces, the statistics within each time slot reflects the variation of network behaviours. And this ”trend” can be recruited as samples to train a predictor to understand the network status, as well as to detect DDoS attacks. We verify the outcome through simulations and measurements over a real testbed. Our main goal is to find a lightweight SL model to detect DDoS attacks with data and features that can be easily obtained. Our results show that SL is able to detect DDoS attacks with a single feature. The performance of the analysed SL algorithms is influenced by the size of training set and parameters used. The accuracy of prediction using the same SL model could be entirely different depending on the training set.
Recently, with the development of Internet of Things (IoT) technology, the devices with the various features of traffic and mobility are increasing exponentially, and now the existing traditional resource allocation algorithms are becoming more and more difficult to meet the ever-increasing demand for terminal transmission. Aiming at the problem of radio resource fragment for complex access users of existing traditional algorithms, this paper proposes a dynamic scheduling algorithm based on Double Deep Q-learning Network(DDQN). At the same time, we design and simulate the NPUSCH transmission environment of the NB-IoT as the interactive environment of the agent. After training iterations, the resource utilization rate of the dynamic scheduling algorithm based on DDQN can be stabilized above 81%, which is better than traditional scheduling algorithms.
Full-text available
Reinforcement learning (RL) methods can successfully solve complex optimization problems. Our article gives a systematic overview of major types of RL methods, their applications at the field of Industry 4.0 solutions, and it provides methodological guidelines to determine the right approach that can be fitted better to the different problems, and moreover, it can be a point of reference for R&D projects and further researches.
Full-text available
The cellular-based infrastructure is regarded as one of potential solutions for massive Internet of Things (mIoT), where the Random Access (RA) procedure is used for requesting channel resources in the uplink data transmission. Due to the nature of mIoT network with the sporadic uplink transmissions of a large amount of IoT devices, massive concurrent channel resource requests lead to a high probability of RA failure. To relieve the congestion during the RA in mIoT networks, we model RA procedure, and analyze as well as evaluate the performance improvement due to different RA schemes, including power ramping (PR), back-off (BO), access class barring (ACB), hybrid ACB and back-off schemes (ACB&BO), and hybrid power ramping and back-off (PR&BO). To do so, we develop a traffic-aware spatio-temporal model for the contention-based RA analysis in the mIoT network, where the signal-to-noise-plus-interference ratio (SINR) outage and collision events jointly determine the traffic evolution and the RA success probability. Compared with existing literature only modelled collision from single cell perspective, we model both SINR outage and the collision from the network perspective. Based on this analytical model, we derive the analytical expression for the RA success probabilities to show the effectiveness of different RA schemes. We also derive the average queue lengths and the average waiting delays of each RA scheme to evaluate the packets accumulation status and packets serving efficiency. Our results show that our proposed PR&BO scheme outperforms other schemes in heavy traffic scenario in terms of the RA success probability, the average queue length, and the average waiting delay.
Conference Paper
Full-text available
Narrowband IoT (NB-IoT) is the latest IoT connec-tivity solution presented by the 3GPP. NB-IoT introduces coverage classes and introduces a significant link budget improvement by allowing repeated transmissions by nodes that experience high path loss. However, those repetitions necessarily increase the energy consumption and the latency in the whole NB-IoT system. The extent to which the whole system is affected depends on the scheduling of the uplink and downlink channels. We address this question, not treated previously, by developing a tractable model of NB-IoT access protocol operation, comprising message exchanges in random-access, control, and data channels, both in the uplink and downlink. The model is then used to analyze the impact of channel scheduling as well as the interaction of coexisting coverage classes, through derivation of the expected latency and battery lifetime for each coverage class. These results are subsequently employed in investigation of latency-energy tradeoff in NB-IoT channel scheduling as well as determining the optimized operation points. Simulations results show validity of the analysis and confirm that there is a significant impact of channel scheduling on latency and lifetime performance of NB-IoT devices.
Full-text available
Narrowband Internet of Things (NB-IoT) is the prominent technology that fits the requirements of future Internet of Things (IoT) networks. However, due to the limited spectrum (i.e., 180 kHz) availability for NB-IoT systems, one of the key issues is how to efficiently use these resources to support massive IoT devices? Furthermore, in NB-IoT, to reduce the computation complexity and to provide coverage extension, the concept of time offset and repetition has been introduced. Considering these new features, the existing resource management schemes are no longer applicable. Moreover, the allocation of frequency band for NB-IoT within LTE band, or as a standalone, might not be synchronous in all the cells, resulting in intercell interference (ICI) from the neighbouring cells’ LTE users or NB-IoT users (synchronous case). In this paper, first a theoretical framework for the upper bound on the achievable data rate is formulated in the presence of control channel and repetition factor. From the conducted analysis, it is shown that the maximum achievable data rates are 89.2 Kbps and 92 Kbps for downlink and uplink, respectively. Secondly, we propose an interference aware resource allocation for NB-IoT by formulating the rate maximization problem considering the overhead of control channels, time offset and repetition factor. Due to the complexity of finding the globally optimum solution of the formulated problem, a sub-optimal solution with an iterative algorithm based on cooperative approaches is proposed. The proposed algorithm is then evaluated to investigate the impact of repetition factor, time offset and ICI on the NB-IoT data rate and energy consumption. Furthermore, a detailed comparison between the non-cooperative, cooperative, and optimal scheme (i.e., no repetition) is also presented. It is shown through the simulation results that the cooperative scheme provides up to 8% rate improvement and 17% energy reduction as compared to the non-cooperative scheme.
Full-text available
NarrowBand-Internet of Things (NB-IoT) is a radio access technology recently standardized by 3GPP. To provide reliable connections with extended coverage, a repetition transmission scheme is applied in both Random Access CHannel (RACH) procedure and data transmission. In this letter, we model RACH in the NB-IoT network taking into account the repeated preamble transmission and collision using stochastic geometry. We derive the exact expression of RACH success probability under time correlated interference, and validate the analysis with different repetition values via independent simulations. Numerical results have shown that the repetition scheme can efficiently improve the RACH success probability in a light traffic scenario, but only slightly improves that performance with very inefficient channel resource utilization in a heavy traffic scenario.
Full-text available
We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
Full-text available
In smart city applications, huge numbers of devices need to be connected in an autonomous manner. 3rd Generation Partnership Project (3GPP) specifies that Machine Type Communication (MTC) should be used to handle data transmission among a large number of devices. However, the data transmission rates are highly variable, and this brings about a congestion problem. To tackle this problem, the use of Access Class Barring (ACB) is recommended to restrict the number of access attempts allowed in data transmission by utilizing strategic parameters. In this paper, we model the problem of determining the strategic parameters with a reinforcement learning algorithm. In our model, the system evolves to minimize both the collision rate and the access delay. The experimental results show that our scheme improves system performance in terms of the access success rate, the failure rate, the collision rate, and the access delay.
Narrowband Internet of Things (NB-IoT) is a new access technology introduced by 3GPP. This paper presents an analytical model to estimate the access success probability and average access delay of the random access channels by considering maximum number of preamble transmissions, size of backoff windows, and number of sub-carriers in each coverage enhancement (CE) levels. A joint optimization technique is proposed to configure the parameters to maximize the access success probability under a target delay constraint. The accuracy of the analysis and the effectiveness of the proposed optimization technique are verified by computer simulations and benchmarked with exhaustive search. The result shows that the proposed optimization is able to find the optimal configuration under various conditions.
Narrowband Internet of Things (NB-IoT) is a new narrowband radio technology introduced in the Third Generation Partnership Project (3GPP) Release 13 towards to the 5th generation (5G) evolution for providing low-power widearea Internet of Things (IoT). In NB-IoT systems, repeating transmission data or control signals has been considered as a promising approach for enhancing coverage. Taking into account the new feature of repetition, link adaptation for NBIoT systems need to be performed in two dimensions, i.e., the modulation and coding scheme (MCS), and the repetition number. Therefore, existing link adaptation schemes without consideration of repetition number are no longer applicable. In this paper, a novel uplink link adaptation scheme with repetition number determination is proposed, which is composed of the inner loop link adaptation and the outer loop link adaptation, to guarantee transmission reliability and improve throughput of NB-IoT systems. In particular, the inner loop link adaptation is designed to cope with Block Error Ratio (BLER) variation by periodically adjusting the repetition number. The outer loop link adaptation coordinates the MCS level selection and the repetition number determination. Besides, key technologies of uplink scheduling like power control and transmission gap are analyzed and a simple single-tone scheduling scheme is proposed. Link-level simulations are performed to validate the performance of the proposed uplink link adaptation scheme. The results show that our proposed uplink link adaptation scheme for NB-IoT systems outperforms the repetition-dominated method and straightforward method, particularly for good channel conditions and larger packet sizes. Specifically, it can save more than 14% of the active time and resource consumption compared with the repetition-dominated method and save more than 46% of the active time and resource consumption compared with the straightforward method.
This letter proposes an efficient small data transmission scheme in the narrow band (NB)-Internet of things (IoT) system. For the efficient use of radio resources, the proposed scheme enables devices in an idle state to transmit a small data packet without the radio resource control connection setup process. This can improve the maximum number of supportable devices in the NB-IoT system which has insufficient radio resources. Numerical results have shown that the proposed scheme can increase the maximum number of supportable devices by about 60% compared with the conventional scheme.