Content uploaded by Sadia Khaf
Author content
All content in this area was uploaded by Sadia Khaf on Oct 12, 2021
Content may be subject to copyright.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 1
Partially Cooperative Scalable Spectrum Sensing in
Cognitive Radio Networks under SDF Attacks
Sadia Khaf, Student Member, IEEE, Mohammad T. Alkhodary, and Georges Kaddoum, Senior Member, IEEE
Abstract—Massive Internet of Things (IoT) connectivity re-
quires addressing spectrum congestion caused by spectrum
scarcity in wireless communications. Over the past decade,
cognitive radio has been proposed as a promising solution to
utilize the licensed spectrum efficiently. Conventional spectrum
sensing approaches are complex and require statistical informa-
tion about the behavior of licensed users, which is impractical.
To overcome this limitation, several reinforcement learning (RL)-
based spectrum sensing approaches have been proposed that
are also highly adaptable to the dynamics of IoT environments.
Additionally, cooperative RL-based spectrum sensing approaches
have been widely used because they are more accurate than non-
cooperative approaches. However, the advantage comes at the
cost of scalability due to increased information sharing overhead.
Furthermore, cooperative spectrum sensing (CSS) approaches
suffer from attacks on the network, such as sensing data
falsification (SDF) attacks, which deteriorate sensing accuracy
dramatically. In this paper, we present a scalable, partially CSS
algorithm that is highly resilient to SDF attacks. The novelty
of the proposed algorithm lies in partial cooperation through
coalition formation, which reduces sensing and information
sharing overhead while improving sensing accuracy. Moreover,
the algorithm learns to adapt the sensing participation percentage
and selects the most rewarding channel for sensing to maximize
rewards while minimizing energy consumption. The proposed
algorithm outperforms state-of-the-art CSS algorithms in terms
of sensing accuracy and overheard. Contrary to centralized CSS
algorithms, the proposed algorithm’s performance is directly
proportional to the number of devices; hence, it is suitable for
massive connectivity.
Index Terms—Smart spectrum sensing, cognitive radio Inter-
net of Things (CR-IoT), channel utilization, energy efficiency.
I. INT ROD UC TI ON
THE number of Internet of Things (IoT) devices is ex-
pected to grow to 41.6 billion devices, generating 79.4
zettabytes (ZB) of data, in 2025 [1]. Therefore, the need
for intelligent resource management and efficient bandwidth
utilization is greater than ever before [2]. The radio spectrum is
divided into licensed bands, which are used mainly by cellular,
radio and television networks, and unlicensed bands occupied
primarily by IoT devices. The rapid growth in the number
of IoT devices will quickly consume all available unlicensed
bands, leading to congestion and packet drops. The concept of
cognitive radio (CR) has become much more popular in recent
years to solve the problem of congestion in the unlicensed
spectrum [3]. In CR, secondary (unlicensed) users (SUs) are
Sadia Khaf, Mohammad T. Alkhodary, and Georges Kaddoum are with
the Department of Electrical Engineering, ´
Ecole de technologie sup´
erieure,
Montreal, QC, H3C 1K3 Canada. e-mail: sadia.khaf.1@ens.etsmtl.ca
Manuscript received xxxxx xx, 2021; revised xxxx xx, 2021.
allowed to utilize the licensed spectrum when it is not occupied
by primary (licensed) users (PUs). Ordinary IoT devices would
inevitably need to be transformed into cognitive IoT devices to
solve spectrum congestion [2]. Introducing CR to IoT opened
the door to massive connectivity in IoT networks and promised
support for an unprecedented number of sensors and devices.
The concept of CR-IoT emerged as a promising solution that
leverages the vacant spaces in the licensed spectrum to solve
the congestion problem in the unlicensed one. The authors
of [4] survey the CR-IoT architectures and frameworks and
propose potential applications. However, CR algorithms face
significant challenges in the context of IoT, such as sensing
accuracy, energy consumption, and network [5].
Spectrum sharing in the context of CR-IoT relies heavily
on spectrum sensing accuracy to minimize harmful interfer-
ence to PUs caused by SUs. Sensing accuracy is commonly
compromised by device accuracy and sensing data falsification
(SDF) attacks. The former is caused by hardware limitations
of the sensing device, and the latter is due to malicious
attackers falsifying the sensing results of SUs, leading to the
SUs learning incorrect channel statuses [6]. An attack of the
latter type affects a few SUs, and majority vote is an effective
strategy to mitigate this type of attack, and hence improve
sensing accuracy. Cooperative spectrum sensing (CSS) was
introduced to improve the sensing accuracy of individual
spectrum sensing devices [7]. In this vein, several multi-agent
deep learning approaches have been proposed to provide a
high level of accuracy using CSS [8], [9], [10]. The authors
of [8] propose an SU selection strategy that chooses an SU
to sense the channel status based on energy detection and
shares the results with other SUs. The authors of [9] also
use deep reinforcement learning to efficiently explore the
radio environment through CSS. However, these approaches
focus on full cooperation among SUs to improve accuracy
and are not scalable to massive IoT networks, and hence have
limited performance under high load factors. Therefore, an
energy efficient approach that maintains a high level of sensing
accuracy under various load factors is needed to support the
future demands of IoT networks.
Current state-of-the-art spectrum sensing techniques con-
sume a lot of time and energy, which makes it impractical
to implement them with low-powered IoT devices [11]. To
overcome this limitation, spectrum sensing using evolutionary
game theory (EGT) approaches gives SUs the freedom to be
free riders to conserve their energy [12], but suffers from
inaccurate sensing due to probabilistic environment model-
ing. On the other hand, spectrum sensing using multi-agent
reinforcement learning (RL) trains the agents (SUs) through
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 2
rewards from interaction with the environment to provide a
realistic environment and accurate sensing [8]. However, the
aforementioned approaches consume a lot of time and energy
under a high load factor [13]. Thus, reducing the energy
consumption (for sensing and collaboration) of multi-agent
CSS is an open research problem.
To tackle CR-IoT’s sensing accuracy, energy consumption
and scalability challenges, this paper presents a novel partially
cooperative multi-agent RL (PCMARL) spectrum sensing
algorithm. Partial cooperation is achieved through coalition
formation, which restricts sensing and information sharing to
small subsets of SUs. The SUs in these subsets collaborate and
use majority voting to fight SDF attacks and improve sensing
accuracy as well as reduce sensing overhead. Our approach
teaches the SUs to optimize their sensing participation by
learning to be free riders to conserve energy while maximizing
rewards. We also give SUs the freedom to select a channel to
sense based on reward history, which gives them the flexibility
to switch channels when a channel is frequently busy or due
to SU mobility.
Novelty and Contribution
This paper presents the first scalable, energy efficient, RL-
based algorithm that takes advantage of coalition formation
and majority voting for reliable spectrum sensing in the
presence of SDF attacks. The contribution of this paper can
be summarized as follows:
•The PCMARL algorithm provides agents with a realistic
model of the radio environment, rather than a probabilis-
tic one as in the case of EGT-based algorithms. In other
words, the environment is more realistic in the sense
that it does not rely on SUs’ probability of detection
and probability of false alarm, but rather calculates the
rewards through the agent’s interaction with the radio
environment.
•We provide a coalition-based partial cooperation strategy
that reduces the amount of energy spent physically sens-
ing the channel, and sharing the channel status with other
SUs.
•Contrary to traditional CSS approaches, the PCMARL
algorithm does not rely on a fusion center to calculate
majority vote. Instead, the SUs in each coalition com-
pute majority vote locally, which makes the PCMARL
algorithm scalable to massive connectivity.
•This is the first approach that gives SUs the freedom to
choose the channel for sensing based on channel-specific
reward history and teaches them to be free riders to
conserve energy.
•In contrast to state-of-the-art spectrum sensing algo-
rithms, the PCMARL algorithm is the first approach
whose performance is directly proportional to the number
of devices in the network, making it suitable for massive
connectivity.
The remainder of this paper is organized as follows. Section
II provides a literature review of the most relevant works. The
system model and the proposed PCMARL spectrum sensing
algorithm are presented in Section III. Section IV discusses the
simulation results. Finally, Section V concludes this work.
II. RE LATE D WOR KS
The challenges of spectrum sensing in the context of IoT
networks can be summarized as spectrum sensing scheduling,
sensing time minimization, energy consumption minimization,
enabling massive connectivity, and maintaining a high level
of accuracy in the presence of SDF attacks [14], [6]. The
solutions proposed to address some of these challenges can be
classified into three categories based on the approach used: A)
Optimization and Heuristic Approaches, B) Game-Theoretic
Approaches, and C) Machine Learning Approaches. A brief
comparison of these approaches in terms of methodology,
advantages, and disadvantages, is presented in Table I with
corresponding references.The key aspects of each approach
are discussed below.
A. Optimization and Heuristic Approaches
Optimization and heuristic approaches were among the
earliest approaches used for CSS [15]. More recently, authors
of [16], [17], [18] have tried sub-carrier allocation subject
to delay, maximum power, and interference constraints. The
proposed methods are computationally expensive. They work
well with a limited number of IoT devices; however, scaling
them to a large number of devices makes them too complex.
The authors of [19] used a Hidden Markov Model (HMM)
for CSS and spectrum hand-off in a CR network (CRN);
however, the fully cooperative nature of the solution makes
it unscalable. In short, classical optimization methods are not
suitable for CR-IoT applications since a complete model of
the environment is not available [29], they are not scalable to
support massive IoT connectivity [30], [31], and they cannot
adapt to the dynamic and evolving environment [31].
B. Game Theoretic Approaches
The use of game theoretic approaches for dynamic spectrum
allocation is studied in [20], and both cooperative and non-
cooperative game models are explored. The application of
EGT for CSS is presented in [21]. The authors present a novel
concept of “free will” for the SUs to participate in spectrum
sensing to reduce energy consumption. An extension of the
work that allows SUs to join multiple coalitions at the same
time is studied in [23]. Adaptive random access for collecting
sensing data is proposed in [22] to optimize sensing time.
In all the game theoretic approaches mentioned above, the
rewards are calculated from SUs’ probability of detection and
probability of false alarm, which provide an unrealistic repre-
sentation of the environment. Additionally, game theoretic ap-
proaches assume homogeneous rewards, are highly complex,
and have poor convergence and low flexibility. Furthermore,
game theoretic approaches can be challenging to implement
in practical scenarios due to the difficulty of distributing the
game rule for the players [32].
C. Machine Learning Approaches
Several deep learning frameworks for spectrum sensing are
proposed in [24], [25], [26] to improve the energy efficiency
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 3
TABLE I: Literature on Spectrum Sensing: Benefits and Drawbacks
Optimization and Heuristic Approaches Game Theoretic Approaches Machine Learning Approaches
Methodology Simplified model of the environment to
formulate spectrum sensing in classic op-
timization problem. Optimization is done
using genetic algorithms, e.g., particle
swarm optimization, etc.
Distributing resources among competing
players to reach a Nash equilibrium
Learning from data or the environment to
make decisions autonomously. Improves
by experience.
Advantages Fast implementation with a close-form
solution.
Fair distribution of resources. Dynami-
cally varying in time
No prior environment model is needed.
Learns complex patterns in the data. Re-
acts to dynamic environment
Disadvantages Incomplete environment model. Difficult
to scale in dynamic IoT environments.
Does not support heterogeneous players.
Slow convergence. May not reach equi-
librium. Difficult to scale in dynamic IoT
environments.
Rely on labeled data. Need training us-
ing large datasets in case of deep learn-
ing (DL). Curse of dimensionality. Su-
pervised machine learning is difficult to
scale in dynamic IoT environments.
References [15], [16], [17], [18], [19] [20], [21], [22], [23] [24], [25], [26], [27], [28]
of the system and provide a low-complexity alternative to
classical optimization methods. A survey of supervised and
unsupervised machine learning methods for spectrum sensing
can be found in [27]. Our earlier works [33], [34], [35]
provide a foundation for resource sharing among competing
IoT devices using deep learning. Similarly, the authors of
[36] propose a reinforcement learning method, SARSA, for
resource provisioning and horizontal container scaling in fog
computing. The authors further extend their work to deep
reinforcement learning in [37], [38] proposing deep Q-learning
as an alternative to heuristic methods for service provisioning
to IoT devices due to the NP hardness of the problem and
demonstrating the algorithm’s efficiency in making proac-
tive placement and scaling decisions. Several supervised and
unsupervised machine learning approaches such as K-means
clustering, Gaussian mixer model, support vector machine, and
K-nearest neighbors are proposed for opportunistic spectrum
access in [28]. Nevertheless, it should be highlighted that data-
driven machine learning approaches suffer from a number of
inherent drawbacks such as lack of sufficient labeled data,
partial observability of the environment, and lack of support
for incorporating delayed feedback from the environment.
Such shortcomings prevent machine learning algorithms from
being an ideal solution for spectrum allocation in a dynamic
radio environment and motivate the use of RL.
The authors of [39] propose to use RL for spectrum sensing
in a dynamic IoT environment and analyze and compare the
performance of -greedy and upper confidence bound (UCB)
exploratory strategies. The idea of using RL with UCB is
further explored in [9], where the authors created several PU
traffic models such as bursty, legacy, and frequency hopping
patterns to show the adaptability of RL-based spectrum sens-
ing methods. The use of RL to facilitate the coexistence of
long-term evolution license-assisted access (LTE-LAA) and
IEEE 802.11 Wi-Fi systems is studied in [40], where agents
compete for channel access, considering the effects of MAC
and physical layers. The authors of [41] proposed a spatial-
correlation-based SU selection mechanism that selects the
most suitable SU for local sensing.
An RL-based multi-agent CSS problem is studied in [42].
A distributed model is used that assumes perfect channel
information. Since the approach is distributed, it is easily
scalable to massive IoT networks. It is assumed that there
will always be only one PU in six channels with a 20%
probability that the PU remains on the same channel and an
80% probability that it switches channel. This is an unrealistic
assumption since, in practice, there can be more than one PU
transmitting at the same time. The authors of [43], [44] also
propose a multi-agent RL framework for spectrum sensing in
CR-IoT. The use of RL for spectrum sensing is also advocated
in more recent works due to the RL agents’ ability to quickly
adapt to a highly dynamic IoT sensor environment [45], [8],
[46], [47], [48], [49].
Cooperative RL approaches exhibit superior performance in
terms of accuracy. To implement cooperative RL algorithms
for low-power CR-IoT devices, a significant reduction in
energy consumption, sensing overhead, and sensing time is
needed. This can be achieved by using coalition formation
and majority vote, and sharing sensing belief with a limited
number of neighbors as shown herein.
Sensing Falsification
The sensing data of individual SUs is vulnerable to sensing
falsification due to device sensing inaccuracies and SDF
attacks. Therefore, the belief-sharing aspect of CSS introduces
new security threats. In an SDF attack, malicious attackers
send false local sensing results to mislead the cooperating
SUs [6]. Attacks may result in either excessive interference
in the PU network or a decrease in spectrum utilization [50].
Attacks of this type are carried out by sparse agents, and
majority vote is an effective strategy to mitigate them. In
addition to SDF attacks, SU sensing results are subject to
falsification due to device-hardware sensing inaccuracies. The
authors of [51] suggest using block chains to mitigate the
effect of falsified sensing. The authors of [52] propose a
distributed trust model to discredit the malicious and selfish
nodes, thereby reducing the weight of their signals. In other
works [53], the authors discuss probabilistic SDF attacks and
a centralized fusion center (FC)-based mitigation strategy.
Similarly the authors of [54] propose a sequential fusion-based
mitigation strategy that relies on an FC. The authors of [55]
also propose an FC-based network topology with three levels
of control, where the secondary controller is responsible for
resource sharing and management strategies as well as attack
detection and mitigation. Although FC is optimized overall, it
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 4
creates a single point of failure and makes the approach non-
scalable. Moreover, the performance of FC-based approaches
degrades in terms of energy efficiency due to large information
sharing overhead. In contrast, we propose an energy efficient
partially cooperative SDF mitigation strategy that computes
the majority vote locally and improves, both energy efficiency
and accuracy with a greater number of devices.
III. PCMARL AL GO RI THM-BA SE D SPE CT RUM SE NS IN G
This section presents the proposed system model, intro-
duces the notation used, explains the structure and flow of
the PCMARL algorithm, lists performance indicators, shows
the algorithm’s resilience against sensing falsification, and
analyzes its complexity.
A. System Model
The proposed CR environment consists of NPPU, NsSU,
and Mlicensed channels, and we assume M=Np. The inter-
nal clocks of all PUs and SUs are synchronized and sufficiently
accurate [56]. The stochastic process governing PUs’ access
to channels can be modeled as a Markov decision process
(MDP). Each PU transmits randomly with transmission length
L=kt, where kis a random number and tis the transmission
unit “time slot”. The SUs have no prior knowledge of the
transmission pattern or transmission length of the PUs, and
should be sure to not interfere with the PUs. Each SU can
sense only Kchannels, where (K≤M), in each time slot due
to energy and hardware limitations. Each SU’s sensing results
are subject to falsification with with probability Pdue to SDF
attacks and hardware sensing inaccuracy. The CR model is
distributed, and there is no centralized unit to share sensing
information among SUs.
Let S={si, ..., sNs}be the SU set with Nsuser “agents”,
and C={ci, ..., cM}be a set of Mchannels occupied by Np
PUs. Each agent interacts with the environment and with other
agents, a subset of Sthat sense the same channel, to collect
information about channel occupancy and decide on its sensing
policy. The sets of observations and actions are donated
by O={0,1, ..., T }and A={not-sensed, idle, busy},
respectively, with size |O| and |A|. The “not-sensed” action
represents the agent’s action if it decides to be a free rider
to conserve energy. The notation asi
j∈ A represents the
action taken by the agent sifor the channel j. The set
Dj={as1
j, ..., asNj
j}represents the multi-set of actions of all
agents in collation j, where Njis the number of agents sensing
channel j. Similarly, rsi
j(ot)∈ {0,1}donates the reward of
agent sias a result of selecting channel jfor observation
ot∈ O.
Interaction with the environment happens through coalition
formation, the sharing of sensing belief Dj, majority vote,
and the exchange of rewards. Each agent keeps a record of its
own actions and its rewards from each channel. In each time
slot, the agent picks a channel to sense based on its reward
history. All agents sensing channel jare considered coalition
Cj, and they can join or leave a coalition based on their quality
of service (QoS), as depicted in Figure 1. After joining a
coalition, each agent calculates the best action, i.e., channel
Coali�on 3 Coali�on 2
Coali�on 1
Coali�on 4 Coali�on 5 PU SU
Fig. 1: The proposed CR environment, all SUs that are trying
to access the same channel are considered a coalition. Note:
this approach is distributed since there is no centralized control
unit, or a fusion center. SUs can join or leave a coalition based
on their QoS.
status, based on the Q-table and shares it with all members
in the coalition. All agents in the coalition then calculate the
majority vote of the coalition locally using the channel status
shared by other members and update their actions based on
the majority vote. Lastly, the agents receive a reward from
the environment if their predicted channel status matches the
actual status and update their action-reward history.
The work of this paper (PCMARL algorithm) is inspired
by EGT and RL in the following manners. From an EGT
perspective, the work uses the concept of coalition forma-
tion, belief sharing, and majority vote. From RL perspective,
the proposed algorithm uses belief sharing, Q-learning, and
rewards through interaction with the environment. Thus, the
algorithm uses belief sharing, which is common to cooperative
games and cooperative RL approaches, to provide a unified
and more realistic model of the radio environment.
B. PCMARL Spectrum Sensing Algorithm
Algorithm 1 highlights the important steps in the PCMARL
spectrum sensing algorithm. Additionally, Figure 2 depicts the
flow of the algorithm. The input parameters αand γdefine
the learning rate and the weight associated with the temporal
difference in the Q-learning algorithm, respectively. The Q-
table and agent histories are initialized to zero so that the
initial actions of all agents are always exploratory.
The PCMARL algorithm uses -greedy to balance explo-
ration and exploitation such that the agent takes exploratory
action with probability and chooses the greedy action, i.e.,
the action with the highest Q-value, with probability 1−.
However, the value of decreases exponentially from 1 to 0
with the number of episodes rather than being constant.
The Q-table holds Q-values for all actions and all obser-
vations for each channel. In the case of a random action, the
agent randomly selects the channel and the channel status,
whereas in the case of a greedy action, the agent picks the
best channel to sense based on channel reward history and the
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 5
Algorithm 1: PCMARL Cooperative Spectrum Sens-
ing
1Algorithm parameters: Learning rate α∈(0,1],γ,
majority-vote v∈[0,1] ;
2Initialize Q(o, a), for all o∈ O, a ∈ A, ;
3Initialize R(o, s), for all o∈ o, s ∈ S, ;
4foreach episode do
5Reset Environment;
6foreach step of episode do
7if random float ≤(episodecount)then
8Sample afrom A;
9else
10 Choose channel with max(Rc);
11 Choose action for channel from Q-table;
12 if v== 1 then
13 Form coalition Cj;si∈ Cjif sisensed
channel j;
14 Calculate majority decision for each
coalition ;
15 Update actions of each sibased on
majority-vote ;
16 foreach si∈ S do
17 Take action a, observe R,o;
18 Q(o, a)←
Q(o, a)+ α[R+γmaxaQ(o0, a)−Q(o, a)];
19 o←o0;
channel status based on the Q-table. The total reward Rsi
jthat
SU siobtains from channel jis
Rsi
j=
ot|t=to
X
ot|t=1
rsi
j(ot),(1)
where otand toare the observation at time tand the current
time, respectively. The average channel reward ¯
Rsi
jof sifor
channel jis calculated as
¯
Rsi
j=Rsi
j
Ksi
j
,(2)
where Ksi
jis the number of times that channel jwas sensed
in all past observations up to the current observation. In each
time slot, each SU picks the channel with highest average
sensing reward to sense as follows:
cj= arg max
j
¯
Rsi
j.(3)
After selecting the best channel, the agent needs to choose
the best action for that channel. The action, i.e., the channel
status for channel j, is selected by SU sibased on
asi
j= arg max
a∈A Q(ot, a).(4)
Since the action space contains “free riding”, “idle”, and
“busy” as actions, the one with the highest Q-value is selected
according to Eq. 4.
In a traditional Q-learning approach, an individual action
chosen based on the Q-table is first compared to the PU
Start
Initialize EnvModel &
Agent_History
Identify Current State
Explore?
Sample random action
from action_space
Choose channel with
max(av/g_chl_reward)
Choose action based
on Q-table
Form coalition
Majority
Vote?
Update agents’
actions
Take Action
Receive
Reward
Update Q-table &
Agent_History
Terminal
State?
End
No
Yes
Yes
No
No
Yes
Fig. 2: Flowchart of the proposed PCMARL cooperative
sensing algorithm.
transmission, and then the agent receives a reward accordingly.
SUs’ individual sensing results can be falsified with probability
Pdue to SDF attacks and hardware sensing inaccuracy.
Coalition formation and majority vote provide a strong defense
against falsification, as explained in detail in Section III-D.
Once all agents in all coalitions have chosen their individual
actions, they share the channel status with the other members
of their coalition [42]. Each coalition member in coalition Cj
thus receives a multi-set Dj={as1
j, ..., asNj
j}containing the
actions of all SUs in the coalition, some of which may have
been falsified. Then, each agent calculates the majority action
of the coalition as
dj=mode(Dj),(5)
and all members of the coalition update their actions as
asi
j=dj. Once all members have updated their actions, their
actions are compared to the actual transmission pattern of the
PUs and the agents receive a reward is rsi
j∈ {0,1}from
the radio environment for predicting the channel status. The
reward rsi
j= 1 for correctly predicting channel status or
rsi
j= 0 for not sensing or incorrectly predicting it. The goal
of the SUs is to maximize their rewards. Each agent sithen
updates its Q-table as follows:
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 6
TABLE II: Table of notations
S={si, ..., sNs}set of SUs
C={ci, ..., cM}set of channels
siSU i
Ns=|S| number of SUs
M=|C| number of channels
Np=Mnumber of PUs
Tnumber of steps per episode, horizon
ttime
ot:t= 1, ..., T observation (function of time)
Pprobability of falsification due to SDF attack
asi
j∈ A action of SU sifor channel j
A={0,1,2}action space
O={o1, o2,...oT}observation space
rsi
j(ot)∈ {0,1}reward of SU sifrom channel jin ot
Rsi
jtotal reward of SU sifrom channel j
Rsitotal reward of SU siregardless of the channel
¯
Rsi
javerage channel reward of SU si
ˆ
Rsitotal reward-per-contribution of SU si
Ksi
jnumber of times channel jwas sensed by si
Ksinumber of times SU siparticipated in sensing
Qsi(ot, asi
j)Q-table of SU si
Dj={as1
j, ..., asNj
j}multi-set of actions of all SUs in coalition Cj
{sN1, ..., sNj}⊂S set of SUs in coalition Cj
dj=mode(Dj)majority decision of coalition Cj
Njnumber of SUs sensing channel j
η=Ns/Npload factor
Lj∈[1,3] transmission length of PU j
Hsisensing overhead of SU si
Q(ot, asi
j) = Q(ot, asi
j) + α∗(rsi
j+γ∗(max
a∈A Q(ot+1, a)
−Q(ot, asi
j))).(6)
The agents continue to repeat the above steps until they
reach the terminal observation, which marks the end of the
episode. At that point, the environment is reset to the initial
observation, a new episode begins, and the process repeats
itself. The convergence of the algorithm is well established
and can be found in [57], [58], [59]. The notations used to
describe the above process are listed in Table II.
C. Performance Indicators
Sensing accuracy is the first performance indicator used in
this work. It is defined as the number of correct predictions
normalized by the number of times the SU participated in
spectrum sensing, as shown below:
Sensing Accuracy =Number of correct predictions
Total sensing contribution ×100.
(7)
Sensing accuracy is calculated per SU, and the total sensing
contribution is an indicator of the energy consumed by the
SU for spectrum sensing and collaboration. The SU’s goal
is therefore to minimize this while maintaining a high level
of accuracy. In terms of RL, this is achieved by assigning
a small negative ”step cost” to all actions other than those
“not-sensed”. However, two agents with different sensing
contributions can have the same sensing accuracy but different
sensing rewards. Therefore, the second performance indicator
used is the total rewards per episode for each SU (regardless
of the channel) Rsi, which is defined as
Rsi=
|C|
X
j=1
Rsi
j.(8)
Since total reward alone does not reflect SU’s contribution,
reward-per-contribution ˆ
Rsiis another performance indicator,
defined as ˆ
Rsi=Rsi/Ksi,(9)
where Ksiis the number of times siparticipated in spectrum
sensing. The last performance indicator used is sensing over-
head Hsi, which shows the amount of energy SUs consume
in spectrum sensing without receiving any reward. Sensing
overhead is defined as
Hsi= 1 −Rsi/K si.(10)
PCMARL algorithm’s performance is evaluated in more
detail using the aforementioned indicators in Section IV.
D. Resilience of the PCMARL Algorithm Against SDF Attacks
CR approaches are vulnerable to sensing inaccuracy due to
hardware limitations and SDF attacks. Majority vote provides
a strong defense against such inaccuracies since the coalition’s
probability of incorrect decision Pjis always lower than P,
where Pis the probability that the sensing result reported by
an SU has been falsified. We calculate Pjas
Pj=Pr(more than half of the agents in coalition Cjreport
falsified results)
=
|Cj|
X
k=l1+|Cj|
2m
Pr(kagents in Cjfalsified).
(11)
The kagents from Cjwho reported a falsified channel status
form subset Ck
j. Hence, there are |Cj|
kpossibilities of such
subsets. Therefore, the coalition’s probability of an incorrect
decision can be calculated as
Pj=
|Cj|
X
k=l1+|Cj|
2m|Cj|
kPk(1 −P)|Cj|−k.(12)
Since ∀|Cj|>1and P∈(0,0.5),Pj< P , majority vote is
an effective remedy against SDF attacks. Figure 3 shows this
phenomenon by plotting a coalition’s probability of incorrect
decision Pjfor various |Cj|against P. It also demonstrates that
larger coalitions provide stronger defense against SDF attacks,
which is advantageous to support scalability. Furthermore,
Eq. 12 provides a relationship between |Cj|and Pjwhere |Cj|
is the number of SUs that actually participated in sensing.
The relationship helps determine the number of contributors
required in a coalition to achieve a certain probability of an
incorrect decision. Hence, coalition formation and majority
vote not only provide a defense against SDF attacks but also
enable greater energy savings by providing a threshold of
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 7
Fig. 3: Coalition’s probability of incorrect decision against P
for various |Cj|.
contributors required to achieve a certain level of accuracy. If
there are more SUs in a coalition than required, the remaining
SUs can be free riders.
E. On the Complexity of the PCMARL Algorithm
The authors of [60] determine the complexity of general
MDPs to be bounded by polynomials in experiment time, and
the size of the state-space and action-space. The time com-
plexity of model-free algorithms, including Q-learning with
-greedy, UCB, and UCB-H exploration strategies is shown in
[61] to be O(E), where Edenotes the total experiment time.
The space complexity of Q-learning is proven in [62] to be
O(|S||A|), where |S | is the size of the state-space, and |A| is
the size of the action-space. The complexity of the PCMARL
algorithm differs in only two aspects:
1) Reward history-based channel selection, which is a max
operation over the number of channels and, hence, has
a complexity of O(M).
2) Majority vote, which is a mode operation over the action
set Djand has a complexity of O(|Dj|). Note that
free-riders reduce this complexity, since the mode is
performed only on contributors.
Therefore, the additional complexity is linear in channel
quantity and coalition size but constant in load factor since
each agent performs all computations locally.
The simulation results are discussed in the next section.
IV. PER FO RM AN CE EVAL UATION
In this section, we present the simulation results, discuss
the PCMARL sensing algorithm and evaluate various perfor-
mance indicators under several load factors. The PCMARL
algorithm’s performance is compared to that of traditional Q-
learning with constant proposed in [9], Q-learning with UCB
as used in [59], and UCB-H as used in [9] for CSS.
Fig. 4: Average sensing accuracy under η= 10,P= 0.3for
all approaches.
The sensing capacity Kof each SU is assumed to be 1,
i.e., each SU can sense only one channel in one time slot,
which is a realistic assumption for most low-powered IoT
devices. The SUs are assumed to be capable of switching
between available channels. We use Np= 3 PUs for all
simulations and Ns=ηNpwhere η∈ {10,25,50,100,200}.
The simulations are carried out for 1,000 episodes, and each
episode has T= 20 steps. If an SU participates in spectrum
sensing, their reported channel status is subject to change due
to SDF attacks and sensing inaccuracy with probability P. The
simulation parameters are given in Table III.
TABLE III: Simulation Parameters
Parameter Value(s) Parameter Value(s)
η[2,10,25,50,100,200] NSηNP
P[0.01,0.05,0.1,0.2,0.3] A {0,1,2}
Number of episodes [100,500,1000] T20
NP3σ5
γ0.1α0.1
Figure 4 compares the sensing accuracy of the proposed PC-
MARL algorithm with three related approaches, as mentioned
above, under load factor η= 5 and falsification probability
due to SDF attack P= 0.3. Since the PCMARL algorithm
uses majority voting, it is significantly more accurate than
traditional Q-learning with constant , the Q-UCB approach
of [59], and the Q-UCB-H approach of [9]. Similarly, Figure
5 shows the sensing accuracy of all four approaches for η= 5
and P= 0.1. Lower values for ηand Pcompared to Figure
4 brings the performance of other approaches closer to that of
the PCMARL algorithm, nevertheless the PCMARL algorithm
is still significantly more accurate.
In order to show the effect of the load factor on sensing
accuracy, Figure 6 plots the sensing accuracy of all approaches
for two different load factors: η= 2 and η= 100. With a
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 8
Fig. 5: Average sensing accuracy under η= 5,P= 0.1for
all approaches.
Fig. 6: Average sensing accuracy of all approaches for η= 2
and η= 100 for P= 0.3.
high load factor, the PCMARL algorithm converges to 100%
accuracy in under 200 episodes due to cooperation among a
large number of SUs, which provides a strong defense against
SDF attacks.
Figure 7 shows the sensing accuracy of the PCMARL
algorithm under various load factors to show scalability. As
the load factor increases, majority vote heavily influences
individual decisions. This figure also shows our approach’s
resilience against SDF attacks when the probability of falsi-
fication is high, i.e., P= 0.3. The system was tested with
various load factor values up to η= 200, and it achieved
100% sensing accuracy in about half as many episodes as the
other approaches for all load factors.
Figure 8 compares the accuracy of all approaches with
Fig. 7: Average sensing accuracy of PCMARL for different
load-factors
Fig. 8: Average sensing accuracy of all approaches under
various P.
different probabilities of falsification in order to show the
proposed algorithm’s resilience. The PCMARL algorithm is
the most resilient to SDF attacks even when 30% of the
reported channel statuses have been falsified. The performance
of all approaches degrades with increasing P. Figure 9 shows
the PCMARL algorithm’s resilience to SDF attacks by plotting
its accuracy for various probabilities of sensing falsification.
Figure 10 shows the average rewards per episode over all
SUs for all approaches. Each episode has T= 20 slots, and the
SU sireceives reward rsi
j(ot)=1for sensing and correctly
predicting the status of channel jin state ot. The minimum and
maximum quantity of rewards per episode is therefore [0,20],
which represents the y-axis in all rewards-related figures.
Figure 11 shows the average reward per episode for all ap-
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 9
Fig. 9: Average sensing accuracy of PCMARL under various
P.
Fig. 10: Average reward over all SUs per episode for all
approaches with η= 10 and P= 0.3.
proaches under various load factors. The PCMARL algorithm
converges faster to higher rewards than the other approaches
despite the probability of falsification being P= 0.3. Sim-
ilarly, Figure 12 shows the average reward per episode with
different probabilities of falsification for η= 10. It emphasizes
the PCMARL algorithm’s resilience against SDF attacks, as
its average reward even when P= 0.3is higher than that of
other approaches when P= 0.1.
One of the contributions of this work is giving SUs the
freedom to choose a channel for sensing based on their reward
history. If a channel frequently becomes busy, an SU can
decide to stop sensing it, leave the coalition and join another
coalition (the one corresponding to the channel from which it
has historically received the highest reward) for sensing in the
Fig. 11: Average reward per episode for all approaches under
various load factors.
Fig. 12: Average reward per episode for all approaches under
various P.
hope of finding a more frequently available channel. Another
reason for SUs to switch between coalitions is their mobility,
which affects their sensing accuracy. The SUs can thus pick
a channel for which they are able to predict channel status
more accurately , thereby leading to higher rewards. In other
words, the SUs also learn to join the most suitable coalition
for higher rewards. This concept is reflected in the PCMARL
algorithm’s behavior as shown in Figures 10, 11 and 12.
The SUs have the free will to be contributors or free riders,
and there is a small per-step negative cost associated with
participating in spectrum sensing to encourage the SUs to learn
to participate in sensing only when a high reward is expected
in order to conserve their energy. In other words, SUs can
learn to settle for a lower reward to conserve their energy. The
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 10
Fig. 13: Average contribution percentage for all approaches.
Fig. 14: Average energy consumption per unit reward for all
approaches.
number of times that SUs participate in spectrum sensing per
episode is indicated by their contribution percentage, shown in
Figure 13, which also reflects their total energy consumption.
We assume that an SU spends one unit of energy whenever
it participates in spectrum sensing in a particular time slot
and zero units of energy whenever it is a free rider. Note
that the epsilon-greedy algorithm exhibits better exploration
of the environment than the UCB and UCB-H algorithms that
prioritize saving energy.
However, total rewards and contribution percentage alone
do not give a complete measure of energy efficiency for SUs
since they do not reflect whether the reward is worth the energy
spent for spectrum sensing. For example, an SU may have
participated in spectrum sensing only three times per episode
to conserve energy and predicted correctly each time. SU’s
Fig. 15: Average energy consumption per unit reward for
PCMARL under various load factors.
Fig. 16: Average sensing overhead of all approaches for P=
0.1and P= 0.3.
total reward per episode might still be lower than that of the
other SUs that participated more in sensing. Thus, rewards
and contribution percentage, when looked at separately, do
not fully represent energy efficiency.
A better indicator of energy-reward balance is energy-per-
reward, which reflects the total energy the SU consumed
per episode normalized by the total rewards it received per
episode. Figure 14 shows total energy-per-reward for the SUs
for all approaches. A lower energy-per-reward ratio reflects
higher energy efficiency since it shows that the SUs learned
to participate in sensing only when they expected a reward.
Ideally, the energy-per-reward ratio should be equal 1. The
PCMARL algorithm achieves a near ideal energy-per-reward
ratio, whereas UCB-H is the most energy inefficient approach
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 11
Fig. 17: Average sensing overhead of PCMARL under various
P.
and does not learn to lower the participation percentage to
conserve energy. Figure 15 shows the PCMARL algorithm’s
energy-per-reward ratio under various load factors, and it is
clear that as the load factor increases, the SUs have more
opportunities to conserve their energy and still achieve a near
ideal energy-per-reward ratio.
A related measure of energy wastage is sensing overhead,
which indicates the amount of energy the SUs spent sensing a
channel when they did not receive any reward. Figure 16 shows
the average sensing overhead of all approaches. The PCMARL
algorithm has extremely low sensing overhead after half as
many episodes as the other approaches. Sensing overhead
is high at the beginning when the SUs are exploring the
environment, but drops as the number of episodes increases. In
an ideal system with zero wasted energy, the sensing overhead
will be zero. Figure 17 shows the PCMARL algorithm’s
sensing overhead with various Pvalues and that it has very
low overhead even with high Pvalues.
V. CO NC LU SI ON
In this paper, we propose a PCMARL spectrum sensing
algorithm that takes advantage of coalition formation and
majority voting to combat SDF attacks in CR-IoT. The pro-
posed PCMARL algorithm outperforms state-of-the-art RL-
based sensing algorithms, namely Q-, Q-UCB, and Q-UCB-
H, in terms of sensing accuracy and energy efficiency. The
adopted majority vote mechanism makes the PCMARL al-
gorithm resilient to SDF attacks even when 30% of the
reported sensing results have been falsified. Unlike centralized
fully CSS approaches, the proposed PCMARL algorithm has
a coalition-specific partially cooperative sensing and belief-
sharing mechanism that makes its performance directly propor-
tional to the number of devices in the network. Moreover, the
algorithm forces agents to learn to optimize their participation
percentage to conserve energy. Agents also learn to switch
channels based on their reward history to maximize their
rewards. Future works may consider approximating the Q-table
to extend the work to very large state-spaces and implement
it in low-powered devices.
ACK NOW LE DG ME NT
This work is supported by FRQNT scholarship number
305094 and the Canada Research Chair Program tier-II entitled
”Towards a Novel and Intelligent Framework for the Next
Generations of IoT Networks”.
REF ER EN CE S
[1] C. MacGillivray and D. Reinsel, “Worldwide global datasphere
iot device and data forecast, 2019-2023,” Available at
https://www.idc.com/getdoc.jsp?containerId=prUS45213219
(2019/06/18).
[2] A. A. Khan, M. H. Rehmani, and A. Rachedi, “When Cognitive Radio
meets the Internet of Things?” 2016 International Wireless Communica-
tions and Mobile Computing Conference, IWCMC, pp. 469–474, 2016.
[3] J. Mitola and G. Maguire, “Cognitive radio: making software radios
more personal,” IEEE Personal Communications, vol. 6, no. 4, pp. 13–
18, 1999.
[4] A. A. Khan, M. H. Rehmani, and A. Rachedi, “Cognitive-Radio-
Based Internet of Things: Applications, architectures, spectrum related
functionalities, and future research directions,” IEEE Wireless Commu-
nications, vol. 24, no. 3, pp. 17–25, 2017.
[5] P. Cheng, Z. Chen, M. Ding, Y. Li, B. Vucetic, and D. Niyato, “Spectrum
Intelligent Radio: Technology, Development, and Future Trends,” IEEE
Communications Magazine, vol. 58, no. 1, pp. 12–18, 2020.
[6] Y. Zou, J. Zhu, L. Yang, Y. C. Liang, and Y. D. Yao, “Securing
physical-layer communications for cognitive radio networks,” IEEE
Communications Magazine, vol. 53, no. 9, pp. 48–54, 2015.
[7] H. Vu-Van and I. Koo, “Cooperative spectrum sensing with collaborative
users using individual sensing credibility for cognitive radio network,”
IEEE Transactions on Consumer Electronics, vol. 57, no. 2, pp. 320–
326, 2011.
[8] R. Sarikhani and F. Keynia, “Cooperative Spectrum Sensing Meets
Machine Learning: Deep Reinforcement Learning Approach,” IEEE
Communications Letters, vol. 24, no. 7, pp. 1459–1462, 2020.
[9] Y. Zhang, P. Cai, C. Pan, and S. Zhang, “Multi-Agent Deep Rein-
forcement Learning-Based Cooperative Spectrum Sensing With Upper
Confidence Bound Exploration,” IEEE Access, vol. 7, pp. 118898–
118 906, 2019.
[10] Y. Alghorani, G. Kaddoum, S. Muhaidat, and S. Pierre, “On the
Approximate Analysis of Energy Detection over n Rayleigh Fading
Channels Through Cooperative Spectrum Sensing,” IEEE Wireless Com-
munications Letters, vol. 4, no. 4, pp. 413–416, 2015.
[11] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain, “Machine
Learning for Resource Management in Cellular and IoT Networks:
Potentials, Current Solutions, and Open Challenges,” IEEE Communi-
cations Surveys and Tutorials, vol. 22, no. 2, pp. 1251–1275, 2020.
[12] H. Li, X. Xing, J. Zhu, X. Cheng, K. Li, R. Bie, and T. Jing, “Utility-
Based Cooperative Spectrum Sensing Scheduling in Cognitive Radio
Networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 1,
pp. 645–655, 2017.
[13] K. Jagannathan, I. Menache, E. Modiano, and G. Zussman, “Non-
cooperative spectrum access the dedicated vs. free spectrum choice,”
IEEE Journal on Selected Areas in Communications, vol. 30, no. 11,
pp. 2251–2261, 2012.
[14] A. Ali and W. Hamouda, “Advances on Spectrum Sensing for Cognitive
Radio Networks: Theory and Applications,” IEEE Communications
Surveys and Tutorials, vol. 19, no. 2, pp. 1277–1304, 2017.
[15] Z. Quan, S. Cui, and A. H. Sayed, “Optimal linear cooperation for
spectrum sensing in cognitive radio networks,” IEEE Journal of selected
topics in signal processing, vol. 2, no. 1, pp. 28–40, 2008.
[16] J. Oueis, E. C. Strinati, and S. Barbarossa, “The fog balancing: Load dis-
tribution for small cell cloud computing,” in 2015 IEEE 81st Vehicular
Technology Conference (VTC Spring), 2015, pp. 1–6.
[17] M. Kim and I.-Y. Ko, “An efficient resource allocation approach based
on a genetic algorithm for composite services in IoT environments,” in
2015 IEEE International Conference on Web Services. IEEE, 2015,
pp. 543–550.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 12
[18] Y. Xu, Y. Yang, Q. Liu, and Z. Li, “Joint energy-efficient resource
allocation and transmission duration for cognitive HetNets under
imperfect CSI,” Signal Processing, vol. 167, p. 107309, 2020. [Online].
Available: https://doi.org/10.1016/j.sigpro.2019.107309
[19] C. Pham, N. H. Tran, C. T. Do, S. I. Moon, and C. S. Hong, “Spectrum
handoff model based on hidden markov model in cognitive radio
networks,” in The International Conference on Information Networking
2014 (ICOIN2014), 2014, pp. 406–411.
[20] Q. Ni, R. Zhu, Z. Wu, Y. Sun, L. Zhou, and B. Zhou, “Spectrum
allocation based on game theory in cognitive radio networks.” JNW,
vol. 8, no. 3, pp. 712–722, 2013.
[21] H. Li, X. Xing, J. Zhu, X. Cheng, K. Li, R. Bie, and T. Jing, “Utility-
Based Cooperative Spectrum Sensing Scheduling in Cognitive Radio
Networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 1,
pp. 645–655, 2017.
[22] D. J. Lee, “Adaptive random access for cooperative spectrum sensing in
cognitive radio networks,” IEEE Transactions on Wireless Communica-
tions, vol. 14, no. 2, pp. 831–840, 2015.
[23] Z. Dai, Z. Wang, and V. W. Wong, “An Overlapping Coalitional Game
for Cooperative Spectrum Sensing and Access in Cognitive Radio
Networks,” IEEE Transactions on Vehicular Technology, vol. 65, no. 10,
pp. 8400–8413, 2016.
[24] M. C. Hlophe and S. B. Maharaj, “Spectrum Occupancy Reconstruction
in Distributed Cognitive Radio Networks Using Deep Learning,” IEEE
Access, vol. 7, pp. 14 294–14 307, 2019.
[25] H. He and H. Jiang, “Deep Learning Based Energy Efficiency Opti-
mization for Distributed Cooperative Spectrum Sensing,” IEEE Wireless
Communications, vol. 26, no. 3, pp. 32–39, 2019.
[26] Z. Shi, W. Gao, S. Zhang, J. Liu, and N. Kato, “AI-Enhanced Coop-
erative Spectrum Sensing for Non-Orthogonal Multiple Access,” IEEE
Wireless Communications, vol. 27, no. 2, pp. 173–179, 2020.
[27] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A Survey on Machine-
Learning Techniques in Cognitive Radios,” IEEE Communications Sur-
veys and Tutorials Tutorials, vol. 15, no. 3, pp. 1136–1159, 2013.
[28] K. M. Thilina, K. W. Choi, N. Saquib, and E. Nazmus, “Machine
Learning Techniques for Cooperative Spectrum Sensing in Cognitive
Radio Networks,” IEEE Journal on Selected Areas in Communications,
vol. 31, no. 11, pp. 2209–2221, 2013.
[29] G. Scutari and J. S. Pang, “Joint sensing and power allocation in noncon-
vex cognitive radio games: Nash equilibria and distributed algorithms,”
IEEE Transactions on Information Theory, vol. 59, no. 7, pp. 4626–
4661, 2013.
[30] A. S. Alfa, B. T. Maharaj, S. Lall, and S. Pal, “Mixed-Integer Program-
ming based Techniques for Resource Allocation in Underlay Cognitive
Radio Networks : A Survey,” Journal of Communications and Networks,
vol. 18, no. 5, pp. 744–761, 2016.
[31] S. Verma, Y. Kawamoto, Z. M. Fadlullah, H. Nishiyama, and N. Kato,
“A Survey on Network Methodologies for Real-Time Analytics of
Massive IoT Data and Open Research Issues,” IEEE Communications
Surveys and Tutorials, vol. 19, no. 3, pp. 1457–1477, 2017.
[32] C. H. Tan, K. C. Tan, and A. Tay, “Dynamic Game Difficulty Scaling
Using Adaptive Behavior-Based AI,” IEEE Transactions on Computa-
tional Intelligence and AI in Games, vol. 3, no. 4, pp. 289–301, 2011.
[33] Z. Ali, L. Jiao, T. Baker, G. Abbas, Z. H. Abbas, and S. Khaf, “A
deep learning approach for energy efficient computational offloading in
mobile edge computing,” IEEE Access, vol. 7, pp. 149623–149 633,
2019.
[34] Z. Ali, S. Khaf, Z. Abbas, G. Abbas, L. Jiao, A. Irshad, K. Kwak, and
M. Bilal, “A comprehensive utility function for resource allocation in
mobile edge computing,” Cmc -Tech Science Press-, vol. 66, pp. 1461–
1477, 11 2020.
[35] Z. Ali, S. Khaf, Z. H. Abbas, G. Abbas, F. Muhammad, and S. Kim, “A
deep learning approach for mobility-aware and energy-efficient resource
allocation in mec,” IEEE Access, vol. 8, pp. 179530–179 546, 2020.
[36] H. Sami, A. Mourad, H. Otrok, and J. Bentahar, “FScaler: Automatic
Resource Scaling of Containers in Fog Clusters Using Reinforcement
Learning,” 2020 International Wireless Communications and Mobile
Computing, IWCMC 2020, pp. 1824–1829, 2020.
[37] ——, “Demand-Driven Deep Reinforcement Learning for Scalable Fog
and Service Placement,” IEEE Transactions on Services Computing, pp.
1–14, 2021.
[38] H. Sami, H. Otrok, J. Bentahar, and A. Mourad, “AI-based Resource
Provisioning of IoE Services in 6G: A Deep Reinforcement Learning
Approach,” IEEE Transactions on Network and Service Management,
pp. 1–14, 2021.
[39] J. Oksanen, J. Lund´
en, and V. Koivunen, “Reinforcement learning
based sensing policy optimization for energy efficient cognitive radio
networks,” Neurocomputing, vol. 80, pp. 102–110, 2012.
[40] S. Mosleh, Y. Ma, J. D. Rezac, and J. B. Coder, “Dynamic Spectrum
Access with Reinforcement Learning for Unlicensed Access in 5G
and beyond,” IEEE Vehicular Technology Conference, vol. 2020-May,
no. Ml, 2020.
[41] R. Sarikhani and F. Keynia, “Cooperative Spectrum Sensing Meets
Machine Learning: Deep Reinforcement Learning Approach,” IEEE
Communications Letters, vol. XX, no. X, pp. 1–1, 2020.
[42] Y. Zhang, P. Cai, C. Pan, and S. Zhang, “Multi-Agent Deep Rein-
forcement Learning-Based Cooperative Spectrum Sensing With Upper
Confidence Bound Exploration,” IEEE Access, vol. 7, pp. 118898–
118 906, 2019.
[43] J. Lunden, S. R. Kulkarni, V. Koivunen, and H. V. Poor, “Multiagent
reinforcement learning based spectrum sensing policies for cognitive
radio networks,” pp. 858–868, 2013.
[44] J. Lunden, V. Koivunen, S. R. Kulkarni, and H. V. Poor, “Exploiting
spatial diversity in multiagent reinforcement learning based spectrum
sensing,” 2011 4th IEEE International Workshop on Computational
Advances in Multi-Sensor Adaptive Processing, CAMSAP 2011, pp. 325–
328, 2011.
[45] K. S. Shin, G. H. Hwang, and O. Jo, “Distributed reinforcement learning
scheme for environmentally adaptive IoT network selection,” Electronics
Letters, vol. 56, no. 9, pp. 441–444, 2020.
[46] H.-h. Chang, H. Song, Y. Yi, J. Zhang, H. He, and L. Liu, “Distributive
Dynamic Spectrum Access Through Deep Reinforcement Learning:
A Reservoir Computing-Based Approach,” IEEE Internet of Things
Journal, vol. 6, no. 2, pp. 1938–1948, 2019.
[47] T. Panayiotou, K. Manousakis, S. P. Chatzis, and G. Ellinas, “A Data-
Driven Bandwidth Allocation Framework with QoS Considerations for
EONs,” Journal of Lightwave Technology, vol. 37, no. 9, pp. 1853–1864,
2019.
[48] K. Zia, N. Javed, M. N. Sial, S. Ahmed, A. A. Pirzada, and F. Pervez, “A
Distributed Multi-Agent RL-Based Autonomous Spectrum Allocation
Scheme in D2D Enabled Multi-Tier HetNets,” IEEE Access, vol. 7, pp.
6733–6745, 2019.
[49] Y. Xu, J. Yu, W. C. Headley, and R. M. Buehrer, “Deep Reinforcement
Learning for Dynamic Spectrum Access in Wireless Networks,” Pro-
ceedings - IEEE Military Communications Conference MILCOM, vol.
2019-Octob, pp. 207–212, 2019.
[50] H. Chen, M. Zhou, K. Wang, and J. Li, “Joint Spectrum Sensing
and Resource Allocation Scheme in Cognitive Radio Networks with
Spectrum Sensing Data Falsification Attack,” IEEE Transactions on
Vehicular Technology, vol. 65, no. 11, pp. 9181–9191, 2016.
[51] A. Sajid, B. Khalid, M. Ali, S. Mumtaz, U. Masud, and F. Qamar,
“Securing Cognitive Radio Networks using blockchains,” Future
Generation Computer Systems, vol. 108, pp. 816–826, 2020. [Online].
Available: https://doi.org/10.1016/j.future.2020.03.020
[52] N. Haddadou, A. Rachedi, and Y. Ghamri-Doudane, “A Job Market
Signaling Scheme for Incentive and Trust Management in Vehicular Ad
Hoc Networks,” IEEE Transactions on Vehicular Technology, vol. 64,
no. 8, pp. 3657–3674, 2015.
[53] A. Ahmadfard, A. Jamshidi, and A. Keshavarz-Haddad, “Probabilistic
spectrum sensing data falsification attack in cognitive radio networks,”
Signal Processing, vol. 137, pp. 1–9, 2017. [Online]. Available:
http://dx.doi.org/10.1016/j.sigpro.2017.01.033
[54] J. Wu, C. Wang, Y. Yu, T. Song, and J. Hu, “Sequential fusion to defend
against sensing data falsification attack for cognitive Internet of Things,”
ETRI Journal, vol. 42, no. August 2019, pp. 976–986, 2020.
[55] D. Bendouda, A. Rachedi, and H. Haffaf, “Programmable
architecture based on Software Defined Network for Internet of
Things: Connected Dominated Sets approach,” Future Generation
Computer Systems, vol. 80, pp. 188–197, 2018. [Online]. Available:
https://doi.org/10.1016/j.future.2017.09.070
[56] X. Zhang and H. Su, “CREAM-MAC: Cognitive Radio-EnAbled Multi-
Channel MAC Protocol Over Dynamic Spectrum Access Networks,”
IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 1, pp.
110–123, 2011.
[57] J. N. Tsitsiklis, Asynchronous Stochastic Approximation and Q-
Learning, 1994, vol. 16, no. 3.
[58] A. L. Strehl, E. Wiewiora, J. Langford, and M. L. Littman, “PAC Model-
Free Reinforcement Learning,” Proceedings of the 23rd International
Conference on Machine Learning, pp. 881–888, 2006.
[59] R. S. Sutton and A. G. Barto, Reinforcement Learning:An Introduction,
2nd ed. The MIT Press, 2018.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.
2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3116928, IEEE Internet of
Things Journal
JOURNAL OF XXXXX CLASS FILES, VOL. XX, NO. X, XXXXX 20XX 13
[60] M. Kearns and S. Singh, “Near-optimal reinforcement learning in
polynomial time,” Machine learning, vol. 49, no. 2, pp. 209–232, 2002.
[61] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning prov-
ably efficient?” Advances in Neural Information Processing Systems,
vol. 2018-Decem, no. NeurIPS, pp. 4863–4873, 2018.
[62] A. L. Strehl, E. Wiewiora, J. Langford, and M. L. Littman, “PAC Model-
Free Reinforcement Learning,” pp. 881–888, 2006.
Sadia Khaf received the B.E. degree in electrical en-
gineering from the National University of Sciences
and Technology, School of Electrical Engineering
and Computer Science (NUST-SEECS), Pakistan, in
2015. She received the M.S. degree in electrical
and electronics engineering from Bilkent University,
Turkey, in 2018. From 2015 to 2018, she was a
Research Assistant with IONOLAB, Turkey. From
2018 to 2020, she was with the Faculty of Elec-
trical Engineering, Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology, Pakistan, as
a Lecturer. Currently, she is with ´
Ecole de Technologie sup´
erieure ( ´
ETS),
Canada, as a Ph.D. candidate. Her research interests include cognitive radio
networks, internet of things, radio resource management, and machine learn-
ing. She received several research excellence awards and grants, including the
Fonds de recherche du Qu´
ebec, Nature et technologies (FRQNT) doctoral fel-
lowship, P.E.O. International Peace Scholarship, ´
ETS Bourses D’implication
aux Sup´
erieurs, and ´
ETS Palmar`
es F´
eminin pluriel award.
Mohammad T. Alkhodary received a B.S. de-
gree in Telecommunication Engineering from the
University of Science and Technology, Yemen, in
2008. He received an M.S. in Communication En-
gineering and Ph.D. in Electrical Engineering from
the King Fahd University of Petroleum and Miner-
als(KFUPM), Dhahran, Saudi Arabia, in 2012 and
2017, respectively. From 2015 to 2018, he held
research positions at King Abdullah University of
Science and Technology, Thuwal, Saudi Arabia. He
was a visiting scholar at the Georgia Institute of
Technology (GerorgiaTech), Atlanta, USA, in 2017. In 2018, he joined ´
Ecole
de Technologie Sup´
erieure ( ´
ETS) as a postdoctoral Fellow. His research inter-
est includes MLOps, the applications of ML and signal processing techniques
to wireless communications, Cloud-native for B5G, Node Clustering, ML for
the Edge-services.
Georges Kaddoum (Senior Member, IEEE) re-
ceived the bachelor’s degree in electrical engineering
from the ´
Ecole Nationale Sup´
erieure de Techniques
Avanc´
es (ENSTA), France, the M.Sc. degree in
telecommunications and signal processing from the
Telecom Bretagne (ENSTB), Brest, in 2005, and
the Ph.D. degree in signal processing and telecom-
munications from the National Institute of Applied
Sciences (INSA), Toulouse, France, in 2009. He is
currently a Full Professor and the Tier-II Canada
Research Chair with the ´
Ecole de Technologie
Sup´
erieure, University of Quebec, Montr´
eal, Canada. He has published over
150 journal and conference papers and has two pending patents. His recent
research interests include IoT wireless communication networks, resource
allocations, security and space communications, and navigation. He was
awarded the ´
ETS Research Chair in physical-layer security for wireless
networks in 2014; the Research Excellence Award of the Universit´
e du
Qu´
ebec, in 2018; the Prestigious Tier 2 Canada Research Chair in wireless
IoT networks in 2019; and the Research Excellence Award from the ´
ETS in
recognition of his outstanding research outcomes, in 2019.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 12,2021 at 17:00:51 UTC from IEEE Xplore. Restrictions apply.