Content uploaded by Shah Zeb
Author content
All content in this area was uploaded by Shah Zeb on Oct 10, 2022
Content may be subject to copyright.
Deep RL-assisted Energy Harvesting in CR-NOMA
Communications for NextG IoT Networks
Syed Asad Ullah∗, Shah Zeb∗, Aamir Mahmood†, Syed Ali Hassan∗, and Mikael Gidlund†
∗School of Electrical Engineering & Computer Science (SEECS),
National University of Sciences & Technology (NUST), 44000 Islamabad, Pakistan.
†Department of Information Systems & Technology, Mid Sweden University, 851 70 Sundsvall, Sweden.
Email: ∗{sullah.phdee21seecs, szeb.dphd19seecs, ali.hassan}@seecs.edu.pk, †{firstname.lastname}@miun.se
Abstract—Zero-energy radios in energy-constrained devices
are envisioned as key enablers to realizing the next-generation
Internet-of-things (NG-IoT) networks for ultra-dense sensing
and monitoring. This paper presents analytical modeling and
analysis of the energy-efficient uplink transmission of an energy-
constrained secondary sensor operating opportunistically among
several primary sensors. The considered scenario assumes that
all primary sensors transmit in a round-robin, time division
multiple access-based schemes, and the secondary sensor is
admitted in the time slot of each primary sensor using a non-
orthogonal multiple access technique, inspired by cognitive radio.
The energy efficiency of the secondary sensor is maximized by
exposing it to a deep reinforcement learning-based algorithm,
recognized as a deep deterministic policy gradient (DDPG). Our
results demonstrate that the DDPG-based transmission scheme
outperforms the conventional random and greedy algorithms in
terms of energy efficiency at different operating conditions.
Index Terms—Next-generation Internet-of-things (NG-IoT),
non-orthogonal multiple access (NOMA), deep deterministic
policy gradient (DDPG), energy efficiency (EE).
I. INTRODUCTION
The provision of energy-efficient wireless connectivity is
becoming vital to realize next-generation Internet-of-things
(NG-IoT) networks. The IoT devices usually have constrained
power supplies, mandating the design of energy-efficient
radios and optimized communication protocols to reduce
energy consumption. In this respect, zero-energy radios are
envisioned to enable ultra-dense connectivity for numerous
application areas, including smart industries, smart healthcare,
smart agriculture, smart cities, etc., [1], [2]. Such radios are
expected to increase the scale of sensing and monitoring
without requiring the need of charging or replacing batteries
for operators. Hence, the goal of NG-IoT networks is to en-
sure energy-efficient communication while satisfying sustain-
able development goals (SDGs) and operational expenditures
(OPEX) of the communication network [3], [4].
With the ever-growing size of the IoT networks, main-
taining the network’s lifetime of energy-constrained sensors
becomes difficult. Particularly, when the sensors are implanted
in unreachable places, the traditional battery-based solutions
are impractical due to high cost of battery replacements and
recycling issues. Therefore, numerous radio frequency (RF)-
based energy harvesting and green communications techniques
are being investigated to address this challenge [5], [6].
In the harvest-then-transmit model, the energy-constrained
sensors may need to switch from transmitting to harvesting or
vice versa depending on various dynamic factors, including
battery capacity, channel conditions, transmit power, and
circuit power [7]–[9]. Under these dynamics, autonomous
and intelligent decision-making and optimization techniques
are necessary, for which deep reinforcement learning (DRL)-
based strategies are gaining momentum [10].
Nevertheless, servicing multiple energy-constrained sen-
sors is still a challenging task due to spectrum limitations.
The challenge of the limited spectrum can be addressed by
adopting a cognitive radio-inspired, prominent multiple ac-
cess technique, recognized as non-orthogonal multiple access
(CR-NOMA), which ensures that multiple uplink users are
multiplexed together and served concurrently [11]–[13].
To provide energy- and spectrum-efficient communication,
optimal energy harvesting and CR-NOMA-based transmis-
sion methods are being investigated in the literature. The
work in [14] addressed a long-term throughput maximization
problem of a point-to-point network and applied the deep
deterministic policy gradient (DDPG) algorithm to achieve
this goal. The authors of [12] have looked into the throughput
maximization problem in an extended uplink scenario where
one unlicensed user uses the NOMA approach to transmit
data during a licensed user’s time slot. To the best of our
knowledge, energy efficiency maximization and its analysis
for an energy-constrained sensor in a CR-NOMA-assisted
NG-IoT network have not been addressed yet.
To maintain a reasonable quality of service (QoS), in CR-
NOMA-assisted NG-IoT networks, we mathematically model
the uplink transmission of an energy-constrained sensor oper-
ating in a CR-NOMA-assisted NG-IoT network and provide
its energy consumption analysis. A DRL-based approach is
implemented to maximize EE of the energy-constrained IoT
sensor operating among several primary sensors in a round-
robin time division multiple access-based (TDMA) scheme.
The contributions of this paper are listed as follows.
•We formulate the energy efficiency metric for an energy-
constrained sensor in a CR-NOMA-assisted IoT network
and optimize it using the DDPG algorithm.
•We present the analysis of energy efficiency for different
parameters, including path loss exponent, distance, circuit
power, etc., and compare the results with the existing
benchmark schemes, such as greedy and random algo-
rithms.
Fig. 1. System model diagram for uplink communication in NG-IoT network
The remainder of the paper is structured as follows. The
system model is presented in Sec. II. Sec. III formulates our
problem within the DDPG framework and Sec. IV explores
the results of the simulations. Finally Sec. V concludes the
paper.
II. SY ST EM MO DE L
We consider an uplink communication scenario as shown in
Fig. 1. There are Nprimary users (e.g., sensors), denoted by
Uj, for j={1,· · · , N }, a base station (BS), and an energy-
constrained secondary sensor, represented by U0, which can
harvest energy from primary sensors, when they transmit.
Channel gain of the secondary sensor is denoted as h0, and
those of the primary sensors are denoted by hj. The channel
between the secondary sensor and the respective primary
sensor is given by hj,0. All primary sensors transmit based
on a TDMA round-robin scheduling, assisted by CR-NOMA,
with a fixed time T, and the transmission continues for a long
time (N T )so that each primary sensor can transmit at least
once.
1) CR-NOMA-enhanced scheme: For transmitting data, an
energy-constrained sensor is combined into the time slot of
each primary sensor via CR-NOMA. Considering each time
slot T, the first τtTseconds are used by the secondary sensor,
for transmitting data, and the remaining time (1 −τt)T, for
harvesting energy, where τtdenotes the time sharing coeffi-
cient and assumes a value between 0 and 1. The following
assumptions are considered in this scenario, i) the secondary
sensor is aware of the channel state information of each
primary sensor, scheduled at that particular time slot T, and ii)
the battery of the energy-constrained sensor is assumed to be
full at the start of the communication. With these assumptions,
the transmit power of the secondary sensor is given by
τtT P0,t ≤Et,(1)
where Etdenotes the current energy in the battery of the
secondary sensor at time tand P0,t represents its transmit
power at time t. Similarly, the energy accumulated by the
secondary sensor, at the start of the time slot, t+ 1, is given
by
Et+1 =minnEt+(1−τt)T ηPjt |hjt,0|2−τtT P0,t , Emo,(2)
which fulfills the condition of no energy overflow. In (2), Em
represents the secondary sensor’s maximum battery capacity,
Pjt represents the power received from the j-th transmitting
sensor at t-th time, ηis the coefficient of energy harvesting
efficiency, and hjt,0represents the channel between the sec-
ondary sensor and the j-th primary sensor at time t. Therefore,
the EE of the secondary sensor at the t-th time can be defined
as [15]
ˆ
ΓEE =PM
t=1 Rt(τt, P0,t)
PT
,(3)
where Rt(τt, P0,t) = τtlog21 + P0,t|h0|2
1+Pjt |hjt|2and PT=Pc+
P0,t, with Pcrepresenting the circuit power consumed by the
internal circuitry of the secondary sensor. The Rtexpression
ensures that the BS first performs successive interference
cancellation (SIC) and can correctly decode the signal from
the secondary sensor. After the BS eliminates the secondary
sensor’s decoded signal, the signals of the primary sensors’
can be decoded.
A. Problem Formulation
Our goal is to maximize EE, therefore, (3) can be formu-
lated as a maximization problem as
max
τt,P0,t
fo(τt, P0,t)
s.t. C1 : f1(P0,t , τt) = minnEm, Qo,
C2 : f2(P0,t , τt)≤0,
C3 : 0 ≤f3(τt)≤1,
C4 : 0 ≤f4(P0,t )≤Psm,
(4)
where Psm is the maximum transmit power of the secondary
sensor, fo(τt, P0,t) = ˆ
ΓEE(τt, P0,t ),f1(P0,t, τt) = Et+1 ,
f2(P0,t, τt) = τtT P0,t −Et,f3(τt)=τt,f4(P0,t )=P0,t,
and Q=Et+ (1 −τt)T ηPjt|hjt,0|2−τtT P0,t . Constraint C1
expresses the battery energy level of the secondary sensor at
time t+1 while the amount of harvested energy cannot exceed
its maximum battery capacity. C2is the difference between
the energy consumed and the energy available at time t, which
ensures the non-negativity of C1.C3limits the value of the
time-sharing coefficient between 0 and 1. Finally, C4states
that the transmit power of the secondary sensor can assume a
value between 0 and Psm.
Problem (4) is non-convex due to C1being not an affine
function and both the optimization variables appear in multi-
plication in C2. However, because the optimization variables’
values are continuous, problem (4) can be resolved using the
DDPG algorithm. Problem (4) is initially divided into two
sub-problems since the range of values for the optimization
variables makes direct implementation of DDPG challenging.
The first sub-problem is defined as
max
τt,P0,t
fo(τt, P0,t)
s.t. C1 : ˆ
f1(P0,t, τt) = 0,
C2,C3,C4in (4),
(5)
where ˆ
f1(P0,t, τt) = (1 −τt)T ηPjt |hjt,0|2−τtT P0,t −¯
Et
and ¯
Et= (1 −τt)T η Pjt|hj t,0|2−τtT P0,t, which denotes the
energy fluctuation parameter. Problem (5) is solved by convex
optimization, where the close-form expressions are obtained
for a given ¯
Et. The corresponding closed-form expressions
are given as [12]
P∗
0,t(¯
Et) = (1 −τ∗
t)ηPjt|hjt,0|2
τ∗
t
−¯
Et
τ∗
tT,
and,
τ∗
t(¯
Et) = min{1,max{x∗,Ω0}},
where Ω0=maxn1−Et+¯
Et
T ηPjt |h0,t|2,T η Pjt|h0,t |2−¯
Et
T ηPjt |h0,t|2+T Pmo,
x∗=x1−x2
ew0(e−1(x1−1))+1−1+x1
,x1=ηPjt |hjt,0|2|h0|2
1+Pjt |hjt|2,x2=
¯
Et|h0|2
T(1+Pjt |hjt|2)and W0(.)represents the Lambart-W-Function.
The second sub-problem is defined as follows. As our goal
is to maximize EE, from (5) we can observe that the EE, ˆ
ΓEE,
at time t, is not dependent on τˆ
tand P0,ˆ
tfor t=ˆ
t. Hence, the
optimization problem (4) can be reformulated as a function of
¯
Et, into the framework of DDPG, which is given as
max
¯
Et
γt−1ˆ
ΓEE¯
Et|τ∗
t, P ∗
0,t
s.t. Et+1 =minnEm, Et+¯
Eto,
(6)
where γrepresents the discounted factor and assumes a value
between 0 and 1. From problem (6) it can be seen that the
action of the energy-constrained sensor is to choose ¯
Etfor
a given τ∗
tand P∗
0,t. By substituting the expression of ˆ
ΓEE
in (6), we get the maximization problem as
max
¯
EtPM
t=1 γt−1τ∗
t(¯
Et)log2 1 + P∗
0,t(¯
Et)|h0|2
1+Pjt |hjt|2!
PT
s.t. Et+1 =min{Em, Et+¯
Et}.
(7)
It can be observed that the above maximization problem is
a univariate function, which is also continuous. This makes
problem (7) well-fitted to be solved by the DDPG algorithm.
III. IMPLEMENTATION OF DRL ALGORITHM
In this section, we provide preliminaries of the DRL al-
gorithm, i.e., DDPG and we formulate our problem into the
DDPG framework.
A. Deep Deterministic Policy Gradient
DDPG being an actor-critic algorithm is based on determin-
istic policy gradient (DPG) and Deep Q-Network (DQN) [16].
Deep Q-Learning (DQL) becomes inefficient when action and
state spaces are continuous and highly dimensional, therefore
DDPG suits best for such scenarios [17]. In a DRL setup,
initially, the agent (or observer) possesses zero knowledge
about the environment. The agent learns the environment with
time, as it continuously monitors the surroundings and learns
how to maximize a reward signal, using an optimal policy.
1) DDPG Framework: In the DDPG algorithm, at a par-
ticular time step t, the goal of an agent is to find an action
at, for an observation st, that receives a reward rt, which
consequently maximizes the action value function, represented
by Q(st, at). Accordingly, the maximization problem is given
as
a∗
t(st) = argmax Q(st, at),(8)
where Q(st, at)represents the expected return. The actor
network (or policy network), takes the action, whereas the
critic network (or Q network) acts as an evaluator, which
evaluates how well the action taken by the actor network is.
The parameter for policy network is θµ, which takes stas an
input and produces an action, represented by µ(st|θµ). The
corresponding actor target network is parameterized by θµt
and outputs µt(st|θµt). The critic network is parameterized
by θQ, which takes stand atas inputs and produces the state
value function, represented by Q(st, at|θQ). The correspond-
ing critic target network is parameterized by θQtand outputs
Qt(st, at|θQt).
2) Networks Updating Process: The actor network takes
the action, while other networks ensure that, the actor network
has been trained perfectly in evaluating its output (action). Let
us assume a tuple (st, at, rt, st+1), where strepresents the
current state, atrepresents the action, the agent took according
to the state observed, rtis the reward for the action taken, and
st+1 represents the upcoming state. Based on the above tuple,
the networks update process is given as follows.
1) The training process for the actor network is accomplished
by maximizing (8), which is known as the state value
function. Using parameters of actor and critic networks, (8)
can be reformulated as
J(θµ) = Q(st, at=µ(st|θµ)|θQ).(9)
By taking the gradient of (9) with respect to θµwe get
∆θµJ(θµ)=∆atQ(st, at|θQ)∆θµµ(st|θµ).(10)
2) Updating the critic network depends on two actor net-
works, first by feeding the output of the target actor
network to the target critic network, which outputs the
target value as a state value function, as
yt=rt+γQt(st+, µt(st+|θµt)|θQt ).(11)
The second estimate for the state value function can be
obtained by minimizing the loss function given by
L(θQ) = |yt−Q(st, at|θQ)|2.(12)
3) Using a soft target, which assumes a very low value, the
parameters of both the critic target network and the actor
target network are updated. This is because both target
networks are updated less frequently as compared to their
corresponding counterparts. The corresponding parameters
are updated as
θµt→ξθµ+ (1 −ξ)θµt(13)
and
θQt→ξθQ+ (1 −ξ)θQt(14)
respectively, and ξdenotes the soft updating parameter.
Replay buffer and exploration are two other important features
of the DDPG algorithm. DDPG replay buffer refers to the
storage of the past tuples (st, at, rt, st+1) in a pool. These
tuples are used for enhancing the learning of the agent. Once
the network updating process is completed, batch-sized tuples
are chosen randomly from the pool, which is further passed
on for updating the network. Regarding exploration, the actor
network is forced to explore its surroundings completely, to
do so, the noise figure is supplemented to the actor network’s
output, which can be represented as
a(st) = µ(st|θµ)+Ψ,(15)
where Ψrepresents the added noise.
B. Problem Formulation into DDPG Framework
The DDPG algorithm is implemented in the above problem
while defining state space, action space, and reward as follows:
1) State Space: The state space shall be a tuple containing
channel gains and the energy-constrained sensor’s available
energy, which is represented as
st=hEt,|hjt |2,|h0|2|hjt,0|2iT
.(16)
2) Action Space: The action space contains a single pa-
rameter, which is ¯
Et. The maximum and minimum values
achieved by ¯
Etare given by
−min{T Psm, Et} ≤ ¯
Et≤min{Em−Et, T ηPj t|ht,o |2},(17)
where the lower bound is due to the fact when τt= 1, i.e., no
energy harvesting, but transmission only, and also due to the
energy available at the start of time slot Tt. The upper bound
on ¯
Etis due to the fact that τt= 0, i.e., no transmission
but only energy harvesting, and also since a finite amount of
energy can be gathered at time Tt.
TABLE I
SIMULATION PARAMETERS
Parameter Symbol Value
Actor Network’s learning rate αa0.002
Critic Network’s learning rate αc0.005
Batch size B64 Tuples
Memory capacity R10000
Noise spectral density σo-190 dBm
Signal bandwidth Ws10 MHz
Maximum Battery Capacity Em0.2 J
Maximum Transmit Power Psm 23 dBm
Circuit Power Pc15 dBm
Energy Efficiency Coefficient η0.9
Time slot duration T1s
Discounted Factor γ0.99
Center Frequency fc914 MHz
Soft Update Parameter ξ0.01
Since (17) can assume a much larger or much smaller value,
these values can be bounded between 0 and 1, hence ¯
Etis
normalized as follows:
¯
Et=ζtminnEm−Et, T ηPj t|hj t,0|2o−
(1 −ζn)minnT Psm, Eto.(18)
According to (18), the the action parameter for the DDPG
algorithm shall be ζ, where ζ∈[0,1].
3) Reward: The reward parameter is the EE achieved by
the secondary sensor, i.e., ˆ
ΓEE.
IV. SIMULATION RESULTS AN D ANALY SI S
In this section, we provide performance analysis of the
system model defined in Sec. II. We benchmark the perfor-
mance of the DDPG algorithm against random and greedy
methods. In these benchmark methods, the transmit power
of the energy-constrained sensor is fixed at Psm, however,
the selection of the time-sharing coefficient, τt, differs. In
the random algorithm, τtis chosen uniformly between 0 and
min{1,Et
T Psm }, whereas, in the greedy algorithm, τtis selected
to be min{1,Et
T Psm }.
A. Simulation Environment Setup and Parameters Selection
In our simulations, we have assumed that the BS is located
at the x-y plane’s origin, i.e., (0,0) and we assume large-scale
route loss and ignore random fading. The neural networks,
each having two hidden layers, are simulated for both actor
and critic networks. The activation function used for two
hidden layers is the linear activation function, known as
rectified linear activation function (ReLU), whereas the output
layer’s activation function is the hyperbolic tangent function.
Regarding the critic network, the ReLU activation function is
used in all hidden layers. Further details of fixed parameters,
chosen for simulations, are listed in Table I.
B. Results Analysis
In this section, we present a performance analysis of the
DDPG scheme in comparison with other benchmark schemes,
i.e., greedy and random algorithms.
5 15 25 35 45
Episodes
400
1000
1600
2200
Energy Efficiency (bits\Joul)
DDPG
RANDOM
GREEDY
20 25 30
0
20
40
60
Fig. 2. Energy efficiency of the energy-constrained sensor for the various
number of episodes for the three algorithms.
1) EE comparison against Episodes: Fig 2 shows the
comparison of episodic rewards in terms of EE for the DDPG
algorithm and the benchmark schemes against a number of
episodes. It can be observed that DDPG achieves higher
rewards as compared to greedy and random techniques.
Additionally, we can see that the DDPG algorithm almost
converges after 40 episodes and that there is only a marginal
improvement in the episodic reward after that point. To
help the reader get clarity, a magnified perspective of the
performance of the random and greedy algorithms has been
provided in Fig 2.
2) EE comparison against Path Loss: In order to evaluate
the performance of the DDPG algorithm, EE for all three
schemes are plotted in Fig. 3 for various values of the path
loss exponent. During this setup, the two primary sensors
are at locations (0 m, 1000 m) and (0 m, 1 m), respectively.
The maximum transmit power of primary sensor is fixed at,
Pum = 30 dBm and the power consumed by the RF circuitry
is assumed to be, Pc= 15 dBm. It can be observed that the
DDPG-based algorithm outperforms both the random as well
as the greedy approach. This looks contradictory that, usually
by increasing the path loss exponent the energy consumption
shall increase, because of the dense environment assumed.
However, this increase in EE is because the throughput of the
secondary sensor depends on the transmit power of the pri-
mary sensors, thus, when the path loss exponent is increased,
the transmit power of the primary sensor (located at (0 m,
1 m)) is more affected as compared to the secondary sensor.
Therefore, this benefits the secondary sensor in achieving high
EE, with the increase in the path loss exponent.
3) EE comparison against Transmit Power of Primary
Sensors: The comparison of EE against the transmit power
of primary sensors is shown in Fig. 4. Once again the DDPG
algorithm outperforms the random and greedy algorithms. In
this setup, the path loss exponent is set to n= 3, and the
two primary sensors, assisting the secondary sensor, are at
locations (0 m, 1000 m) and (0 m, 1 m), respectively in the x-
y plane, where the location of the secondary sensor is (1 m,
1 m) in the x-y plane. The power consumed by the RF circuitry
2 2.5 3 3.5 4
Path Loss Exponent
5
10
20
50
150
Energy Efficiency (bits/J)
DDPG
Random
Greedy
Fig. 3. Energy efficiency comparison of three algorithms against path loss
exponent.
0.5 1 1.5 2 2.5
Maximum Transmit Power (W)
1
2
10
100
250
Energy Efficiency (bits/J)
DDPG
Random
Greedy
Fig. 4. Energy efficiency comparison of three algorithms against the transmit
power of primary sensors.
is assumed to be Pc= 15 dBm.
We can observe that by increasing the transmit power of
the primary sensors the EE of the secondary is not raised
much and shows a constant behavior. In the case of the DDPG
algorithm, this is because the expression of data rate when
increased by some value of transmit power is decreased by
the same value at the same time, hence showing a constant
trend.
4) EE comparison against Distance and Circuit Power:
The combined effect of distance and circuit power on both
DDPG and random algorithms has been depicted in Fig. 5(a)
and Fig. 5(b). The distance of the secondary sensor from BS
and primary sensors is presented on the y-axis and the amount
of power consumed by the internal circuitry of the secondary
sensor is presented on the x-axis. In this setting the path loss
exponent is assumed to be, n= 3, the maximum transmit
power of primary sensors is assumed to be, Pum = 30 dBm
and the two primary sensors are located, in x-y plane, at
(0 m, 1000 m) and (0 m, 1 m), respectively. One can observe
a decrease in EE of the secondary sensor with both variables
changing in ascending order. In other words, as the energy-
constrained sensor moves away from primary sensors, in the
x-plane, more energy would be required by the secondary
sensor to make its transmissions, hence its EE is reduced.
The decrease in EE of the secondary sensor against its circuit
140
140
188
188
188
235
235
283
283
331
331
378
378
426
473
521
569
616
0.01 0.03 0.05 0.07 0.09
Circuit Power (w)
5
15
25
35
Distance (m)
Energy
Efficiency
100
200
300
400
500
600
700
(a)
33
33
65
65
97
97
97
128
128
160
192
224
256
288
320
352
383
0.01 0.03 0.05 0.07 0.09
Circuit Power (w)
5
15
25
35
Distance (m)
Energy
Efficiency
50
100
150
200
250
300
350
400
(b)
Fig. 5. Energy efficiency of the energy-constrained sensor against distance
and circuit power, (a) DDPG algorithm, (b) random algorithm.
power can also be observed being declined, as the circuit
power increases. This is because an increase in the circuit
power of the secondary sensor increases the total amount of
power required to transmit data, which causes the EE of the
secondary sensor to be reduced.
ACKNOWLEDGMENT
This work was supported by the Swedish Knowledge Foun-
dation (KKS) research profile NIIT.
V. CONCLUSION
This paper studied the uplink performance analysis of an
energy-constrained secondary sensor in a considered CR-
NOMA-assisted IoT network. We mathematically modeled
and formulated the EE maximization problem of the secondary
sensor, which was solved using a DRL framework, i.e., the
DDPG algorithm. Moreover, we analyzed and compared the
obtained simulation results with the benchmark algorithms,
i.e., greedy and random. The simulation results demonstrated
that the considered DDPG algorithm outperforms the selected
benchmark algorithms in the EE metric. In comparison, we
observed that the EE curve for the DDPG algorithm converged
almost after 40 episodes, while high EE performance was ob-
served in harsher and more diverse environmental conditions.
Similarly, the results demonstrated that increasing the transmit
power of primary sensors in CR-assisted NOMA transmission
leads to improved EE of the secondary sensor with DDPG. We
also examined the combined effect of separation distance and
circuit power, which can be a handful from a system design
perspective. In future work, the model can be extended to
analyze the EE of multiple energy-constrained sensors in a
CR-NOMA network.
REFERENCES
[1] Y. B. Zikria, R. Ali, M. K. Afzal, and S. W. Kim, “Next-generation
Internet of things (IoT): Opportunities, challenges, and solutions,”
Sensors, vol. 21, no. 4, p. 1174, 2021.
[2] S. Zeb, A. Mahmood, et al., “Analysis of beyond 5G integrated com-
munication and ranging services under indoor 3-D mmwave stochastic
channels,” IEEE Transactions on Industrial Informatics, vol. 18, no. 10,
pp. 7128–7138, 2022.
[3] S. Zeb et al., “Industry 5.0 is coming: A survey on intelligent
nextG wireless networks as technological enablers,” arXiv preprint
arXiv:2205.09084, 2022.
[4] S. Zeb, M. A. Rathore, et al., “Edge intelligence in softwarized 6G:
Deep learning-enabled network traffic predictions,” in IEEE Globecom
Workshops (GC Wkshps), pp. 1–6, 2021.
[5] G. G. de Oliveira Brante, M. T. Kakitani, and R. D. Souza, “Energy
efficiency analysis of some cooperative and non-cooperative trans-
mission schemes in wireless sensor networks,” IEEE Transactions on
Communications, vol. 59, no. 10, pp. 2671–2677, 2011.
[6] A. W. Nazar, S. A. Hassan, H. Jung, A. Mahmood, and M. Gidlund,
“BER analysis of a backscatter communication system with non-
orthogonal multiple access,” IEEE Transactions on Green Communi-
cations and Networking, vol. 5, no. 2, pp. 574–586, 2021.
[7] S. Zeb et al., “Industrial digital twins at the nexus of nextG wireless
networks and computational intelligence: A survey,” Journal of Network
and Computer Applications, vol. 200, p. 103309, 2022.
[8] B. Matthiesen, A. Zappone, et al., “A globally optimal energy-efficient
power control framework and its efficient implementation in wireless in-
terference networks,” IEEE Transactions on Signal Processing, vol. 68,
pp. 3887–3902, 2020.
[9] N. Rubab et al., “Interference mitigation in RIS-assisted 6G systems
for indoor industrial iot networks,” in IEEE 12th Sensor Array and
Multichannel Signal Processing Workshop (SAM), pp. 211–215, 2022.
[10] A. Mahmood et al., “Industrial IoT in 5G-and-beyond networks: Vi-
sion, architecture, and design trends,” IEEE Transactions on Industrial
Informatics, vol. 18, no. 6, pp. 4122–4137, 2022.
[11] F. Jameel et al., “NOMA-enabled backscatter communications: Toward
battery-free iot networks,” IEEE Internet of Things Magazine, vol. 3,
no. 4, pp. 95–101, 2020.
[12] Z. Ding, R. Schober, and H. V. Poor, “No-pain no-gain: DRL assisted
optimization in energy-constrained CR-NOMA networks,” IEEE Trans-
actions on Communications, vol. 69, no. 9, pp. 5917–5932, 2021.
[13] S. Zeb, Q. Abbas, et al., “NOMA enhanced backscatter communication
for green iot networks,” in 16th International Symposium on Wireless
Communication Systems, pp. 640–644, 2019.
[14] L. Li, H. Xu, J. Ma, A. Zhou, and J. Liu, “Joint EH time and transmit
power optimization based on DDPG for EH communications,” IEEE
Communications Letters, vol. 24, no. 9, pp. 2043–2046, 2020.
[15] G. Y. Li, Z. Xu, C. Xiong, C. Yang, S. Zhang, Y. Chen, and S. Xu,
“Energy-efficient wireless communications: tutorial, survey, and open
issues,” IEEE Wireless communications, vol. 18, no. 6, pp. 28–35, 2011.
[16] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in International conference
on machine learning, pp. 387–395, 2014.
[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” arXiv preprint arXiv:1509.02971, 2015.