Content uploaded by Shah Zeb

Author content

All content in this area was uploaded by Shah Zeb on Oct 10, 2022

Content may be subject to copyright.

Deep RL-assisted Energy Harvesting in CR-NOMA

Communications for NextG IoT Networks

Syed Asad Ullah∗, Shah Zeb∗, Aamir Mahmood†, Syed Ali Hassan∗, and Mikael Gidlund†

∗School of Electrical Engineering & Computer Science (SEECS),

National University of Sciences & Technology (NUST), 44000 Islamabad, Pakistan.

†Department of Information Systems & Technology, Mid Sweden University, 851 70 Sundsvall, Sweden.

Email: ∗{sullah.phdee21seecs, szeb.dphd19seecs, ali.hassan}@seecs.edu.pk, †{ﬁrstname.lastname}@miun.se

Abstract—Zero-energy radios in energy-constrained devices

are envisioned as key enablers to realizing the next-generation

Internet-of-things (NG-IoT) networks for ultra-dense sensing

and monitoring. This paper presents analytical modeling and

analysis of the energy-efﬁcient uplink transmission of an energy-

constrained secondary sensor operating opportunistically among

several primary sensors. The considered scenario assumes that

all primary sensors transmit in a round-robin, time division

multiple access-based schemes, and the secondary sensor is

admitted in the time slot of each primary sensor using a non-

orthogonal multiple access technique, inspired by cognitive radio.

The energy efﬁciency of the secondary sensor is maximized by

exposing it to a deep reinforcement learning-based algorithm,

recognized as a deep deterministic policy gradient (DDPG). Our

results demonstrate that the DDPG-based transmission scheme

outperforms the conventional random and greedy algorithms in

terms of energy efﬁciency at different operating conditions.

Index Terms—Next-generation Internet-of-things (NG-IoT),

non-orthogonal multiple access (NOMA), deep deterministic

policy gradient (DDPG), energy efﬁciency (EE).

I. INTRODUCTION

The provision of energy-efﬁcient wireless connectivity is

becoming vital to realize next-generation Internet-of-things

(NG-IoT) networks. The IoT devices usually have constrained

power supplies, mandating the design of energy-efﬁcient

radios and optimized communication protocols to reduce

energy consumption. In this respect, zero-energy radios are

envisioned to enable ultra-dense connectivity for numerous

application areas, including smart industries, smart healthcare,

smart agriculture, smart cities, etc., [1], [2]. Such radios are

expected to increase the scale of sensing and monitoring

without requiring the need of charging or replacing batteries

for operators. Hence, the goal of NG-IoT networks is to en-

sure energy-efﬁcient communication while satisfying sustain-

able development goals (SDGs) and operational expenditures

(OPEX) of the communication network [3], [4].

With the ever-growing size of the IoT networks, main-

taining the network’s lifetime of energy-constrained sensors

becomes difﬁcult. Particularly, when the sensors are implanted

in unreachable places, the traditional battery-based solutions

are impractical due to high cost of battery replacements and

recycling issues. Therefore, numerous radio frequency (RF)-

based energy harvesting and green communications techniques

are being investigated to address this challenge [5], [6].

In the harvest-then-transmit model, the energy-constrained

sensors may need to switch from transmitting to harvesting or

vice versa depending on various dynamic factors, including

battery capacity, channel conditions, transmit power, and

circuit power [7]–[9]. Under these dynamics, autonomous

and intelligent decision-making and optimization techniques

are necessary, for which deep reinforcement learning (DRL)-

based strategies are gaining momentum [10].

Nevertheless, servicing multiple energy-constrained sen-

sors is still a challenging task due to spectrum limitations.

The challenge of the limited spectrum can be addressed by

adopting a cognitive radio-inspired, prominent multiple ac-

cess technique, recognized as non-orthogonal multiple access

(CR-NOMA), which ensures that multiple uplink users are

multiplexed together and served concurrently [11]–[13].

To provide energy- and spectrum-efﬁcient communication,

optimal energy harvesting and CR-NOMA-based transmis-

sion methods are being investigated in the literature. The

work in [14] addressed a long-term throughput maximization

problem of a point-to-point network and applied the deep

deterministic policy gradient (DDPG) algorithm to achieve

this goal. The authors of [12] have looked into the throughput

maximization problem in an extended uplink scenario where

one unlicensed user uses the NOMA approach to transmit

data during a licensed user’s time slot. To the best of our

knowledge, energy efﬁciency maximization and its analysis

for an energy-constrained sensor in a CR-NOMA-assisted

NG-IoT network have not been addressed yet.

To maintain a reasonable quality of service (QoS), in CR-

NOMA-assisted NG-IoT networks, we mathematically model

the uplink transmission of an energy-constrained sensor oper-

ating in a CR-NOMA-assisted NG-IoT network and provide

its energy consumption analysis. A DRL-based approach is

implemented to maximize EE of the energy-constrained IoT

sensor operating among several primary sensors in a round-

robin time division multiple access-based (TDMA) scheme.

The contributions of this paper are listed as follows.

•We formulate the energy efﬁciency metric for an energy-

constrained sensor in a CR-NOMA-assisted IoT network

and optimize it using the DDPG algorithm.

•We present the analysis of energy efﬁciency for different

parameters, including path loss exponent, distance, circuit

power, etc., and compare the results with the existing

benchmark schemes, such as greedy and random algo-

rithms.

Fig. 1. System model diagram for uplink communication in NG-IoT network

The remainder of the paper is structured as follows. The

system model is presented in Sec. II. Sec. III formulates our

problem within the DDPG framework and Sec. IV explores

the results of the simulations. Finally Sec. V concludes the

paper.

II. SY ST EM MO DE L

We consider an uplink communication scenario as shown in

Fig. 1. There are Nprimary users (e.g., sensors), denoted by

Uj, for j={1,· · · , N }, a base station (BS), and an energy-

constrained secondary sensor, represented by U0, which can

harvest energy from primary sensors, when they transmit.

Channel gain of the secondary sensor is denoted as h0, and

those of the primary sensors are denoted by hj. The channel

between the secondary sensor and the respective primary

sensor is given by hj,0. All primary sensors transmit based

on a TDMA round-robin scheduling, assisted by CR-NOMA,

with a ﬁxed time T, and the transmission continues for a long

time (N T )so that each primary sensor can transmit at least

once.

1) CR-NOMA-enhanced scheme: For transmitting data, an

energy-constrained sensor is combined into the time slot of

each primary sensor via CR-NOMA. Considering each time

slot T, the ﬁrst τtTseconds are used by the secondary sensor,

for transmitting data, and the remaining time (1 −τt)T, for

harvesting energy, where τtdenotes the time sharing coefﬁ-

cient and assumes a value between 0 and 1. The following

assumptions are considered in this scenario, i) the secondary

sensor is aware of the channel state information of each

primary sensor, scheduled at that particular time slot T, and ii)

the battery of the energy-constrained sensor is assumed to be

full at the start of the communication. With these assumptions,

the transmit power of the secondary sensor is given by

τtT P0,t ≤Et,(1)

where Etdenotes the current energy in the battery of the

secondary sensor at time tand P0,t represents its transmit

power at time t. Similarly, the energy accumulated by the

secondary sensor, at the start of the time slot, t+ 1, is given

by

Et+1 =minnEt+(1−τt)T ηPjt |hjt,0|2−τtT P0,t , Emo,(2)

which fulﬁlls the condition of no energy overﬂow. In (2), Em

represents the secondary sensor’s maximum battery capacity,

Pjt represents the power received from the j-th transmitting

sensor at t-th time, ηis the coefﬁcient of energy harvesting

efﬁciency, and hjt,0represents the channel between the sec-

ondary sensor and the j-th primary sensor at time t. Therefore,

the EE of the secondary sensor at the t-th time can be deﬁned

as [15]

ˆ

ΓEE =PM

t=1 Rt(τt, P0,t)

PT

,(3)

where Rt(τt, P0,t) = τtlog21 + P0,t|h0|2

1+Pjt |hjt|2and PT=Pc+

P0,t, with Pcrepresenting the circuit power consumed by the

internal circuitry of the secondary sensor. The Rtexpression

ensures that the BS ﬁrst performs successive interference

cancellation (SIC) and can correctly decode the signal from

the secondary sensor. After the BS eliminates the secondary

sensor’s decoded signal, the signals of the primary sensors’

can be decoded.

A. Problem Formulation

Our goal is to maximize EE, therefore, (3) can be formu-

lated as a maximization problem as

max

τt,P0,t

fo(τt, P0,t)

s.t. C1 : f1(P0,t , τt) = minnEm, Qo,

C2 : f2(P0,t , τt)≤0,

C3 : 0 ≤f3(τt)≤1,

C4 : 0 ≤f4(P0,t )≤Psm,

(4)

where Psm is the maximum transmit power of the secondary

sensor, fo(τt, P0,t) = ˆ

ΓEE(τt, P0,t ),f1(P0,t, τt) = Et+1 ,

f2(P0,t, τt) = τtT P0,t −Et,f3(τt)=τt,f4(P0,t )=P0,t,

and Q=Et+ (1 −τt)T ηPjt|hjt,0|2−τtT P0,t . Constraint C1

expresses the battery energy level of the secondary sensor at

time t+1 while the amount of harvested energy cannot exceed

its maximum battery capacity. C2is the difference between

the energy consumed and the energy available at time t, which

ensures the non-negativity of C1.C3limits the value of the

time-sharing coefﬁcient between 0 and 1. Finally, C4states

that the transmit power of the secondary sensor can assume a

value between 0 and Psm.

Problem (4) is non-convex due to C1being not an afﬁne

function and both the optimization variables appear in multi-

plication in C2. However, because the optimization variables’

values are continuous, problem (4) can be resolved using the

DDPG algorithm. Problem (4) is initially divided into two

sub-problems since the range of values for the optimization

variables makes direct implementation of DDPG challenging.

The ﬁrst sub-problem is deﬁned as

max

τt,P0,t

fo(τt, P0,t)

s.t. C1 : ˆ

f1(P0,t, τt) = 0,

C2,C3,C4in (4),

(5)

where ˆ

f1(P0,t, τt) = (1 −τt)T ηPjt |hjt,0|2−τtT P0,t −¯

Et

and ¯

Et= (1 −τt)T η Pjt|hj t,0|2−τtT P0,t, which denotes the

energy ﬂuctuation parameter. Problem (5) is solved by convex

optimization, where the close-form expressions are obtained

for a given ¯

Et. The corresponding closed-form expressions

are given as [12]

P∗

0,t(¯

Et) = (1 −τ∗

t)ηPjt|hjt,0|2

τ∗

t

−¯

Et

τ∗

tT,

and,

τ∗

t(¯

Et) = min{1,max{x∗,Ω0}},

where Ω0=maxn1−Et+¯

Et

T ηPjt |h0,t|2,T η Pjt|h0,t |2−¯

Et

T ηPjt |h0,t|2+T Pmo,

x∗=x1−x2

ew0(e−1(x1−1))+1−1+x1

,x1=ηPjt |hjt,0|2|h0|2

1+Pjt |hjt|2,x2=

¯

Et|h0|2

T(1+Pjt |hjt|2)and W0(.)represents the Lambart-W-Function.

The second sub-problem is deﬁned as follows. As our goal

is to maximize EE, from (5) we can observe that the EE, ˆ

ΓEE,

at time t, is not dependent on τˆ

tand P0,ˆ

tfor t=ˆ

t. Hence, the

optimization problem (4) can be reformulated as a function of

¯

Et, into the framework of DDPG, which is given as

max

¯

Et

γt−1ˆ

ΓEE¯

Et|τ∗

t, P ∗

0,t

s.t. Et+1 =minnEm, Et+¯

Eto,

(6)

where γrepresents the discounted factor and assumes a value

between 0 and 1. From problem (6) it can be seen that the

action of the energy-constrained sensor is to choose ¯

Etfor

a given τ∗

tand P∗

0,t. By substituting the expression of ˆ

ΓEE

in (6), we get the maximization problem as

max

¯

EtPM

t=1 γt−1τ∗

t(¯

Et)log2 1 + P∗

0,t(¯

Et)|h0|2

1+Pjt |hjt|2!

PT

s.t. Et+1 =min{Em, Et+¯

Et}.

(7)

It can be observed that the above maximization problem is

a univariate function, which is also continuous. This makes

problem (7) well-ﬁtted to be solved by the DDPG algorithm.

III. IMPLEMENTATION OF DRL ALGORITHM

In this section, we provide preliminaries of the DRL al-

gorithm, i.e., DDPG and we formulate our problem into the

DDPG framework.

A. Deep Deterministic Policy Gradient

DDPG being an actor-critic algorithm is based on determin-

istic policy gradient (DPG) and Deep Q-Network (DQN) [16].

Deep Q-Learning (DQL) becomes inefﬁcient when action and

state spaces are continuous and highly dimensional, therefore

DDPG suits best for such scenarios [17]. In a DRL setup,

initially, the agent (or observer) possesses zero knowledge

about the environment. The agent learns the environment with

time, as it continuously monitors the surroundings and learns

how to maximize a reward signal, using an optimal policy.

1) DDPG Framework: In the DDPG algorithm, at a par-

ticular time step t, the goal of an agent is to ﬁnd an action

at, for an observation st, that receives a reward rt, which

consequently maximizes the action value function, represented

by Q(st, at). Accordingly, the maximization problem is given

as

a∗

t(st) = argmax Q(st, at),(8)

where Q(st, at)represents the expected return. The actor

network (or policy network), takes the action, whereas the

critic network (or Q network) acts as an evaluator, which

evaluates how well the action taken by the actor network is.

The parameter for policy network is θµ, which takes stas an

input and produces an action, represented by µ(st|θµ). The

corresponding actor target network is parameterized by θµt

and outputs µt(st|θµt). The critic network is parameterized

by θQ, which takes stand atas inputs and produces the state

value function, represented by Q(st, at|θQ). The correspond-

ing critic target network is parameterized by θQtand outputs

Qt(st, at|θQt).

2) Networks Updating Process: The actor network takes

the action, while other networks ensure that, the actor network

has been trained perfectly in evaluating its output (action). Let

us assume a tuple (st, at, rt, st+1), where strepresents the

current state, atrepresents the action, the agent took according

to the state observed, rtis the reward for the action taken, and

st+1 represents the upcoming state. Based on the above tuple,

the networks update process is given as follows.

1) The training process for the actor network is accomplished

by maximizing (8), which is known as the state value

function. Using parameters of actor and critic networks, (8)

can be reformulated as

J(θµ) = Q(st, at=µ(st|θµ)|θQ).(9)

By taking the gradient of (9) with respect to θµwe get

∆θµJ(θµ)=∆atQ(st, at|θQ)∆θµµ(st|θµ).(10)

2) Updating the critic network depends on two actor net-

works, ﬁrst by feeding the output of the target actor

network to the target critic network, which outputs the

target value as a state value function, as

yt=rt+γQt(st+, µt(st+|θµt)|θQt ).(11)

The second estimate for the state value function can be

obtained by minimizing the loss function given by

L(θQ) = |yt−Q(st, at|θQ)|2.(12)

3) Using a soft target, which assumes a very low value, the

parameters of both the critic target network and the actor

target network are updated. This is because both target

networks are updated less frequently as compared to their

corresponding counterparts. The corresponding parameters

are updated as

θµt→ξθµ+ (1 −ξ)θµt(13)

and

θQt→ξθQ+ (1 −ξ)θQt(14)

respectively, and ξdenotes the soft updating parameter.

Replay buffer and exploration are two other important features

of the DDPG algorithm. DDPG replay buffer refers to the

storage of the past tuples (st, at, rt, st+1) in a pool. These

tuples are used for enhancing the learning of the agent. Once

the network updating process is completed, batch-sized tuples

are chosen randomly from the pool, which is further passed

on for updating the network. Regarding exploration, the actor

network is forced to explore its surroundings completely, to

do so, the noise ﬁgure is supplemented to the actor network’s

output, which can be represented as

a(st) = µ(st|θµ)+Ψ,(15)

where Ψrepresents the added noise.

B. Problem Formulation into DDPG Framework

The DDPG algorithm is implemented in the above problem

while deﬁning state space, action space, and reward as follows:

1) State Space: The state space shall be a tuple containing

channel gains and the energy-constrained sensor’s available

energy, which is represented as

st=hEt,|hjt |2,|h0|2|hjt,0|2iT

.(16)

2) Action Space: The action space contains a single pa-

rameter, which is ¯

Et. The maximum and minimum values

achieved by ¯

Etare given by

−min{T Psm, Et} ≤ ¯

Et≤min{Em−Et, T ηPj t|ht,o |2},(17)

where the lower bound is due to the fact when τt= 1, i.e., no

energy harvesting, but transmission only, and also due to the

energy available at the start of time slot Tt. The upper bound

on ¯

Etis due to the fact that τt= 0, i.e., no transmission

but only energy harvesting, and also since a ﬁnite amount of

energy can be gathered at time Tt.

TABLE I

SIMULATION PARAMETERS

Parameter Symbol Value

Actor Network’s learning rate αa0.002

Critic Network’s learning rate αc0.005

Batch size B64 Tuples

Memory capacity R10000

Noise spectral density σo-190 dBm

Signal bandwidth Ws10 MHz

Maximum Battery Capacity Em0.2 J

Maximum Transmit Power Psm 23 dBm

Circuit Power Pc15 dBm

Energy Efﬁciency Coefﬁcient η0.9

Time slot duration T1s

Discounted Factor γ0.99

Center Frequency fc914 MHz

Soft Update Parameter ξ0.01

Since (17) can assume a much larger or much smaller value,

these values can be bounded between 0 and 1, hence ¯

Etis

normalized as follows:

¯

Et=ζtminnEm−Et, T ηPj t|hj t,0|2o−

(1 −ζn)minnT Psm, Eto.(18)

According to (18), the the action parameter for the DDPG

algorithm shall be ζ, where ζ∈[0,1].

3) Reward: The reward parameter is the EE achieved by

the secondary sensor, i.e., ˆ

ΓEE.

IV. SIMULATION RESULTS AN D ANALY SI S

In this section, we provide performance analysis of the

system model deﬁned in Sec. II. We benchmark the perfor-

mance of the DDPG algorithm against random and greedy

methods. In these benchmark methods, the transmit power

of the energy-constrained sensor is ﬁxed at Psm, however,

the selection of the time-sharing coefﬁcient, τt, differs. In

the random algorithm, τtis chosen uniformly between 0 and

min{1,Et

T Psm }, whereas, in the greedy algorithm, τtis selected

to be min{1,Et

T Psm }.

A. Simulation Environment Setup and Parameters Selection

In our simulations, we have assumed that the BS is located

at the x-y plane’s origin, i.e., (0,0) and we assume large-scale

route loss and ignore random fading. The neural networks,

each having two hidden layers, are simulated for both actor

and critic networks. The activation function used for two

hidden layers is the linear activation function, known as

rectiﬁed linear activation function (ReLU), whereas the output

layer’s activation function is the hyperbolic tangent function.

Regarding the critic network, the ReLU activation function is

used in all hidden layers. Further details of ﬁxed parameters,

chosen for simulations, are listed in Table I.

B. Results Analysis

In this section, we present a performance analysis of the

DDPG scheme in comparison with other benchmark schemes,

i.e., greedy and random algorithms.

5 15 25 35 45

Episodes

400

1000

1600

2200

Energy Efficiency (bits\Joul)

DDPG

RANDOM

GREEDY

20 25 30

0

20

40

60

Fig. 2. Energy efﬁciency of the energy-constrained sensor for the various

number of episodes for the three algorithms.

1) EE comparison against Episodes: Fig 2 shows the

comparison of episodic rewards in terms of EE for the DDPG

algorithm and the benchmark schemes against a number of

episodes. It can be observed that DDPG achieves higher

rewards as compared to greedy and random techniques.

Additionally, we can see that the DDPG algorithm almost

converges after 40 episodes and that there is only a marginal

improvement in the episodic reward after that point. To

help the reader get clarity, a magniﬁed perspective of the

performance of the random and greedy algorithms has been

provided in Fig 2.

2) EE comparison against Path Loss: In order to evaluate

the performance of the DDPG algorithm, EE for all three

schemes are plotted in Fig. 3 for various values of the path

loss exponent. During this setup, the two primary sensors

are at locations (0 m, 1000 m) and (0 m, 1 m), respectively.

The maximum transmit power of primary sensor is ﬁxed at,

Pum = 30 dBm and the power consumed by the RF circuitry

is assumed to be, Pc= 15 dBm. It can be observed that the

DDPG-based algorithm outperforms both the random as well

as the greedy approach. This looks contradictory that, usually

by increasing the path loss exponent the energy consumption

shall increase, because of the dense environment assumed.

However, this increase in EE is because the throughput of the

secondary sensor depends on the transmit power of the pri-

mary sensors, thus, when the path loss exponent is increased,

the transmit power of the primary sensor (located at (0 m,

1 m)) is more affected as compared to the secondary sensor.

Therefore, this beneﬁts the secondary sensor in achieving high

EE, with the increase in the path loss exponent.

3) EE comparison against Transmit Power of Primary

Sensors: The comparison of EE against the transmit power

of primary sensors is shown in Fig. 4. Once again the DDPG

algorithm outperforms the random and greedy algorithms. In

this setup, the path loss exponent is set to n= 3, and the

two primary sensors, assisting the secondary sensor, are at

locations (0 m, 1000 m) and (0 m, 1 m), respectively in the x-

y plane, where the location of the secondary sensor is (1 m,

1 m) in the x-y plane. The power consumed by the RF circuitry

2 2.5 3 3.5 4

Path Loss Exponent

5

10

20

50

150

Energy Efficiency (bits/J)

DDPG

Random

Greedy

Fig. 3. Energy efﬁciency comparison of three algorithms against path loss

exponent.

0.5 1 1.5 2 2.5

Maximum Transmit Power (W)

1

2

10

100

250

Energy Efficiency (bits/J)

DDPG

Random

Greedy

Fig. 4. Energy efﬁciency comparison of three algorithms against the transmit

power of primary sensors.

is assumed to be Pc= 15 dBm.

We can observe that by increasing the transmit power of

the primary sensors the EE of the secondary is not raised

much and shows a constant behavior. In the case of the DDPG

algorithm, this is because the expression of data rate when

increased by some value of transmit power is decreased by

the same value at the same time, hence showing a constant

trend.

4) EE comparison against Distance and Circuit Power:

The combined effect of distance and circuit power on both

DDPG and random algorithms has been depicted in Fig. 5(a)

and Fig. 5(b). The distance of the secondary sensor from BS

and primary sensors is presented on the y-axis and the amount

of power consumed by the internal circuitry of the secondary

sensor is presented on the x-axis. In this setting the path loss

exponent is assumed to be, n= 3, the maximum transmit

power of primary sensors is assumed to be, Pum = 30 dBm

and the two primary sensors are located, in x-y plane, at

(0 m, 1000 m) and (0 m, 1 m), respectively. One can observe

a decrease in EE of the secondary sensor with both variables

changing in ascending order. In other words, as the energy-

constrained sensor moves away from primary sensors, in the

x-plane, more energy would be required by the secondary

sensor to make its transmissions, hence its EE is reduced.

The decrease in EE of the secondary sensor against its circuit

140

140

188

188

188

235

235

283

283

331

331

378

378

426

473

521

569

616

0.01 0.03 0.05 0.07 0.09

Circuit Power (w)

5

15

25

35

Distance (m)

Energy

Efficiency

100

200

300

400

500

600

700

(a)

33

33

65

65

97

97

97

128

128

160

192

224

256

288

320

352

383

0.01 0.03 0.05 0.07 0.09

Circuit Power (w)

5

15

25

35

Distance (m)

Energy

Efficiency

50

100

150

200

250

300

350

400

(b)

Fig. 5. Energy efﬁciency of the energy-constrained sensor against distance

and circuit power, (a) DDPG algorithm, (b) random algorithm.

power can also be observed being declined, as the circuit

power increases. This is because an increase in the circuit

power of the secondary sensor increases the total amount of

power required to transmit data, which causes the EE of the

secondary sensor to be reduced.

ACKNOWLEDGMENT

This work was supported by the Swedish Knowledge Foun-

dation (KKS) research proﬁle NIIT.

V. CONCLUSION

This paper studied the uplink performance analysis of an

energy-constrained secondary sensor in a considered CR-

NOMA-assisted IoT network. We mathematically modeled

and formulated the EE maximization problem of the secondary

sensor, which was solved using a DRL framework, i.e., the

DDPG algorithm. Moreover, we analyzed and compared the

obtained simulation results with the benchmark algorithms,

i.e., greedy and random. The simulation results demonstrated

that the considered DDPG algorithm outperforms the selected

benchmark algorithms in the EE metric. In comparison, we

observed that the EE curve for the DDPG algorithm converged

almost after 40 episodes, while high EE performance was ob-

served in harsher and more diverse environmental conditions.

Similarly, the results demonstrated that increasing the transmit

power of primary sensors in CR-assisted NOMA transmission

leads to improved EE of the secondary sensor with DDPG. We

also examined the combined effect of separation distance and

circuit power, which can be a handful from a system design

perspective. In future work, the model can be extended to

analyze the EE of multiple energy-constrained sensors in a

CR-NOMA network.

REFERENCES

[1] Y. B. Zikria, R. Ali, M. K. Afzal, and S. W. Kim, “Next-generation

Internet of things (IoT): Opportunities, challenges, and solutions,”

Sensors, vol. 21, no. 4, p. 1174, 2021.

[2] S. Zeb, A. Mahmood, et al., “Analysis of beyond 5G integrated com-

munication and ranging services under indoor 3-D mmwave stochastic

channels,” IEEE Transactions on Industrial Informatics, vol. 18, no. 10,

pp. 7128–7138, 2022.

[3] S. Zeb et al., “Industry 5.0 is coming: A survey on intelligent

nextG wireless networks as technological enablers,” arXiv preprint

arXiv:2205.09084, 2022.

[4] S. Zeb, M. A. Rathore, et al., “Edge intelligence in softwarized 6G:

Deep learning-enabled network trafﬁc predictions,” in IEEE Globecom

Workshops (GC Wkshps), pp. 1–6, 2021.

[5] G. G. de Oliveira Brante, M. T. Kakitani, and R. D. Souza, “Energy

efﬁciency analysis of some cooperative and non-cooperative trans-

mission schemes in wireless sensor networks,” IEEE Transactions on

Communications, vol. 59, no. 10, pp. 2671–2677, 2011.

[6] A. W. Nazar, S. A. Hassan, H. Jung, A. Mahmood, and M. Gidlund,

“BER analysis of a backscatter communication system with non-

orthogonal multiple access,” IEEE Transactions on Green Communi-

cations and Networking, vol. 5, no. 2, pp. 574–586, 2021.

[7] S. Zeb et al., “Industrial digital twins at the nexus of nextG wireless

networks and computational intelligence: A survey,” Journal of Network

and Computer Applications, vol. 200, p. 103309, 2022.

[8] B. Matthiesen, A. Zappone, et al., “A globally optimal energy-efﬁcient

power control framework and its efﬁcient implementation in wireless in-

terference networks,” IEEE Transactions on Signal Processing, vol. 68,

pp. 3887–3902, 2020.

[9] N. Rubab et al., “Interference mitigation in RIS-assisted 6G systems

for indoor industrial iot networks,” in IEEE 12th Sensor Array and

Multichannel Signal Processing Workshop (SAM), pp. 211–215, 2022.

[10] A. Mahmood et al., “Industrial IoT in 5G-and-beyond networks: Vi-

sion, architecture, and design trends,” IEEE Transactions on Industrial

Informatics, vol. 18, no. 6, pp. 4122–4137, 2022.

[11] F. Jameel et al., “NOMA-enabled backscatter communications: Toward

battery-free iot networks,” IEEE Internet of Things Magazine, vol. 3,

no. 4, pp. 95–101, 2020.

[12] Z. Ding, R. Schober, and H. V. Poor, “No-pain no-gain: DRL assisted

optimization in energy-constrained CR-NOMA networks,” IEEE Trans-

actions on Communications, vol. 69, no. 9, pp. 5917–5932, 2021.

[13] S. Zeb, Q. Abbas, et al., “NOMA enhanced backscatter communication

for green iot networks,” in 16th International Symposium on Wireless

Communication Systems, pp. 640–644, 2019.

[14] L. Li, H. Xu, J. Ma, A. Zhou, and J. Liu, “Joint EH time and transmit

power optimization based on DDPG for EH communications,” IEEE

Communications Letters, vol. 24, no. 9, pp. 2043–2046, 2020.

[15] G. Y. Li, Z. Xu, C. Xiong, C. Yang, S. Zhang, Y. Chen, and S. Xu,

“Energy-efﬁcient wireless communications: tutorial, survey, and open

issues,” IEEE Wireless communications, vol. 18, no. 6, pp. 28–35, 2011.

[16] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,

“Deterministic policy gradient algorithms,” in International conference

on machine learning, pp. 387–395, 2014.

[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforcement

learning,” arXiv preprint arXiv:1509.02971, 2015.