Content uploaded by Pouya Agheli
Author content
All content in this area was uploaded by Pouya Agheli on Jul 23, 2024
Content may be subject to copyright.
arXiv:2407.14092v1 [cs.IT] 19 Jul 2024
1
Integrated Push-and-Pull Update Model for
Goal-Oriented Effective Communication
Pouya Agheli, Graduate Student Member, IEEE, Nikolaos Pappas, Senior Member, IEEE,
Petar Popovski, Fellow, IEEE, and Marios Kountouris, Fellow, IEEE.
Abstract—This paper studies decision-making for goal-oriented
effective communication. We consider an end-to-end status up-
date system where a sensing agent (SA) observes a source,
generates and transmits updates to an actuation agent (AA), while
the AA takes actions to accomplish a goal at the endpoint. We
integrate the push- and pull-based update communication models
to obtain a push-and-pull model, which allows the transmission
controller at the SA to decide to push an update to the AA and the
query controller at the AA to pull updates by raising queries at
specific time instances. To gauge effectiveness, we utilize a grade
of effectiveness (GoE) metric incorporating updates’ freshness,
usefulness, and timeliness of actions as qualitative attributes.
We then derive effect-aware policies to maximize the expected
discounted sum of updates’ effectiveness subject to induced
costs. The effect-aware policy at the SA considers the potential
effectiveness of communicated updates at the endpoint, while
at the AA, it accounts for the probabilistic evolution of the
source and importance of generated updates. Our results show
the proposed push-and-pull model outperforms models solely
based on push- or pull-based updates both in terms of efficiency
and effectiveness. Additionally, using effect-aware policies at
both agents enhances effectiveness compared to periodic and/or
probabilistic effect-agnostic policies at either or both agents.
Index Terms—Goal-oriented effective communication, status
update systems, push-and-pull model, decision-making.
I. INT RO DU CTI ON
The emergence of cyber-physical systems empowered with
interactive and networked sensing and actuation/monitoring
agents has caused a shift in focus from extreme to sustainable
performance. Emerging networks aim to enhance effectiveness
in the system while substantially improving resource utiliza-
tion, energy consumption, and computational efficiency. The
key is to strive for a minimalist design, frugal in resources,
which can scale effectively rather than causing network over-
provisioning. This design philosophy has crystallized into the
goal-oriented and/or semantic communication paradigm, hold-
ing the potential to enhance the efficiency of diverse network
processes through a parsimonious usage of communication
and computation resources [2], [3]. Under an effectiveness
perspective, a message is generated and conveyed by a sender
if it has the potential to have the desirable effect or the right
impact at the destination, e.g., executing a critical action, for
P. Agheli and M. Kountouris are with the Communication Systems Dept.,
EURECOM, France, email: agheli@eurecom.fr. N. Pappas is with the
Dept. of Computer and Information Science, Link¨oping University, Sweden,
email: nikolaos.pappas@liu.se. P. Popovski is with the Dept. of Elec-
tronic Systems, Aalborg University, Denmark, email: petarp@es.aau.dk.
M. Kountouris is also with the Dept. of Computer Science and Artificial
Intelligence, University of Granada, Spain, email: mariosk@ugr.es. Part
of this work is presented in [1].
accomplishing a specific goal. This promotes system scalabil-
ity and efficient resource usage by avoiding the acquisition,
processing, and transportation of information turning out to
be ineffective, irrelevant, or useless.
Messages, e.g., in the form of status update packets, are
communicated over existing networked intelligent systems
mostly using a push-based communication model. Therein,
packets arrived at the source are sent to the destination based
on decisions made by the source, regardless of whether or
not the endpoint has requested for or plans to utilize these
updates to accomplish a goal. In contrast, in a pull-based
model, the endpoint decides to trigger and requests packet
transmissions from the source and controls the time and the
type of generated updates [4]–[10]. Nevertheless, this model
does not consider the availability of the source to generate
updates or the usefulness of those updates. To overcome these
limitations, we propose an integrated push-and-pull model
that involves both agents/sides in the decision-making process,
thereby combining push- and pull-based paradigms in a way
that mitigates their drawbacks. In either model, decisions at
the source or endpoint could influence the effectiveness of
communicated updates. Therefore, we can categorize decision
policies into effect-aware and effect-agnostic. Under an effect-
aware policy, the source adapts its decisions by taking into ac-
count the effects of its communicated packets at the endpoint.
Likewise, the endpoint raises queries based on the evolution
of the source and the expected importance of pulled updates.
Under the effect-agnostic policy, however, decisions are made
regardless of their consequent effect on the performance.
In this work, we study a time-slotted end-to-end status
update system where a sensing agent observes an information
source and communicates updates/observations in the form of
packets with an actuation agent. The actuation agent then takes
actions based on the successfully received updates as a means
to accomplish a subscribed goal at the endpoint. We develop
an integrated push-and-pull model, which allows both agents
to make decisions based on their local policies or objectives.
In particular, a transmission controller at the sensing agent
decides to either send or drop update packets according to
their potential usefulness at the endpoint. On the other side, a
query controller at the actuation agent also determines the time
instances around which the actuator should perform actions in
the form of raising queries. In that sense, effective updates
are those that result in the right impact and actuation at the
endpoint. Those queries, however, are not communicated to
the sensing agent. Instead, the actuation agent acknowledges
the sensing agent of effective updates. With prior knowledge
2
Processes
Observing
the source
Deciding on
sending an update
Deciding on
raising a query
Sending
Receiving
the update
the update
Time
Waiting for
an acknowledgement
Taking action
(if available)
At sensing agent
At actuation agent
Evaluating the
update’s significance
Measuring effectiveness
and sending the
acknowledgement (if any)
Receiving the
acknowledgement
Time slot 1Time slot 2Time slot 3...
Fig. 1. A time diagram of processes involving the sensing and actuation
agents, illustrating interactions and update communications leading to actions.
that the effectiveness of updates depends on the actuator’s
availability to perform actions, the sensing agent can infer the
raised queries using those acknowledgments. A time diagram
that shows the processes at both agents is depicted in Fig. 1.
We introduce a metric to measure the effectiveness and the
significance of updates and derive a class of optimal policies
for each agent that makes effect-aware decisions to maximize
the long-term expected effectiveness of update packets com-
municated to fulfill the goal subject to induced costs. To do
so, the agent first needs to estimate the necessary system
parameters for making the right decisions. Our analytical
and simulation results show that the integrated push-and-pull
model comes with a higher energy efficiency compared to
the push-based model and better effectiveness performance
compared to the pull-based one. Moreover, we observe that
utilizing effect-aware policies at both agents significantly
improves the effectiveness performance of the system in the
majority of the cases with a large gap compared to those of
periodic and probabilistic effect-agnostic policies at either or
both agents. Accordingly, we demonstrate that the solution to
find an optimal effect-aware policy at each agent converges to
athreshold-based agent decision framework where the agent
can timely decide based on an individual lookup map in hand
and threshold boundaries computed to satisfy the goal.
A. Related Works
This paper widely broadens prior work on push-based and
(query-) pull-based communications by enabling both agents
to make decisions so as to maximize the effectiveness of
communicated updates in the system. The pull-based com-
munication model has been widely analyzed but not limited
to [4]–[10]. In [4], a new metric called effective age of
information (EAoI), which comprises the effects of queries
and the freshness of updates in the form of the AoI [11]–[13],
is introduced. Query AoI (QAoI), which is similar to the EAoI,
is utilized in [5]–[9]. Following the same concept as the QAoI,
on-demand AoI is introduced in [10], [14]. Probe (query)-
based active fault detection where actuation or monitoring
agents adaptively decide to probe sensing agents to detect
probable faults at the endpoint is studied in [15]. Most prior
work has employed a pull-based communication model and
focuses on the freshness and timeliness of information. In this
work, we consider multiple information attributes and propose
agrade of effectiveness metric to measure the effectiveness
of updates, which goes beyond metrics, including AoI, EAoI,
QAoI, on-demand AoI, Age of Incorrect Information (AoII)
[16], and Value of Information (VoI) [17], [18]. In particular,
we focus on the freshness of successfully received updates and
the timeliness of performed actions as two attributes of interest
through the link level, as well as on the usefulness/significance
(semantics) of the updates to fulfill the goal at the source level.
This paper extends our prior work [1], which only considers
a pull-based model and an effectiveness metric with two
freshness and usefulness attributes. As such, [1] conveys a
special form of the decision problem we solve here. In this
work, we generalize the problem to a push-and-pull commu-
nication model, considering that the sensing and actuation
agents individually make decisions and converge to a point
where they can transmit updates and raise queries, respectively,
which maximize the effectiveness of updates and result in the
right impact at the endpoint. Importantly, we assume that the
source distribution is not known to the actuation agent, and
the sensing agent does not have perfect knowledge of the goal.
Therefore, the agents must estimate their required parameters
separately. This approach is substantially different from the
one in our previous work and other state-of-the-art approaches.
B. Contributions
The main contributions can be briefly outlined as follows.
•We develop an integrated push-and-pull update com-
munication model owing to which both agents have
decision-making roles in the acquisition and transmitting
of updates and take appropriate actions to satisfy the goal,
following the paradigm of goal-oriented communications.
With this, the system becomes adaptable from an effec-
tiveness viewpoint compared to the conventional push-
and pull-based models.
•We use a grade of effectiveness metric to capture the
timely impact of communicated updates at the endpoint,
which relies on the freshness of successfully communi-
cated updates, the timeliness of actions performed, and
the usefulness of those updates in fulfilling the goal.
Our approach maps multiple information attributes into
a unique metric that measures the impact or effect each
status update packet traveling over the network can offer.
•We obtain optimal model-based control policies for
agents that make effect-aware decisions to maximize the
discounted sum of updates’ effectiveness while keeping
the induced costs within certain constraints. To achieve
this, we formulate an optimization problem, derive its
dual form, and propose an iterative algorithm based on
dynamic programming to solve the decision problem
separately from each agent’s perspective.
•We demonstrate that the integrated push-and-pull model
offers higher energy efficiency than the push-based model
and better effectiveness performance compared to the
pull-based one. We also show that applying effect-aware
policies at both agents results in better performance than
3
Source
Endpoint
Transmission
controller
Query controller
Actuation agent
Sensing agent
Update
channel
Subscribed goal
E-ACK
Fig. 2. End-to-end status update communication to satisfy a subscribed goal.
in the scenarios where one or both agents utilize effect-
agnostic policies. We also broaden our results via deriv-
ing model-free decisions using reinforcement learning.
Eventually, we provide a lookup map presenting optimal
decisions for each agent that applies the effect-aware
policy based on the given solution. This allows the agent
to make decisions on time by merely looking up the map
with the obtained threshold-based policy for the goal.
Notations: R,R+,R+
0, and Z+indicate the sets of real,
positive real, non-negative real, and positive integer numbers,
respectively. E[·]denotes the expectation operator, |·| depicts
the absolute value operator,
1
{·} is the indicator function, and
O(·)shows the growth rate of a function.
II. SY STE M MODE L
We consider an end-to-end communication system in which
asensing agent (SA) sends messages in a time-slotted manner
to an actuation agent (AA) as a means to take effective
action at the endpoint and satisfy a subscribed goal (see
Fig. 2). Specifically, the SA observes a source and generates
status update packets in each time slot, and a transmission
controller decides whether to transmit that observation or not,
following a specific policy. We assume that the source has
finite-dimensional realizations and that observation at the n-
th, ∀n∈N, time slot is assigned a rank of importance vn
from a finite set V={νi|i∈ I }, with I={1,2, ..., |V |},
based on its significance or usefulness for satisfying the goal,
measured or judged at the source level.1The elements of Vare
independent and identically distributed (i.i.d.) with probability
pi=pν(νi)for the i-th outcome, where pν(·)denotes a given
probability mass function (pmf).2
The AA is assisted by a query controller that decides to raise
queries and pull new updates according to a certain policy.
A received packet at the AA has a satisfactory or sufficient
impact at the endpoint if that update achieves a minimum ef-
fectiveness level subject to the latest query raised and the AA’s
availability to act on it. An effective update communication
is followed by an acknowledgment of effectiveness (E-ACK)
signal sent from the AA to the SA to inform about the effective
1To determine the usefulness of an update, we can use the same metavalue
approach proposed in [19, Section III-A].
2A more elaborated model could consider the importance of a realization
dependent on the most recently generated update at the SA. This implies that
a less important update increases the likelihood of a more significant update
occurring later, which can be captured utilizing a learning algorithm.
update communication. We assume all transmissions and E-
ACK feedback occur over packet erasure channels (PECs),
with pǫand p′
ǫbeing the erasure probabilities in the forward
communication and the acknowledgment links, respectively.
Therefore, an E-ACK is not received at the SA due to either
ineffective update communication or erasure in the acknowl-
edgment (backward) channel. With this interpretation, channel
errors lead to graceful degradation of the proposed scheme. A
raised query does not necessarily need to be shared with the
SA. As discussed in Section III-C, the SA can deduce the raise
of a query or the availability of the AA to take action from a
successful E-ACK, given prior knowledge that an update can
be effective only if it arrives within the period during which
the AA is available to act.
In this model, we consider the goal to be subscribed at
the endpoint, with the AA fully aware of it. On the other
hand, the SA does not initially know the goal but learns which
updates could be useful to accomplish the goal based on the
received E-ACK and observations’ significance. Meanwhile,
the AA is not aware of the evolution of the source or the
likely importance of observations, attempting to approximate
it from arrival updates. Consequently, the agents might use
different bases to measure the usefulness of the updates and
may need to adjust their criteria or valuation frameworks to
account for possible changes in goals over time. Finally, we
assume that update acquisition, potential communication, and
waiting time for receiving an E-ACK occur in one slot.
A. Communication Model
The following three strategies can be employed for effective
communication of status updates.
1) Push-based: Under this model, the SA pushes its
updates to the AA, taken for instance based on the source
evolution, without considering whether the AA has requested
them or is available to take any action upon receipt. This
bypasses the query controller, enabling the SA to directly
influence actions at the AA side.
2) Pull-based: In this model, the query controller plays a
central role in the generation of update arrivals at the AA by
pulling those updates from the SA. Here, the AA can only take
action when queries are raised. However, this model excludes
the SA from generating and sending updates.
3) Push-and-pull: This model arises from integrating the
push- and pull-based models so that the transmission and
query controllers individually decide to transmit updates and
send queries, respectively. Thereby, the AA is provided with
a level of flexibility where it is also able to take some actions
beyond query instances within a limited time. As a result,
the effectiveness of an update packet depends on both agents’
decisions. Dismissing the decision of either agent transforms
the push-and-pull model into the push- or pull-based model.
B. Agent Decision Policies
We propose that the agents can adhere to the following
decision policies, namely effect-agnostic and effect-aware, for
transmitting updates or raising queries to satisfy the goal.
4
1) Effect-agnostic: This policy uses a predetermined
schedule or random process (e.g., Poisson, binomial, or
Markov [5]–[9]) to send updates (raise queries) from (by) the
SA (AA), without accounting for their impact at the desti-
nation. We define a controlled update transmission (query)
rate specifying the expected constant number of updates
(queries) to be communicated (raised) within a period. Also,
as the effect-agnostic policy does not consider what might be
happening in the other agent during the time of the decision,
there exists an aleatoric uncertainty associated with random
updates (queries).
2) Effect-aware: The effect-aware policy takes into consid-
eration the impacts of both agents’ decisions at the endpoint.
In this regard, the SA (AA) predicts the effectiveness status
at the endpoint offered by a sent update that is potentially
received at the AA (the usefulness of a possible update at the
source). Then, based on this prediction, the agent attempts to
adapt transmission (query) instants and send (pull) updates in
the right slots. This policy comes with an epistemic uncertainty
because decisions are made according to probabilistic estima-
tions, not accurate knowledge. However, such uncertainty can
be decreased using learning or prediction techniques.
III. EFFE CT IVE NES S ANA LYSIS ME TRI CS
To achieve the right effect at the endpoint, an update packet
that is successfully received at the AA has to satisfy a set of
qualitative attributes, captured by the metrics as follows.
A. Grade of Effectiveness Metric
We introduce a grade of effectiveness (GoE) metric that
comprises several qualitative attributes and characterizes the
amount of impact an update makes at the endpoint. Mathemat-
ically speaking, the GoE metric is modeled via a composite
function GoEn= (f◦g)(In)for the n-th time slot. Here,
g:Rx→Ry, x ≥y, is a (nonlinear) function of x∈Z+
information attributes In∈Rx, and f:Ry→Ris a context-
aware function.3The particular forms of functions fand g
could vary according to different subscribed goals and their
relevant requirements.
In this paper, without loss of generality, we consider
freshness of updates and timeliness of actions as the main
contextual attributes. The first comes in the form of age
of information (AoI) metric, which is denoted by ∆n. The
second is measured from the action’s lateness, denoted by Θn.
Thereby, we can formulate the GoE metric as follows
GoEn=fg(g∆(ˆvn,∆n), gΘ(Θn); gc(Cn)) (1)
where Cnrepresents the overall cost incurred in the n-th time
slot. Also, g∆:R+
0×R+→R+
0,gΘ:R+
0→R+
0, and
gc:R+
0→R+
0are penalty functions, and fg:R+
0×R+
0×
R+
0→R+
0is a non-decreasing utility function. Moreover, g∆,
gΘ, and gcare non-increasing with respect to (w.r.t.) ∆n,Θn,
and Cn, respectively, while g∆is non-decreasing w.r.t. ˆvn.
Here, ˆvnis the usefulness of the received update from the
3The GoE metric in this form can be seen as a special case of the SoI
metric introduced in [2], [19], [20].
Push-based
Push-and-pull
Pull-based
Query 1Query 2Query 3Time [slot]
Θmax
Action window Idle window
Fig. 3. The outline of the action and idle windows in different models.
endpoint’s viewpoint in the n-th slot. Thus, we assume that
ˆvnbelongs to the set ˆ
V={0} ∪ {ˆνj|ˆνj>0, j ∈ J } with
i.i.d. elements, where J={1,2, ..., |ˆ
V | − 1}, the j-th element
has probability qj=pˆν(ˆνj), and pˆν(·)is a pmf derived in
Section V-A. Since the packet is sent over a PEC, ˆvn= 0 if it
is erased or the update ends up being useless at the endpoint.
1) AoI: Measuring the freshness of correctly received
updates at the AA within a query slot, the AoI is defined
as ∆n=n−u(n), where ∆0= 1 and u(n)is the slot index
of the latest successful update, which is given by
u(n) = max m|m≤n, βm(1 −ǫm) = 1(2)
with ǫm∈ {0,1}being the channel erasure in the m-th slot. In
addition, βm∈ {0,1}indicates the query controller’s decision,
where βm= 1 means pulling the update; otherwise, βm= 0.
2) Action lateness: The lateness of an action performed in
the n-th time slot in relevance to a query raised at the n′-th
slot, n′≤n, is given by
Θn= (1−βn)(n−n′),(3)
which is valid for Θn<Θmax. Herein, Θmax shows the width
of action window within which the AA can act on each query
based on update arrivals from the SA. Outside the dedicated
action window for the SA, the AA might undertake other tasks
or communicate with other agents. Employing the push-based,
pull-based, and push-and-pull update communication models,
we have Θmax =∞,Θmax = 1, and Θmax >1, respectively.
Fig. 3 illustrates the action and idle windows for different
models. It is worth mentioning that a wider action window
allows for higher flexibility in cases of heavy action loads at
the cost of longer actuation availability.
B. Special Forms of the GoE
The GoE metric’s formulation in (1) can simply turn into
the QAoI and the VoI metrics as special cases. In this
regard, we obtain a penalty function of the QAoI such that
GoEn=g∆(∆n)if we set Θmax = 1, assume linear gΘ(·),
and overlook updates’ usefulness and cost. In addition, by
removing the concepts of query and time, hence the freshness
and timeliness in the GoE’s definition, we arrive at a utility
function of the VoI, i.e., GoEn=fg(ˆvn;gc(Cn)).
C. Effectiveness Indicator
An update in the n-th time slot is considered effective at the
system level if its GoEnis higher than a target effectiveness
5
grade, which is called GoEtgt and is necessary to satisfy the
goal. Let us define Enas an effectiveness indicator in the n-th
time slot. Thus, we can write
En=
1
{GoEn≥GoEtgt ∧Θn<Θmax}.(4)
The second condition in (4) appears from (3). According to
(4), an update could be effective only if it arrives within the
action window of the AA. Hence, a consequent E-ACK shared
with the SA can imply the raise of a query or the availability
of the AA to take action. Given the values of ∆nand Θn, and
by inserting (1) into (4), we reach a target usefulness level vtgt
as the importance threshold that the update should exceed to
be considered effective. In this case, if Θn<Θmax , we have
vtgt =
minnˆνj|ˆνj∈ˆ
V, fg(g∆(ˆνj,∆n), gΘ(Θn)) ≥GoEtgt
o;(5)
otherwise, vtgt = max{ˆνj|ˆνj∈ˆ
V}. In (5), vtgt can be
computed by exhaustive search.
IV. MOD E L- BA SE D AGEN T DECI SI ONS
In this section, we first formulate a decision problem for
effect-aware policies at either or both agent(s), cast it as a
constrained Markov decision process (CMDP), and then solve
it based on the problem’s dual form.
A. Problem Formulation
The objective is to maximize the expected discounted sum
of the updates’ effectiveness in fulfilling the subscribed goal,
where each agent individually derives its decision policy
subject to the relevant ensued cost by looking into the problem
from its own perspective. Let us define π∗
αand π∗
βas the
classes of optimal policies for transmission and query controls,
respectively. Therefore, we can formulate the decision problem
solved at each agent as follows
P1: max
πγ
lim sup
N→∞
1
NEN
X
n=1
λnEnE0
s.t.lim sup
N→∞
1
NEN
X
n=1
λncγ(γn)≤Cγ,max (6)
where λ∈[0,1] indicates a discount factor, and γ∈ {α, β}is
replaced with αand βfor the update transmission and query
decision problems, respectively, at the SA and the AA. Herein,
γn∈ {0,1}denotes the decision at the relevant agent, cγ:
{0,1} → R+
0is a non-decreasing cost function, and Cγ ,max
shows the maximum discounted cost.
For either update communication model introduced in Sec-
tion II-A, optimal decisions at the agent(s) following the
effect-aware policy, i.e., π∗
αand/or π∗
β, are obtained by solving
P1in (6). However, for every agent that employs an effect-
agnostic policy, with regard to Section II-B, there is a pre-
defined/given set of decisions denoted by ˜παor ˜πβsuch that
πα= ˜παor πβ= ˜πβ, respectively.
B. CMDP Modeling
We cast P1from (6) into an infinite-horizon CMDP denoted
by a tuple (Sγ,Aγ, Pγ, rγ)with components that are defined
via the agent that solves the decision problem.
1) Modeling at the SA: The CMDP at the SA is modeled
according to the following components:
States – The state of the system Sα,n in the n-th slot from
the SA’s perspective is depicted by a tuple (vn,ˆ
En)in which
vnis the update’s usefulness, and ˆ
En∈ {0,1}shows the E-
ACK arrival status at the SA after passing the PEC, as defined
in Section II. Herein, we have ˆ
En= 0 in case En= 0 or the
acknowledgment signal is erased; otherwise, ˆ
En= 1. In this
regard, Sα,n belongs to a finite and countable state space Sα
with |Sα|= 2 · |V | elements.
Actions – We consider αnthe decision for update commu-
nication in the n-th slot, which is a member of an action space
Aα={0,1}. In this space, 0stands for discarding the update,
and 1indicates transmitting the update.
Transition probabilities – The transition probability from
the current state Sα,n to the future state Sα,n+1 via taking the
action αnis written by
pα(Sα,n, αn, Sα,n+1 ) = Pr(vn+1,ˆ
En+1)|(vn,ˆ
En), αn
=pν(vn+1)Prˆ
En+1 |vn, αn(7)
since ˆ
En+1 and ˆ
Enare independent, and ˆ
Enis independent
of vn,∀n. We can derive the conditional probability in (7) as
•Prˆ
En+1 = 0 |vn, αn= Pr(ˆvtgt > αnvn) = 1 −
Pˆvtgt (αnvn),
•Prˆ
En+1 = 1 |vn, αn=Pˆvtgt (αnvn),
where ˆvtgt is a mapped target usefulness that the SA considers,
Pˆvtgt (ˆvtgt) = Pˆv′
tgt≤ˆvtgt pˆvtgt (ˆv′
tgt)denotes its cumulative
distribution function (CDF), and pˆvtgt (·)shows the pmf de-
rived in Section V-B.
Rewards – The immediate reward of moving from the state
Sα,n to the state Sα,n+1 under the action αnis equal to
rα(Sα,n, αn, Sα,n+1 ) = ˆ
En+1 where it relies on the E-ACK
status in the future state.
Despite possible erasures over the acknowledgment link, the
reward defined in this model fits into the decision problem in
(6), where the corresponding objective becomes maximizing
the expected discounted sum of E-ACK arrivals. In this sense,
ˆ
Enat the SA resembles Enat the AA plus noise in the form
of the E-ACK erasure.
2) Modeling at the AA: For modeling the problem at
the AA, we have the components as follows: States – We
represent the state Sβ ,n in the n-th time slot using a tuple
(ˆvn,∆n,Θn), where ˆvnis the usefulness of the received
update from the perspective of the endpoint, ∆nis the AoI,
and Θndenotes the action lateness, as modeled in Section III.
Without loss of generality, we assume the values of ∆nand
Θnare truncated by the maximum values notated as ∆max
and Θmax, respectively, such that the conditions
g∆(ˆvn,∆max −1) ≤(1 + ε∆)g∆(ˆvn,∆max),(8)
for ˆvn∈ˆ
V, and
gΘ(Θmax −1) ≤(1 + εΘ)gΘ(Θmax)(9)
6
are met with the relevant accuracy ε∆and εΘ. Given this, at
the AA, Sβ,n is a member of a finite and countable space Sβ
having |Sβ|= ∆max ·Θmax · |ˆ
V| states.
Actions – As already mentioned, βnshows the decision of
raising a query in the n-th time slot and gets values from an
action space Aβ={0,1}. Here, 0and 1depict refusing and
confirming to pull an update, respectively.
Transition probabilities – The transition probability from
the current state Sβ,n to the future state Sβ ,n+1 under the
action βnis modeled as
pβ(Sβ,n, βn, Sβ,n+1 ) =
Pr(ˆvn+1,∆n+1 ,Θn+1)|(ˆvn,∆n,Θn), βn.(10)
According to (10), we can write:
•Pr((ˆνj,min{∆n+1,∆max },
min{Θn+1,Θmax })|(ˆνj,∆n, , Θn), βn) = 1 −βn,
•Pr((ˆνj,min{∆n+ 1,∆max},1) |(ˆνj,∆n,Θn), βn) =
βnpǫ,
•Pr((ˆνj′,1,1) |(ˆνj,∆n,Θn), βn) = βn(1−pǫ)qj′,
with ˆνj,ˆνj′∈ˆ
V. For the rest of the transitions, we have
pβ(Sβ,n, βn, Sβ,n+1 ) = 0. As stated earlier, qj=pˆν(ˆνj)with
the pmf pˆν(·)derived in Section V-A.
Rewards – Arriving at the state Sβ ,n+1 from the state Sβ ,n
by taking the action βn, is rewarded based on the effectiveness
level provided at the future state such that
rβ(Sβ,n, βn, Sβ ,n+1 ) = En+1 =
1
fg(g∆(ˆvn+1,∆n+1 ), gΘ(Θn+1)) ≥GoEtgt
×
1
Θn+1 <Θmax(11)
by the use of (1) and (4).
3) Independence of the initial state: Before we delve
into the dual problem and solve it, we state and prove two
propositions to show that the expected discounted sum of
effectiveness in (6) is the same for all initial states.
Proposition 1. The CMDP modeled at the SA satisfies the
accessibility condition.
Proof. Given the transition probabilities defined in (7), every
state Sα,m ∈ Sα,m≤N, is accessible or reachable from the
state Sα,n in finite steps with a non-zero probability, following
the policy πα. Therefore, the accessibility condition holds for
the CMDP model at the SA [21, Definition 4.2.1].
Proposition 2. The modeled CMDP at the AA meets the weak
accessibility condition.
Proof. We divide the state space Sβinto two disjoint spaces
of Taand Tb=Sβ− Ta, where Taconsists of all the states
whose ∆n= 1, i.e., Ta={Sβ,n |Sβ ,n = (ˆνj,1,Θn),∀ˆνj∈
ˆ
V,Θn= 1,2, ..., Θmax}. Thus, Tbincludes the rest of the
states with ∆n≥2. With regard to the transition probabilities
derived in (10), all states of Tbare transient under any policy,
while every state of an arbitrary pair of two states in Ta
is accessible from the other state. Accordingly, the weak
accessibility condition in the modeled CMDP at the AA is
satisfied according to [21, Definition 4.2.2].
Given Propositions 1 and 2, we can show that the expected
effectiveness obtained by P1in (6) is the same for all initial
states [21, Proposition 4.2.3]. In this regard, En,∀n, is
independent of E0for either model, thus we arrive at the
following decision problem:
P2: max
πγ
lim sup
N→∞
1
NEN
X
n=1
λnEn:=¯
Eγ
s.t.lim sup
N→∞
1
NEN
X
n=1
λncγ(γn)≤Cγ,max (12)
for γ={α, β}. Applying Propositions 1 and 2 confirms that
there exist stationary optimal policies π∗
αand π∗
βfor P2solved
at the SA and the AA, respectively, where both policies are
unichain [21, Proposition 4.2.6].
C. Dual Problem
To solve the decision problem P2given in (12), we first
define an unconstrained form for the problem via dualizing
the constraint. Then, we propose an algorithm to compute the
decision policies at both agents.
The unconstrained form of the problem is derived by writing
the Lagrange function L(µ;πγ)as below
L(µ;πγ) = max
πγ
lim sup
N→∞
1
NEN
X
n=1
λnEn−µcγ(γn)
+µCγ,max (13)
with µ≥0being the Lagrange multiplier. According to (13),
we arrive at the following dual problem to be solved:
P3: inf
µ≥0max
πγ
L(µ;πγ)
|{z }
:=hγ(µ)
(14)
where hγ(µ) = L(µ;π∗
γ,µ )is the Lagrange dual function with
π∗
γ,µ :Sγ→ Aγdenoting a stationary µ-optimal policy, which
is obtained as
π∗
γ,µ = arg max
πγ
L(µ;πγ)(15)
for µderived in the dual problem P3. As the dimension
of the state space Sγis finite for both defined models, the
growth condition is met [22]. Also, the immediate reward
is bounded below, having a non-negative value according to
Section IV-B. In light of the above satisfied conditions, from
[22, Corollary 12.2], we can claim that P2and P3converge
to the same expected values, thus we have
¯
Eγ= inf
µ≥0hγ(µ) = max
πγ
u(πγ)(16)
under any class of policy πγ. Owing to the satisfied conditions,
there exist non-negative optimal values for the Lagrange
multiplier µ∗such that we can define u(πγ) = L(µ∗;πγ)in
(16) [22, Theorem 12.8].
We can now proceed to derive the optimal policies at the
SA and the AA from the decision problem P3by applying
an iterative algorithm in line with the dynamic programming
approach based on (13)–(15) [14].
7
Algorithm 1: Solution for deriving π∗
γand µ∗
Input: Known parameters N≫1,Cγ,max ,η,εµ, state
space Sγ, and action space Aγ. The form of the
cost function cγ(·). Initial values l←1,
µ(0) ←0,µ−←0,µ+≫1,πγ,µ−←0, and
πγ,µ+←0.
1Initialize π∗
γ,µ (s),∀s∈ Sγ, via running Utility(µ(0)).
2if EhPN
n=1 cγ(γn)i≤NCγ,max then goto 11.
3while |µ+−µ−| ≥ εµdo
Step l:⊲Outer loop (Bisection search)
4update µ(l)←µ−+µ+
2.
5Improve π∗
γ,µ ←Utility(µ(l)).
6if EhPN
n=1 cγ(γn)i≥NCγ,max then
7µ−←µ(l), and πγ,µ−←Utility(µ−).
8else µ+←µ(l), and πγ,µ+←Utility(µ+).
9Reset l←l+ 1.
10 if EhPN
n=1 cγ(γn)i< NCγ,max then
π∗
γ,µ (s)←ηπγ,µ−(s) + (1 −η)πγ,µ+(s),∀s∈ Sγ.
11 return µ∗=µ(l)and π∗
γ(s) = π∗
γ,µ (s),∀s∈ Sγ.
Function Utility(µ):
Input: Known parameters N≫1,επ, state space
Sγ, and action space Aγ. Initial values
k←1,πγ,µ (s)←0, and Vπγ,µ
k(s)←0,
∀s∈ Sγ.
Iteration k:⊲Inner loop (Value iteration)
12 for state s∈ Sγdo
13 compute Vπγ,µ
k(s)from (18).
14 Improve πγ,µ (s)according to (19)and (20).
15 if spVπγ,µ
k−Vπγ,µ
k−1≥επas in (22)then
16 step up k←k+ 1, and goto 12.
17 return π∗
γ,µ (s) = πγ,µ (s),∀s∈ Sγ.
D. Iterative Algorithm
The iterative algorithm is given in Algorithm 1 and consists
of two inner and outer loops. The inner loop is for computing
the µ-optimal policy, i.e., π∗
γ,µ , using the value iteration
method. Over the outer loop, the optimal Lagrange multiplier
µ∗is derived via the bisection search method.
1) Computing π∗
γ,µ :Applying the value iteration method,
the decision policy is iteratively improved given µfrom the
outer loop (bisection search). Thus, πγ,µ(s)∈ Aγ,∀s∈ Sγ,
is updated such that it maximizes the expected utility (value)
Vπγ,µ
k(s)at the k-th, ∀k∈N, iteration, which is obtained as
Vπγ,µ
k(s) = Erk+λrk+1 +λ2rk+2 +· · · | sk=s
≈Erk+λV πγ,µ
k−1|sk=s(17)
where skdenotes the state at the k-th iteration, and rkis the
corresponding reward at that state. The approximation in (17)
appears after bootstrapping the rest of the discounted sum of
the rewards by the value estimate Vπγ,µ
k−1. Under the form of the
value iteration for the unichain policy MDPs [23], the optimal
value function is derived from Bellman’s equation [24], as
Vπγ,µ
k(s) =
max
γ∈AγX
s′∈Sγ
pγ(s, γ, s′)hrγ ,µ(s, γ , s′) + λV πγ,µ
k−1(s′)i(18)
for the state s∈ Sγ. Consequently, the decision policy in that
state is improved by
πγ,µ (s)∈
arg max
γ∈AγX
s′∈Sγ
pγ(s, γ, s′)hrγ ,µ(s, γ , s′) + λV πγ,µ
k−1(s′)i.(19)
In (18) and (19), we define a net reward function as
rγ,µ (s, γ, s′) = rγ(s, γ, s′)−µcγ(γ),(20)
which takes into account the cost caused by the taken action.
The value iteration stops running at the k-th iteration once
the following convergence criterion is met [23]:
spVπγ,µ
k−Vπγ,µ
k−1< επ(21)
where επ>0is the desired convergence accuracy, and sp(·)
indicates a span function R+
0→R+
0given as
spVπγ,µ
k′= max
s∈S Vπγ,µ
k′(s)−min
s∈S Vπγ,µ
k′(s)(22)
by using the span seminorm [23, Section 6.6.1]. As the
decision policies are unichain and have aperiodic transition
matrices, the criterion in (21) is satisfied after finite iterations
for any value of λ∈[0,1] [23, Theorem 8.5.4].
2) Computing µ∗:We leverage the bisection search
method to compute the optimal Lagrange multiplier over
multiple steps in the outer loop based on the derived π∗
γ,µ
from the inner loop. Starting with an initial interval [µ−, µ+]
such that hγ(µ−)hγ(µ+)<0, the value of the multiplier at the
l-th, ∀l∈N, step is improved by µ(l)=µ−+µ+
2. As shown in
Algorithm 1, at each step, the value of either µ−or µ+and the
corresponding decision policy πγ,µ−or πγ,µ+, respectively, are
updated according to the cost constraint in (12) until a stopping
criterion |µ+−µ−|< εµis reached with the accuracy εµ.
Considering (13) and (14), hγ(µ)is a linear non-increasing
function of µ. In this regard, the bisection method searches
for the smallest Lagrange multiplier that guarantees the cost
constraint. Also, one can show that hγ(µ)denotes a Lipschitz
continuous function with the Lipschitz constant as below
Cγ ,max −lim sup
N→∞
1
NEN
X
n=1
λncγ(γn).
Therefore, the bisection search converges to the optimal value
of µwithin L∈Nfinite steps [25, pp. 294].
After the outer loop stops running, we obtain a stationary
deterministic decision policy as π∗
γ=πγ,µ if the following
condition holds:
C: lim sup
N→∞
1
NEN
X
n=1
λncγ(γn)=Cγ,max.(23)
Otherwise, the derived policy becomes randomized stationary
in the shape of mixing two deterministic policies πγ ,µ−=
8
Estimation at the AA
Estimation at the SA
E-horizon D-horizon Time
CMDP-based decisions
Fig. 4. Time partitioning of the estimation and decision horizons.
lim
µ→µ−
πγ,µ and πγ ,µ+= lim
µ→µ+πγ,µ with probability η∈[0,1]
[26]. Hence, we can write
π∗
γ←ηπγ,µ−+ (1 −η)πγ,µ+,(24)
which implies that the decision policy is randomly chosen as
π∗
γ=πγ,µ−and π∗
γ=πγ,µ+with probabilities ηand 1−η,
respectively. In (24), ηis computed such that the condition C
in (23) is maintained.
3) Complexity analysis: The value iteration approach in
the inner loop is polynomial with O(|Aγ||Sγ|2)arithmetic
operations at each iteration. Thereby, the longest running
time of Algorithm 1, in terms of the number of arithmetic
operations over both loops, is given by O2L|Sγ|2
1−λlog1
1−λ
for the fixed λ, as studied in [27] and [28]. The complexity
of the algorithm increases with a larger state space, additional
steps in the outer loop, and as λ→1.
V. MO NT E CARL O PROBAB ILI TY DIS TR IBUTI ON
EST IM ATI ON
In this section, we leverage the Monte Carlo estimation
method to statistically compute the estimated pmfs of the
received updates’ usefulness from the endpoint’s perspective
at the AA, and the mapped target usefulness at the SA.
To this end, we consider a time interval in the format of
an estimation horizon (E-horizon), followed by a decision
horizon (D-horizon), as illustrated in Fig. 4. The E-horizon
is exclusively reserved for the estimation processes and has a
length of Mtime slots, which is sufficiently large to enable
an accurate estimation. The D-horizon represents the long-
term time horizon with the sufficiently large length of N≫1
slots, as defined in Section IV, during which the agents find
and apply their (model) CMDP-based decision policies.
Within the E-horizon, the SA does not make any decisions.
Instead, it focuses on communicating updates at the highest
possible rate while adhering to cost constraints. Once receiving
these updates, the AA measures their usefulness, stores them in
memory, and sends E-ACK signals for effective updates. The
SA logs whether the E-ACK has been successfully received
or not in every slot of the E-horizon. Finally, employing the
received E-ACK signals at the SA and the measured updates’
usefulness at the AA, both agents perform their estimations.
A. Usefulness Probability of Received Updates
Picking the j-th, ∀j∈ J , outcome from the set ˆ
V(defined
in Section III-A) that corresponds to the received update’s
usefulness from the endpoint’s perspective in the m-th slot
of the E-horizon, i.e., ˆvm, the relevant estimated probability
of that outcome is given by
qj=pˆν(ˆνj) = 1
M
M
X
m=1
1
ˆvm= ˆνj.(25)
B. Probability of the Mapped Target Usefulness
We assume that the mapped target usefulness, i.e., ˆvtgt ,
is a member of the set ˆ
Vtgt ={ϑj|j∈ Jtgt}with i.i.d.
elements, where Jtgt ={1,2, ..., |ˆ
Vtgt|}, and the probability
of the j-th element is equal to pˆvtgt (ˆvtgt =ϑj). Herein, as
mentioned earlier, pˆvtgt(·)is the estimated pmf of the mapped
target usefulness and obtained by
pˆvtgt (ϑj) = X
e∈{0,1}
pˆvtgt ϑj|ˆ
E=ePrˆ
E=e(26)
where we find the probability of successfully receiving E-
ACK, i.e., e= 1, or not, i.e., e= 0, as follows
Prˆ
E=e=1
M
M
X
m=1
1
ˆ
Em=e(27)
where ˆ
Emindicates the E-ACK arrival status in the m-th
slot of the E-horizon. Furthermore, to derive the conditional
probability in (26), we first consider the successful arrivals of
E-ACK signals such that
pˆvtgt ϑj|ˆ
E= 1=Pi∈I pν|ˆ
E=1νi|νi≥ϑj
Pj∈Jtgt Pi∈I pν|ˆ
E=1νi|νi≥ϑj.
(28)
Then, we have
pˆvtgt ϑj|ˆ
E= 0=Pi∈I pν|ˆ
E=0νi|νi< ϑj
Pj∈Jtgt Pi∈I pν|ˆ
E=0νi|νi< ϑj.
(29)
The pmfs pν|ˆ
E=1(·)and pν|ˆ
E=0(·)in (28) and (29) are
associated with an observation’s importance rank given the
successful and unsuccessful communication of the E-ACK,
respectively. In this regard, by applying Bayes’ theorem we
can derive the following formula:
pν|ˆ
E=eνi=
1
MPM
m=1
1
vm=νi∧ˆ
Em=e
Pr( ˆ
E=e).(30)
VI. SI MUL ATI ON RE SULTS
In this section, we present simulation results that corroborate
our analysis and assess the performance gains in terms of
effectiveness achieved by applying different update models and
agent decision policies in end-to-end status update systems.
A. Setup and Assumptions
We study the performance over 5×105time slots, which
includes the E-horizon and the D-horizon with 1×105and
4×105slots, respectively. To model the Markovian effect-
agnostic policy, we consider a Markov chain with two states,
0and 1. We assume that the self-transition probability of state
0is 0.9, while the one for state 1relies on the controlled update
transmission or query rate. Without loss of generality, we
assume that the outcome spaces for the usefulness of generated
updates, i.e., V, the usefulness of received updates, i.e., ˆ
V, and
the mapped target usefulness, i.e., ˆ
Vtgt, are bounded within the
span [0,1]. For simplicity, we divide each space into discrete
levels based on its number of elements in ascending order,
9
TABLE I
PAR AM ET ER S F OR SI MULATI ON RE SU LTS .
Name Symbol Value
E-horizon length – 1×105[slot]
D-horizon length N4×105[slot]
Erasure probability in update channel pǫ0.2
Erasure probability in acknowledgment link p′
ǫ0.1
Length of generated update usefulness space |V| 10
Length of received update usefulness space |ˆ
V| 11
Length of mapped target usefulness space |ˆ
Vtgt|11
Shape parameters for usefulness distribution a0.3
b0.3
Maximum truncated AoI ∆max 10 [slot]
Action window width in pull-based model
Θmax
1 [slot]
Action window width in push-and-pull model 5 [slot]
Action window width in push-based model 10 [slot]
Update transmission cost at n-th slot Cn,10.1
Query raising cost at n-th slot Cn,20.1
Actuation availability cost at n-th slot Cn,30.01
Maximum discounted cost in decision problem Cγ,max 0.08
Target effectiveness grade GoEtgt 0.6
Discount factor in CMDP λ0.75
Convergence accuracy in Algorithm 1 εµ10−4
επ10−4
Mixing probability in bisection method η0.5
Controlled transmission rate (value-agnostic) – 0.8
Controlled query rate (value-agnostic) – 0.8
where every level shows a randomized value. We also consider
that the i-th outcome of the set Vnotated as νi,i∈ I, occurs
following a beta-binomial distribution with pmf
pν(νi) = |V| − 1
i−1Beta(i−1 + a, |V| − i+b)
Beta(a, b)(31)
where Beta(·,·)is the beta function, and a= 0.3and b= 0.3
are shape parameters.
Moreover, we plot the figures based on the following form of
the GoE metric4, which comes from the general formulation
proposed in (1):
GoEn=ˆvn
∆nΘn
−αnCn,1−βnCn,2−Cn,3(32)
for the n-th time slot. Herein, Cn,1is the communication cost,
Cn,2denotes the query cost, and Cn,3indicates the actuation
availability cost at the AA which depends on the update com-
munication model. However, the cost function in the decision
problem’s constraint is assumed to be cγ(γn) = γn,∀n∈N.
The parameters used in the simulations are summarized for
Table I. In the legends of the plotted figures, we depict the
policies at the agents in the form of a tuple, with the first and
second elements referring to the policy applied in the SA and
the AA, respectively. For an effect-aware policy, we simply
use the notation “E-aware,” while for an effect-agnostic policy,
4The analysis can be easily extended to any other forms of the GoE metric.
012345
105
0
0.05
0.1
0.15
0.2
0.25
Average cumulative effectiveness
Time slot
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 5. The evolution of the average effectiveness accumulated over time
based on the push-and-pull model.
only the modeling process is mentioned in the legend.
B. Results and Discussion
Fig. 5 illustrates the evolution of the average cumulative ef-
fectiveness over time for the push-and-pull model and different
agent decision policies. As it is shown, applying the effect-
aware policy in both agents offers the highest effectiveness,
where the gap between its and the other policies’ performance
increases gradually as time passes. Nevertheless, using the
other value-agnostic policies at either or both agent(s) dimin-
ishes the effectiveness performance of the system by at least
12% or 36%, respectively. Particularly, if the SA and the AA
apply the effect-aware and Markovian effect-agnostic policies,
sequentially, the offered effectiveness is even lower than the
scenario in which both agents use periodic effect-agnostic
policies. One of the reasons is that the Markovian query raising
misleads the SA in estimating the mapped target usefulness of
updates. It is also worth mentioning that throughout the E-
horizon from slot 1to 1×105, the scenarios where the AA
raises effect-aware queries offer the same performance and
better than those of the other scenarios. However, in the D-
horizon, a performance gap appears and evolves depending
on the applied policy at the SA. Therefore, by making effect-
aware decisions at both agents based on estimations in the
E-horizon and CMDP-based policies in the D-horizon, perfor-
mance can be significantly improved.
The bar chart in Fig. 6 shows the average rate of update
transmissions from the SA and the consequent actions per-
formed at the AA based on raised queries that result in the
system’s effectiveness, as depicted in Fig. 5. Interestingly, the
scenario where both agents apply the effect-aware policy has
lower transmission and action rates compared to the other sce-
narios except the ones with Markovian effect-agnostic query
policies. However, the number of actions that can be done
for those scenarios with Markovian queries is limited since
most of the updates are received out of the action windows.
Figs. 5 and 6 show that using effect-aware policies at both
the SA and the AA not only brings the highest effectiveness
but also needs lower update transmissions by an average of
10
0
0.2
0.4
0.6
0.8
1
Average rate
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Update
Action
Fig. 6. The average update transmission and query rates of different decision
policies in the push-and-pull model.
11%, saving resources compared to the scenarios that have
comparable performances. We also note that although effect-
aware and periodic query decisions with effect-aware update
transmission have almost the same action rate, the effect-aware
case leads to more desirable effects or appropriate actions
at the endpoint. This results in around 16% higher average
cumulative effectiveness, referring to Fig. 5.
Figs. 7 (a), 7 (b), and 7 (c) present the average cumulative
GoE provided in the system over 5×105time slots for the
push-and-pull, push- and pull-based communication models,
respectively. The corresponding effectiveness at the endpoint
for the primary model can be found in Fig. 5. The plots
demonstrate that when both agents decide based on effect-
aware policies, regardless of the update communication model,
the highest offered GoE of the system is reached. How-
ever, in other scenarios, the performance of some policies
exceeds those of others, depending on the update model.
For instance, having effect-aware update transmission and
Markovian queries shows 2.52 times higher average GoE for
the push-based model than the push-and-pull one. This is
because the push-based model has a larger action window.
Besides, comparing Figs. 7 (a) and 7 (b), the reason that the
provided GoE by the effect-aware decisions at both agents is
28% lower for the push-based model compared to the push-
and-pull is that the AA has to be available longer, which causes
a higher cost. Also, with the pull-based model as in Fig. 7 (c),
the average GoE within a period is less than the average
cost since the AA is only available to act at query instants,
significantly reducing the average GoE despite the high update
transmission rate. In the pull-based model, however, applying
the CMDP-based update transmission decisions at the SA can
address this issue with 16% higher average GoE.
The trade-off between the average effectiveness and the
width of the action window, i.e., Θmax, is shown in Fig. 8
for different decision policies. We see that the system cannot
offer notable effectiveness with Θmax = 1, i.e., under the pull-
based model. However, by expanding the action window width
from Θmax = 1 to 10 [slot], the average effectiveness boosts
from its lowest to the highest possible value. Since all policies
012345
105
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
Average cumulative GoE
Time slot
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
(a)
012345
105
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Average cumulative GoE
Time slot
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
(b)
012345
105
-0.1
-0.08
-0.06
-0.04
-0.02
0
Average cumulative GoE
Time slot
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
(c)
Fig. 7. The evolution of the average cumulative GoE over time, following
the (a) push-and-pull, (b) push-based, and (c) pull-based models.
already reach their best performance before Θmax = 10, which
indicates the push-based model, we can conclude that the
push-and-pull model with a flexible action window is more
advantageous than the push-based one with a very large fixed
window. In addition, the scenario where both agents use effect-
aware policies outperforms the others for Θmax >1. It is
worth mentioning that the performance of the scenarios where
11
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
0.2
0.25
Average effectiveness
Action window width (Θmax)[slot]
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 8. The average effectiveness versus the width of the action window for
the push-and-pull model.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
Average effectiveness
Target effectiveness grade (GoEtgt)
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 9. The average effectiveness w.r.t. the target effectiveness grade in the
push-and-pull model.
the AA raises Markovian queries converges to the highest level
for very large widths. The average cumulative effectiveness of
these scenarios, as in Fig. 5, visibly rises for Θmax ≥9.
The interplay between the average effectiveness and the
target effectiveness grade, i.e., GoEtgt from Section III-C, is
depicted in Fig. 9 for the push-and-pull model and different
agent decision policies. Evidently, the average effectiveness
under all policies decreases gradually with the increase of the
target effectiveness grade and converges to zero for GoEtgt ≥
0.9. Using effect-aware policies at both agents offers the
highest effectiveness for medium-to-large target grades, i.e.,
GoEtgt ≥0.52 here. However, for lower target grades, the
effect-aware and periodic effect-agnostic decisions at the SA
and the AA, respectively, result in better performance. This
comes at the cost of higher transmission and action rates.
Thus, there is a trade-off between the paid cost and the offered
effectiveness, thus various policies can be applied depending
on the cost criterion and the target effectiveness grade.
Fig. 10 depicts the average effectiveness obtained in the
system through 5×105time slots versus the controlled
update transmission rate for the push-and-pull model and
various agent decision policies. Concerning Section II-B, this
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.05
0.1
0.15
0.2
0.25
Average effectiveness
Controlled update transmission rate
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 10. The performance comparison between various policies under different
update transmission rates but fixed query rates for value-agnostic policies.
controlled rate is related to the value-agnostic update trans-
mission policies at the SA and denotes the expected number
of updates to be communicated within the specified period of
time. Therefore, the performance of the other policies should
remain fixed for different controlled transmission rates. Fig. 10
reveals that increasing the controlled update rate increases the
offered effectiveness when the SA applies the value-agnostic
policies. However, even at the highest possible rate, subject
to the maximum discounted cost, using effect-aware policies
at both agents is necessary to ensure the highest average
effectiveness, regardless of the controlled transmission rate.
As an illustrative example, when the AA raises effect-aware
queries, the effectiveness drops by an average of 38% (43%)
if the SA transmits updates based on the periodic (Markovian)
effect-agnostic policy instead of using the effect-aware one.
In Fig. 11, the plot shows the same trend as Fig. 10, but
this time it focuses on the changes in the average effectiveness
versus the controlled query rate. Herein, the controlled query
rate is dedicated to the scenarios where the AA operates under
effect-agnostic policies. The results indicate that the average
effectiveness rises via the increase of the controlled query
rate for the periodic and Markovian policies. However, the
effectiveness increase is not significant for the latter one.
Despite this, even with the highest rates, the effectiveness
offered in the course of effect-aware queries is still higher
than those of raising effect-agnostic queries. Therefore, the
highest average effectiveness is achieved when both agents
make effect-aware decisions, as depicted in Figs. 10 and 11.
In the context of the decision-making problem P2in (12),
altering the maximum discounted cost, i.e., Cγ,max, can impact
the decisions made by each agent. To study this, we have
plotted Fig. 12 for the push-and-pull model under different
decision policies. The figure shows that the stricter the cost
constraint, the lower the average effectiveness, irrespective of
the decision policy. Also, for all cost constraints, the scenario
in which both agents use effect-aware policy yields the best
performance, whereas the other policies outperform each other
under different cost constraints. Due to CMDP-based deci-
sions, the gap between the effectiveness performance of the
12
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.05
0.1
0.15
0.2
0.25
Average effectiveness
Controlled query rate
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 11. The comparison between different policies with variable query rates
but fixed update transmission rates for value-agnostic policies.
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
0
0.05
0.1
0.15
0.2
0.25
Average effectiveness
Maximum discounted cost (Cγ,max )
(E-aware, E-aware)
(Markov., E-aware)
(E-aware, Markov.)
(Markov., Markov.)
(Periodic, E-aware)
(E-aware, Periodic)
(Periodic, Periodic)
Fig. 12. The average effectiveness attained by different policies versus the
maximum discounted cost in the push-and-pull model.
best scenario and those of the others increases as the constraint
decreases until Cγ ,max = 0.1, where the SA can transmit all
updates, and the AA can raise queries without restrictions.
Afterward, we compare the performance provided by
model-based agent decisions discussed in Section IV with that
of model-free decisions. The latter is based on reinforcement
learning (RL), where each agent separately learns to make
decisions through direct interaction with the environment. This
learning process is modeled under the state (here, observation)
spaces, action sets, and rewards according to Section IV-B,
without relying on the construction of a predefined model.
To derive model-free decisions, we employ a deep Q-network
(DQN) and parameterize an approximate value function for
every agent within the E-horizon through a multilayer percep-
tron (MLP), assisted with the experience replay mechanism.
The neural network consists of two hidden layers, each with 64
neurons, and is trained using the adaptive moment estimation
(ADAM) optimizer. The default values for the RL setting are
taken from [29], except for the learning rate that is 10−4,
and the discount factor assumed 0.75, as aligned with Table I.
In this regard, Fig. 13 depicts the evolution of the average
cumulative effectiveness over time for the push-and-pull model
0 1 2 3 4 5
105
0
0.05
0.1
0.15
0.2
0 1 2 3 4 5
105
0
0.05
0.1
0.15
0.2
Average cumulative Average cumulative
effectiveness effectiveness
Time slot
Time slot
(E-aware, E-aware) –MC
(E-aware, E-aware) –MC
(Markov., E-aware) –MC
(E-aware, Markov.) –MC
(Periodic, E-aware) –MC
(E-aware, Periodic) –MC
(E-aware, E-aware) –RL
(E-aware, E-aware) –RL
(Markov., E-aware) –RL
(E-aware, Markov.) –RL
(Periodic, E-aware) –RL
(E-aware, Periodic) –RL
Fig. 13. The evolution of the average cumulative effectiveness for different
policies under the MC and RL approaches in the push-and-pull model.
101102103104105
0
0.05
0.1
0.15
0.2
0.25
Average effectiveness
E-horizon length
(E-aware, E-aware) –MC
(Markov., E-aware) –MC
(E-aware, Markov.) –MC
(Periodic, E-aware) –MC
(E-aware, Periodic) –MC
(E-aware, E-aware) –RL
(Markov., E-aware) –RL
(E-aware, Markov.) –RL
(Periodic, E-aware) –RL
(E-aware, Periodic) –RL
Fig. 14. The E-horizon length’s impact on the effectiveness for different MC-
and RL-based policies in the push-and-pull model.
and different model-based agent decisions based on the Monte
Carlo (MC) estimation method given in Section V and model-
free RL-based decisions. As shown, the MC approach exceeds
the RL one under all decision policies. However, the perfor-
mance gap is not significant in the scenario where both agents
apply effect-aware policies.
The reason for the better effectiveness of the MC approach
could be the small state (observation) space, especially for
the modeling at the SA. Additionally, the length of the E-
horizon as the training interval may be another factor. In
order to analyze the impact of the E-horizon length on the
average effectiveness of the effect-aware and effect-agnostic
decisions based on the MC and RL approaches, we plot
Fig. 14 for the push-and-pull model. We infer that increasing
the length of the E-horizon improves effectiveness, but the
MC-based policies reach their highest performance after a
certain optimal length. This optimal E-horizon length varies
depending on the decision policy. For example, in the scenario
where both agents utilize the MC approach and apply effect-
aware policies, an E-horizon length of around 316 time slots
is needed to ensure accurate enough estimations and achieve
the highest average effectiveness. On the other hand, RL-
13
12345678910
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
Index of importance rank
ˆ
En
(a)
1 2 3 4 5 6 7 8 9 10
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Index of importance rank
ˆ
En
(b)
Fig. 15. The lookup maps for decisions the SA makes based on (a) Cα,max =
0.06 and (b) Cα,max = 0.08.
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
11
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Index of usefulness
Index of usefulness
Index of usefulness
∆n
∆n
∆n
Θn= 1 Θn= 2
Θn≥3
(a)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
11
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
12345678910
1
2
3
4
5
6
7
8
9
10
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Index of usefulness
Index of usefulness
∆n
∆n
Θn= 1 Θn≥2
(b)
Fig. 16. The lookup maps the AA utilizes for (a) Cβ ,max = 0.06 and (b)
Cβ,max = 0.08.
based policies consistently improve across the plotted region.
However, the scenarios in which the AA applies effect-aware
policies, exhibit a notable challenge—achieving superior per-
formance to MC-based policies demands an extensive E-
horizon length. It is also worth mentioning that the reason
for the monotonic decrease in the performance of the scenario
with Markovian queries based on the MC approach is that the
provided effectiveness is limited, and it grows at a slower pace
w.r.t. the time interval we sum up the timely effectiveness.
C. Lookup Maps for Agent Decisions
As the modeled CMDPs in Section IV-B have finite states,
we can depict optimal mode-based decisions derived in Algo-
rithm 1 via a multi-dimensional lookup map for each decision-
making agent. Figs. 15 and 16 illustrate the maps for the SA
and the AA, respectively, under different maximum discounted
costs. The number of dimensions in a map relies on the number
of elements constructing every state of the relevant CMDP,
with each dimension assigned to one element. With the lookup
map in hand, an agent can make optimal decisions in each slot
based on its current state. When comparing the same maps for
different maximum discounted costs, the observation emerges
that the more stringent the cost constraint is, the narrower the
agent decision boundaries become. Thus, the maps could vary
by changes in the parameters’ values or the goals with different
target effectiveness grades. It is noteworthy that in Fig. 16, the
maps with Θn= 1 represent the pull-based model, while the
push-and-pull model converges to the push-based model with
Θn≥3and 2in Figs. 16 (a) and 16 (b), respectively.
We compute an optimal threshold for each element of the
state as a decision criterion, given values of the other elements.
Let us consider Ωαand Ωβas the decision criteria at the SA
and AA, respectively. To derive the optimal decision α∗
nin the
n-th slot, there are two alternative ways to define the criterion:
Ωαis a threshold for the index of the update’s importance
rank, i.e., i∈ I, for vn=νi,∀νi∈ V, given the E-ACK,
i.e., ˆ
En. Thus, we have
α∗
n=
1
i≥Ωα(ˆ
En)|ˆ
En.(33)
Ωαshows a threshold for the E-ACK given the impor-
tance